0% found this document useful (0 votes)
4 views540 pages

Data Analyticsi Foundations

The document is a textbook titled 'Data Analytics - Foundation' by Man Van Minh Nguyen, focusing on inference, regression, and stochastic processes. It is part of a series aimed at providing statistical methods and probabilistic models for various fields including science, engineering, and sustainable development. The book serves as a resource for practitioners, researchers, and students in data science and analytics, covering essential topics in probability theory, statistical analysis, and exploratory data analysis.

Uploaded by

bhcrazycrane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views540 pages

Data Analyticsi Foundations

The document is a textbook titled 'Data Analytics - Foundation' by Man Van Minh Nguyen, focusing on inference, regression, and stochastic processes. It is part of a series aimed at providing statistical methods and probabilistic models for various fields including science, engineering, and sustainable development. The book serves as a resource for practitioners, researchers, and students in data science and analytics, covering essential topics in probability theory, statistical analysis, and exploratory data analysis.

Uploaded by

bhcrazycrane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 540

DATA ANALYTICS - FOUNDATION

Inference, Regression and Stochastic Processes


Data Analytics- Foundation

Inference, Regression and

Stochastic Processes

Man Van Minh Nguyen

LAP LAMBERT Academic Publishing


Imprint

Any brand names and products names mentioned in this book are subject to trademark, brand or patent
protection and are trademarks or registered trademarks of their respective holders. The use of brand
names, products names, common names, trade names, product description etc, even without a partic-
ular marking in this work is in no way to be construed to mean that such names may be regarded as
unrestricted in respect of trademark and brand protection legislation and could thus be used by anyone.

Cover image: www.ingimage.com

Publisher:
LAP LAMBERT Academic Publishing
is a trademark of
International Book Market Service Ltd., member of OmniScriptum Publishing Group
17 Meldrum Street, Beau Bassin 71504, Mauritius
Printed at: see last page

Nguyen, Man VM.


Data Analytics- Foundation: Inference, Regression and Stochastic Processes / Man VM. Nguyen – 1st
ed.
ISBN: 978-620-2-79791-7
1. Modeling - Statistics. 2. Stochastic analysis.

Copyright © 2020 Man VM. Nguyen


All rights reserved

Copyright © 2020 International Book Market Service Ltd., member of OmniScriptum Publishing Group
To the memory of my late parents,

to my inspiring mentors, and my wife with love


DATA ANALYTICS for SCIENCE and ENGINEERING

New Looks, Approaches and Frontiers

Subject headings

statistical modeling / statistical inference


statistical designs / statistical optimization
probabilistic modeling / probabilistic optimization
mathematical programming / operations research
stochastic analysis / stochastic process

——————————————————————————

DATA ANALYTICS is a series of textbooks that provide the background of concepts, statistical methods,
probabilistic models and practical research problems in science, engineering and sustainable develop-
ment. The main point is to develop Statistics Science based solutions in specific areas of engineering,
actuarial science, industrial production and traffic science in the context of sustainable economic devel-
opment.
Statistical and probabilistic methods essentially reveal important information and knowledge in dataset
observed in application a bunch of domains and sectors such as actuarial science, financial science, qual-
ity control, government bodies, supply chain management, urban traffic management, software and
industrial manufacturing.
The books in the series provide statistical support to practitioners, experts and researchers who work
in various fields, including not only actuaries, computer scientists, environmentalists, finance analysts,
government decision makers, corporate managers, but also process engineers, software engineers, and
traffic engineers.
The book series also provides support for students from masters to doctoral level, who are reading
courses in Data Science and/or Data Analytics, for ones who are daily finding practical solutions, who
are making optimal decisions using actual observation data sets that are becoming ever larger in size
and with very complex structure.
DATA ANALYTICS- STATISTICAL FOUNDATION

Inference, Linear Regression and

Stochastic Processes

Man Van Minh Nguyen

Department of Mathematics

Faculty of Science, Mahidol University

Copyright © 2020 by Man Van Minh Nguyen

All rights reserved.


Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

Content Briefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

Organization of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

Part A: Probability - Probability Distribution 1


Chapter 1 Probability Theory and Probability Distributions
Can uncertainty be evaluated? 3

1.1 Specific problems need Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Basic operations and rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Assign probabilities to events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.3 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.4 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.5 Bayes’ Theorem and usages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Random Variables and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 Key definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.2 Types of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.4 Important Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.1 Bernoulli distribution 𝐵 ∼ B(𝑝) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.2 Binomial distribution 𝑋 ∼ Bin(𝑛, 𝑝) . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.4.3 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.5 Continuous Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.5.1 Continuous Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.5.2 Normal (or Gauss) distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.5.3 The standard normal distribution and properties . . . . . . . . . . . . . . . . . . . 26

DATA ANALYTICS- FOUNDATION


iv CONTENTS

1.5.4 Exponential distribution- the second important one . . . . . . . . . . . . . . . . . . 31

1.5.5 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.5.6 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.5.7 Beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

1.6.1 Popular numerical sets for quantitative measurements and counts . . . . . . . . . . 37

1.6.2 Computing Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

1.6.3 Key probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1.6.4 On Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1.7 Basic Probability Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1.7.1 How to compute probability of an event? . . . . . . . . . . . . . . . . . . . . . . . 41

1.7.2 Basic type problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

1.7.3 Using expectation in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

1.7.4 Self-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Chapter 2 Statistical Science for Data Analytics


Does data make sense in services? 53

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.1.1 Practical problems require statistical analysis . . . . . . . . . . . . . . . . . . . . . 55

2.1.2 Statistics in Engineering and Science . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.1.3 Roles of statisticians, data scientists or practitioners . . . . . . . . . . . . . . . . . . 58

2.2 Populations, samples and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

2.3 Scientific data and characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.3.1 Characteristics of various data sets- Our approach . . . . . . . . . . . . . . . . . . 60

2.3.2 Statistically analysis of observed data sets . . . . . . . . . . . . . . . . . . . . . . . 61

2.3.3 What is the use of various data sources for? . . . . . . . . . . . . . . . . . . . . . . 62

Part B: Data Exploration and Statistical Inference 63


Chapter 3 Exploratory Data Analysis
Making observed data meaningful 65

3.1 Introduction and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.1.1 What is Exploratory Data Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.1.2 Key statistical concepts for EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.2 Visualize sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Inference, Linear Regression and Stochastic Processes


CONTENTS v

3.2.1 Enumeration table (frequency table) . . . . . . . . . . . . . . . . . . . . . . . . . . 70


3.2.2 Charts - a representation of a variable’s values . . . . . . . . . . . . . . . . . . . . 71
3.2.3 Examining a distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3 Software for exploratory data analysis (EDA) . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Measure of Central and Spreading Tendency . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1 Measure of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.2 Measures of spreading tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5 Measure of Dispersion (Variability) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.1 Measure of Dispersion- Variance and Standard deviation . . . . . . . . . . . . . . . 81
3.5.2 Dispersion- Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5.3 Dispersion- Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5.4 Summary- Critical thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.6 Visualization for Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.6.1 Statistics of the ordered samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.6.2 Box-plot- definition and drawing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.6.3 Extra indicators for the shape of a distribution of observations . . . . . . . . . . . . 88
3.6.4 Quantile plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.7 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.8 Association Between Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.8.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.8.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.9.1 On sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.9.2 Percentiles- Mathematical formula . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.9.3 Summary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Chapter 4 Statistical Parameter Estimation


Estimating parameters of a population 99
4.1 Overview and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.1 Fundamental notation and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.2 Population parameters and their estimator . . . . . . . . . . . . . . . . . . . . . . 103
4.2.3 Statistical parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

DATA ANALYTICS- FOUNDATION


vi CONTENTS

4.3 Point Estimation- Concepts and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


4.3.1 Point Estimation - Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.2 The Point Estimators of the mean 𝜇 and variance 𝜎 2 . . . . . . . . . . . . . . . . . 106
4.3.3 Error types of a random sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.4 The Method of Moment Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3.5 The Method of Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 112
4.4 The Law of Large Numbers and Central Limit Theorem . . . . . . . . . . . . . . . . . . . 115
4.4.1 Sample sizes for estimating the sample mean . . . . . . . . . . . . . . . . . . . . . 115
4.4.2 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.4.3 Central Limit Theorem - Sampling Distribution . . . . . . . . . . . . . . . . . . . . 116
4.5 Confidence Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.1 Components of a confidence interval estimation . . . . . . . . . . . . . . . . . . . 118
4.5.2 Computing Confidence Interval in R . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.6 Estimation of Population Mean- 𝜎 known case . . . . . . . . . . . . . . . . . . . . . . . . 120
4.6.1 A problem in Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.6.2 Use Sampling Distribution of the sample mean 𝑋 . . . . . . . . . . . . . . . . . . . 123
4.6.3 Confidence Interval: Two-sided case . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.6.4 Confidence Interval: One-sided case . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.7 Interval Estimation of Population Mean- 𝜎 unknown . . . . . . . . . . . . . . . . . . . . 127
4.7.1 𝑡 distribution- Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.7.2 𝑡 distribution- Usage for interval estimation . . . . . . . . . . . . . . . . . . . . . . 128
4.7.3 Find sample size given error and variance . . . . . . . . . . . . . . . . . . . . . . . 131
4.8 Summary of Terms and Main Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.9 Chapter’s Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Chapter 5 Statistical Hypothesis Testing


Confirming your claims or beliefs about population parameters 137
5.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.2 Sampling and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2.1 Steps in the decision-making process for research . . . . . . . . . . . . . . . . . . . 142
5.2.2 Key sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.3 Hypothesis Testings for Single Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3.1 Hypothesis Testing for the Population Mean- general . . . . . . . . . . . . . . . . . 145
5.3.2 Hypothesis Testing- one side test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Inference, Linear Regression and Stochastic Processes


CONTENTS vii

5.3.3 Hypothesis Tests and Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 158


5.4 Interval Estimation and Hypothesis Testing
for the Population Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.4.1 Key distribution for proportion problems . . . . . . . . . . . . . . . . . . . . . . . . 160
5.4.2 Interval Estimation of the proportion 𝑃 . . . . . . . . . . . . . . . . . . . . . . . . 161
5.5 Testing for two populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.5.1 Test hypothesis for population means . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.5.2 Testing for means from independent samples . . . . . . . . . . . . . . . . . . . . . . 164
5.5.3 Testing for means from matched pairs . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.5.4 Choice of sample sizes for inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.6 Summary for Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.6.1 Terms of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.6.2 Sampling distributions of popular statistics . . . . . . . . . . . . . . . . . . . . . . 169
5.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.7.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.7.2 Hypothesis Testing in Manufacturing . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.7.3 Hypothesis Testing in Insurance Industry . . . . . . . . . . . . . . . . . . . . . . . . 172
5.7.4 Critical thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.8 Few Case studies with Data Analytics Approach . . . . . . . . . . . . . . . . . . . . . . . 174
5.8.1 Case 1: Inference for political science . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.8.2 Case 2: Inference for natural disaster control . . . . . . . . . . . . . . . . . . . . 175

PART C: Analysis by Statistical Designs and Linear Regression 178


Chapter 6 Experimental Designs in Engineering
Causal analysis with ordinal variables 179
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.2 Fisher variable and statistic- a reminder . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.3 Completely Randomized Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.3.1 CRD, Completely Randomized Design- Theory . . . . . . . . . . . . . . . . . . . . . 184
6.3.2 Specific problem in industrial manufacturing . . . . . . . . . . . . . . . . . . . . . 185
6.3.3 Use linear model for CRD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.3.4 Variance analysis for the response . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.4 Theory of Block Designs (RCBD and BIBD) . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.4.1 Blocking and randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

DATA ANALYTICS- FOUNDATION


viii CONTENTS

6.4.2 The analysis of randomized complete block designs (RCBD) . . . . . . . . . . . . . . 189


6.4.3 Concepts of Balanced Incomplete Block Designs (BIBD) . . . . . . . . . . . . . . . . 190
6.4.4 The ANOVA for a BIBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
6.5 Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.5.1 Statistical model of a factorial experiment . . . . . . . . . . . . . . . . . . . . . . . 195
6.5.2 The ANOVA for Full Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.5.3 The Full Binary Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.5.4 Factorial Designs in Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.6 Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.6.1 Balanced factorial designs with more than two factors . . . . . . . . . . . . . . . . 208
6.7 Summary of Terms- Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

Chapter 7 Experimental Designs II


Analysis with Random Effects Model 217
7.1 Random effects model of a single-factor experiment . . . . . . . . . . . . . . . . . . . . . 218
7.1.1 The linear model of a single-factor . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.1.2 Estimating the Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.2 Random effects model of two-factor experiment . . . . . . . . . . . . . . . . . . . . . . . 221
7.2.1 The linear model of two factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
7.2.2 Testing hypotheses with Fisher statistic . . . . . . . . . . . . . . . . . . . . . . . . . 222
7.3 The Two-Stage Nested Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7.3.1 The Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.3.2 The ANOVA Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

Chapter 8 Multivariate Probability Distributions


Simultaneously study random variables 227
8.1 Random vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
8.2 Joint and marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.2.1 Joint and marginal distributions- the discrete case . . . . . . . . . . . . . . . . . . 230
8.2.2 Joint and marginal distributions- the continuous case . . . . . . . . . . . . . . . . . 233
8.2.3 Distributions of a function of random variables . . . . . . . . . . . . . . . . . . . . 236
8.3 Covariance and correlation of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8.3.1 Independence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
8.3.2 IID sequence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

Inference, Linear Regression and Stochastic Processes


CONTENTS ix

8.4 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239


8.4.1 Conditional probability density function (p.d.f.) . . . . . . . . . . . . . . . . . . . 239
8.4.2 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
8.5 Chapter’s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Chapter 9 Regression Analysis I


Simple Linear Regression 243
9.1 Introduction and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
9.2 Correlation and Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
9.2.1 Co-variance and correlation of two samples . . . . . . . . . . . . . . . . . . . . . . 245
9.2.2 Statistical prediction with field-trip observation based data . . . . . . . . . . . . . . 249
9.2.3 Statistical model building and Model fitting . . . . . . . . . . . . . . . . . . . . . . 250
9.3 Regression and Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
9.3.1 Statistical Model for Linear Regression Analysis . . . . . . . . . . . . . . . . . . . . 253
9.3.2 Fitting regression lines to data- least squares method . . . . . . . . . . . . . . . . . 255
9.3.3 Computation and Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
9.4 Analysis of variance for regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
9.4.1 ANOVA and the correlation coefficient R . . . . . . . . . . . . . . . . . . . . . . . . 264
9.4.2 Goodness of fit with the coefficient of determination . . . . . . . . . . . . . . . . . . 266
9.5 Chapter’s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

Chapter 10 Regression Analysis II:


Inference for Regression 271
10.1 Introduction and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.2 Empirical Models and Their Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
10.2.1 Essential role of realistic data in regression analysis . . . . . . . . . . . . . . . . . . 273
10.2.2 Testing for Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.3 Tests and estimations of regression coefficients . . . . . . . . . . . . . . . . . . . . . . . 276
10.3.1 T-statistic for testing significance of a model . . . . . . . . . . . . . . . . . . . . . . 281
10.3.2 Testing hypotheses on the regression slope . . . . . . . . . . . . . . . . . . . . . . . 283
10.3.3 Goodness of fit and Coefficient of determination . . . . . . . . . . . . . . . . . . . . 284
10.3.4 ANOVA F-statistic for testing significance of a model . . . . . . . . . . . . . . . . . . 286
10.3.5 Relationship of F-test and T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
10.4 Estimation of responses using regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

DATA ANALYTICS- FOUNDATION


x CONTENTS

10.4.1 Confidence interval for the mean of responses . . . . . . . . . . . . . . . . . . . . . 290

10.4.2 Prediction interval for the individual response . . . . . . . . . . . . . . . . . . . . . 291

10.5 Adequacy of the Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

10.5.1 Major Assumptions for regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

10.5.2 Residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

10.6 Chapter’s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

Chapter 11 Regression Analysis III


Multiple Regression Analysis 297

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

11.1.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

11.1.2 Computing regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

11.2 Multiple linear regression (MLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

11.2.1 Standard assumptions of multiple linear regression . . . . . . . . . . . . . . . . . . 303

11.2.2 Statistical multiple linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

11.2.3 Principle of Least Squares- the multivariate case . . . . . . . . . . . . . . . . . . . . 308

11.3 Regression on few predictor variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

11.3.1 Theory for the case of two predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 310

11.3.2 Regression on three and four predictors . . . . . . . . . . . . . . . . . . . . . . . . 312

11.4 Aspects of Multiple Regression Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

11.4.1 Models with Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

11.4.2 Polynomial Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

11.5 Chapter’s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

11.5.1 Computational problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

11.5.2 Theoretic problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

11.5.3 Model fitting on three predictor variables with interaction . . . . . . . . . . . . . . 318

11.5.4 Model fitting in oil industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

11.6 Step-wise regression for a Climate Change study . . . . . . . . . . . . . . . . . . . . . . . 321

11.7 Chapter’s Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

Part D: Stochastic Process Based Analysis 328


Chapter 12 Stochastic Process
Characterizing systems with randomly spatial-temporal fluctuations 329

12.1 What and Why Stochastic Processes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330

Inference, Linear Regression and Stochastic Processes


CONTENTS xi

12.2 Introductory Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332


12.2.1 Classification of stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . 333
12.2.2 Key characteristics of stochastic process . . . . . . . . . . . . . . . . . . . . . . . . 338
12.3 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
12.3.1 Markov processes and Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . 341
12.3.2 Homogeneous Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
12.3.3 Chapman Kolmogorov equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
12.3.4 Compute the probability distribution at stage 𝑛 . . . . . . . . . . . . . . . . . . . . 345
12.3.5 Describing and using a Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . 347
12.4 Limiting distributions and Classification of states . . . . . . . . . . . . . . . . . . . . . . 348
12.4.1 Limiting distribution at states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
12.4.2 Accessible states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
12.4.3 Recurrent states and Transient states . . . . . . . . . . . . . . . . . . . . . . . . . . 351
12.4.4 Periodic states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
12.4.5 Absorption problems: states and probabilities . . . . . . . . . . . . . . . . . . . . . 355
12.5 Theory of stochastic matrix for Markov chains . . . . . . . . . . . . . . . . . . . . . . . . 357
12.5.1 On eigenspace of a square matrix (or linear operator) . . . . . . . . . . . . . . . . 357
12.5.2 Characterization for Diagonalizable Matrices . . . . . . . . . . . . . . . . . . . . . 358
12.5.3 Properties of stochastic matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
12.5.4 Ergodicity and regularity of stochastic matrices . . . . . . . . . . . . . . . . . . . . 360
12.6 Markov Process’s Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
12.6.1 Stationary or time homogeneous Markov processes . . . . . . . . . . . . . . . . . . 364
12.6.2 The transition matrix P(𝑡) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
12.6.3 The Chapman-Kolmogorov equations . . . . . . . . . . . . . . . . . . . . . . . . . 367
12.6.4 Transition rates (forces of transition)- Transition rate matrix . . . . . . . . . . . . . 368
12.7 Kolmogorov’s differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

Chapter 13 Statistical Simulation


Describing systems with algorithms 379
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
13.1.1 Engineering problems worked out by simulation . . . . . . . . . . . . . . . . . . . . 380
13.1.2 Basic concepts and topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
13.2 How to generate random numbers with R commands? . . . . . . . . . . . . . . . . . . . 382
13.2.1 Generating random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382

DATA ANALYTICS- FOUNDATION


xii CONTENTS

13.2.2 Computing probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 383


13.3 Generation of random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
13.3.1 Transformation random numbers into input data . . . . . . . . . . . . . . . . . . . 386
13.3.2 Two most important usages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
13.4 How to measure/record output data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
13.5 Synchronous and asynchronous simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 394
13.6 Basic Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
13.7 Solving problems by Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . . . 399
13.7.1 Basic Monte Carlo Procedure- Methodology . . . . . . . . . . . . . . . . . . . . . . 399
13.7.2 Estimating Probabilities by Monte Carlo simulation . . . . . . . . . . . . . . . . . . 399
13.8 Chapter 13’s Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

Chapter 14 Poisson Process and Variations


Systems changed by random arrivals in time 405
14.1 Introduction and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
14.2 The Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
14.2.1 Counting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
14.2.2 Poisson process and its properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
14.2.3 Postulates of Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
14.3 Arrival-Type Processes: Few variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.3.1 Multiple Independent Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . 412
14.3.2 Thinning of a Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
14.4 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
14.4.1 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
14.4.2 Conditional variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
14.4.3 Hypergeometric random variable and its mean . . . . . . . . . . . . . . . . . . . . 417
14.4.4 Examples for conditional expectation and variance . . . . . . . . . . . . . . . . . . 418
14.5 Compound and Nonhomogeneous Poisson process . . . . . . . . . . . . . . . . . . . . . 420
14.5.1 Expectation of a sum of random number of random variables . . . . . . . . . . . . 420
14.5.2 Compound Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
14.5.3 Nonhomogeneous Poisson process- NHPP . . . . . . . . . . . . . . . . . . . . . . . . 421
14.6 Summary and Problems on Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . 422
14.7 The Birth and Death processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
14.7.1 Description of a birth and death process . . . . . . . . . . . . . . . . . . . . . . . . 424

Inference, Linear Regression and Stochastic Processes


CONTENTS xiii

14.7.2 Transient Analysis of Birth and Death Processes . . . . . . . . . . . . . . . . . . . . 425


14.7.3 Local Balance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
14.8 Chapter’s Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427

Chapter 15 Branching Processes


And Renewal Processes 433
15.1 Key concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
15.2 The variance V[𝑋𝑛 ] of a branching process . . . . . . . . . . . . . . . . . . . . . . . . . 436
15.3 Ultimate Extinction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
15.4 Generations of Offsprings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
15.4.1 Probability-generating function of a compound variable . . . . . . . . . . . . . . 438
15.4.2 Probability-generating function (PGF) for Galton-Watson process . . . . . . . . . 439
15.4.3 Compute the expected value 𝜇𝑛 = E[𝑋𝑛 ] . . . . . . . . . . . . . . . . . . . . . . . 439
15.5 Introduction to Renewal Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
15.5.1 The distribution of 𝑁 (𝑡) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
15.5.2 Find the distribution of the 𝑛-th renewal 𝑇𝑛 . . . . . . . . . . . . . . . . . . . . 443
15.6 Transform Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
15.6.1 Probability-generating function (p.g.f. or PGF) of a discrete variable . . . . . . . . . 443
15.6.2 Moment generating function of a continuous variable . . . . . . . . . . . . . . . . . 444
15.6.3 Laplace transform and Fourier transform . . . . . . . . . . . . . . . . . . . . . . . 445
15.6.4 Sums of Random Variables- Convolutions . . . . . . . . . . . . . . . . . . . . . . 447
15.7 The Renewal Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
15.8 Markov Renewal Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
15.8.1 The Markov Renewal Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
15.8.2 The one-step transition probability . . . . . . . . . . . . . . . . . . . . . . . . . . 451
15.8.3 Computing the Markov renewal function 𝑀𝑖,𝑘 (𝑡) . . . . . . . . . . . . . . . . . . 451
15.9 Chapter’s problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

Chapter 16 Statistical Data Analytics in Practice


Uncertainty-accepted decision making 455
16.1 Data Analytics 1 with Emergency Medical Services Data . . . . . . . . . . . . . . . . . . 456
16.1.1 The problem and related study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
16.1.2 The observed data and its structure . . . . . . . . . . . . . . . . . . . . . . . . . . 457
16.1.3 Method for Solving Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459

DATA ANALYTICS- FOUNDATION


xiv CONTENTS

16.1.4 Coding all variables and Defining the response variable . . . . . . . . . . . . . . . . 462
16.1.5 Choosing suitable statistical models via predictors and response . . . . . . . . . . . 464
16.1.6 Discussion from computational outcome of R . . . . . . . . . . . . . . . . . . . . . 467
16.2 Data Analytics Project 2: Bridge Health Monitoring . . . . . . . . . . . . . . . . . . . . . 471
16.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
16.2.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
16.2.3 Selecting the Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
16.2.4 Selecting the Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
16.2.5 Time Series Modeling for BHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
16.2.6 Data clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
16.2.7 Damage extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
16.2.8 Sequential Probability Ratio Test for Damage Identification . . . . . . . . . . . . . . 478
16.2.9 Closing remarks and open problems . . . . . . . . . . . . . . . . . . . . . . . . . . 481

Chapter A Transform Methods


489
A.1 Probability-generating function of a discrete variable . . . . . . . . . . . . . . . . . . . . 489
A.2 Laplace transform for continuous variable . . . . . . . . . . . . . . . . . . . . . . . . . . 491

Chapter B Statistical Software and Computation


Powerful tools for dealing with large data 493
B.1 Introductory R- a popular Statistical Software . . . . . . . . . . . . . . . . . . . . . . . . 493
B.2 Basic R commands for Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 497

Chapter C Generalized Linear Model 499


C.1 Model components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
C.2 Model’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
C.3 Poisson Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
C.4 Measuring the goodness of fit and Comparing models . . . . . . . . . . . . . . . . . . . . 503
C.4.1 What is model fitting? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
C.4.2 What is Deviance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
C.4.3 Deviance for key response distributions of exponential family . . . . . . . . . . . . 505
C.4.4 Information Criteria - Akaike’s Information Criterion . . . . . . . . . . . . . . . . . 506
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515

Inference, Linear Regression and Stochastic Processes


Preface

Probability Theory decodes uncertainty of events in our worlds, it provides mathematical concepts and
methods for formulating and computing likelihood of events and processes under influence of many
factors. Statistical Science (Statistics) can be briefly described as the science of problem solving in the
presence of variability or uncertainty. This identifies Statistics as a scientific discipline, which demands
the same type of rigor and adherence to basic principles as Physics or Chemistry.
The newly coming Data Analytics refers to the techniques used to analyze data to enhance productiv-
ity and business gain. Data Analytics practically is the synergy of many building components including
mostly tools from Statistics, Probability Theory, next algorithmic ideas from Computing, powerful fun-
damentals of Mathematics, and last but not least specific knowledge of application domains. We can
define Data Analytics as the science and the art of making ‘good or meaningful’ decision based on data
sets, within a limited time range, a finite budget and/or computational resource.
Combining Probability Theory and Statistics together provides a powerful approach to scientific en-
deavors and technological breakthroughs. This statement can be seen by looking in parallel between the
scientific discoveries at least in the 19th and 20th centuries, from the theory of evolution (Darwin) to
modern biology (Watson, 1953), quantum mechanics (Dirac, Bohr, Heisenberg, Hawking during 1930-
1970...) to social sciences; and the other side, the developments of Statistics, from Monte Carlo casino
gamblers giving Monte Carlo simulation algorithm, the foundations of probability (Kolmogorov, 1933),
anova and the design of experiments (Fisher, 1920s), exploratory data analysis (John Tukey), Bayesian
inference (Lehmann, 1970) to bootstrapping (Efron, 1980).
Now in our current 21st century, the marriage between mathematics, computing with probability
and statistics gives machine learning and artificial intelligence (AI); and the newborn disciplines even
prove fundamentally meaningful roles in solving hard and urgent problems, typically in traffic, com-
munication, and life sciences. We all know that the major forces in artificial intelligence have been
intensively computational algorithms and causal inference from the 1960s (Pearl [16]).
However, the probability-statistics-based coupling topics of causal analysis and causal inference
turn out to be the critical component of developing any useful and complex tools, since AI won’t be
very smart if computer don’t grasp cause and effect, see comments by Pearl’s followers in a recent MIT
Technology Review [15].
This book just presents an essential knowledge body for data analytics consisting of statistical infer-
ence, linear regression and stochastic analysis. It honestly conveys basic knowledge, theoretical meth-
ods and practical approaches to students and professionals, especially ones working in sciences (such as
computer, environmental, actuarial, economic), engineering and industrial manufacturing.
The book has been written thanking to the best support of Department of Mathematics, Faculty of
Science, Mahidol University, and the Center of Excellence in Mathematics, Thailand.
The author expresses his sincere thanks to Vietnamese colleagues for contributing many beautiful
figures as well as useful ideas. They bring aesthetic senses to readers and perhaps help readers to
better understand abstract ideas in this science. Particular thanks are sent to An Khuong Nguyen, Tuong
Nguyen Huynh, Phu Le Vo, Hoai Van Tran and Trung Van Le at HCMUT, VNHCM, Vietnam; to Hien
Trong Huu Phan (Melbourn, Australia), Linh V. Huynh and Vinh Tan Tran (in the U.S.).
Moreover, some theoretical topics have been included in the book due to special attractions of ac-
tual monitoring data sets, through communication with engineers or/and listening to experimental re-
searchers, arising after seminars or an academic exchange. In these contexts, the interaction itself, the
ideas from communicating with colleagues are the main factors that shapes the structure and content of
this text. Most gracious thanks are due to John Borkowski in Montana, and Nabendu Pal in Louisiana,
the U.S. for newly introducing diverse approaches in the statistical science.

Despite the fine efforts of these individuals, however, a few errors may occur in the text and these
possible flaws are totally my responsibility. I would appreciate receiving comments from readers so that
this series can be more useful to the readers.

The author hope readers find it joyful to employ the methods in this book, the first text of this data
analytics series, when you are making an important or optimal decision based on huge data sets with
complex or unusual structure. He sincerely thanks colleagues and friends through that collaboration,
and wish to listen to your opinions and comments in the future.

Man VM. Nguyen


Bangkok
Summer 2020
Content Briefs

Part A: PROBABILITY- PROBABILITY DISTRIBUTIONS

Chapter 1: Probability Theory and Random Variables


Chapter 2: Statistical Science for Data Analytics

Part B: DATA EXPLORATION and STATISTICAL INFERENCE

Chapter 3: Exploratory Data Analysis


Chapter 4: Statistical Parameter Estimation
Chapter 5 : Statistical Hypothesis Testing

Part C: DATA ANALYSIS- STATISTICAL DESIGNS and LINEAR REGRESSION

Chapter 6 : Experimental Designs in Engineering


Chapter 7 : Experimental Designs II - Random Effects Model
Chapter 8 : Multivariate Probability Distributions
Chapter 9 : Regression Analysis I- Simple Linear Regression
Chapter 10 : Regression Analysis II - Inference for Regression
Chapter 11 : Regression Analysis III- Multiple Regression Analysis

Part D: STOCHASTIC PROCESS BASED ANALYSIS

Chapter 12: Stochastic Process


Chapter 13: Statistical Simulation
Chapter 14: Poisson Process and and Variations
Chapter 15: Branching Processes and Renewal Processes
Chapter 16: Statistical Data Analytics in Practice

Appendix A, Appendix B and C.


Organization of the Book

The book has sixteen chapters and three appendices, which can be grouped roughly into four parts,
ordered according to increasing difficulty. The level of difficulty is far from uniform: the first and
second parts are intended to be accessible with less background.
Part A provides the foundation for Statistical Science and Data Analytics, starts with theory of prob-
ability, random variables, and probability distributions (Chapter 1), and ends with an introduction of
Statistical Science for Data Analytics in Chapter 2. A few probabilistic tools are recalled in Appendix A
and an introduction to software R is shown in Appendix B.

FLOWCHART OF THE BOOK

Part A:
PROBABILITY- Part D:
PROBABILITY STOCHASTIC PROCESS
DISTRIBUTIONS BASED ANALYSIS

Part C:
Part B:
DATA ANALYSIS
DATA EXPLORATION and
STATISTICAL DESIGNS and
STATISTICAL INFERENCE
LINEAR REGRESSION

Part B on data exploration and statistical inference opens with Exploratory Data Analysis in Chapter
3, being meaningful for all later developments in Data Analytics, not just limited to this book. Param-
eter estimation and hypothesis testing are elaborated in Chapter 4 and Chapter 5. In Part C we start
the discussion of Causal Data Analysis when predictors are both qualitative (Chapter 6 and 7); then
quantitative in Chapters 8 to 11 with linear regression models.
Finally, Part D introduces Stochastic Analysis, being based on theory of stochastic processes, with chap-
ters 12 to 16, discussing about Stochastic Process, Statistical Simulation, Poisson Processes, Branching
Processes, and few case studies respectively. Appendix C briefly presents generalized linear model, a
key approach for linear regression analysis when responses are counts.
Guidelines for Instructors and Self-learners

• Chapters 1 to 5 could provide an essential knowledge body for one-semester studying in any scientific
and engineering program at undergraduate level.

• Chapter 6 and 7, on statistically designs contribute another view of causal analysis (but not causal
inference) when we are interested in discrete-valued predictors. Professional readers can extend its
1
contents to a full course of quality control for academic or industrial sectors.

• Chapters 8 to 11 of Part C fully build up a brief background for regression, and particularly linear
regression modeling, which could be studied in one semester at undergraduate level 2

• Last but not least, the advanced Part D basically presents diversifying angles of stochastic analysis.
Chapters 12 to 15 could be merged to be a solid lecture for masters or doctoral students in one
3
semester. Chapter 16 is especially designed as a seminar-based course for graduates.

Keywords:

confidence interval estimation, descriptive statistics,


experimental design, hypothesis testing, linear regression,
point estimation, probability distributions, Poisson process, queuing theory,
simulation, statistical inference, stochastic process

1 These chapters are motivated from the papers [34], [24], [21], the thesis [36], and the book [62].
2 These chapters are motivated from the papers [26], [20], and the books [38], [46].
3 These chapters are motivated from the papers [19], [22], [25], [32], [33], and the texts [42], [44] and [45].
Part A
Probability- Probability Distribution

Chapter 1: Probability Theory and Probability Distributions.

Chapter 2: Statistical Science for Data Analytics

The world is full of uncertainty, and this is certainly true in engineering, service, commerce and
business. A key aspect of solving real problems is appropriately dealing with uncertainty. This
involves explicitly recognizing that uncertainty exists and using quantitative methods to model
uncertainty. If you want to develop realistic models of industry, technology and/ or business prob-
lems, you should not simply act as if uncertainty doesn’t exist.

Part A provides the foundation for the whole book. The objective of Part A specifically is to in-
troduce fundamental theoretical concepts of random variables and probability distributions. This
presentation will establish the necessary link between statistics and probability.

• Firstly, basic probability theory (e.g. concepts, rules, Bayes theorem) and random variables
(components and laws of random variables) are formally defined in Chapter 1. Discrete and
continuous types then are treated separately, followed by a description of their properties and
use. Multiple random variables are then treated extensively in more advanced data analytics
courses.

• Specific types of discrete and continuous models of importance in actuarial science, medicine,
finance, civil and environmental engineering ... are briefly motivated in Chapter 2. Finally, Ap-
pendix B introduces the popular statistical software R .
This page is left blank intentionally.
Chapter 1

Probability Theory and Probability


Distributions
Can uncertainty be evaluated?

Figure 1.1: Would we reveal information and knowledge behind this beautiful picture?

[Source [9]]
CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
4 CAN UNCERTAINTY BE EVALUATED?

1.1 Specific problems need Probability Theory

Probabilistic Methods help us to solve realistic problems as follows.

Education Given the probability that a randomly selected student in a class is a female is 56%, how
much chance that a selected student is a male?

Farming Eggs sold at a market are usually packaged in boxes of six. The number 𝑋 of broken eggs
in a box has the probability distribution given by table:

𝑥 0 1 2 3 4 5 6
P[𝑋 = 𝑥] 0.80 0.14 0.03 0.02 0.01 0 0

Denote by 𝑌 the number of unbroken eggs in a box. What are the average of 𝑋 and of 𝑌 respec-
tively?

Industry The mass, 𝑋 kg, of silicon produced in a PC manufacturing process is modeled by the
probability density function (pdf)
3
𝑓𝑋 (𝑥) = (4𝑥 − 𝑥2 ) if 0 ≤ 𝑥 ≤ 4;
32
𝑓𝑋 (𝑥) = 0 otherwise.

What is the mean of the mass of silicon produced?

Financial and Actuarial Science: Suppose an insurance company 𝐵 has thousands of customers,
and each customer is charged $500 a year. Since the customer’s businesses are risky, from the past
experience the company 𝐵 estimates that about 15% of their customers would get fatal trouble (e.g.
fire, accident ...) and, as a result they will submit a claim in any given year.
We assume that the claim will always be $3000 for each customer.

a/ Model the amount of money that the insurance company believes to obtain from each customer
How much can the company expect to make per customer in a certain year?
b/ Now suppose that there are 𝑁 = 300 customers, the amount of money that the 𝑖-th customer
could receive from 𝐵 is a random variable 𝑋𝑖 for 𝑖 = 1, 2, . . . 𝑁 . The {𝑋𝑖 } are I.I.D. sequence
with the same random variable 𝑋, (see Section 1.4) and 𝑋 has the observed values

Range(𝑋) = {0, 1, 2, 3, . . .},

given by the probability density distribution


(︂ )︂𝑗
3 1
𝑓𝑗 = P(𝑋 = 𝑗) = ,
4 4
𝑗 ∈ Range(𝑋), ( unit 100 USD ).

Are {𝑓𝑗 } really the probability density distribution of 𝑋?


c/ Determine the random variable 𝑆𝑁 representing the total amount of claim that the company 𝐵
have to pay to its customers? Compute E[𝑆𝑁 ].

Inference, Linear Regression and Stochastic Processes


1.2. Probability Theory 5

1.2 Probability Theory

What is probability? Probability of an event, a phenomenon ... informally is a numerical measure how
much chance that event would/ will happen.

Experiments. An experiment E is a specific trial/activity (of scientists, human being) whose out-
comes possess randomness.

Simple examples are:

• Coin throwing- throw a coin, random outcomes are head (H) or tail (T)

• Temperature measurement- observe continuously temperatures at noon in HCMC in 10 days of


Summer 2007, random outcomes are recorded by the list

[34, 29, 28, 32, 31, 32, 30, 31, 30, 33] (in Celsius degree).

Basic concepts

1. Sample space Ω- set of all possible outcomes.

Ex. 1: Coin throwing −→ Ω = {𝐻, 𝑇 }

2. Event- is subset 𝐴 of sample space Ω: 𝐴 ⊂ Ω.

Usually we include all events into a set, called the event set
Q := { events 𝐴 : 𝐴 ⊂ Ω}.

When an experiment E is performed and an outcome 𝑎 is observed we say that event 𝐴 has occurred
if 𝑎 ∈ 𝐴.

3. Probability function- a map P from Q to the interval [0, 1]:

P : Q −→ [0, 1]
𝐴 ∈ Q ↦−→ P(𝐴) = Prob(𝐴) =
probability or chance that 𝐴 occurs.

1.2.1 Basic operations and rules

Events are sets of outcomes. Therefore, to learn how to compute probabilities of events, we shall
discuss some set operations. Namely, we shall define unions, intersections, differences, and comple-
ments.
Let 𝐴, 𝐵 be events, we can make new events as below.

• Union of 𝐴 and 𝐵, write 𝐴 ∪ 𝐵 = {𝑥 | 𝑥 ∈ 𝐴 ∨ 𝑥 ∈ 𝐵} is an event consisting of elements belonging


to 𝐴 or 𝐵.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
6 CAN UNCERTAINTY BE EVALUATED?

• Intersection of 𝐴 and 𝐵, write 𝐴 ∩ 𝐵 = {𝑥 | 𝑥 ∈ 𝐴 ∧ 𝑥 ∈ 𝐵} is an event consisting of elements


belonging both to 𝐴 and 𝐵. We could write 𝐴 𝐵 = 𝐴 ∩ 𝐵.

• 𝐴 and 𝐵 are disjoint if 𝐴 ∩ 𝐵 = ∅, that is, they contain no common element. Obviously, by definition,
any event is detached from its complementary event, i.e. 𝐴 ∩ 𝐴𝑐 = ∅.

𝐴∪𝐵 𝐴∩𝐵

𝐴 𝐵 𝐴 𝐵

• Example: • Example:

{1, 2, 3} ∩ {2, 4} = {2} {1, 2, 3} ∪ {2, 4} = {1, 2, 3, 4}


{1, 2, 3} ∩ N = {1, 2, 3} {1, 2, 3} ∪ ∅ = {1, 2, 3}

We extend the definition of the union (and intersection) of the pairs of events to the case of a finite
number of events.

𝑛
⋃︁
𝐴𝑖 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = {𝑥 | 𝑥 ∈ 𝐴1 ∨ 𝑥 ∈ 𝐴2 ∨ ... ∨ 𝑥 ∈ 𝐴𝑛 },
𝑖=1

𝑛
⋂︁
𝐴𝑖 = 𝐴1 ∩ 𝐴2 ∩ ... ∩ 𝐴𝑛 = {𝑥 | 𝑥 ∈ 𝐴1 ∧ 𝑥 ∈ 𝐴2 ∧ ... ∧ 𝑥 ∈ 𝐴𝑛 }.
𝑖=1

We also write intersection of many events as a product

𝑛
⋂︁
𝐴𝑖 = 𝐴1 ∩ 𝐴2 ∩ ... ∩ 𝐴𝑛 ≡ 𝐴1 𝐴2 ... 𝐴𝑛 .
𝑖=1

The union of events satisfies the commutation, association and distribution, is expressed in turn
by the following equations:

𝐴 ∪ 𝐵 = 𝐵 ∪ 𝐴; và 𝐴 ∩ 𝐵 = 𝐵 ∩ 𝐴 (commutation law)


𝐴 ∪ (𝐵 ∪ 𝐶) = (𝐴 ∪ 𝐵) ∪ 𝐶 (association law) (1.1)

𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶) (distribution law)

and
(𝐴 ∪ 𝐵)𝑐 = 𝐴𝑐 ∩ 𝐵 𝑐 De Morgan law
(1.2)
(𝐴 ∩ 𝐵)𝑐 = 𝐴𝑐 ∪ 𝐵 𝑐 .

Inference, Linear Regression and Stochastic Processes


1.2. Probability Theory 7

(a) Event and its complement (b) Sub events

Figure 1.2: Complement and sub event

Axioms of Probability Theory (A. Kolmogorov, 1933).

A1. Probabilities are nonnegative, 0 ≤ P(𝐴) ≤ 1, where P(𝐴) := Prob(𝐴).


A2. The sample space Ω has probability 1, that is P(Ω) = 1

A3. Probabilities of disjoint events 𝐴, 𝐵, 𝐴 ∩ 𝐵 = ∅:

P(𝐴 ∪ 𝐵) = P(𝐴 or 𝐵) = P(𝐴) + P(𝐵),

in which 𝐴, 𝐵 ⊂ Ω are events, and the sample space Ω is formed from a random experiment E. More
general, we have

P(𝐴1 ∪ 𝐴2 ∪ · · · ∪ 𝐴𝑚 ) = P(𝐴1 ) + P(𝐴2 ) + · · · + P(𝐴𝑚 )

for 𝑚 mutually disjoint events, i.e. 𝐴𝑖 ∩ 𝐴𝑗 = ∅ when 1 ≤ 𝑖 ̸= 𝑗 ≤ 𝑚.

1.2.2 Assign probabilities to events

Possible ways to assign probabilities to events:

a) Frequency interpretation: probability is based on history (data obtained or observed). For any event
𝐴 ⊂ Ω, its probability is the relative frequency
∑︁
P(𝐴) = Prob(𝐴) = P(𝑠).
𝑠∈𝐴

Example 2: If Temperature measurements form the sample space

[34, 29, 28, 32, 31, 32, 30, 31, 30, 33] (in Celsius degree),

and define the event 𝐴 = temperatures higher than 30𝑜 .

The sample space Ω is the above list, and if we suppose the chance to see any temperature 𝑡 ∈ Ω is
∑︀ 6
the same, then P(𝐴) = 𝑠∈𝐴 P(𝑠) = .
10

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
8 CAN UNCERTAINTY BE EVALUATED?

b) Classical interpretation: compute probability based on the assumption that all outcomes have equal
probability. Apply when the sample space Ω holds |Ω| = 𝑛 < ∞, then for any event 𝐴 ⊂ Ω, its
probability is the fraction found by counting methods:
|𝐴|
P(𝐴) = Prob(𝐴) = .
|Ω|

1
Example 3: In Coin throwing, Ω = {𝐻, 𝑇 }, then P(𝐻) = P({𝐻}) = = P(𝑇 ).
2

c) Subjective interpretation: use a model, can hypothesize about phenomenon possessing randomness

Example 4: P(survival after a serious surgery) is estimated by the doctor

Probability of a single event. For finite sample spaces, we assume Ω = {𝑠1 , 𝑠2 , · · · , 𝑠𝑛 } and next
define 𝑝𝑖 = P(𝑠𝑖 ) then
𝑛
∑︁
𝑝𝑖 ≥ 0, and 𝑝𝑖 = 1.
𝑖=1

If all outcomes have equal probabilities, then


𝑛𝐴
P(𝐴) = Prob(𝐴) = , where 𝑛𝐴 = |𝐴|.
𝑛
Example 1.1. On a single toss of a die, we get only one of six possible outcomes 1,2,3,4,5 or 6; then
the sample space
Ω = {1, 2, 3, 4, 5, 6}, and 𝑝𝑖 = P(𝑖) = 1/6, for all 𝑖 = 1..6

Multiple events– Union or addition rule. What are mutually exclusive and not mutually exclusive
events?
Two events 𝐴 and 𝐵 are mutually exclusive if 𝐴 ∩ 𝐵 = ∅, i.e. the occurrence of 𝐴 precludes the
occurrence of 𝐵. Then
P(𝐴 and 𝐵) = P(𝐴 ∩ 𝐵) = 0

* For mutually exclusive events:

P(𝐴 ∪ 𝐵) = P(𝐴 or 𝐵) = P(𝐴) + P(𝐵).

* Non-mutually exclusive events fulfill that

P(𝐴 and 𝐵) = P(𝐴 ∩ 𝐵) ̸= 0.

Then P(𝐴 and 𝐵) = P(𝐴 ∩ 𝐵) ̸= 0, and

P(𝐴 ∪ 𝐵) = P(𝐴 or 𝐵) = P(𝐴) + P(𝐵) − P(𝐴 and 𝐵).

Example 1.2.

Inference, Linear Regression and Stochastic Processes


1.2. Probability Theory 9

If the die is fair, when tossing of a die, the probability of seeing number 𝑖 is 𝑝𝑖 = P(𝑖) = 1/6. The
probability of event 𝑍 =‘getting 2 or 3 or 4’ is
∑︁
P(𝑍) = P(2 or 3 or 4) = P(𝑠) = 3/6. 
𝑠∈𝑍

Fact 1.1. [Total probability rule.]

• Observe that
P[𝐵] = P[𝐵𝐴] + P[𝐵𝐴𝑐 ],

here 𝐴𝑐 = 𝐴′ is the complement of 𝐴.

• If, more general events 𝐸1 , 𝐸2 , · · · , 𝐸𝑛 (𝑛 ≥ 1) is a partition of Ω, i.e.


𝑛
(︂ ⋃︁ )︂
𝐸𝑖 = 𝐸1 ∪ 𝐸2 · · · ∪ 𝐸𝑛 = Ω,
𝑖

then for any event 𝐵, we have


𝑛
∑︁
P[𝐵] = P[𝐵 ∩ 𝐸𝑖 ]. (1.3)
𝑖

Figure 1.3: General union rule

Indeed, by distribution law


𝑛
(︂ ⋃︁ )︂ 𝑛
⋃︁
𝐵 =𝐵∩Ω=𝐵∩ 𝐸𝑖 = 𝐵 𝐸𝑖 .
𝑖 𝑖

Because 𝐸1 , 𝐸2 , · · · , 𝐸𝑛 are pairwise disjoint events, we get

𝐵 𝐸𝑖 ∩ 𝐵 𝐸𝑗 = 𝐵 ∩ 𝐸𝑖 ∩ 𝐸𝑗 = ∅

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
10 CAN UNCERTAINTY BE EVALUATED?

for all 𝑖 ̸= 𝑗. By Axiom A3


𝑛
[︂ ⋃︁ ]︂ 𝑛
∑︁ 𝑛
∑︁
P[𝐵] = P 𝐵 𝐸𝑖 = P[𝐵 𝐸𝑖 ] = P[𝐵 ∩ 𝐸𝑖 ]. 
𝑖 𝑖 𝑖

1.2.3 Independent events

Event 𝐴 and event 𝐵 are dependent if the appearance of an event is related (in a certain way) to the
occurrence of the other event.

In the above picture on the right, e.g. we can view 𝐵 is the partial girl appearance event, then event 𝐴
of boy appearance in both left and right large pictures is dependent with 𝐵.
So what happen when 𝐴 is not related to the occurrence of 𝐵?

 CONCEPT 1.

Events 𝐴 and 𝐵 are independent if the occurrence of 𝐴 is not connected in any way to the occurrence
of 𝐵. Then P(𝐴𝐵) = P(𝐴) · P(𝐵).
• The joint probability of two independent events 𝐴 and 𝐵 is

P(𝐴𝐵) = P(𝐴 ∩ 𝐵) = P(𝐴) · P(𝐵). (1.4)

• In general, events 𝐸1 , 𝐸2 , . . . , 𝐸𝑛 are independent if they occur independently of each other, i.e.,
occurrence of one event does not affect the probabilities of others.

1.2.4 Conditional probability

Given that event 𝐵 “happened”, what is the probability that event 𝐴 also happened?
Brainstorming thought: narrow down the sample space to the space where 𝐵 has occurred; to com-
pare 𝐴 ∩ 𝐵 with 𝐵.
The formula: Conditional probability of event 𝐴 given event 𝐵

P(𝐴𝐵) P(𝐴 ∩ 𝐵)
P(𝐴 | 𝐵) = = . (1.5)
P(𝐵) P(𝐵)

Inference, Linear Regression and Stochastic Processes


1.2. Probability Theory 11

 CONCEPT 2.

The joint probability of two events 𝐴 and 𝐵 in general is

P(𝐴 and 𝐵) = P(𝐴 ∩ 𝐵) = P(𝐵) · P(𝐴 | 𝐵).

If 𝐴 and 𝐵 are independent we got their joint probability as presented in Equation 1.4: P(𝐴 ∩ 𝐵) =
P(𝐴) · P(𝐵).

Example 1.3. Suppose that two balls are to be withdrawn, without replacement, from an urn that
contains 9 blues and 7 yellow balls.
If each ball drawn is equally likely to be any of the balls in the urn at the time, what is the probability
that both balls are blue?

GUIDANCE for solving. Two types of random sampling are:

1. Sampling with replacement: you take a unit out of the interest population, and return it back (to
the population) before you take the next unit; repeat the action until you get enough the sample with
given size 𝑛 ≥ 1.

2. Sampling without replacement: you take a unit out of the interest population, and continue taking
other units from that population (now decline available units for picking new units); until you get
enough the sample with given size 𝑛 ≥ 1.

1.2.5 Bayes’ Theorem and usages

Theorem 1.1 (Bayes’ Theorem). For any pair of events 𝐴, 𝐵, since

P(𝐴 ∩ 𝐵) = P(𝐵 ∩ 𝐴) = P(𝐴) · P(𝐵 | 𝐴),

we always have
P(𝐴) · P(𝐵 | 𝐴)
P(𝐴 | 𝐵) = (1.6)
P(𝐵)

What are dependent events? Events 𝐴 and 𝐵 are dependent if the occurrence of one is connected
in some way to the occurrence of the other. Then the joint probability of 𝐴 and 𝐵 is

P(𝐴𝐵) = P(𝐵) · P(𝐴 | 𝐵) = P(𝐴) · P(𝐵 | 𝐴) (1.7)

since (viewing 𝐴 as a given condition)

P(𝐴𝐵) = P(𝐵𝐴) = P(𝐴) · P(𝐵 | 𝐴)

How about the independent case? We know from the above section that events 𝐴 and 𝐵 are
independent if the occurrence of 𝐴 is not connected in any way to the occurrence of 𝐵. Then

P(𝐴 | 𝐵) = P(𝐴) and P(𝐵 | 𝐴) = P(𝐵). (1.8)

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
12 CAN UNCERTAINTY BE EVALUATED?

Plug this result back to Equation 1.7 we obtain the joint probability of independent events 𝐴 and 𝐵
being given in Equation 1.4.

Independence for 𝑚 > 2 events 𝐴1 , 𝐴2 , ..., 𝐴𝑚 : The mutual independence for 𝑚 > 2 events
means:

• they are pairwise independent: P[𝐴𝑖 𝐴𝑗 ] = P[𝐴𝑖 ] P[𝐴𝑗 ], for all pair of indices 1 ≤ 𝑖 ̸= 𝑗 ≤ 5;

• they are 3-wise independent: P[𝐴𝑖 𝐴𝑗 𝐴𝑙 ] = P[𝐴𝑖 ] P[𝐴𝑗 ]P[𝐴𝑙 ] for all triple of indices 1 ≤ 𝑖, 𝑗, 𝑙 ≤ 5; ...

When 𝐴1 , 𝐴2 , ..., 𝐴𝑚 are mutually independent then we get


𝑚
[︂ ⋂︁ 𝑚
]︂ ∏︁
P 𝐴𝑖 = P[𝐴𝑖 ]. (1.9)
𝑖=1 𝑖=1

1.3 Random Variables and Classification

A variable, such as the strength of a concrete or any other material or physical quantity, whose value is
uncertain or unpredictable or nondeterministic is called a random variable or a variate if its distribution
is known. A random variable, practically may assume some value, the magnitude of which depends on
a particular occurrence or outcome (usually noted by an observation or measurement) of an experiment
in which tests are made and records maintained.
A random variable is formally viewed as a function defined on the sample space of an experiment
such that there is a numerical value of the random variable corresponding to each possible outcome;
that is there is a probability associated with each occurrence in the sample space.

1.3.1 Key definitions

A random variable 𝑋 is a map from a sample space Ω to the reals R. That is for 𝑤 ∈ Ω then
𝑋(𝑤) ∈ R.

• The domain of a random variable is the sample space Ω.

• The range of a random variable is the set of all observations

𝑆𝑋 = Range(𝑋) = {𝑋(𝑤)}.

Its range 𝑆𝑋 can be the set of all real numbers R, or only the positive numbers R+ = (0, +∞), or
the integers Z, or the interval (0, 1), etc., depending on what possible values the random variable can
potentially take.
For example, if 𝑋 measures the height of male students in Europe, here

Ω = {𝑤 : 𝑤 is a male students in European countries}

Inference, Linear Regression and Stochastic Processes


1.3. Random Variables and Classification 13

then 𝑋(𝑤) (in meter) is the height of student 𝑤, and possibly 𝑆𝑋 = Range(𝑋) = (1, 2.2).

Definition 1.1.

For any 𝑏 ∈ R, the preimage

𝐸 := 𝑋 −1 (𝑏) = {𝑤 ∈ Ω : 𝑋(𝑤) = 𝑏} ⊂ Ω

is an event, we define
P[𝑋 = 𝑏] = Prob{𝑋 = 𝑏} := Prob(𝐸).

For finite set - sample space Ω then obviously

|𝐸|
P[𝑋 = 𝑏] := Prob(𝐸) = .
|Ω|

P[𝑋 = 𝑏] indicates how much chance observation 𝑏 happen, and called the probability density (mass)
function ( p.d.f. or p.m.f ) of 𝑋 at observation (or observed value) 𝑏.
In Picture 1.4,
event 𝐸 := 𝑋 −1 (𝑏) is the violet square on the left, and
event 𝐴 := 𝑋 −1 (ARIN) is the green oval.
Let us consider two simple examples below.

1. Our sample space Ω is the set of all King Mongkuk University’s students, define a map ‘most liked
singer’

𝑋 : Ω −→ { famous singers in Thailand },

we ask each student 𝑤 ∈ Ω to know his (her) most liked singer 𝑏 in Thailand, 𝑋(𝑤) = 𝑏, then the set
Range(𝑋) of the map 𝑋 is the set of all famous singers in Thailand.

If KM University has 30000 students, and 1500 students like - say (value) singer ‘ARIN’ then

𝐴 := 𝑋 −1 (‘ARIN’) = {𝑤 ∈ Ω : 𝑋(𝑤) = ‘ARIN’} ⊂ Ω

has cardinality (i.e. the number of elements) 1500, and

|𝐴| 1500
P[𝑋 = ‘ARIN’] := Prob(𝐴) = = = 1/20.
|Ω| 30000

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
14 CAN UNCERTAINTY BE EVALUATED?

Map X: Ω R The observed value set


The sample space Ω Range(X)

ARIN

Figure 1.4: Visualization of a random variable 𝑋

In Picture 1.4, event 𝐴 := 𝑋 −1 (ARIN) (the green oval) consists of precisely 1500 students 𝑤.

2. Now let our sample space Ω be the set of all Honda cars produced in Japan, inspect each car 𝑤 ∈ Ω
to know its fault (defect) 𝑑, 𝑋(𝑤) = 𝑑, then

the set Range(𝑋) of the map ‘car defect’ 𝑋 is the set of all popular car defects. If Honda Japan
produced 1 million cars, in which 2000 cars have the (value) fault 𝑑 = ‘𝑑𝑒𝑓 𝑒𝑐𝑡𝑒𝑑 𝑝𝑖𝑠𝑡𝑜𝑛′ then

𝐴 := 𝑋 −1 (𝑑) = {𝑤 ∈ Ω : 𝑋(𝑤) = ‘𝑑𝑒𝑓 𝑒𝑐𝑡𝑒𝑑 𝑝𝑖𝑠𝑡𝑜𝑛′ } ⊂ Ω

|𝐴| 2000
has cardinality 2000, and P[𝑋 = 𝑑] := Prob(𝐴) = = =?.
|Ω| 106

For any random variable 𝑋, we write 𝑆𝑋 = Range(𝑋) with the same meaning for the set of observed
values (observations) of 𝑋.

 CONCEPT 3.

A random variable 𝑋(.) is discrete when it has a discrete range, meaning that the set of values
consists of no more than a countable number of elements.
In finite element case, we usually write set of values

𝑆𝑋 = Range(𝑋) = {𝑥1 , 𝑥2 , 𝑥3 , · · · , 𝑥𝑚−1 , 𝑥𝑚 }, 𝑚 ∈ N.

In countably infinite element case, we write set of values

𝑆𝑋 = Range(𝑋) = {𝑥1 , 𝑥2 , 𝑥3 , · · · , 𝑥𝑚−1 , 𝑥𝑚 , . . .}

Inference, Linear Regression and Stochastic Processes


1.3. Random Variables and Classification 15

For example,

• the number of famous singers in Thailand is a discrete variable, then its range 𝑆𝑋 is finite. In the
above singer example, if Thailand has 15 famous singers with names 𝑥𝑖 then

𝑆𝑋 = Range(𝑋) = {𝑥1 , 𝑥2 , 𝑥3 , · · · , 𝑥14 , 𝑥15 }.

• The number of banks in the US is a discrete variable 𝑋, then 𝑆𝑋 of all observed values is finite
(bank 1 = Citibank, ...)

• The number of defect types of Honda cars is a discrete variable 𝑋, 𝑆𝑋 is finite. In the above car
fault example, suppose that Honda’s cars has 5 specific defects then

𝑆𝑋 = Range(𝑋) = { defected piston, software, brake, mirror, tyre}.

• The number of traffic accidents in Asia each year is a discrete variable, but 𝑆𝑋 is a countably infinite set.

1.3.2 Types of random variables

Discrete random variables

So far, we are dealing with discrete random variables. These are variables whose range is finite or
countable (i.e. countably infinite). In particular, it means that their values can be listed, or arranged in
a sequence, as in Concept 3. Examples include
the number of jobs submitted to a printer,
the number of corruption-free departments in a government,
the number of failed components of a software, and so on.
Discrete variables don’t have to be integers.

Continuous random variables

Continuous random variables assume a whole interval of values. This could be a bounded interval
(𝑎, 𝑏), or an unbounded interval such as
(𝑎, +∞), (−∞, 𝑏), or (−∞, +∞).
Sometimes, it may be a union of several such intervals. Intervals are uncountable, therefore, all
values of a random variable cannot be listed in this case.
Examples of continuous variables include

• various times (software installation time, code execution time, connection time, waiting time, lifetime),
also

• physical variables like weight, height, voltage, temperature, distance, the number of miles per gallon,
etc.

We discuss both discrete and continuous random variables in detail in Chapter ??.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
16 CAN UNCERTAINTY BE EVALUATED?

1.4 Important Discrete Probability Distributions

1.4.1 Bernoulli distribution 𝐵 ∼ B(𝑝)

Bernoulli variable describes a random variable 𝐵 that can take only two possible values, i.e.
𝑆𝐵 = {0, 1}.
Its probability density function is given by

𝑝(1) = P(𝐵 = 1) = 𝑝,

𝑝(0) = P(𝐵 = 0) = 1 − 𝑝 for some 𝑝 ∈ [0, 1].

Use Eq. ??, it is easy to check that

E(𝐵) = 𝑝, V(𝐵) = 𝑝(1 − 𝑝).

Definition 1.2 (I.I.D. sequence of random variables).

• Two random variables 𝐴, 𝐵 with ranges 𝑆𝐴 , 𝑆𝐵 are said to independent if we have the joint probability
density function satisfies:

𝑓 (𝑥, 𝑦) = P[𝐴 = 𝑥, 𝐵 = 𝑦] = P[𝐴 = 𝑥 ∩ 𝐵 = 𝑦]

= P[𝐴 = 𝑥] P[𝐵 = 𝑦], for all obvervations 𝑥 ∈ 𝑆𝐴 , 𝑦 ∈ 𝑆𝐵 .

• A sequence of 𝑛 random variables 𝑋𝑖 are mutually independent if they are independent in pairs, in
triple of random variables, and in general 𝑘-wise independent in the sense of Equation 1.9, that is

P[𝑋1 = 𝑎1 , 𝑋2 = 𝑎2 , . . . , 𝑋𝑘 = 𝑎𝑘 · · · ]

= P[𝑋1 = 𝑎1 ] P[𝑋2 = 𝑎2 ] P[𝑋3 = 𝑎3 ] ..., 𝑎𝑖 ∈ Range(𝑋𝑖 ).

• A sequence of 𝑛 random variables 𝑋𝑖 are identically distributed if they follow the same distribution of
a common random variable 𝑋. More precisely they have the same ranges Range(𝑋) and the same
p.d.f. 𝑓𝑋 (). We write 𝑋𝑖 ∼ 𝑋.

Independent, identical random variables

A sequence of many random variables 𝑋𝑖 are independently and identically distributed (write
I.I.D. or i.i.d.) if they are both mutually independent and identically distributed.

Inference, Linear Regression and Stochastic Processes


1.4. Important Discrete Probability Distributions 17

1.4.2 Binomial distribution 𝑋 ∼ Bin(𝑛, 𝑝)

Binomial variable describes a random variable 𝑋 that is a number of successes in 𝑛 indepen-


dent and identical Bernoulli trials with probability of success 𝑝. In other words, 𝑋 is a sum of
𝑛 i.i.d. Bernoulli random variables:

𝑋 = 𝐵1 + 𝐵2 + . . . + 𝐵𝑛 , where each 𝐵𝑖 ∼ 𝐵 ∼ B(𝑝)

Therefore, 𝑋 takes values in the range

Range(𝑋) = {0, 1, 2, ..., 𝑛}

and its distribution is given by the probability density function


(︂ )︂
𝑛 𝑘
𝑝(𝑘) = P(𝑋 = 𝑘) = 𝑝 (1 − 𝑝)𝑛−𝑘 . (1.10)
𝑘
It is easy to check that the mean and variance are

E(𝑋) = 𝑛𝑝, V(𝑋) = 𝑛𝑝(1 − 𝑝).

PRACTICE 1.

Two fair dice are tossed. If the total is 7, we win $100; if the total is 2 or 12, we lose $100; otherwise
we lose $10. What is the expected value of the game?

Reminder : if 𝑉 : Ω → R is an assignment of values to the points in sample space Ω, then


∑︁
E(𝑉 ) = P(𝑤) · 𝑉 (𝑤).
𝑤∈Ω

Example 1.4 (Fair bet).

A bet whose expected winnings equals to 0 is called a fair bet. Let the random variable 𝑋 denote
the amount that we win when we make a certain bet.
Find the expectation (or mean) E(𝑋) if there is
a 60% chance that we lose 1 USD,
a 20% chance that we win 1 USD, and
a 20% chance that we win 2 USD.
Is this a fair bet?

HINT: We can solve these two examples by using Bernoulli random variable, with some modifications.
Proof of Equation A.5. Let 𝐻 and 𝑇 be two outcomes of a Bernoulli trial with value space Range(B(𝑝)) =
{𝐻, 𝑇 }, and
P(𝐻) = 𝑝; P(𝑇 ) = 1 − 𝑝 =: 𝑞.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
18 CAN UNCERTAINTY BE EVALUATED?

Assume that we perform 𝑛 trials of the experiment and each trial is independent of the others. For
example,
the event “𝐻 on the first trial” is independent from
the event “𝐻 on the second trial.” So both events have probability 𝑝.
The value space of sequences of 𝑛 trials is:

{𝑥1 𝑥2 · · · 𝑥𝑛−1 𝑥𝑛 | 𝑥𝑖 ∈ {𝐻, 𝑇 }}.

Since the trials are independent, we assign probabilities to these sequences by

P(𝑥1 𝑥2 · · · 𝑥𝑛−1 𝑥𝑛 ) = P(𝑥1 ) P(𝑥2 ) · · · P(𝑥𝑛−1 ) P(𝑥𝑛 ).

Our question now is: what is the probability of exactly 𝑘 successes in 𝑛 trials of a binomial experiment
where
P(success = 𝐻) = 𝑝, P(failure = 𝑇 ) = 1 − 𝑝?

Since 𝑋 = Bin(𝑛, 𝑝) is the sum of 𝑛 independent Bernoulli r.v.,

𝑋 = 𝐵1 + 𝐵2 + . . . + 𝐵𝑛 , each 𝐵𝑖 ∼ B(𝑝)

then 𝑋 takes values in {0, 1, ..., 𝑛}. Therefore exactly 𝑘 successes in 𝑛 trials is 𝑋 = 𝑘 ∈ {0, 1, ..., 𝑛}. By
combinatorial reasoning, the binomial distribution Bin(𝑛, 𝑘) is given by the probability density function
(︂ )︂
𝑛 𝑘
𝑝(𝑘) = P(𝑋 = 𝑘) = 𝑝 (1 − 𝑝)𝑛−𝑘 .
𝑘

Figure 1.5 shows pdf of the binomial Bin(𝑛, 𝑝) when 𝑛 = 50, and with 𝑝 = 0.25, 0.50, 0.75.

1.4.3 Poisson distribution

Practical Problem 1.

Suppose we observe a telephone station, the statistics show that in average for 1 minute there are 2
phone users. We need to calculate the probability that in 1 minute there are 4 phone users.

• Let 𝑋 be the number of phone users in one minute;

• then 𝑋 receives a countably infinite numbers 0, 1, 2, 3, ... Hence it is discrete random variable
accepting the range Range(𝑋) = N.

Poisson variable is denoted by 𝑋 ∼ Pois(𝜆), it is also known as variable for rare events, playing an
important role for discrete event modeling in the actuarial science or quality control.

The value 𝜆 > 0 denotes for the speed of rare events (as the average amount of customers
arriving at an insurance firm) in each time period or per spacial sample.

Inference, Linear Regression and Stochastic Processes


1.4. Important Discrete Probability Distributions 19

Figure 1.5: The probability density function of Bin(50, 𝑝) with 𝑝 = .25, .50, .75

• The observed values 𝑆𝑋 = {0, 1, 2, 3, 4, . . . , 𝑚, 𝑚 + 1, . . .}.

• Probability density function of 𝑋 is

𝜆𝑖 𝑒−𝜆
𝑝(𝑖; 𝜆) = P(𝑋 = 𝑖) = 𝑖 = 0, 1, 2, ... (1.11)
𝑖!

• Probability cumulative function of 𝑋 is


𝑥
∑︁ 𝑥
∑︁
𝑃 (𝑥; 𝜆) = P(𝑋 ≤ 𝑥) = P(𝑋 = 𝑖) = 𝑝(𝑖; 𝜆) 𝑥 = 0, 1, 2, . . . (1.12)
𝑖=0 𝑖=0

• Expectation and variance of Poisson variable Pois(𝜆)

E(𝑋) = 𝜇 = 𝜆, V(𝑋) = E[(𝑋 − E(𝑋))2 ] = 𝜆. (1.13)

If the variance of a data is much greater than the mean, the Poisson distribution would not be a
good model for the random variable’s distribution.

Lemma 1.2 (A fundamental result).

For a Poisson arriving process, the number of arrivals 𝑋(𝑡) occurring in a time interval of length 𝑡 is
Poisson-distributed (random variable) with mean 𝜆 𝑡. It means that the probability density function
of 𝑋(𝑡) is given by:
𝑒−𝜆 𝑡 (𝜆 𝑡)𝑘
P[𝑋(𝑡) = 𝑘] = , 𝑘 = 0, 1, 2, ...
𝑘!

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
20 CAN UNCERTAINTY BE EVALUATED?

Figure 1.6: Histogram of Poisson (p.d.f bar plot)- Case 𝜆 = 4.

Example 1.5.

If customers come to a SCB branch in Bangkok and follow a Poisson distribution with constant rate
𝜆 = 5 (see Figure 1.6) then

• the mean 𝜇 = 4 customers per hour, the variance is also V(𝑋) = 𝜎 2 = 4,



• so the standard deviation of the arriving customers is 𝜎 = 4 = 2.

1.5 Continuous Probability Distributions

What is a continuous probability distribution?

• We informally say a random variable 𝑋 is a continuous random variable if and only if the range set
Range(𝑋) is a continuous set (such as the reals R or its interval subsets)

• A continuous probability distribution refers to the range Range(𝑋) of all possible values of 𝑋, along
with the associated probabilities P(𝑋 ≤ 𝑡).

• Note that Range(𝑋) and 𝑆𝑋 both mean the value space of 𝑋.

Here is the formal definition.

Definition 1.3.

Inference, Linear Regression and Stochastic Processes


1.5. Continuous Probability Distributions 21

A random variable 𝑋 is continuous if and only if its range 𝑆𝑋 is a continuous set (uncountable
infinite set), such as the reals R or interval subsets of R ).

• The probability cumulative distribution function (cdf) of 𝑋 is the function defined by


∫︁ 𝑡
𝐹𝑋 (𝑡) = P(𝑋 ≤ 𝑡) = 𝑓 (𝑥)𝑑𝑥, −∞ < 𝑡 < ∞. (1.14)
−∞

• 𝑓 (𝑥) is said to be the probability density function - pdf of 𝑋.

CONVENTION: If 𝑋 is clear from the context, write 𝐹 (𝑡) = 𝐹𝑋 (𝑡). Also we can change variable 𝑡
to 𝑥 and rewrite 𝐹 (𝑥), 𝑓 (𝑥). In few books, 𝑓 (𝑥) is simply called a probability function, and 𝐹 (𝑥) the
distribution function.

Relationship between the cdf 𝐹 and the pdf 𝑓

XThe probability cumulative function (cdf)


∫︁ 𝑥
𝐹 (𝑥) = P(𝑋 ≤ 𝑥) = 𝑓 (𝑢)𝑑𝑢,
−∞

is the probability that 𝑋 is less than or equals 𝑥, equals to the area under the curve 𝑓 (𝑥) between −∞
and 𝑥. 𝐹 (𝑥) must have derivative
𝑑𝐹 (𝑥)
=: 𝑓 (𝑥) (1.15)
𝑑𝑥

XThis probability density function 𝑓 (𝑥) is defined almost every where and is piecewise continuous.
It is is given by a smooth curve 𝐶. The pdf 𝑓 (𝑥) then satisfies 2 conditions: i) 𝑓 (𝑥) ≥ 0, ∀𝑥, and ii) the
whole area below the pdf curve ∫︁ ∞
𝑓 (𝑥) 𝑑𝑥 = 1 = 𝐹 (+∞).
−∞

But the probability of a specific value P(𝑋 = 𝑥) = 0 is 0.


Geometric view about relationship of pdf 𝑓 (𝑥) and cdf 𝐹 (𝑥).
XThe left white area in Figure 1.7.a) is
∫︁ 𝑎
𝐹 (𝑎) = 𝑓 (𝑥)𝑑𝑥.
−∞

XThe probability that a continuous random variable 𝑋 receives any value within a given interval say,
[𝑎, 𝑏], i.e. 𝑎 ≤ 𝑋 ≤ 𝑏 is measured by the blue area under the curve 𝐶 within that interval.
In other words, the probability of the event “𝑎 ≤ 𝑋 ≤ 𝑏′′ is:
∫︁ 𝑏
P(𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑓 (𝑥)𝑑𝑥 = P(𝑎 < 𝑋 < 𝑏).
𝑎

The total area under the curve 𝑓 (𝑥) is the whole area including:

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
22 CAN UNCERTAINTY BE EVALUATED?

a) Blue area is the chance that a <= X <= b

b)Meaning of expectation as a center of gravity

Figure 1.7: Geometric view of cumulative probability and expectation

• the left white area ∫︁ 𝑎


𝐹 (𝑎) = P(𝑋 ≤ 𝑎) = 𝑓 (𝑢)𝑑𝑢,
−∞

• the blue area in the middle P(𝑎 ≤ 𝑋 ≤ 𝑏), (see Figure 1.7.a)

• and the right white area


∫︁ +∞
P(𝑏 ≤ 𝑋) = 𝑓 (𝑥)𝑑𝑥 = P(𝑋 ≥ 𝑏).
𝑏

This total area (probability) is ∫︁ ∞


𝑓 (𝑥)𝑑𝑥 = 1 = 𝐹 (+∞).
−∞

MEAN and VARIANCE: The mean 𝜇 of a continuous variable 𝑋 with pdf 𝑓 (𝑥) is
∫︁
𝜇 = E(𝑋) = 𝑥 𝑓 (𝑥)𝑑𝑥, (1.16)
𝑥∈Range(𝑋)

see Figure 1.7.b), and the variance


∫︁
V(𝑋) = 𝜎 2 = [𝑥 − 𝜇]2 𝑓 (𝑥)𝑑𝑥. (1.17)
𝑥∈Range(𝑋)

Inference, Linear Regression and Stochastic Processes


1.5. Continuous Probability Distributions 23

Example 1.6. [Industry- Manufacturing.]

A new type of biological catalyst used in making breads has an efficient working time of 𝑋 (in hours).
The random variable 𝑋 is modeled by the pdf

𝑓 (𝑥) = 𝑘 𝑥−2 if 400 ≤ 𝑥 ≤ 900; 𝑓 (𝑥) = 0 otherwise.

a) Find 𝑘; DIY. b) Compute the approximated mean of work time E[𝑋] of this biological catalysts.
Answer b): 584 hours 
Popular continuous probability distributions include

• Continuous Uniform distribution: used mostly in engineering.

• Normal or Gaussian distribution: found to be useful in many areas like petroleum engineering,
environmental and medical sciences.

• Exponential distribution: useful in numerous other areas

• Beta distribution: used mostly in finance

1.5.1 Continuous Uniform Distribution

A continuous random variable 𝑋 with probability density function

1
𝑓 (𝑥) = , 𝑎≤𝑥≤𝑏 (1.18)
𝑏−𝑎

is a continuous uniform random variable. We write 𝑋 ∼ Uniform(𝑎, 𝑏).


The probability density function of a continuous uniform random variable is shown in Figure 1.8.
In summary,
* the probability density function of Uniform(𝑎, 𝑏) is


⎨1/(𝑏 − 𝑎), 𝑎≤𝑥≤𝑏
𝑓𝑈 (𝑥) = 𝑓 (𝑥; 𝑎, 𝑏) = (1.19)
⎩0, otherwise,

* and the probability cumulative function is



0, if 𝑥 < 𝑎





𝐹𝑈 (𝑥) = 𝐹 (𝑥; 𝑎, 𝑏) = (𝑥 − 𝑎)/(𝑏 − 𝑎), if 𝑎 ≤ 𝑥 < 𝑏 (1.20)



⎩0, if 𝑏 ≤ 𝑥.

Practice.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
24 CAN UNCERTAINTY BE EVALUATED?

Continuous uniform probability density function

Figure 1.8: The probability density function 𝑓 of Uniform(𝑎, 𝑏)

Find the mean E[𝑋] and variance V[𝑋] of the continuous uniform random variable 𝑋, using Formula
1.16, 1.17.
SOLUTION- ANSWER:

∫︁ 𝑏 ⃒ ⃒𝑏
1 1 ⃒⃒ 1 2 ⃒⃒ 1 𝑎+𝑏
𝜇= 𝑥𝑑𝑥 = 𝑥 = (𝑏2 − 𝑎2 ) = .
𝑏−𝑎 𝑎 𝑏 − 𝑎 ⃒ 2 ⃒𝑎 2(𝑏 − 𝑎) 2

As a result,

1 2 1 1
𝜎 2 = 𝜇2 − 𝜇21 = (𝑎 + 𝑎𝑏 + 𝑏2 ) − (𝑎2 + 2𝑎𝑏 + 𝑏2 ) = (𝑏 − 𝑎)2 .
3 4 12

1.5.2 Normal (or Gauss) distribution

A normal random variable 𝑋 is described by probability density function

[︂ ]︂2
𝑥−𝜇
− 12
1 𝜎
𝑓 (𝑥) = √ 𝑒 , (1.21)
𝜎 2𝜋

in which −∞ < 𝑥 < ∞, 𝜇 ∈ R, 𝜎 2 > 0. We write 𝑥 ∼ N(𝜇, 𝜎 2 ), where


𝑓 (𝑥) = the height of the normal curve; 𝑒 = constant 2.71; 𝜋 = constant 3.14,
𝜇 is the mean, and 𝜎 2 is the variance of the normal distribution.
The probability density function ’s curve of N(𝜇, 𝜎 2 ) is the following figure:

Normal distribution- Properties.

Inference, Linear Regression and Stochastic Processes


1.5. Continuous Probability Distributions 25

Figure 1.9: The probability density function of N(𝜇, 𝜎) with 𝜇 = 10 and 𝜎 = 1, 2, 3

On parameters 𝜇 and 𝜎 2 . These parameters can be proved to be

𝜇 = E(𝑋), and 𝜎 2 = Var(𝑋).

On the normal curve. The normal curve (of the probability function 𝑓 (𝑥)) is

i/ bell-shaped,

ii/ symmetrical about the mean, and

iii/ when we move further away from the mean in both directions, the normal curve approaches the
horizontal axis.

These properties is described in three most useful cases (see Figure 1.10):

• The area of 𝐴1 = {𝑥 : |𝑥 − 𝜇| ≤ 𝜎} takes 68.26% of the whole area.

• The area of 𝐴2 = {𝑥 : |𝑥 − 𝜇| ≤ 2𝜎} takes 95.44% of the area.

• The area of 𝐴3 = {𝑥 : |𝑥 − 𝜇| ≤ 3𝜎} takes 99.74% of the area.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
26 CAN UNCERTAINTY BE EVALUATED?

Using continuous pdf curve to compute probability

Using Gauss pdf to compute areas being symmetric around the mean

Figure 1.10: Areas with radius 1, 2 and 3 𝜎 -


when data approximate Gauss distribution

1.5.3 The standard normal distribution and properties

Definition 1.4.

Let 𝑍 ∼ N(0, 1) be a standard normal random variable, with pdf

𝑧2
1 −
𝑓 (𝑧) = √ 𝑒 2 , −∞ < 𝑧 < ∞,
2𝜋

then the function Φ(𝑥), defined for all real numbers 𝑥 by


∫︁ 𝑥
Φ(𝑥) = P[𝑍 ≤ 𝑥] = 𝑓 (𝑧)𝑑𝑧,
−∞

is called the standard normal distribution function.


Thus the Laplace function Φ(𝑥) = P[𝑍 ≤ 𝑥], the probability that the standard r.v. 𝑍 is less than or

Inference, Linear Regression and Stochastic Processes


1.5. Continuous Probability Distributions 27

equals 𝑥, is equal to the area under the standard normal density function

𝑧2
1 −
𝑓 (𝑧) = √ 𝑒 2 , −∞ < 𝑧 < ∞
2𝜋

between −∞ and 𝑥, as in the figure below:

Fact 1.2. If 𝑍 ∼ N(0, 1) then P[𝑍 < −𝑥] = P[𝑍 > 𝑥]


or equivalently Φ(−𝑥) = 1 − Φ(𝑥).

Fact 1.3. When 𝑎 < 𝑏, we get P[𝑎 < 𝑍 ≤ 𝑏] = Φ(𝑏) − Φ(𝑎).


𝑋 −𝜇
Fact 1.4. Due to the 𝑍-transformation 𝑍 = , we have
𝜎

Case 1 : 𝑋 = 𝜇 ⇐⇒ 𝑍 = 0; 𝑋 = 𝜇 + 𝜎 ⇐⇒ 𝑍 = 1. Since

𝑋 −𝜇
𝑋 − 𝜇 ≤ 𝑎𝜎 ⇔ 𝑍 = ≤ 𝑎,
𝜎

we get P(|𝑋 − 𝜇| ≤ 𝜎) = P(−1 ≤ 𝑍 ≤ 1) = 0.6826.

Case 2 : 𝑋 = 𝜇 + 𝑘𝜎 ⇐⇒ 𝑍 = 𝑘. So
P(|𝑋 − 𝜇| ≤ 𝑘 𝜎) = P(|𝑍| ≤ 𝑘) = P(−𝑘 ≤ 𝑍 ≤ 𝑘) = Φ(𝑘) − Φ(−𝑘).
Normal distribution- Computation using the 𝑍-transformation.
Observe that the probability cumulative function of a normal variable 𝑋 ∼ N(𝜇, 𝜎 2 ) given by
∫︁ 𝑎
𝐹 (𝑎) = P(𝑥 ≤ 𝑎) = 𝑓 (𝑥)𝑑𝑥
−∞

can not be evaluated symbolically!

The Z-transformation standardizes the normal variable

We can only compute probabilities related to 𝑋 if we newly use


𝑋 −𝜇
𝑍= . (1.22)
𝜎

• 𝑍 standardizes variable 𝑋, and 𝑍 has 𝜇 = 0 and 𝜎 = 1.


(︀ )︀
• We write 𝑍 ∼ N 0, 1 , now is named the standardized or standard normal variable.

For instance, a normal variable 𝑋 ∼ N(𝜇, 𝜎 2 ) with 𝜇 = 10, 𝜎 = 2 can be standardized to

𝑋 −𝜇 𝑋 − 10 (︀ )︀
𝑍= = ∼ N 0, 1 .
𝜎 2

Example 1.7. [Actuarial Science.]

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
28 CAN UNCERTAINTY BE EVALUATED?

Standardize Gauss variable X to Z

Compute the c.d.f. using Laplace function

Suppose that the cost measurements of customer’s claims in Actuarial Science are assumed to follow
a normal distribution 𝑋 with 𝜇 = $1000 and 𝜎 = $200. What is the probability that the cost measurement
is between $1000 and $1400?

GUIDANCE for solving.

First we scale down 100 USD to 1 USD, hence 1000 USD means 10USD, 𝜎 = $200 becomes $2...
We assumed 𝑋 ∼ N(𝜇, 𝜎 2 ) with 𝜇 = 10, 𝜎 = 2; then use
𝑋 −𝜇 𝑋 − 10
𝑍= =
𝜎 2
we have 10 ≤ 𝑋 ⇒ 0 ≤ 𝑍, and 𝑋 ≤ 14 ⇒ 𝑍 ≤ 2, that gives us

P(10 ≤ 𝑋 ≤ 14) = P(0 ≤ 𝑍 ≤ 2) = Φ(2) − Φ(0).

Inference, Linear Regression and Stochastic Processes


1.5. Continuous Probability Distributions 29

Tabulated values of Φ(𝑥) are employed to extract probabilities.

Table 1.1: Tabulated values of Φ(𝑥)

𝑝 = Φ(𝑧𝛼 ) 99.5% 99% 97.72% 97.5% 95% 90% 80% 75% 0.5
𝑧𝛼 2.58 2.33 2.00 1.96 1.645 1.28 0.84 0.6745 0

Table 1.1 says that: 𝑧 = 2 ⇒ Φ(2) = 0.9772, and Φ(0) = 0.5. So

P(10 ≤ 𝑋 ≤ 14) = P(0 ≤ 𝑍 ≤ 2) = Φ(2) − Φ(0) = 0.4772.

Furthermore, by Property 5:
P(|𝑋 − 𝜇| ≤ 2𝜎) = P(|𝑋 − 10| ≤ 4)

= P(−2 ≤ 𝑍 ≤ 2) = Φ(2) − Φ(−2) = 0.9544?

Few most practically well known cases, see Figure 1.10 are:

• 𝑥 = 𝜇 + 1.645𝜎 ⇐⇒ 𝑧 = 1.645 ⇒ P(−1.645 ≤ 𝑍 ≤ 1.645) = 90%;

• 𝑥 = 𝜇 + 1.96𝜎 ⇐⇒ 𝑧 = 1.96 ⇒ P(−1.96 ≤ 𝑍 ≤ 1.96) = 95%; and

• 𝑥 = 𝜇 + 3𝜎 ⇐⇒ 𝑧 = 3 ⇒ P(−3 ≤ 𝑍 ≤ 3) = 99.7%. 

Example 1.8. [Education.]

IQ examination scores for freshmen are normally distributed with mean value 𝜇 = 100 and standard
deviation 𝜎 = 14.2. What is the probability that a randomly chosen freshman has an IQ score greater
than 130?

GUIDANCE for solving.

Let 𝑋 be the score of a randomly chosen freshman. Exploit


𝑋 −𝜇
𝑍=
𝜎
we get P(𝑋 > 130) = ... = 1 − Φ(2.113) = 1 − 0.983 = 0.017. 
 SUMMARY:

1. The standard Gaussian density 𝑓 (𝑥) is even because 𝑓 (𝑥) = 𝑓 (−𝑥), ∀𝑥.

2. Standard curve (of the probability function 𝑓 (𝑥)- Figure 1.11) is

i/ bell-shaped, ii/ symmetrical about the mean 𝜇 = 0.

The cdf Φ(𝑧) = 𝐹 (𝑧) is called Laplace function.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
30 CAN UNCERTAINTY BE EVALUATED?

3. Table of the Gaussian distribution usually does not list the values of Φ(𝑥) for 𝑥 < 0. The reason is
because the density function 𝑓 (𝑥) is symmetric over the line 𝑥 = 0,

P[𝑍 < −𝑥] = P[𝑍 > 𝑥].

We have a relation
Φ(−𝑥) = 1 − Φ(𝑥) for every 𝑥. (1.23)

Figure 1.11: Symmetry of the standard Gauss distribution

Thus, Figure 1.11 shows the probability that 𝑍 < −1, indeed

P[𝑍 < −1] = Φ(−1) = 1 − Φ(1) = 1 − 0.8413 = 0.1587.

4. The 𝑝-th percentile of the standard Gaussian distribution is number 𝑧𝑝 that meets the equation

Φ(𝑧𝑝 ) = P[𝑍 ≤ 𝑧𝑝 ] = 𝑝. (1.24)

If 𝑋 ∼ N(𝜇, 𝜎 2 ) we denote the 𝑝-th percentile of 𝑋 by 𝑥𝑝 . We can see that 𝑥𝑝 is related to the
normalized quantile 𝑧𝑝 by
𝑥𝑝 = 𝜇 + 𝑧𝑝 𝜎.

Inference, Linear Regression and Stochastic Processes


1.5. Continuous Probability Distributions 31

1.5.4 Exponential distribution- the second important one

XAn exponential random variable 𝑋 with parameter 𝜆 is given by the probability density function

𝑓 (𝑡) = 𝑓 (𝑡; 𝜆) = 𝜆𝑒−𝜆 𝑡 , 𝑡≥0 (1.25)

where 𝜆 > 0 is a constant. The mean and the variance are

1 2 1
𝜇= ; 𝜎 = 𝜇2 = 2 .
𝜆 𝜆
zThe exponential cumulative distribution function (cdf) is

∫︁ 𝑡 ∫︁ 𝑡
𝐹 (𝑡) = P(𝑥 ≤ 𝑡) = 𝑓 (𝑥)𝑑𝑥 = 𝜆𝑒−𝜆 𝑥 𝑑𝑥 = 1 − 𝑒−𝜆𝑡 , 𝑡 ≥ 0.
0 0

Notes:

1. In practice, an exponential random variable 𝑋 is used to describe the distance between successive
events of a Poisson process with mean number of events 𝜆 > 0 per unit interval. [See details in
Section 1.6.4, and an application in Section 5.4]

2. For Poisson distributions, the mean and variance are the same; while for exponential ones, the mean
and standard deviation are the same.

We have the following key concept.

Let 𝑇 be any continuous random variable with non-negative values, with cdf 𝐹𝑇 . The survival
function of 𝑇 is one minus the cdf, defined as
∫︁ ∞
𝑆(𝑡) = 1 − 𝐹𝑇 (𝑡) = P(𝑇 > 𝑡) = 𝑓 (𝑥)𝑑𝑥, with 𝑡 ≥ 0. (1.26)
𝑡

give the probability that the component will fail after 𝑡 time units.

Definition 1.5.

* The instantaneous hazard function of a person or system, also called the failure rate function, is
defined as
𝑓 (𝑡)
𝜆(𝑡) = , 𝑡 ≥ 0. (1.27)
𝑆(𝑡)
* The function ∫︁ 𝑡
Λ(𝑡) = 𝜆(𝑢)𝑑𝑢 (1.28)
0
is called cumulative hazard rate.
* The expected life length E[𝑇 ] of a product is called the mean time till death or mean time till failure
(MTTF). This quantity is given by
∫︁ ∞ ∫︁ ∞ ∫︁ ∞
𝜇 = E[𝑇 ] = MTTF = 𝑡𝑓 (𝑡)𝑑𝑡 = P(𝑇 > 𝑡) 𝑑𝑡 = 𝑆(𝑡) 𝑑𝑡. (1.29)
0 0 0

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
32 CAN UNCERTAINTY BE EVALUATED?

Exponential variables 𝑇 ∼ E(𝜆) perfectly represent the lifetime of a subject or a system.

Example 1.9 (Lifetime follows an exponential distribution).

In applications the exponential distribution with mean 𝛽 is used for 𝑇 , with pdf
𝑡
1 −𝛽
𝑓 (𝑡) = 𝑓 (𝑡; 𝛽) = 𝑒 , for 𝑡 ≥ 0 (1.30)
𝛽
and the survival
𝑆(𝑡) = 1 − 𝐹 (𝑡) = 𝑒−𝑡/𝛽 , 𝑡 ≥ 0.

In this model the survival function diminishes from 1 to 0 exponentially fast, relative to 𝛽. The hazard
rate function is
1 −𝑡/𝛽
𝑓 (𝑡) 𝛽 𝑒 1
𝜆(𝑡) = = −𝑡/𝛽 = , 𝑡 ≥ 0.
𝑆(𝑡) 𝑒 𝛽
1
That is, the exponential model is valid for cases where the hazard rate function 𝜆(𝑡) = 𝛽 = 𝜆 > 0, is a
constant independent of time.
1 1
If the MTTF is E[𝑇 ] = 𝛽 = 100 [hr], we expect 1 failure per 100 hours, i.e., ℎ(𝑡) = [ ]. 
100 hr

1.5.5 Gamma distribution

Reminder: the relation between pdf 𝑓 and cdf 𝐹 is

∫︁
𝑑𝐹𝑋 (𝑡)
𝑓 (𝑡) = ⇐⇒ 𝑓 𝑑𝑡 = 𝐹 (𝑡).
𝑑𝑡
Two important distributions for studying the reliability and failure rates of systems are the gamma and
the Weibull distributions. We will need these distributions in our study of reliability methods. These
distributions are discussed here as further examples of continuous distributions.

♣ Practical motivation 1.

Suppose we use in a manufacturing process a machine which mass-produces a particular part. In


a random manner, it produces defective parts at a rate of 𝜆 per hour. The number of defective parts
produced by this machine in a time period [0, 𝑡] is a random variable 𝑋(𝑡) having a Poisson distribution
with mean 𝜆 𝑡. By Lemma 1.2, the probability density function of of 𝑋(𝑡) is

(𝜆 𝑡)𝑗
P[𝑋(𝑡) = 𝑗] = 𝑒−𝜆 𝑡 , 𝑗 = 0, 1, 2, ...
𝑗!

Now we wish to study the distribution of the time until the 𝑘-th defective part is produced.
Call this continuous random variable 𝑌𝑘 .

Inference, Linear Regression and Stochastic Processes


1.5. Continuous Probability Distributions 33

We use the fact that the 𝑘-th defect will occur before time 𝑡 (i.e., 𝑌𝑘 ≤ 𝑡) if and only if at least 𝑘 defects
occur up to time 𝑡 (i.e. 𝑋(𝑡) ≥ 𝑘). Therefore,

𝑌𝑘 ≤ 𝑡 ⇔ 𝑋(𝑡) ≥ 𝑘,

thus the c.d.f. for 𝑌𝑘 is

𝐺(𝑡; 𝑘, 𝜆) = P[𝑌𝑘 ≤ 𝑡] = P[𝑋(𝑡) ≥ 𝑘]


𝑘−1 𝑘−1 (1.31)
∑︁ ∑︁ (𝜆𝑡)𝑗 𝑒−(𝜆𝑡)
=1− P(𝑋(𝑡) = 𝑗) = 1 −
𝑗=0 𝑗=0
𝑗!

The corresponding p.d.f. for 𝑌𝑘 is

𝜆𝑘

𝑡𝑘−1 𝑒−(𝜆𝑡) , when 𝑡 ≥ 0,


𝑔(𝑡; 𝑘, 𝜆) = (𝑘 − 1)! (1.32)

⎩0, when 𝑡 < 0.
This p.d.f. is clearly a member of a general family of distributions gamma 𝐺(𝜈, 𝛽) which depend on two
parameters 𝜈 and 𝛽. The probability density function of 𝐺(𝜈, 𝛽), generalized from Equation 1.32, is

1


𝜈 Γ(𝜈)
𝑥𝜈−1 𝑒−𝑥/𝛽 , if 𝑥 ≥ 0,
𝑔(𝑥; 𝜈, 𝛽) = 𝛽 (1.33)
⎩0, if 𝑥 < 0.

In soft R, function pgamma computes c.d.f of a gamma distribution having the shape 𝜈, and scale 𝛽;
0 < 𝜈, 𝛽 < ∞. If we use 𝜈 = 𝑠ℎ𝑎𝑝𝑒 = 1 = 𝛽 = 𝑠𝑐𝑎𝑙𝑒 then the cdf 𝐹𝐺 (1) = 0.6321206.
> pgamma(q=1, shape=1, scale=1)
[1] 0.6321206

The expected value and variance of the gamma distribution 𝐺(𝜈, 𝛽) are, respectively,

𝜇 = 𝜈𝛽, 𝜎 2 = 𝜈𝛽 2 . (1.34)

Property.

1. Γ(𝜈) is called the gamma function of 𝜈 and is defined as the integral


∫︁ ∞
Γ(𝑥) = 𝑡𝑥−1 𝑒−𝑡 𝑑𝑡; 𝑥 > 0 (1.35)
0

The gamma function satisfies the relationship Γ(1) = 1 and

Γ(𝑥 + 1) = 𝑥Γ(𝑥), ∀𝑥 > 1. (1.36)


1 √
Hence, for every positive integer 𝑥 = 𝑛 ∈ N then Γ(𝑛 + 1) = 𝑛!. Besides, Γ( ) = 𝜋.
2

We note also that the exponential distribution E(𝛽) is a special case of the gamma distribution with
𝜈 = 1, write E(𝛽) = 𝐺(1, 𝛽).

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
34 CAN UNCERTAINTY BE EVALUATED?

Figure 1.12: The pdf with 𝛽 = 1 and 𝜈 = 0.5, 1, 2.

2. If 𝑋𝑖 ∼ 𝐺(𝜈𝑖 , 𝛽) are independent, then


𝑛
∑︁ 𝑛
∑︁
𝑋𝑖 ∼ 𝐺( 𝜈𝑖 , 𝛽).
𝑖=1 𝑖=1

𝑛
∑︁
Therefore, if particularly, 𝑋𝑖 ∼ E(𝛽) = 𝐺(1, 𝛽) are iid exponential, the sum 𝑇 = 𝑋𝑖 ∼ 𝐺(𝑛, 𝛽).
𝑖=1
Hence,
𝑛
∑︁
𝑇 = 𝑡(X) = 𝑋𝑖 = 𝑡
𝑖=1

has pdf

⎧ 1
⎪ 𝑡𝑛−1 𝑒−𝑡/𝛽 , if 𝑡 ≥ 0.
𝛽 𝑛 Γ(𝑛)

𝑓𝑇 (𝑡; 𝑛, 𝛽) = 𝑛 (1.37)
⎩or 𝜃 𝑡𝑛−1 𝑒−𝜃 𝑡 ,
⎪ if put 𝜃 = 1/𝛽
Γ(𝑛)

1.5.6 Weibull distribution

Weibull distributions 𝑊 (𝛼, 𝛽) are often used in reliability models in which the system either “ages” with
time or becomes “younger”.

Inference, Linear Regression and Stochastic Processes


1.5. Continuous Probability Distributions 35

The Weibull family of distributions will be denoted by 𝑊 (𝛼, 𝛽). The positive parameters 𝛼, 𝛽 > 0 are
called the shape and the scale parameters, respectively. Figure 1.13 draws two p.d.f. of 𝑊 (𝛼, 𝛽) with
𝛼 = 1.5; 2, and 𝛽 = 1. Note that, 𝑊 (1, 𝛽) = E(𝛽) is the exponential distribution.

Figure 1.13: The pdf of Weibull with 𝛼 = 1.5; 2 and 𝛽 = 1.

• The p.d.f. and corresponding c.d.f. of 𝑊 (𝛼, 𝛽) respectively are


⎧ 𝛼−1
⎨𝛼 𝑡
⎪ 𝛼
𝑒−(𝑡/𝛽) , when 𝑡 ≥ 0,
𝛽 𝛼
𝑤(𝑡; 𝛼, 𝛽) = (1.38)

⎩0, when 𝑡 < 0,


⎨1 − 𝑒−(𝑡/𝛽)𝛼 , when 𝑡 ≥ 0,
𝑊 (𝑡; 𝛼, 𝛽) = (1.39)
⎩0, when 𝑡 < 0.

• The mean and variance of this distribution are

1
𝜇 = 𝛽 · Γ(1 + ), (1.40)
𝛼

and {︂ }︂
2 1
𝜎 2 = 𝛽 2 Γ(1 + ) − Γ(1 + )2 . (1.41)
𝛼 𝛼

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
36 CAN UNCERTAINTY BE EVALUATED?

1.5.7 Beta distribution

The probability density function of Beta(𝜈1 , 𝜈2 ) is


1

⎨ 𝑥𝜈1 −1 (1 − 𝑥)𝜈2 −1 , when 1 > 𝑥 > 0,
𝑓 (𝑥; 𝜈1 , 𝜈2 ) = 𝐵(𝜈1 , 𝜈2 ) (1.42)
otherwise,
⎩0,

with shape parameters 𝜈1 , 𝜈2 > 0; the integration


∫︁ 1
𝐵(𝜈1 , 𝜈2 ) = 𝑥𝜈1 −1 (1 − 𝑥)𝜈2 −1 𝑑𝑥. (1.43)
0

When 𝜈1 = 𝜈2 = 1, Beta becomes the uniform 𝑈 (0, 1).

Figure 1.14: The pdf 𝑓 (𝑥; 𝜈1 , 𝜈2 ) of Beta(𝜈1 , 𝜈2 ) when 𝜈1 = 2.5, 𝜈2 = 2.5; 𝜈1 = 2.5, 𝜈2 = 5.00.

The probability cumulative function of Beta(𝜈1 , 𝜈2 ) is

∫︁ 𝑥
1
𝐼𝑥 (𝜈1 , 𝜈2 ) = 𝑢𝜈1 −1 (1 − 𝑢)𝜈2 −1 𝑑𝑢, (1.44)
𝐵(𝜈1 , 𝜈2 ) 0

with 0 ≤ 𝑥 ≤ 1. Note that 𝐼𝑥 (𝜈1 , 𝜈2 ) = 1 − 𝐼1−𝑥 (𝜈2 , 𝜈1 ). Figure 1.14 shows the graph 𝑓 of Beta(2.5, 5.0)
1
and Beta(2.5, 2.5). If 𝜈1 = 𝜈2 then the pdf 𝑓 of Beta is symmetric via the vertical line 𝜇 = .
2
The beta distribution has an important role in the theory of statistics. As will be seen later, many
methods of statistical inference are based on the order statistics, and their distributions are related to
the beta distribution.

Inference, Linear Regression and Stochastic Processes


1.6. Summary 37

1.6 Summary

1.6.1 Popular numerical sets for quantitative measurements and counts

• Discrete (countably infinite): counts, such as


the numbers of faulty parts in a software,
the numbers of students, of telephone calls etc.
- Their values usually are naturals in the set N = {0, 1, 2, 3, . . . , 𝑛, 𝑛 + 1, . . .}, or in
- the set of integers Z = {..., −3, −2, −1, 0, 1, 2, 3, ...}.

• The set of rational numbers


{︂ }︂
𝑚
Q= : 𝑚, 𝑛 ∈ Z and 𝑛 ̸= 0 ,
𝑛
is also discrete.

• Continuous: the real numbers (rational and irrational numbers) denoted by R = (−∞, +∞).

Elucidation.
a/ The number 0, 1, 2, 3, and so on are called natural numbers N. However, if we subtract or divide
two natural numbers, the results are not always a natural number. To overcome the limitation of subtrac-
tion, we extend the natural number system to the system Z of integers. We include in Z with all the nat-
ural numbers, all of their negatives and the number zero (0). Thus, Z = {. . . , −3, −2, −1, 0, 1, 2, 3, . . .}.
b/ We still can not always divide any two integers. For example 8/(-2) = -4 is an integer, but 8/3 is
not an integer. To overcome this problem, we extend the system of integers to the system of rational
numbers Q.
We define a number as rational if it can be expressed as a ratio of two integers. Thus, all four
basic arithmetic operations (addition, subtraction, multiplication and division) are all possible in the
rational number system Q. Some numbers in everyday use are not rational number; i.e. they can not
be expressed as a ratio of two integers. E.g. 𝜋 ≈ 3.14, 𝑒 ≈ 2.71, etc. are not rational numbers; such
numbers are called irrational numbers.
c/ The term real number is used to describe a number that is either rational or irrational. To give a
complete definition of real numbers R would involve the introduction of a number of new ideas, and we
shall not do this task now. However, it is a good idea to think about a real number in terms of decimals.

1.6.2 Computing Probability

In summary, denote ‘events’ or ‘outcomes’ with capital letters 𝐴, 𝐵..., we have that

• Probability of any event is a number between 0 and 1. If 𝐴 is an event, P(𝐴) is the probability that
the event 𝐴 occurs: 0 ≤ P(𝐴) ≤ 1. The empty event has probability 0, and the sample space has
probability 1.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
38 CAN UNCERTAINTY BE EVALUATED?

Rule 1. P(𝐴) + P(𝐴𝑐 ) = 1 or P(𝐴𝑐 ) = 1 − P(𝐴)

• There are rules for computing probabilities of unions, intersections, and complements.

If an event 𝐵 is a subset of event 𝐴, then P(𝐵) ≤ P(𝐴).

* For a union of disjoint events 𝐴 and 𝐵, probabilities are added:

Rule 2. P(𝐴 or 𝐵) = P(𝐴) + P(𝐵).

* Independent events 𝐴 and 𝐵 lead to the probability P(𝐴 𝐵):

Rule 3. P(𝐴 𝐵) = P(𝐴 ∩ 𝐵) = P(𝐴) · P(𝐵).

• Unconditional probability of 𝐵 can be computed from its conditional probabilities by the Law of Total
Probability, set up as follows.

When events 𝐸1 , 𝐸2 , · · · , 𝐸𝑛 (𝑛 ≥ 1) is a partition of a sample space Ω, that is


𝑛
(︂ ⋃︁ )︂
Ω= 𝐸𝑖 = 𝐸1 ∪ 𝐸2 · · · ∪ 𝐸𝑛 ,
𝑖

then for any event 𝐵 we have

Rule 4. the Law of Total Probability:


𝑛
[︂ ⋃︁ 𝑛
]︂ ∑︁ 𝑛
∑︁
P[𝐵] = P 𝐵 𝐸𝑖 = P[𝐵 𝐸𝑖 ] = P[𝐵 ∩ 𝐸𝑖 ].
𝑖 𝑖 𝑖

• Given occurrence of event 𝐵, one can compute conditional probability of event 𝐴 as in Eqn. 1.5.

• The Bayes Rule, often used in testing and diagnostics, relates conditional probabilities of 𝐴 given 𝐵
and of 𝐵 given 𝐴, as in

P[𝐴] · P[𝐵 | 𝐴]
P(𝐴 | 𝐵) = (1.45)
P[𝐵]
Replace 𝐴 by 𝐸𝑖 and
𝑛
∑︁ 𝑛
∑︁
P[𝐵] = P[𝐵 ∩ 𝐸𝑖 ] = P[𝐸𝑗 ) · P(𝐵 | 𝐸𝑗 ]
𝑖 𝑗=1

we get

Rule 5. General Bayes Rule.

P(𝐸𝑖 ) · P(𝐵 | 𝐸𝑖 )
P(𝐸𝑖 | 𝐵) = 𝑛 , 𝑖 = 1, · · · , 𝑛. (1.46)
∑︁
P(𝐸𝑗 ) · P(𝐵 | 𝐸𝑗 )
𝑗=1

How can we compute probabilities?

Inference, Linear Regression and Stochastic Processes


1.6. Summary 39

1. Method A. The relative-frequency interpretation of probability applies to situations in which we can


observe results over and over again. For example, it is easy to envision flipping a coin over and over
again and observing whether it lands heads or tails.

The probability that the coin lands heads up. is the relative frequency, over the long run, with which
the coin lands heads up.

Here are some more similar interesting situations where probability can be applied:

* Buying a weekly lottery ticket and observing whether it is a winner

* Commuting to work daily and observing whether a certain traffic signal is red when we see it

* Testing individuals in a population and observing whether they carry a gene for a certain disease

* Observing births and noting if the baby is male or female.

Two important remarks are:

i/ The interpretation does not apply to situations where the outcome one time is influenced by or in-
fluences the outcome the next time because the probability would not remain the same from one
time to the next. We cannot determine a number that is always changing.

ii/ Probability cannot be used to determine whether the outcome will occur on a single occasion but
can be used to predict the long-term proportion of the times the outcome will occur.

2. Method B. Applying simple or classical probability rules.

We should use Rule 4.1 to 4.3 in Section ??, recalled here.

Rule 1: If there are only two possible outcomes in an uncertain situation, their probabilities must add
to 1: P(𝐴) + P(𝐴𝑐 ) = 1.
Rule 2: If two outcomes or events cannot happen simultaneously, they are said to be mutually
exclusive. The probability of one or the other of two mutually exclusive outcomes is

𝐴 ∩ 𝐵 = ∅ ⇒ P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵)

Rule 3: If two events 𝐴, 𝐵 do not influence each other, the events are said to be independent of
each other.
If two events 𝐴, 𝐵 are independent, the probability that they both happen is

P(𝐴 ∩ 𝐵) = P(𝐴) · P(𝐵).

In general we have P(𝐴 ∩ 𝐵) ̸= P(𝐴) · P(𝐵).

3. Method C. The personal-probability interpretation of an event being the degree to which a given
individual believes the event will happen.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
40 CAN UNCERTAINTY BE EVALUATED?

1.6.3 Key probability distributions

Name of Notation Parameters Expectation Variance


distribution E[𝑋] V[𝑋] = Var(𝑋)
Bernoulli 𝑋 ∼ B(𝑝) 𝑝- 𝑝 𝑝(1 − 𝑝).
probability
of success
Binomial 𝑋 ∼ Bin(𝑛, 𝑝) 𝑛, 𝑝 𝑛𝑝 𝑛𝑝(1 − 𝑝)
Poisson 𝑋 ∼ Pois(𝜆) 𝜆 𝜆 𝜆
2 2
Gauss 𝑋 ∼ N(𝜇, 𝜎 ) 𝜇, 𝜎 𝜇 𝜎2
Exponential 𝑋 ∼ E(𝜆) 𝜆 1/𝜆 1/𝜆2

𝜒2 𝑋 ∼ 𝜒2𝑛 𝑛 𝜇=𝑛 2𝑛

Student 𝑇 ∼ 𝑡𝑛,𝑝 𝑛, 𝑝 𝜇𝑇 = 0 𝑛/((𝑛 − 2)


Gamma 𝐺(𝜈, 𝛽) 𝜈, 𝛽 𝜇 = 𝜈𝛽 𝜎 2 = 𝜈𝛽 2
Beta Beta(𝜈1 , 𝜈2 ) 𝜈1 , 𝜈2

1.6.4 On Poisson distribution

The Poisson distribution is given by the probability density function

𝑒−𝜆 𝜆𝑥
𝑝(𝑥) = 𝑥 = 0, 1, 2, ... (1.47)
𝑥!
where
𝑥 = designated number of successes, 𝑒 = 2.71 the natural base,
𝜆 > 0 = the average number of successes per unit of time period
Poisson distribution can be used in the followings.

• To model of the number of occurrences of some event/ phenomenon in the time interval (0, 𝑡], we
can use Formula 1.47. Now 𝑥 = 0 implies that there are no occurrences of the event in (0, 𝑡], and
Prob(𝑥 = 0) = 𝑝(0) = 𝑒−𝜆 .

• To model the number of defects or non-conformities that occur in a unit of product (unit area, volume
...) say, a semiconductor device, by a Poisson distribution.

USAGE: Consider a queue where customers are buses and server is a bus station.

• Arrivals (buses) at a bus-stop follow a Poisson distribution with an average of 𝜆 = 4.5 buses every
15 minutes.

Inference, Linear Regression and Stochastic Processes


1.7. Basic Probability Problems 41

• Can we a) obtain a bar-plot (histogram) of the distribution (assume a maximum of 20 arrivals in 15


minutes); and b) calculate the probability of fewer than 3 arrivals in 15 minutes?

GUIDANCE for solving.

The probabilities of fewer than 3 (meaning from 0 up to 2) arrivals can be calculated directly from
Formula 1.11 with 𝜆 = 4.5:
𝑒−𝜆 𝜆𝑖
𝑝(𝑖; 𝜆) = P(𝑋 = 𝑖) = ,
𝑖!
𝑒−4.5 4.50
hence 𝑝(0; 𝜆) = 𝑃 (𝑋 = 0) = = 0.01111. Similarly 𝑝(1; 𝜆) = 0.04999 and 𝑝(2; 𝜆) = 0.11248.
0!
Therefore the probability of fewer than 3 arrivals is the cdf
2
∑︁
𝑃 (2; 𝜆) = P(𝑋 ≤ 2) = P(𝑋 = 𝑖) = 0.17358 = 17.36%.
𝑖=0

The next diagram shows the case of 𝜆 = 10 buses every 15 minutes, we see that the probability of 20
arrivals is no longer negligible.

1.7 Basic Probability Problems

1.7.1 How to compute probability of an event?

How to find P(𝐴), for any 𝐴 ⊂ Ω? Use Counting Techniques:


1. Multiplication rule
2. Permutation rule
3. Combination rule

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
42 CAN UNCERTAINTY BE EVALUATED?

The right tool for the right job!


Counting Techniques
1. Multiplication rule:

• Let an operation consist of 𝑘 steps and there are

• 𝑛1 ways of completing step 1,

• 𝑛2 ways of completing step 2, . . . and

• 𝑛𝑘 ways of completing step k,

• Then, the total number of ways to perform 𝑘 steps is:

𝑁 = 𝑛1 · 𝑛2 · . . . · 𝑛𝑘 . (1.48)

Web Site Design for your new Insurance Firm?

• In the design for a website, we can choose to use among:

4 colors, 3 fonts, and 3 positions for an image.

• How many designs are possible? Answer via Rule 1: 4 · 3 · 3 = 36

2. Permutation rule:

• A permutation is a unique sequence of distinct items.

• Number of permutations for a set of 𝑛 items is 𝑛!. By definition: 0! = 1.

• If 𝑆 = {𝑎, 𝑏, 𝑐}, then there are 6 permutations, namely: 𝑎𝑏𝑐, 𝑎𝑐𝑏, 𝑏𝑎𝑐, 𝑏𝑐𝑎, 𝑐𝑎𝑏, 𝑐𝑏𝑎 (order matters)

Subset Permutations
For a sequence of 𝑘 items from a set of 𝑛 items,
the number of choosing sequences is:

𝑛!
𝑃𝑘𝑛 =
(𝑛 − 𝑘)!

Permutation rule in INDUSTRY: Printed Circuit Board (in PC)

• A printed circuit board has eight different locations in which a component can be placed.

• If four different components are to be placed on the board,

how many designs are possible?

Inference, Linear Regression and Stochastic Processes


1.7. Basic Probability Problems 43

• Answer: Order is important, so use the permutation formula with 𝑛 = 8, 𝑘 = 4

8!
𝑃48 = = ... = 1680
(8 − 4)!

3. Combination rule:

• A combination is a selection of 𝑘 items from a set of 𝑛 where order does not matter.

• If 𝑆 = {𝑎, 𝑏, 𝑐}, 𝑛 = 3, then

* If 𝑘 = 3, there is 1 combination, namely: 𝑎𝑏𝑐

* If 𝑘 = 2, there are 3 combinations, namely 𝑎𝑏, 𝑎𝑐, and 𝑏𝑐

• # of permutations ≥ # of combinations

• Since order does not matter with combinations, we are dividing

the # of permutations by 𝑘!,


𝑃𝑛
(︂ )︂
𝑛! 𝑛
𝐶𝑘𝑛 = 𝑘 = = . (1.49)
𝑘! 𝑘! (𝑛 − 𝑘)! 𝑘

1.7.2 Basic type problems

Outline:
A/ Experiment and sample space
B/ Basic rules of operations with events
C/ Computation of probability
D/ Independent events- Conditional probability

A/ Experiment and sample space

1. The experiment is to select a sequence of 5 letters for transmission of a code in a money transfer
operation. Let 𝑎1 , 𝑎2 , . . . , 𝑎5 denote the first, second, ..., fifth letter chosen.

The sample space Ω is the set of all possible sequences of five letters. Formally,

Ω = {(𝑎1 , 𝑎2 , . . . , 𝑎5 ) : 𝑎𝑖 ∈ {𝑎, 𝑏, 𝑐, · · · , 𝑥, 𝑦, 𝑧}, 𝑖 = 1, · · · , 5}.

* This is a finite sample space containing 265 possible sequences of 5 letters, due to the multiplica-
tion rule in Equation 1.48.

* A sample point (unit/element) is any such sequence (𝑎1 , 𝑎2 , . . . , 𝑎5 ) in Ω.

Quiz: Let 𝐸 be the event that all the 5 letters in the sequence are the same. Describe 𝐸 and
find P[𝐸].

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
44 CAN UNCERTAINTY BE EVALUATED?

2. (Industrial Production) Our experiment now is to choose a steel bar from a specific production process,
and to measure its weight.

- The sample space Ω is the continuous interval (𝑤0 , 𝑤1 ) ⊂ R+ of possible weights.

- The weight of a particular bar is a sample point.

CONCLUSION

Thus, sample spaces could be


finite sets of sample points, or
countable sets (as N or Z) or
non-countable infinite sets (as R).

B/ Basic rules of operations with events

Consider randomly choosing a real number from 0 to 1, then


we get the sample space
Ω = {𝑢 : 0 ≤ 𝑢 ≤ 1}.

Call the events


𝐸1 = {𝑢 : 0 ≤ 𝑢 ≤ 0.5}, 𝐸2 = {𝑢 : 0.35 ≤ 𝑢 ≤ 1}.

The union of these two events is


𝐸3 = 𝐸1 ∪ 𝐸2 = {𝑢 : 0 ≤ 𝑢 ≤ 1} = Ω,
and their intersection is 𝐸4 = 𝐸1 ∩ 𝐸2 = {𝑢 : 0.35 ≤ 𝑢 ≤ 0.5} =
̸ ∅.

• Thus, 𝐸1 and 𝐸2 are not disjoint. The complementary events are

𝐸1𝑐 = {𝑢 : 0.5 < 𝑢 ≤ 1}, và 𝐸2𝑐 = {𝑢 : 0 ≤ 𝑢 < 0.35}.

• The fact 𝐸1𝑐 ∩ 𝐸2𝑐 = ∅ means the complementary events are disjoint.

• By De Morgan’s law

(𝐸1 ∩ 𝐸2 )𝑐 = 𝐸1𝑐 ∪ 𝐸2𝑐 = {𝑢 : 𝑢 < 0.35 or 𝑢 > 0.5}.

FOR THOSE WHO LIKE FORMULAS


Denote “events” or “outcomes” with capital letters 𝐴, 𝐵, 𝐶, and so on.
𝐴 complement = 𝐴𝑐 .
P(𝐴) is the probability that the event or outcome 𝐴 occurs.

• Rule 0: For any event 𝐴, 0 ≤ P(𝐴) ≤ 1.

• Rule 1: P(𝐴) + P(𝐴𝑐 ) = 1 or P(𝐴𝑐 ) = 1 − P(𝐴)

Inference, Linear Regression and Stochastic Processes


1.7. Basic Probability Problems 45

• Rule 2: If events 𝐴 and 𝐵 are mutually exclusive, then

P(𝐴 or 𝐵) = P(𝐴) + P(𝐵)

• Rule 3: If events 𝐴 and 𝐵 are independent, then

P(𝐴𝐵) = P(𝐴) · P(𝐵)

• Rule 4: If event 𝐵 is a subset of event 𝐴, then P(𝐵) ≤ P(𝐴).

C/ Computation of probability

EXAMPLE: An experiment in telecommunication (radar, cellphone, military satellites...) consists of


randomly transmitting a sequence of binary signals, 0 or 1.

What is the probability that 3 out of 6 signals are 1’s?

Let 𝐸3 denote this event.

• The sample space of 6 signals consists of 26 points.

• Each point is equally probable. Each point is a combination of 3 signals assigned to 6 positions of a
binary sequence.
(︀6)︀
• The number of combinations of 3 chosen from 6 is 𝐶36 = 3 ,
due to Equation 1.49. The probability of 𝐸3 is
(︂ )︂
6 1 6·5·4 20
P[𝐸3 ] = 6
= = = 0.3125.
3 2 1 · 2 · 3 · 64 64

Problem 1- Medical science:


In a medical study, patients are classified in 8 ways according to whether they have blood type
𝐴𝐵 + , 𝐴𝐵 − , 𝐴+ , 𝐴− , 𝐵 + , 𝐵 − , 𝑂+ , or 𝑂− , and also according to whether their blood pressure is low,
normal, or high.
Find the number of ways in which a patient can be classified.
HINT: Use multiplication rule.
Problem 2 - Education: A class for engineers consists of 25 industrial, 10 mechanical, 10 electrical,
and 8 civil engineering students. If a person is randomly selected by the instructor to answer a question,
find the probability that the student chosen is (a) an industrial engineering major and
(b) a civil engineering or an electrical engineering major.

HINT: represent students in classes of industrial, mechanical, electrical and civil engineering by
events 𝐼, 𝑀 , 𝐸 and 𝐶; then the whole statistics class for engineers 𝑆 is the union of these events.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
46 CAN UNCERTAINTY BE EVALUATED?

Problem 3- Industrial research:


Interest centers around the life of an electronic component.

• Suppose it is known that the probability that the component survives for more than 6000 hours is
0.42.

• Suppose also that the probability that the component survives no longer than 4000 hours is 0.04.

(a) What is the probability that the life of the component is less than or equal to 6000 hours?

(b) What is the probability that the life is greater than 4000 hours?

HINT: use the complement rule P(𝐴) + P(𝐴𝑐 ) = 1 or P(𝐴𝑐 ) = 1 − P(𝐴).

Problem 4 (*) - Medical research:


A study in Germany concluded that following 7 simple health rules can extend
- a man’s life by 11 years on the average and
- a woman’s life by 7 years.
These 7 rules are as follows:
i/ no smoking, ii/ get regular exercise, iii/ use alcohol only in moderation,
iv/ get 7 to 8 hours of sleep, v/ maintain proper weight,
vi/ eat breakfast, and vii/ do not eat between meals.
In how many ways can a person adopt 5 of these rules to follow
(a) if the person presently violates all 7 rules?
(b) if the person never drinks and always eats breakfast?

D/ Independent events- Conditional probability

Problem 5 - Disaster control:


A small town has one fire engine and one ambulance available for emergencies.
The probability that the fire engine is available when needed is 0.98, and the probability that the
ambulance is available when called is 0.92.
In the event of an injury resulting from a burning building, find the probability that both the ambulance
and the fire engine will be available, assuming they operate independently.

HINT: Let 𝐴 and 𝐵 be the respective events that the fire engine and the ambulance are available.
Problem 6 (*)- Quality control

Inference, Linear Regression and Stochastic Processes


1.7. Basic Probability Problems 47

ASSUMPTION. Five identical departments are designed in a given commercial bank.


Let 𝐸1 , 𝐸2 , · · · , 𝐸5 be the events that these five departments comply with the quality specifications
(non-defective, non bugs...). Under the model of mutual independence the probability that all the five
departments are indeed non-defective is

P(𝐸1 ∩ 𝐸2 ∩ · · · ∩ 𝐸5 ) = P(𝐸1 ) P(𝐸2 ) · · · P(𝐸5 ).

Since these departments come from the same production process (i.e. construction company), we
can assume that P(𝐸𝑖 ) = 𝑝, all 𝑖 = 1, · · · , 5. Thus, the probability that all the 5 departments are
non-defective is 𝑝5 .
What is the probability that one department is defective and
all the other four are non-defective?
HINTS: Let 𝐴1 be the event that one out of five parts is defective. In order to simplify the notation,
we write the intersection of events as their product. Thus,

𝐴1 = 𝐸1𝑐 𝐸2 𝐸3 𝐸4 𝐸5 ∪ 𝐸1 𝐸2𝑐 𝐸3 𝐸4 𝐸5 ∪ 𝐸1 𝐸2 𝐸3𝑐 𝐸4 𝐸5

∪ 𝐸1 𝐸2 𝐸3 𝐸4𝑐 𝐸5 ∪ 𝐸1 𝐸2 𝐸3 𝐸4 𝐸5𝑐 .

𝐴1 is the union of five disjoint events. Therefore

P(𝐴1 ) = P(𝐸1𝑐 𝐸2 𝐸3 𝐸4 𝐸5 ) + · · · + P(𝐸1 𝐸2 𝐸3 𝐸4 𝐸5𝑐 ) = 5𝑝4 (1 − 𝑝).

Can you explain why? Since 𝐸1 , 𝐸2 , · · · , 𝐸5 are independent of each other:

P(𝐸1𝑐 𝐸2 𝐸3 𝐸4 𝐸5 ) = P(𝐸1𝑐 ) P(𝐸2 ) P(𝐸3 ) P(𝐸4 ) P(𝐸5 ) = (1 − 𝑝)𝑝4 .

Similarly,
P(𝐸1 𝐸2𝑐 𝐸3 𝐸4 𝐸5 ) = · · · = P(𝐸1 𝐸2 𝐸3 𝐸4 𝐸5𝑐 ) = (1 − 𝑝)𝑝4 .

More generally, if 𝐽5 is the number of defective parts out of a total of five machine parts, then
(︂ )︂
5
P(𝐽5 = 𝑖) = 𝑝5−𝑖 (1 − 𝑝)𝑖 .
𝑖

1.7.3 Using expectation in practice

Problem 1.1. How to measure profit in Insurance Industry?

An insurance company charges $50 per customer in a year. Let 𝑋 be a discrete random variable
(customer’s injury level) with three values (outcomes) Death, Disability and Good.
Assuming that it made a research on 1000 people and have following table:

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
48 CAN UNCERTAINTY BE EVALUATED?

Outcome 𝑥 Payroll 𝑚 Probability 𝑝 = P[𝑋 = 𝑥]

1
Death 10,000 1000

2
Disability 5000 1000

997
Good 0 1000

Or horizontally

𝑋 Death Disability Good

𝑀 (Payroll) 10,000 5000 0


1 2 997
𝑝𝑖 = P[𝑋 = 𝑥𝑖 ] 𝑝1 = 𝑝2 = 𝑝3 =
1000 10000 1000

Can company make a profit?


Use Expectation = Expected Value: The Center

• 𝑋 is a discrete random variable (customer’s injury level) with three values (outcomes) Death, Disabil-
ity and Good.

• Call 𝑀 be a random variable indicating the money that the company pays to a customer, correspond-
ing with values Death, Disability and Good of 𝑋.

• The company expects that they have to pay for each customer:
∑︁ ∑︁
E[𝑋] = 𝑥 · P(𝑋 = 𝑥) = 𝑚 · P(𝑀 = 𝑚)

1 2 997
= $10, 000 ( ) + $5000 ( ) + $0 ( ) = $20.
1000 1000 1000

𝑋 Death Disability Good

𝑀 (Payroll) 10,000 5000 0


1 2 997
𝑝𝑖 = P[𝑀 = 𝑚𝑖 ] 𝑝1 = 𝑝2 = 𝑝3 =
1000 10000 1000

Expected money amount E[𝑋] = E[𝑀 ] = 𝜇 = $20 per customer.


Variance: The Spread (measure uncertain cost)

• Of course, the expected value $20 will not happen in reality

• There will be variability (uncertainty). Let’s calculate!

Inference, Linear Regression and Stochastic Processes


1.7. Basic Probability Problems 49

• Variance V[𝑋] = 𝜎 2 = E[(𝑋 − 𝜇)2 ] = [𝑥𝑘 − 𝜇]2 𝑝𝑘


∑︀
𝑥𝑘 ∈𝑆𝑋

∑︁
V[𝑋] = (𝑚 − 𝐸(𝑀 ))2 · P(𝑀 = 𝑚) = V[𝑀 ]
1 2 997
V[𝑀 ] = 99802 ( 1000 ) + 49802 ( 1000 ) + (−20)2 ( 1000 ) = $149, 600
√︀ √
• Standard deviation V[𝑀 ] = 149, 600 ≈ $386.78 = 𝜎

The company expects to pay out $20, and make $30 (= $50- $20). However, the standard deviation
of $386.78 indicates that it’s no sure thing. That’s pretty big spread (and risk) for an average profit of
$20.

Problem 1.2.

As batches of water filters at a brewery firm B often contain defective parts, inspectors are required
to detect the number of defective parts for several consecutive days. Each day they check one batch
only, count and record the number of defective parts of that batch. After many days they are able to
calculate the corresponding probability
𝑝𝑥 = P[𝑋 = 𝑥], 𝑥 = 0, 1, 2, · · · , 5. For example, 𝑝5 = 0.2 means that 20% of all shipments contain 5
damaged parts. They finally report the status to the manager, which is listed in the following table:

𝑋 0 1 2 3 4 5

𝑝𝑥 = P[𝑋 = 𝑥] 0 𝑎 0.3 0.2 0.1 0.2

Compute probability 𝑎 and the expected value 𝜇 (the average number of defective parts in each
shipment).

Problem 1.3.

A discrete random variable 𝑅 indicates four health insurance types 𝑟 = 1, 2, 3, 4 being provided by
an actuarial company D. Each type 𝑟 has a corresponding percentage of P[𝑅 = 𝑟] (out of total cases
provided annually), as shown in the following table:

𝑟 1 2 3 4

P[𝑅 = 𝑟] 0.1 𝑎 0.3 𝑏

You further know that the higher 𝑟 is, the higher the quality (of service) company D can get. The
company’s goal is to achieve the expected value (the average quality level of service) E[𝑅] = 3.
Find the highest quality percentage 𝑏 (associated with level 4).

Problem 1.4.

Given that the number of wire-bonding defects per unit 𝑋 is Poisson distributed with parameter 𝜆 = 4.
Compute the probability that a randomly selected semiconductor device will contain two or fewers wire-
bonding defects.

DATA ANALYTICS- FOUNDATION


CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
50 CAN UNCERTAINTY BE EVALUATED?

1.7.4 Self-test

1. Given that P(𝐴) = 0.9, P(𝐵) = 0.8, and P(𝐴 ∩ 𝐵) = 0.75. Find

(a) P(𝐴 ∪ 𝐵); (b) P(𝐴 ∩ 𝐵 𝑐 ); and (c) P(𝐴𝑐 ∩ 𝐵 𝑐 ).

2. Prove Bonferroni inequality: P(𝐴 ∩ 𝐵) ≥ P(𝐴) + P(𝐵) − 1.

3. There are 𝑛 persons in a room.

(a) What is the probability that at least two persons have the same birthday?

(b) Calculate this probability for 𝑛 = 50.

(c) How large need 𝑛 be for this probability to be greater than 0.5?

4. A committee of 5 persons is to be selected randomly from a group of 5 men and 10 women. Find

(a) the probability that the committee consists of 2 men and 3 women.

(b) the probability that the committee consists of all women.

5. Conditional probability. The conditional probability of an event 𝐴 given event 𝐵, denoted by


P(𝐴|𝐵), is defined as
P(𝐴 ∩ 𝐵)
P(𝐴|𝐵) =
P(𝐵)
where P(𝐴 ∩ 𝐵) is the joint probability of 𝐴 and 𝐵. Similarly,

P(𝐵 ∩ 𝐴)
P(𝐵|𝐴) = .
P(𝐴)

Show that P(𝐴|𝐵) just defined satisfies the three axioms of a probability.

6. Two manufacturing plants produce similar parts. Plant 1 produces 1,000 parts, 100 of which are
defective. Plant 2 produces 2,000 parts, 150 of which are defective. A part is selected at random and
found to be defective.

What is the probability that it came from plant 1?

7. Show that for any events 𝐴, 𝐵 in a sample space Ω,

a) if P(𝐴|𝐵) > P(𝐴), then P(𝐵|𝐴) > P(𝐵). b) Show that P(𝐵) = P(𝐵|𝐴) P(𝐴) + P(𝐵|𝐴𝑐 ) P(𝐴𝑐 ).

c) Now suppose that a medical laboratory test to detect a certain disease has the following statistics.
Let 𝐴 = ‘event that the tested person has the disease’, and 𝐵 = ‘event that the test result is positive’.

It is known that
P(𝐵|𝐴) = 0.99 and P(𝐵|𝐴𝑐 ) = 0.005,

and 0.1 percent of the population actually has the disease.

What is the probability that a person has the disease given that the test result is positive?

Inference, Linear Regression and Stochastic Processes


1.7. Basic Probability Problems 51

8. (AVIATION) The probability that a regularly scheduled flight departs on time is P(𝐷) = 0.83; the
probability that it arrives on time is P(𝐴) = 0.82; and the probability that it departs and arrives on time
is P(𝐷 ∩ 𝐴) = 0.78.

Find the probability that a plane

(a) arrives on time, given that it departed on time, and

(b) departed on time, given that it has arrived on time.

9. (BIO-MEDICAL Engineering)
A manufacturer of a flu vaccine is concerned about the quality of its flu serum. Batches of serum are
processed by three different departments having rejection rates of 0.10, 0.08, and 0.12, respectively.
The inspections by the three departments are sequential and independent.

• What is the probability that a batch of serum survives the first departmental inspection but is
rejected by the second department?

• What is the probability that a batch of serum is rejected by the third department?

DATA ANALYTICS- FOUNDATION


This page is left blank intentionally.
Chapter 2

Statistical Science for Data Analytics


Does data make sense in services?

[Source [9]]
CHAPTER 2. STATISTICAL SCIENCE FOR DATA ANALYTICS
54 DOES DATA MAKE SENSE IN SERVICES?

2.1 Overview

Randomness and variability (also called) uncertainty are phenomena that engineering students as
well as actuarial and financial students and economics learners are facing in both their daily life and
in professional environments. The text introduces a few key techniques, fundamental methodologies
together with basic formalization in Statistical Data Analysis for Engineers and Scientists. These are
aimed for both undergraduates and graduates in

• Statistics and Applied Mathematics

• Chemistry, Biology and Life Sciences related subjects

• Environment, Ecology and natural resource management

• Computing and Information Technology

• Agriculture - Industrial production and Logistics

The techniques and foundation help them to understand and resolve efficiently theoretical and prac-
tical problems possessing randomness by nature. Let imagine, for example, the above image would
show the color spectrum of the Gulf of Thailand, do the pattern or chaos in the picture suggest or
inspire us to choose sample drilling areas while searching for oil or gas?

The definition of the discipline of Statistics

Statistics is the science of problem-solving in


the presence of variability found from observed data sets.

Statistics, practically is a system of methods of collecting, synthesizing, presenting data, calculating


features of the object under study for analysis, prediction and making decision. We can even define a
more general definition:

Statistics is the problem-solving science that allows change or uncertainty exists.

Statistics and statistic:


Statistic (no ‘s’ at the end) is the expression of the relationship between the quantity and quality of the
phenomenon or object under study in terms of specific time and place. In this text we introduce basic
statistical concepts and terminology that are fundamental to the use of statistics in experimental work.
These concepts include:

• Probability theory: random variables and probability distributions.

• Descriptive statistics is concerned with summarizing and describing numerically a body of data.

Inference, Linear Regression and Stochastic Processes


2.1. Overview 55

• Inferential statistics, a key part of Statistical Science. More importantly, Inferential statistics is the
process of reaching generalizations about the whole (called the populations) by examining a portion
or many portions (called samples).

Hence we can use the following more elaborated and popular definition:

Statistics is a collection of methods used


to collect and clean data,
to visualize and compute various characteristics of the data,
to analyze and infer complex relationships among related factors,
to make decisions and to model explicit implicit phenomena, and
last but not least to deliver optimal solutions,
all tasks are based on the collected data.

2.1.1 Practical problems require statistical analysis

Problem 2.1 (Environmental Study).

Trace metals in drinking water affect the flavor, and unusually high concentrations can pose a health
hazard. The article “Trace Metals of South Indian River” (Environmental Studies, 1982: 62–66) reports
on a study in which six river locations were selected (six experimental objects) and the zinc concen-
tration (mg/L) determined for both surface water and bottom water at each location. The six pairs of
observations are displayed in the table below.

Location

1 2 3 4 5 6

Zinc concentration in bottom water (𝑥) .430 .266 .567 .531 .707 .716

Zinc concentration in surface water (𝑦) .415 .238 .390 .410 .605 .609

Difference .015 .028 .177 .121 .102 .107

Does true average concentration in bottom water exceed that of surface water?

Problem 2.2 (Flood monitoring and urban management).

A city’s manager wishes to determine if it is equally likely that floods (traffic jams, dead holes ...)
will take place at major districts of metropolis like Bangkok or Hanoi. He records the number of floods
happened during one year in Bangkok and obtains the following frequencies:
District 2, 20; District 3 , 14; District 4, 18;
District TD, 17; District BC, 22; and District BT, 29.

DATA ANALYTICS- FOUNDATION


CHAPTER 2. STATISTICAL SCIENCE FOR DATA ANALYTICS
56 DOES DATA MAKE SENSE IN SERVICES?

Do the data indicate there is a difference with respect to the number of floods happened at different
districts of Bangkok?
A test statistic, the chi-square statistic will be introduced for analyzing this data, measured with a
nominal scale.

Problem 2.3 (Finance and Insurance).

Suppose an insurance company 𝐵 has thousands of customers, and each customer is charged $500
a year. Since the customer’s businesses are risky, from the past experience the company estimates
that about 15% of their customers would get fatal trouble (e.g. fire, accident ...) and, as a result they will
submit a claim in any given year. We assume that the claim will always be $3000 for each customer.

• Model the amount of money that the insurance company believes to obtain from each customer.

• Determine the random variable 𝑆𝑁 representing the total amount of claim that the company 𝐵 have
to pay to its customers? Compute E[𝑆𝑁 ].

Problem 2.4 (Studying Climate change and Agriculture).

BPH = Brown Plant-hoppers


What factors have most significant impacts on the BPH growth?

1. Longitude (𝑥1 ), Latitude (𝑥2 )

2. Leaf color (𝑥3 , categorical)

3. Number of leaves (ind/m2) (𝑥4 )

4. Seeding density (kg/ha) (𝑥5 )

5. Temperature (C) (𝑥6 )

6. Humidity (%) (𝑥7 )

7. Water level (cm) (𝑥8 )

8. Rice species (𝑥9 , categorical)

9. Grass density (ind/m2) (𝑥10 )

10. Number of buds (ind/m2) (𝑥11 )

The above predictors could affect BPH growth and their measured values.
Realistic data set from Mekong Delta

Inference, Linear Regression and Stochastic Processes


2.1. Overview 57

Long. Lat. Rice Seed. Temp Humi. Water Leaf Grass No. No. No.

species den. level color den. buds leaves BPH

5562 11260 JASMIN 85 15 24 90 6 4 24 962 4329 0

5557 11260 JASMIN 85 21 25 90 0 5 0 1058 4768 0

5563 11259 OM2000 21 26 90 7 5 0 1046 4707 0

5561 11257 OM1490 15 25 90 0 4 0 1070 4815 0

5559 11256 OMCS1490 21 29 75 0 5 48 1050 4725 0

5559 11254 VD20 18 28 82 6 5 0 966 4347 0

5565 11256 VD20 19 25 90 3 5 0 982 4419 0

5565 11254 JASMIN 85 18 26 82 0 4 0 992 4464 0

5566 11253 JASMIN 85 20 27 82 7 5 0 981 4414 0

5564 11251 JASMIN 85 19 27 82 0 4 0 972 4374 0


..
.

5640 10780 MT1240 18 29 90 6 5 2 528 2112 106

5641 10779 OM1490 16 29 90 6 4 0 608 2432 109

5642 10778 OM1490 16 29 90 5 4 0 604 2416 326

5644 10779 OM1490 12 29 90 5 3 0 512 2048 72

5640 10777 OM3240 16 29 90 8 4 12 576 2304 184

Impact of climate change on BPH outbreak?


Farmers raised the question: how could we control BPH growth?
Statistician’s Aim: Prediction of Brown Plant-hoppers (BPH)?

Our research concerns could possibly be

a/ which factors- predictors most affect on BPH growth/ births?

b/ which statistical models well predict the number BPHs?

c/ what if some assumptions of linear models turn wrong [as random errors are not i.i.d normal?],
what can we do?

2.1.2 Statistics in Engineering and Science

The term scientific suggests a process of objective investigation that ensures that valid conclusions
can be drawn from an experimental study.

DATA ANALYTICS- FOUNDATION


CHAPTER 2. STATISTICAL SCIENCE FOR DATA ANALYTICS
58 DOES DATA MAKE SENSE IN SERVICES?

Interest factor Y is the


number of BHPs.

Scientific investigations are important not only in laboratories of research universities but also in the
engineering laboratories of industrial manufacturers.

Statistical methods are applied in an enormous diversity of problems in:

• Economics (how are the living standards changing?)

• Traffic engineering - based on Structural Health Monitoring

• Environmental Studies (do polluted water sources induce higher cancer rates?). See Practical
Problem 2 above for data and details.

• Flooding monitoring and urban management - wise decision making based on counting fre-
quency, see Practical Problem 2 above.

2.1.3 Roles of statisticians, data scientists or practitioners

The role of a statistician or quantitative scientist include:

a/ Design data collection in a way that minimizes bias and confounders, and at the same time maxi-
mize the content of information

b/ Check the quality of the data after it is collected, and

Inference, Linear Regression and Stochastic Processes


2.2. Populations, samples and statistics 59

c/ Analyzes the data by calculated statistic and the methods that provide insight or knowledge sup-
porting to engineers, industrialists, administrators, or specialists researchers making decisions.

2.2 Populations, samples and statistics

We will use the key concepts below (see [12]).

• A population or statistical population includes all of the entities of interest in a study. A population
includes all of the entities of common interest, whether they be people, households, machines, or
whatever. A unit is a specific element in a population.

For example, a gathering all US citizens on January 1, 2010, is a statistical population. Generally
this includes many communities (subpopulations), for example, all men between the ages of 19-25,
living in Illinois, etc.

In this example the population of US citizens as of January 1, 2010, is finite and real; while the
population of concrete blocks with fixed sizes and can be produced by a specific production process
is infinite and assumed population.

• A sample is a subset of the population, often randomly chosen and preferably representative of the
population as a whole.

A sample is usually selected from a population with the aim of observing properties/ characteristics
of that population, for gathering information, data and then making statistical decisions related to the
corresponding characteristics.

In addition, we should distinguish the following important pair of terms-


parameters / statistic, and individuals / variables.

1. Parameter and statistic

A parameter is a constant that defines a certain characteristic of the distribution of a random variable
/ observation, of a population or a process.

A statistic or statistical attribute/ criterion is a value that can be calculated from data samples (being
observed or experimented).

Statistical criterion practically is the notion of element’s characteristic in a population. Each attribute
has its expression values, based on which one divides it into two categories:

Qualitative attribute: reflects the type or nature of the unit (as gender)

Quantitative attribute: the characteristic of the population unit expressed as the number (yield, height
of the crop); can be discrete (finite or countably infinite), or continuous.

DATA ANALYTICS- FOUNDATION


CHAPTER 2. STATISTICAL SCIENCE FOR DATA ANALYTICS
60 DOES DATA MAKE SENSE IN SERVICES?

2. Individuals and variables


Individuals are the objects described in a set of data. Individuals are sometimes people. When the
objects that we want to study are not people, we often call them cases.

A variable is any characteristic of an individual on which information is obtained in a survey, field


observation or a designed experiment. A variable can take different values for different individuals.
There are two major kinds of variables:
a. Qualitative variables (factors, class variables); these variables classify objects into groups.

• categorical (such as eye color, country of birth); there is no sense of order;


• ordinal (such as income classified as high, medium or low); there is natural order for the values
of the variable.

b. Quantitative variables (measurements and counts)

• continuous (such as heights, weights, temperatures); their values are often real numbers; there
are few repeated values;
• discrete (counts, such as the number of faulty parts in a car, the number of telephone calls to
you per day, etc); their values are usually integers; there may be many repeated values.

2.3 Scientific data and characteristics

Recall the following elaborated and popular definition of statistics:

Statistics is a collection of methods used to collect and clean data, to visualize and compute
various characteristics of the data, to analyze and infer (deduce) complex relationships among
related factors, to make decisions and to model explicit implicit phenomena, and last but not
least to deliver optimal solutions, all tasks are based on the collected data.

You have seen that every task (calculating, analyzing or making decision) is based on observed data.
But what would be typical characteristics of our scientific, technical, economic, financial data?

2.3.1 Characteristics of various data sets- Our approach

• Various data sets come from many sources (from environment, chemistry, bridges, computers, biotech-
nology, food technology...), showing very different chemical and physical characteristics, depending
on where and when they are observed, as well as how to collect samples.

• Our common perception is that data sets are differently structured, could be quantitative or qualitative
... They require a completely different set of expressions, modeling, as well as statistical interpreta-
tions, as a result, methods for making decisions or conclusions are distinct.

Inference, Linear Regression and Stochastic Processes


2.3. Scientific data and characteristics 61

• Actuarial, economic and financial data are often susceptible to the effects of abnormalities, and the
graphing method for presenting quantitative and qualitative data will be different.

Our statistical approach used in this text includes:

Describing data First of all, we go from the description, using criteria (or sample characteristics) to
measure the centrality (central tendency ) as median, median, mode. Then we measure the spread-
ing tendency (or dispersion) such as variance, standard deviation ...

Decoding dataset’s uncertainty The next step is to use the theory of random variables and the cor-
responding probability distributions to quantify numerically. A fundamental characteristic of the data
collected from actuarial science and economics is the size (large or very large) and structural com-
plexity, the correlation of the factors in the pattern expressed for the happening process. The estima-
tion of parameters from large sample data is discussed in Part B.

Discover implicit relationships in data Finally, if you want to explain the phenomenon, discuss the
effect of factors on the response of the process, find the root of the problem, or make management
decisions, we must use more difficult and sophisticated techniques such as statistical inference and
estimation of population parameters (in Chapter ??).

2.3.2 Statistically analysis of observed data sets

About mathematically theoretic consideration for statistically analyzing process (of observed data sets)
we need to comply with the following steps:

1. removing the raw numerical error; calculating sample characteristics (mean, min, max, deviation) in
Chapter 3; evaluating the value of the population parameters [using both point estimation and interval
estimation], as seen from Chapter 4,

2. testing statistical hypotheses on one or more populations, see Chapter 5,

3. analyzing correlations between factors that influence the outcome of a process (groundwater con-
tamination, river salinity, drought, cash flow of a bank ...

4. knowing some basic models of your specific application domains; e.g. in Environment: water quality,
the spread of pollutants in the air, solid deposit in liquid.

Technologically, we need measuring equipment to collect data:

In environment/ resource management - water quality indicators, geographic data (GIS), satellite
images (GPS) ...

In medicine/ health care - sensors to record biological indcies of human body,

In industry/ manufacturing - devices and cameras to online (24/24) observe processes,

In economics/ finance - mechanisms to keep track daily changes of selling and buying, stock ex-
changes/ fluctuations... (see more in [14]).

DATA ANALYTICS- FOUNDATION


CHAPTER 2. STATISTICAL SCIENCE FOR DATA ANALYTICS
62 DOES DATA MAKE SENSE IN SERVICES?

2.3.3 What is the use of various data sources for?

Engineers or scientists can use field observation data to answer many practical questions, to forecast
trends, to resolve or make decision quantitatively for a series of problems. These scientific solutions
are positively related to the well-being and sustainability of people’s life, especially in countries where
natural resources and manpower haven’t been properly governed in a scientific way. A complete pro-
cedure for using monitoring data is provided in Section 5.2.
First, let’s us try some practical issues in using field observation data.

Decision making in pharmaceutical engineering A manager of a pharmaceutical company evalu-


ated the effectiveness of an upgraded production line by collecting a random sample 𝑥 consisting of
the weights of 50 pills from population 𝑋 of pills before upgrading, and a random sample 𝑦 recording
the weights of 50 pills from population 𝑌 of pills after upgrading.

Based on these data, he obtained the average weight x = 8.5 mg of pills before the upgrade, and the
average weight y = 7.2 mg of pills after the upgrade.

Given the population standard deviation 𝜎𝑋 = 𝜎𝑌 = 1.8 mg both before and after the upgrade, what
is a 90% confidence interval CI of the population mean difference 𝜇𝑋 − 𝜇𝑌 ? If we select significance
level 𝛼 = 0.05, can you test the following pair of hypotheses

𝐻0 : 𝜇1 − 𝜇2 = 0, 𝐻1 : 𝜇1 > 𝜇2 ?

Inference, Linear Regression and Stochastic Processes


Part B

DATA EXPLORATION and

STATISTICAL INFERENCE

Chapter 3: Exploratory Data Analysis

Chapter 4: Statistical Parameter Estimation

Chapter 5: Hypothesis Testing for one sample and two samples

Statistical inference, mathematically is a group of important tools used to estimate and test hy-
pothesis on parameters. In the following sections, further approach and statistical thinking allow
you decipher the uncertainty of the phenomena in our world.

Part B is designed to develop knowledge and skills for explaining fundamental concepts of esti-
mation and hypothesis testing, and then apply statistical inference into real world problems.

• Chapter 3 presents basic concepts and tools for describing transformations. We emphasize
graphic techniques to explore and summarize changes in observations. This chapter introduces
readers to the examples in the R software.

• Chapter 4 presents basic inferential methods, including sampling distributions and parameter
estimation.

• In Chapter 5 we discuss how to test of statistical hypotheses for one and two populations.
This page is left blank intentionally.
Chapter 3

Exploratory Data Analysis


Making observed data meaningful

[Source [56]]
CHAPTER 3. EXPLORATORY DATA ANALYSIS
66 MAKING OBSERVED DATA MEANINGFUL

3.1 Introduction and Overview

Exploratory Data Analysis (EDA) covers the following parts:.

• Fundamental EDA

• Basic plots (table and charts)

• Numerical measures

Measures of Central tendency and Measures of dispersion (variability)

• Graphical visualization

Statistics of the ordered samples

Box-and-whisker plot - Percentile diagram

• Measures of association between two variables: Covariance and Correlation

The goal of this chapter and the next is to make sense out of data by constructing appropriate sum-
mary measures, tables, and graphs. Our purpose here is to take a set of data that at first glance might
have little meaning and to present the data in a form that makes sense to people.

There are many ways to do this, tools used most often are:

1. a variety of graphs, including bar and pie charts, histograms, scatter and time series plots;

2. numerical summary measures such as counts, percentages, averages, variability; and

3. tables of summarized measures such as totals, averages, and counts, grouped by categories.

3.1.1 What is Exploratory Data Analysis?

Statistical tools and ideas help us examine data in order to describe their main features. This
examination is called exploratory data analysis.
Like an explorer crossing unknown lands, we want first to simply describe what we see.
Hence, we also call Descriptive Statistics for EDA.

Here are two basic strategies that help us organize our exploration of a data:

• Begin by examining each variable by itself. Then move on to study the relationships among the
variables.

• Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data.

Inference, Linear Regression and Stochastic Processes


3.1. Introduction and Overview 67

Classification of variables

Let recall terms being useful for this chapter and subsequent ones.

• A variable is any characteristic of an individual on which information is obtained in a survey, field


observation or a designed experiment.

• A variable can take different values for different individuals.

Two major kinds of variables are:

a. Qualitative variables (categorical, factors, class variables)

b. Quantitative variables (measurements and counts)

1. Qualitative variables- classify objects into groups.

• categorical: there is no sense of order; such as nations, blood types...


• ordinal: there is natural order for the values of the variable; such as average income classified
as high, medium or low.

2. Quantitative variables measurements with values in the real R and counts in the natural N; see
from Section 3.4.

Graphs for categorical variables

The values of a categorical variable are labels for the categories, such as
‘female’ and ‘male’ in biology; ‘sell’ or ‘buy’ in stock market;
‘nations’ in the world, ‘car producers’ in Thailand,
‘investment for industry’ or ‘investment for agriculture’ in macroeconomics...

• Distribution. The distribution of a categorical variable lists the categories and gives either the count
or the percent of individuals who fall in each category;
to measure of how frequently they occur in a process or sample.

• Frequency distribution. In any sample data 𝑥 of size 𝑛, the number of observations 𝑛𝐴 of a partic-
ular value 𝐴 is its absolute frequency distribution.
The heights of the bars in a histogram of frequency distribution show the counts of the categories.
𝑛𝐴
• Relative frequency distribution. A relative frequency distribution of 𝐴 is 𝑛 . The heights of the
bars in a histogram of relative frequency distribution show the percents in the categories.

• Description of the values of a categorical variable uses some tabular or graphical structures like
histogram. A histogram is a bar graph of a frequency distribution.
We can draw histogram for frequency distribution or relative frequency distribution. Any statistical
software package will of course make a histogram for you.

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
68 MAKING OBSERVED DATA MEANINGFUL

Example 3.1. (Gender distribution in Higher- Education)

An engineering course has 𝑛 = 160 students enrolled, in which there are 60 female students. If
denote 𝑛𝐹 and 𝑛𝑀 for the number of female and male students, 𝑝𝐹 and 𝑝𝑀 for the frequency of
female and male students, respectively then we can make (a table of) frequency distribution and relative
frequency distribution as follows.

Gender

Female Male

Frequency (No. of students) 𝑛𝐹 = 60 𝑛𝑀 = 𝑛 − 𝑛𝐹


𝑛𝐹 𝑛𝑀
Relative frequency 𝑝𝐹 = = 60/160 = 3/8 𝑝𝑀 = = 1 − 𝑝𝐹
𝑛 𝑛

Can you draw a histogram for this data? 

Example 3.2. [Economics - Commerce.]

Making a histogram from raw data sometimes is a long but interesting process, as follows.

Economists of IMF compare how rich of developed countries in comparison with developing countries
via GDP per capita of 4 typical countries USA, UK, Mexico and India.
Denote 𝐴, 𝐵, 𝐶, 𝐷 respectively be the monthly average income of citizens at countries USA, UK,
Mexico and India. If IMF make a survey census in 𝑛 months, let 𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 , 𝑑𝑖 be specific income in
month 𝑖, for 𝑖 = 1, 2, . . . , 𝑛. Write
𝑥1 + 𝑥2 + · · · + 𝑥𝑛
x=
𝑛
for the average of sequence 𝑥1 , 𝑥2 , · · · , 𝑥𝑛 .
They can get the answer by using a sample data of size 𝑛 = 10, observations are monthly
income, having been recorded in 10 months at the four countries above in 2016, shown in Table
3.1, where 𝑎, 𝑏, 𝑐 and 𝑑 are yearly GDP per capita of 4 countries.

A histogram of this data is given in Figure 3.1. 

3.1.2 Key statistical concepts for EDA

• Observation:

The collection of information in an experiment, or actual values obtained on variables in an exper-


iment. Response variables are outcomes or observed values of an experiment.

• Data set:

Generally is a rectangular array of data where

the columns contain variables, such as height, gender, and income, and

Inference, Linear Regression and Stochastic Processes


3.1. Introduction and Overview 69

Table 3.1: GDP in USA, UK, Mexico and India via monthly income

Nation

The US United Kingdom Mexico India

Month 𝐴 𝐵 𝐶 𝐷

1 𝑎1 𝑏1 𝑐1 𝑑1

2 𝑎2 𝑏2 𝑐2 𝑑2
.. .. .. .. ..
. . . . .

10 𝑎10 𝑏10 𝑐10 𝑑10

Yearly income 𝑎 = $46720 𝑏 = $43090 𝑐 = $10210 𝑑 = $1070

Distribution compares the average income in four countries


in a pattern being called bar graph (or histogram)

Figure 3.1: A popular graphical type distribution

the row contain an observation or measurement (including the attributes of a particular member of
the population)

A population (also called statistical population) , from Section 8.2, is a set of elements having
one or many certain common properties.

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
70 MAKING OBSERVED DATA MEANINGFUL

A sample is a subset of some specific units in the population, often randomly chosen and prefer-
ably representative of the whole population.

A sample is usually selected from a population with the aim of


observing properties/ characteristics of that population,
gathering information, data and analyzing them, then
making statistical decisions related to the corresponding characteristics.

Example 3.3.

• The collection of all rare animals living in Thailand on January 1, 2017 is a statistical population. This
population includes many communities (sub-populations), such as all elephants aged 1-15 years,
living in Chiang Mai National Park and so on.

• Another statistical population: all sets of smart phones that have a defined configuration, and that
can be manufactured under specific conditions by Samsung’s factories (or any other manufacturer
on the world market). 

3.2 Visualize sample data

Reminder on popular numerical sets:


The natural numbers N = {0, 1, 2, 3, . . . , 𝑛, 𝑛 + 1, . . .},
the set of integers is Z = {..., −3, −2, −1, 0, 1, 2, 3, ...},
the rational numbers is Q = { 𝑚
𝑛 : 𝑚, 𝑛 ∈ Z and 𝑛 ̸= 0},

and the real numbers (rational and irrational numbers) denoted by R.


Now we present few geometrical patterns to visualize sample data. To describe sample data with
other graphical type distributions in detail, we can use enumeration tables and charts.

3.2.1 Enumeration table (frequency table)

We put the data into a table according to a certain rule. Enumeration table usually start with a
header/title and end with a source/ origin.
+ Title: a simple description for the contents of the table
+ Origin: recording the source of the data in the table.

Example 3.4. [Industry- Manufacturing.]

Inference, Linear Regression and Stochastic Processes


3.2. Visualize sample data 71

Thailand’s government wants to compare how competitive among the car producers in order to
design macroeconomic policy in automobile manufacturing.
They can get the answer by using a sample data of size 𝑛 = 10, observations are brand names,
having been recorded from 10 producers in 2008, shown in Table 3.2. From this enumeration table
we can draw many charts to support for their decisions. 

Table 3.2: Market share of car producers in Thailand

Producer Output (unit 100𝑐𝑎𝑟𝑠/1𝑦𝑒𝑎𝑟) Percent (relative output)

183
Honda 𝑓1 = 183 𝑝1 = 1000 = 0.183

100
Toyota 𝑓2 = 100 𝑝2 = 1000 = 0.1 = 10%
.. ..
. .
𝑓𝑖
Ford 𝑓𝑖 𝑝𝑖 =
𝑛
.. .. ..
. . .
𝑓𝑚
GM 𝑓𝑚 = 106 𝑝𝑚 =
𝑛
Sum 𝑛 = 1000 100%

Source : organization XYZ

3.2.2 Charts - a representation of a variable’s values

Charts are graphs that present statistical information - recorded in variables - with a more graphical
and dynamic way, including:

• Bar chart (histogram, for qualitative variable), Pie chart (for qualitative variable), and

• Line chart (time series plot, for both)

Bar chart (discrete histogram) of qualitative variables- example:


A qualitative variable [with 4 discrete values: USA, UK, Mexico, India] gives the average annual
incomes per capita of the countries in 2016, see data in Table 3.3 and visualization in Fig. 3.1. 
Can we draw more graphic plots? YES.
Pie chart and Time series plot can be made by using info in enumeration table.
Pie chart on contributions of various sectors in an economy, in Figure 3.2(a).

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
72 MAKING OBSERVED DATA MEANINGFUL

Table 3.3: Annual incomes per capita of four countries

Nation GDP per capita (in 𝑈 𝑆𝐷/1𝑦𝑒𝑎𝑟) Note

USA $46720 developed country

UK $43090

Mexico $10210
.. .. ..
. . .

India $1070 developing country

Source: IMF or WB or CIA

Time series plot is a two-dimensional flowchart is a representation of a relation between two


quantities, usually drawn on two axes: a horizontal axis (𝑡 - time ...) and a vertical axis (the
quantity varies with 𝑡 - time).

Time series plot usage- to predict change, e.g. as seen in Figure 3.2(b): with a continuously moni-
tored industrial productivity data set, we obtain a time series plot, which reflects a variability (fluctua-
tion) and a trend of industrial productivity changing of Bangkok City in many years.

3.2.3 Examining a distribution

Making a statistical graph is not an end in itself. The purpose of the graph is to help us understand the
data. After you make a graph, always ask, “What do I see?” Once you have displayed a distribution,
you can see its important features as follows.

In any graph of data, look for the overall pattern and for striking deviations from that pattern.
You can describe the overall pattern of a distribution by its shape, center, and spread.
An important kind of deviation is an outlier, an individual value that falls outside the overall
pattern.

Summary on variable types

Qualitative (categorical, factors, class variables); these variables classify objects into groups.

• categorical (nominal): such as methods of transport to Salaya campus; there is no sense of


order - no comparison among values;

• ordinal: such as Thai citizen’s income, being classified as high, medium or low; there is a natural
order - there exists comparison for the values of the variable.

Inference, Linear Regression and Stochastic Processes


3.3. Software for exploratory data analysis (EDA) 73

(a) Pie chart

(b) Time-series graph

Figure 3.2: Pie-chart and Time series graph

Quantitative (measurements and counts)

• continuous (measurements): such as


heights, weights of human being, temperatures of a room, a city; their values are often real
numbers in set R.

• discrete: counts, such as the numbers of faulty parts in a software, the numbers of students,
of telephone calls etc; their values are usually naturals N or integers Z.

3.3 Software for exploratory data analysis (EDA)

We introduce the most popular software package R.

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
74 MAKING OBSERVED DATA MEANINGFUL

• Package R: being available and downloaded from the site

CRAN (link at https://cran.r-project.org/).

All R applications mentioned in this subject are contained in a package called mistat, coupled with
the book Modern Industrial Statistics with applications in R, MINITAB, 2nd edition,

by Professors Ron S. Kenett, Shelemyahu Zacks, 2014, Wiley.

Basic steps for using the package R include:

1. Download from the link https://cran.r-project.org/

Click on the Installer icon and run the R installation procedure.

2. Starting and quitting R: Start by double-clicking on the R shortcut.

Type commands in the R console. Symbol # used for comments to human, R doesn’t read.

Exit via the File menu, or the q( ) command.

R prompts you to save an image of your current work space.

3. Running - try in the command console of R interface:

install.packages(“mistat”, dependencies=TRUE)

# Install mistat package and its dependencies


library(mistat) # load the most useful library

data(CYCLT) # Load specified data set CYCLT, a vector of values

help(CYCLT) # Read the help page about CYCLT

CYCLT # print CYCLT to Console

Package R - basic commands for numbers, and from numbers to vectors


In the command console of R try:

• x = 1:10 # store the first 10 naturals into list x

• y = log(x); # R allows mathematical calculations

• y # To see what is in any vector, simply type its name.

• z = x*y # compute the inner product of two vectors

• objects() # Store numbers in a vector by c(), c means column vector

• marks = c(53, 67, 81, 25, 72, 40)

• mean(marks) # Calculate the mean of this data set

Inference, Linear Regression and Stochastic Processes


3.3. Software for exploratory data analysis (EDA) 75

Table creation and operations, use data.frame()

• age = c(23, 43, 17, 52, 28, 31, 15, 31)

• insulin <- c(10, 12, 19, 23, 21, 20, 17, 10)

• insudata <- data.frame(age, insulin) # Draw a histogram of a variable

• hist(age)

Managing your work in R


* The R workspace is the current working environment, primarily the user-defined objects (variables,
data-sets, functions).
* On exiting R, you are prompted to save an image of your current work-space. This is automatically
reloaded when R is next started.

Using R for basic plots

Graphs are created automatically by the appropriate R functions, e.g. plot and hist(). A graph may
be resized, be copied and pasted directly into a MS Word document, Excel spreadsheet, or other
applications.

• salaries <- c(19, 24, 28, 29, 30, 34, 12, 13, 19, 20, 19, 23, 24)

• x= salaries; length(x);

• hist(x); # draw frequency distribution of data set, see Section 3.1.2

• hist(x, br = 5, col=”lightblue”, border=”pink”)

• hist(x,br = 7, xlim=c(10,30), xlab=”IM Engineer salaries”)

• mean(x); median(x); # what are they?

• var(x); sd(x); summary(x) # what do they provide?, see Section ??

• boxplot(x) # box mean George Box a great British statistician!

• barplot(x)

Package R for basic plots - Summary


Five-number summary

• mean(x); median(x) Produces the mean and the median of the elements in vector x

• sd(x); var(x) Gives the standard deviation and the variance of x

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
76 MAKING OBSERVED DATA MEANINGFUL

• summary(x) Produces a summary of the elements in vector x

Graphs In the most basic form,

• hist(x) Produces a histogram of the results in x

Or, with certain conditions

• barplot(z) Produces a barplot with one “bar” for every individual component in vector z.

• barplot(table(z)) more visually helpful than just barplot. It gathers together like terms.

3.4 Measure of Central and Spreading Tendency

♣ Practical motivation 2. Salaries of IT engineers

Two data sets, representing monthly salaries of IT engineers in the US:

𝑥 = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325], and

after sorting 𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].

What is the average salary of the US’s IT engineers? What is salary at which half (50%) of the IT
engineers has monthly income smaller that value?

Mean and median of sample data are concepts providing answers!

3.4.1 Measure of Central Tendency

In any sample 𝑥 = 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 of size 𝑛, we have the measures below.


Central tendency- Mean or the average value

The sample mean 𝑥 describes the central tendency of a sample of size 𝑛:


∑︀𝑛
𝑥𝑖
𝑥 = 𝑖=1
𝑛

For instance, the data 𝑥 above

𝑥 = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325],

∑︀𝑛
𝑥𝑖
give its sample mean 𝑥 = 𝑖=1
𝑛 = 2940.

* For the entire population of size 𝑁 , the population mean is


∑︀𝑁
𝑥𝑖
𝜇 = 𝑖=1 .
𝑁

Inference, Linear Regression and Stochastic Processes


3.4. Measure of Central and Spreading Tendency 77

Central tendency- Median (see more in [12])


The median 𝑀 is the midpoint of a distribution. Half the observations are smaller than the median
and the other half are larger than the median.

The median 𝑀 is the value in the middle when the data 𝑥1 , · · · , 𝑥𝑛 of size 𝑛 is sorted in ascending
order (smallest to largest).
- If 𝑛 is odd, then the median 𝑀 is the middle value.
- If 𝑛 is even, 𝑀 is the average of the two middle values.

For instance, the median of two data sets 𝑥, 𝑥* above

𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].

is the same, through their sample means are different:


∑︀𝑛
𝑥𝑖
𝑥 = 𝑖=1 = 2940 ̸= 𝑥* .
𝑛

Indeed, since 𝑛 = 12 is even, the middle two values of data 𝑥* are 2890 and 2920; the median 𝑀 is
the average of these values:
2890 + 2920
𝑀= = 2905 = 𝑀 (𝑥* ).
2
Remark : Whenever a data set contain extreme values, the median is often the preferred measure of
central location than the mean.

𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].

Data 𝑥* consists of extreme values (outliers) such as $10000, then the new sample mean is

∑︀𝑛
𝑖=1 𝑥*𝑖
𝑥* = = 3496 >> 2940 = the old mean of data 𝑥
𝑛
But the median is unchanged, reflecting better central tendency:

2890 + 2920
𝑀 (𝑥) = 𝑀 (𝑥* ) = = 2905.
2
Mean versus median
The median and mean are the most common measures of the center of a distribution.
The mean and median of a symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed
distribution, the mean is farther out in the long tail than is the median.

Frequency distribution and Mode

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
78 MAKING OBSERVED DATA MEANINGFUL

In any sample data 𝑥 = 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 of size 𝑛,


the mode of 𝑥 is the value that occurs with greatest frequency.

HOW TO? We knew

• the number of observations 𝑛𝐴 of a particular value 𝐴 is its absolute frequency,


𝑛𝐴
• a relative frequency of 𝐴 is 𝑛 ,

• a histogram is a bar graph of a (relative) frequency distribution.

So we just choose a specific value 𝐴 with greatest frequency (or greatest relative frequency) from
the histogram.

Table 3.4: Frequency distribution of 𝑥

Grades Absolute frequency Relative frequency

5 1 0.1

6 4 0.4

7 2 0.2

8 1 0.1

9 2 0.2

Size of data = 𝑛 = 10 relative frequency = 𝑛𝐴 /𝑛

Example 3.5.

Professor 𝐾𝑎𝑛𝑗𝑎𝑛𝑎 received the following grades of her students in the first semester 2017:

𝑥 = [6, 7, 6, 8, 5, 7, 6, 9, 9, 6].

Choose 𝐴 = 6. Hence, the mode of our grade data 𝑥 is 𝑀 𝑜𝑑𝑒 = 6, its absolute frequency is 4, its
relative frequency is 0.4.
R code: hist(x, br = 6, col="blue", border="pink");
gives us the above histogram. 

3.4.2 Measures of spreading tendency

We employ basic measures to describe spreading tendency of data:


a/ Percentiles and b/ Quartiles, a specific case of percentiles.

Inference, Linear Regression and Stochastic Processes


3.4. Measure of Central and Spreading Tendency 79

a) Percentiles provide information Student’s


about how the and
scores are spread over the interval from the
dataMode
smallest value to the largest value.

Given a sample 𝑥 of increasingly ordered observations, formally we have

Definition 3.1. The 𝑝th percentile, for any 0 < 𝑝 < 1, is a value 𝑚 such that

• 100𝑝 percent of the observations are less than or equal 𝑚, i.e.

P[𝑋 ≤ 𝑚] = 𝑝;

• and 100(1 − 𝑝) percent of the observations are greater than this value.

• The domain of 𝑝 is [0, 100]: 𝑝 is a real number, but in practice we usually allow 𝑝 ∈ Q ∩ [1, 100]: 𝑝 is a
rational number.

Example 3.6. [Education.]

Universities frequently report admission test scores in terms of percentiles. Suppose an applicant
𝐾 obtain a raw score 𝑚 = 54 (on the scale 100) of an admission test.
Would we know his chance to pass the exam in comparison with his friends?
YES, if we know how many percent the value 𝑚 corresponds to on the set of all applicant scores!
If the value 𝑚 = 54 corresponds to, say 75th percentile (of the whole students scores), we know

• that approximately 75% of students scored at most mark of applicant 𝐾, that is 54 marks,

• and approximately 25% of students scored higher than his score. 

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
80 MAKING OBSERVED DATA MEANINGFUL

b) Quartiles are percentiles obtained when we choose value 𝑝


to be a multiple of 0.25 = 25%.

Often we divide data into four equal parts, each part contains approximately one-fourth, or 25% of the
observations. The division points are called the quartiles, and defined as:

𝑄1 = first quartile, or 25th percentile

𝑄2 = second quartile, or 50th percentile (also the median)

𝑄3 = third quartile, or 75th percentile

For any 𝑝 ∈ (0, 1) the 𝑝th percentile of data 𝑥 is found by R code:


quantile(𝑥, p);

Example 3.7. Our salary data sample on salaries of IT engineers, given in Practical motivation 1, is
divided into four parts:

𝑥 = [2710, 2755, 2850, ‖2880, 2880, 2890, ‖2920, 2940, 2950, ‖3050, 3130, 3325]

Q1 ↑ = 2865 Q2 ↑ = 2905 Q3 ↑ = 3000

In summary,

• the first quartile 𝑄1 or 25% is located


such that 25% of the data lie below 𝑄1 = 2865, and 75% = 100 − 25 of the data lie above 𝑄1 ;

• the second quartile 𝑄2 or the median is located


such that half (50%) of the data lie at most 𝑄2 , and the other half of the data lie above 𝑄2 ;

• the Interquartile range is the data lie between the third quartile 𝑄3 and the first quartile 𝑄1 , counts
for 50% in the middle of the data. 

3.5 Measure of Dispersion (Variability)

Three measures are

1. Variance and Standard deviation (most important in engineering),

2. Range (easiest to use),

3. Interquartile Range.

Inference, Linear Regression and Stochastic Processes


3.5. Measure of Dispersion (Variability) 81

3.5.1 Measure of Dispersion- Variance and Standard deviation

Definition 3.2. The variance of a data is a measure of variability that utilizes all the data.

We denote 𝑠2 = V[𝑥] for sample variance of data 𝑥, and 𝑠 = 𝑠2 for its standard deviation.

• The sample variance of a data 𝑥 = [𝑥1 , · · · , 𝑥𝑛 ] of size 𝑛


∑︀𝑛
(𝑥𝑖 − 𝑥)2
𝑠2 = 𝑖=1 (3.1)
𝑛−1

∑︀𝑛
2 𝑖=1 𝑥2𝑖 − 𝑛𝑥2
or 𝑠 = ;
𝑛−1
where the sample mean is
𝑛
1 ∑︁
𝑥 := x 𝑛 = (𝑥1 + . . . + 𝑥𝑛 )/𝑛 = 𝑥𝑖 . (3.2)
𝑛 𝑖=1

• The sample standard deviation



𝑠= 𝑠2 .

• The population variance of a population of size 𝑁 , with 𝜇 = 𝑥:


∑︀𝑁
2 𝑖=1 (𝑥𝑖 − 𝜇)2
𝜎 = .
𝑁
Definition 3.3.

Coefficient of Variation 𝐶𝑉 measures relative dispersion, i.e. compares how large the standard
deviation is relative to the mean:
(︂ )︂
𝜎
𝐶𝑉 = × 100 % for populations
𝜇

and (︂ )︂
𝜎𝑥
𝐶𝑉 = × 100 % for samples 𝑥.
𝜇𝑥

♣ Practical motivation 3. Retail Business and Suppliers (CFM)

You are the purchasing agent of Maximart in SG, you regularly place orders with two distinct
suppliers in good and luxurious ceramic, say
Minh Long ceramic, denoted M and another foreign brand, denoted F.
After several months of operation, you obtained the following DATA, given in Table 3.5, repre-
sented in terms of frequency distributions of times the two suppliers that meet Maximart’s request.

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
82 MAKING OBSERVED DATA MEANINGFUL

Question 1.

Can we use data analytics to make your business decision that, in the long run which supplier- 𝑀
or 𝐹 will Maximart go along with?
First observations:
a) the 7 or 8 deliveries shown for the foreign brand F are viewed as favorable, meanwhile
b) the slow 12- to 15- deliveries for the foreign brand F could be disastrous, in terms of keeping your
business run smoothly, as keeping workforce busy, or big selling during peak-seasons ...
Data set of the two suppliers M and F is given in

Table 3.5: Frequency distributions of the days needed to deliver products

Working days Supplier M Supplier F

𝑘 (times met 𝑘 days) (times met 𝑘 days)

𝑥1 = 7 0 𝑣1 = 2

8 0 𝑣2 = 1

𝑥3 = 9 𝑤3 = 1 0

𝑥4 = 10 𝑤4 = 5 3

𝑥5 = 11 𝑤5 = 4 1

12 0 1

13 0 1

14 0 𝑣8 = 0

𝑥9 = 15 0 𝑣9 = 1

Steps of analysis when Dispersion is Max - Mean


Data explanation:

𝑥𝑖 = 𝑘 = Number of working days that suppliers need to deliver products

𝑤𝑖 = Number of times (frequency) that supplier M met 𝑘 days

𝑣𝑖 = Number of times (frequency) that supplier F met 𝑘 days

We will show that although the sample means of Minh Long (M) and F are the same, but

• Minh Long ceramic has smaller dispersion than that of the brand F.

Inference, Linear Regression and Stochastic Processes


3.5. Measure of Dispersion (Variability) 83

• Conclusion: the degree of reliability of Minh Long is higher...

———————————–

GUIDANCE for solving.

Q: Will the sample means of M and F be the same?


You find that the mean number of days required to fill orders is x =10.3 days for both suppliers. Why?
Since
sum of delivering days
x=
total times of delivering
∑︀
𝑤 𝑖 𝑥𝑖
= ∑︀ .
𝑤𝑖
∑︀ ∑︀
Here 𝑤𝑖 is nothing else the frequency of value 𝑘 = 𝑥𝑖 ; and 𝑤𝑖 = 𝑣𝑖 = 10.

In total, Minh Long ceramic provides its sum of delivery days


∑︀
𝑤𝑖 𝑥𝑖 = 1.9 + 5.10 + 4.11 = 9 + 50 + 44 = 59 + 44 = 103 days

and the foreign brand F got


∑︁
𝑣𝑖 𝑥𝑖 = 2.7 + ... + 1.(11 + 12 + 13 + 15) = 14 + ... + 51 = 103 days.

Obviously x 𝑀 = x 𝐹 =10.3 days.

• Do the two suppliers M and F demonstrate the same degree of reliability in terms of making deliveries
on schedule?

• Which supplier would you prefer?

With our data, if use sample standard deviations then


√︀ √
𝑠𝑀 = Var[𝑀 ] = 0.45 = 0.67;
√︀ √
𝑠𝐹 = Var[𝐹 ] = 6.67 = 2.58;
so the coefficient of variation of the Supplier M, F are
(︂ )︂
𝜎𝑀 0.67
𝐶𝑉𝑀 = × 100 % = × 100% = 6.5%
𝜇𝑀 10.3

(︂ )︂
𝜎𝐹 2.58
𝐶𝑉𝐹 = × 100 % = × 100% = 25%.
𝜇𝐹 10.3
• Although Minh Long ceramic and the foreign brand F have the same mean x 𝑀 = x 𝐹 = 10.3,

• but F has dispersion 𝑠𝐹 = 2.58 > 0.67 = 𝑠𝑀 , M’s dispersion.

• In that sense, the foreign brand F has large dispersion, so less reliable
(than Minh Long firm) in terms of making deliveries on schedule!

Hence Supplier F is less reliable than Supplier M with a ratio of almost 4 times! Reliability here
means providing goods on time/day, without too much delay!

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
84 MAKING OBSERVED DATA MEANINGFUL

3.5.2 Dispersion- Range

Range is the largest minus the smallest. That is


Range of the data 𝑥 = [𝑥1 , · · · , 𝑥𝑛 ] is 𝑀 𝑎𝑥(𝑥) − 𝑀 𝑖𝑛(𝑥).

For the salary data of Practical motivation 1

𝑥 = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325]

the range of the data is 3325 − 2710 = 615. For the extreme one

𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000]

the range of the data now is 10000 − 2710 = 7290!.

Hence the range is not so descriptive indicator of dispersion!

3.5.3 Dispersion- Interquartile Range

The Interquartile Range (IQR) is the range of the middle 50% of the data:

IQR = 𝑄3 − 𝑄1 .

This indicator overcomes the dependency on extreme values. E.g., from

[2710, 2755, 2850, ‖2880, 2880, 2890, ‖2920, 2940, 2950, ‖3050, 3130, 3325]

𝑄1 ↑ = 2865 𝑄2 =↑ 2905 𝑄3 =↑3000

the Interquartile Range of the data is 𝑄3 − 𝑄1 = 3000 − 2865 = 135: not indicates how far the extreme
data values is from the mean!

3.5.4 Summary- Critical thinking

The sample standard deviation is used more often since its units (cm., lb.) are the same as those of
the original measurements. In the next section we will discuss some ways of interpreting the sample
standard deviation. Presently we remark only that data sets with greater dispersion about the mean will
have larger standard deviations. The sample standard deviation and sample mean provide information
on the variability and central tendency of observation.

1. Why do we square the deviations 𝑥𝑖 − x ?

2. Why do we emphasize the standard deviation 𝑠 rather than the sample variance 𝑠2 ? ANS 1: 𝑠 is a
natural measure of spread for Normal distributions, learn in next weeks.

3. Why do we average by dividing by 𝑛 − 1 rather than 𝑛 in calculating the variance? Read book for
answer.

Inference, Linear Regression and Stochastic Processes


3.6. Visualization for Exploratory Data Analysis 85

3.6 Visualization for Exploratory Data Analysis

In this section we learn some common graphic techniques available today for exploratory data analysis.
These techniques include Box plot and quantile plot. We also discuss the sensitivity of the sample
mean and sample standard deviation according to the abnormal observed observations (outlier ), and
also robust statistics.

3.6.1 Statistics of the ordered samples

In general, statistics are calculated from an observed sample and used to infer the characteristics of
the population containing that sample.
We now identify several characteristic values of a sample of observations that are increasingly or-
dered These sample characteristics are called ordered statistic. Statistics that are not required to
arrange observed values will be discussed later.
Denote by 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 a series of observed values according to a random sampling procedure.
For example, consider the following 10 values of the cutting resistance of stainless steel welds [lb /
weld]:
2385, 2400, 2285, 2765, 2410, 2360, 2750, 2200, 2500, 2550.

What can we do to describe the change and location of these values?


The first step is to sort the sample values in ascending order, i.e, rewrite the list of sample values,
such as
2200, 2285, 2360, 2385, 2400, 2410, 2500, 2550, 2750, 2765.

These observations 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 orderly sorted values are denoted by

𝑋(1) < 𝑋(2) < · · · < 𝑋(𝑖) < 𝑋(𝑖+1) < · · · < 𝑋(𝑛) .

We call 𝑋(𝑖) the 𝑖-th static statistics of the sample.

E.g. 𝑋(1) = 2200 is the smallest value, 𝑋(2) = 2285 the second small, ... 𝑋(10) = 2765 the biggest
value.

Next we identify some characteristic values that depend on the ordered (sequential) statistics, namely:
sample minimum and sample maximum values, sample range, sample median, and sample quart-tiles.

• Sample min is 𝑋(1) ,

• Sample max is 𝑋(𝑛) .

• Sample range is 𝑅 = 𝑋(𝑛) − 𝑋(1) .

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
86 MAKING OBSERVED DATA MEANINGFUL

• We can also represent the average of two consecutive statistics by

𝑋(𝑖.5) = [𝑋(𝑖) + 𝑋(𝑖+1) ]/2.

E.g. 𝑋(2.5) = [𝑋(2) + 𝑋(3) ]/2.

• The “middle” value in the ordered pattern is called sample median, computed as

𝑀𝑒 = 𝑋(𝑚) , 𝑚 = (𝑛 + 1)/2.

When 𝑛 is even, 𝑀𝑒 is the mean of two values in the middle of the formula, according to the formula
𝑋𝑖.5 .

The median characterizes the center (midpoint) of dispersion of the sample values, so called statistic
for central tendency, or location statistic. About 50% of the sample value is less than median.

3.6.2 Box-plot- definition and drawing

A) The Five-Number Summary.


The five-number summary indicates the centre and the spread of the sample. It divides the ordered
sample 𝑥1 < 𝑥2 < · · · < 𝑥𝑛 into four sections; the five numbers are the borders of the sections.

The length of the sections tell us about the spread of the sample. The five numbers are:

• The lowest or minimum value, Min = 𝑥1 ;

• Lower Quartile denoted by Q1 , which ‘cuts off’ a quarter of the ordered data;

• Median, Med, also denoted by Q2, is the number such that half of the values are above it and half
are below it.
If there are an odd number of values in the data set, the median is simply the middle value in the
ordered list.
If there is an even number of values, the median is the average of the middle two values.

• Upper Quartile denoted by Q3, which ‘cuts off’ three quarters of the ordered data;

• The highest or maximum value, Max = 𝑥𝑛 .

B) The boxplot
named after G. Box, a famous statistician, is a graphical technique that displays the distribution of
variables. It helps us see the location, spread, tail length and outlying points or outliers.

• An extreme value is considered an outlier if it is outside of the range [𝑄1, 𝑄3] (greater than Q3 or
less than Q1 ) by more than 1.5 times the interquartile 𝐼𝑄 = Q3- Q1.

Inference, Linear Regression and Stochastic Processes


3.6. Visualization for Exploratory Data Analysis 87

• The boxplot is a graphical representation of the Five Number Summary, and is particularly useful in
comparing different batches.

Construction of the Boxplot:


1. Draw a box with borders (edges) at Q1 and Q3 (i.e., 50% of the data are in this box).
2. Draw the median as a solid line (|) and the mean as a dotted line (—).
3. Show extreme values-outliers as “∘” depending on whether they are outside of

[𝑄1 − 1.5 𝐼𝑄, 𝑄3 + 1.5 𝐼𝑄]

where 𝐼𝑄 is the interquartile. Label them if possible.

C) Draw a box-plot
Use R to draw box-plot with data 19, 24, 28, 29, 30, 34, 12, 13, 19, 20, 19, 23, 24:

• salaries <- c(19, 24, 28, 29, 30, 34, 12, 13, 19, 20, 19, 23, 24)

• x= salaries;

• boxplot(x) # box mean George Box a great British statistician!

• barplot(x)

• y <- c(1, 2, 2, 2, 3, 1, 1, 4, 4, 1, 3, 2, 5); barplot(table(y))

• barplot(table(y), main=”Television Survey”, xlab=”Channel”, ylab=”Votes”)

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
88 MAKING OBSERVED DATA MEANINGFUL

3.6.3 Extra indicators for the shape of a distribution of observations

Additional information pertaining to the shape of a distribution of observations is derived from the
sample skewness and sample kurtosis.

 CONCEPT 4. The sample skewness is defined as the index


𝑛
1 ∑︁
𝛽3 = (𝑋𝑖 − 𝑋)3 /𝑆 3 , (3.3)
𝑛 𝑖=1

and the sample kurtosis (steepness) is defined as

𝑛
1 ∑︁
𝛽4 = (𝑋𝑖 − 𝑋)4 /𝑆 4 . (3.4)
𝑛 𝑖=1

Figure 3.3: Symmetric and asymmetric distributions.

If a distribution is symmetric (around its mean), then skewness = 0.


If skewness > 0, we say that the distribution is positively skewed or skewed to the right.
If skewness < 0, then the distribution is negatively skewed or skewed to the left.
We should also comment that in distributions which are positively skewed then X > 𝑀 𝑒 (median),
while in those which are negatively skewed X < 𝑀 𝑒. In symmetric distributions X = 𝑀 𝑒.

3.6.4 Quantile plots

The quantile plot is a plot of the sample quantiles 𝑥𝑝 against 𝑝, 0 < 𝑝 < 1 and 𝑥𝑝 = 𝑋(𝑝(𝑛+1)) .

In Figure 3.5 we see the quantile plot of the log yarn-strength. From such a plot one can obtain
graphical estimates of the quantiles of the distribution. For example, from Figure 3.6 we immediately

Inference, Linear Regression and Stochastic Processes


3.6. Visualization for Exploratory Data Analysis 89

Figure 3.4: Normal, steep, and flat distribution.

Figure 3.5: Quantile plot of log yarn-strength data.

obtain the estimate 2.8 for the median, 2.23 for the first quartile 𝑄1 and 3.58 for the third quartile 𝑄3 .
These are close to the values presented earlier.
We see also in Figure 3.5 that the maximal point of this data set is an outlier.

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
90 MAKING OBSERVED DATA MEANINGFUL

Figure 3.6: Box whiskers plot of log yarn-strength data.

3.7 Prediction intervals

When the data 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 represents a sample of observations from some population, we can
use the sample statistics discussed in the previous sections to predict how future measurements will
behave. Of course, our ability to predict accurately depends on the size of the sample.
Prediction using order statistics is very simple and is valid for any type of distribution. Since the
ordered measurements partition the real line R into 𝑛 + 1 subintervals,

(−∞, 𝑋(1) ), (𝑋(1) , 𝑋(2) ), · · · , (𝑋(𝑛) , +∞),

we can predict that 100/(𝑛 + 1)% of all future observations will fall in any one of these subintervals.
Hence 100𝑖/(𝑛 + 1)% of future sample values are expected to be less than the 𝑖-th statistic 𝑋(𝑖).
It is interesting to note that the sample minimum, 𝑋(1), is not the smallest possible value. Instead
we expect to see one out of every 𝑛 + 1 future measurements to be less than 𝑋(1). Similarly one out
of every 𝑛 + 1 future measurements is expected to be greater than 𝑋(𝑛).
Predicting future measurements using sample skewness and kurtosis is a bit more difficult because
it depends on the type of distribution that the data follow.

Normal data: If the distribution is symmetric (skewness ≈ 0) and somewhat “bell-shaped” or “normal”
(Gaussian, with kurtosis ≈ 3) as in Figure 3.7, for the log yarn strength data, we can make the
following statements:

1. Approximately 68% of all future values will lie within one standard deviation 𝜎 of the mean.

2. Approximately 95% of all future measurements will lie within two 𝜎 of the mean.

3. Approximately 99.7% of all future measurements will lie within three 𝜎 of the mean.

Inference, Linear Regression and Stochastic Processes


3.7. Prediction intervals 91

Figure 3.7: Histogram of log yarn strength.


Non-normal data: When the data does not follow a normal distribution, we may use the following
Chebyshev’s Inequality.

Chebyshev’s Inequality.

E(𝑌 )
Type 1: Given a random variable 𝑌 , and 𝑚 > 0 then P[𝑌 > 𝑚] ≤ .
𝑚

Type 2: For any number 𝑘 > 1, the percentage of future measurements within 𝑘 standard
deviations of the mean will be at least 100(1 − 1/𝑘 2 )%:
[︁ ]︁
P |𝑌 − E[𝑌 ]| < 𝑘 𝜎 ≥ (1 − 1/𝑘 2 ).

ELUCIDATION

• This means that at least 75% of all future measurements will fall within 2 standard deviations (𝑘 = 2).
Similarly, at least 89% will fall within 3 standard deviations (𝑘 = 3). These statements are true for
any distribution; however, the actual percentages may be considerably larger. Notice that for data
which is normally distributed, 95% of the values fall in the interval [X −2𝑆, X +2𝑆]. The Chebyshev
inequality gives only the lower bound of 75%, and is therefore very conservative.

• Any prediction statements, using the order statistics or the sample mean and standard deviation, can
only be made with the understanding that they are based on a sample of data. They are accurate

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
92 MAKING OBSERVED DATA MEANINGFUL

Using continuous pdf curve to compute probability

Using Gauss pdf to compute areas being symmetric around the mean

Figure 3.8: Areas with radius 1, 2 and 3 𝜎 -


when data approximate Gauss distribution

only to the degree that the sample is representative of the entire population.

When the sample size is small, we cannot be very confident in our prediction. In Section 4.5 we will
discuss theoretical and computerized statistical inference whereby we assign a “confidence level” to
such statements. This confidence level will depend on the sample size.

3.8 Association Between Two Variables

We now consider the relationship between variables via two most important descriptive measures:
Covariance measures the co-movement of two separate distributions and Correlation. Let us start by
looking at the example below.
Practical motivation 4. [Sale trend]
A manager of a sound equipment store wants to determine the relationship between

Inference, Linear Regression and Stochastic Processes


3.8. Association Between Two Variables 93

• the number 𝑥 of weekend television commercials shown, and

• the sales 𝑦 at his store during the following weeks.

Data of size 𝑛 = 10 has been recorded in 10 weeks, shown in Table 3.6.

Table 3.6: Sample data for the sound equipment store

Week Number of commercials (𝑥) Sales Volume 𝑦 (×$100s)

1 2 50

2 5 57

3 1 41

4 3 54

5 4 54

6 1 38

7 5 63

8 3 48

9 4 59

10 2 46

3.8.1 Covariance

the 1st descriptive measure of association between 2 variables 𝑋, 𝑌 .

Definition 3.4. For a sample data of size 𝑛 with the observations


(𝑥, 𝑦) = {(𝑥1 , 𝑦1 ), · · · , (𝑥𝑛 , 𝑦𝑛 )} the sample covariance is defined as

∑︀
𝑖 (𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑠𝑥𝑦 =
𝑛−1

In our example we have 𝑥 = 30/10 = 3 and 𝑦 = 510/10 = 51, and the sample covariance 𝑠𝑥𝑦 =
99/9 = 11.

Obviously for the entire population, the population covariance is

∑︀
𝑖 (𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )
𝜎𝑥𝑦 =
𝑁

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
94 MAKING OBSERVED DATA MEANINGFUL

A positive covariance indicates that 𝑋 and 𝑌 move together in relation to their means.
A negative covariance indicates that they move in opposite directions.

Remark that
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦) > 0 ⇐⇒ the point (𝑥𝑖 , 𝑦𝑖 ) ∈ quadrants 𝐼&𝐼𝐼𝐼
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦) < 0 ⇐⇒ the point (𝑥𝑖 , 𝑦𝑖 ) ∈ quadrants 𝐼𝐼&𝐼𝑉

As a result,

1. 𝑠𝑥𝑦 > 0 indicates a positive linear association (relationship) between 𝑥 and 𝑦

2. 𝑠𝑥𝑦 ≈ 0: 𝑥 and 𝑦 are not linearly associated

3. 𝑠𝑥𝑦 < 0 then 𝑥 and 𝑦 are negatively linearly associated

3.8.2 Correlation

In our example, 𝑠𝑥𝑦 = 99/9 = 11 indicating a strong positive linear relationship between the number 𝑥
of television commercials shown and the sales 𝑦 at the multimedia equipment store.

But the value of the covariance depends on the measurement units for 𝑥 and 𝑦. Is there other precise
measure of this relationship?

Correlation coefficient- the second descriptive measure.

𝑠𝑥𝑦
𝑟𝑥𝑦 =
𝑠𝑥 𝑠𝑦
We get −1 ≤ 𝑟𝑥𝑦 ≤ 1. And moreover, if 𝑥 and 𝑦 are linearly related by the equation

𝑦 = 𝑎 + 𝑏𝑥,

then
𝑟𝑥𝑦 = 1 when 𝑏 is positive, and
𝑟𝑥𝑦 = −1 when 𝑏 is negative.

Application of Correlation coefficients. See Assignments.

3.9 Summary

3.9.1 On sample variance

Given a data set 𝑥 = [𝑥1 , · · · , 𝑥𝑛 ] of size 𝑛, we knew that


∑︀𝑛 ∑︀𝑛
𝑖=1 (𝑥𝑖 − 𝑥)
2
𝑥2 − 𝑛𝑥2
Var(𝑥) = 𝜎𝑥 = 2
= 𝑖=1 𝑖 (3.5)
𝑛−1 𝑛−1

Inference, Linear Regression and Stochastic Processes


3.9. Summary 95

is the sample variance of 𝑥. Moreover, the sample standard deviation is


√︀
𝜎𝑥 = Var(𝑥).


If we also denote 𝑆 2 = Var(𝑥) for the sample variance then 𝑆 = 𝑆 2 = 𝜎𝑥 exactly is our sample
standard deviation. In general, sample standard deviation 𝑆 is used more frequently than sample
variance 𝑆 2 because it has the same unit (kg, cm., Lb, Newton, ton, hour, ...) with our initial (original)
measurements.

3.9.2 Percentiles- Mathematical formula

Our concern now is: Given 𝑝%, where 𝑝 ∈ Q ∩ [1, 100], find the value 𝑚 by locating its position (index)
in the observed sample data 𝑥 of size 𝑛.

Calculating the 𝑝th percentile. In 3 steps

1. Arrange the data 𝑥 in ascending order to obtain the sorted sample data 𝑦: 𝑦 = 𝑠𝑜𝑟𝑡(𝑥.

2. Compute an index 𝑖

(︁ 𝑝 )︁
𝑖= 𝑛
100
𝑝
(︀ )︀
Note that some text books use 𝑖 = 100 (𝑛 + 1), this doesn’t change the mathematical meaning of
the concept of percentile. We don’t use this one since when 𝑝 = 100 the index 𝑖 = 𝑛 + 1 is out of the
sample’s index range: 𝑛 + 1 ̸∈ [1, 2, 3, . . . , 𝑛 − 1, 𝑛]!

3. Locate 𝑚 from 𝑖:

• If 𝑖 is not an integer, round up to the ceiling ⌈𝑖⌉ =: 𝑗 (the smallest integer that bigger than 𝑖).
Then 𝑚 = 𝑦[𝑗].

𝑝
(︀ )︀
Remark 1. As mentioned above, if use 𝑖 = 100 (𝑛 + 1) then when 𝑖 is not an integer we must
round down to the floor ⌊𝑖⌋ =: 𝑗 (the biggest integer that smaller than 𝑖).

• If 𝑖 is an integer, then 𝑚 = (𝑦[𝑖] + 𝑦[𝑖 + 1])/2.


In particular, when 𝑝 = 100, 𝑖 = 𝑛, we set 𝑚 = 𝑦[𝑛].

Example. Let us find the 75th percentile for the salary data given in Practical motivation 1.

𝑥 = [2710, 2755, 2850, ‖2880, 2880, 2890, ‖2920, 2940, 2950, ‖3050, 3130, 3325]

Q1 ↑ = 2865 Q2 ↑ = 2905 Q3 ↑ = 3000

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
96 MAKING OBSERVED DATA MEANINGFUL

- Arrange 𝑥 to get 𝑦 = 𝑥 (since the data is sorted already).


(︀ 𝑝 )︀ (︀ 75 )︀
- Compute an index 𝑖 = 100 𝑛 = 100 12 = 9 ∈ N. Then, as we have seen before:

𝑚 = (𝑦[𝑖] + 𝑦[𝑖 + 1])/2 = (𝑦[9] + 𝑦[10])/2 = (2950 + 3050)/2 = 3000 = 𝑄3

* The 50th percentile for the same data is similarly computed as:
(︁ 𝑝 )︁ (︂ )︂
50
𝑖= 𝑛= 12 = 6 ∈ N
100 100

So 𝑚 = (𝑦[6] + 𝑦[7])/2 = (2890 + 2920)/2 = 2905.


* The 85th percentile for the same data is similarly computed as:
(︁ 𝑝 )︁ (︂ )︂
85
𝑖= 𝑛= 12 = 10.2 ∈ N
100 100

Round it up to the ceiling 𝑗 = ⌈𝑖⌉ =: 11 then 𝑚 = 𝑦[11] = 3130.

3.9.3 Summary of Terms

Further details are explained now.

1. Population- Sample- Sample space- statistical experiment.

A sample is a subset of a population. A population is the collection of items under discussion. It may
be finite or infinite; it may be real or hypothetical.

In reality, to understand what a real population is, we conduct a statistical experiment (by oberseving,
surveying, interviewing or measuring some interest process/phenomenon) and can use the set of all
possible outcomes of that statistical experiment as a population. This set of possible outcomes is
called the sample space, usually denoted by Ω .

2. Exploring Uni-variate Data.

We examine the collected data, to see any surprising features, before attempting to answer any for-
mal questions. This is the exploratory stage of data analysis. There are two major kinds of variables:

XQuantitative Variables (measurements and counts)

• continuous (such as heights, weights, temperatures); their values are often real numbers; there are
few repeated values;

• discrete (counts, such as numbers of faulty parts, numbers of tele- phone calls etc); their values are
usually integers; there may be many repeated values.

XQualitative Variables (factors, class variables); these variables classify objects into groups.

• categorical (such as methods of transport to College); there is no sense of order;

• ordinal (such as income classified as high, medium or low); there is natural order for the values of
the variable.

Inference, Linear Regression and Stochastic Processes


3.9. Summary 97

3. Locating the Centre of the Data.

Three main measures of centre are:


∑︀𝑛
𝑥𝑖
Mean: the average value of the sample, denoted by 𝑥 = 𝑖=1
𝑛

Median: the middle value of the ordered data set, 𝑥1 < · · · < 𝑥𝑛 of size 𝑛; (smallest to largest),
denoted by Med.

- If 𝑛 is odd, then the median Med is the midle value.

- If 𝑛 is even, then the median Med is the average of the two midle values.

Mode: the value with the highest frequency.

Unimodal or Bimodal. If there is a single prominent peak in a histogram, the shape is called unimodal,
meaning “one mode.” If there are two prominent peaks, the shape is called bimodal, meaning two
modes.

4. Measures of spreading- variability- dispersion: standard deviation.

Besides the mean, representing the center, the standard deviation, representing the spread or vari-
ability in the values. Sometimes the variance is given instead of the standard deviation.

The standard deviation is simply the square root of the variance, so once you have one you can
easily compute the other.

Computing Standard Deviation of a data sample: follow the steps

a. Find the mean 𝜇𝑥

b. Find the deviation of each value 𝑥𝑖 from the mean = 𝑥𝑖 − 𝜇𝑥

c. Square the deviations and take the sum the squared deviations
∑︁
𝑆𝑆 = (𝑥𝑖 − 𝜇𝑥 )2
𝑖=1..𝑛

d. Divide the sum 𝑆𝑆 by the number of values- 1= 𝑛 − 1, resulting in the variance 𝑉 𝑎𝑟𝑥 = 𝑆𝑆/𝑛 − 1.

e Take the square root of the variance, 𝑉 𝑎𝑟𝑥 is the standard deviation.

Computing the Standard Deviation of a population: keep the same steps, just replace

a. Find the mean 𝜇;

b. compute 𝑥𝑖 − 𝜇, then

− 𝜇)2
∑︀
c. take the sum over whole population 𝑖=1..𝑁 (𝑥𝑖

d. Divide the sum by 𝑁 ; e. the same as above.

DATA ANALYTICS- FOUNDATION


CHAPTER 3. EXPLORATORY DATA ANALYSIS
98 MAKING OBSERVED DATA MEANINGFUL

3.10 Problems

1. Given the list 70, 75, 85, 86, 87 and 85, find the five-number summary

If the list had an additional value of 90 in it, what is the five-number summary?

Can you draw the boxplot of this data list?

2. The following table gives the grades on a test for a class of 40 students.

7 5 6 2 8 7 6 7 3 9 10 4 5 5 4

6 7 4 8 2 3 5 6 7 9 8 2 4 7 9

4 6 7 8 3 6 7 9 10 5

Table 3.7: Raw grades of a class of 40 students

(a) Arrange these grades (raw data set) into an array from the lowest grade to the highest grade.

(b) Construct a table showing class intervals, class midpoints.

(c) Construct a table showing the absolute, relative, and cumulative frequencies for each grade.

(d) Present the data in the form of a histogram, relative-frequency histogram.

3. A firm pays a wage of $4 per hour to its 25 unskilled workers, $6 to its 15 semiskilled workers, and
$8 to its 10 skilled workers. What is

a/ the mean wage, the median and the mode wage paid by this firm?
∑︀
b/ the weighted average, or weighted mean 𝑥𝑤 = ∑︀𝑤𝑥 , wage paid by this firm? Hint: the weights 𝑤
𝑤
have the same function as the frequency in finding the weighted mean for the grouped data.

4. a) Suppose that engineers of Bangkok Insurance Firm (BIF) have observed claims of customers in
23 weeks from January 2015 till November 2016. They recorded the following data sample

𝑦 =(21, 17, 14, 26, 15, 19, 16, 12, 13, 67, 18, 16,
29, 25, 32, 24, 30, 25, 33, 50, 43, 48, 32) [in 1000 USD].

Find the observation error 𝑠 and the standard error of sample mean 𝑠y .

b) Assume that the Quality Assurance department of Singha Beer Cooperation recorded the number
of bad Heineken bottles (sourer, less volume ...) in 12 months of 2016 as in the following sample

𝐷 = 1, 2, 4, 5, 2, 0, 4, 4, 9, 8, 8, 8.

i/ Compute the constant 𝐼𝑄𝑅 defined by 𝑄3 − 𝑄1 and

ii/ the percentage of associated observations in the interval (𝑄1 , 𝑄3 ].

Inference, Linear Regression and Stochastic Processes


Chapter 4

Statistical Parameter Estimation


Estimating parameters of a population

Figure 4.1: Would we infer useful knowledge behind this beautiful picture?

[Source [56]]
CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
100 ESTIMATING PARAMETERS OF A POPULATION

4.1 Overview and motivation

We firstly brief topics of Statistical Parameter Estimation.

• Statistical Inference: Fundamental concepts- Statistical parameter estimation

• Point Estimation- Concepts and Methods

• The Law of Large Numbers and Central Limit Theorem

• Confidence Interval of a population mean 𝜇

* When 𝜎 is known

* When 𝜎 is unknown or sample size 𝑛 < 30, see Section 4.7

• Find Sample Size for specified error on the mean

After careful study of Chapter 4, you should be able to

• Describe major blocks of Statistical Inference

• Compute Point Estimation:

Point Estimators of the mean 𝜇 and variance 𝜎 2

• Use the methods of Moment Estimation and Maximum Likelihood Estimation

• Determine the distribution of sample mean by employing

The Law of Large Numbers and Central Limit Theorem

• Construct confidence intervals on the mean of a normal distribution,

using either Gauss (normal) distribution

or Student’s 𝑡 distribution method.

• Compute the Margin of Error from standard deviation and sample size

We next take a look at few problems in various sectors.

Problem 4.1 (Business).

A charter airplane company is asked to carry regular loads of 100 ‘small sporty’ motorbikes. The
plane available for this work has a carrying capacity of 5000 kg. Records of the weights of about 1000
sheep which are typical of those that might be carried show that the distribution of sheep weight has a
mean of 45 kg and a standard deviation of 3 kg.
Can the company take the order?

Inference, Linear Regression and Stochastic Processes


4.2. Statistical Inference 101

Problem 4.2 (Actuarial Science).

AIA firm, a well known insurance company in the US recorded monthly salaries of 27 new customers
as in the list 𝑥 below:

𝑥 = [6.9, 7.8, 8.9, 5.2, 7.7, 9.6, 8.7, 6.7, 4.8, 8.0, 10.1, 8.5, 6.5,

9.2, 7.4, 6.3, 5.6, 7.3, 8.3, 7.2, 7.5, 6.1, 9.4, 5.4, 7.6, 8.1, 7.9]

with unit of 1000 USD. Assume that the population mean 𝜇 = 8.


Do you believe that the sample mean 𝑥 is a good approximation of 𝜇, the population mean? [con-
stant, but unknown!] Explain.

Problem 4.3 (Tournament).

1. Tom and George are playing in the club golf tournament.


(︀ )︀
2. Their scores vary as they play the course repeatedly. Tom’s score X has the N 110, 10 distribution,
(︀ )︀
and George’s score Y varies from round to round according to the N 100, 8 distribution.

3. If they play independently, what is the probability that Tom will score lower than George and thus
do better in the tournament?

This chapter’s methods can solve the above problems, using the powerful approach of statistical
inference. Informally, Statistical Inference draws conclusions about a population or process based on
sample data. It also provides a statement, expressed in terms of probability, of how much confidence
we can place in our conclusions.

4.2 Statistical Inference

Statistical inference in short is a sample-based analyzing and conclusion making. That means a
process in which we infer from information contained in a sample to get properties of the population
(from which the observations are taken).
Through this inference we better understand and be able to model the underlying process which
generates the data. This is a major objective in statistics. Often samples are never equal to population,
hence any conclusion in statistical inference are probabilistic conclusion.
The techniques of statistical inference can be classified into two broad categories:

Parameter Estimation is the process of inferring or estimating a population parameter (its mean,
variance) from the corresponding statistic of a sample drawn from the population.

Hypothesis Testing is the process of determining, on the basis of sample information, whether to
accept or reject a hypothesis or assumption with regard to the value of a parameter.

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
102 ESTIMATING PARAMETERS OF A POPULATION

4.2.1 Fundamental notation and concepts

A population is the collection of items under discussion. It may be finite or infinite; it may be real or
hypothetical. A sample is a subset of a population.
A sample should be chosen to be representative of the population because we usually want to draw
conclusions or inferences about that population based on the sample selected.

A schematic diagram for Statistical Inference

Population parameters and their sample estimators


from using samples that represent the whole population

Figure 4.2: Data collection and statistical inference

 NOTATION 1.

Formally, a sample data consists of 𝑛 observations

𝑋 = [𝑋𝑖 ]𝑖 := [𝑋𝑖 ]𝑛𝑖=1 = 𝑋1 , 𝑋2 , . . . , 𝑋𝑛

considered as a list of random variables.

• If all 𝑋𝑖 are associated with the same population random variable 𝑋 with a pdf 𝑓 (𝑥), we say the
observations are identical.

• If 𝑋𝑖 are mutually independent then we say the observations are independent. [See Definition 1.2
in Section 1.4.]

Inference, Linear Regression and Stochastic Processes


4.2. Statistical Inference 103

Definition 4.1 (Random sample).

A sample data is called a random sample if the sample observations 𝑋 are both identical and
independent, i.e. [𝑋𝑖 ]𝑖 is a list of identically and independent distributed random variables. We
shortly write 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ∼𝑖.𝑖.𝑑. 𝑋.

This concept is applicable for both finite or infinite populations, and where sampling is performed
with replacement (write RSWR).

CONVENTION.

After observing realistic phenomena, our complete data resulting from a random sample 𝑋1 , . . . , 𝑋𝑛
now is written as 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , considered as a list of numerical values of the 𝑋𝑖 ’s.

4.2.2 Population parameters and their estimator

A population parameter is a characteristic of a population or a process.


A statistical estimator of a population parameter is numerical characteristic that is computed from a
sample of observations, see Figure 4.2.

a/ Specifically, the population mean 𝜇, standard deviation 𝜎, variance 𝜎 2 ... are symbolized with Greek
characters and they refer to ”true”, but hidden or unknown values which we cannot know exactly.
These are characteristic parameters or population parameters.

b/ The sample average X , the sample standard deviation 𝑆, variance 𝑆 2 etc. are represented by Latin
characters and they refer to values, which we can calculate from a certain sample data.

c/ We use such values to estimate the corresponding true (but unknown) population parameter. Each
values (of X , 𝑆, 𝑆 2 ...) is called a statistical estimator, an estimator, a sample statistic, or just
statistic.

Definition 4.2 (Statistic and its sampling distribution).

Mathematically, a sample statistic or shortly a statistic is defined as any function 𝑇 (𝑋1 , . . . , 𝑋𝑛 ) of


the sample 𝑋1 , . . . , 𝑋𝑛 .
A statistic is a random variable. Its distribution is a sampling distribution.

Sampling and Random sampling.

• Sampling: the act of selectively choosing units from a population to form a sample so that we can
compute statistical estimators and make inferences about a population.

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
104 ESTIMATING PARAMETERS OF A POPULATION

Random sampling: means making a random sample, that is when observations are identically and
independently distributed. Only with random samples our estimation and inference are made sure
mathematically correct! And the process of estimation and inference requires information of sample
statistic as below.

We will work with the following three popular statistics:

1. The sample mean describes the central tendency:


∑︀𝑛
𝑖=1 𝑋𝑖
𝑋 = 𝑇 (𝑋1 , . . . , 𝑋𝑛 ) = (4.1)
𝑛

approximating the population mean 𝜇.

2. The sample variance ∑︀𝑛


𝑖=1 (𝑋𝑖 − 𝑋)2
𝑆 2 = 𝑇 (𝑋1 , . . . , 𝑋𝑛 ) = V(𝑋) = , (4.2)
𝑛−1
approximates the population variance 𝜎 2 ; and

3. the sample standard deviation


√︀
𝑆= V(𝑋) (4.3)

approximates the population standard deviation 𝜎.

Both the last twos describe the variability of data, process or a system. Based on these popular statis-
tics, we will develop procedures to estimate the parameters of a population or probability distributions,
and solve other inference or decision-oriented problems.

4.2.3 Statistical parameter estimation

In general, the parameters of a population are unknown and constant, such as the mean 𝜇 and vari-
ance 𝜎 2 . To approximate a population parameter 𝜃 we must have a random sample data , then build a
single estimate 𝜃ˆ of it.

Statistical parameter estimation

Statistical parameter estimation is the process of deriving the population parameter values from
the sample statistics, when the sample statistics oscillate (vary) around the actual value of the
parameter.

In science and technology, the above threes are very common:

• the 𝑋 varies randomly around of the true (but unknown) value 𝜇;

Inference, Linear Regression and Stochastic Processes


4.3. Point Estimation- Concepts and Methods 105

• the 𝑆 2 varies randomly around of the true (but unknown) value 𝜎 2 ,


√︀
• the 𝑆 = V(𝑋) oscillates around of the 𝜎.

−∞ − − − − − − − −𝑋 − − − −𝜇 − − − −𝑋 − − − − − −∞

0 − − − − − − − −𝑆 2 − − − − − − − 𝜎 2 − − − −𝑆 2 − − − +∞

0 − − − − − − − −𝑆 − − − − − − − 𝜎 − − − −𝑆 − − − − + ∞

4.3 Point Estimation- Concepts and Methods

In point estimation, a numerical value 𝜃ˆ for each 𝜃 is calculated, meanwhile in interval estimation, a
𝑘 ≥ 1-dimensional region is determined. We describe point estimation here, see next chapter for
interval estimation.

4.3.1 Point Estimation - Concepts

Definition 4.3 (Point estimator and Unbiased estimator).

Take 𝜃 be a population parameter (e.g. we can take 𝜃 = 𝜇, 𝜎 2 or 𝜎, or any population parameter of


interest), and make a random sample 𝑀 = 𝑋1 , . . . , 𝑋𝑛 .

• A point estimator 𝜃ˆ of 𝜃 is a statistic 𝑇 (𝑋1 , . . . , 𝑋𝑛 ).


It is a random variable, so it has distribution, see Definition 4.2.2.

Sampling distribution: 𝜃ˆ has distribution means that it has a range, a pdf, a cdf, an expecta-
ˆ and a variance V(𝜃).
tion E(𝜃) ˆ

• An estimator 𝜃ˆ of 𝜃 is unbiased if E(𝜃)


ˆ = 𝜃, where the expected value is taken over all possible
samples.

• When we have got realistic values 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 of the random sample 𝑀 the value 𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
becomes a real number 𝜃ˆ𝑛 := 𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), called a point estimate of 𝜃.

We take few examples from specific populations as: all cars in Thailand, all customers of JP Mor-
gan bank in USA, all students in Havard University, all households in NewYork... In such a population
let choose a certain random sample 𝑀 = 𝑋1 , . . . , 𝑋𝑛 , then we have 3 most used point estimators.

• The population mean 𝜃 = 𝜇 has a point estimator

𝜃ˆ = 𝑋 𝑀 = (𝑋1 + . . . + 𝑋𝑛 )/𝑛 = 𝑋,

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
106 ESTIMATING PARAMETERS OF A POPULATION

and its point estimate is ∑︀𝑛


𝑖=1 𝑥𝑖
𝑥𝑀 = = 𝑥.
𝑛
• The population variance 𝜃 = 𝜎 2 has a point estimator
∑︀𝑛
(𝑋𝑖 − X 𝑀 )2
𝜃ˆ = 𝑆𝑀
2
= 𝑖=1 ,
𝑛−1
and its point estimate is ∑︀𝑛
2 − 𝑥)2
𝑖=1 (𝑥𝑖
𝑠 = 𝑠2𝑀 = .
𝑛−1
• The population standard deviation 𝜃 = 𝜎 has a point estimator 𝜃ˆ = 𝑆𝑀 , and
the sample standard deviation 𝑠 is its point estimate. We also call
√︃ ∑︀
𝑛 2
𝑖=1 (𝑥𝑖 − 𝑥)
S.E.[𝑥] = 𝑠 = (4.4)
𝑛−1
the observation error.

4.3.2 The Point Estimators of the mean 𝜇 and variance 𝜎 2

Now we fix a random sample 𝑋1 , . . . , 𝑋𝑛 from a population with mean 𝜇 and finite variance 𝜎 2 . We
have the following results on unbiased estimators.

Theorem 4.1 (Two specific unbiased estimators).

a) The sample mean 𝑋 is an unbiased estimator of the population mean 𝜇, i.e.

E[𝑋] = 𝜇.

1 ∑︀
b) The sample variance 𝑆 2 = 2
𝑖 [𝑋𝑖 − 𝑋] is an unbiased estimator of
𝑛−1
the population variance 𝜎 2 .

PROOF: Item a) use definition of random sample; of Item b): since E[𝑆 2 ] = 𝜎 2 . [Check this as
exercise!]
−∞ − − − − − − − −𝑋 − − − − − 𝜇 − − − − − − − − − −∞

0 − − − − − − − −𝑆 2 − − − − − 𝜎 2 − − − − − − − − − − + ∞

Example 4.1. Expectation of the sample mean 𝑋 [Environmental Science.]

A very small population of SO polluted indices of Thame River in London consists of values {2, 4, 6, 8, 10},
unit mg / l, hence 𝑁 = 5. We see the population mean
∑︀
𝑥𝑖
𝜇= 𝑖 = (2 + 4 + 6 + 8 + 10)/5 = 6.
𝑁
(︀𝑁 )︀ (︀5)︀
If we take all samples of size 𝑛 = 2, then the number of random samples is 𝑛 = 2 = 10, as in the
table

Inference, Linear Regression and Stochastic Processes


4.3. Point Estimation- Concepts and Methods 107

Sample {2, 4} {2, 6} {2, 8} {2, 10} {4, 6} {4, 8} {4, 10} {6, 8} {6, 10} {8, 10}

𝑋 3 4 5 6 5 6 7 7 8 9

The sample mean X has Range(X ) = {3, 4, 5, 6, 7, 8, 9} =: 𝑢, its probability density values 𝑝x are
0.1, 0.1, 0.2, 0.2, 0.2, 0.1, 0.1 =: 𝑣.
Will the expectation of the sample mean E[𝑋] = 𝜇?

∑︁
E[𝑋] = x 𝑝x = 𝑢 · 𝑣 (inner product of two vectors)
x ∈Range(X )

= 3 · 0.1 + 4 · 0.1 + · · · + 9 · 0.1 = 6 = 𝜇!

4.3.3 Error types of a random sample

Theorem 4.2 (Standard error of a sample mean).

1. More importantly, the variance of the sample mean 𝑋 is

V[𝑋] = 𝜎 2 /𝑛.

2. The standard error of the sample mean 𝑋 is


𝜎
√︁
S.E.[𝑋] = 𝑆𝑋 = V[𝑋] = √ (4.5)
𝑛

or with observed data 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 :


√︀ 𝑠
S.E.[𝑥] = 𝑠𝑥 = V[𝑥] = √ .
𝑛

Proof. Item 1:

V[𝑋] = V[(𝑋1 + . . . + 𝑋𝑛 )/𝑛]

1 𝜎2 + · · · + 𝜎2
=( ) [V[𝑋1 ] + · · · + V[𝑋𝑛 ]] = = 𝜎 2 /𝑛.
𝑛2 𝑛2

Hence, there are few error types being associated with a random data 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 .

1. Real errors 𝑒𝑖 = 𝑥𝑖 − 𝜇.

2. Random errors by measuring: 𝑣𝑖 := 𝑥𝑖 − 𝑥.

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
108 ESTIMATING PARAMETERS OF A POPULATION

3. Observation error (summarized from random errors)


√︃ ∑︀
𝑛 2
𝑖=1 (𝑣𝑖 )
S.E.(𝑋) = 𝑠 = .
𝑛−1

4. The standard error of the sample mean


√︀ 𝑠
S.E.[𝑥] = 𝑠𝑥 = V[𝑥] = √ .
𝑛

Example 4.2. [Actuarial Science.]

AIA firm, a well known insurance company in the US recorded monthly salaries of 27 new customers
as in the list below:

𝑥 = [6.9, 7.8, 8.9, 5.2, 7.7, 9.6, 8.7, 6.7, 4.8, 8.0, 10.1, 8.5, 6.5,

9.2, 7.4, 6.3, 5.6, 7.3, 8.3, 7.2, 7.5, 6.1, 9.4, 5.4, 7.6, 8.1, 7.9]

with unit of 1000 USD. Assume that the population mean 𝜇 = 8 (obtained from a national census).

• Find the sample mean 𝑥, the sample variance 𝑠2 ,

• Find the observation error S.E.(𝑥) = 𝑠 of the sample.

• Compute the standard error S.E.[𝑥] of 𝑋. Your comment?

GUIDANCE for solving. In software R:

> salarydata = c(6.9,7.8,8.9,5.2,7.7,9.6,8.7,6.7,4.8,


8.0,10.1,8.5,6.5,9.2,7.4,6.3,5.6,7.3,8.3,7.2,7.5,
6.1,9.4,5.4,7.6,8.1,7.9);
> x= salarydata; length(x); summary(x);
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.800 6.600 7.600 7.507 8.400 10.100
> s= sd(x); var(x);
1.383398
1.913789

Hence in real life the sample mean x = $7507, the observation error

S.E.[𝑥] = 𝑠 = $1383

and the standard error of the sample mean


𝑠
S.E.[𝑥] = √ = $266.
𝑛

Inference, Linear Regression and Stochastic Processes


4.3. Point Estimation- Concepts and Methods 109

4.3.4 The Method of Moment Estimation

The method of moments is dated back at least to Karl Pearson in the late 1800s. It has the virtue of
being quite simple to use and almost always yields some sort of estimate, when other methods prove
intractable.

Definition 4.4.

Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 be a random sample (i.i.d. random variables) from a pmf or pdf 𝑓 (𝑥). For 𝑘 =
1, 2, 3, . . . then

• the 𝑘-th population moment, or 𝑘-th moment of the distribution 𝑓 (𝑥) is 𝜇𝑘 = E[𝑋 𝑘 ];

• the 𝑘-th sample moment is


𝑛
∑︁
𝑋𝑖𝑘
𝑖=1
𝑀𝑘 = (𝑋1𝑘 + . . . + 𝑋𝑛𝑘 )/𝑛 = . (4.6)
𝑛

When 𝑘 = 1: The first population moment is E[𝑋] = 𝜇 and the first sample moment is the X =
𝑛
∑︁
𝑋𝑖 /𝑛.
𝑖=1
𝑛
∑︁
When 𝑘 = 2: The second population and sample moments respectively are E[𝑋 2 ] and 𝑋𝑖2 /𝑛.
𝑖=1

For 𝑘 ≥ 2, we generalize the LLN (??) to the case of moment of order 𝑘 > 1 to get the strong law of
large number (SLLN) (??).

One parameter: The sample 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 has pdf 𝑓 (𝑥; 𝜃), where 𝜃 is the sole parameter whose
values are unknown. The moment estimator 𝜃ˆ is obtained by equating 𝑀1 to 𝜇1 , which is a function
of 𝜃, and solves for 𝜃.

Many parameters: Generally, if 𝑓 (𝑥; 𝜃) depends on parameters 𝜃 = (𝜃1 , · · · , 𝜃𝑘 ), we set up 𝑘 equa-


tions. ⎧


⎪ 𝑀1 = 𝜇1 (𝜃1 , · · · , 𝜃𝑘 )



⎨𝑀2 = 𝜇2 (𝜃1 , · · · , 𝜃𝑘 )

(4.7)
⎪..


⎪ .



⎩𝑀 = 𝜇 (𝜃 , · · · , 𝜃 ).
𝑘 𝑘 1 𝑘

to estimate/solve for 𝜃1 , · · · , 𝜃𝑘 . The RHS depends on parameters 𝜃, and the LHS can be computed
from data 𝑥1 , 𝑥2 , · · · , 𝑥𝑛 .

Example 4.3 (Bernoulli).

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
110 ESTIMATING PARAMETERS OF A POPULATION

Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ∼ 𝑋 = B(𝑝), Bernoulli distribution, depending on one parameter 𝜃 = 𝑝. Then

𝜇1 (𝜃) = E[𝑋] = 𝑝

and 𝑀1 = X 𝑛 . The estimator is obtained by equating 𝑀1 = 𝜇1 (𝜃) ⇔ X 𝑛 = 𝑝


∑︀
The moment estimator of 𝑝 is then 𝑝ˆ = X 𝑛 = ( 𝑖 𝑋𝑖 )/𝑛 

Example 4.4 (Exponential).

Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 represent a random sample of service times of 𝑛 customers at a certain facility,


where the underlying distribution is assumed exponential with parameter 𝜆.

Since there is only one parameter to be estimated, the estimator is obtained by equating 𝑀1 = X to

𝜇1 = 𝜇 = E[𝑋] = 1/𝜆.

ˆ = 1/ X .
The moment estimator is then 𝜆

Linear combinations of random variables

Consider a function 𝑊 = 𝑔(𝑋, 𝑌 ) for a pair of random variables 𝑋, 𝑌 .

Linearity When 𝑊 = 𝑋 + 𝑌 , for any 𝑋, 𝑌 we get

E(𝑋 + 𝑌 ) = E[𝑋] + E[𝑌 ].

Independence When 𝑊 = 𝑔(𝑋, 𝑌 ) = 𝑋 𝑌 then we have

E(𝑋𝑌 ) = E[𝑋] E[𝑌 ]

only for a pair of independent variables 𝑋, 𝑌 .

Now let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be random variables having a joint distribution, with joint p.d.f.
𝑓 (𝑥1 , ..., 𝑥𝑛 ) := 𝑓𝑋1 ,...,𝑋𝑛 (𝑥1 , ..., 𝑥𝑛 ), with means 𝜇𝑖 = E(𝑋𝑖 ) and variances 𝜎𝑖2 = V(𝑋𝑖 ). Let
𝛼1 , 𝛼2 , . . . , 𝛼𝑛 be given constants. Then
𝑛
∑︁
𝑊 = 𝛼𝑖 𝑋𝑖
𝑖=1

is a linear combination of the 𝑋’s. We discuss in the present section only the formulae of the expected
value and variance of 𝑊 .

• Linearity in generic case: It is straightforward to show that


𝑛
∑︁ 𝑛
∑︁
E(𝑊 ) = 𝛼𝑖 E(𝑋𝑖 ) = 𝛼𝑖 𝜇𝑖 . (4.8)
𝑖=1 𝑖=1

That is, the expected value of a linear combination is the same linear combination of the expectations.

Inference, Linear Regression and Stochastic Processes


4.3. Point Estimation- Concepts and Methods 111

• The formula for the variance is given by


𝑛
∑︁ ∑︁ ∑︁
V[𝑊 ] = 𝛼𝑖2 V(𝑋𝑖 ) + 𝛼𝑖 𝛼𝑗 Cov(𝑋𝑖 , 𝑋𝑗 ) (4.9)
𝑖=1 𝑖̸=𝑗

or equivalently
𝑛
∑︁ ∑︁
V[𝑊 ] = 𝛼𝑖2 V(𝑋𝑖 ) + 2 𝛼𝑖 𝛼𝑗 Cov(𝑋𝑖 , 𝑋𝑗 ).
𝑖=1 𝑖<𝑗

• If furthermore, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 make a random sample, the 𝑋𝑖 s are i.i.d. then

Cov(𝑋𝑖 , 𝑋𝑗 ) = 0 if 𝑖 ̸= 𝑗;

Cov(𝑋𝑖 , 𝑋𝑖 ) = 𝜎𝑖2 = 𝜎 2 , ∀𝑖 = 1...𝑛.

Variance of a linear combination of a random sample

We obtain
𝑛
∑︁ 𝑛
∑︁ 𝑛
(︂ ∑︁ )︂
V[𝑊 ] = 𝛼𝑖2 V(𝑋𝑖 ) = 𝛼𝑖2 𝜎𝑖2 = 𝛼𝑖2 𝜎 2 ;
𝑖=1 𝑖=1 𝑖=1

and so the variance V[X ] of the sample mean

X = (𝑋1 + 𝑋2 + · · · + 𝑋𝑛 )/𝑛

becomes
𝜎2
V[X ] = .
𝑛

Problem 4.4.

The drought indexes of 10 provinces in Thailand are denoted by a sample 𝑋1 , 𝑋2 , · · · , 𝑋10 . Find the
variance of the sum of these 10 random variables
if each has the same variance 5 and
if each pair has correlation coefficient 0.5.

GUIDANCE for solving. Note that the variances 𝜎𝑖 = 5 = 𝜎, 𝑖 = 1, 2, . . . , 10.

Use
Cov(𝑋𝑖 , 𝑋𝑗 ) = 𝜌(𝑋𝑖 , 𝑋𝑗 ) 𝜎𝑖 𝜎𝑗 = 𝜌(𝑋𝑖 , 𝑋𝑗 ) 𝜎 2 = 25(1/2) = 12.5

𝜎2
The sum 𝑆10 = 10 X , then apply V[X ] = .
𝑛

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
112 ESTIMATING PARAMETERS OF A POPULATION

4.3.5 The Method of Maximum Likelihood Estimation

The main principle of MLE is that the data we observe is associated with some probability distribution
function, whose parameters are unknown. We then estimate these parameters based on the chance
or likelihood to see data samples.
Let us consider a family of distributions F = 𝑃𝜃 indexed by a parameter (possibly be a vector of
parameters) 𝜃 ∈ Θ. We can think of

• 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) as a realization of an i.i.d. sample 𝑋 = (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ),

• 𝑓 (𝑥; 𝜃) measures the chance to observe the data 𝑥.

• The likelihood function of observing data vector 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) with respect to 𝜃 is given as
𝑛
∏︁
𝐿(𝜃) = 𝑓 (𝑥; 𝜃) = 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ; 𝜃) = 𝑓 (𝑥𝑖 ; 𝜃).
𝑖=1

Definition 4.5.

• The log likelihood function of observing data 𝑥 is


𝑛
∑︁
𝐿* (𝜃) = 𝑙𝑛 (𝜃) = log 𝐿(𝜃) = 𝑓 (𝑥𝑖 ; 𝜃). (4.10)
𝑖=1

• The maximum likelihood estimate (MLE) of the parameter 𝜃 is a numerical value 𝜃ˆ that maxi-
mize the function 𝐿(𝜃), or equivalently
the log likelihood function 𝐿* (𝜃) = log 𝐿(𝜃), provided the data 𝑥:

𝜃ˆ = Argmax 𝜃∈Θ 𝐿(𝜃) = Argmax 𝜃∈Θ 𝑙𝑛 (𝜃). (4.11)

Note that, if 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) is just a sample (not i.i.d) from a density 𝑓 (𝑥; 𝜃), the likelihood
𝐿(𝜃) = 𝑓 (𝑥; 𝜃) is called the sample density considered as a function of 𝜃 for fixed sample 𝑥.

1. Write 𝐿* (𝜃) = 𝑙𝑛 (𝜃) to indicate the log-likelihood, a function of both sample 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) and
parameter 𝜃.

2. The notation Argmax means that both 𝐿(𝜃) and the log-likelihood 𝐿* (𝜃) achieves theirs maximum
ˆ see Figure 4.3.
values at 𝜃,

3. If the pdf 𝑓 (𝑥; 𝜃) depends on parameters 𝜃 = (𝜃1 , 𝜃2 , · · · , 𝜃𝑝 ), then the maximum likelihood estimates

𝜃ˆ1 , 𝜃ˆ2 , · · · , 𝜃ˆ𝑝

are those values of the 𝜃𝑖 ’s that maximize the likelihood function, so that

𝑓 (𝑥; 𝜃ˆ1 , 𝜃ˆ2 , · · · , 𝜃ˆ𝑝 ) ≥ 𝑓 (𝑥; 𝜃1 , 𝜃2 , · · · , 𝜃𝑝 ), for all 𝜃1 , 𝜃2 , · · · , 𝜃𝑝 .

Inference, Linear Regression and Stochastic Processes


4.3. Point Estimation- Concepts and Methods 113

ˆ
4. When the 𝑋𝑖 ’s are substituted in place of the 𝑥𝑖 ’s in 𝜃(𝑥), the maximum likelihood estimators 𝜃(𝑋) ˆ
result. We should write 𝜃ˆ𝑛 for 𝜃ˆ and 𝐿*𝑛 (𝜃) for the log-likelihood 𝐿* (𝜃) if the sample size 𝑛 of data 𝑥
is matter for interpretation.

Likelihood and log likelihood plotted against

Figure 4.3: The curves of likelihood function and its log.

Knowledge box 1. Calculating the MLE 𝜃ˆ

We employ the fact that observed data 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) is a realization of the IID (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ),
with the same pdf 𝑓 (𝑥; 𝜃). Since 𝑋𝑖 are IID, we can write the likelihood function in parameter 𝜃 as

𝐿(𝜃) = 𝑓 (𝑥; 𝜃) = 𝑓 (𝑥1 ; 𝜃) · 𝑓 (𝑥2 ; 𝜃) · · · 𝑓 (𝑥𝑛 ; 𝜃).

• To find a numerical value 𝜃ˆ that maximize the function 𝐿(𝜃), or equivalently the log likelihood
𝑙(𝜃) = log 𝐿(𝜃), use the Lagrange method:

* take derivative of 𝐿* (𝜃) = log 𝐿(𝜃) with respect to variable 𝜃 and


𝜕𝑙(𝜃)
* solve the equation = 0.
𝜕𝜃

Example 4.5 (MLE for the Bernoulli distribution).

Suppose that 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ∼ B(𝑝)- the Bernoulli distribution. The probability function, for 𝑥 =
0, 1, is
𝑓 (𝑥; 𝑝) = 𝑝𝑥 (1 − 𝑝)1−𝑥 .

The unknown parameter is 𝜃 = 𝑝. Then,


𝑛
∏︁
𝐿(𝑝) = 𝑓 (𝑥𝑖 ; 𝑝) = 𝑝𝑆 (1 − 𝑝)𝑛−𝑆
𝑖=1
∑︀
where 𝑆 = 𝑖 𝑥𝑖 . Prove that the MLE of 𝑝 is 𝑝ˆ𝑛 = 𝑆/𝑛. 

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
114 ESTIMATING PARAMETERS OF A POPULATION

GUIDANCE for solving. We find the log-likelihood

𝐿* = log 𝐿(𝑝) = log[𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ; 𝑝)] = log[𝑝𝑆 (1 − 𝑝)𝑛−𝑆 ] = ...,

then find the derivative


𝑑
log[𝑝𝑆 (1 − 𝑝)𝑛−𝑆 ] = 𝐴 = ...;
𝑑𝑝
and solve 𝐴 = 0 with unknown 𝑝 ∈ [0, 1] to find the mle 𝑝ˆ.
ANSWER: The MLE of 𝑝 is the proportion of sample successes:
𝑆 ∑︁
𝑝ˆ = =X =( 𝑋𝑖 )/𝑛.
𝑛 𝑖

EXTRA QUESTION: suppose that we know in advance that, instead of 𝑝 ∈ [0, 1], 𝑝 is restricted by
the inequalities 0 ≤ 𝑝 ≤ 𝑐 < 1. Prove that the new mle 𝑝ˆ = min{x , 𝑐}.

Example 4.6.

A sample of 10 new bike helmets manufactured by a firm is obtained. Upon testing, it is found that
the 1st, 3rd, and 10th helmets are flawed, whereas the others are not.
Let 𝑝 = P[ flawed helmet ] and define 𝑋1 , 𝑋2 , · · · , 𝑋10 by
𝑋𝑖 = 1 if the 𝑖-th helmet is flawed and 𝑋𝑖 = 0 otherwise.
Then the observed 𝑥𝑖 ’s are 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
so the joint pmf of the sample is

𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥10 ; 𝑝) = 𝑝3 (1 − 𝑝)7 .

♣ QUESTION 4.1.

We now ask, “For what value of 𝑝 is the observed sample most likely to have occurred?” That is, we
wish to find the value of 𝑝 that maximizes the above pmf, or equivalently, maximizes its natural log.

GUIDANCE for solving. We have the log-likelihood

𝐿* = log[𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥10 ; 𝑝)] = 3 ln(𝑝) + 7 ln(1 − 𝑝) (4.12)

and this is a differentiable function of 𝑝, equating the derivative of (4.12) to zero gives the maximizing
value:
𝑑 3 7
log[𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥10 ; 𝑝)] = − =0
𝑑𝑝 𝑝 1−𝑝
3 𝑥
=⇒ 𝑝 = =
10 𝑛
where 𝑥 is the observed number of successes (flawed helmets), see Figure 4.3.
The estimate of 𝑝 is now 𝑝ˆ = 3/10. It is called the maximum likelihood estimate because for fixed
𝑥1 , 𝑥2 , · · · , 𝑥10 it is the parameter value that maximizes the likelihood (joint pmf) 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥10 ; 𝑝). 

Inference, Linear Regression and Stochastic Processes


4.4. The Law of Large Numbers and Central Limit Theorem 115

4.4 The Law of Large Numbers and Central Limit Theorem

4.4.1 Sample sizes for estimating the sample mean

Example 4.7. Public Health

We would like to estimate the mean height 𝜇 of the population of all Thai women between the ages
of 18 and 24 years.

• This 𝜇 is the mean 𝜇𝑋 of the random variable 𝑋 obtained by choosing a young woman at random
and measuring her height.

• To estimate 𝜇, we choose an SRS 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 of young women and use the sample mean X to
estimate the unknown population mean 𝜇.

• In our statistical language, 𝜇 is a parameter and X is a statistic.

Reminder: Statistics obtained from probability samples are random variables because their values
vary in repeated sampling. The sampling distributions of statistics are just the probability distributions
of these random variables.
Why do we choose X to estimate 𝜇? Three reasons:

1. An SRS should fairly represent the population, so the mean X of the sample should be somewhere
near the mean 𝜇 of the population.

2. X is unbiased, and we can control its variability by choosing the sample size, as we saw in Equation
4.5.

3. We have know that if we keep on measuring more women, eventually we will estimate the mean
height of all young women very accurately.

This key fact is called the law of large numbers. It is remarkable because it holds for any population,
not just for some special class such as Normal distributions.

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
116 ESTIMATING PARAMETERS OF A POPULATION

4.4.2 The law of large numbers

Formulation of the Law of Large Numbers - LLN

• With an i.i.d. sequence 𝑋1 , 𝑋2 , · · · 𝑋𝑛 , · · · ∼ 𝑋 of random variables, and E(|𝑋|) < ∞,


(𝑋1 + 𝑋2 + · · · + 𝑋𝑛 )
• the sample mean 𝑋 𝑛 = satisfies that
𝑛
lim X n = 𝜇, in probability.
𝑛→∞

We could write with any 𝜀 > 0

lim P[|𝑋 𝑛 − 𝜇| > 𝜀] = 0.


𝑛→∞

• We draw independent observations at random from any population with finite mean 𝜇, then decide
how accurately we will estimate 𝜇.

• As the number of observations drawn increases, the mean


(𝑋1 + 𝑋2 + · · · + 𝑋𝑛 )
X =
𝑛
of the observed values eventually approaches the mean 𝜇 of the population as closely as you speci-
fied and then stays that close.

Heights of young women: revisited

• The distribution of the heights of all young women is close to the Normal distribution with 𝜇 =
64.5 𝑖𝑛𝑐ℎ𝑒𝑠, 𝜎 = 2.5 𝑖𝑛𝑐ℎ𝑒𝑠.

• Suppose that 𝜇 = 64.5 were exactly true. The above figure shows the behavior of the mean height X 𝑛
of 𝑛 women chosen at random from a population whose heights follow the N(64.5, 2.52 ) distribution.

• The graph plots the values of X as we add women to our sample.


The first woman drawn had height 64.21 inches, so the line starts there. The second had height
64.35 inches, so for 𝑛 = 2 the mean is
(𝑥1 + 𝑥2 )
𝑥 = = 64.28. This is the second point on the line in the graph.
2
• Eventually, however, the mean of the observations gets close to the population mean 𝜇 = 64.5 and
settles down at that value. The law of large numbers says that this always happens.

4.4.3 Central Limit Theorem - Sampling Distribution

Central Limit Theorem (C.L.T) informally says in selecting random samples of size 𝑛,
if 𝑛 is large (≥ 30) the Sampling Distribution of 𝑋 can be approximated by a normal distribution,
regardless of the distributions of component random samples. Formally we have the following.

Inference, Linear Regression and Stochastic Processes


4.5. Confidence Interval Estimation 117

The law of large numbers in action.


As we take more observations, the sample mean always approaches
the mean µ of the population.

Central Limit Theorem - C.L.T.

Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ∼𝑖.𝑖.𝑑. 𝑋 with the same mean 𝜇 and the same standard deviation 𝜎. (𝑋 in
general does not need to be a normal).
(︀ 𝜎 2 )︀
But the sample mean follows a normal distribution, i.e: X ∼ N 𝜇, when 𝑛 is large (𝑛 > 30).
𝑛
As a result, the standardized (of X )
X n −𝜇
𝑍𝑛 = √
𝜎/ 𝑛
satisfies lim𝑛→∞ 𝑍𝑛 = 𝑍 ∼ N(0, 1) in distribution.
𝑛→∞
It means the probability cumulative function 𝐹𝑍𝑛 goes to Φ(𝑧): 𝐹𝑍𝑛 −→ Φ(𝑧), i.e.

lim P[𝑍𝑛 ≤ 𝑧] = Φ(𝑧), (4.13)


𝑛→∞

where Φ(𝑧) is the probability cumulative function of N(0, 1).

4.5 Confidence Interval Estimation

Why do we study Confidence Interval?

• In Statistics, a point estimate of a population parameter is a sample statistic used to estimate that
population parameter.

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
118 ESTIMATING PARAMETERS OF A POPULATION

• But a point estimate cannot be expected to provide the exact value of the population parameter.

Mathematically, an interval estimate refers to a range of values together with the probability, called
confidence level, that the interval includes the unknown population parameter.

Problems solved by confidence interval estimation

Problem 4.5 (Traffic engineering).

A engineer wishes to estimate the mean velocity 𝜇 (of km/h) that motorbikers pass an observing
point, known the width of the confidence interval 𝑤 = 3 𝑘𝑚/ℎ and 1 − 𝛼 = 99% confidence level.
If he knows 𝜎 = 1.5 𝑘𝑚/ℎ, compute the minimum required 𝑛 (the numbers of motor-bikers we need
to observe), known that 𝑛 > 30.

Problem 4.6 (Environmental pollution).

Environmentalists observed the sample


6.9, 7.8, 8.9, 5.2, 7.7, 9.6, 8.7, 6.7, 4.8, 8.0, 10.1, 8.5,
6.5, 9.2, 7.4, 6.3, 5.6, 7.3, 8.3, 7.2, 7.5, 6.1, 9.4, 5.4,
7.6, 8.1, 7.9 [in mg/l]
of 𝑛 = 27 metal concentrations in Chaopraya river near by a paper factory.
Assuming a confidence level of 95%, estimate an interval estimation of the population mean 𝜇 of
that metal pollutant.

Problem 4.7 (Politics).

A poll of 1,200 voters in Bangkok asked what the most significant issue was in the upcoming election.
Sixty-five percent answered the economy. We are interested in the population proportion of voters
who feel the economy is the most important.
Which probability distribution should you use for this problem?

4.5.1 Components of a confidence interval estimation

What is a Confidence Interval of a parameter?

A confidence interval (or interval estimate) of a (population) parameter 𝜃 is computed by adding


and subtracting a margin of error to the point estimate:

An interval estimate = Point estimate 𝜃ˆ ± Margin of error

Inference, Linear Regression and Stochastic Processes


4.5. Confidence Interval Estimation 119

A Confidence Interval (CI ) consists of

• Margin of error 𝑅; (also called radius)

• Two statistics (bounds) being determined as


𝐿 = Point estimate − Margin of error,
𝑈 = Point estimate + Margin of error.

𝐿| − − − − − 𝜃 − − − −𝜃ˆ − − − − − − − − − |𝑈

• Confidence probability or coefficient: bounds 𝐿, 𝑈 round possible values 𝜃 up to a probability, and


this probability is called confidence coefficient (level).

• Four values 𝐿, 𝑈 , 𝜃ˆ and 𝑅 (the margin of error) all must be computed from sample data.

We conclude that a CI of 𝜃 is:


[𝐿, 𝑈 ] = 𝜃ˆ − 𝑅, 𝜃ˆ + 𝑅 .
(︀ )︀

What is a Confidence Interval of the population mean?

• Set 𝜃 := 𝜇 the population mean (unknown).


A confidence interval of 𝜇 consists of three components:
two statistics 𝐿, 𝑈 and the confidence level 𝐶𝐿 = 1 − 𝛼.

• The value 𝛼 is significant level.

• Three components and the concerned parameter satisfy:

P{𝐿 ≤ 𝜇 ≤ 𝑈 } = 1 − 𝛼 (4.14)

𝐿| − − − − − 𝜇 − −ˆ
𝜇 − − − − − − − |𝑈

Mathematically, 𝛼 + 𝐶𝐿 = 1.

Example 4.8.

If 𝜇 is the mean of most productive age of human being, with the significant level 𝛼 = 0.1, so 1 − 𝛼 =
0.9, 𝐿 = 35, 𝑈 = 45, then the interval

[35, 45] = 35 ≤ 𝜇 ≤ 45

is called a 100(1 − 𝛼)% = 90% confidence interval for the population mean 𝜇.

𝐶𝐿 = P{35 ≤ 𝜇 ≤ 45} = 1 − 𝛼 = 90%.

𝐿 = 35| − − − − − 𝜇
ˆ − − − 𝜇 − − − − − |𝑈 = 45

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
120 ESTIMATING PARAMETERS OF A POPULATION

Write (𝐿, 𝑈 ) = (35, 45) = 90%CI of 𝜇. The interval [𝐿, 𝑈 ] = 𝐿 ≤ 𝜇 ≤ 𝑈 is a 100(1 − 𝛼)%
confidence interval for the unknown population mean 𝜇:

ˆ ± Margin of error 𝑅(𝛼).


[𝐿, 𝑈 ] = 𝜇

 It means that the confidence level

𝐶𝐿 = P{𝐿 ≤ 𝜇 ≤ 𝑈 } = 1 − 𝛼

where bounds are calculated as

ˆ − 𝑅(𝛼), 𝑈 = 𝜇
𝐿=𝜇 ˆ + 𝑅(𝛼),

ˆ is the center of our CI , and the probability that 𝜇 ∈ [𝐿, 𝑈 ] is 1 − 𝛼.


𝜇

 Here 𝑅(𝛼) = 𝑈 − 𝜇
ˆ measures how large the bounding area of 𝜇 is!

4.5.2 Computing Confidence Interval in R

age <- rnorm(100, mean=42, sd=10); summary(age);

> t.test(age);

One Sample t-test

data: age

t = 44.119, df = 99, p-value < 2.2e-16

95 percent confidence interval: 40.39988 44.20492

sample estimates: mean of x: 42.3024

4.6 Estimation of Population Mean- 𝜎 known case

QUESTION. What would be the mathematical base for this computing?

We consider estimation of a population mean of of a normal distribution when population variance 𝜎 2


is known.
* The Case of 𝜎 2 unknown will be in Section 4.7.

4.6.1 A problem in Business Intelligence

Consider a customer survey conducted by AIA, a insurance firm at Bangkok. The firm’s quality assur-
ance team uses a customer survey to measure satisfaction of customers.

Inference, Linear Regression and Stochastic Processes


4.6. Estimation of Population Mean- 𝜎 known case 121

Summarizing data, how? We rate satisfaction of customers by asking their satisfaction scores, in
range 0..100.

A sample data of 𝑛 = 100 customers are surveyed, a sample mean 𝑥 = 42 of customer satisfaction
is then computed.

𝑥 = satis-score = [48, 55, 35, 31, · · · , 29, 31, 29, 39, 32, 44, 50.]

𝑁 = the number of all customers, and

𝑛 = 100 (the number of customer we asked).

A confidence interval of 𝜇 is
[𝐿, 𝑈 ] = 𝑥 ± 𝑅(𝛼).

How could we determine

- the Margin of Error (or EBM- Error Bound Margin, or just error) 𝐸 = 𝑅(𝛼), with

𝑅(𝛼) = Critical value × S.E. (𝑥)?

[where S.E. 𝑀 ) = Std(𝑀 ) the standard error (deviation) of a quantity 𝑀 ]; and

- the CI [𝐿, 𝑈 ] of the mean 𝜇?

Key accepted result to compute 𝑅(𝛼) when 𝜎 is known

* Assume that random samples are used in the analysis.


Then the error (EBM) of the CI [𝐿, 𝑈 ] is

𝜎
𝑅(𝛼) = 𝑧𝛼/2 · √ ,
𝑛

where 𝑧𝛼/2 = critical point of the upper tail probability 𝛼/2.


That means, a two side 100(1 − 𝛼)% CI of 𝜇 approximately satisfies:
[︂ ]︂
𝜎 𝜎
P X −𝑧𝛼/2 · √ ≤ 𝜇 ≤ X +𝑧𝛼/2 · √ =1−𝛼
𝑛 𝑛
[︂ ]︂
𝜎
or P |𝜇 − 𝑥| < 𝑧𝛼/2 · √ = 1 − 𝛼. (See Proposition 4.3 for details)
𝑛
Hence, the confidence level is the percent of confidence intervals that contain the true population
parameter when repeated samples are taken.

Example 4.9 (from PROBLEM in Business Intelligence).

Assume the population is normal. When 𝛼 = 5% = 0.05, then

𝜎
𝑧𝛼/2 = 𝑧0.025 = 1.96, 𝑅(𝛼) = 1.96 · √ .
𝑛

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
122 ESTIMATING PARAMETERS OF A POPULATION

Figure 4.4: Critical point of the upper tail probability

Figure 4.5: A CI of 𝜇 with Margin of Error 𝑅(𝛼)


The 95% CI for the mean 𝜇 is
(︀ 𝜎 𝜎 )︀
𝑥 − 1.96 · √ , 𝑥 + 1.96 · √ (4.15)
𝑛 𝑛
𝜎 𝜎
equivalently P(𝑥 − 1.96 √ < 𝜇 < 𝑥 + 1.96 √ ) = 0.95.
𝑛 𝑛
If don’t know 𝜎 we use its estimate 𝑠, given by
∑︀𝑛
2 (𝑥𝑖 − 𝑥)2
𝑠 = 𝑖=1 .
𝑛−1
In our AIA case, with data 𝑥 = 𝑎𝑔𝑒 = [48, 55, 35, 31, · · · , 29, 31, 29, 39, 32, 44, 50] of size 𝑛 = 100
customers, the sample mean 𝑋 has value 𝑥 = 42, and assume 𝜎 = 10, we get

𝜎 10
S.E. (𝑋) = √ = √ = 10/10 = 1.
𝑛 100

Inference, Linear Regression and Stochastic Processes


4.6. Estimation of Population Mean- 𝜎 known case 123

Figure 4.6: Different samples give distinct CI of 𝜇

Hence, with the confidence level 1 − 𝛼 = 0.95, then

𝜎
𝜇 ∈ [𝑥 ± 𝑧𝛼/2 · √ ] = [42 − 1.96, 42 + 1.96] =?
𝑛

4.6.2 Use Sampling Distribution of the sample mean 𝑋

Consider a process of selecting random samples (all size 𝑛) from a population.

• The sample mean 𝑋 is a random variable, and so 𝑋 has

i/ a mean 𝜇𝑋 = E[𝑋], gets specific value 𝑥,


√︁
ii/ a standard deviation [also named standard error ] 𝑆𝑋 = V[𝑋] and

iii/ a probability distribution. [See Theorem 4.1 and Theorem 4.2]

• Since various possible values of 𝑋 are results of distinct random samples 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 , the
probability distribution of 𝑋 is called the Sampling Distribution of 𝑋.

In all cases, 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 is random (i.i.d), X is determined by the mean E[𝑋] = 𝜇


and the variance Var[𝑋] = 𝜎 2 /𝑛, so the standard error of X is

S.E. (X ) = 𝜎/ 𝑛.

With the above observed sample data 𝑥 = 𝑎𝑔𝑒 = [48, 55, 35, 31, · · · 29, 31, 29, 39, 32, 44, 50], the
standard error of the sample mean 𝑥 is

𝜎 𝑠
𝑠𝑥 = √ ≈ √ .
𝑛 𝑛

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
124 ESTIMATING PARAMETERS OF A POPULATION

Here, 𝑥 provides a point estimate of 𝜇 for all AIA customer’s scores. From the survey we found that the
population of ages is normally distributed, with a 𝜎 ≈ 𝑠 =?. So
𝑠 𝑠
𝑠𝑥 = √ = √ = 𝑠/10.
𝑛 100

4.6.3 Confidence Interval: Two-sided case

The Sampling Distribution of 𝑋 allows us to make probability statements about how close the sample
mean 𝑥 is to 𝜇; described by two cases in Proposition 4.3.

Proposition 4.3 (Generic cases).

In general for any the significance level 0 < 𝛼 < 1, the Margin of Error is given by
𝜎 𝑠
𝑅(𝛼) = 𝑧𝛼/2 · √ or 𝑅(𝛼) = 𝑧𝛼/2 · √ .
𝑛 𝑛

I/ A small sample confidence interval: The population must be normal distributed and known 𝜎,
we establish a confidence interval 100(1 − 𝛼)% CI of 𝜇 similarly as in the above Eq. 4.15:

X −𝑅(𝛼) < 𝜇 < X +𝑅(𝛼)


𝜎 𝜎
with the Margin of Error 𝑅(𝛼) = 𝑧𝛼/2 · √ . It also means |𝑋 − 𝜇| = |𝜇 − 𝑋| < 𝑅(𝛼) = 𝑧𝛼/2 · √ or
𝑛 𝑛
𝜇−𝑋
−𝑧𝛼/2 < √ < 𝑧𝛼/2 and the probability of this event is
𝜎/ 𝑛

Figure 4.7: Two-side testing with Z statistic

II/ A large sample confidence interval:

Inference, Linear Regression and Stochastic Processes


4.6. Estimation of Population Mean- 𝜎 known case 125

Assume the population now is either normal or not normal.


When 𝑛 is large, 𝑛 > 40, by CLT (Theorem 4.4), the standardized quantity

X −𝜇
𝑍= √
𝑆/ 𝑛

has an approximate standard normal distribution. Hence, a 100(1 − 𝛼)% CI of 𝜇 is

x −𝑅(𝛼) < 𝜇 < x +𝑅(𝛼)

with the Margin of Error


𝑠
𝑒 = 𝑅(𝛼) = 𝑧𝛼/2 · √ .
𝑛
That is
𝑠 𝑠
x −𝑧𝛼/2 · √ ≤ 𝜇 ≤ x +𝑧𝛼/2 · √ (4.16)
𝑛 𝑛
with [︂ ]︂
𝑠
P |𝜇 − 𝑥| < 𝑧𝛼/2 · √ = 1 − 𝛼.
𝑛
where

• 1 − 𝛼 is the confidence level, and

• 𝑧𝛼/2 is the 𝑧 value providing an area 𝛼/2 in the upper tail of


the Gauss probability density function 𝑓𝑍 .

When 𝑛 ≤ 30 (small sample size), the population is normal but 𝜎 is unknown we must replace the
Gauss distribution 𝑍 by the Student distribution 𝑇 , being discussed in Section 4.7.

Table 4.1: Tabulated values of Laplace function Φ(𝑧)

𝛼
𝑝=1− = Φ(𝑧) 99.5% 97.5% 95% 90% 80% 75% 0.5
2
𝑧𝑝 2.58 1.96 1.645 1.28 0.84 0.6745 0

Computing 𝑧𝛼/2 in the Margin of Error


𝜎
𝑅(𝛼) = 𝑧𝛼/2 · √
𝑛
• where a): the critical value 𝑧𝛼/2 provides an area 𝛼/2 in the upper tail of the standard Gauss 𝑓𝑍 ;

• or equivalently b): we can say this critical value 𝑧𝛼/2 = 𝑞1−𝛼/2 - the Standard Normal quantile deter-
𝛼
mines an area (1 − ) in the lower tail of 𝑓𝑍 .
2
In interpretation b) we view −𝑧𝛼/2 = 𝑞𝛼/2 , 𝑧𝛼/2 = 𝑞1−𝛼/2 .

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
126 ESTIMATING PARAMETERS OF A POPULATION

E.g., as in Table 4.1, when 𝛼 = 0.1, 𝛼/2 = 0.05, 1 − 𝛼 = 0.9 = 90%

𝛼
⇒ (1 − ) = 0.95 = 𝑝 = Φ(𝑧) =⇒ 𝑧 = Φ−1 (𝑝) = 1.645
2

Some well known cases include:

• The 95% confidence level for 𝜇 as P(𝑥 − 1.96 𝜎𝑥 < 𝜇 < 𝑥 + 1.96 𝜎𝑥 ) = 0.95.

Similarly, we have two other popular confidence levels

• Confidence level = 90%: 𝛼 = 10%, 𝛼/2 = 5% 𝑧 = 𝑧𝛼/2 = 1.645

𝑥 = 𝜇 + 1.645 𝜎𝑥 = 𝜇 + 𝑧𝛼/2 𝜎𝑥 ⇔ P(|𝜇 − 𝑥| < 𝑧 𝜎𝑥 ) = 0.90

• Confidence level = 99%: 𝑧 = 𝑧𝛼/2 = 2.58

𝑥 = 𝜇 + 2.58 𝜎𝑥 = 𝜇 + 𝑧𝛼/2 𝜎𝑥 ⇔ P(|𝜇 − 𝑥| < 𝑧 𝜎𝑥 ) = 0.99

Figure 4.8: Standard normal quantiles split the area under the Gauss density curve

In practice the standard deviation 𝜎 is often unknown. Under such conditions we must modify our
approach, see Section 4.7.

4.6.4 Confidence Interval: One-sided case

• A 100(1 − 𝛼)% upper-confidence bound for 𝜇 is

[︂ ]︂
𝜎 𝜎
𝜇 ≤ 𝑈 = X +𝑅(𝛼) = X +𝑧𝛼 · √ , with P 𝜇 ≤ X +𝑧𝛼 · √ = 1 − 𝛼. (4.17)
𝑛 𝑛

Inference, Linear Regression and Stochastic Processes


4.7. Interval Estimation of Population Mean- 𝜎 unknown 127

• A 100(1 − 𝛼)% lower-confidence bound for 𝜇 is

[︂ ]︂
𝜎 𝜎
X −𝑧𝛼 · √ = X −𝑅(𝛼) = 𝐿 ≤ 𝜇 with P 𝜇 ≥ X −𝑧𝛼 · √ = 1 − 𝛼. (4.18)
𝑛 𝑛

Table 4.2: Crtical value for one-sided test, 𝛼 is significant level

𝑝 = 1 − 𝛼 = Φ(𝑧) 99.5% 95% 90% 80% 75% 0.5

𝑧𝑝 2.58 1.645 1.28 0.84 0.6745 0

With sample x = 64.1, 64.7, 64.5, 64.6, 64.5, 64.3, 64.6, 64.8, 64.2, 64.3; suppose that our population is
normally distributed with 𝜎 = 1, find the a lower, one-sided 100(1 − 𝛼)% = 95% CI for 𝜇.
Recall that 𝑧𝛼 = 𝑧0.05 = 1.64, 𝑛 = 10, and x = 64.46, the CI is

𝜎 1
𝜇 ≥ x −𝑅(𝛼) = x −𝑧𝛼 · √ = 64.46 − 1.64 √ = 63.94.
𝑛 𝑛

4.7 Interval Estimation of Population Mean- 𝜎 unknown

If sample size 𝑛 is large in Case 1. If 𝑛 is small we must depart from normality, see Case 2.

Case 1: See from Proposition 4.3.

If population is arbitrary (non-normal) and if the sample size is large, say, 𝑛 > 40, the quantity

𝑋 −𝜇

𝑆/ 𝑛
still has an approximate standard normal distribution. [By CLT] Apply
𝑠 𝑠
x −𝑧𝛼/2 · √ ≤ 𝜇 ≤ x +𝑧𝛼/2 · √ ,
𝑛 𝑛
where 𝑠 is the sample standard deviation.

* 1
Case 2: If population is normal but 𝑛 is small, we apply the Student’s 𝑡 distribution.
1 Historical fact: Originated by William S. Gosset, who worked for the Guinness brewery in Dublin around 1900 and wrote
under the pseudonym Student.
Student variable 𝑡 with a degree of freedom 𝜈, is given by the probability density function
𝑢2 −(𝜈+1)/2
[︂ ]︂
Γ((𝜈 + 1)/2)
𝑓 (𝑢) = √ 1+ , 𝑢 ∈ R.
𝜈𝜋 Γ(𝜈/2) 𝜈
𝑓 (𝑢; 𝜈) is even since 𝑓 (−𝑢; 𝜈) = 𝑓 (𝑢; 𝜈). Since 𝑓 is symmetric so 𝑡1−𝑝 = −𝑡𝑝 .
For 𝑝 ∈ (0, 1) the 𝑝-percentile is
P(𝑇 ≤ 𝑡𝜈,𝑝 ) = 𝑝 = P(𝑇 ≤ 𝑡𝑝 ).

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
128 ESTIMATING PARAMETERS OF A POPULATION

4.7.1 𝑡 distribution- Properties

Let 𝑥 = 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 be a random sample of size 𝑛, taken from a normal distribution with mean 𝜇.
𝑋−𝜇
Similarly to using the 𝑍 transform 𝑍 = √
𝜎/ 𝑛
to standardize the sample mean 𝑋, we use the 𝑇 variable
as

𝑋 −𝜇
𝑇 = √ ∼ 𝑡𝑛−1 (4.19)
𝑆/ 𝑛

• Thus for a normal sample of size 𝑛, 𝑇 has the Student or 𝑡 distribution with 𝑣 = (𝑛 − 1) degrees of
freedom. The sample standard deviation
∑︀𝑛
− 𝑥)2
𝑖=1 (𝑥𝑖
𝑠=
𝑛−1
is used in place of 𝜎.

• For each 𝑣 ≥ 1, the 𝑡 distribution is symmetrical with zero mean.

When 𝑣 becomes large, the 𝑡 distribution approaches the standard Gauss.

• The 𝑡 distribution enables inferences to be made from samples of any size.

The probability density function 𝑡[𝜈] is symmetric around 0 (Fig. 4.9).

Therefore, the expectation and variance are

E(𝑇 ) = 0; if 𝑣 > 1
(4.20)
V(𝑇 ) = 𝑣/(𝑣 − 2); if 𝑣 > 2.

4.7.2 𝑡 distribution- Usage for interval estimation

Setting Let 𝑥 and ∑︀𝑛


− 𝑥)2 𝑖=1 (𝑥𝑖
𝑠=
𝑛−1
be the sample mean and standard deviation of a random sample

𝑥 = 𝑥1 , 𝑥2 , . . . , 𝑥𝑛

drawn from a normal distribution with unknown standard deviation 𝜎.

With the 𝑇 transformation


𝑋 −𝜇
𝑇 = √ ,
𝑆/ 𝑛
since P[−𝑡𝛼/2 < 𝑇 < 𝑡𝛼/2 ] = 1 − 𝛼, equivalently it means that
(︀ √ √ )︀
P 𝑋 − 𝑡𝑛−1,𝛼/2 · 𝑆/ 𝑛 ≤ 𝜇 ≤ 𝑋 + 𝑡𝑛−1,𝛼/2 · 𝑆/ 𝑛 = 1 − 𝛼,

hence we can compute CI (confidence interval) of 𝜇 with a significance level 𝛼 as follows.

Inference, Linear Regression and Stochastic Processes


4.7. Interval Estimation of Population Mean- 𝜎 unknown 129

Figure 4.9: Probability density function of 𝑡[𝜈] with 𝜈 = 5, 50

Confidence Interval of 𝜇 if 𝜎 unknown, 𝑛 is small.

The 100(1 − 𝛼)% central two-sided CI for the mean 𝜇 is:


(︀ √ √ )︀
𝑥 − 𝑡𝑛−1,𝛼/2 · 𝑠/ 𝑛, 𝑥 + 𝑡𝑛−1,𝛼/2 · 𝑠/ 𝑛

where 𝑡𝑛−1,𝛼/2 is a Gosset’s 𝑡 variate with 𝑛 − 1 degrees of freedom and right tail probability 𝛼/2.

Figure 4.10: Visualization of 𝑡-distribution with 2-tailed probability

The right tail and left tail probabilities both are 𝛼/2, as a result obviously we have
(︀ √ √ )︀
P 𝑥 − 𝑡𝑛−1,𝛼/2 · 𝑠/ 𝑛 ≤ 𝜇 ≤ 𝑥 + 𝑡𝑛−1,𝛼/2 · 𝑠/ 𝑛 = 1 − 𝛼.

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
130 ESTIMATING PARAMETERS OF A POPULATION

Percentiles of 𝑡[𝜈] is denoted as 𝑡𝛼 [𝜈], read from Table 4.3, or determined by software.

Example 4.10. [Environmental Science.] Suppose you observed the sample

6.9, 7.8, 8.9, 5.2, 7.7, 9.6, 8.7, 6.7, 4.8, 8.0, 10.1, 8.5,
6.5, 9.2, 7.4, 6.3, 5.6, 7.3, 8.3, 7.2, 7.5, 6.1, 9.4, 5.4, 7.6, 8.1, 7.9 [in mg/l]
of 𝑛 = 27 metal concentrations in a river near by a paper factory.
From data we get x = 7.51 mg/l and sample standard deviation 𝑠 = 1.38 mg/l.
The score 𝑡1−𝛼/2 [𝑛 − 1] satisfies
𝛼
P[𝑇 > 𝑡1−𝛼/2 [𝑛 − 1]] = ,
2
where 𝑇 is a random variable having 𝑡[𝑛 − 1] distribution with 𝑣 = (𝑛 − 1) degrees of freedom.

With 𝐶𝐿 = 1 − 𝛼 = 95%, 𝛼/2 = 0.025, and 𝑝 = 1 − 𝛼/2 = 0.975,


the 𝑡 score (or critical value) 𝑡1−𝛼/2 [𝑛 − 1] = 𝑡0.975 [26] = 2.056.

Table 4.3: The percentile 𝛼 of probability density function of 𝑇

1-Tailed, 𝛼 = 0.20 0.15 0.10 0.05 0.025 0.01

2-Tailed, 𝛼 = 0.40 0.30 0.20 0.10 0.05 0.02

Confidence level 𝛽 0.60 0.70 0.80 0.90 0.95 0.98

𝑑𝑓

𝑑𝑓 = 1 1.376 . . . . 31.82

2 1.06 . . . . 6.96

3 0.978 . . . . .

5 0.92 1.156 1.476 2.015 2.571 3.365

7 . 1.415 1.895 2.365 2.998

8 . . . 1.86 2.306 2.896

9 0.883 1.1 1.38 1.83 2.262 2.821

21 1.72 2.080 2.518

26 1.70 2.056 2.479

Hence the standard error of X is


√ √
S.E.(X ) = 𝑠/ 𝑛 = 1.38/ 27 = 0.27,

Inference, Linear Regression and Stochastic Processes


4.7. Interval Estimation of Population Mean- 𝜎 unknown 131

hence the margin



𝑅(𝛼) = 𝑡𝑛−1,𝛼/2 · 𝑠/ 𝑛

and finally the 95%CI of 𝜇 is


(︀ √ √ )︀
𝑥 − 𝑡𝑛−1,𝛼/2 · 𝑠/ 𝑛, 𝑥 + 𝑡𝑛−1,𝛼/2 · 𝑠/ 𝑛

or
7.51 − 2.056 · (0.27) < 𝜇 < 7.51 + 2.056 · (0.27)

⇐⇒ 6.96 < 𝜇 < 8.05.

4.7.3 Find sample size given error and variance

Given x and a specified error 𝐸, can we find a sample size 𝑛 such that

𝑅(𝛼) = | x −𝜇| ≤ 𝐸?

If x is used as an estimate of 𝜇, we can be 100(1 − 𝛼)% confident that the error

𝑅(𝛼) = | x −𝜇| ≤ 𝐸,

i.e. 𝑅(𝛼) will not exceed 𝐸 when the sample size 𝑛 (an integer) is

𝑧𝛼/2 𝜎 2
(︂ )︂
𝑛= . (4.21)
𝐸

Example 4.11.

Ten measurements of impact energy (J, joule) on specimens of stainless steel cut at 60°C are as
follows: 64.1, 64.7, 64.5, 64.6, 64.5, 64.3, 64.6, 64.8, 64.2, 64.3 (𝑖𝑛𝐽).
The impact energy is normally distributed with 𝜎 = 1𝐽.
Determine how many specimens must be tested to ensure that
the 95% CI on 𝜇 for this stainless steel cut at 60°C has a length of at most 1.0𝐽.

GUIDANCE for solving.

The bound on error in estimation 𝐸 is one-half of the length of the CI . Use Equation 4.21

𝑧𝛼/2 𝜎 2
(︂ )︂
𝑛=
𝐸

to determine 𝑛 with 𝐸 = 0.5, 𝜎 = 1, and 𝑧𝛼/2 = 1.96, then the solution is 𝑛 = 𝐶𝑒𝑖𝑙𝑖𝑛𝑔(15...) =
⌈15...⌉ = 16?
The length of CI = the width of CI = 2 * 𝐸.

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
132 ESTIMATING PARAMETERS OF A POPULATION

4.8 Summary of Terms and Main Points

Theorem 4.4 (Sampling Distribution of 𝑋 generally).

Assume that random variables 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ∼𝑖.𝑖.𝑑. 𝑋 (the 𝑋𝑖 independent and have the same
distribution with a common random variable 𝑋) with expected value 𝜇 and variance 𝜎 2 .

The normal population : If population 𝑋 ∼ N(𝜇, 𝜎 2 ) then for any 𝑛 the sample mean
𝑛
∑︁
𝑋𝑖
𝑖=1
X =
𝑛
has expectation E[𝑋] = 𝜇 and variance V[𝑋] = 𝜎 2 /𝑛. Moreover X n is normal for any 𝑛:

(︀ 𝜎 2 )︀
X n ∼ N 𝜇, .
𝑛

(︀ 𝜎 2 )︀
Generic population : If population 𝑋 is not normal, then X n approximates with variable N 𝜇,
𝑛
only when 𝑛 is large (𝑛 > 30). In brief we say the sampling distribution of the sample mean will tend
to normality asymptotically, see C.L.T. above.

Only one question remained: If population 𝑋 is not normal but 𝑛 ≤ 30 then what is the Sampling
Distribution of 𝑋? Next chapter will answer this.

1. Confidence Interval (CI or Interval Estimate): an interval estimate for an unknown population pa-
rameter. This depends on:

• the desired confidence level,

• information that is known about the distribution (for example, known standard deviation),

• the sample and its size.

2. Error Bound for a Population Mean (EBM) or the Margin of error 𝑅(𝛼): depends on confidence level
1 − 𝛼, sample size 𝑛, and known population standard deviation 𝜎 or estimated 𝑠.

3. Confidence Level (CL): the probability 1 − 𝛼 that the confidence interval contains the true population
parameter. For e.g,, if the CL = 90%, then in 90 out of 100 samples the interval estimate will enclose
the true population parameter. The CI is:

(x −𝑅(𝛼), x +𝑅(𝛼)).

𝛼 is the significant (level) probability that the interval does not contain the unknown population pa-
rameter.

4. Degrees of Freedom (df): the number of objects in a sample that are free to vary.

5. Parameter: a numerical characteristic of a population

Inference, Linear Regression and Stochastic Processes


4.9. Chapter’s Problems 133

6. Point Estimate: a single number computed from a sample and used to estimate a population pa-
rameter

7. 𝑡-distribution with 𝑛 − 1 degrees of freedom.

• The 𝑡-score (statistic) has the same interpretation as the 𝑧-score. It measures how far x is from
𝜇.
• For each sample size 𝑛, there is a different Student’s 𝑡-distribution.
• For example, if we have a sample of size 𝑛 = 20 items, then we calculate the degrees of freedom
as
𝑑𝑓 = 𝑛 − 1 = 20 − 1 = 19

and we write variable 𝑇 ∼ 𝑡19 .

4.9 Chapter’s Problems

Problem 4.8. Business

Yoonie is a personnel manager in a large corporation. Each month she must review 16 of the em-
ployees. From past experience, she has found that the reviews take her approximately four hours each
to do with a population standard deviation of 1.2 hours.
Let 𝑋 be the random variable representing the time it takes her to complete one review.
Assume that 𝑋 is normally distributed. Let X be the random variable representing the mean time to
complete the 16 reviews. Assume that the 16 reviews represent a random set of reviews.
What are the mean, the standard deviation, and the sample size?
Solution: Mean = 4 hours; Standard deviation = 1.2 hours; Sample size = 16. 

Problem 4.9. [Industry- Manufacturing.]

A random sample of 𝑛 bike helmets manufactured by a company is selected. Let 𝑋 be the number
of helmets among the 𝑛 that are flawed, and let 𝑝 = 𝑃 (flawed). Assume that only 𝑋 is observed,
rather than the sequence of 𝑆’s and 𝐹 ’s.

1. Derive the maximum likelihood estimator of 𝑝. If 𝑛 = 20 and 𝑥 = 3, what is the estimate? Is this
estimator unbiased?

2. If 𝑛 = 20 and 𝑥 = 3, what is the MLE of the probability (1–𝑝)5 that none of the next five helmets
examined is flawed? 

Problem 4.10 (Method of moments).

DATA ANALYTICS- FOUNDATION


CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
134 ESTIMATING PARAMETERS OF A POPULATION

Let 𝑋 be the proportion of allotted time that a randomly selected student spends working on a
certain aptitude test.

Suppose the pdf of 𝑋 is ⎧


⎨(𝜃 + 1) 𝑥𝜃 0≤𝑥≤1
𝑓 (𝑥; 𝜃) = (4.22)
⎩0 otherwise
where −1 < 𝜃. A random sample of ten students yields data

𝑥1 = 0.92, 𝑥2 = 0.79, 𝑥3 = 0.90, 𝑥4 = 0.65, 𝑥5 = 0.86,

𝑥6 = 0.47, 𝑥7 = 0.73, 𝑥8 = 0.97, 𝑥9 = 0.94, 𝑥10 = 0.77.

1. Use the method of moments to obtain an estimator of 𝜃; compute the estimate for this data.

2. Obtain the maximum likelihood estimator of 𝜃, then compute the estimate for the data. 

Problem 4.11.

A random sample of annual salaries 144 mathematicians at Los Alamos National Lab (USA) with a
mean 𝑥 = 100 (in unit of $1000), and a known standard deviation 𝜎 of 16.
Determine the CI of the population mean 𝜇, given confidence level 1 − 𝛼 = 95%.

Problem 4.12.

A manager wishes to estimate the mean number of minutes that workers take to complete a specific
manufacturing task within ±1 min and with 1 − 𝛼 = 90% confidence. From past data, he knows the
standard deviation 𝜎 = 15 min.
Compute the minimum required sample size 𝑛, known that 𝑛 > 30.

Problem 4.13. (Sample size determination)

A engineer wishes to estimate the mean velocity 𝜇 (of km/h) that motorbikers pass an observing
point, known the width of the confidence interval 𝑤 = 3 𝑘𝑚/ℎ and 1 − 𝛼 = 99% confidence level.

If he knows 𝜎 = 1.5 𝑘𝑚/ℎ, compute the minimum required 𝑛 (the numbers of motorbikers we need
to observe), known that 𝑛 > 30.

Problem 4.14.

The standard deviation of the weights of ‘baby’ elephants is known to be approximately 15 pounds.
We wish to construct a 95% confidence interval for the mean weight of newborn elephant calves. Fifty
newborn elephants are weighed. The sample mean is 244 pounds. The sample standard deviation is
11 pounds. Identify the following:

Inference, Linear Regression and Stochastic Processes


4.9. Chapter’s Problems 135

a) x =? b) 𝑠 =? c) 𝑛 =? d) In words, define the random variables 𝑋 and X .

ANS: 𝑋 is the weight in pounds of a newborn elephant.


𝑋1 + 𝑋2 + · · · + 𝑋50
X = is the average of weights of the sample of 50 baby elephants.
50
(︁ 15 )︁
e) Which distribution should you use for this problem? ANSWER: N 244, √ .
50

Problem 4.15.

The standard deviation of the weights of elephants is known to be approximately 15 pounds. We


wish to construct a 95% confidence interval for the mean weight of newborn elephant calves. Fifty
newborn elephants are weighed. The sample mean is 244 pounds. The sample standard deviation is
11 pounds.
Construct a 95% confidence interval for the population mean weight of newborn elephants. State the
confidence interval, sketch the graph, and calculate the error bound.

ANSWER: CI (239.84, 248.16) and EBM = 4.16.

DATA ANALYTICS- FOUNDATION


This page is left blank intentionally.
Chapter 5

Statistical Hypothesis Testing


Confirming your claims or beliefs about
population parameters

You can use a hypothesis test to decide if a dog breeder’s claim


that every Dalmatian has 35 spots is statistically sound.

[Source [?]]
CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
138 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

5.1 Introduction and Background

There are many situations in which we have to make decisions based on observations or data that are
random variables. What we are examining concerns
the parameters or the form of probability distribution that yields the observations.
This involves making a declaration or statement called a hypothesis about a population. The theory
behind the solutions for these situations is known as decision theory or statistical hypothesis testing.
Major topics we will learn in this part include:

1. Problems need statistically testing hypotheses

2. Hypothesis Testings- Introduction in Section 5.1

3. Sampling and Inference, in Section 5.2

4. Statistical Inference for Single Sample-


on Population Mean of Normal distribution- the case of 𝜎 is known, Section 5.3

5. Interval Estimation and Hypothesis Testing for Proportion, Section 5.4

6. Testing for two populations, in Section 5.5

7. Summary for Statistical Inference, in Section 5.6

Motivation

First let’s look at few problems that can be solved by the chapter’s methods.

Problem 5.1. [Industry- Manufacturing.]

In the past, machines of a production company produced bearings (small mechanical devices such
as digital cameras, tablets ...) with an average thickness of 0.05 cm. To determine whether these
machines were still operating normally, a sample of 10 bearings was selected, the average thickness
of 0.053 cm and the standard deviation of 0.003 cm were calculated.
Assume that a significance level is 𝛼 = 0.05.

1. Test the hypothesis that this machine works normally.


2. Find the value 𝑃 of the above test.

Problem 5.2. [Computing.]

The Telegraph newspaper reported that, on average, United Kingdom (U.K.) subscribers with third
generation technology (3G) phones in 2006 spent an average of 8.3 hours per month listening to coun-
try music on their cell phones.

Inference, Linear Regression and Stochastic Processes


5.1. Introduction and Background 139

To study what happen in the U.S. researchers draw the following random sample 𝑥 of music listening
time amount per month from the U.S. population of 3G subscribers:

𝑥 = 5, 6, 0, 4, 11, 9, 2, 3 (hours per month).

Do these data suggest that, on average, an American subscriber listens to country music less than a
U.K subscriber? Explain clearly your conclusion.

Problem 5.3.

• Moreover, in communication or radar technology, for instance decision theory or hypothesis test-
ing is known as (signal) detection theory.

• In politics, during an election year, we see articles in the newspaper that state confidence intervals
in terms of proportions or percentages.

The scientific method, briefly, states that only by following a careful and specific process can some
assertion be included in the accepted body of knowledge. This process begins with a set of assump-
tions upon which a theory, sometimes called a model, is built. This theory, if it has any validity, will lead
to predictions; what we call hypotheses.

General setting

• A hypothesis is a statement about a population parameter. More precisely, a statistical hypothesis is


a statement about the values of the population parameters of a probability distribution.

• The two complementary hypotheses in a hypothesis testing problem are called the null hypothesis
𝐻0 and the alternative hypothesis 𝐻1 or 𝐻𝐴 .

Hypothesis testing is a decision process establishing the validity of a formulated (or conjectured) hy-
pothesis. Mathematically, suppose we observe a random sample (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ) of a random variable
𝑋 whose probability density function

𝑓 (𝑥; 𝜃) = 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ; 𝜃)

depends on a parameter 𝜃, where 𝜃 denotes a population parameter. We wish to test the assumption
𝜃 = 𝜃0 , denoted by 𝐻0 against the assumption 𝜃 ̸= 𝜃0 , denoted by 𝐻1 .

𝐻0 :𝜃 = 𝜃0 (Null hypothesis) (5.1)


𝐻1 :𝜃 ̸= 𝜃0 (Alternative hypothesis) (5.2)

The general format of the two hypotheses are

𝐻0 : 𝜃 ∈ 𝐴 and 𝐻 1 : 𝜃 ∈ 𝐴𝑐

where

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
140 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

• 𝐴 is a subset of the parameter space Θ and

• 𝐴𝑐 is its complement.

Figure 5.1: Acceptance region 𝐴 and its complement 𝐴𝑐

How to test, in general?

Hypothesis-testing procedures rely on using the information in a random sample from the population
of interest.

• If this information is consistent with the hypothesis, we conclude that the hypothesis is true;

• if this information is inconsistent with the hypothesis, we conclude that the hypothesis is false.

Hence, let 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) be the observed data vector, 𝜃 is our parameter, and 𝜃ˆ is an estimate
of 𝜃 from data 𝑥. Then

• if 𝜃ˆ ∈ 𝐴, known as the Acceptance region, as the white area in Figure 5.1:


we will statistically confirm on 𝐻0 ; otherwise

• if 𝜃ˆ ∈ 𝐴𝑐 , known as the Rejection region:


we will statistically determine 𝐻1 is true, since the null 𝐻0 is rejected.

Example 5.1.

A hypothesis can be that the mean age of students 𝜇 taking the SDA course is greater than 19
years. Writing 𝜇 for the mean age, the two hypotheses are

𝐻0 : 𝜇 ≤ 19.0, and the alternative 𝐻1 : 𝜇 > 19.0.

Main aim: to draw valid conclusions from data gathered, or more precisely
to determine whether a statement about the value of a population parameter should or should
not be rejected. Thus in the age example, formally

Θ = (0, ∞), 𝐴 = (0, 19], 𝐴𝑐 = (19, ∞).

(0 − − − − − − − −19](19 − − − − − − − − − − + ∞)

Inference, Linear Regression and Stochastic Processes


5.2. Sampling and Inference 141

5.2 Sampling and Inference

Overview of Sampling process

Figure 5.2: The learning process produces data and knowledge,


that allows humans to solve problems. Source [61]

In brief, Figure 5.2 diagrams the learning process of humans in order to solve problems in relationship
to two types of studies, observational studies (based on observation) and experimental studies.
Both types produce data.

Observational Studies. In many cases, the only viable way to collect data is to observe individuals
under natural conditions. This is true in studies of wildlife’s behaviors in natural habitat conditions (see
Figure 5.2). In observational studies, researchers observe or measure the main characteristics of
interest individuals in natural conditions (and, therefore, do not attempt to influence or change these
characteristics).

Experimental Studies. Used when one influences or changes the characteristics of a phenomenon,
by altering the levels of major factors influencing the phenomenon or process, and then collecting the
data.

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
142 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

5.2.1 Steps in the decision-making process for research

There are eight basic steps in the decision-making process in both types of researches, including:

1. Define research objectives to select the right sample:

- What are objects of investigation? What’s the target?

- Sampling time (Weather (seasons)? In production shifts or normal?) and Sampling locations

2. Design sampling methods

+ Date and time; Place; collector’s name; sample symbols;

+ Recognize characteristics of the place of sampling

+ Early description of the analysis phase: analysis date, analyst, analyzer ...

3. Sample selection: structure and size

4. Field work?

5. Data preparation: data cleaning, storing and coding

6. Data analysis:

+ Use descriptive/ inferential statistics

+ Compare with actuarial (financial, industrial ...) standard regulations!

7. Outcome presentation/ interpretation: use histogram or graph ...

8. Reporting the results: purpose, method of implementation, statistical evaluation, confirming corre-
lation between factors, making decisions...

Remark 2. The first three basic steps are important! If we perform them rightly, the valuable
information and knowledge that hide behind the phenomenon can be discovered.

5.2.2 Key sampling distributions

How do we verify a hypothesis 𝐻0 ? Suppose we have collected some data. Do they support the
null hypothesis or contradict it? We need a criterion, based on the collected data, which will help us
answer the question.

(a) Gauss (normal) distribution: If 𝑋 is a normal variable, written 𝑋 ∼ N(𝜇, 𝜎 2 ), then its pdf is

[︂ ]︂2
𝑥−𝜇
− 12
1 𝜎
𝑓 (𝑥) = √ 𝑒 , (5.3)
𝜎 2𝜋
where −∞ < 𝑥 < ∞, 𝜇 ∈ R, 𝜎 2 > 0 ,

Inference, Linear Regression and Stochastic Processes


5.3. Hypothesis Testings for Single Sample 143

𝑓 (𝑥) = the height of the normal curve, 𝑒 = constant 2.71, 𝜋 = constant 3.14,
𝜇 is the mean, and 𝜎 2 is the variance of the normal distribution.

Figure 5.3: The probability density function of N(𝜇, 𝜎) with 𝜇 = 10 and 𝜎 = 1, 2, 3

Fact 5.1. Distribution of a linear combination of Gauss distributions

Assume that 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are normal and independent random variables with means 𝜇1 , 𝜇2 , . . . , 𝜇𝑛 ,
and variances 𝜎12 , 𝜎22 , . . . , 𝜎𝑛2 .

Then the distribution of the linear combination 𝑌 = 𝑎1 𝑋1 + 𝑎2 𝑋2 + · · · + 𝑎𝑛 𝑋𝑛 is normal, i.e.


𝑦 ∼ 𝑁 (𝜇𝑦 , 𝜎𝑦2 ), with the mean

𝜇𝑌 = 𝑎1 𝜇1 + 𝑎2 𝜇2 + · · · + 𝑎𝑛 𝜇𝑛 ,

and the variance is 𝜎𝑌2 = 𝑎21 𝜎12 + 𝑎22 𝜎22 + · · · + 𝑎2𝑛 𝜎𝑛2 .

(b) Student distribution: see Section 4.7.


Student variable 𝑡 with a degree of freedom 𝜈, is given by the probability density function
]︂−(𝜈+1)/2
𝑢2
[︂
Γ((𝜈 + 1)/2)
𝑓 (𝑢) = √ 1+ , 𝑢 ∈ R.
𝜈𝜋 Γ(𝜈/2) 𝜈

5.3 Hypothesis Testings for Single Sample

Definition 5.1. A hypothesis test is a rule that specifies

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
144 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

• for which sample values the decision is made to reject 𝐻0 , i.e. accept 𝐻1 ,

• and for which sample values not to reject 𝐻0 .

To be able to extract meaningful facts from huge data, two key points being kept in our minds are: 1)
a clear formulation of the hypotheses to be tested, and 2) a good comprehension and decent under-
standing of the mathematical phenomena involved. Let’s see a few cases below.

A) Oil reservoir exploration

Drill’s speed in oil reservoir exploration being manufactured are supposed to have a mean speed of
50 cm/s. From past experience we know

- the standard deviation 𝜎 = 10𝑐𝑚/𝑠;

- the lengths are normally distributed.

A random sample 𝑥 of 10 drills had a mean x = 45𝑐𝑚/𝑠.

Test the hypothesis that the mean 𝜇 is 50 cm/s with 𝛼 = 0.05.

GUIDANCE for solving. A statistical hypothesis is a statement about the parameters of one or more
populations.

Let 𝐻0 : 𝜇 = 50 centimeters per second and

𝐻1 : 𝜇 ̸= 50 centimeters per second.

The statement 𝐻0 : 𝜇 = 50 is called the null hypothesis.

The statement 𝐻1 : 𝜇 ̸= 50 is called the alternative hypothesis

Can we conclude that 𝜇 = 50 centimeters per second from using data?

B) Manufacturing

A tire manufacturer developed a new tire designed to provide an increase in mileage over the firm’s
current line of tires. To estimate the mean number of miles provided by the new tires, the manufac-
turer selected a sample of 120 new tires for testing. The test provided a sample mean of x = 36.500
miles. Hence, an estimate of the population mean tire mileage 𝜇 (for the population of new tires)
was x = 36.500 miles.

Can we conclude that 𝜇 = 36.500 miles?

C) Market research

Teenagers in Thailand seems to favor a particular mobile phone brand 𝐴. Ee want to know an
estimate of the proportion of surveyed voters favoring the product. How could we do?

Inference, Linear Regression and Stochastic Processes


5.3. Hypothesis Testings for Single Sample 145

5.3.1 Hypothesis Testing for the Population Mean- general

Hypothesis testing consists of two contradictory hypotheses or statements, a decision based on the
data, and a conclusion. To perform a hypothesis test, we will:

1. Set up two contradictory hypotheses.

2. Collect sample data (yourself, or the data or summary statistics will be provided).

3. Determine the correct distribution to perform the test.

4. Analyze sample data by performing the calculations that ultimately will allow you to reject or decline
to reject the null hypothesis.

5. Make a decision and write a meaningful conclusion.

In practice we need to conduct the following steps.

How to do? 7 steps procedure.

Step 1 Define the problem, identify the parameter of interest.

Formulate the null hypothesis 𝐻0 , and specify an appropriate alternative hypothesis, 𝐻1 .

Step 2 Specify a significance level, 𝛼.

Step 3 Collect the sample data and compute value of sample mean

Step 4 Determine an appropriate test statistic

Step 5 State the rejection criteria for the statistic.

Step 6 Compute necessary sample quantities

Step 7 Make decision (conclusion) using the critical value approach, or 𝑝-value approach

Step 1: Develop the null hypothesis 𝐻0 and the alternative one 𝐻1 .

The general aim: we consider a test of 𝐻0 versus 𝐻1 for a distribution with an unknown parameter 𝜃.
There are 4 possible situations, only twos of which lead to a correct decisions.

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
146 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

The other twos are called Type I and Type II errors, their descriptions and corresponding probabilities
are given by:

Type I error occurs when we reject the true null hypothesis; and it has probability

𝛼 = P( reject 𝐻0 | H0 true ) = P(𝐻1 | 𝐻0 )

Probability 𝛼 of type I error is also the significance level of a test.

Type II error occurs when we accept the false null hypothesis; and it has probability

𝛽 = P( accept 𝐻0 | H0 false ) = P(𝐻0 | 𝐻1 )

We would like to have test procedures which make both kinds of errors small.

Result of the test from observed data

Decision 𝐻0 true 𝐻1 true

1−𝛼

𝐻0 accepted Correct decision False decision (𝛽)

𝐻1 accepted False decision (𝛼) Correct decision

1−𝛽

Definition 5.2 (The Power of the test).


The power of the test is the probability of correctly rejecting 𝐻0 . Obviously

𝑃 𝑜𝑤𝑒𝑟 = P( accept 𝐻1 | 𝐻1 true ) = P( reject 𝐻0 | 𝐻0 false )

= 1 − P(𝑎𝑐𝑐𝑒𝑝𝑡 𝐻0 | 𝐻0 false )

= 1 − P( Type II error ) = 1 − 𝛽.

Hence, the power of the test is 1 − 𝛽. And a test has high power if the probability 1 − 𝛽 (rejecting a
false null hypothesis) is large.

Hypothesis types for a parameter 𝜃


Reminder:
Now consider a population mean 𝜇 =: 𝑡ℎ𝑒𝑡𝑎
(can also be population proportion 𝑝, population variance 𝜎 2 )

 CONCEPT 5.

Inference, Linear Regression and Stochastic Processes


5.3. Hypothesis Testings for Single Sample 147

𝐻0 : 𝜃 = 𝜃 0 hypothesis (the null hypothesis)

𝐻1 = 𝐻𝐴 alternative (the alternative hypothesis)

• Alternative of the type 𝐻𝐴 : 𝜇 ̸= 𝜇0 covering regions on both sides of the hypothesis (𝐻0 : 𝜇 = 𝜇0 )
is a two-sided alternative.

• Alternative of the type 𝐻𝐴 : 𝜇 < 𝜇0 covering the region to the left of 𝐻0 is a one-sided alterna-
tive, left-tail.

• Alternative of the type 𝐻𝐴 : 𝜇 > 𝜇0 covering the region to the right of 𝐻0 is a one-sided
alternative, right-tail.

Critical thinking: How can we write these 𝐻0 , 𝐻𝐴 ?

• A null hypothesis is always an equality, express a usual statement that people have believed in for
years. In order to overturn the common belief and to reject the hypothesis 𝐻0 , we need significant
evidence. Such evidence can only be provided by data.

• Only when such evidence is found, and when it strongly supports the alternative 𝐻𝐴 , can the hypoth-
esis 𝐻0 be rejected in favor of 𝐻𝐴 .

E.g., the managing board of KMITL in Thailand believes that the average height 𝑋 of freshmen in
IMSE (Industrial and Management Systems Engineering) Program is 1.7m; they write the hypothesis
𝐻0 : 𝜇 = 1.7 (𝜇 = E[𝑋]).
But if you collect a sample of heights of your friends and see that their heights are around 1.65 .. 1.8
m then you might accept the alternative 𝐻𝐴 : 𝜇 > 1.7.

Example in engineering and quality control


Choosing the right-tail or left-tail alternative depends on specific data!

Example 5.2.

In oil reservoir exploration, firms use drills, and drill’s speed are supposed to form a set of mean
drilling speeds, with population mean 𝜇 = 50𝑐𝑚/𝑠. From past experience we know
- the standard deviation 𝜎 = 10𝑐𝑚/𝑠 and
- the drilling speeds 𝑋𝑖 are normally distributed.
A random sample x = 𝑥1 , 𝑥2 , · · · , 𝑥10 of 10 drill speeds had a sample mean x = 45𝑐𝑚/𝑠. You can
write the hypothesis 𝐻0 : 𝜇 = 50; but could you accept the alternative 𝐻𝐴 : 𝜇 < 50? 

Example 5.3.

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
148 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

To verify if the proportion of defective products (computers, cars...) is at most 3%, we test 𝐻0 :
𝑝 = 0.03 vs. 𝐻𝐴 : 𝑝 > 0.03, where 𝑝 is the (population) proportion of defects in the whole shipment.
Why do we choose the right-tail alternative 𝐻𝐴 : 𝑝 > 0.03?
That is because we reject the shipment only if significant/meaningful evidence supporting this 𝐻𝐴
is collected.

The qualified shoes for selling to the market must have sizes
fitted into the well-designed boxes.

Figure 5.4: Testing hypotheses in quality control. [Source [?]]

If the data suggest that 𝑝 ≤ 0.03, the shipment will still be accepted.
The alternative 𝐻1 : 𝑠𝑖𝑧𝑒𝑠 > 𝑚𝑎𝑥𝑠𝑖𝑧𝑒𝑠, or better
𝐻1 : the defective proportion 𝑝(𝑠𝑖𝑧𝑒𝑠 > 𝑚𝑎𝑥𝑠𝑖𝑧𝑒𝑠) > 3%. 

Process engineers must test the hypothesis of

𝐻0 : 𝑠𝑖𝑧𝑒𝑠 = 𝑚𝑎𝑥𝑠𝑖𝑧𝑒𝑠

to be true; where
(𝑠𝑖𝑧𝑒𝑠 = 𝑚𝑎𝑥𝑠𝑖𝑧𝑒𝑠) ⇔ 𝑙𝑒𝑛𝑔𝑡ℎ = 𝑙0 , 𝑤𝑖𝑑𝑡ℎ = 𝑤0 , ℎ𝑒𝑖𝑔ℎ𝑡 = ℎ0 .

Step 2: Level of significance - Type I, Type II Errors.

Significance level

The probability of a type I error, 𝛼, is called the significance level of the test. It is controlled by
the location of the rejection (critical) region. So it can be set as small as required.

Inference, Linear Regression and Stochastic Processes


5.3. Hypothesis Testings for Single Sample 149

A Key Procedure of Hypothesis Testing consists of:

• directly control: to specify a value of the Type I error 𝛼,

• indirectly control: to design a test process so that a small value of the Type II error probability 𝛽 is
obtained, that is the 𝑃 𝑜𝑤𝑒𝑟 𝑓 (𝑛) = 1 − 𝛽, as a function of sample size, is large!

Step 3: Collect the data and and compute value of sample mean
In this step researchers go field trips, use survey and measurement devices... to capture or observe
values of units in a sample. Then they have to employ various popular formulas of central and spread-
ing tendency ... (in Chapter 2) to compute the value of the test statistic.

Standard deviation 𝜎x of 𝑥 when obtained sample 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 .


Use the below notation:
𝜎𝑥 = the standard deviation of 𝑥, 𝜎 = the standard deviation of the population,
𝑛 = the sample size, 𝑁 = the population size.
Then the standard deviation of 𝑥 is
𝜎
𝜎𝑥 = √ , for finite population.
𝑛

Example 5.4. Assume that a population is composed of 900 elements with a mean of 𝜇 = 20 units and
a standard deviation of 𝜎 = 12.
Find P[18 < 𝑥 < 24], if given for a sample size of 36.

The mean and standard error- of the sampling distribution of the mean 𝑥- are:
𝜎 12
E(𝑥) = 𝜇𝑥 = 𝜇 = 20; 𝜎𝑥 = √ = √ = 2.
𝑛 36

The probability that the sample mean 𝑥 of a random sample of 36 elements falls in the interval [18, 24]
is computed as follows:
x 1 −𝜇x x 1 −𝜇
𝑧1 = = = (18 − 20)/2 = −1
𝜎x 𝜎x
x 2 −𝜇x x 2 −𝜇
𝑧2 = = = (24 − 20)/2 = 2
𝜎x 𝜎x
Looking up these values in 𝑧 values table we get

P[18 < 𝑋 < 24] = P[−1 < 𝑍 < 2] = 0.3413 + 0.4772 = 0.8185, or 81.85%.

Step 4: Determine an appropriate test statistic

a/ Sampling from a Normal Distribution.

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
150 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

Suppose that 𝑥 is a normally distributed random variable, with mean of the normal distribution 𝜇 ∈ R
and the variance 𝜎 2 > 0, given by the pdf
[︂ ]︂2
𝑥−𝜇
− 21
1 𝜎
𝑓 (𝑥) = √ 𝑒 , −∞ < 𝑥 < ∞. (5.4)
𝜎 2𝜋
If 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 is a random sample of size 𝑛 from this process, then the distribution of the sample
mean 𝑋 is 𝑁 (𝜇, 𝜎 2 /𝑛), due to Fact 5.1.

– Use the test statistic


X −𝜇
𝑍= ; (5.5)
𝜎X
and when we obtained sample 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 we get the 𝑧-value
x −𝜇 x −𝜇
𝑧= = √ .
𝜎x 𝜎/ 𝑛
– If sample size 𝑛 < 30, replace the above 𝑍 statistic by Student statistic:
X −𝜇
𝑇 = ;
𝜎X
and when we obtained sample 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 we get the 𝑡-value
x −𝜇 x −𝜇
𝑡= = √ .
𝜎x 𝜎/ 𝑛
b/ Sampling from a generic distribution:
Central Limit Theorem (CLT, in Section 4.4.3) says that
the sum 𝑌 = 𝑛 X = 𝑋1 + 𝑋2 + · · · + 𝑋𝑛 of independent random variables is approximately normal,
if 𝑛 > 30, regardless of the distributions of component variables. Similarly, use the test statistic 𝑍
as in the above Equation 5.5.

If 𝑛 <= 30 we use Bootstrap method.Formally, the reasoning is based on Theorem 5.1.

Step 5: State the rejection criteria for the statistic


5b) If variance 𝜎 2 known- one side test, see Section 5.3.2.

5a) If variance 𝜎 2 known- two side test


Two side hypothesis testing consists of the null and alternative hypothesis

𝐻0 : 𝜇 = 𝜇0 ,

𝐻1 : 𝜇 ̸= 𝜇0 .

Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from a population 𝑋, and let

E[𝑋] = 𝜇, and

Var[𝑋] = 𝜎 2

where 𝜇 is unknown and 𝜎 2 is known. There are two cases to consider.

Inference, Linear Regression and Stochastic Processes


5.3. Hypothesis Testings for Single Sample 151

Case a/ If the population 𝑋 is normal then we know that 𝑋 ∼ N(𝜇, 𝜎 2 /n)

Case b/ If the population 𝑋 is not normal but has finite mean and variance then for 𝑛 large, due to
CLT we have
𝑋 ∼ N(𝜇, 𝜎 2 /n)

– Compute the 𝑍-statistic


𝑋 − 𝜇0
𝑍= √ ∼ N(0, 1)
𝜎/ 𝑛
exactly in case a/ and approximately in case b/.
– If the null hypothesis would be true, i.e. 𝜇 = 𝜇0 then the 𝑍 value

𝑋 − 𝜇0
𝑧= √ ∼ N(0, 1) .
𝜎/ 𝑛

The set
{𝑧 : |𝑧| ≤ 𝑧𝛼/2 }

is named the acceptance region (the white area in Figure 5.5).

The rejection region is

{𝑧 : 𝑧 < −𝑧𝛼/2 union with 𝑧 > 𝑧𝛼/2 } = {𝑧 : |𝑧| > 𝑧𝛼/2 }.

Step 6: Compute the test score

From sample 𝑥1 , . . . , 𝑥𝑛 , we calculate x can find the test score 𝑧0 by


x −𝜇0
𝑧0 = √ .
𝜎/ 𝑛

Then we decide:

– If 𝑧0 < −𝑧𝛼/2 or 𝑧0 > 𝑧𝛼/2 we reject 𝐻0 .


– If −𝑧𝛼/2 ≤ 𝑧0 ≤ 𝑧𝛼/2 we accept (or fail to reject) 𝐻0 .

− − −− − 𝑧𝛼/2 − − − − − 0 − − − − − 𝑧𝛼/2 − − −− > 𝑍

The rejection region is reminded as

{𝑧 : 𝑧 < −𝑧𝛼/2 union with 𝑧 > 𝑧𝛼/2 } = {𝑧 : |𝑧| > 𝑧𝛼/2 }.

− − H1 − −− −𝑧𝛼/2 − − − − − 0 − − − − − 𝑧𝛼/2 − − H1 − − > 𝑍

EXPLANATION and REASONING:

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
152 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

Figure 5.5: Critical region for two side test with 𝑍 statistic
Indeed, if the null hypothesis is true 𝑧 should be close to zero. Large values of |𝑧| would tend to
contradict the hypothesis. Suppose we find the value 𝑧𝛼/2 such that

P(𝑍 > 𝑧𝛼/2 ) = 𝛼/2,

then by the symmetry of the standard normal distribution N(0, 1) we have

P(𝑍 < −𝑧𝛼/2 ) = 𝛼/2.

The two bold areas in Figure 5.5 correspond with these probabilities. Therefore, finally we set the
rejection region as {𝑧 : |𝑧| > 𝑧𝛼/2 } or

{𝑧 : 𝑧 < −𝑧𝛼/2 union with 𝑧 > 𝑧𝛼/2 }

– We see that the probability of a type I error is the probability 𝑝 that 𝑍 lies in the rejection region,
when the null hypothesis 𝐻0 is true, and

𝑝 = P( type I error ) = P(𝐻1 | 𝐻0 ) = 𝛼

exactly. Sometimes Type I error is called the significance level, or the 𝛼-error, or the size of the
test.

– The rejection region, for this alternative hypothesis, consists of the two tails of the standard
normal distribution and, for this reason, we call it a two-tailed test.

Inference, Linear Regression and Stochastic Processes


5.3. Hypothesis Testings for Single Sample 153

Step 7: Give decision using critical value approach

Depend on two-tailed: the rejection region is

= {𝑧 : |𝑧| > 𝑧𝛼/2 }.

or one-tailed cases (see Section 5.3.2).

Example 5.5. [Aviation Engineering.]

– Air crew escape systems are powered by a solid propellant.


The burning rate of this propellant is an important product characteristic.

– Specifications require that the mean burning rate must be 50 centimeters per second and
the standard deviation is 𝜎 = 2 centimeters per second.

– Test the hypothesis that the mean 𝜇 is 50 cm/s


with significance level of 𝛼 = 0.05, and known that a random sample of 𝑛 = 25 has a sample
average burning rate of x = 51.3 centimeters per second.

GUIDANCE for solving.

1. Identify the parameter of interest.


The parameter of interest is 𝜇, the mean burning rate.

2. Formulate the null hypothesis 𝐻0 : 𝜇 = 50 centimeters per second, and its alternative hypothe-
sis, 𝐻1 : 𝜇 ̸= 50.

3. Choose a significance level 𝛼 = 0.05.


X −𝜇0
4. Determine an appropriate test statistic 𝑍0 = √
𝜎/ 𝑛
5. State the rejection criteria for the statistic.
Reject 𝐻0 if 𝑧0 < −𝑧𝛼/2 or 𝑧0 > 𝑧𝛼/2 .
Note: if use software, this condition is equivalent with the P-value 𝑝 < 0.05.
The boundaries of the critical region would be

{𝑧 : 𝑧 < −𝑧𝛼/2 union with 𝑧 > 𝑧𝛼/2 }

where 𝑧𝛼/2 = 𝑧0.025 = 1.96 and −𝑧𝛼/2 = −𝑧0.025 = −1.96.

6. Compute the test score, precisely, compute 𝑧0 associated with 𝛼 = 0.05, use two-side test in
Step 5a.
√ x −𝜇0
Since x = 51.3 and 𝜎 = 2, so with 𝑧0 = 𝑛
𝜎
√ 51.3 − 𝜇0
𝑧0 = 25 = 3.25.
2

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
154 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

We see that 𝑧0 = 3.25 > 1.96 = 𝑧0.025 = 𝑧𝛼/2

Table 5.1: Tabulated values of Laplace function Φ(𝑧)

𝛼
𝑝=1− = Φ(𝑧) 99.94% 99.5% 97.5% 95% 90% 80% 75% 0.5
2
𝑧𝑝 3.25 2.58 1.96 1.645 1.28 0.84 0.6745 0

− − −− − 𝑧𝛼/2 − − − − − 0 − − − − − 𝑧𝛼/2 − − − − z0 − −− > 𝑍

7. Draw appropriate conclusions:


we reject 𝐻0 : 𝜇 = 50 at the 0.05 level of significance,
equivalently accept the alternative 𝐻1 : 𝜇 ̸= 50, see Figure 5.6. It means the mean burning rate
differs from 50 centimeters per second, based on a sample of 25 measurements. 

Figure 5.6: Critical region for two side test with 𝑍 statistic

Theorem 5.1 (Distribution of the sum of independent normal variables).

If 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are independent r. variables with means E(𝑋𝑖 ) = 𝜇𝑖 , and variances 𝜎𝑖2 , and if

Inference, Linear Regression and Stochastic Processes


5.3. Hypothesis Testings for Single Sample 155

𝑌 = 𝑋1 + 𝑋2 + · · · + 𝑋𝑛 then the probability density function 𝑓𝑛 (𝑧) of


𝑛
∑︁
𝑌 − 𝜇𝑖
𝑖=1
𝑍= ⎯ (5.6)
⎸ 𝑛 2
⎸∑︁
⎷ 𝜎𝑖
𝑖=1

approaches the 𝑁 (0, 1) when 𝑛 approaches ∞. That is

𝑓𝑛 (𝑧) → 𝑓 (𝑧) ∼ 𝑁 (0, 1).

In the case of I.I.D variables 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 with the same means E(𝑋𝑖 ) = 𝜇, and variances
Var(𝑋𝑖 ) = 𝜎 2 , then

𝑌 −𝑛𝜇 𝑌 −𝑛𝜇 X −𝜇
𝑍= √ = √ = (5.7)
𝑛𝜎 2 𝜎 𝑛 𝜎X
approaches the 𝑁 (0, 1) in the sense that

𝑓𝑛 (𝑧) → 𝑓 (𝑧) ∼ 𝑁 (0, 1).

Example 5.6. Computing Type I, Type II Error probabilities, see from Example 5.2

Drills have 𝜇0 = 50𝑐𝑚/𝑠, 𝜎 = 10𝑐𝑚/𝑠 and the lengths are normally distributed. 𝑛 = 10 drills had a
mean x = 45𝑐𝑚/𝑠.

H0 : μ = 50 centimeters per second


H1 : μ ≠ 50 centimeters per second

1. Computing Type I Error probability P(𝐻1 | 𝐻0 ) = 𝛼:

𝛼 = P[X < 48.5 when 𝜇 = 50] + P[X > 51.5 when 𝜇 = 50] =?

The 𝑧-values that correspond to the critical values 48.5 and 51.5 are
𝑥1 − 𝜇
𝑧1 = = (48.5 − 50)/0.79 = −1.9
𝜎𝑥

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
156 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

𝑥2 − 𝜇
𝑧2 = = (51.5 − 50)/0.79 = 1.9
𝜎𝑥
Therefore
𝛼 = P(𝑍 < 1.90) + P(𝑍 > 1.90) = 0.028717 + 0.028717 = 0.057434
which implies 5.74% of all random samples would lead to the rejection of the hypothesis 𝐻0 : 𝜇 =
50.

The Probability of Type I Error is the sum of two bolded areas

2. Computing Type II Error probability P(𝐻0 |𝐻1 ) = 𝛽:

If the true value of the mean is 𝜇 = 52 cm then

𝛽 = P[48.5 ≤ X ≤ 51.5 when 𝜇 = 52] =?

The 𝑧-values corresponding to 48.5 and 51.5 when 𝜇 = 52 are


𝑥1 − 𝜇
𝑧1 = = (48.5 − 52)/0.79 = −4.43
𝜎𝑥
𝑥2 − 𝜇
𝑧2 = = (51.5 − 52)/0.79 = −0.63
𝜎𝑥

𝛽 = P(−4.42 ≤ 𝑍 ≤ −0.63) = 0.2643 − 0.00 = 0.2643,


which means that the probability that we will fail to reject the false null hypothesis 𝐻0 is 0.2643 or
26.43%. The power of our statistical test is 𝑝𝑜𝑤𝑒𝑟 = 1 − 𝛽 = 1 − 0.2643 = 0.7357.

Inference, Linear Regression and Stochastic Processes


5.3. Hypothesis Testings for Single Sample 157

5.3.2 Hypothesis Testing- one side test

Steps for one side test- known 𝜎.

1. Select population parameter, as 𝜇 (𝜎 2 or 𝜎).

2. Determine hypotheses few options


* a/ 𝐻0 : 𝜇 ≤ 𝜇0 and the alternative 𝐻1 : 𝜇 > 𝜇0 , see Fig. 5.7
* b/ 𝐻0 : 𝜇 = 𝜇0 and the alternative 𝐻1 : 𝜇 > 𝜇0 .
* c/ 𝐻0 : 𝜇 = 𝜇0 and the alternative 𝐻1 : 𝜇 < 𝜇0 .

Acceptance and Rejection regions

Figure 5.7: Acceprtance area and rejection area in one side test

3. Choose significant level 𝛼 and compute critical value 𝑧1−𝛼 with the table

Signif. level 𝛼 1% 2% 2.5% 5% 10% 25%

Confid. level 1 − 𝛼 99% 98% 97.5% 95% 90% 75%

𝑧1−𝛼 2.33 2.05 1.96 1.65 1.28 0.675

4. Compute the sample mean X and the standardized 𝑍0 of X


X −𝜇0
𝑍0 = √ .
𝜎/ 𝑛

5. Case a/ if 𝐻0 : 𝜇 ≤ 𝜇0 and 𝐻1 : 𝜇 > 𝜇0 then the rejection region is {𝑍0 : 𝑍0 > 𝑧𝛼 }, satisfies
equation P[𝑍0 > 𝑧𝛼 ] = 𝛼. This is equivalent with the region

{𝑋 𝑛 : 𝑋 𝑛 ≥ 𝜇0 + 𝑧𝛼 𝜎/ 𝑛}.

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
158 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

− − − − − − − − −0 − − − − − 𝑧𝛼 − − −− > 𝑍

6. Compute the test score

7. Making conclusion uses Figure 5.8.a or b.

Figure 5.8: Test a hypothesis using critical value approach

5.3.3 Hypothesis Tests and Confidence Intervals

A close relationship exists between hypothesis testing for 𝜃, and the confidence interval for 𝜃.
If [𝐿, 𝑈 ] is a 100(1 − 𝛼)% confidence interval for the parameter 𝜃, the test of significant level 𝛼 of the
hypothesis pair
𝐻0 : 𝜃 = 𝜃0 and the alternative 𝐻1 : 𝜃 ̸= 𝜃0
will lead to rejection of 𝐻0 if and only if 𝜃0 is not in the 100(1 − 𝛼)% CI [𝐿, 𝑈 ].

SUMMARY: General Procedure for Hypothesis Tests


1. Identify the parameter of interest.

Formulate the null hypothesis 𝐻0 , and specify an appropriate alternative hypothesis, 𝐻1 .

2. Choose a significance level, 𝛼.

3. Collect the data and and compute value of sample statistic

4. Determine an appropriate test statistic.

Inference, Linear Regression and Stochastic Processes


5.4. Interval Estimation and Hypothesis Testing
for the Population Proportion 159

5. State the rejection criteria for the statistic.

6. Compute the test score

7. Draw appropriate conclusions.

Using 𝑝-value approach (Optional)

Using 𝑝-value approach is controversial, as claimed by ASA. But software as R and SPSS still use it,
hence we introduce it in this section.

Definition 5.3. The 𝑝-value is the smallest level of significance that would lead to rejection of the 𝐻0 .
It is the tail area beyond the value of the test statistic 𝑡0 for a one-sided test, or twice this area for a
two-sided test.

We discussed the use of 𝑝-values when we looked at goodness of fit tests. They can be useful as a
hypothesis test with a fixed significance level 𝛼 does not give any indication of the strength of evidence
against the null hypothesis.
The 𝑝-value depends on whether we have a one-sided or two-sided test.

𝑝- value Interpretation

𝑝 > 0.10 No evidence against H0

0.05 < 𝑝 < 0.10 Weak evidence against H0

0.01 < 𝑝 < 0.05 Moderate evidence against H0

0.001 < 𝑝 < 0.01 Strong evidence against H0

𝑝 < 0.001 Very strong evidence against H0

5.4 Interval Estimation and Hypothesis Testing


for the Population Proportion

Motivation

Investors in the stock market are interested in the true proportion of stocks that go up and down each
week. Businesses that sell personal computers are interested in the proportion of households in the
United States that own personal computers. Confidence intervals can be calculated for the true proportion of stocks
that go up or down each week and
for the true proportion of households in the United States that own personal computers.

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
160 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

Figure 5.9: Test a hypothesis using 𝑝 value approach

• The procedure to find the confidence interval for a population proportion is similar to that for the
population mean, but the formulas are a bit different although conceptually identical.

• While the formulas are different, they are based upon the same mathematical foundation given to us
by the Central Limit Theorem. Because of this we will see the same basic format using the same
three pieces of information:

- the sample value of the parameter in question,

- the standard deviation of the relevant sampling distribution, and

- the number of standard deviations we need to have the confidence in our estimate that we desire.

5.4.1 Key distribution for proportion problems

How do you know you are dealing with a proportion problem?

Distribution used for a proportion 𝑃 : First, the underlying distribution of a proportion 𝑃 of interest
is a binomial distribution. Why?

We knew that if 𝑋 represents the number of successes in 𝑛 trials, then 𝑋 is a binomial random
variable, and 𝑋 ∼ Bin(𝑛, 𝑝) where 𝑛 is the number of trials and 𝑝 is the probability of a success.

• So a point estimator of the proportion 𝑃 in a binomial experiment is given by the statistic

𝑋
𝑃ˆ = .
𝑛

• Therefore, the sample proportion


𝑝ˆ = 𝑥/𝑛

will be used as the point estimate of the parameter 𝑃 , where


𝑥 = the number of successes in the sample, and 𝑛 = the size of the sample.

Inference, Linear Regression and Stochastic Processes


5.4. Interval Estimation and Hypothesis Testing
for the Population Proportion 161

• The formula for the confidence interval for a population proportion follows the same format as
that for an estimate of a population mean.

Mean and Standard deviation (standard error) of the estimator 𝑃ˆ : secondly- by CLT- are given by

𝑛𝑝
𝜇𝑃^ = E[𝑃ˆ ] = E[𝑋/𝑛] = =𝑝 (5.8)
𝑛

2
𝜎𝑋 𝑛𝑝𝑞 𝑝𝑞
𝜎𝑃2^ = 𝜎𝑋/𝑛
2
= = 2 = , (5.9)
𝑛2 𝑛 𝑛
so the standard error (of the sampling distribution) of 𝑝ˆ is
√︂ √︂
𝑝𝑞 𝑝 (1 − 𝑝)
𝜎𝑝^ = = .
𝑛 𝑛

5.4.2 Interval Estimation of the proportion 𝑃

Therefore, we can assert that

𝑃ˆ − 𝑃
P[−𝑧𝛼/2 < 𝑍 < 𝑧𝛼/2 ] = 1 − 𝛼, with 𝑍 = (5.10)
𝜎𝑝^

here 𝑧𝛼/2 is is the value above which we find an area of 𝛼/2 under the standard normal curve. Substi-
tuting for 𝑍, we write:

𝑃ˆ − 𝑃
P[−𝑧𝛼/2 < < 𝑧𝛼/2 ] = 1 − 𝛼, (5.11)
𝜎𝑝^
and this gives us the CI of 𝑃 with significance level 𝛼:

√︂ √︂
𝑝𝑞 𝑝𝑞
𝑃ˆ − 𝑧𝛼/2 < 𝑃 < 𝑃ˆ + 𝑧𝛼/2
𝑛 𝑛
When 𝑛 is large, and we don’t know the unknown population proportions 𝑝, 𝑞 beforehand, very little
error is introduced by substituting the point estimate 𝑝ˆ = 𝑥/𝑛 for the 𝑝 under the radical sign.

More precisely, 𝑝ˆ is the numerical value of the statistic 𝑃ˆ , also the estimated proportion of successes
or the sample proportion of successes,

𝑝ˆ also is a point estimate for 𝑃 , the true population proportion.

Then we can write

√︂ √︂
𝑝ˆ 𝑞ˆ 𝑝ˆ 𝑞ˆ
𝑝ˆ − 𝑧𝛼/2 < 𝑃 < 𝑝ˆ + 𝑧𝛼/2 .
𝑛 𝑛
Example 5.7.

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
162 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

Suppose that a market research firm is hired to estimate the percent of adults living in a large city
who have cell phones.
Five hundred randomly selected adult residents in this city are surveyed to determine whether they
have cell phones. Of the 500 people sampled, 421 responded yes - they own cell phones.
Using a 95% confidence level, compute a confidence interval estimate for the true proportion of adult
residents of this city who have cell phones.

GUIDANCE for solving.

The solution step-by-step. Let 𝑋 = the number of people in the sample who have cell phones. 𝑋 is
binomial: the random variable is binary, people either have a cell phone or they do not.
To calculate the confidence interval, we must find 𝑝ˆ, 𝑞ˆ.
𝑛 = 500, 𝑥 = the number of successes in the sample = 421 ... Answer: 0.810 ≤ 𝑝 ≤ 0.874
Interpretation: We estimate with 95% confidence that between 81% and 87.4% of all adult residents
of this city have cell phones.

5.5 Testing for two populations

Many decision making problems require to determine whether the means or proportions of two popu-
lation are the same or different. In general, the two sample problem arises when two populations are
to be compared e.g.,
1. IQ’s of babies bottle or breast fed.
2. Time to complete task when instructed in two ways.
3. Weight gain of anorexic girls when subjected to two different treatments.
In Example 1, IQ is the response variable (measured at a specified age); the explanatory variable or
treatment is the type of feeding at infancy. Examples 2,3 could be designed to be carried out under
experimental conditions.

• Taking example 3, suppose a random sample of 16 girls is taken all of whom are anorexic. Randomly
allocate eight of the girls to treatment 1 and the remaining eight to treatment 2.

• To assess the effectiveness of the treatments it is usual that one of the treatments is a ‘control’ i.e.,
no treatment or the usual treatment. Thus

a) the experiment has replication, randomization and a control treatment, and

b) a significant difference does not automatically imply a meaningful difference. 

After careful study of this session, you should be able to do the following.

Inference, Linear Regression and Stochastic Processes


5.5. Testing for two populations 163

• Construct comparative experiments involving two samples as tests.

• Test hypotheses and construct confidence intervals on the difference in means of two normal
distributions.

• Use the P-value approach for making decisions

• Compute power, Type II error probability, and make sample size decisions for two-sample
tests on means.

5.5.1 Test hypothesis for population means

Suppose we investigate a common feature of two populations 𝑋 and 𝑌 .

Assumptions

1. Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛1 be a random sample from 𝑋.

2. Let 𝑌1 , 𝑌2 , · · · , 𝑌𝑛2 be a random sample from 𝑌 .

3. The two populations 𝑋 and 𝑌 are independent.

4. Both 𝑋 and 𝑋 are normal.

We then knew that

E[X − Y ] = E[X ] − E[Y ] = 𝜇𝑋 − 𝜇𝑌 (5.12)

2
𝜎𝑋 𝜎2
V[X − Y ] = V X ] + V[Y ] = + 𝑌. (5.13)
𝑛1 𝑛2

Key result: The standardized variable


X − Y −(𝜇𝑋 − 𝜇𝑌 )
𝑍= √︃
2
𝜎𝑋 𝜎2
+ 𝑌
𝑛1 𝑛2

has a N(0, 1) distribution.


Null hypothesis: 𝐻0 : 𝜇𝑋 − 𝜇𝑌 = ∆0
Test criterion: Compute test statistic (from a data set)
a/ if we know 𝜎𝑋 , 𝜎𝑌 then use

𝑥 − 𝑦 − Δ0
𝑍 = √︂ 2
𝜎𝑋 𝜎𝑌2
+
𝑛1 𝑛2
DATA ANALYTICS- FOUNDATION
CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
164 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

(can replace 𝜎𝑋 , 𝜎𝑌 by sample standard deviations 𝑠𝑋 , 𝑠𝑌 ).

b/ or if we do not know 𝜎𝑥 , 𝜎𝑦 and small sample sizes, use


𝑥 − 𝑦 − Δ0
𝑇 = √︂
1 1
𝑠𝑝 +
𝑛1 𝑛2

that has a 𝑡 distribution with 𝑛1 + 𝑛2 − 2 degrees of freedom, where the pooled variance

(𝑛1 − 1)𝑠2𝑥 + (𝑛2 − 1)𝑠2𝑦


𝑠2𝑝 =
𝑛1 + 𝑛2 − 2
depends on sample variances 𝑠2𝑥 , 𝑠2𝑦 .

Alternative Hypotheses P-Value Rejection Criterion For


Fixed-Level Tests

H0: 1  2 ≠ 0 Probability above |z0| and


probability below  |z0|, z0  z2 or z0  z2
P = 2[1  (|z0|)]

H1: 1  2 > 0 Probability above z0, z0  z


P = 1  (z0)

H1: 1  2 < 0 Probability below z0, z0  z


P = (z0)

Summary of two population test based on normality

a/ If use 𝑍 then see the above diagram.


b/ If use 𝑇 then just apply the same argument, but remember using the degrees of freedom
𝑛1 + 𝑛2 − 2 when locating the suitable critical value 𝑡𝛼 or 𝑡𝛼/2 .

5.5.2 Testing for means from independent samples

Here it is assumed that the experimental units are relatively homogeneous. The two treatments are
randomly assigned to the experimental material and the appropriate measurements taken.

Inference, Linear Regression and Stochastic Processes


5.5. Testing for two populations 165

Example 5.8.

Vitamin D deficiency, it is conjectured, could be related to the amount of fibre in the diet.
Two groups of healthy adults are randomly assigned to one of two diets: Normal or High Fibre.
A measure of vitamin D is then obtained for each of the subjects:

Normal Diet 𝑋 19.1 24.0 28.6 29.7 30.0 34.8


High Fibre 𝑌 12.0 13.0 13.6 20.5 22.7 23.7 24.8

Assumptions:

• The 𝑋, 𝑌 measures of vitamin D are normal.

• The variances of the two populations are equal 𝜎 2 .

Notation. 𝑋1 = 𝑋, 𝑋2 = 𝑌 .
For the Normal diet,
let 𝑥1 be the sample mean, 𝑠22 be the sample variance;
let 𝜇1 be the population mean, common variance 𝜎 2 .

For the High Fibre,


let 𝑥2 be the sample mean, 𝑠22 be the sample variance,
let 𝜇2 be the population mean, variance 𝜎 2 .

Hypotheses:
𝐻0 : 𝜇1 = 𝜇2 , or 𝜇1 − 𝜇2 = ∆0 = 0

versus with 𝐻1 : 𝜇1 ̸= 𝜇2 .

Test Statistic: Since we do not know 𝜎1 , 𝜎2 and small sample sizes, use sample variances:

𝑥1 − 𝑥2
𝑇 = √︀
𝑠𝑝 1/𝑛1 + 1/𝑛2
where the pooled variance
(𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22
𝑠2𝑝 =
𝑛1 + 𝑛2 − 2
Make decision: If 𝐻0 is true then 𝑇 ∼ 𝑡𝑛1 +𝑛2 −2 , where

𝑛1 + 𝑛2 − 2 = 6 + 7 − 2 = 11

Reject 𝐻0 if the Test Statistic

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
166 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

𝑇 > 𝑡𝑛1 +𝑛2 −2;𝛼/2 or 𝑇 < −𝑡𝑛1 +𝑛2 −2;𝛼/2 ; that means 𝑇 > 2.20 or 𝑇 < −2.20.

From samples, we get 𝑥1 = 27.70, 𝑥2 = 18.61, 𝑠21 = 29.63, 𝑠22 = 30.80 and 𝑠2𝑝 = 30.27.
Hence the test statistic 𝑇 = ... = 2.97 > 2.20: we reject the null hypothesis 𝐻0 .
We conclude that there is a significant difference between the vitamin D levels of the two
groups. (Vitamin D levels are lower by an estimated 9.09 in the high fibre diet group.)

SUMMARY: General Procedure for Two population Hypothesis Tests

1. Identify the parameter of interest.

2. Formulate the null hypothesis 𝐻0 , and specify an appropriate alternative hypothesis, 𝐻1 .

3. Choose a significance level, 𝛼.

4. Determine an appropriate test statistic:

a/ if we know 𝜎1 , 𝜎2 then

𝑥 1 − 𝑥 2 − ∆0
𝑍 = √︀ 2
𝜎1 /𝑛1 + 𝜎22 /𝑛2

b/ or if we do not know 𝜎1 , 𝜎2 , and small sample sizes then we use

𝑥 1 − 𝑥 2 − ∆0
𝑇 = √︀
𝑠𝑝 1/𝑛1 + 1/𝑛2
where the pooled standard deviation is
√︃
(𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22
𝑠𝑝 = .
𝑛1 + 𝑛2 − 2

5. State the rejection criteria for the statistic.

6. Compute the test score (defined from Step 4).

7. Draw appropriate conclusions.

Recall that the Gosset’s 𝑇 statistic is given by

𝑋 −𝜇
𝑇 = √ ∼ 𝑡𝑛−1 (5.14)
𝑆/ 𝑛

where 𝑆 is the sample standard deviation.

Inference, Linear Regression and Stochastic Processes


5.5. Testing for two populations 167

5.5.3 Testing for means from matched pairs

Here the two samples are not independent. Subjects to be allocated to the two treatments are paired.
Pairing is sometimes achieved
by using twins, using the same subject twice, or
simply pairing by some other characteristic.

Example 5.9. Six healthy male subjects were taken and their growth hormone level measured at night
under two conditions: Post daily exercise and without daily exercise. The results were:

Subject 123456
Post Exercise 13.6 14.7 42.8 20.0 19.2 17.3
Control 8.5 12.6 21.6 19.4 14.7 13.6

Differences 5.1 2.1 21.2 0.6 4.5 3.7


Here we consider the sample of 6 differences, assuming that the population of differences is Normally
distributed, but the population variance is unknown.
Hypotheses: 𝐻0 : 𝜇1 = 𝜇2 , vs 𝐻1 : 𝜇1 ̸= 𝜇2 .
Test Statistic
𝑑
𝑇 = √︀
𝑠𝑑 1/𝑛
[due to Equation (5.14)] where 𝑑 = 𝑥1 − 𝑥2 is the mean difference,
𝑠𝑑 is the sample standard deviation of the differences;
𝑛 is the sample size.

• If 𝐻0 is true then 𝑇 ∼ 𝑡𝑛−1 .

• Reject 𝐻0 if the statistic 𝑇 > 𝑡𝑛−1;𝛼/2 = 2.57 or 𝑇 < −𝑡𝑛−1;𝛼/2 ;

• Observed values are: 𝑛 = 6, 𝑑 = 6.2 and 𝑠𝑑 = 7.53

• Observed
𝑇 = ... = 2.017 ∈ [−2.57, 2.57],

so no evidence to reject 𝐻0 . Conclude that on average there is not a significant difference in the
growth hormone level when exercise is taken compared to that when there is no exercise. 

5.5.4 Choice of sample sizes for inference

For inference on two populations


When making inference for a difference in means of two normal distributions, variances known we
can find a suitable sample size as
(︂ )︂2
𝑛= 𝑧𝛼/2 /𝐸 (𝜎12 + 𝜎22 ). (5.15)

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
168 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

where 𝐸 is the max margin of error as usual. Remember also to round up if 𝑛 is not an integer. This
ensures that the level of confidence does not drop below 100(1 − 𝛼)%, where 𝛼 is a significance level.

For inference on one population

Given type I error 𝛼 and type II error 𝛽 we can determine sample size 𝑛.

For a two-sided alternative hypothesis:

(︂ )︂2
(𝑧𝛼/2 + 𝑧𝛽 ) 𝜎
𝑛= . (5.16)
𝛿
where 𝛿 = 𝜇 − 𝜇0 , and 𝑛 must be an integer.

For a one-sided alternative hypothesis:


(︂ )︂2
(𝑧𝛼 + 𝑧𝛽 ) 𝜎
𝑛= . (5.17)
𝛿

where 𝛿 = 𝜇 − 𝜇0 , and 𝑛 must be an integer.

5.6 Summary for Statistical Inference

5.6.1 Terms of Hypothesis Testing

1. Parameter: a numerical characteristic of a population.

2. Point Estimate: a single number computed from a sample and used to estimate a population param-
eter

3. Confidence Interval (CI ): an interval estimate for an unknown population parameter. This depends
on: • the desired confidence level, • information that is known about the distribution • the sample and
its size.

4. Confidence Level (CL): the percent expression for the probability that the confidence interval contains
the true population parameter; for example, if the CL = 90%, then in 90 out of 100 samples the interval
estimate will enclose the true population parameter.

5. Hypothesis: a statement about the value of a population parameter, in case of two hypotheses

The actual test begins by considering two hypotheses. They are called the null hypothesis (notation
𝐻0 ) and the alternative hypothesis (notation 𝐻𝑎 ). These hypotheses contain opposing viewpoints.

6. 𝐻0 - The null hypothesis: It is a statement about the population that either is believed to be true or
is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.

7. 𝐻1 or 𝐻𝑎 - The alternative hypothesis: It is a claim about the population that is contradictory to 𝐻0


and what we conclude when we reject 𝐻0 .

Inference, Linear Regression and Stochastic Processes


5.7. Problems 169

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if
you have enough evidence to reject the null hypothesis or not.
The evidence is in the form of sample data.

8. After you have determined which hypothesis the sample supports, you make a decision. There are
two options for a decision.
1/ ”reject 𝐻0 ” if the sample information favors the alternative hypothesis or
2/ ”do not reject 𝐻0 ” or ”decline to reject 𝐻0 ” if the sample information is insufficient to reject the null
hypothesis.

9. 𝑡-distribution with 𝑛 − 1 degrees of freedom.

• The 𝑡-score (statistic) has the same interpretation as the 𝑧-score. It measures how far x is from
𝜇.
• For each sample size 𝑛, there is a different Student’s 𝑡-distribution.
• For example, if we have a sample of size 𝑛 = 20 items, then we calculate the degrees of freedom
as
𝑑𝑓 = 𝑛 − 1 = 20 − 1 = 19
and we write variable 𝑇 ∼ 𝑡19 .

5.6.2 Sampling distributions of popular statistics

Under appropriate conditions, generally if 𝑆 is a statistic of interest, then approximately

𝑆 − E[𝑆] 𝑆 − E[𝑆] 𝑆 − E[𝑆]


= √︀ = ∼ N(0, 1) (5.18)
S.E. 𝑆 V[𝑆] 𝜎𝑆
or equivalently
[𝑆 − E[𝑆]]2
∼ 𝜒2 (1). (5.19)
V[𝑆]

(︀ 𝜎 2 )︀
Examples include the CLT: the sample mean follows a normal distribution: X ∼ N 𝜇, . When 𝑛
𝑛
is large (𝑛 > 30) the standardized (of X )
X n −𝜇
𝑍𝑛 = √ satisfies lim 𝑍𝑛 = 𝑍 ∼ N(0, 1).
𝜎/ 𝑛 𝑛→∞

5.7 Problems

5.7.1 Fundamentals

1. You are testing that the mean speed of your cable Internet connection is more than three Megabits
per second.

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
170 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

a/ What is the random variable? Describe in words.

ANS: The random variable is the mean Internet speed in Megabits per second.

b/ State the null and alternative hypotheses.

ANS: DIY

2. The mean entry level salary of an employee at a company in USA is $58,000. You believe it is higher
for IT professionals in the company.

State the null and alternative hypotheses.

ANS: 𝐻0 : 𝜇 = $58, 000

𝐻1 : 𝜇 > $58, 000

3. In a population of fish, approximately 42% are female. A test is conducted to see if, in fact, the
proportion is less. State the null and alternative hypotheses.

ANS: DIY

4. A group of doctors is deciding whether or not to perform an operation. Suppose the null hypothesis
𝐻0 is: the surgical procedure will go well.

State the Type I and Type II errors in complete sentences.

ANS: Type I: The procedure will go well, but the doctors think it will not.

Type II: The procedure will not go well, but the doctors think it will.

(a) Consider a hypothesis testing problem: It is believed that 70% of males pass their drivers test in
the first attempt, while 65% of females pass the test in the first attempt. Of interest is whether
the proportions are in fact equal.
Indicate if the hypothesis test is for
a. independent group means, population standard deviations, and/or variances known
b. independent group means, population standard deviations, and/or variances unknown
c. matched or paired samples; d. single mean; e. two proportions; f. single proportion.
ANS: e. two proportions

(b) A study is done to determine which of two soft drinks has more sugar. There are 13 cans of
Beverage A in a sample and six cans of Beverage B. The mean amount of sugar in Beverage A
is 36 grams with a standard deviation of 0.6 grams. The mean amount of sugar in Beverage B
is 38 grams with a standard deviation of 0.8 grams.
The researchers believe that Beverage B has more sugar than Beverage A, on average. Both
populations have normal distributions.
Is this a one-tailed or two-tailed test? ANS: This is a one-tailed test.

Inference, Linear Regression and Stochastic Processes


5.7. Problems 171

(c) The Telegraph newspaper reported that, on average, United Kingdom (U.K.) subscribers with
third generation technology (3G) phones in 2006 spent an average of 8.3 hours per month
listening to country music on their cell phones.
To study what happen in the U.S. researchers draw the following random sample 𝑥 of music
listening time amount per month from the U.S. population of 3G subscribers:

𝑥 = 5, 6, 0, 4, 11, 9, 2, 3 (hours per month).

Do these data suggest that, on average, an American subscriber listens to country music less
than a U.K subscriber? Explain clearly your conclusion.
Hints: If use confidence interval or hypothesis testing you must choose a confidence level of
95%.

5.7.2 Hypothesis Testing in Manufacturing

In the past, machines of a production company produced bearings (small mechanical devices such
as digital cameras, tablets ...) with an average thickness of 0.05 cm. To determine whether these
machines were still operating normally, a sample of 10 bearings was selected, the average thickness
of 0.053 cm and the standard deviation of 0.003 cm were calculated. Assume that a significance level
is 𝛼 = 0.05.
1. Test the hypothesis that this machine works normally.
2. Find the value 𝑃 of the above test.

GUIDANCE for solving.

1. We choose the two-sided test to satisfy the problem’s requirements.


Let 𝑋 be the thickness of bearings produced by the company. The pair of hypotheses
H0 : 𝜇 = 0.05: machine works normally
and H1 : 𝜇 ̸= 0.05: machine work abnormally.
The sample size 𝑛 = 10, sample mean is 𝑥 = 0, 053, sample standard deviation 𝑠 = 0.003.
With the significance level 𝛼 = 5% = 0.05, small sample (𝑛 = 10 < 30) we use Student distribution,
the two-sided test gives 𝑡𝛼 = 2.2622.
Due to Student distribution,

𝑋 −𝜇 𝑋 −𝜇 √
𝑇 = = * 𝑛 ∼ 𝑡𝑛−1
𝜎X 𝑆
we get the value
𝑥 − 0.05 √
𝑡= * 𝑛 = 3.1623 > 𝑡𝛼 = 2.2622, (see Table 5.2),
𝑠
we reject 𝐻0 and accept 𝐻1 . The machine is not working properly.

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
172 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

Table 5.2: The percentile 𝛼 of probability density function of 𝑇

1-Tailed, 𝛼 = 0.20 0.15 0.10 0.05 0.025 0.01

2-Tailed, 𝛼 = 0.40 0.30 0.20 0.10 0.05 0.02

Confidence level 𝛽 0.60 0.70 0.80 0.90 0.95 0.98

𝑑𝑓

𝑑𝑓 = 1 1.376 . . . . 31.82

2 1.06 . . . . 6.96

5 0.92 1.156 1.476 2.015 2.571 3.365

7 . 1.415 1.895 2.365 2.998

8 . . . 1.86 2.306 2.896

9 0.883 1.1 1.38 1.83 2.262 2.821

21 1.72 2.080 2.518

26 1.70 2.056 2.479

5.7.3 Hypothesis Testing in Insurance Industry

The insurance premium (or insurance fee) 𝐾 paid by a customer is thought to be affected by the risk
preference (or risk type) of that customer. The standard deviation of 𝐾 is known to be 𝜎 = 3 (unit 10
USD per year) regardless of the risk types, including

Type 1 (customers are risk neutral) and

Type 2 (customers are risk seeking).


Denote by 𝜇1 and 𝜇2 respectively the insurance premium means of Type 1 and Type 2 populations.
For each of the risk types we collect ten observations, given as:
* Type 1 (risk neutral customers):

𝑢 = 66.4, 71.7, 70.3, 69.3, 64.8, 69.6, 68.6, 69.4, 65.3, 68.8

* Type 2 (risk seeking customers):

𝑣 = 57.9, 66.2, 65.4, 65.4, 65.2, 62.6, 67.6, 63.7, 67.2, 71.0

• Find a two-sided 95% confidence interval of 𝜇1 .

Inference, Linear Regression and Stochastic Processes


5.7. Problems 173

• If we select significance level 𝛼 = 0.05, which of the following pair of hypotheses can you conclude
from the given data sets?
𝐻0 : 𝜇1 = 𝜇2 , 𝐻1 : 𝜇1 ̸= 𝜇2 ; or 𝐻0 : 𝜇1 − 𝜇2 = 0, 𝐻1 : 𝜇1 > 𝜇2 ?

Hint: for a one-sided hypothesis test of the difference of population means 𝜇1 , 𝜇2 , need to find the
pair of Student critical value 𝑡𝛼, 𝑛1 +𝑛2 −2 and score 𝑡0 .

5.7.4 Critical thinking

1. Which two distributions can you use for hypothesis testing for this chapter? ANS: A normal distribution
or a Student’s 𝑡-distribution

2. Which distribution do you use when you are testing a population mean and the population standard
deviation is known? Assume a normal distribution, with 𝑛 ≥ 30.
ANS: Use a normal distribution.

3. A population mean is 13. The sample mean is 12.8, and the sample standard deviation is two. The
sample size is 20.
What distribution should you use to perform a hypothesis test?
Assume the underlying population is normal. ANS: a Student’s 𝑡-distribution

4. A particular brand of tires claims that its deluxe tire averages at least 50,000 miles before it needs to
be replaced. From past studies of this tire, the standard deviation is known to be 8,000. A survey of
owners of that tire design is conducted. From the 28 tires surveyed, the mean lifespan was 46,500
miles with a standard deviation of 9,800 miles.
Using 𝛼 = 0.05, is the data highly inconsistent with the claim?
HINT: Use the 7-step procedure for one-sided test, with

𝐻0 : 𝜇 ≥ 50, 000 𝑣𝑠 𝐻𝑎 : 𝜇 < 50, 000.

ANS: There is sufficient evidence to conclude that the mean lifespan of the tires is less than 50,000
miles. The CI = (43, 537, 49, 463).

5. The Nation, a key press in Thailand recently reported that average PM2.5 pollution index in the air of
BKK metropolis was 24 (in 𝑚𝑖𝑐𝑟𝑜𝑔𝑟𝑎𝑚/𝑚3 ) in Summer. In Bangkok City, Thai researchers observed
the following random sample of

𝑥 = 24, 22, 26, 34, 35, 32, 33, 29, 19, 36, 30, 15, 17, 28, 38, 40, 37, 27

of this index in consecutive 18 days of Summer 2017.


At a 5% level of significance, based on the sample data, write the pair of hypotheses to express
severity level of PM10 pollution in Bangkok metropolis, and compute the 𝑡0 score.
ANS: DIY to get 𝐻0 : 𝜇 ≤ 24, 𝐻1 : 𝜇 > 24; 𝑡0 = 2.87

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
174 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

5.8 Few Case studies with Data Analytics Approach

5.8.1 Case 1: Inference for political science

• Employing binomial distribution in political science

Consider a political election in the US (or any democratic nation having two parties), where each
people either say YES to vote for the 2nd term of President T

or say NO to vote for a new president.

Write 𝑋 ∼ Bin(𝑘, 𝜃) for the binomial, formed by a Bernoulli sample data of size 𝑘 = 3,

𝑍1 , 𝑍2 , · · · , 𝑍𝑘 ∼ B(𝜃)
𝑘
∑︁
then 𝑋 = 𝑍𝑖 is the number of individuals in each sample of size 𝑘 who say ”Yes”.
𝑖=1

• DATA: We observed a random sample X = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 with 𝑛 = 40, each 𝑋𝑖 follows the same
binomial 𝑋 ∼ Bin(3, 𝜃), known E[𝑋] = 𝑘 𝜃.

The forty binomial values are summarized in the following table.

Table 5.3: Data produced from binomial distribution

𝑋=𝑗 0 1 2 3

Frequency 6 10 14 10

𝑛0 𝑛1 𝑛2 𝑛3

• QUESTIONS:

a/ Find the mle 𝜃̂︀ of 𝜃

b/ Compute the corresponding probability P(𝑋 ≥ 2), indicating an estimated proportion of Americans
who support President T.

GUIDANCE for solving.

You may employ all concepts provided in Section 4.3.


∑︀𝑘
𝑗=0 𝑗𝑛𝑗 68
a/ Use the MLE machinery to get 𝜃ˆ = = ?
𝑛𝑘 120
Observed 𝑛 = 40 random samples, each sample 𝑖 give 𝑋𝑖 ∼ 𝑋 = Bin(3, 𝜃), with Range(𝑋) =
{0, 1, 2, . . . , 𝑘}, where 𝑘 = 3. The pdf of 𝑋𝑖 is
(︂ )︂
𝑘 𝑥
P[𝑋𝑖 = 𝑥] = P[𝑋 = 𝑥] = 𝜃 (1 − 𝜃)𝑘−𝑥 , 𝑥 ∈ 𝑅𝑋 = {0, 1, ..., 𝑘}, 𝜃 ∈ M = (0, 1).
𝑥

Inference, Linear Regression and Stochastic Processes


5.8. Few Case studies with Data Analytics Approach 175

∑︀𝑘
Suppose the value 𝑗 has frequency 𝑛𝑗 , then 𝑛 = 𝑗=0 𝑛𝑗 , with 𝑗 ∈ Range(𝑋). We prove that
𝑛
∑︁
𝑋𝑖 ∑︀𝑘
𝑖=1 𝑗=0 𝑗 𝑛𝑗
𝜃ˆ𝑛 = =
𝑛𝑘 𝑛𝑘
with 𝑛𝑗 = #{𝑖 = 1..𝑛 : 𝑋𝑖 = 𝑗}. Indeed, 𝜃 = Prop(𝑌 𝑒𝑠) = proportion of people in the population who
say ”Yes”, then from Example 4.5, the ML estimator of 𝜃 is
𝑛
∑︁
𝑋𝑖
{︁ 1 2 𝑛𝑘 }︁
𝑖=1
𝜃ˆ𝑛 = ∈ 0, , ,...,
𝑛𝑘 𝑛𝑘 𝑛𝑘 𝑛𝑘

With data in Table 5.3 the ML estimate is


∑︀𝑘
ˆ 𝑗=0 𝑗 𝑛𝑗
𝜃𝑛 = = 68/120 = 17/30.
𝑛𝑘
𝑛
∑︁ 𝑆
Write 𝑆 = 𝑋𝑖 , then sample proportion of people with ”Yes” is = 𝜃ˆ𝑛 , its mean and variance
𝑖=1
𝑛𝑘
respectively are
𝑆
E[𝜃ˆ𝑛 ] = E[ ] = [𝑛𝑘𝜃]/𝑛𝑘 = 𝜃
𝑛𝑘
[︁ 𝑆 ]︁ 𝜃(1 − 𝜃)
V[𝜃ˆ𝑛 ] = V = V[𝑆]/𝑛2 𝑘 2 = .
𝑛𝑘 𝑛𝑘
Then, by CLT for ‘large 𝑛’ we also get

𝑆 approx
(︁ 𝜃(1 − 𝜃) )︁
𝜃ˆ𝑛 = ∼ N 𝜃, .
𝑛𝑘 𝑛

b/ Now obviously the corresponding probability

P[𝑋 ≥ 2] = P[𝑋 = 2] + P[𝑋 = 3] = 1 − P[𝑋 = 0] − P[𝑋 = 1] = ...

This P(𝑋 ≥ 2) measures how likely the Americans vote for President T.

In general we view
∑︁
P[𝑋 ≥ 𝑐] = P[𝑋 = 𝑗]
𝑐≤𝑗≤𝑘

where (︂ )︂
𝑘 ̂︀𝑗
P[𝑋 = 𝑗] = 𝐶𝑘𝑗 𝜃̂︀𝑗 (1 − 𝜃)
̂︀ 𝑘−𝑗 = ̂︀ 𝑘−𝑗
𝜃 (1 − 𝜃) (5.20)
𝑗

is just the pmf of the binomial distribution Bin(𝑘, 𝜃).


̂︀

5.8.2 Case 2: Inference for natural disaster control

• Using Poisson distribution in natural disaster study

DATA ANALYTICS- FOUNDATION


CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
176 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS

ASEAN nations- in particular the Mekong River Basin with countries of Lao, Thailand, Cambodia and
Vietnam- are experiencing extreme natural threats or disasters (to be defined as either flooding or
drought) for years. Those extreme phenomena are called rare events, and a Poisson distribution
Poiss(𝜃) is used to model them.
Scientists of the Mekong River Committee (MRC) observed 𝑋, the monthly number of natural disas-
ters continuously in the last 5 years in the entire region. They code threat levels in five codes of 0, 1,
2, 3, 4 where the larger value means the more serious level the disaster is.
The following table shows observed data

Table 5.4: Data produced from Poisson distribution

𝑥 0 1 2 3 4

Frequency 9 14 14 12 11

summarizing a random sample of size 𝑛 = 60.

• QUESTIONS:
a/ Find the maximum likelihood estimate (MLE) 𝜃̂︀ of 𝜃.
b/The most fatal events are expressed by codes 3 and 4.
Compute the probability P(𝑋 ≥ 3) (measuring how much likelihood the most fatal events happened
in the past 5 years), from which the MRC could workout risk-control solutions.

GUIDANCE for solving.

a/ Carry out the ML Estimation process to get 𝜆𝑀 𝐿 = 𝜃̂︀ = X :

• Since 𝑋𝑖 are IID, we can write the likelihood function (of observing data vector 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ))
with respect to 𝜃 by the formula
∏︁
𝐿(𝜃) = 𝑓 (𝑋; 𝜃) = 𝑝(𝑋𝑖 ; 𝜃).
𝑖

• Take derivative of 𝐿* = log 𝐿(𝜃) with respect to variable 𝜃 and solve


𝜕 log 𝐿(𝜃)
= 0 ⇔ 𝜃̂︀ = X .
𝜕𝜃

𝜕 2 𝐿* (𝜃)
̂︀ ∑︀
Check that < 0. The estimate 𝜃𝑀 𝐿 = 𝜃̂︀ = X = ( 𝑥𝑖 )/𝑛 = (1.14 + ... + 4.11)/60 = 2.03?
𝜕𝜃2

𝑥 0 1 2 3 4

Frequency 9 14 14 12 11

Inference, Linear Regression and Stochastic Processes


5.8. Few Case studies with Data Analytics Approach 177

b/ The probability P[𝑋 ≥ 3] of most fatal events that is associated with the ML estimate 𝜆𝑀 𝐿 is
∑︁
P[𝑋 ≥ 3] = P[𝑋 = 𝑥] = 1 − P(𝑋 < 3] = 1 − 𝐹 (2),
3≤𝑥

where
̂︀𝑥 𝑒−𝜆̂︀
𝜆
P[𝑋 = 𝑥] = 𝑥 = 0, 1, 2, ... (5.21)
𝑥!
is just the pmf of the Poisson distribution Pois(𝜆)
̂︀ (3 points).

So P(𝑋 ≥ 3) = 1 − 𝐹 (2) = 1 − [P(𝑋 = 0) + P(𝑋 = 1) + P(𝑋 = 2)] = 0.3314

DATA ANALYTICS- FOUNDATION


PART C
DATA ANALYSIS
Statistical Designs and Linear Regression

Chapter 6: Experimental Designs in Engineering

Chapter 8 : Multivariate Probability Distributions

Chapter 9: Simple Linear Regression

Chapter 10: Inference for Linear Regression

Chapter 11: Multiple Linear Regression Analysis

Engineering and economics students nowadays need to know a few key quantitative methods in
order to properly design and test their prototypes before handling products (cars, mobile phones,
censors...) or services to industry for mass manufacturing. These tasks essentially need a causal
analysis between predictors and responses (dependent variables).

1. Data Analysis when predictors are qualitative:

Chapter 6 discusses a group of methods for this task, called statistically designed experiments,
an essential knowledge body for students in engineering institutions.

Chapter 8 presents Multivariate Probability Distributions, specifically with concepts

Random vector- Joint and marginal distributions

Covariance and correlation of variables, and Conditional distributions.

2. Data Analysis when predictors are quantitative:

What if predictors (independent variables) are quantitative? The causal relationship between
predictors and a response variable will be studied from Chapter 9, to Chapter 11. Chapter
9 covers correlation and regression analysis, linear regression, and analysis of variance for
regression. Chapter 10 and Chapter 11 discuss advanced aspects of linear regression.
Chapter 6

Experimental Designs in Engineering


Causal analysis with ordinal variables

[Source [56]]
CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
180 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

6.1 Introduction

Statistically designed experiments (or Experimental Designs in brief) have been used to accelerate
learning since their introduction by R.A. Fisher in the first half of the twentieth century. One of Fisher’s
major contributions was the development of factorial experiments, which can simultaneously study
several factors, see major methods in texts [6], [?] and [73].
New challenges and opportunities arose when Fisher’s ideas were applied in the industrial environ-
ment. Experimental Designs are now recognized as essential for rapid learning, manufacturing and
thereby, for reducing time-to-market [of products], while preserving high quality and peak perfor-
mance.

Figure 6.1: Sir R.A. Fisher, the inventor of Experimental Designs

Nowadays, statistically based experimental design techniques are particularly useful in the engi-
neering world for solving many important problems: discovery of new basic phenomena that can lead
to new products and commercialization of new technology including new product development, new
process development, and improvement of existing products and processes.

Repeated experiments

Experiments (or repeated experiments) are used in industry to improve productivity, reduce variability
and obtain robust products/processes. In this section we study how to design and analyze experi-
ments which are aimed at testing hypotheses (both scientific and technological).
These hypotheses are concerned with

• the effects of procedures or treatments on the yield; the relationship between variables;

• and the conditions under which a production process yields maximum output or optimum results.

Inference, Linear Regression and Stochastic Processes


6.1. Introduction 181

Briefly, every experiment involves a sequence of activities:

1. Conjecture: the original hypothesis that motivates the experiment.

2. Experiment: the test performed to investigate the conjecture.

3. Analysis: the statistical analysis of the data from the experiment.

4. Conclusion: what has been learned about the original conjecture from the experiment.

Often the experiment will lead to a revised conjecture, a new experiment, and so forth.

Standard concepts of Experimental Designs in Engineering

The following are guiding principles which we follow in designing experiments. They are designed to
ensure high information quality (InfoQ) of the study, and robust products in engineering and manufac-
turing (see Kenett [?]).

1. The objectives of the study should be well stated, and criteria established to test whether these
objectives have been met.

2. The response variable(s) should be clearly defined so that the study objectives are properly trans-
lated. At this stage measurement uncertainty should be established.

3. All factors which might affect the response variable(s) should be listed, and specified. We call these
the controllable factors. This requires interactive brainstorming with content experts.

4. The type of measurements or observations on all variables should be specified.

5. The levels of the controllable factors to be tested should be determined.

6. A statistical model should be formulated concerning the relationship between the pertinent vari-
ables, and their error distributions. This can rely on prior knowledge or literature search.

7. An experimental layout or experimental array should be designed so that the inference from the
gathered data will be:

(a) valid; (b) precise; c) generalizable; (d) easy to obtain.

8. The trials should be performed if possible in a random order, to avoid bias by factors which are not
taken into consideration.

9. A protocol of execution should be prepared, as well as the method of analysis.

The method of analysis and data collection depends on the design.

10. The execution of the experiment should follow the protocol with proper documentation.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
182 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

11. The results of the experiments should be carefully analyzed and reported ensuring proper documen-
tation and traceability.

12. Confirmatory experiments should be conducted to validate the inference (conclusions) of the main
experiments.

6.2 Fisher variable and statistic- a reminder

Fisher random variable is a ratio of independent mean squares, denoted by 𝐹 [𝜈1 , 𝜈2 ], where 𝜈1 > 0 and
𝜈2 > 0 are numerator and denominator degrees of freedom, respectively.
Its density function 𝑓 (𝑥) is given in Figure 6.2, with Range = {𝑥 : 𝑥 > 0}.

𝜈2
Expectation: E[𝐹 [𝜈1 , 𝜈2 ]] = for 𝜈2 > 2;
𝜈2 − 2
Variance:
2𝑣22 (𝑣1 + 𝑣2 − 2)
V[𝐹 [𝜈1 , 𝜈2 ]] = ; if 𝑣2 > 4.
𝑣1 (𝑣2 − 2)2 (𝑣2 − 4)

Probability density functions Upper and lower


of two F distributions. percentage points of the F distribution.

Figure 6.2: The pdf of 𝐹 (5, 5), 𝐹 (5, 15) and percentiles

Concept: 𝑥* is the 𝑝-th quantile (0 < 𝑝 < 1) of a distribution if

P[𝑋 < 𝑥* ] = 𝑝 ⇔ 𝐹 (𝑥* ) = 𝑝 ⇔ 𝑥* = 𝐹 −1 (𝑝) = 𝑞(𝑝).

𝐹 is the cdf; 𝐹 −1 = 𝑞 is the quantile function.

Inference, Linear Regression and Stochastic Processes


6.3. Completely Randomized Designs 183

The 𝛼−th percentile 𝑥* = 𝑞𝛼 [𝜈1 , 𝜈2 ] of Fisher variable is found by R.


To get 𝑞𝛼 [𝜈1 , 𝜈2 ] when 𝛼 < 0.5, we apply

1
𝑞𝛼 [𝜈1 , 𝜈2 ] = . (6.1)
𝑞1−𝛼 [𝜈2 , 𝜈1 ]

For example, 𝑞0.05 [15, 10] = 1/𝑞0.95 [10, 15] = 1/2.54 = .3932; due to
1
𝑞1−𝛼 [𝜈1 , 𝜈2 ] = .
𝑞𝛼 [𝜈2 , 𝜈1 ]

Knowledge box 2. (Generic Fisher distribution)

• The F statistic and distribution:

When 𝑠21 and and 𝑠22 are sample variances from independent SRSs of sizes 𝑛1 and 𝑛2 drawn
from normal populations, the statistic

𝑠21 /𝜎12
𝐹 =
𝑠22 /𝜎22

has the F distribution with 𝜈1 > 0 and 𝜈2 > 0 degrees of freedom.

• Relationship:

For independent rand. variables 𝜒2 [𝜈1 ] and 𝜒2 [𝜈2 ], the ratio

𝜒2 [𝜈1 ]
= 𝐹 [𝜈1 , 𝜈2 ].
𝜒2 [𝜈2 ]

In the subsequent parts, we discuss few most popular deigns in engineering:


* Completely Randomized Design with one factor,
* Randomized complete block designs (RCBD) using additive models
* Balanced Incomplete Block Designs (BIBD), Factorial Design and Fractional Factorial Design.

6.3 Completely Randomized Designs

The use of experimental design in the engineering design process can result in

• products that are easier to manufacture,

• products that have better field performance and reliability than their competitors, and

• products that can be designed, developed, and produced in less time.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
184 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

6.3.1 CRD, Completely Randomized Design- Theory

For the sake of explanation, let us start first with CRD- the simplest repeated experiment, in which the
response depends on one factor only. Thus, let 𝐴 designate some factor, which is applied at different
levels or categories 𝐴1 , · · · , 𝐴𝑎 . The levels of 𝐴 are also called ‘treatments.’

• Suppose that at each level of 𝐴 we make 𝑛 independent repetitions (replicas) of the experiment.

Let 𝑌𝑖𝑗 (𝑖 = 1, 2, · · · , 𝑎 and 𝑗 = 1, 2, · · · , 𝑛) denote the observed yield at the 𝑗-th replication of level
𝐴𝑖 . We model the random variables 𝑌𝑖𝑗 by

𝑌𝑖𝑗 = 𝜇 + 𝜏𝑖𝐴 + 𝑒𝑖𝑗 , 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, 2, · · · , 𝑛, (6.2)

where 𝜇 and 𝜏1𝐴 , · · · , 𝜏𝑎𝐴 are unknown parameters, satisfying


𝑎
∑︁
𝜏𝑖𝐴 = 0. (6.3)
𝑖=1

• Errors 𝑒𝑖𝑗 (for 𝑖 = 1, 2, · · · , 𝑎; 𝑗 = 1, 2, · · · , 𝑛) are independent random variables such that (zero mean
and constant variance)
E[𝑒𝑖𝑗 ] = 0 and V(𝑒𝑖𝑗 ) = 𝜎 2 . (6.4)

Put
𝑛
1 ∑︁
Y𝑖 = 𝑌𝑖𝑗 , 𝑖 = 1, 2, · · · , 𝑎.
𝑛 𝑗=1

The expected values of these means are

E[Y 𝑖 ] = 𝜇 + 𝜏𝑖𝐴 , 𝑖 = 1, 2, · · · , 𝑎. (6.5)

Parameter 𝜏𝑖𝐴 is called the main effect of factor 𝐴 at level 𝐴𝑖 .


The mean of all 𝑁 = 𝑎 × 𝑛 observations (the grand mean) is
𝑎
1 ∑︁
𝑌 = Y𝑖. (6.6)
𝑎 𝑖=1

𝑎
∑︁
Because 𝜏𝑖𝐴 = 0 we obtain the expectation of the grand mean
𝑖=1

E[𝑌 ] = 𝜇. (6.7)

CRD design with experimental data can be shown in the following table:

Inference, Linear Regression and Stochastic Processes


6.3. Completely Randomized Designs 185

Table 6.1: CRD design

Observation

Treatment 1 2 ··· 𝑛−1 𝑛 Sum Mean Y 𝑖.

1 𝑦11 𝑦12 ··· 𝑦1𝑛−1 𝑦1𝑛 𝑦1· y 1·

2 𝑦21 𝑦22 ··· 𝑦2𝑛−1 𝑦2𝑛 𝑦2· y 2·


..
.

𝑎 𝑦𝑎1 𝑦𝑎2 ··· 𝑦𝑎𝑛−1 𝑦𝑎𝑛 𝑦𝑎· y 𝑎·


∑︀
𝑦.. = 𝑖𝑗 𝑦𝑖𝑗 y .. = 𝑦

6.3.2 Specific problem in industrial manufacturing

Problem 1. CRD design for ‘green’ package production

A manufacturer of paper used for making grocery bags is interested in improving the product’s tensile
strength.

• Product engineers believe that tensile strength is a function of the hardwood concentration in the pulp
and that the range of hardwood concentrations of practical interest is between 5% và 20%.

Table 6.2: CRD design - experimental data [14]

Observations

Hardwood 1 2 3 4 5 6 Totals Means Y 𝑖.

concentration (%)

5 7 8 15 11 9 10 60 10.00

10 12 17 13 18 19 15 94 15.67

15 14 18 19 17 16 18 102 17.00

20 19 25 22 23 18 20 127 21.17
∑︀
𝑖𝑗 𝑦𝑖𝑗 = 383 𝑦 = 15.96

• A team of engineers responsible for the study decides to investigate four levels of hardwood con-
centration: 5%, 10%, 15%, và 20%.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
186 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

• They decide to make up six test specimens at each concentration level by using a pilot plant. All 24
specimens are tested on a laboratory tensile tester in random order.

The data from this experiment is shown in the Table 6.2.

• This is an example of a completely randomized single-factor experiment (CRD) with 𝑎 = 4 levels


of the factor 𝐴. 𝐴 has 𝑎 = 4 levels or treatments, and each treatment has 𝑛 = 6 observations or
replicates), are totally randomized in order.

• The role of randomization in this experiment is extremely important. By randomizing the order of
the 24 runs, the effect of any nuisance variable that may influence the observed tensile strength is
approximately balanced out.

• The mean (average) corresponds to each 𝑖-th treatment is


𝑛
1 ∑︁
Y𝑖 = 𝑌𝑖𝑗 , 𝑖 = 1, 2, · · · , 4;
6 𝑗=1

they are the numbers in the last column of the Table 6.2.

6.3.3 Use linear model for CRD

We assume a linear model for CRD.

• The linear statistical model for variable of yarn strength of the (𝑖𝑗) observation is

𝑌𝑖𝑗 = 𝜇 + 𝜏𝑖 + 𝑒𝑖𝑗 , 𝑖 = 1, 2, · · · , 4, 𝑗 = 1, 2, · · · , 6, (6.8)

where 𝜇 is called the overall mean, a parameter common to all treatments ,


𝑒𝑖𝑗 is normal random error component such that

E[𝑒𝑖𝑗 ] = 0 and V[𝑒𝑖𝑗 ] = 𝜎 2 , ∀ 𝑖 = 1, 2, · · · , 4; 𝑗 = 1, 2, · · · , 6;

[identically and independently distributed with mean zero and variance 𝜎 2 ].

• Since E[𝑒𝑖𝑗 ] = 0, the expected values of these averages are

E[Y 𝑖 ] = 𝜇 + 𝜏𝑖 , 𝑖 = 1, 2, · · · , 4; (6.9)

where𝜏𝑖 is main effect of hardwood 𝐴 at level 𝑖.

6.3.4 Variance analysis for the response

We can use variance analysis (ANOVA) to test the hypothesis that different wood stiffness does not
affect the average fiber strength of the shopping bag. The hypotheses are

𝐻0 : 𝜏1 = 𝜏2 = 𝜏3 = 𝜏4 = 0,
𝐻1 : 𝜏𝑖 ̸= 0 for each 𝑖 ∈ {1, 2, 3, 4}.

Inference, Linear Regression and Stochastic Processes


6.4. Theory of Block Designs (RCBD and BIBD) 187

With the above data we have the boxplot in Figure (a). If it is assumed that the 4 populations corre-
sponding to 4 treatments are all Gaussian and have the same variance 𝜎 2 , then the graph of the effect
of treatments on bag strength 𝑌 is given in Figure (b).
The ANOVA partitions the total variability in the sample data into two component parts. The total
variability in data is described by the total sum of squares [of observed deviations compared to 𝑌 ]
4 ∑︁
∑︁ 6
𝑆𝑆𝑇 = (𝑦𝑖𝑗 − 𝑦)2 = 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚 + 𝑆𝑆𝐸 = 512.96, (6.10)
𝑖=1 𝑗=1

where
4
∑︁
𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚 = 𝑛 (𝑦 𝑖 − 𝑦)2 = 382.79,
𝑖=1

𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚 = 512.96 − 382.79 = 130.17.

are respectively the sum of squares of treatments and of random errors.

ANOVA is summarized in Table 7.1. Because statistic 𝐹 from data is


𝑀 𝑆𝐵
𝐹0 = = 19.6 > 𝑓𝛼;𝑎−1,𝑁 −𝑎 = 𝑓0.01;3,20 = 4.94
𝑀 𝑆𝑊
we reject 𝐻0 and conclude that with the significance value of 𝛼 = 1%, hardwood content in bag signifi-
cantly affects the mean bag strength.

6.4 Theory of Block Designs (RCBD and BIBD)

6.4.1 Blocking and randomization

Blocking and randomization are devices in planning of experiments, which are aimed at increasing the
precision of the outcome and ensuring the validity of the inference. Blocking is used to reduce errors.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
188 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

Table 6.3: ANOVA of the factor of hardwood

Source of D.F. S.S. M.S. 𝐹 statistic

Variation

Hardwood 𝑎−1=3 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚 = 382.79 𝑀 𝑆𝐵 19.60


𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚
treatments 𝐴 = = 127.60
𝑎−1
Error 𝑁 − 𝑎 = 20 𝑆𝑆𝐸 = 130.17 𝑀 𝑆𝑊
𝑆𝑆𝐸
= = 6.51
𝑁 −𝑎
Total 𝑁 − 1 = 23 𝑆𝑆𝑇 = 512.96 -

A block is a portion of the experimental material that is expected to be more homogeneous than the
whole aggregate.

For example, if the experiment is designed to test the effect of polyester coating of electronic circuits
on their current output, the variability between circuits could be considerably bigger than the effect
of the coating on the current output. In order to reduce this component of variance, one can block
by circuit. Each circuit will be tested under two treatments: no-coating and coating. We first test the
current output of a circuit without coating. Later we coat the circuit, and test again.

• Such a comparison of before and after a treatment, of the same units, is called paired-comparison.
Other examples of blocks could be machines, shifts of production, days of the week, operators, etc.

• Generally, if there are 𝑡 treatments to compare, and 𝑏 blocks, and if all 𝑡 treatments can be performed
within a single block, we assign all the 𝑡 treatments to each block.

Inference, Linear Regression and Stochastic Processes


6.4. Theory of Block Designs (RCBD and BIBD) 189

The order of applying the treatments within each block should be randomized.

Such a design is called a randomized complete block design (RCBD). We will see later how a
proper analysis of the yield can validly test for the effects of the treatments.

• If not all treatments can be applied within each block, it is desirable to assign treatments to blocks in
some balanced fashion. Such designs, to be discussed later, are called balanced incomplete block
designs (BIBD).

6.4.2 The analysis of randomized complete block designs (RCBD)

We consider the case of several blocks, 𝑡 treatments per block.


As said earlier, the randomized complete block designs (RCBD) are those in which each block con-
tains all the 𝑡 treatments. The treatments are assigned to the experimental units in each block at
random.
Let 𝑏 denote the number of blocks. The linear model for these designs is

𝑌𝑖𝑗 = 𝜇 + 𝜏𝑖 + 𝛽𝑗 + 𝑒𝑖𝑗 , 𝑖 = 1, 2, · · · , 𝑡, 𝑗 = 1, 2, · · · , 𝑏, (6.11)

where 𝑦𝑖𝑗 is the yield of the 𝑖-th treatment in the 𝑗th block.

• The error terms 𝑒𝑖𝑗 are independent random variables satisfying

E[𝑒𝑖𝑗 ] = 0 and V[𝑒𝑖𝑗 ] = 𝜎 2 , (6.12)

for all 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, 2, · · · , 𝑏.

• The main effect of the 𝑖-th treatment is 𝜏𝑖 , and the main effect of the 𝑗-th block is 𝛽𝑗 .

• It is assumed that the effects are additive (no interaction). Under this assumption, each treatment is
tried only once in each block. The different blocks serve the role of replicas.

However, since the blocks may have additive effects, 𝛽𝑗 , we have to adjust for the effects of blocks in
estimating 𝜎 2 . This is done as shown in the ANOVA table below.

Here,
𝑡 ∑︁
∑︁ 𝑏
𝑆𝑆𝑇 = (𝑌𝑖𝑗 − 𝑌 )2 , (6.13)
𝑖=1 𝑗=1

𝑡
∑︁
𝑆𝑆𝑇 𝑅 = 𝑏 (Y 𝑖. −𝑌 )2 , [sum of squared errors by treaments] (6.14)
𝑖=1

𝑡
∑︁
𝑆𝑆𝐵𝐿 = 𝑡 (Y .𝑗 −𝑌 )2 , [sum of squared errors by blocks] (6.15)
𝑖=1

and
𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑇 𝑅 − 𝑆𝑆𝐵𝐿 [sum of squared errors by randomness]. (6.16)

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
190 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

Table 6.4: ANOVA table for RCBD

Source of DF SS MS E(𝑀 𝑆)

Variation
𝑡
𝑏 ∑︁
Treatments 𝑡−1 𝑆𝑆𝑇 𝑅 𝑀 𝑆𝑇 𝑅 𝜎2 + 𝜏𝑖
𝑡 − 1 𝑖=1
𝑏
𝑡 ∑︁
Blocks 𝑏−1 𝑆𝑆𝐵𝐿 𝑀 𝑆𝐵𝐿 𝜎2 + 𝛽𝑗
𝑏 − 1 𝑗=1

Errors (𝑡 − 1)(𝑏 − 1) 𝑆𝑆𝐸 𝑀 𝑆𝐸 𝜎2

Total 𝑡𝑏 − 1 𝑆𝑆𝑇 -

Besides
𝑏 𝑡
1 ∑︁ 1 ∑︁
Y 𝑖. = 𝑌𝑖𝑗 , Y .𝑗 = 𝑌𝑖𝑗 (6.17)
𝑏 𝑗=1 𝑡 𝑖=1

and 𝑌 is the grand mean of all outcomes.

6.4.3 Concepts of Balanced Incomplete Block Designs (BIBD)

As mentioned before, it is often the case that the blocks are not sufficiently large to accommodate all
the 𝑡 treatments. For example, in testing the wear-out of fabric one uses a special machine (Martindale
wear tester) which can accommodate only four pieces of clothes simultaneously. Here the block size is
fixed at 𝑘 = 4, while the number of treatments 𝑡, is the number of types of cloths to be compared.
Balanced Incomplete Block Designs (BIBD) are designs which assign 𝑡 treatment to 𝑏 blocks of
size 𝑘 (𝑘 < 𝑡) in the following manner.

1. Each treatment is assigned only once to any one block.

2. Each treatment appears in 𝑟 blocks. 𝑟 is the number of replicas.

3. Every pair of two different treatments appears in 𝜆 blocks.

4. The order of treatments within each block is randomized.

5. The order of blocks is randomized.

According to these requirements there are, altogether, 𝑁 = 𝑡 𝑟 = 𝑏 𝑘 trials. Moreover, the following
equality should hold
𝜆(𝑡 − 1) = 𝑟(𝑘 − 1). (6.18)

The question is how to design a BIBD, for a given 𝑡 and 𝑘.

Inference, Linear Regression and Stochastic Processes


6.4. Theory of Block Designs (RCBD and BIBD) 191

(︀ 𝑡 )︀
• One can obtain a BIBD by the complete combinatorial listing of the 𝑘 selections without replace-
ments of 𝑘 out of 𝑡 letters. In this case, the number of blocks is
(︂ )︂
𝑡
𝑏= . (6.19)
𝑘

(︀ 𝑡−1 )︀ (︀ 𝑡−2 )︀
• The number of replicas is 𝑟 = 𝑘−1 , and 𝜆 = 𝑘−2 .

• The total number of trials is


(︂ )︂ (︂ )︂
𝑡−1 𝑡! 𝑡
𝑁 = 𝑡𝑟 = 𝑡 = =𝑘 = 𝑘 𝑏. (6.20)
𝑘−1 (𝑘 − 1)!(𝑡 − 𝑘)! 𝑘

• Such designs of BIBD are called combinatoric designs. A BIB design is said to be symmetric if
𝑡 = 𝑏 and consequently 𝑟 = 𝑘; otherwise, asymmetric.

DISCUSSION.

They generally might be too big. For example, if 𝑡 = 8 and 𝑘 = 4 we are required to have 𝑘𝑡 = 84 =
(︀ )︀ (︀ )︀

70 = 𝑏 blocks. Thus, the total number of trials is 𝑁 = 70 × 4 = 280 and 𝑟 = 73 = 35. Here 𝜆 = 62 = 15.
(︀ )︀ (︀ )︀

There are advanced algebraic methods which can yield smaller designs for 𝑡 = 8 and 𝑘 = 4.
But it is not always possible to have a BIBD smaller in size than a complete combinatoric design.
Such a case is 𝑡 = 8 and 𝑘 = 5. Here the smallest number of blocks possible is 𝑏 = 𝑘𝑡 = 85 = 56,
(︀ )︀ (︀ )︀

and 𝑁 = 𝑘𝑏 = 5 × 56 = 280.

6.4.4 The ANOVA for a BIBD

Let 𝐵𝑖 denote the set of treatments in the i-th block. For example, if block 1 contains the treatments 1,
2, 3, 4, then 𝐵1 = {1, 2, 3, 4}. Let 𝑌𝑖𝑗 be the yield of treatment 𝑗 ∈ 𝐵𝑖 . The effect model is

𝑌𝑖𝑗 = 𝜇 + 𝛽𝑖 + 𝜏𝑗 + 𝑒𝑖𝑗 , 𝑖 = 1, · · · , 𝑏, 𝑗 ∈ 𝐵𝑖 (6.21)

{𝑒𝑖𝑗 } are random experimental errors, with E(𝑒𝑖𝑗 ) = 0 and V(𝑒𝑖𝑗 ) = 𝜎 2 for all (𝑖, 𝑗).

• The block effects 𝛽1 , · · · , 𝛽𝑏 and treatment effects 𝜏1 , · · · , 𝜏𝑡 satisfy the constraints


𝑡
∑︁ 𝑏
∑︁
𝜏𝑗 = 0 and 𝛽𝑖 = 0.
𝑗=1 𝑖=1

• Let 𝑇𝑗 be the set of all indices of blocks containing the 𝑗-th treatment. The least squares estimates
of the treatment effects are obtained in the following manner.
∑︁
Let 𝑊𝑗 = 𝑌𝑖𝑗 be the sum of all 𝑌 values under the 𝑗-th treatment. Let
𝑖∈𝑇𝑗
∑︁ ∑︁
𝑊𝑗* = 𝑌𝑖𝑙
𝑖∈𝑇𝑗 𝑙∈𝐵𝑖

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
192 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

Table 6.5: ANOVA for a BIBD

Source of Variation DF SS MS E[𝑀 𝑆]


𝑏
𝑡 ∑︁ 2
Blocks 𝑏−1 𝑆𝑆𝐵𝐿 𝑀 𝑆𝐵𝐿 𝜎2 + 𝛽
𝑏 − 1 𝑖=1 𝑖
𝑡
2 𝑏 ∑︁ 2
Treatments adjusted 𝑡−1 𝑆𝑆𝑇 𝑅 𝑀 𝑆𝑇 𝑅 𝜎 + 𝜏
𝑡 − 1 𝑗=1 𝑗

Error 𝑁 −𝑡−𝑏+1 𝑆𝑆𝐸 𝑀 𝑆𝐸 𝜎2

Total 𝑁 −1 𝑆𝑆𝑇 - -

be the sum of the values in all the 𝑟 blocks which contain the 𝑗-th treatment. Compute

𝑄𝑗 = 𝑘𝑊𝑗 − 𝑊𝑗* , 𝑗 = 1, · · · , 𝑡. (6.22)

The LSE of 𝜏𝑗 is
𝑄𝑗
𝜏̂︀𝑗 = , 𝑗 = 1, · · · , 𝑡. (6.23)
𝑡𝜆

𝑡 𝑡 𝑏
∑︁ ∑︁ 1 ∑︁ ∑︁
Notice that 𝑄𝑗 = 0, thus 𝜏̂︀𝑗 = 0. Let 𝑌 = 𝑌𝑖𝑙 .
𝑗=1 𝑗=1
𝑁 𝑖=1
𝑙∈𝐵𝑖

The adjusted treatment average is defined as

*
Y 𝑗 = 𝑌 + 𝜏̂︀𝑗 , 𝑗 = 1, · · · , 𝑡.

The ANOVA for a BIBD is given in Table 6.5. Here,


𝑏 ∑︁
∑︁ 𝑏 ∑︁
(︂ ∑︁ )︂2
𝑆𝑆𝑇 = 𝑌𝑖𝑙2 − 𝑌𝑖𝑙 /𝑁,
𝑖=1 𝑙∈𝐵𝑖 𝑖=1 𝑙∈𝐵𝑖
𝑏 (︂ )︂2
1 ∑︁ ∑︁ 2
𝑆𝑆𝐵𝐿 = 𝑌𝑖𝑙 − 𝑁 𝑌 , (6.24)
𝑘 𝑖=1
𝑙∈𝐵𝑖
𝑡
1 ∑︁ 2
𝑆𝑆𝑇 𝑅 = 𝑄 , and 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐵𝐿 − 𝑆𝑆𝑇 𝑅.
𝜆 𝑘 𝑡 𝑗=1 𝑗

𝑀 𝑆𝑇 𝑅
The significance of the treatments effects is tested by the statistic 𝐹 = .
𝑀 𝑆𝐸
Example 6.1.

Six different adhesives (𝑡 = 6) are tested for the bond strength in a lamination process, under curing
pressure of 200 [psi]. Lamination can be done in blocks of size 𝑘 = 4. A combinatoric design (listed in
Table 6.6) has 𝑘𝑡 = 64 = 15 blocks, with
(︀ )︀ (︀ )︀

(︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂
𝑡−1 5 𝑡−2 4
𝑟= = = 10, 𝜆 = = = 6,
𝑘−1 3 𝑘−2 2

Inference, Linear Regression and Stochastic Processes


6.4. Theory of Block Designs (RCBD and BIBD) 193

Table 6.6: Block sets

𝑖 𝐵𝑖 𝑖 𝐵𝑖

1 1, 2, 3, 4 9 1, 3, 5, 6

2 1, 2, 3, 5 10 1, 4, 5, 6

3 1, 2, 3, 6 11 2, 3, 4, 5

4 1, 2, 4, 5 12 2, 3, 4, 6

5 1, 2, 4, 6 13 2, 3, 5, 6

6 1, 2, 5, 6 14 2, 4, 5, 6

7 1, 3, 4, 5 15 3, 4, 5, 6

8 1, 3, 4, 6

and 𝑁 = 𝑘𝑏 = 60. The treatment indices of the 15 blocks are


The observed bond strength in these trials, listed in Table 6.7 are:

Table 6.7: Values of 𝑌𝑖𝑙 , 𝑖 ∈ 𝐵𝑖

𝑖 𝑌𝑖𝑙 𝑖 𝑌𝑖𝑙

1 24.7, 20.8, 29.4, 24.9 8 23.1, 29.3, 27.1, 34.4

2 24.1, 20.4, 29.8, 30.3 9 22.0, 29.8, 31.9, 36.1

3 23.4, 20.6, 29.2, 34.4 10 22.8, 22.6, 33.2, 34.8

4 23.2, 20.7, 26.0, 30.8 11 21.4, 29.6, 24.8, 31.2

5 21.5, 22.1, 25.3, 35.4 12 21.3, 28.9, 25.3, 35.1

6 21.4, 20.1, 30.1, 34.1 13 21.6, 29.5, 30.4, 33.6

7 23.2, 28.7, 24.9, 31.0 14 20.1, 25.1, 32.9, 33.9

15 30.1, 24.0, 30.8, 36.5

The grand mean of the bond strength is 𝑌 = 27.389.


The sets 𝑇𝑗 and the sums 𝑊𝑗 , 𝑊𝑗* are in Table 6.8. Table 6.9 shows the ANOVA table.
The adjusted mean effects of the adhesives are in Table 6.10.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
194 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

Table 6.8: The set 𝑇𝑗 and the statistics 𝑊𝑗 , 𝑊𝑗*

𝑗 𝑇𝑗 𝑊𝑗 𝑊𝑗* 𝑄𝑗

1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 229.536 1077.7 −159.56

2 1, 2, 3, 4, 5, 6, 11, 12, 13, 14 209.023 1067.4 −231.31

3 1, 2, 3, 7, 8, 9, 11, 12, 13, 15 294.125 1107.60 68.90

4 1, 4, 5, 7, 8, 10, 11, 12, 14, 15 249.999 1090.90 −90.90

5 2, 4, 6, 7, 9, 10, 11, 13, 14, 15 312.492 1107.50 142.47

6 3, 5, 6, 8, 9, 10, 12, 13, 14, 15 348.176 1123.80 268.90

Table 6.9: ANOVA for BIBD

Source of Variation DF SS MS F

Blocks 14 161.78 11.556 23.99

Treat. adjusted 5 1282.76 256.552 532.54

Error 40 19.27 0.48175 –

Total 59 1463.81

Table 6.10: Mean Effects and their S.E.

* *
Treatment 𝑌𝑗 S.E. 𝑌 𝑗

1 22.96 1.7445

2 20.96 1.7445

3 29.33 1.7445

4 24.86 1.7445

5 31.35 1.7445

6 34.86 1.7445

The variance of each adjusted mean effect is

* 𝑘𝜎 2
V[𝑌 𝑗 ] = , 𝑗 = 1, · · · , 𝑡. (6.25)
𝑡𝜆

Inference, Linear Regression and Stochastic Processes


6.5. Factorial Designs 195

*
Thus, the S.E. of 𝑌 𝑗 is
(︂ )︂1/2
* 𝑘 𝑀 𝑆𝐸
S.E. 𝑌 =
𝑗 , 𝑗 = 1, · · · , 𝑡. (6.26)
𝑡𝜆

It seems that there are two homogeneous groups of treatments {1, 2, 4} and {3, 5, 6}. 

For the remainder of Part C, to get insights of causal data analysis, we will study factorial designs
and linear regression models.

6.5 Factorial Designs

Factorial designs (a specific experimental design) is a very useful solution for our industrial
manufacturing problems.
Regression analysis captures relationships between random variables, determine the magni-
tude of the relationships between variables, and used to make predictions based on the models.
Linear regression describes the linear relationship between a response variable and a set of
other variables, called regressors, explanatory or predictor variables, in which we wish to predict
the values of the response variable of interest.

When several factors are of interest in an experiment, a factorial experiment should be used. In these
experiments, factors are varied together.

• By factorial design or experiment, we mean that in each complete trial or replicate of the exper-
iment, all possible combinations of the levels of the factors are investigated. Thus, if there are two
factors 𝐴 and 𝐵 with 𝑎 levels of factor 𝐴 and 𝑏 levels of factor 𝐵, each replicate 𝐷 = 𝐴 × 𝐵 contains
all 𝑎𝑏 treatment combinations. A full factorial design with 3 binary factors is visualized in Figure 6.3.

• The effect of a factor is defined as the change in response produced by a change in the level of the
factor. It is called a main effect because it refers to the primary factors in the study.

• Fractional factorial design 𝐹 is a subset of a (full) factorial 𝐷, with repeated runs are allowed.

6.5.1 Statistical model of a factorial experiment

In full factorial experiments, the number of levels of different factors do not have to be the same. Some
factors might be tested at two levels and others at three or four levels. Full factorial, or certain fractional
factorials which will be discussed later, are necessary, if the statistical model is not additive.

• In order to estimate or test the effects of interactions, one needs to perform factorial experiments,
full or fractional. In a full factorial experiment, all the main effects and interactions can be tested or
estimated.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
196 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

The 2^3 design

Figure 6.3: A factorial design with 3 binary factors

• Recall that if there are 𝑝 factors 𝐴, 𝐵, 𝐶, · · · there are

𝑝 types of main effects,


(︀𝑝)︀
2 types of pairwise interactions 𝐴𝐵, 𝐴𝐶, 𝐵𝐶, · · ·
(︀𝑝)︀
3 interactions between three factors, 𝐴𝐵𝐶, 𝐴𝐵𝐷, · · · and so on.

On the whole there are, together with the grand mean 𝜇, 2𝑝 parameters.

Statistical model of designs with two factors

Suppose that there are two factors, 𝐴 and 𝐵, at 𝑎 and 𝑏 levels respectively, there are 𝑎 𝑏 treatment
combinations (𝐴𝑖 , 𝐵𝑗 ), 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, · · · , 𝑏.
Suppose also that 𝑛 independent replicas are made at each one of the treatment combinations.
The yield at the 𝑘-th replication of treatment combination (𝐴𝑖 , 𝐵𝑗 ) is given by

𝑌𝑖𝑗𝑘 = 𝜇 + 𝜏𝑖𝐴 + 𝜏𝑗𝐵 + 𝜏𝑖𝑗


𝐴𝐵
+ 𝑒𝑖𝑗𝑘 . (6.27)

where 𝜇 is the overall mean effect,


𝜏𝑖𝐴 is the effect of the 𝑖-th level of factor 𝐴, 𝜏𝑗𝐵 is the effect of the 𝑗-th level of factor 𝐵,
𝐴𝐵
𝜏𝑖𝑗 is the effect of the interaction between 𝐴𝑖 , and 𝐵𝑗 .
The error terms 𝑒𝑖𝑗𝑘 are independent random variables satisfying

E[𝑒𝑖𝑗𝑘 ] = 0 and V[𝑒𝑖𝑗𝑘 ] = 𝜎 2 , (6.28)

for all 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, 2, · · · , 𝑏, 𝑘 = 1, 2, · · · , 𝑛.
In the following section we discuss the structure of the Analysis of Variance (ANOVA) for testing the
significance of main effects and interactions.

Inference, Linear Regression and Stochastic Processes


6.5. Factorial Designs 197

6.5.2 The ANOVA for Full Factorial Designs

The analysis of variance for full factorial designs is done to test the hypotheses that main-effects or
interaction parameters are equal to zero. We present the ANOVA for a two factor situation, 𝐴 and 𝐵
with a statistical model given in Equation (6.27). The method can be generalized to any number of
factors. Let
𝑛
1 ∑︁
Y 𝑖𝑗 = 𝑌𝑖𝑗𝑘 , (6.29)
𝑛
𝑘=1

𝑏
1 ∑︁
Y 𝑖. = Y 𝑖𝑗 , 𝑖 = 1, · · · , 𝑎 (6.30)
𝑏 𝑗=1
𝑎
1 ∑︁
Y .𝑗 = Y 𝑖𝑗 , 𝑗 = 1, · · · , 𝑏 (6.31)
𝑎 𝑖=1
and
𝑎 𝑏
1 ∑︁ ∑︁
𝑌 = Y 𝑖𝑗 . (6.32)
𝑎𝑏 𝑖=1 𝑗=1

The ANOVA analysis includes 3 steps:

1. The ANOVA partitions first the total sum of squares of deviations from 𝑌
∑︁ 𝑏 ∑︁
𝑎 ∑︁ 𝑛
𝑆𝑆𝑇 = (𝑌𝑖𝑗𝑘 − 𝑌 )2 [total sum of squared errors] (6.33)
𝑖=1 𝑗=1 𝑘=1

to two components
∑︁ 𝑏 ∑︁
𝑎 ∑︁ 𝑛
𝑆𝑆𝑊 = (𝑌𝑖𝑗𝑘 − Y 𝑖𝑗 )2 [sum of squared errors in whole design] (6.34)
𝑖=1 𝑗=1 𝑘=1

and
𝑎 ∑︁
∑︁ 𝑏
𝑆𝑆𝐵 = 𝑛 (Y 𝑖𝑗 −𝑌 )2 [sum of squared errors among factor levels]. (6.35)
𝑖=1 𝑗=1

It is straightforward to show that


𝑆𝑆𝑇 = 𝑆𝑆𝑊 + 𝑆𝑆𝐵. (6.36)

2. In the second stage, the sum of squares of deviations 𝑆𝑆𝐵 is partitioned to three components
𝑆𝑆𝐼, 𝑆𝑆𝑀 𝐴, 𝑆𝑆𝑀 𝐵, as 𝑆𝑆𝐵 = 𝑆𝑆𝐼 + 𝑆𝑆𝑀 𝐴 + 𝑆𝑆𝑀 𝐵, where

𝑎 ∑︁
∑︁ 𝑏
𝑆𝑆𝐼 = 𝑛 (Y 𝑖𝑗 − Y 𝑖. − Y .𝑗 +𝑌 )2 , [errors caused by interaction effects]
𝑖=1 𝑗=1
𝑎
∑︁
𝑆𝑆𝑀 𝐴 = 𝑛𝑏 (Y 𝑖. −𝑌 )2 [sum of squared errors caused by factor effect 𝐴] (6.37)
𝑖=1
𝑏
∑︁
𝑆𝑆𝑀 𝐵 = 𝑛𝑎 (Y .𝑗 −𝑌 )2 , [sum of squared errors caused by factor effect 𝐵],
𝑗=1

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
198 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

All these terms are collected in a table of ANOVA (Table6.11). Thus,

𝑆𝑆𝑀 𝐴 𝑆𝑆𝑀 𝐵
𝑀 𝑆𝐴 = , 𝑀 𝑆𝐵 = , (6.38)
𝑎−1 𝑏−1

and
𝑆𝑆𝐼 𝑆𝑆𝑊
𝑀 𝑆𝐴𝐵 = , 𝑀 𝑆𝑊 = . (6.39)
(𝑎 − 1)(𝑏 − 1) 𝑎𝑏(𝑛 − 1)

Table 6.11: Table of ANOVA for a 2-factor factorial experiment

Source of variation DF SS MS F

𝐴 𝑎−1 𝑆𝑆𝑀 𝐴 𝑀 𝑆𝐴 𝐹𝐴

𝐵 𝑏−1 𝑆𝑆𝑀 𝐵 𝑀 𝑆𝐵 𝐹𝐵

𝐴𝐵 (𝑎 − 1)(𝑏 − 1) 𝑆𝑆𝐼 𝑀 𝑆𝐴𝐵 𝐹𝐴𝐵

Between 𝑎𝑏 − 1 𝑆𝑆𝐵 - -

Within 𝑎𝑏(𝑛 − 1) 𝑆𝑆𝑊 𝑀 𝑆𝑊 -

Total 𝑁 −1 𝑆𝑆𝑇 - -

3. Finally, we compute the 𝐹 -statistics


𝑀 𝑆𝐴
𝐹𝐴 = , (6.40)
𝑀 𝑆𝑊

𝑀 𝑆𝐵
𝐹𝐵 = , (6.41)
𝑀 𝑆𝑊
and
𝑀 𝑆𝐴𝐵
𝐹𝐴𝐵 = . (6.42)
𝑀 𝑆𝑊

𝐹𝐴 , 𝐹𝐵 and 𝐹𝐴𝐵 are test statistics to test, respectively, the significance of the main effects of 𝐴, the
main effects of 𝐵 and the interactions 𝐴𝐵 on the response. Few cases to consider:

1. If 𝐹𝐴 < 𝐹1−𝛼 [𝑎 − 1, 𝑎𝑏(𝑛 − 1)] the null hypothesis

𝐻0𝐴 : 𝜏1𝐴 = 𝜏2𝐴 = · · · = 𝜏𝑎𝐴 = 0

cannot be rejected.

Inference, Linear Regression and Stochastic Processes


6.5. Factorial Designs 199

2. If 𝐹𝐵 < 𝐹1−𝛼 [𝑏 − 1, 𝑎𝑏(𝑛 − 1)] the null

𝐻0𝐵 : 𝜏1𝐵 = 𝜏2𝐵 = · · · = 𝜏𝑏𝐵 = 0

cannot be rejected.

3. Also, if 𝐹𝐴𝐵 < 𝐹1−𝛼 [(𝑎 − 1)(𝑏 − 1), 𝑎𝑏(𝑛 − 1)], we cannot reject the null hypothesis

𝐻0𝐴𝐵 : 𝜏11
𝐴𝐵 𝐴𝐵
= · · · = 𝜏𝑎𝑏 = 0.

See a concrete analysis in Section 6.5.4.

6.5.3 The Full Binary Factorial Designs

As an essential illustration of full factorial designs, we present 2𝑚 factorial designs, the most simple full
factorials of 𝑚 factors, each one at two levels. The levels of the factors are labeled as “Low” and “High”,
or 1 and 2. If the factors are categorical then the labeling of the levels is arbitrary and the ordering of
values of the main effects and interaction parameters depend on this arbitrary labeling.
We will discuss here experiments in which the levels of the factors are measured on a continuous
scale, like in the case of the factors effecting the piston cycle time.

• The levels of the 𝑖-th factor (𝑖 = 1, · · · , 𝑚) are fixed at 𝑥𝑖1 and 𝑥𝑖2 , where 𝑥𝑖1 < 𝑥𝑖2 . By simple
transformation all factor levels can be reduced to

+1, if 𝑥 = 𝑥𝑖2





𝑐𝑖 = 𝑖 = 1, · · · , 𝑚.



⎩−1, if 𝑥 = 𝑥𝑖1

• In such a factorial experiment there are 2𝑚 treatment combination (or treatment). Let (𝑖1 , · · · , 𝑖𝑚 )
denote a treatment combination, where 𝑖1 , · · · , 𝑖𝑚 are indices, such that

⎨0, if 𝑐𝑗 = −1
𝑖𝑗 =
⎩1, if 𝑐 = 1.
𝑗

Thus, if there are 𝑚 = 3 factors, the number of possible treatment combinations is 23 = 8. These are
given in Table 6.12. The index 𝜈 of the standard order, is given by the formula
𝑚
∑︁
𝜈= 𝑖𝑗 2𝑗−1 . (6.43)
𝑗=1

Notice that 𝜈 ranges from 0 to 2𝑚 − 1. This produces tables of the treatment combinations for a 2𝑚
factorial design, arranged in a standard order (see Table 6.13).
In R we obtain a fraction of a full factorial design with:

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
200 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

Table 6.12: Treatment combinations of a 23 experiment

𝜈 𝑖1 𝑖2 𝑖3

0 0 0 0

1 1 0 0

2 0 1 0

3 1 1 0

4 0 0 1

5 1 0 1

6 0 1 1

7 1 1 1

> install.packages("FrF2")
> library(FrF2)
> FrF2(nfactors=5, resolution=5)
A B C D E
1 -1 1 -1 1 1
2 1 1 1 -1 -1
3 1 -1 -1 1 1 ...
class=design, type= FrF2

which is a half-fractional replication 25−1 of a 25 design.


In Table 6.13 we present the design of a 25 full factorial experiment derived from the R soft:

> Design <- fac.design(nlevels=2, nfactors=5)


creating full factorial with 32 runs ...

> head(Design, 2)
A B C D E
1 2 2 2 1 2
2 1 2 1 1 1

> tail(Design, 2)
A B C D E

Inference, Linear Regression and Stochastic Processes


6.5. Factorial Designs 201

Table 6.13: The labels in standard order for a 25 factorial design

𝜈 𝐼1 𝐼2 𝐼3 𝐼4 𝐼5 𝜈 𝐼1 𝐼2 𝐼3 𝐼4 𝐼5

0 1 1 1 1 1 16 1 1 1 1 2

1 2 1 1 1 1 17 2 1 1 1 2

2 1 2 1 1 1 18 1 2 1 1 2

3 2 2 1 1 1 19 2 2 1 1 2

4 1 1 2 1 1 20 1 1 2 1 2

5 2 1 2 1 1 21 2 1 2 1 2

6 1 2 2 1 1 22 1 2 2 1 2

7 2 2 2 1 1 23 2 2 2 1 2

8 1 1 1 2 1 24 1 1 1 2 2

9 2 1 1 2 1 25 2 1 1 2 2

10 1 2 1 2 1 26 1 2 1 2 2

11 2 2 1 2 1 27 2 2 1 2 2

12 1 1 2 2 1 28 1 1 2 2 2

13 2 1 2 2 1 29 2 1 2 2 2

14 1 2 2 2 1 30 1 2 2 2 2

15 2 2 2 2 1 31 2 2 2 2 2

31 1 2 2 2 2
32 2 1 2 1 1
> rm(Design)

Let 𝑌𝜈 , 𝜈 = 0, 1, · · · , 2𝑚 − 1, denote the yield of the 𝜈-th treatment combination.We discuss now the
estimation of the main effects and interaction parameters. Starting with the simple case of 2 factors,
the variables are presented schematically, in Table 6.14.
According to our previous definition there are four main effects main effect 𝜏1𝐴 , 𝜏2𝐴 , 𝜏1𝐵 , 𝜏2𝐵 and four
𝐴𝐵 𝐴𝐵 𝐴𝐵 𝐴𝐵
interaction effects 𝜏11 , 𝜏12 , 𝜏21 , 𝜏22 . But since 𝜏1𝐴 + 𝜏2𝐴 = 𝜏1𝐵 + 𝜏2𝐵 = 0, it is sufficient to represent
the main effects of 𝐴 and 𝐵 by 𝜏2𝐴 and 𝜏2𝐵 . Similarly, since

𝐴𝐵 𝐴𝐵 𝐴𝐵 𝐴𝐵
𝜏11 + 𝜏12 = 0 = 𝜏11 + 𝜏21 ,

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
202 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

Table 6.14: Treatment means in a 22 design

Factor A Row means


Factor B
1 2

1 Y0 Y1 Y 1.

2 Y2 Y3 Y 2.

Column means Y .1 Y .2 𝑌

𝐴𝐵 𝐴𝐵 𝐴𝐵 𝐴𝐵
𝜏21 + 𝜏22 = 0 = 𝜏12 + 𝜏22 ,

𝐴𝐵
it is sufficient to represent the interaction effects by 𝜏22 . Main effect 𝜏2𝐴 is estimated by

1 1
𝜏̂︀2𝐴 = Y .2 −𝑌 = (Y 1 + Y 3 ) − (Y 0 + Y 1 + Y 2 + Y 3 )
2 4
1
= (− Y 0 + Y 1 − Y 2 + Y 3 ).
4

Estimator of 𝜏2𝐵 is

1 1
𝜏̂︀2𝐵 = Y 2. −𝑌 = (Y 2 + Y 3 ) − (Y 0 + Y 1 + Y 2 + Y 3 )
2 4
1
= (− Y 0 − Y 1 + Y 2 + Y 3 ).
4
𝐴𝐵
Finally, the estimator of 𝜏22 is

𝐴𝐵
𝜏̂︀22 = Y 3 − Y 2. − Y .2 +𝑌
1 1 1
= Y 3 − (Y 2 + Y 3 ) − (Y 1 + Y 3 ) + (Y 0 + Y 1 + Y 2 + Y 3 )
2 2 4
1
= (Y 0 − Y 1 − Y 2 + Y 3 ).
4

The parameter 𝜇 is estimated by the grand mean 𝑌 = 14 (Y 0 + Y 1 + Y 2 + Y 3 ). All these estimators


𝐴𝐵 ′
[𝜇, 𝜏2𝐴 , 𝜏2𝐵 , 𝜏22 ] =: 𝜃 can be presented in a matrix form as
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢ 𝜇 ̂︀ ⎥ ⎢1 1 1
1⎥ ⎢ Y0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 𝜏̂︀𝐴
⎥ 1 ⎢−1 1 −1 1⎥ Y1 ⎥
⎥ ⎢ ⎢
̂︀ = ⎢ 2
⎢ ⎥ ⎢ ⎥
𝜃 ⎥= ⎢ ⎥·⎢ ⎥.
⎢ ⎥ 4 ⎢ ⎥ ⎢ ⎥
⎢ 𝐵
⎢ 𝜏̂︀2 ⎢−1 −1 1 1⎥ ⎢ Y2 ⎥
⎥ ⎢ ⎥ ⎢ ⎥

⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
𝐴𝐵
𝜏̂︀22 1 −1 −1 1 Y3

The indices in a 22 design are given in the following matrix

Inference, Linear Regression and Stochastic Processes


6.5. Factorial Designs 203

⎡ ⎤
⎢1 1⎥
⎢ ⎥
⎢ ⎥
⎢2 1⎥
⎢ ⎥
𝐷22 =⎢

⎥.

⎢1 2⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
2 2

The corresponding 𝐶 coefficients are the 2nd and 3rd columns in the matrix
⎡ ⎤
⎢1 −1 −1 1⎥
⎢ ⎥
⎢ ⎥
⎢1
⎢ 1 −1 −1⎥

𝐶22 =⎢

⎥.

⎢1 −1 1 −1⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
1 1 1 1

The 4th column of this matrix is the product of the elements in the 2nd and 3rd columns. Notice also
that the linear model for the yield vector is
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢ 𝑌0 ⎥ ⎢1 −1 −1 1⎥ ⎢ 𝜇 ⎥ ⎢ 𝑒1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 𝑌 ⎥ ⎢1 1 −1 −1⎥ ⎢ 𝜏𝐴 ⎥ ⎢ 𝑒 ⎥
⎢ 1 ⎥ ⎢ ⎥ ⎢ 2 ⎥ ⎢ 2 ⎥
⎢ ⎥=⎢ ⎥ ⎢ ⎥+⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 𝐵
⎢ 𝑌2 ⎥ ⎢1 −1 1 −1⎥ ⎢ 𝜏2 ⎥ ⎢ 𝑒3 ⎥
⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
𝐴𝐵
𝑌3 1 1 1 1 𝜏22 𝑒4

where 𝑒𝑖 are independent random variables, with E[𝑒𝑖 ] = 0 and V[𝑒𝑖 ] = 𝜎 2 for 𝑖 = 1, 2, 3, 4.
Let
𝑌 (4) = [𝑌0 , 𝑌1 , 𝑌2 , 𝑌3 ]′ ,
𝐴𝐵 ′
𝜃 (4) = [𝜇, 𝜏2𝐴 , 𝜏2𝐵 , 𝜏22 ], and
𝑒(4) = [𝑒1 , 𝑒2 , 𝑒3 , 𝑒4 ]′

then the model is

𝑌 (4) = 𝐶22 𝜃 (4) + 𝑒(4) .

This is the usual linear model for multiple regression. The least squares estimator of 𝜃 (4) is

̂︀(4) = [𝐶 ′ 2 𝐶22 ]−1 𝐶 ′ 2 𝑌 (4) .


𝜃 2 2

The matrix 𝐶22 has orthogonal column (row) vectors and 𝐶2′ 2 𝐶22 = 4𝐼4 , where 𝐼4 is the identity matrix

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
204 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

of rank 4. Therefore
⎡ ⎤ ⎡ ⎤
⎢1 1 1 1⎥ ⎢ Y0 ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢−1 1 −1 1⎥ ⎢ Y ⎥
̂︀(4) = 1 (𝐶22 )′ 𝑌 (4) 1⎢ ⎥ ⎢ 1 ⎥
𝜃 = ⎢ ⎥ ⎢ ⎥.
4 4⎢ ⎥ ⎢ ⎥
⎢−1 −1 1 1⎥ ⎢ Y2 ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦
1 −1 −1 1 Y3

This is identical with the solution obtained earlier. The estimators of the main effects and interactions
are the least squares estimators, as has been mentioned before.

6.5.4 Factorial Designs in Engineering

Definition 6.1. Formally, we fix 𝑑 finite sets 𝑄1 , 𝑄2 , . . . , 𝑄𝑑 called factors, where 1 < 𝑑 ∈ N. The
elements of a factor are called its levels. The (full) factorial design (also factorial experiment design-
FED) with respect to these factors is the Cartesian product 𝐷 = 𝑄1 × 𝑄2 × . . . × 𝑄𝑑 .
A fractional design or fraction 𝐹 of 𝐷 is a subset consisting of elements of 𝐷 (possibly with mul-
tiplicities). Put 𝑟𝑖 := |𝑄𝑖 | be the number of levels of the 𝑖th factor. We say that 𝐹 is symmetric if
𝑟1 = 𝑟2 = · · · = 𝑟𝑑 , otherwise 𝐹 is mixed.

A specific example using factorial designs in engineering

We wrap up the developed theory by illustrating the useful role of factorial designs in the following case
study (modified from [?, Example 11.7]).
Seven prediction factors for the piston cycle time, in an industrial design problem, were listed. These
are

A. Piston weight, 30 - 60 [kg].

B. Piston surface area, 0.005 - 0.020 [𝑚2 ].

C. Initial gas volume, 0.002 - 0.010 [𝑚3 ].

D. Spring coefficient, 1000 - 5000 [N/m].

E. Atmospheric pressure, 90000 - 100000 [N/𝑚2 ].

F: Ambient temperature, 290 - 296 [∘ 𝐾].

G: Filling gas temperature, 340 - 360 [∘ 𝐾].

We are interested to test the effects of the piston weight (A) and the spring coefficient (D) on the
cycle time (seconds). For this purpose we designed a factorial experiment at three levels of A, and
three levels of D. The levels are 𝐴1 = 30[𝑘𝑔], 𝐴2 = 45[𝑘𝑔], and 𝐴3 = 60[𝑘𝑔]; and

𝐷1 = 1500[𝑁/𝑚], 𝐷2 = 3000[𝑁/𝑚] and 𝐷3 = 4500[𝑁/𝑚].

Inference, Linear Regression and Stochastic Processes


6.5. Factorial Designs 205

Five replicas were performed at each treatment combination (𝑛 = 5). The five factors which were not
under study were kept at the levels

𝐵 = 0.01[𝑚2 ], 𝐶 = 0.005[𝑚3 ], 𝐸 = 95000[𝑁/𝑚2 ], 𝐹 = 293[∘ 𝐾], và 𝐺 = 350[∘ 𝐾].

The data can be obtained by using R, as follows.

> install.packages("DoE.base")
> library(DoE.base)

> Factors <- list( m=c(30, 45, 60), k=c(1500, 3000, 4500))
> FacDesign <- fac.design(
factor.names=Factors,
randomize=TRUE, replications=5, repeat.only=TRUE)
creating full factorial with 9 runs ...

> Levels <- data.frame(


lapply( lapply(FacDesign, as.character), as.numeric),
s=0.01, v0=0.005, p0=95000, t=293, t0=350)
> Ps <- pistonSimulation(m=Levels$m,
s=Levels$s, v0=Levels$v0,
k=Levels$k, p0=Levels$p0,
t=Levels$t, t0=Levels$t0, each=1, seed=123)
> FacDesign <- add.response( design=FacDesign, response=Ps$seconds)
> summary( aov(Ps.seconds ~ m*k, data=FacDesign) )
Df Sum Sq Mean Sq F value Pr(>F)
m 2 0.1064 0.05320 1.718 0.1939
k 2 0.3018 0.15089 4.872 0.0134 *
m:k 4 0.3674 0.09184 2.966 0.0324 *
Residuals 36 1.1149 0.03097
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We start with the two-way ANOVA presented in Table 6.15.


Figure 6.4, shows the effect of the factor spring coefficient on cycle time. Spring coefficient at 4500
[N/m] reduces the mean cycle time and its variability. Figure 6.5 is an interaction-plot showing the
effect of piston weight on the mean cycle time, at each level of the spring coefficient.
To draw Fig. 6.4 and 6.5 , we can use R codes:

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
206 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

Table 6.15: Two-way ANOVA for cycle time

Source DF SS MS 𝐹 𝑝

𝑆𝑝𝑟𝐶𝑜𝑓 2 1.01506 0.50753 8.66 0.001

𝑃 𝑖𝑠𝑡𝑊 𝑔 2 0.09440 0.04720 0.81 0.455

𝑆𝑝𝑟𝐶𝑜𝑓 * 𝑃 𝑖𝑠𝑡𝑊 𝑔 4 0.06646 0.01662 0.28 0.887

Error 36 2.11027 0.05862

Total 44 3.28619 - -

> summary(aov(Ps.seconds ~ m*k, data=FacDesign))


> boxplot(seconds ~ k, data=Ps)
> with(FacDesign,
interaction.plot(x.factor=m,
trace.factor=k, trace.label = "Spring coefficient",
response=Ps.seconds, type="b", pch=15:18,
ylab ="Mean of cycle time", xlab = "Piston weight (m)",xpd = NA ))
> rm(Ps); rm(FacDesign); rm(Levels, Factors)

Figure 6.4: Effect of spring coefficient on cycle time. (The 𝑦 axis corresponds to cycle time in minutes).

Inference, Linear Regression and Stochastic Processes


6.5. Factorial Designs 207

Figure 6.5: Interaction plot of piston weight with spring coefficient.

• The P-values are computed with the appropriate Fisher 𝐹 distribution. We see in the ANOVA table
that only the main effects of the spring coefficient (D) are significant. Since the effects of the piston
weight (A) and that of the interaction are not significant, we can estimate 𝜎 2 by a pooled estimator
𝑆𝑆𝑊 + 𝑆𝑆𝐼 + 𝑆𝑆𝑀 𝐴 2.2711
̂︀2 =
𝜎 = = 0.0541.
36 + 4 + 2 42

• To estimate the main effects of D we pool all data from samples having the same level of D together.
We obtain pooled samples of size 𝑛𝑝 = 15. The means of the cycle time for these samples are

𝐷1 𝐷2 𝐷3 Grand

Y 0.743 0.509 0.380 0.544

Main Effects 0.199 -0.035 -0.164 -

The standard error of these main effects is S.E. 𝜏̂︀𝑗𝐷 = 0.0425.

• Since we estimate on the basis of the pooled samples, and the main effects 𝜏̂︀𝑗𝐷 (𝑖 = 1, 2, 3) are
contrasts of 3 means, the coefficient 𝑆𝛼 for the simultaneous confidence intervals has the formula

𝑆𝛼 = (2𝐹.95 [2, 42])1/2 = 2 × 3.22 = 2.538.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
208 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

• The simultaneous confidence intervals for 𝜏𝑗𝐷 , at 𝛼 = 0.05 are

Lower limit Upper limit

𝜏1𝐷 0.0911 0.3069

𝜏2𝐷 −0.1429 0.0729

𝜏3𝐷 −0.2619 −0.0561

• We see that the confidence interval for 𝜏2𝐷 covers zero. Thus 𝜏2𝐷 is not significant. The significant
main effects are 𝜏1𝐷 và 𝜏3𝐷 . 

• Hence, we just studied the effects of two factors on the cycle time of a piston, keeping all the other
five factors fixed. In the present analysis we perform a 25 experiment with the piston varying factors
A, B, C, D and F at two levels, keeping the atmospheric pressure (factor E) fixed at 90000 [N/𝑚2 ]
and the filling gas temperature (factor D) at 340 [∘ K].

• The two levels of each factor are those specified, in the above part, as the limits of the experimental
range. 

6.6 Fractional Factorial Designs

6.6.1 Balanced factorial designs with more than two factors

Fractional factorial designs, especially the balanced factorial designs (also called orthogonal array)
are employed in various industries and engineering, for instance, see [13] for testing software, see [49]
for military science, and [50] for cDNA microarray experiments. We firstly consider a small balanced
factorial design.
A start-up company wants to put a new type of new product (mobile phones, yogurt, cars, new way
of banking management ...). The product have five potential and different features, yet to be decided,
namely:

1. Color Co,

2. Shape Sh,

3. Weight Wei,

4. Material Ma, and

5. Price Pri.

Each of these features can take on only two possible values, hence require 25 = 32 experiments. But
5 factors are not enough to study cellphone’s quality! How about using other 6 extra factors?

Inference, Linear Regression and Stochastic Processes


6.6. Fractional Factorial Designs 209

Table 6.16: An orthogonal array with 11 binary factors

Co Sh Wei Ma Pri Run

0 0 0 0 0 0 0 0 0 0 0 1

1 1 1 0 1 1 0 1 0 0 0 2

0 1 1 1 0 1 1 0 1 0 0 3

0 0 1 1 1 0 1 1 0 1 0 4

0 0 0 1 1 1 0 1 1 0 1 5

1 0 0 0 1 1 1 0 1 1 0 6

0 1 0 0 0 1 1 1 0 1 1 7

1 0 1 0 0 0 1 1 1 0 1 8

1 1 0 1 0 0 0 1 1 1 0 9

0 1 1 0 1 0 0 0 1 1 1 10

1 0 1 1 0 1 0 0 0 1 1 11

1 1 0 1 1 0 1 0 0 0 1 12

OS Cam. Wifi Anti. Ant. Place

A specific subset 𝐹 ⊆ 𝐷 = 𝐶 × 𝑆 × 𝑊 × 𝑀 × 𝑃 × 𝑂𝑆 × 𝐶𝑎𝑚 . . . × 𝑃 𝑙𝑎𝑐𝑒, given in Table 6.16, is called


balanced factorial designs denoted by OA(12; 211 ; 2), having strength 2, i.e. all defining generators has
length at least 3, as 𝐶𝑆𝑊 = 1, that allows us to separate all main effects on the response

𝑌 = 𝛽0 + 𝛽1 𝐶𝑜𝑙𝑜𝑟 + 𝛽2 𝑆ℎ𝑎𝑝𝑒 . . . + 𝛽10 𝐴𝑛𝑡𝑖𝑠ℎ𝑜𝑐𝑘 + 𝛽11 𝑃 𝑙𝑎𝑐𝑒.

Design Resolution and strength [see Montgomery [14] and Wu [73]]

• The design resolution 𝑅 of a design is the minimum length of all defining generators. 𝑅 is used
to catalog binary fractional factorial designs according to the alias patterns they produce. Regular
designs are designs that be fully defined by generator words.

• For a regular design, its strength is defined as 𝑡 = 𝑅 − 1 .

Specifically,

1. Resolution III Designs: ones for which no main effect are aliased with other main effects, but main
effects are aliased with 2-factor interactions, and some 2-factor interactions may be aliased with each
other.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
210 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

2. Resolution IV Designs: designs in which no main effect is aliased with any other main effect or
two-factor interactions, but two-factor interactions are aliased with each other.

3. Resolution V Designs: in which no main effect or 2-factor interaction is aliased with any other main
effect or 2-factor interaction, but 2-factor interactions are aliased with 3-factor interactions.

Designs of resolution 𝑅 = 𝐼𝐼𝐼, 𝐼𝑉 are useful in factor screening experiments.

For irregular designs, we consider 𝑑 factors, where 2 ≤ 𝑑 ∈ N, described by finite sets 𝑄1 , 𝑄2 , . . . , 𝑄𝑑 ,


denote 𝑟𝑖 = |𝑄𝑖 |.

A multi-subset 𝐹 ⊂ 𝐷 = 𝑄1 × 𝑄2 × . . . × 𝑄𝑑 . is said to be a strength 𝑡 orthogonal array


(OA) or 𝑡-balanced if, for each choice of 𝑡 coordinates (columns) from 𝐹 , every combination of
coordinate values from those 𝑡 columns occurs equally often; here 𝑡 is a natural number.
For a mixed orthogonal array 𝐹 has 𝑁 rows, factors 𝑄𝑖 , strength 𝑡 we denote 𝐹 =
OA(𝑁 ; 𝑟1 , 𝑟2 , · · · , 𝑟𝑑 ; 𝑡).

We can rewrite 𝐹 = OA(𝑁 ; 𝑠𝑎1 1 · 𝑠𝑎2 2 · · · 𝑠𝑎𝑚𝑚 ; 𝑡), meaning that 𝐹 has

• strength 𝑡; with 𝑁 runs,

• 𝑑 = 𝑎1 + 𝑎2 + . . . + 𝑎𝑚 is the total number of factors,

• 𝑎𝑖 factors has the same 𝑠𝑖 levels, where 𝑠𝑖 ̸= 𝑠𝑗 if 𝑖 ̸= 𝑗 = 1, . . . , 𝑚.

However, orthogonal arrays include both regular designs and irregular designs!

A case study for more than two factors with distinct levels

To illustrate the useful role of fractional factorial designs (FFD), we present here an usage of FFD with
more than two factors in software manufacturing.
Suppose that you are a quality engineer in a software firm.

• Your responsibility is to use statistical techniques for lowering the cost of design and production while
maintaining customer satisfaction.

What would you do if confronted with the following challenge?

• A competitor improves its product while simultaneously reducing the price. Your job is to identify
components in your company’s software production process which can be changed to reduce the
production time and lower the price, while making the product more robust [52, 6]

• You are required to carry out a series of experiments, in which a range of parameters, called factors,
can be varied. The outcome of these experiments will be used to decide which strategy should be

Inference, Linear Regression and Stochastic Processes


6.6. Fractional Factorial Designs 211

followed in the future. To be precise, you will perform experiments and measure some quantitative
outcomes, called responses, when values of the factors are varied. Each experiment is also called
an experimental run, or just a run. In each run, the factors are set to specific values from a certain
finite set of settings or levels, and the responses are recorded.

Identifying important factors and the number of levels

The board wants to study as many parameters as possible within a limited budget. They have identified
8 factors, coded as 𝐴, 𝐵, 𝐶, 𝐷, 𝐸, 𝐹, 𝐺, 𝐻, that could affect the outcome. The factors and their levels
are described in Table 6.17, where # stands for the number of levels of each factor.

• An initial investigation indicates that employees should have at least one year of experience, and that
there is a great difference between an employee with three years experience and one with five years.

• We choose 5 levels for years of experience, which we call factor 𝐴. Factor 𝐵 is the programming
language that our software is written in. Of the many languages used in the market nowadays, we
choose 4 which are appropriate for large applications.

• Although there are many different applications of software (factor 𝐶), we can classify them into two
major categories: scientific applications and business applications (such as finance, accountancy,
and tax).
For the former, the software developers require a fair knowledge of exact sciences like mathematics
or physics, but relatively little knowledge of the particular customers. On the other hand, for the latter,
the clients have specific requirements, which we need to know before designing, implementing and
testing the software.

• We use two popular operating systems, Windows and Linux, for factor 𝐷. Whether we interview the
customers is factor 𝐸 – as mentioned, we expect this to interact with factor 𝐶. The factors 𝐹, 𝐺, 𝐻
are self-explanatory, and each clearly has two levels.

Conflicting demands need a trade-off approach

Let 𝑁 be the number of experimental runs in a possible fractional experiment; each run will be assigned
to a particular combination of factor levels. In the worst case scenario, the total number of possible level
combinations of the factors 𝐴, 𝐵, 𝐶, 𝐷, 𝐸, 𝐹, 𝐺 and 𝐻 is max 𝑁 = 5 · 4 · 26 = 1280.
Selecting the right combinations of levels in these factors is obviously crucial, from the cost-benefit
view. The experiments are indeed costly and the board has decided that the budget allows for only 100
experiments. Is there any design with at most 100 experiments available for our task?

1. We restrict ourselves to studying only one response, 𝑌 , the number of failures (errors or crashes)
occurring in a week. To minimize the average number of failures in new products, we study the
combined influence of the factors using linear regression models.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
212 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

Table 6.17: Eight factors, the number of levels and the level meanings

Level

Factor Description # 0 1 2 3 4

𝐴 years of experience 5 1 3 5 7 9

𝐵 programming languages 4 C++ Java Perl XML

𝐶 applications 2 scientific business

𝐷 operating systems 2 Windows Linux

𝐸 interviewing customers 2 no yes

𝐹 weekly bonus policy 2 no yes

𝐺 teamwork training 2 no yes

𝐻 overwork policy 2 no yes

2. In these models, we make a distinction between main effects, two-factor interactions, and higher-
order interactions.

3. The main effect of a factor models the average change in the response when the setting of that
factor is changed. A model containing just the main effects takes the form

4
∑︁ 3
∑︁
𝑌 = 𝜃0 + 𝜃𝐴𝑖 𝑎𝑖 + 𝜃𝐵𝑗 𝑏𝑗 + 𝜃𝐶 𝑐 + . . . + 𝜃𝐻 ℎ + 𝜖, (6.44)
𝑖=1 𝑗=1

where 𝜖 is a random error term, 𝑎 = 0, 1, 2, 3, 4, 𝑏 = 0, 1, 2, 3, the possible values of 𝑐, 𝑑, 𝑒, 𝑓, 𝑔, ℎ are 0


or 1, and the parameters 𝜃* are the regression coefficients.

4. In particular, 𝜃0 , the number of failures when all factors are set to the values 0, is called the intercept
of the model. These coefficients are estimated by taking linear combinations of the responses.

5. Two-factor interactions, or two-interactions, model changes in the main effects of a factor due to a
change in the setting of another factor.

To study the activity of all two-interactions simultaneously, we may want to augment Model (6.44) by
adding

4 ∑︁
∑︁ 3 4
∑︁ 4
∑︁
𝜃𝐴𝑖 𝐵𝑗 𝑎𝑖 𝑏𝑗 + 𝜃𝐴𝑖 𝐶 𝑎𝑖 𝑐 + . . . + 𝜃𝐴𝑖 𝐻 𝑎𝑖 ℎ +
𝑖=1 𝑗=1 𝑖=1 𝑖=1
(6.45)
3
∑︁ 3
∑︁
𝜃 𝐵 𝑗 𝐶 𝑏𝑗 𝑐 + . . . + 𝜃𝐵𝑗 𝐻 𝑏𝑗 ℎ + 𝜃𝐶𝐷 𝑐𝑑 + . . . + 𝜃𝐺𝐻 𝑔ℎ.
𝑗=1 𝑗=1

Inference, Linear Regression and Stochastic Processes


6.6. Fractional Factorial Designs 213

6. We can also define higher-order interactions but these are usually considered unimportant. The
total number of intercept, main effect and two interaction parameters is
8
∑︁ 8
∑︁
1+ (𝑠𝑖 − 1) + (𝑠𝑖 − 1)(𝑠𝑗 − 1),
𝑖=1 𝑖,𝑗=1
𝑖<𝑗

where 𝑠𝑖 is the number of levels of factor 𝑖.

7. This formula shows that we need 83 parameters up to two-factor interactions to model the com-
bined influences of the factors. In fact, only some of the two-factor interactions turn out to be im-
portant, so we need even fewer than 83 parameters. This is in contrast with a full model including
all interactions up to order 8, which needs 1280 parameters.
Our trade-off solution will be using a design with runzise around 83 runs.

A suggested strength 3 fractional factorial design

The full factorial design of the eight factors described above is the Cartesian product {0, 1, . . . , 4} ×
{0, 1, . . . , 3} × {0, 1}6 .

• Using this design, we are able to estimate all interactions, but performing all 1280 runs exceeds the
firm’s budget. Instead we use a fractional factorial design, that is, a subset of elements in the full
factorial design.

• We want to choose a fractional design that still allows us to estimate the main effects and some of the
two-interactions. If we want to measure simultaneously all effects up to 2-interactions of the above 8
factors, an 83 run fractional design would be needed.

• Constructing an 83 run design is possible, and could be found with trial-and-error algorithms. But it
lacks some attractive features such as balance, which is discussed next.

DISCUSSION.

1. An algebraic approach can also be used to construct such a design, but it is infeasible for large run
size designs; for more details see Nguyen [36]. A workable solution is the 80 run experimental design
presented in Table 6.18. This allows us to estimate the main effect of each factor and some of their
pairwise interactions. The construction of this design is presented in [34]. Note that the responses 𝑌
have been computed by simulation, not by conducting actual experiments.

2. A notable property of the array in Table 6.18 is that it has strength 3. That is, if we choose any 3
columns in the table and go down we find that every triple of symbols in those columns appears the
same number of times.
This property is also called 3-balance or 3-orthogonality; and the array (fractional design) itself is
called a strength 3 orthogonal array or a 3-balanced fractional design. By [51, Theorem 11.3], a
strength 3 design allows us to measure all the main effects and some of the two-interactions.

DATA ANALYTICS- FOUNDATION


CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
214 CAUSAL ANALYSIS WITH ORDINAL VARIABLES

Table 6.18: A mixed orthogonal design with 3 distinct sections

run 𝐴 𝐵 𝐶 𝐷 𝐸 𝐹 𝐺 𝐻 𝑌 run 𝐴 𝐵 𝐶 𝐷 𝐸 𝐹 𝐺 𝐻 𝑌

5 4 2 2 2 2 2 2 5 4 2 2 2 2 2 2

1 0 0 0 0 0 0 0 0 25 41 2 2 0 0 1 1 0 0 12

2 0 0 1 0 1 0 1 0 15 42 2 2 1 0 0 0 1 1 6

3 0 0 1 1 0 1 0 1 15 43 2 2 1 1 1 1 1 0 12

4 0 0 0 1 1 1 1 1 15 44 2 2 0 1 0 0 0 1 15

5 0 1 0 0 0 0 0 1 5 45 2 3 0 0 1 0 0 0 12

6 0 1 1 0 1 1 1 0 15 46 2 3 1 0 0 0 1 0 15

7 0 1 1 1 0 1 0 0 10 47 2 3 1 1 1 1 1 1 0

8 0 1 0 1 1 0 1 1 15 48 2 3 0 1 0 1 0 1 21

9 0 2 0 0 0 1 1 1 25 49 3 0 0 0 0 1 1 0 4

10 0 2 1 0 1 1 0 1 30 50 3 0 1 0 1 1 1 1 4

11 0 2 1 1 0 0 1 0 20 51 3 0 1 1 1 0 0 0 4

12 0 2 0 1 1 0 0 0 10 52 3 0 0 1 0 0 0 1 8

13 0 3 0 0 0 1 1 0 15 53 3 1 0 0 1 1 0 0 8

14 0 3 1 0 1 0 0 1 30 54 3 1 1 0 0 0 1 0 8

15 0 3 1 1 0 0 1 1 30 55 3 1 1 1 0 0 1 1 0

16 0 3 0 1 1 1 0 0 10 56 3 1 0 1 1 1 0 1 2

17 1 0 0 0 0 0 0 1 20 57 3 2 0 0 0 0 0 0 4

18 1 0 1 0 1 1 0 0 4 58 3 2 1 0 1 0 0 1 6

19 1 0 1 1 0 1 1 1 4 59 3 2 1 1 1 1 1 0 14

20 1 0 0 1 1 0 1 0 8 60 3 2 0 1 0 1 1 1 6

21 1 1 0 0 1 0 1 0 0 61 3 3 0 0 1 0 1 1 14

22 1 1 1 0 0 1 1 1 16 62 3 3 1 0 0 1 0 1 8

23 1 1 1 1 1 0 0 1 4 63 3 3 1 1 0 1 0 0 4

24 1 1 0 1 0 1 0 0 20 64 3 3 0 1 1 0 1 0 0

25 1 2 0 0 0 1 1 0 24 65 4 0 0 0 1 1 0 1 2
.. .. .. ..
. . . .

39 2 1 1 1 1 0 0 0 15 79 4 3 1 1 1 0 0 1 7

40 2 1 0 1 0 0 1 0 6 80 4 3 0 1 0 0 0 0 8

3. We could, in fact, investigate all main effects and all two-interactions of the abovementioned eight
factors by using an 160 run strength 3 orthogonal array; see [51, Section 11.4] for a detailed expla-

Inference, Linear Regression and Stochastic Processes


6.7. Summary of Terms- Problems 215

nation.

But the board would have to increase the current budget by at least 60 percent if we use an 160 run
orthogonal array.

6.7 Summary of Terms- Problems

The main concepts and tools introduced in this chapter include:

• Response Variable • Randomization

• Controllable Factor • Main Effects

• Factor Level • Interactions

• Experimental Array • Factorial Designs

Problem 6.1.

Describe a production process familiar to you, like baking of cakes, or manufacturing concrete. List
the pertinent variables. What is (are) the response variable(s)? Classify the variables which affect the
response to noise variables and control variables.
How many levels would you consider for each variable?

Problem 6.2.

Different types of adhesive are used in a lamination process, in manufacturing a computer card.
The card is tested for bond strength. In addition to the type of adhesive, a factor which might influence
the bond strength is the curing pressure (currently at 200psi).
Follow the basic steps of experimental design to set a possible experiment for testing the effects of
adhesives and curing pressure on the bond strength.

Problem 6.3.

Three factors A, B, C are tested in a given experiment, designed to assess their effects on the
response variable. Each factor is tested at 3 levels. List all the main effects and interactions.

DATA ANALYTICS- FOUNDATION


This page is left blank intentionally.
Chapter 7

Experimental Designs II
Analysis with Random Effects Model

[Source [56]]
CHAPTER 7. EXPERIMENTAL DESIGNS II
218 ANALYSIS WITH RANDOM EFFECTS MODEL

Introduction

An experimenter is frequently interested in a factor that has a large number of possible levels. If the
experimenter randomly selects a of these levels from the population of factor levels, then we say that
the factor is random. Because the levels of the factor actually used in the experiment were chosen
randomly, inferences are made about the entire population of factor levels.
We assume that the population of factor levels is either of infinite size or is large enough to be
considered infinite. Situations in which the population of factor levels is small enough to employ a finite
population approach are not encountered frequently.

7.1 Random effects model of a single-factor experiment

7.1.1 The linear model of a single-factor

The linear random effects model of a factor 𝐴 (with 𝑎 levels) is

𝑌𝑖𝑗 = 𝜇 + 𝜏𝑖𝐴 + 𝜀𝑖𝑗 , 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, 2, · · · , 𝑛, (7.1)

where both the treatment effects 𝜏𝑖 := 𝜏𝑖𝐴 and 𝜀𝑖𝑗 are random variables. We will assume
that the treatment effects 𝜏𝑖 are iid N (0, 𝜎𝜏2 ) random variables and
that the errors 𝜀𝑖𝑗 are iid N (0, 𝜎 2 ) random variables, and
that the 𝜏𝑖 and 𝜀𝑖𝑗 are independent.

• Since 𝜏𝑖 is independent of 𝜀𝑖𝑗 , the variance of any observation is

V[𝑌𝑖𝑗 ] = 𝜎𝜏2 + 𝜎 2 (7.2)

Definition 7.1.

The variances 𝜎𝜏2 and 𝜎 2 are called variance components, and the model (Equation 7.10) is
called the components of variance or random effects model.

The observations in the random effects model are normally distributed because they are linear com-
binations of the two normally and independently distributed random variables 𝜏𝑖 and 𝜀𝑖𝑗 . But in Model
7.10 the observations 𝑌𝑖𝑗 are only independent if they come from different factor levels. Specifically,
we can show that the covariance of any two observations is

Cov(𝑌𝑖𝑗 , 𝑌𝑖𝑘 ) = 𝜎𝜏2 𝑗 ̸= 𝑘


(7.3)
Cov(𝑌𝑖𝑗 , 𝑌𝑖1 𝑘 ) = 0 𝑖 ̸= 𝑖1

Inference, Linear Regression and Stochastic Processes


7.1. Random effects model of a single-factor experiment 219

Hence, the observations within a specific factor level all have the same covariance, because before
the experiment is conducted, we expect the observations at that factor level to be similar since they all
have the same random component.
Once the experiment has been conducted, we can assume that all observations can be assumed to
be independent, because the parameter 𝜏𝑖 has been determined and the observations in that treatment
differ only because of random error.
The covariance structure of the observations as in (7.3) [of the single-factor random effects model]
can be written through the covariance matrix of the observations with size 𝑁 ×𝑁 , 𝑁 = 𝑎 𝑛. To illustrate,
suppose that we have 𝑎 = 3 treatments and 𝑛 = 2 replicates. There are 𝑁 = 6 observations, which we
can write as a vector ⎡ ⎤
⎢ 𝑦11 ⎥
⎢ ⎥
⎢ ⎥

⎢ 𝑦12 ⎥

⎢ ⎥
⎢ ⎥
⎢ 𝑦21 ⎥
𝑇
⎢ ⎥
𝑦 = [𝑦11 , 𝑦12 , 𝑦21 , 𝑦22 , 𝑦31 , 𝑦32 ] = ⎢



𝑦22 ⎥
⎢ ⎥

⎢ ⎥
⎢ ⎥
𝑦31 ⎥
⎢ ⎥

⎢ ⎥
⎣ ⎦
𝑦32

and the 6 × 6 covariance matrix of these observations is

⎡ ⎤
⎢ 𝜎𝜏2 +𝜎 2
𝜎𝜏2 0 0 0 0 ⎥
⎢ ⎥
⎢ ⎥

⎢ 𝜎𝜏2 𝜎𝜏2 + 𝜎 2 0 0 0 0 ⎥

⎢ ⎥
⎢ ⎥

⎢ 0 0 𝜎𝜏2 + 𝜎 2 𝜎𝜏2 0 0 ⎥

Σ = Cov(𝑦) = ⎢


⎥ (7.4)
0 0 𝜎𝜏2 𝜎𝜏2 +𝜎 2
0 0
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
0 0 0 0 𝜎𝜏2 + 𝜎 2 𝜎𝜏2
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
0 0 0 0 𝜎𝜏2 𝜎𝜏2 + 𝜎 2

The main diagonals of this matrix are the variances of each individual observation and every off-
diagonal element is the covariance of a pair of observations.

7.1.2 Estimating the Model Parameters

How to estimate the variance components 𝜎𝜏2 and 𝜎 2 in the model (7.11)?
Two methods are used: I) method of moments and II) maximum likelihood.

I) Method of moments relies on equating the expected mean squares to their observed values in the
ANOVA table and solving for the variance components,

DATA ANALYTICS- FOUNDATION


CHAPTER 7. EXPERIMENTAL DESIGNS II
220 ANALYSIS WITH RANDOM EFFECTS MODEL

𝑀 𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 = 𝜎 2 + 𝑛 𝜎𝜏2
(7.5)
𝑀 𝑆𝐸 = 𝜎 2

Therefore, the estimators of the variance components respectively are


𝑀 𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 − 𝑀 𝑆𝐸
̂︀2 = 𝑀 𝑆𝐸
𝜎 and ̂︀𝜏2 =
𝜎 (7.6)
𝑛
𝑀 𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 and 𝑀 𝑆𝐸 are found from the basic ANOVA, with
𝑎 ∑︁
∑︁ 𝑛
𝑆𝑆𝑇 = (𝑦𝑖𝑗 − 𝑦)2 = 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 + 𝑆𝑆𝐸 , (7.7)
𝑖=1 𝑗=1

𝑎
∑︁
where 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 = 𝑛 (𝑦 𝑖 − 𝑦)2 , 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 ,
𝑖=1

Table 7.1: ANOVA of the factor 𝐴

Source of D.F. S.S. M.S. 𝐹 statistic

Variation

Factor treatments 𝑎−1=3 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 𝑀 𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 𝑓0


𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠
𝐴 =
𝑎−1
Error 𝑁 −𝑎 𝑆𝑆𝐸 𝑀 𝑆𝐸
𝑆𝑆𝐸
=
𝑁 −𝑎
Total 𝑁 −1 𝑆𝑆𝑇 -

Summary: The method of moments does not require the normality assumption. The estimators
̂︀2 , 𝜎
𝜎 ̂︀𝜏2 are best quadratic unbiased (i.e., of all unbiased quadratic functions of the observations, these
estimators have minimum variance).

II) Method of maximum likelihood estimation (MLE)- The REML

Recall the likelihood of observing data vector 𝑦 = (𝑦1 , 𝑦2 , · · · , 𝑦𝑁 ) with respect to 𝜃 is


𝑁
∏︁
𝐿(𝜃) = 𝑓 (𝑦; 𝜃) = 𝑓 (𝑦; 𝜃) = 𝑓 (𝑦𝑖 ; 𝜃).
𝑖=1

To illustrate how the MLE applies to an experimental design model,

Inference, Linear Regression and Stochastic Processes


7.2. Random effects model of two-factor experiment 221

let 𝑦 = 𝑦1 , 𝑦2 , · · · , 𝑦𝑎 𝑛 be the 𝑎 𝑛 × 1 vector of observations for a single-factor random effects model


(7.10), with 𝑎 treatments and 𝑛 replicates (so 𝑁 = 𝑎 𝑛), and

let Σ be the 𝑎 𝑛 × 𝑎 𝑛 covariance matrix of the observations.

The likelihood function, with parameters 𝜃 = (𝜇, 𝜎𝜏2 , 𝜎 2 ), has the form

1 {︁ 1 (︀ )︀′ )︀}︁
𝐿(𝑦; 𝜃) = 𝑓 (𝑦1 , 𝑦2 , · · · , 𝑦𝑎 𝑛 ; 𝜇, 𝜎𝜏2 , 𝜎 2 ) = 𝑦 −𝑗 𝑁 𝜇 Σ−1 𝑦 −𝑗 𝑁 𝜇
(︀
exp − (7.8)
𝐶 2

where 𝑁 = 𝑎 𝑛 is the total number of observations,


𝑗 𝑁 is an 𝑁 × 1 vector of 1s, 𝜇 is the overall mean in the model, and constant

1
𝐶= .
(2𝜋)𝑁/2 [Σ]1/2

The maximum likelihood estimates of the parameters 𝜃 = (𝜇, 𝜎𝜏2 , 𝜎 2 ) are the values of these quantities
(with two variance components) that maximize the likelihood function (7.8).

• The classical MLE of 𝜎 2 is


(𝑋𝑖 − X )2
∑︀
̂︀2 =
𝜎 𝑖
.
𝑛
This one is biased.

• The standard variant of maximum likelihood estimation that is used for estimating variance com-
ponents is known as the residual maximum likelihood (REML) method. The basic characteristic of
REML is that it takes the location parameters in the model into account when estimating the random
effects.

The REML estimator is unbiased, given by

(𝑋𝑖 − X )2
∑︀
𝑆2 = 𝑖
(7.9)
𝑛−1

7.2 Random effects model of two-factor experiment

In this section, we focus on methods for the design and analysis of factorial experiments with two
random factors. This paves the way for studying, in next chapters, nested and split-plot designs, two
situations where random factors are frequently encountered in practice.

7.2.1 The linear model of two factors

Suppose that we have two factors, 𝐴 and 𝐵 and that both factors have a large number of levels that
are of interest. We will choose at random 𝑎 levels of factor 𝐴 and 𝑏 levels of factor 𝐵 and arrange these

DATA ANALYTICS- FOUNDATION


CHAPTER 7. EXPERIMENTAL DESIGNS II
222 ANALYSIS WITH RANDOM EFFECTS MODEL

factor levels in a factorial experimental design. If the experiment is replicated 𝑛 time the linear random
effects model of factors 𝐴, 𝐵 is

𝑌𝑖𝑗𝑘 = 𝜇 + 𝜏𝑖 + 𝛽𝑗 + (𝛼 𝛽)𝑖𝑗 + 𝜀𝑖𝑗𝑘 , 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, 2, · · · , 𝑏, 𝑘 = 1, 2, · · · , 𝑛 (7.10)

where the model parameters (treatment effects) 𝜏𝑖 , 𝛽𝑗 , (𝛼 𝛽)𝑖𝑗 and 𝜀𝑖𝑗𝑘 are all independent random
variables.
We will assume that the random variables 𝜏𝑖 𝛽𝑗 , (𝛼 𝛽)𝑖𝑗 and 𝜀𝑖𝑗𝑘 are normally distributed with mean
0 and variances given by

• V[𝜏𝑖 ] = 𝜎𝜏2 , V[𝛽𝑗 ] = 𝜎𝛽2 ,

• V[(𝛼 𝛽)𝑖𝑗 ] = 𝜎𝜏2 𝛽 , and V[𝜀𝑖𝑗𝑘 = 𝜎 2 .

• Since model parameters are mutually independent, the variance of any observation is

V[𝑌𝑖𝑗𝑘 ] = 𝜎𝜏2 + 𝜎𝛽2 + 𝜎𝜏2 𝛽 + 𝜎 2 (7.11)

and 𝜎𝜏2 , 𝜎𝛽2 , 𝜎𝜏2 𝛽 and 𝜎 2 are the variance components.

7.2.2 Testing hypotheses with Fisher statistic

The hypotheses that we are interested in testing are

 EXAMPLE 7.1 ([Industry- Manufacturing.] - A Measurement Systems Capability Study).

A common industrial application is to use a designed experiment to study the components of vari-
ability in a measurement system. These studies are often called gauge capability studies or gauge
repeatability and reproducibility (R & R) studies because these are the components of variability
that are of interest.


Next we introduce an important type of experimental designs, the nested designs. These designs
find reasonably widespread application in the industrial use of designed experiments.

7.3 The Two-Stage Nested Design

In multifactor experiments, the levels of one factor (e.g., factor 𝐵) are similar but not identical for different
levels of another factor (e.g., 𝐴). Such an arrangement is called a nested, or hierarchical, design,
with the levels of factor 𝐵 nested under the levels of factor 𝐴. For example, consider a company that
purchases its raw material from three different suppliers. The company wishes to determine whether

Inference, Linear Regression and Stochastic Processes


7.3. The Two-Stage Nested Design 223

the purity of the raw material is the same from each supplier. There are four batches of raw material
available from each supplier, and three determinations of purity are to be taken from each batch.
This is a two-stage nested design, with batches nested under suppliers.

At first glance, you may ask why this is not a factorial experiment. If this were a factorial, then batch
1 would always refer to the same batch, batch 2 would always refer to the same batch, and so on. This
is clearly not the case because the batches from each supplier are unique for that particular supplier.
That is, batch 1 from supplier 1 has no connection with batch 1 from any other supplier, batch 2 from
supplier 1 has no connection with batch 2 from any other supplier, and so forth.

7.3.1 The Statistical Model

The linear statistical model for the two-stage nested design is


𝑖 = 1, 2, · · · , 𝑎,





𝑌𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗(𝑖) + 𝜀(𝑖𝑗) 𝑘 𝑗 = 1, 2, · · · , 𝑏, (7.12)



⎩𝑘 = 1, 2, · · · , 𝑛

That is, there are a 𝑎 levels of factor 𝐴 and 𝑏 levels of factor 𝐵, nested under each level of 𝐴, and 𝑛
replicates. We also view

• 𝜇 = 𝜇... is a constant (overall mean effect),

• the subscript 𝑗(𝑖) indicates that the 𝑗th level of factor 𝐵 is nested under the 𝑖th level of factor 𝐴.

• It is convenient to think of the replicates as being nested within the combination of levels of 𝐴 and 𝐵;
thus, the subscript (𝑖𝑗) 𝑘 is used for the error term.

• 𝜀(𝑖𝑗) 𝑘 are independent N (0, 𝜎 2 ) =: NID(0, 𝜎 2 ), 𝑖 = 1, . . . , 𝑎, 𝑗 = 1, . . . , 𝑏, 𝑘 = 1, . . . , 𝑛.

• This is a balanced nested design because there are an equal number of levels of 𝐵 within each
level of 𝐴 and an equal number of replicates.

7.3.2 The ANOVA Analysis

We may write the total corrected sum of squares as


∑︁ ∑︁ ∑︁
𝑆𝑆𝑇 = (𝑦𝑖𝑗𝑘 − 𝑦)2 = 𝑆𝑆𝐴 + 𝑆𝑆𝐵(𝐴) + 𝑆𝑆𝐸 (7.13)
𝑖 𝑗 𝑘

There are 𝑎𝑏𝑛 − 1 degrees of freedom for 𝑆𝑆𝑇 , 𝑎 − 1 degrees of freedom for 𝑆𝑆𝐴 , 𝑎(𝑏 − 1) degrees of
freedom for 𝑆𝑆𝐵(𝐴) , and 𝑎𝑏(𝑛 − 1) degrees of freedom for error.

DATA ANALYTICS- FOUNDATION


CHAPTER 7. EXPERIMENTAL DESIGNS II
224 ANALYSIS WITH RANDOM EFFECTS MODEL

Table 7.2: ANOVA for the Two-Stage Nested Design

Source of D.F. S.S. M.S.

Variation Degrees of Freedom Sum of Squares Mean Square

𝑆𝑆𝐴
Factor 𝐴 𝑎 − 1 =? 𝑆𝑆𝐴 𝑀 𝑆𝐴 =
𝑎−1

𝑆𝑆𝐵(𝐴)
𝐵 within 𝐴 𝑎(𝑏 − 1) 𝑆𝑆𝐵(𝐴) 𝑀 𝑆𝐵(𝐴) =
𝑎(𝑏 − 1)

𝑆𝑆𝐸
Error 𝑎𝑏(𝑛 − 1) 𝑆𝑆𝐸 𝑀 𝑆𝐸 =
𝑎𝑏(𝑛 − 1)
Total 𝑁 − 1 = 𝑎𝑏𝑛 − 1 𝑆𝑇

Knowledge box 3.

1. The ANOVA idea means that if the errors are NID(0, 𝜎 2 ), we may divide each sum of squares on
the right of Equation 7.13 by its degrees of freedom to obtain independently distributed mean
squares such that the ratio of any two mean squares is distributed as the Fisher distribution 𝐹 .

2. Fixed or random model?

The appropriate statistics for testing the effects of factors 𝐴 and 𝐵 depend on whether 𝐴 and 𝐵
are fixed or random.
𝑎
∑︁ 𝑏
∑︁
• If factors 𝐴 and 𝐵 are fixed, we assume that 𝜏𝑖 = 0, and 𝛽𝑗(𝑖) = 0, for each
𝑖=1 𝑗=1
𝑖 = 1, 2, · · · , 𝑎. That is, the 𝐴 treatment effects sum to zero, and the 𝐵 treatment effects
sum to zero within each level of 𝐴.

• Alternatively, if 𝐴 and 𝐵 are random, we assume that 𝜏𝑖 is NID(0, 𝜎𝜏2 ) and 𝛽𝑗(𝑖) is NID(0, 𝜎𝛽2 ).

 EXAMPLE 7.2 ([Industry- Manufacturing.]).

Consider a company that buys raw material in batches from three different suppliers. The purity of
this raw material varies considerably, which causes problems in manufacturing the finished product. We
wish to determine whether the variability in purity is attributable to differences between the suppliers.
Four batches of raw material are selected at random from each supplier, and three determinations of
purity are made on each batch. This is, of course, a two-stage nested design. The data, after coding
by subtracting 93, are shown in Table 7.3.

Inference, Linear Regression and Stochastic Processes


7.3. The Two-Stage Nested Design 225

Table 7.3: Coded Purity Data for Example- (Code: 𝑦𝑖𝑗𝑘 = 𝑃 𝑢𝑟𝑖𝑡𝑦 − 93)

Supplier 1 Supplier 2 Supplier 3

Batches 1 2 3 4 1 2 3 4 1 2 3 4

1 −2 −2 1 1 0 −1 0 2 −2 1 3

−1 −3 0 4 −2 4 0 3 4 0 −1 2

Batch totals 𝑦𝑖𝑗· 0 −9 −1 5 −4 6 −3 5 6 0 2 6

Supplier totals 𝑦𝑖·· -5 4 14

♣ OBSERVATION 1.

1. The practical implications of this experiment and the analysis are very important.

The objective of the experimenter is to find the source of the variability in raw material purity. If
it results from differences among suppliers, we may be able to solve the problem by selecting
the ‘best’ supplier. However, that solution is not applicable here because the major source of
variability is the batch-to-batch purity variation within suppliers.

Therefore, we must attack the problem by working with the suppliers to reduce their batch-to-
batch variability. This may involve modifications to the suppliers’ production processes or their
internal quality assurance system.

2. This analysis indicates that batches differ significantly and that there is a significant interaction
between batches and suppliers. However, it is difficult to give a practical interpretation of the
batches × suppliers interaction. 

Table 7.4: Expected Mean Squares in the Two-Stage Nested Design

A Fixed A Fixed A Random

E[𝑀 𝑆] B Fixed B Random B Random

E[𝑀 𝑆𝐴 ] 𝜎 2 + 𝑛 𝜎𝛽2 + 𝑏 𝑛 𝜎𝜏2

DATA ANALYTICS- FOUNDATION


This page is left blank intentionally.
Chapter 8

Multivariate Probability Distributions


Simultaneously study random variables

[Source [9]]
CHAPTER 8. MULTIVARIATE PROBABILITY DISTRIBUTIONS
228 SIMULTANEOUSLY STUDY RANDOM VARIABLES

8.1 Random vector

Motivation

In previous statistical courses, we have discussed probability models and computation of probability for
events involving only one random variable. These are called univariate models.

• In this chapter, we discuss probability models that involve more than one random variable-naturally
enough, called multivariate models.

• Methods and probabilistic models in engineering and science often involve some random variables.
For example,

- in medical diagnosis, the results of some tests may be meaningful, or

- in the context of a computer network, the workload of several portals may be of interest.

All random variables are associated with the same experiment, sample space, and probability law
and their values may have interrelated.

Recall that a random variable 𝑋 is a map from a sample space Ω to the reals R. That is for 𝑤 ∈ Ω
then 𝑋(𝑤) ∈ R.

• The domain of a random variable is the sample space Ω.

• The range of a random variable is the set of all observations

𝑆𝑋 = Range(𝑋) = {𝑋(𝑤)}.

𝑆𝑋 can be the set of all real numbers R, or the integers Z ... depending on what possible values
the random variable can take.

Hence, a random variable was defined to be a function from a sample space Ω into the real numbers.
A random vector, consisting of several random variables, is defined similarly.

Definition 8.1.

An 𝑛-dimensional random vector is a function from a sample space 𝑆 into 𝑅𝑛 , 𝑛-dimensional


Euclidean space.

Random Vector when 𝑛 = 2: Given a random experiment with a sample space C, consider two ran-
dom variables 𝑋1 and 𝑋2 , which assign to each element 𝑐 of C one and only one ordered pair of
numbers 𝑋1 (𝑐) = 𝑥1 , 𝑋2 (𝑐) = 𝑥2 .

• Then we say that (𝑋1 , 𝑋2 ) is a random vector.

Inference, Linear Regression and Stochastic Processes


8.1. Random vector 229

• The space of (𝑋1 , 𝑋2 ) is the set of ordered pair D = {(𝑥1 , 𝑥2 ) : 𝑥1 = 𝑋1 (𝑐), 𝑥2 = 𝑋2 (𝑐); 𝑐 ∈ C}. We
also denote Range(𝑋1 , 𝑋2 ) := D.

CONVENTION:

1. Use notation Ω to indicate the sample space of a single random variable, and notation C for the
sample space of multiple random variables or random vectors;

2. use notation Range(.) or 𝑆. to indicate the range (value space) of a single random variable, and use
notation D for the value space of multiple random variables or random vectors.

The sample space C Random vector The observed value set


X = (X, Y): C R Range(X)= D = S_X x S_Y

X (select best singer)

S_X

S_Y
Y (welfare classification)

D = S_X x S_Y
Example: C is the set of whole population of Bangkok. Hence, the vector
X = (X, Y) provides two aspects of dwellers in BKK, shown in the range set D = { (x,y): x in S_X, y in S_Y }.
It helps the government to know BKK’s economic- social structures, to plan suitable policies...

For instance, let C be the population in Bangkok, we want to measure two key indexes of people
there: (1) Hobby and (2) Living standard. From previous examples,

• let 𝑋 denote the ‘favored singer’ variable with the value space Range(𝑋) = {𝐴𝑅𝐼𝑁, 𝑏, 𝑐} = 𝑆𝑋 [in
Entertainment Industry], and

• let 𝑌 denote the ‘social class’ (welfare classification) random variable with
Range(𝑌 ) = {𝑃 𝑜𝑜𝑟, 𝑅𝑖𝑐ℎ, Extremely rich} = {𝑝, 𝑟, er} = 𝑆𝑌 [in Economics].

Our interest is represented by the pair of variables (𝑋, 𝑌 ) =: X, we call it a random vector.

Example 8.1.

DATA ANALYTICS- FOUNDATION


CHAPTER 8. MULTIVARIATE PROBABILITY DISTRIBUTIONS
230 SIMULTANEOUSLY STUDY RANDOM VARIABLES

A coin is tossed three times and our interest is in the ordered number pair (number of H’s on first
two tosses, number of H’s on all three tosses), where H and T represent, respectively, heads and
tails. Let 𝑋1 denote the number of H’s on the first two tosses and 𝑋2 denote the number of H’s on
all three flips. Then our interest can be represented by the pair of random variables (𝑋1 , 𝑋2 ).

For example, (𝑋1 (𝐻𝑇 𝐻), 𝑋2 (𝐻𝑇 𝐻)) represents the outcome (1, 2).
The sample space C = {𝑇 𝑇 𝑇, 𝑇 𝑇 𝐻, 𝑇 𝐻𝑇, 𝐻𝑇 𝑇, 𝑇 𝐻𝐻, 𝐻𝑇 𝐻, 𝐻𝐻𝑇, 𝐻𝐻𝐻}. Continuing in this
way, 𝑋1 and 𝑋2 are real-valued functions defined on the sample space C, which take us from C to the
space of ordered number pairs

D = {(0, 0), (0, 1), (1, 1), (1, 2), (2, 2), (2, 3)}.

Thus 𝑋1 and 𝑋2 are two random variables defined on the space C, and, in this example, the space
(range) of these random variables is the two-dimensional set 𝐷, which is a subset of two-dimensional
Euclidean space 𝑅2 . Hence (𝑋1 , 𝑋2 ) is a vector function from C to D.

8.2 Joint and marginal distributions

We often denote 2-dim random vectors using vector notation X = (𝑋1 , 𝑋2 )′ , where the ′ denotes the
transpose of the row vector (𝑋1 , 𝑋2 ).

8.2.1 Joint and marginal distributions- the discrete case

We focus firstly on two discrete random variables here.


Let 𝑋1 , 𝑋2 be random variables which are jointly observed at the same experiments. The random
vector X = (𝑋1 , 𝑋2 )′ is a discrete random vector if its space D is finite or countable (as in the above
example). Hence, 𝑋1 and 𝑋2 are both discrete also. For convenience, we also write X = (𝑋1 , 𝑋2 ) and
call it a discrete bivariate random vector.

Definition 8.2. Let X = (𝑋1 , 𝑋2 ) be a discrete bivariate random vector.

• The joint probability mass function (jpmf) of (𝑋1 , 𝑋2 ) is defined by

𝑝(𝑥1 , 𝑥2 ) = 𝑝𝑋1 ,𝑋2 (𝑥1 , 𝑥2 ) = P[𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ], (8.1)

for all (𝑥1 , 𝑥2 ) ∈ D. If the range of 𝑋 is 𝑆𝑋 , of 𝑌 is 𝑆𝑌 then the value space of (𝑋1 , 𝑋2 ) is
D = 𝑆𝑋 × 𝑆𝑌 .

Inference, Linear Regression and Stochastic Processes


8.2. Joint and marginal distributions 231

• The joint probability mass function is characterized by the two properties

(𝑖) 0 ≤ 𝑝(𝑥1 , 𝑥2 ) ≤ 1, and

(8.2)
∑︁
(𝑖𝑖) 𝑝(𝑥1 , 𝑥2 ) = 1.
(𝑥1 ,𝑥2 )∈D

To stress the fact that 𝑝() is the joint pmf of the vector (𝑋1 , 𝑋2 ) rather than some other vector,
the notation 𝑝𝑋1 ,𝑋2 (𝑥1 , 𝑥2 ) will be used.

Let D be the value space associated with the random vector (𝑋1 , 𝑋2 ). As in the case of one random
variable, we speak of the event 𝐴 ⊂ D.

a/ The probability of the event 𝐴: The jpmf uniquely defines the probability of 𝐴 defined in terms of
(𝑋1 , 𝑋2 ), by
∑︁
P[(𝑋1 , 𝑋2 ) ∈ 𝐴] = 𝑝(𝑥1 , 𝑥2 ) (8.3)
(𝑥1 ,𝑥2 )∈𝐴

b/ The joint distribution or joint cdf especially, is given by

𝐹 (𝑥1 , 𝑥2 ) = 𝐹𝑋1 ,𝑋2 (𝑥1 , 𝑥2 )


= P[𝑋1 ≤ 𝑥1 , 𝑋2 ≤ 𝑥2 ] (8.4)

= P[(𝑋1 ≤ 𝑥1 ) ∩ (𝑋2 ≤ 𝑥2 )],

for every (𝑥1 , 𝑥2 ) ∈ 𝑅2 (the Euclidean space of 2 dimensions).

c/ The marginal pmf of 𝑋, 𝑌 can be obtained from the joint pmf 𝑝(𝑥1 , 𝑥2 ), by summing the joint p.m.f.
with respect to all 𝑥𝑗 , 𝑗 ̸= 𝑖. E.g., the marginal p.m.f. of 𝑋, 𝑌 are
∑︁ ∑︁
𝑝𝑋 (𝑥) = 𝑝(𝑥, 𝑦), and 𝑝𝑌 (𝑦) = 𝑝(𝑥, 𝑦).
𝑦∈𝑆𝑌 𝑥∈𝑆𝑋

d/ The marginal distribution or marginal c.d.f of 𝑋


∑︁ ∑︁
𝐹𝑋 (𝑥) = P[𝑋 ≤ 𝑥] = 𝑝(𝑢, 𝑣) (8.5)
𝑢≤𝑥 𝑣∈𝑆𝑌

Example 8.2.

Consider the experiment of tossing two fair dice (blue and red say). The sample space for this
experiment has 36 equally likely points. With each of these 36 points associate two numbers,
𝑋 := 𝑋1 and 𝑌 := 𝑋2 . Let
𝑋 = sum of the two dice and 𝑌 = | Difference of the two dice |.

DATA ANALYTICS- FOUNDATION


CHAPTER 8. MULTIVARIATE PROBABILITY DISTRIBUTIONS
232 SIMULTANEOUSLY STUDY RANDOM VARIABLES

Here the sample space

C = {(1, 1), . . . , (1, 6), (2, 1), (2, 2), . . . , (2, 6), . . . , (6, 5), (6, 6)}.

The range of 𝑋 is 𝑆𝑋 = {2, 3, . . . , 12} and of 𝑌 is 𝑆𝑌 = {0, 1, 2, . . . , 5}. Therefore, the value space of
(𝑋1 , 𝑋2 ) is
D = 𝑆𝑋 × 𝑆𝑌 = {(2, 0), . . . , (2, 5), · · · , (12, 0), . . . , (12, 5)}.

* For the sample point 𝑤 = (3, 3), then 𝑋(𝑤) = 3 + 3 = 6, 𝑌 (𝑤) = |3 − 3| = 0.


* For the value (𝑢, 𝑣) = (1, 5) ∈ D? the preimage
X−1 (𝑢, 𝑣) = {𝑤 ∈ C : 𝑋(𝑤) = 1 ∩ 𝑋(𝑤) = 5} = ∅?
* If (𝑢, 𝑣) = (12, 5) ∈ D, then the preimage

X−1 (𝑢, 𝑣) = {𝑤 ∈ C : 𝑋(𝑤) = 12 ∩ 𝑋(𝑤) = 5} = ∅?

1. What is P[𝑋 = 5 𝑎𝑛𝑑 𝑌 = 3]? The only two sample points in the event

𝐴 = {(𝑖, 𝑗) ∈ C : 𝑋(𝑖) = 5, 𝑌 (𝑗) = 3} ⊂ D = 𝑆𝑋 × 𝑆𝑌

that yield 𝑋 = 5 𝑎𝑛𝑑 𝑌 = 3 are (4, 1 ) and (1 , 4)? So the joint pmf is

𝑝𝑋,𝑌 (5, 3) = P[𝑋 = 5 𝑎𝑛𝑑 𝑌 = 3] = P[𝐴] = P[{(4, 1), (1, 4)}] = 2/36

2. The joint distribution (or joint cumulative distribution function- joint cdf)

𝐹 (3, 3) = 𝐹𝑋,𝑌 (3, 3) = P[𝑋 ≤ 3, 𝑌 ≤ 3]


∑︁
= P[(𝑋 ≤ 3) ∩ (𝑌 ≤ 3)] = 𝑝𝑋,𝑌 (𝑖, 𝑗) =?
𝑖≤3, 𝑗≤3

Since 𝑆𝑋 = {2, 3, . . . , 12} so 𝑝𝑋,𝑌 (1, 𝑗) = P[𝑋 = 1 ∩ 𝑌 = 𝑗] = P[{}] = 0, ∀ 𝑗.

𝑋 = 2: Only one point (1,1) yield 𝑋 = 2 𝑎𝑛𝑑 𝑌 = 0 , and no point yield 𝑋 = 2 𝑎𝑛𝑑 𝑌 = 𝑗, when
𝑗 ≥ 1, so
𝑝𝑋,𝑌 (2, 0) = P[𝑋 = 2 𝑎𝑛𝑑 𝑌 = 0] = P[{(1, 1)}] = 1/36,

𝑝𝑋,𝑌 (2, 𝑗) = P[𝑋 = 2 𝑎𝑛𝑑 𝑌 = 𝑗] = P[∅] = 0, ∀𝑗 > 0.

𝑋 = 3: Similarly, you see that no point yield 𝑋 = 3 𝑎𝑛𝑑 𝑌 = 0, and can find sample points that
yield 𝑋 = 3 𝑎𝑛𝑑 𝑌 = 𝑗 when 𝑗 > 0. Summing all the found values of 𝑝𝑋,𝑌 (𝑖, 𝑗) gives the result,
D.I.Y.

Summarized various distributions- the discrete case


Given a random vector X = (𝑋1 , 𝑋2 ) = (𝑋, 𝑌 ), we employ the followings.

Inference, Linear Regression and Stochastic Processes


8.2. Joint and marginal distributions 233

1. C the sample space of both 𝑋, 𝑌 on which 𝑋, 𝑌 are defined

2. 𝑆𝑋 = 𝑋(C) = {𝑋(𝑤) : 𝑤 ∈ C}: the range (value space) of 𝑋; 𝑆𝑌 = 𝑌 (C)

3. 𝑐𝐷 = X(C) = 𝑆𝑋 × 𝑆𝑌 the range of the random vector X

4. 𝑝(𝑥1 , 𝑥2 ): the joint probability mass function (jpmf) of (𝑋1 , 𝑋2 )

5. For an event 𝐴 ⊂ D, the probability of 𝐴 is


∑︀
P[(𝑋1 , 𝑋2 ) ∈ 𝐴] = (𝑥1 ,𝑥2 )∈𝐴 𝑝(𝑥1 , 𝑥2 )

6. 𝐹 (𝑥1 , 𝑥2 ) = P[(𝑋1 ≤ 𝑥1 ) ∩ (𝑋2 ≤ 𝑥2 )]: the joint cdf


∑︀
7. 𝑝𝑋 (𝑥) = 𝑦∈𝑆𝑌 𝑝(𝑥, 𝑦): the marginal pmf of 𝑋;
∑︀ ∑︀
8. 𝐹𝑋 (𝑥) = P[𝑋 ≤ 𝑥] = 𝑢≤𝑥 𝑣∈𝑆𝑌 𝑝(𝑢, 𝑣): the marginal distribution (marginal c.d.f) of 𝑋

8.2.2 Joint and marginal distributions- the continuous case

Now consider the case of 𝑛 ≥ 2 continuous random variables. Let 𝑋1 , 𝑋2 , ..., 𝑋𝑛 be random variables
which are jointly observed at the same experiments.

a/ Joint distribution:

A function 𝐹 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) is called the joint distribution or joint c.d.f. of 𝑋1 , 𝑋2 , ..., 𝑋𝑛 if

𝐹 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = P[𝑋1 ≤ 𝑥1 , . . . , 𝑋𝑘 ≤ 𝑥𝑛 ] (8.6)

for every (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝑅𝑛 (the Euclidean space of 𝑛 dimensions).

b/ Joint probability density function (joint p.d.f):

Similarly as Condition (8.2), a function 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ≥ 0 is called the joint p.d.f of 𝑋1 , ..., 𝑋𝑛 if

(i) Non-negativity: 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ≥ 0


for every (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), with −∞ < 𝑥𝑖 < ∞ (𝑖 = 1, · · · , 𝑛)

(ii) The whole probability is unit:


∫︁ +∞ ∫︁ ∞
··· 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) 𝑑𝑥1 𝑑𝑥2 . . . 𝑑𝑥𝑛 = 1, (8.7)
−∞ −∞

and

(iii) Relationship of pdf and cdf:


∫︁ 𝑥1 ∫︁ 𝑥𝑛
𝐹 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ··· 𝑓 (𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ) 𝑑𝑦1 𝑑𝑦2 . . . 𝑑𝑦𝑛 .
−∞ −∞

DATA ANALYTICS- FOUNDATION


CHAPTER 8. MULTIVARIATE PROBABILITY DISTRIBUTIONS
234 SIMULTANEOUSLY STUDY RANDOM VARIABLES

Note that if the joint probability density function is a constant 𝑐 over a bounded region 𝑅 (and
zero elsewhere), then Condition (ii) in Equation 8.7 becomes, in this special case, as
∫︁ ∫︁
··· 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) 𝑑𝑥1 . . . 𝑑𝑥𝑛 = 𝑐 × (volumeof region 𝑅) = 1. (8.8)
𝑅

Hence 𝑐 = 1/(volumeof region 𝑅).

Geometric meaning of the joint pdf when 𝑛 = 2:

Condition b(iii)/ above can be used as:

𝑓 (𝑥, 𝑦) is the joint probability density function for 𝑋 and 𝑌 if for any two-dimensional set 𝐴

∫︁ ∫︁
P[(𝑋, 𝑌 ) ∈ 𝐴] = 𝑓 (𝑥, 𝑦) 𝑑𝑥 𝑑𝑦 (8.9)
𝐴

Meaning of the joint pdf in 3D-space [two random variables]

Figure 8.1: P[(𝑋, 𝑌 ) ∈ 𝐴 = volume under density surface above 𝐴

Particularly, in logistic industry, risk management (or military science and similar applications) if
𝐴 is the two-dimensional rectangle

{(𝑥, 𝑦) : 𝑎 ≤ 𝑥 ≤ 𝑏; 𝑐 ≤ 𝑦 ≤ 𝑑},

then ∫︁ 𝑏 ∫︁ 𝑑
P[(𝑋, 𝑌 ) ∈ 𝐴] = P[ 𝑎 ≤ 𝑋 ≤ 𝑏; 𝑐 ≤ 𝑌 ≤ 𝑑] = 𝑓 (𝑥, 𝑦) 𝑑𝑦 𝑑𝑥
𝑎 𝑐

Inference, Linear Regression and Stochastic Processes


8.2. Joint and marginal distributions 235

c/ Marginal distributions (marginal c.d.f): By letting one or more variables tend to infinity, we obtain
the joint c.d.f. of the remaining variables. For example,

𝐹 (𝑥, ∞) = P[𝑋1 ≤ 𝑥, 𝑋2 ≤ ∞] = P[𝑋1 ≤ 𝑥] = 𝐹1 (𝑥).

The c.d.f.’s of the individual variables, are called the marginal distributions. E.g., 𝐹1 (𝑥) is the
marginal c.d.f. of 𝑋1 . Hence,

- the marginal distribution (marginal c.d.f) of 𝑋


∫︁ 𝑥 [︂ ∫︁ +∞ ]︂
𝐹1 (𝑥) = P[𝑋 ≤ 𝑥] = 𝑓 (𝑢, 𝑣) 𝑑𝑣 𝑑𝑢 (8.10)
−∞ −∞

d/ Marginal p.d.f.

The marginal p.d.f. of 𝑋𝑖 , (𝑖 = 1, · · · , 𝑛) can be obtained from the joint p.d.f. 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), by
integrating the joint p.d.f. with respect to all 𝑥𝑗 , 𝑗 ̸= 𝑖.

E.g., if 𝑛 = 2, 𝑓 (𝑥1 , 𝑥2 ) is the joint p.d.f. of 𝑋 = 𝑋1 , 𝑌 = 𝑋2 , then the marginal p.d.f. of 𝑋, 𝑌 are
∫︁ +∞ ∫︁ +∞
𝑓1 (𝑥) = 𝑓 (𝑥, 𝑦) 𝑑𝑦, 𝑓2 (𝑦) = 𝑓 (𝑥, 𝑦) 𝑑𝑥. (8.11)
−∞ −∞

Indeed, the marginal c.d.f. of 𝑋 is


∫︁ 𝑥1 ∫︁ +∞
𝐹 (𝑥) = 𝑓 (𝑢, 𝑣)𝑑𝑢𝑑𝑣.
−∞ −∞

Differentiating 𝐹 (𝑥) in 𝑥 we obtain the marginal p.d.f. of 𝑋


∫︁ 𝑥1 ∫︁ +∞
𝑑
𝑓 (𝑥) = 𝑓 (𝑢, 𝑣)𝑑𝑢𝑑𝑣
𝑑𝑥 −∞ −∞
∫︁ +∞
= 𝑓 (𝑥, 𝑣)𝑑𝑣.
−∞

Example 8.3.

The present example is theoretical and is designed to illustrate the above concepts. Let (𝑋, 𝑌 ) be a
pair of random variables having a joint uniform distribution on the region

𝑇 = {(𝑥, 𝑦) : 0 ≤ 𝑥, 𝑦, 𝑥 + 𝑦 ≤ 1.}

SOLUTION:
𝑇 is a triangle in the (𝑥, 𝑦)-plane with vertices at (0, 0), (1, 0) and (0, 1).
The joint p.d.f. of 𝑋, 𝑌 𝑓𝑋,𝑌 (𝑥, 𝑦) = 𝑓 (𝑥, 𝑦), must fulfills conditions:


⎨(𝑖)
⎪ 𝑓 (𝑥, 𝑦) ≥ 0, ∀𝑥, 𝑦, and
∫︁ +∞ ∫︁ +∞ (8.12)
⎩(𝑖𝑖)
⎪ 𝑓 (𝑥, 𝑦) 𝑑𝑥𝑑𝑦 = 1.
−∞ −∞

DATA ANALYTICS- FOUNDATION


CHAPTER 8. MULTIVARIATE PROBABILITY DISTRIBUTIONS
236 SIMULTANEOUSLY STUDY RANDOM VARIABLES

According to the assumption of uniform distribution, the joint p.d.f. 𝑓 (𝑥, 𝑦) is a constant, so
∫︁ +∞ ∫︁ +∞ ∫︁ 1 ∫︁ 1−𝑥
𝑓 (𝑥, 𝑦) 𝑑𝑥𝑑𝑦 = 𝑐 𝑑𝑥𝑑𝑦 = 1,
−∞ −∞ 0 0

hence


⎨2, if (𝑥, 𝑦) ∈ 𝑇
𝑓 (𝑥, 𝑦) =
⎩0, otherwise.

The marginal p.d.f. of 𝑋 is obtained as


∫︁ 1−𝑥 ∫︁ 1−𝑥
𝑓1 (𝑥) = 𝑓 (𝑥, 𝑦) 𝑑𝑦 = 2 𝑑𝑦 = 2(1 − 𝑥), 0 ≤ 𝑥 ≤ 1.
0 0

Obviously, 𝑓1 (𝑥) = 0 if 𝑥 ̸∈ [0, 1]. Similarly, the marginal p.d.f. of 𝑌 is



⎨2(1 − 𝑦), if 0 ≤ 𝑦 ≤ 1
𝑓2 (𝑦) =
⎩0, otherwise.

Remark 3. As with univariate random variables, we often drop the subscript (𝑋1 , 𝑋2 ) from joint cdfs,
pdfs, and pmfs, when it is clear from the context.
We also use notation such as 𝑓12 instead of 𝑓𝑋1 ,𝑋2 .
Besides (𝑋1 , 𝑋2 ), we often use (𝑋, 𝑌 ) to express random vectors.

8.2.3 Distributions of a function of random variables

When considering multiple random variables, we can generate a new random variable by considering a
function that takes the associated random variables as arguments. In particular, a function 𝑍 = 𝑔(𝑋, 𝑌 )
of the variables 𝑋 and 𝑌 is a random variable, its density can be calculated from the joint probability
density function 𝑝𝑋,𝑌 (discrete case) or 𝑓𝑋,𝑌 (continuous case).

The density function of 𝑍 = 𝑔(𝑋, 𝑌 ) - in the discrete case of 𝑋 and 𝑌 - is given by


∑︁
𝑝𝑍 (𝑧) = 𝑝𝑋,𝑌 (𝑥, 𝑦). (8.13)
(𝑥,𝑦) | 𝑔(𝑥,𝑦)=𝑧

Expectation of the function 𝑍 = 𝑔(𝑋, 𝑌 ):


∑︁
E[𝑔(𝑋, 𝑌 )] = 𝑔(𝑥, 𝑦) 𝑝𝑋,𝑌 (𝑥, 𝑦). (8.14)
(𝑥,𝑦)∈R2

Inference, Linear Regression and Stochastic Processes


8.3. Covariance and correlation of variables 237

8.3 Covariance and correlation of variables

Given any two random variables 𝑋 and 𝑌 having a joint distribution with p.m.f. 𝑝(𝑥, 𝑦), as defined in
Equation (8.2), or p.d.f. 𝑓 (𝑥, 𝑦), as in (??). We define

The covariance Cov(𝑋, 𝑌 ) between two rv’s 𝑋 and 𝑌 is Cov(𝑋, 𝑌 ) = E[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )] with
𝜇𝑋 = E[𝑋], 𝜇𝑌 = E[𝑌 ] are the two means.

When 𝑋 and 𝑌 are discrete, then


∑︁ ∑︁
Cov(𝑋, 𝑌 ) = (𝑥 − 𝜇𝑋 )(𝑦 − 𝜇𝑌 ) 𝑝(𝑥, 𝑦). (8.15)
𝑥 𝑦

When 𝑋 and 𝑌 are continuous, then


∫︁ ∞ ∫︁ ∞
Cov(𝑋, 𝑌 ) = (𝑥 − 𝜇𝑋 )(𝑦 − 𝜇𝑌 ) 𝑓 (𝑥, 𝑦)𝑑𝑥𝑑𝑦. (8.16)
−∞ −∞

We could use Cov(𝑋, 𝑌 ) = E[𝑋𝑌 ] − E[𝑋]E[𝑌 ] = E[𝑋𝑌 ] − 𝜇𝑋 𝜇𝑌 .

The correlation 𝜌𝑋𝑌 of 𝑋, 𝑌 measures the dependence of 𝑋 and 𝑌 :

Cov(𝑋, 𝑌 )
Corr(𝑋, 𝑌 ) = 𝜌𝑋𝑌 = . (8.17)
𝜎𝑋 𝜎𝑌
where 𝜎𝑋 𝜎𝑌 are the standard deviations of 𝑋, 𝑌 .

Example 8.4. [Actuarial Science.]

A large insurance agency services a number of customers who have purchased both a homeowner’s
policy and an automobile policy from the agency. For each type of policy, a deductible amount must be
specified. For an automobile policy, the choices are $100 and $250, whereas for a homeowner’s policy,
the choices are 0, $100, and $200.
Suppose an individual with both types of policy is selected at random from the agency’s files.

Table 8.1: Data of insurance agency

𝑝(𝑥, 𝑦) 0 100 200

𝑥 100 .20 .10 .20

250 .05 .15 .30

Let variables

DATA ANALYTICS- FOUNDATION


CHAPTER 8. MULTIVARIATE PROBABILITY DISTRIBUTIONS
238 SIMULTANEOUSLY STUDY RANDOM VARIABLES

𝑋 = the deductible amount on auto policy and


𝑌 = the deductible amount on the homeowner’s policy
a/ Find possible (𝑋, 𝑌 ) pairs. b/ Compute the marginal pmf’s 𝑝𝑋 (𝑥) and 𝑝𝑌 (𝑦).
c/ Compute the means 𝜇𝑋 = E[𝑋], 𝜇𝑌 = E[𝑌 ].
d/ Assume the joint pmf is given in Table 8.1. Find the Cov(𝑋, 𝑌 ) [Answer: 1875]
e/ Find the correlation 𝜌𝑋𝑌 . [Answer: 0.301 ]

8.3.1 Independence of random variables

 CONCEPT 6. Given (𝑋, 𝑌 ) a pair of random variables, we say 𝑋 and 𝑌 are independent if

Discrete case for all 𝑥, 𝑦 ∈ R, we have

P(𝑋 = 𝑥, 𝑌 = 𝑦) = P(𝑋 = 𝑥)P(𝑌 = 𝑦) ⇔ 𝑝𝑋,𝑌 (𝑥, 𝑦) = 𝑝𝑋 (𝑥) 𝑝𝑌 (𝑦),

Continuous case for all 𝑥, 𝑦 ∈ R, we have

𝑓𝑋,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦).

For any 𝑘 ≥ 2, variables 𝑋1 , 𝑋2 , ..., 𝑋𝑘 are said to be mutually independent if,

1/ in the case of discrete variables: for every (𝑥1 , 𝑥2 , . . . , 𝑥𝑘 ) we have

P[𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 , . . . , 𝑋𝑘 = 𝑥𝑘 ] = P[𝑋1 = 𝑥1 ] P[𝑋2 = 𝑥2 ] ... P[𝑋𝑘 = 𝑥𝑘 ],

2/ in the case of continuous variables: for every (𝑥1 , 𝑥2 , . . . , 𝑥𝑘 ) we have


𝑘
∏︁
𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑘 ) = 𝑓1 (𝑥1 )𝑓2 (𝑥2 ) . . . 𝑓𝑘 (𝑥𝑘 ) = 𝑓𝑖 (𝑥𝑖 ), (8.18)
𝑖=1

where 𝑓𝑖 (𝑥𝑖 ) is the marginal p.d.f. of 𝑋𝑖 (see from Equation 8.11).

Knowledge box 4.

a/ If random variables are independent, then their correlation is zero.

b/ The converse is not true. Zero correlation does not imply independence.

c/ Expectation of product of mutually independent random variables

If 𝑋1 , 𝑋2 , ..., 𝑋𝑘 are mutually independent (mutually independent), then, for any integrable func-
tions 𝑔1 (𝑋1 ), 𝑔2 (𝑋2 ), . . . , 𝑔𝑘 (𝑋𝑘 ),

Inference, Linear Regression and Stochastic Processes


8.4. Conditional distributions 239

{︃ 𝑘
}︃ 𝑘
∏︁ ∏︁
E 𝑔𝑖 (𝑋𝑖 ) = E(𝑔𝑖 (𝑋𝑖 )). (8.19)
𝑖=1 𝑖=1

8.3.2 IID sequence of random variables

A sequence of 𝑛 random variables 𝑋𝑖 are identically distributed if they follow the same distribution of
a common random variable 𝑋.
More precisely they have the same ranges Range(𝑋) and the same p.d.f. 𝑓𝑋 (). We write 𝑋𝑖 ∼ 𝑋.

A sequence of many random variables 𝑋𝑖 are


independently and identically distributed (write I.I.D. or i.i.d.)
if they are both mutually independent and identically distributed. We write 𝑋𝑖 ∼𝑖.𝑖.𝑑. 𝑋

8.4 Conditional distributions

8.4.1 Conditional probability density function (p.d.f.)

If 𝑋 and 𝑌 are two random variables having a joint p.d.f. 𝑓 (𝑥, 𝑦) = 𝑓𝑋,𝑌 (𝑥, 𝑦), and marginal ones,
𝑓𝑋 (𝑥), 𝑓𝑌 (𝑦), respectively, then the conditional p.d.f. of 𝑌 given 𝑋 = 𝑥, where 𝑓𝑋 (𝑥) > 0, is defined to
be
𝑓 (𝑥, 𝑦)
𝑓𝑌 (𝑦|𝑥) = 𝑓𝑌 |𝑋 (𝑦|𝑋 = 𝑥) = . (8.20)
𝑓𝑋 (𝑥)
Similarly, the conditional p.d.f. of 𝑋 given 𝑌 = 𝑦 is
𝑓 (𝑥, 𝑦)
𝑓𝑋 (𝑥|𝑦) = 𝑓𝑋|𝑌 (𝑥|𝑌 = 𝑦) = .
𝑓𝑌 (𝑦)

Conditional distribution of 𝑋 known 𝑌 , with any 𝑦 ∈ S𝑌 is


∫︀ 𝑥
P[𝑋 ≤ 𝑥, 𝑌 = 𝑦] 𝑓𝑋,𝑌 (𝑠, 𝑦) 𝑑𝑠
𝐹𝑋 (𝑥|𝑦) = = −∞ (8.21)
P[𝑌 = 𝑦] 𝑓𝑌 (𝑦)
such that 𝑓𝑌 (𝑦) > 0. The relation between a conditional pdf and 𝐹𝑋 (𝑥|𝑦) is
𝑑𝐹𝑋 (𝑥|𝑦)
𝑓𝑋 (𝑥|𝑦) = .
𝑑𝑥
• We also write
𝑝𝑋 (𝑥|𝑦) = 𝑝𝑋 (𝑥|𝑌 = 𝑦)

and 𝑝𝑌 (𝑦|𝑥) = 𝑝𝑌 (𝑦|𝑋 = 𝑥) for discrete random case.

• Write
𝑓𝑋 (𝑥|𝑦) = 𝑓𝑋 (𝑥|𝑌 = 𝑦)

and 𝑓𝑌 (𝑦|𝑥) = 𝑓𝑌 (𝑦|𝑋 = 𝑥) for continuous case.

DATA ANALYTICS- FOUNDATION


CHAPTER 8. MULTIVARIATE PROBABILITY DISTRIBUTIONS
240 SIMULTANEOUSLY STUDY RANDOM VARIABLES

When 𝑋 and 𝑌 are independent we replace 𝑓 (𝑥, 𝑦) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦) and get

𝑓𝑌 (𝑦|𝑥) = 𝑓𝑌 |𝑋 (𝑦|𝑋 = 𝑥) = 𝑓𝑌 (𝑦)


𝑓𝑋 (𝑥|𝑦) = 𝑓𝑋|𝑌 (𝑥|𝑌 = 𝑦) = 𝑓𝑋 (𝑥).
Example 8.5.

From Ex. 8.3, we compute the conditional pdf of 𝑌 , given 𝑋 = 𝑥, when 0 < 𝑥 < 1. By (8.20),

⎨ 1 ,

when 0 < 𝑦 < 1 − 𝑥
𝑓𝑌 |𝑋 (𝑦|𝑋 = 𝑥) = 𝑓𝑌 (𝑦|𝑥) = 1 − 𝑥
⎩0, otherwise.

This is a uniform distribution over (0, 1 − 𝑥), 0 < 𝑥 < 1. If 𝑥 ̸∈ (0, 1), the conditional p.d.f. does not exist.


8.4.2 Conditional expectation

The conditional expectation of 𝑌 given 𝑋 = 𝑥, denoted by

E(𝑌 | 𝑥) := E(𝑌 | 𝑋 = 𝑥)

is the expected value of 𝑌 with respect to the conditional p.d.f. 𝑓𝑌 (𝑦|𝑥), that is,

∑︁
E[𝑌 | 𝑥] = E[𝑌 |𝑋 = 𝑥] = 𝑦𝑗 𝑝𝑌 (𝑦𝑗 |𝑥) if (𝑋, 𝑌 ) discrete,
𝑗=1
∫︁ ∞ (8.22)
= 𝑦 𝑓𝑌 (𝑦|𝑥) 𝑑𝑦 if (𝑋, 𝑌 ) continuous.
−∞

Similarly, we can define the conditional variance of 𝑌 , given 𝑋 = 𝑥, as the variance of 𝑌 , with
respect to the conditional p.d.f. 𝑓𝑌 |𝑋 (𝑦 | 𝑥).

♣ OBSERVATION 2.

Notice that for a pair of random variables (𝑋, 𝑌 ), the conditional expectation E[𝑌 | 𝑋 = 𝑥] changes
with 𝑥, if 𝑋 and 𝑌 are dependent. Thus, we can consider E[𝑌 | 𝑋] to be a random variable, which is
a function of 𝑋. This is the foundation for defining regression models in Chapter 9 about Simple Linear
Regression.
Moreover, if we assume linearity between non-random predictor 𝑋 and 𝑌 in a model

𝑌 = 𝑓 (𝑋) = 𝛼 + 𝛽𝑋 + 𝜀 (8.23)

then the response variance V[𝑌 ] = 0 + V[𝜀] = V[𝜀] = 𝜎 2 . This condition turns out to be a key fact for
linear regression analysis in the subsequent parts.

8.5 Chapter’s Problem

Problem 8.1. (Joint distributions and marginal distributions)

Inference, Linear Regression and Stochastic Processes


8.5. Chapter’s Problem 241

The joint p.d.f. of two random variables (𝑋, 𝑌 ) is



⎨ 1 , if (𝑥, 𝑦) ∈ 𝑆

𝑓 (𝑥, 𝑦) = 2
⎩0, otherwise

where 𝑆 is a square of area 2, whose vertices are (1, 0), (0, 1), (−1, 0), (0, −1).

1. Find the marginal p.d.f. of 𝑋 and of 𝑌 .

2. Find E(𝑋), E(𝑌 ), V(𝑋), V(𝑌 ).

Problem 8.2. (Component Lifetimes)

In an electronic assembly, let the random variables 𝑋1 , 𝑋2 , · · · , 𝑋4 denote the lifetime of four com-
ponents, respectively, in hours. Suppose that the joint probability density function of these variables
is
𝑓𝑋1 ,𝑋2 ,𝑋3 ,𝑋4 (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) = 9 × 10−12 𝑒−0.001𝑥1 −0.002𝑥2 −0.0015𝑥3 −0.003𝑥4

for 𝑥1 ≥ 0, 𝑥2 ≥ 0, 𝑥3 ≥ 0, 𝑥4 ≥ 0. What is the probability that the device operates for more than
1000 hours without any failures?

GUIDANCE for solving.

The requested probability is P[𝑋1 > 1000, 𝑋2 > 1000, 𝑋3 > 1000, 𝑋4 > 1000], which equals the
multiple integral of 𝑓𝑋1 ,𝑋2 ,𝑋3 ,𝑋4 (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) over the region 𝑥1 > 1000, 𝑥2 > 1000, 𝑥3 > 1000, 𝑥4 >
1000. The joint probability density function can be written as a product of exponential functions, and
each integral is the simple integral of an exponential function. Therefore,

P[𝑋1 > 1000, 𝑋2 > 1000, 𝑋3 > 1000, 𝑋4 > 1000] = ... = 𝑒−1−2−1.5−3 = 000055

DATA ANALYTICS- FOUNDATION


This page is left blank intentionally.
Chapter 9

Regression Analysis I
Simple Linear Regression

[Source [9]]
CHAPTER 9. REGRESSION ANALYSIS I
244 SIMPLE LINEAR REGRESSION

9.1 Introduction and Overview

In previous chapters, we were concerned about the distribution of one random variable and also random
vector, theirs parameters, expectation, variance, median, etc. In this chapter, we study relations among
variables. This chapter covers the followings.

1. Correlation and Regression Analysis in Section 9.2

2. Regression and Linear Regression in Section 9.3

3. Analysis of variance for regression in Section 9.4

4. Chapter’s Problems in Section 9.5

Learning Objectives

After careful study of this chapter, you should be able to

1. Understand concepts of co-variance and the coefficient of correlation of data samples

2. Calculate co-variance and the coefficient of correlation of data samples on software

3. Explain the model fitting process and key assumptions of regression

4. Formulate statistical linear model and regression analysis

5. Use the method of least square to fit regression lines to data

6. Check the adequacy of the model (goodness of fit) by analysis of variance for regression

When conducting a survey or experiment, measurements can be made on many characteristics of


the elements to be observed in the sample. In this case, we will obtain the results of multivariate
observations, and the statistical methods used to analyze the relationship between observed values on
different variables will be called multivariate method.

• Linear regression models show the linear relationship between a variable of interest (response vari-
able or dependent variable) and a set of other variables, called observation variables or predictor
variables, Accordingly, we wish to predict the values of the variable of interest.

• Chapter 11 on multivariate regression analysis will lay the foundations for more advanced data
analysis, being essential for experimental sciences such as chemical engineering, bio-medical sci-
ences or industrial production.

• Techniques for comparing the mean (or other parameters) of many populations are grouped into a
group of methods named Analysis of variances (ANOVA).

Inference, Linear Regression and Stochastic Processes


9.2. Correlation and Regression Analysis 245

9.2 Correlation and Regression Analysis

Specific datasets
In the present chapter we start with numerical analysis of multivariate data. In order to illustrate the
ideas and enrich practical applications, we introduce here an industrial data set, called ALMPIN.csv.
The ALMPIN.csv set consists of 70 records on 6 variables measured on aluminum pins used in
airplanes. The aluminum pins are inserted with air-guns in pre-drilled holes in order to combine critical
airplane parts such as wings, engine supports and doors.

9.2.1 Co-variance and correlation of two samples

We introduce now a statistic which summarizes the simultaneous variability of samples obtained from
two variables 𝑋 and 𝑌 . The statistic is called the sample covariance. It is a generalization of the
sample variance statistics, 𝑠2𝑥 of one variable, 𝑋.
Let 𝑥 = 𝑥1 , 𝑥2 , · · · , 𝑥𝑛 and 𝑦 = 𝑦1 , 𝑦2 , · · · , 𝑦𝑛 be two samples of the same size 𝑛, observed on
variable 𝑋 and 𝑌 respectively; 𝑥 * 𝑦 is the inner product of two vectors 𝑥, 𝑦.

• Denote the sample covariance of two samples 𝑥 and 𝑦 by


𝑛
∑︁
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑖
𝑠𝑥𝑦 = = 𝑥 * 𝑦 − x .y (9.1)
𝑛−1

Note that 𝑠𝑥𝑥 is the sample variance 𝑠2𝑥 , and 𝑠𝑦𝑦 = 𝑠2𝑦 .

• The sample covariance 𝑠𝑥𝑦 can assume positive or negative values. If one of the variables, say, 𝑋,
assumes a constant value 𝑐, i.e. 𝑥𝑖 = 𝑐, ∀𝑖, then 𝑠𝑥𝑦 = 0. This can be immediately verified, since
x = 𝑐 and 𝑥𝑖 − x = 0 for all 𝑖 = 1, · · · , 𝑛.

We also have Schwarz inequality:

It can be proven that, for any variables 𝑋 and 𝑌 ,

𝑠2𝑥𝑦 ≤ 𝑠2𝑥 𝑠2𝑦 . (9.2)

Now if observe only two variables 𝑋 and 𝑌 , we can merge 𝑥 = 𝑥1 , 𝑥2 , · · · , 𝑥𝑛 and 𝑦 = 𝑦1 , 𝑦2 , · · · , 𝑦𝑛


into a single data sample
D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )}.

 CONCEPT 7. (Sample correlation)

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
246 SIMPLE LINEAR REGRESSION

By dividing 𝑠𝑥𝑦 by 𝑠𝑥 · 𝑠𝑦 we obtain a standardized index of dependence, which is called the sample
correlation (Pearson’s sample correlation), namely
𝑠𝑥𝑦
𝑟𝑥𝑦 = . (9.3)
𝑠𝑥 · 𝑠𝑦

Figure 9.1: A sample correlation from a small data

Correlations- two specific cases:

We always have −1 ≤ 𝑟 ≤ 1 for the sample correlation 𝑟 = 𝑟𝑥𝑦 . Two limit values in this constraint,
when 𝑟 = 1 and when 𝑟 = −1 are theoretical and apply when all data points of a scatter plot (diagram)
on a straight line of the form
𝑌𝑖 = 𝛼 + 𝛽𝑥𝑖 ,

where 𝛼 and 𝛽 are constants.


* 𝛽 is positive for every correlation coefficient 𝑟 > 0, including 𝑟 = 1.
* Similarly 𝛽 < 0 with every correlation coefficient 𝑟 < 0, including 𝑟 = −1.
From the Schwarz inequality, the sample correlation always assumes values between −1 and +1.

Example 9.1. [Industry- Manufacturing.] [?]

The measurements of data ALMPIN.csv were taken in a computerized numerically controlled (CNC)
metal cutting operation. The six variables are
Diameter 1, Diameter 2, Diameter 3,
Cap Diameter,

Inference, Linear Regression and Stochastic Processes


9.2. Correlation and Regression Analysis 247

Lengthncp and
Lengthwcp.
All the measurements are in millimeters.
The first three variables give the pin diameter at three specified locations. Cap Diameter is the
diameter of the cap on top of the pin. The last two variables are the length of the pin, without and with
the cap, respectively.

Computation on R .

To see the first five rows, we write R code in command line environment:
> library(mistat);
> data(ALMPIN)
> ALMPIN[1:5, ];

Table 9.1: Sample covariances of aluminum pins variables

Y
X
Diameter 1 Diam. 2 Diam. 3 Cap Diam. Lengthnocp Lengthwcp

Diameter 1 0.0270

Diameter 2 0.0285 0.0329

Diameter 3 0.0255 0.0286 0.0276

Cap Diameter 0.0290 0.0314 0.0285 0.0358

Lengthnocp −0.0139 −0.0177 −0.0120 −0.0110 0.1962

Lengthwcp −0.0326 −0.0418 −0.0333 −0.0319 0.1503 0.2307

In Table 9.1 we present the sample covariances of the six variables measured on the aluminum
pins. Since
𝑆𝑥𝑦 = 𝑆𝑦𝑥

(covariances and correlations are symmetric statistics), it is sufficient to present the values at the
bottom half of Table 9.1 (on and below the diagonal).
In Table 9.2 we present the sample correlations in the data file ALMPIN.csv. We see that the
sample correlations between Diameter 1, Diameter 2 and Diameter 3 and Cap Diameter are all
greater than 0.9.

As we see in Figure 9.2 (the multivariate scatter plots) the points of these variables are scattered
close to straight lines.

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
248 SIMPLE LINEAR REGRESSION

Figure 9.2: Multiple scatterplots of the aluminum pins measurements.

Table 9.2: Sample correlations of aluminum pins variables

Y
X
Diam. 1 Diam. 2 Diam. 3 Cap Diam. Lengthnocp Lengthwcp

Diameter 1 1

Diameter 2 0.958 1

Diameter 3 0.935 0.949 1

Cap Diam. 0.933 0.914 0.908 1

Lengthnocp −0.191 −0.220 −0.163 −0.132 1

Lengthwcp −0.413 −0.0480 −0.417 −0.351 0.707 1

On the other hand, no clear relationship is evident between the first four variables and the length of
the pin (with or without the cap). The negative correlations, usually indicate that the points are scattered
around a straight line having a negative slope. In the present case it seems that the magnitude of these
negative correlations are due to the one outlier (pin # 66).
To get Figure 9.2, write R code:

> data(ALMPIN)
> plot(ALMPIN)

PRACTICE 2.

Inference, Linear Regression and Stochastic Processes


9.2. Correlation and Regression Analysis 249

If use the data below (obtained by R code ALMPIN[1:5, ];)


diam1 diam2 diam3 capDiam lenNocp lenWcp
1 9.99 9.97 9.96 14.97 49.89 60.02
2 9.96 9.96 9.95 14.94 49.84 60.02
3 9.97 9.96 9.95 14.95 49.85 60.00
4 10.00 9.99 9.99 14.99 49.89 60.06
5 10.00 9.99 9.99 14.99 49.91 60.09
can you compute sample correlations between variables
a/ diam1 and diam2? b/ lenWcp and capDiam?

9.2.2 Statistical prediction with field-trip observation based data

Forecasting is to comprehend the rules and trends of future research objectives, based on the analysis
of information flows or actual data from the past and present. The forecast consists of 4 steps, being
named as DPSF:

1. Data collection: Data must be accurate and accurate

2. Preliminary treatment:

- Remove unnecessary, inaccurate data; Adding missing data.

- Divide the dataset into two groups, for parameter estimation and model goodness check, see c).

3. Select the method and build the regression model (forecast)

- The method chosen to match the data.

- Build the regression model so that the random error is the smallest and the determined coefficient
𝑅 is close to 1.

- This forecast error is verified by the test data [set out in b).

4. Forecast and/ or optimize :

- From the regression model we determine the predicted value.

- Analyze the results received.

Ways of modeling complex processes

Linear regression is the most common method for forecasting and optimizing in the process of Statis-
tical Optimization above. We need a way to express our objective, which we call model.

Mathematical model. A model is termed mathematical if

it is derived from theoretical or mechanistic considerations

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
250 SIMPLE LINEAR REGRESSION

that represent exact, error-free assumed relationships among the variables.

This definition has the common trait that the response and predictor variables are assumed to be
free of specification error and measurement uncertainty.

Statistical model. A model is termed statistical if it is derived from data that are subject to various
types of specification, observation, experimental, and/or measurement errors.

Statistical Models are approximations to actual physical systems, and are subject to specification
and measurement errors.

Statistical regression model. Many variables observed in real life are related. The type of their re-
lation can often be expressed in a mathematical form called statistical regression model or just re-
gression.

Establishing and testing such a relation (a model) enables us:


– to understand interactions, causes, and effects among variables;
– to predict unobserved variables based on the observed ones;
– to determine which variables significantly affect the variable of interest.

Example 9.2.

Consumption Theory in Economics tells us that generally people increase their consumption expen-
diture 𝐶 when their after-tax (disposable) income 𝑌𝑑 increases, but not by as much as the increase in
their disposable income.
This can be stated in explicit linear equation form, mathematically as:

𝐶 = 𝑏0 + 𝑏1 𝑌𝑑

where 𝑏0 , 𝑏1 are unknown constants called parameters. But different people having the same dispos-
able income are likely to have somewhat different consumption expenditures. As a result, the above
deterministic relationship must be modified to include a random disturbance or error term 𝑢, making
it stochastic or statistical model:
𝐶 = 𝑏0 + 𝑏1 𝑌𝑑 + 𝑢.

9.2.3 Statistical model building and Model fitting

Statistical model building is an iterative process. We entertain a tentative model but we are ready
to revise it if necessary. Only when we are happy with our model should we stop. We can then use
our model, sometimes to understand our current set of data, sometimes to help us predict what may
happen in the future. We must be ready to translate what the model is telling us statistically to the client
with the real life problem. But how to build a model from data?

Inference, Linear Regression and Stochastic Processes


9.2. Correlation and Regression Analysis 251

Example 9.3. In Ecology, how do brown creepers 𝑌 increase in relative abundance with increas-
ing extent 𝑋 of late-successional forest in high areas as North America?

In a statistical model between the amount of brown creepers 𝑌 in the increasing extent 𝑋

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀

we see 2 components. We want to determine the deterministic part why minimizing the stochastic
part! The deterministic component must be found via estimating parameters from data.

Knowledge box 5. The process of building a model from data is called model fitting.

The model fitting process in this text generally involves four steps.

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
252 SIMPLE LINEAR REGRESSION

1/ Model specification- (see 2.2) a model is specified in two parts:

* an equation linking the response and explanatory variables, call it an regression model;

* the probability distribution of the response variable.

2/ Parameter estimation- Estimation of the parameters of linear models, see Chapter 10

3/ Checking the adequacy of the model- how well it fits or summarizes the data, see Chapter 10.

4/ Inference- classically this involves calculating confidence intervals, testing hypotheses about the
parameters in the model and interpreting the results, see Chapter 4, 5, 6 and Appendix.

9.3 Regression and Linear Regression

What is Regression Analysis?

Regression analysis is the study of relationships between variables. It is one of the most useful tools
for a business/industry/management analyst. Regression analysis involves the number of explanatory
variables in the analysis. In every regression study there is a single variable that we are trying to
explain or predict, called the dependent variableor response variable.
To explain or predict the response, we use one or more explanatory variables (also called indepen-
dent variables, regressor or predictor variables). The response variable is the single variable being
explained by the regression, more precisely, the explanatory variables are used to explain the response
variable. For example, we can not only understand how a company’s sales are affected by its advertis-
ing, but we can also use the company’s records of current and past advertising levels to predict future
sales.

The branch of statistics that studies such relationships is called regression analysis. Some poten-
tial uses of regression analysis in business include the following:

• How do wages of employees depend on years of experience, years of education, and gender? How
does the current price of a stock depend on its own past values, as well as the current and past
values of a market index?

• How does a company’s current sales level depend on its current and past advertising levels, the
advertising levels of its competitors, the company’s own past sales levels, and the general level of the
market?

• How does the selling price of a house depend on such factors as the appraised value of the house,
the square footage of the house, the number of bedrooms in the house, and others?

Each of these questions asks how a single variable, such as selling price or employee wages, de-
pends on other relevant variables. If we can estimate this relationship, then we can not only better

Inference, Linear Regression and Stochastic Processes


9.3. Regression and Linear Regression 253

understand how the world operates, but we can also do a better job of predicting the variable in
question.

Briefly speaking, regression models relate a response variable to one or several predictors. Hav-
ing observed predictors, we can forecast the response by computing its conditional expectation (see
Equation (14.8) of Section 8.4.2), given all the available predictors.

Definition 9.1. Basic terms and Standard assumptions of regression

• Response or dependent variable 𝑌 is a variable of interest that we predict based on one or


several predictors.

• Predictors or independent variables 𝑋1 , 𝑋2 , · · · , 𝑋𝑘 are used to predict the values and be-
havior of the response variable 𝑌 .

• Regression of 𝑌 on 𝑋1 , 𝑋2 , · · · , 𝑋𝑘 is the conditional expectation,

𝐺(𝑋) = 𝑌̂︀ := E[𝑌 |𝑋 = 𝑥] if 𝑘 = 1 (only one predictor 𝑋) or

𝐺(𝑋) = 𝑌̂︀ := E[𝑌 |𝑋1 = 𝑥1 , · · · , 𝑋𝑘 = 𝑥𝑘 , ] if 𝑘 > 1

𝑌̂︀ is a function, whose form can be estimated from data.

First, we define the statistical model for linear regression.

9.3.1 Statistical Model for Linear Regression Analysis

A simple regression analysis includes one explanatory variable, whereas multiple regression [see
Chapter 11] include any number of explanatory variables. We have seen examples before in which the
relationship between two variables 𝑋 and 𝑌 is close to linear. This is the case when the (𝑥, 𝑦) points
scatter along a straight line.

Definition 9.2. Suppose that we are given D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )}.

If the 𝑌 observations are related to those on 𝑋, according to the linear model

𝑦𝑖 = 𝑓 (𝑥𝑖 ) = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 , 𝑖 = 1, ..., 𝑛, (9.4)

where 𝛼 and 𝛽 are constant coefficients, and


𝜀𝑖 are random components, with zero mean and constant variance,
we say that 𝑌 relates to 𝑋 according to a statistical linear model.

• By Definition 9.1, linear regression model assumes that the conditional expectation

E[𝑌 |𝑋 = 𝑥] = 𝐺(𝑥) + E[𝜀] = 𝐺(𝑥) + 0 = 𝛼 + 𝛽 𝑥

is a linear function of 𝑥. As any linear function, it has an intercept 𝛼 and a slope 𝛽.

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
254 SIMPLE LINEAR REGRESSION

The coefficients 𝛼 and 𝛽 are called the regression coefficients. 𝛼 is the intercept and 𝛽 is the slope
coefficient.

• Generally, the coefficients 𝛼 and 𝛽 are unknown. We fit to the data points a straight line, which is
called the estimated regression line, or prediction line.

• The intercept
𝛼 = 𝐺(0)

equals the value of the regression function for 𝑥 = 0. Sometimes it has no physical meaning. For
example, nobody will try to predict the value of a computer with 0 random access memory (RAM), and
nobody will consider the national reserve rate in year 0. In other cases, intercept is quite important.

• The slope
𝛽 = 𝐺(𝑥 + 1) − 𝐺(𝑥)

is the predicted change in the response variable when predictor changes by 1. This is a very impor-
tant parameter that shows how fast we can change the expected response by varying the predictor.

A zero slope means absence of a linear relationship between 𝑋 and 𝑌 . In this case, 𝑌 = 𝛼 a
constant when 𝑋 changes.

Example 9.4.

Model the quality of produced computers 𝑥 on the customer satisfaction 𝑦 from the public. Have
a single linear regression model, described by (changing 𝛽0 = 𝛼, 𝛽1 = 𝛽)

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀.

The model 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀 has two important parts:

• The mean 𝜇𝑦 = E[𝑌 |𝑋 = 𝑥] of the response variable 𝑌 changes as 𝑥 changes. The means all lie
on a straight line when plotted against 𝑥, that is

𝜇𝑦 = 𝛽0 + 𝛽1 𝑥.

• Individual responses of 𝑦 with the same 𝑥 vary according to a normal distribution. These normal
distributions all have the same standard deviation.

See Section 10.4 for concrete computation and analysis of the these two parts.

Example 9.5.

For a relationship which is influenced by two independent variables or regressor variables 𝑥1 , 𝑥2 .


Suppose we wish to develop an empirical model relating the viscovity 𝑦 of a polymer to the
temperature 𝑥1 and the catalyst feed rate 𝑥2 .

Inference, Linear Regression and Stochastic Processes


9.3. Regression and Linear Regression 255

The statistical model for linear regression; the mean response is a


straight-line function of the explanatory variable.

A model that might describe this relationship is

𝑦 = 𝛼 + 𝛽 1 𝑥 1 + 𝛽2 𝑥 2 + 𝜀 (9.5)

Taking mean we get 𝜇𝑦 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ] = 𝛼 + 𝛽1 𝑥 + 𝛽2 𝑥2 .

• This is a multiple linear regression model with independent variables, or regressor variables 𝑥1 , 𝑥2 .
This model describes a plane in the two-dimensional space of the regressor variables 𝑥1 , 𝑥2 .

• Termed linear since 𝑦 is a linear function of the unknown parameters or regression coefficients
𝛼, 𝛽1 , 𝛽2 .

Multiple linear regression models [see details in Chapter 11] become more complex, when we add
an interaction term between 𝑥1 , 𝑥2 to the 1st-order model (11.21), and get

𝑦 = 𝛼 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 + 𝜀 (9.6)

The full second-order response surface model in two regressors 𝑥1 , 𝑥2 is given by

𝑦 = 𝛼 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽11 𝑥21 + 𝛽22 𝑥22 + 𝛽12 𝑥1 𝑥2 + 𝜀 (9.7)

Taking mean we get

𝜇𝑦 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ] = 𝛼 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽11 𝑥21 + 𝛽22 𝑥22 + 𝛽12 𝑥1 𝑥2 .

9.3.2 Fitting regression lines to data- least squares method

Consider the model


𝜇𝑦 = 𝛼 + 𝛽 𝑥 (9.8)

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
256 SIMPLE LINEAR REGRESSION

with respect to a single explanatory variable 𝑥, with intercept 𝛼 and slope 𝛽.


The data for a linear regression are observed values of 𝑦 and 𝑥, given in

D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )}

The model takes each 𝑥 to be a fixed known quantity. In practice, 𝑥 may not be exactly known. If the
error in measuring 𝑥 is large, more advanced inference methods are needed.

Generally, we have

Fundamental Equation for Regression

Observed Value = Fitted Value + Residual ⇔ 𝑦 = 𝑦̂︀ + 𝜀.

The response 𝑌 to a given 𝑥 is a random variable. The linear regression model describes the
conditional mean
𝑌̂︀ = E[𝑌 |𝑋 = 𝑥]

and standard deviation of this random variable 𝑌 . These unknown parameters 𝛽𝑗 must be estimated
from the data.

Least square estimation (Ordinary Least Square- OLS)

We now introduce the method of least squares by looking at the least squares geometry and dis-
cussing some of its algebraic properties.
Suppose that 𝑦̂︀ = 𝑎 + 𝑏𝑥 is the straight line fitted to the data. The principle of least squares requires
one to determine 𝑎 and 𝑏, the estimates of 𝛼 and 𝛽, which minimize the sum of squares of residuals

𝜀𝑖 = 𝑦𝑖 − 𝑦̂︀𝑖 = 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖

around the line, that is minimizing


𝑛
∑︁
𝑆𝑆𝐸 := 𝜀2𝑖 = 𝜀21 + 𝜀22 + · · · + 𝜀2𝑛 .
𝑖=1

We can now explain how to choose the best-fitting line through the points in the scatter-plot. It is the
line with the smallest sum of squared residuals. The resulting line is called the least squares line.
QUIZ: Why do we use the sum of squared residuals? Why not minimize some other measure
of the residuals?
The definitions of the following terms for the sum of squares and cross-products are useful for com-
putations in regression:

Inference, Linear Regression and Stochastic Processes


9.3. Regression and Linear Regression 257

Figure 9.3: Residuals and the estimated regression line

𝑛
∑︁
𝑆𝑥𝑦 = (𝑥𝑖 − x )(𝑦𝑖 − y) = (𝑛 − 1) 𝑠𝑥𝑦
𝑖=1

and
𝑛
∑︁
𝑆𝑥2 = 𝑆𝑥𝑥 = (𝑥𝑖 − x )2 = (𝑛 − 1) 𝑠2𝑥 ,
𝑖=1

both are estimated from a data set D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )} of size 𝑛.
DATA ANALYTICS- FOUNDATION
CHAPTER 9. REGRESSION ANALYSIS I
258 SIMPLE LINEAR REGRESSION

Computing 𝑎, 𝑏

First, it is not appropriate to simply minimize the sum of the residuals. This is because the positive
residuals would cancel the negative residuals.

• 𝜀𝑖 = 𝑦𝑖 − 𝑦̂︀𝑖 is viewed as a residual (deviation, random error), between observed responses 𝑦𝑖 and
their fitted values 𝑦̂︀𝑖 ,

• the method of least squares (OLS) looks for a line such that the sum
𝑛
∑︁ 𝑛
∑︁ 𝑛
∑︁
𝑆𝑆𝐸 = 𝜀2𝑖 = (𝑦𝑖 − 𝑦̂︀𝑖 )2 = (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2 (9.9)
𝑖=1 𝑖=1 𝑖=1

is the smallest. Divide the two sides of the above equation to 𝑛 − 1 we have
(︂ )︂2
𝑆𝑆𝐸 2 2 2 𝑆𝑥𝑦
= 𝑆𝑦 (1 − 𝑅𝑥𝑦 ) + 𝑆𝑥 𝑏 − 2 . (9.10)
𝑛−1 𝑆𝑥
Here
𝑛
∑︁
𝑆𝑥𝑦 = (𝑥𝑖 − x )(𝑦𝑖 − y),
𝑖=1
𝑛
∑︁ 𝑛
∑︁
𝑆𝑥2 = 𝑆𝑥𝑥 = (𝑥𝑖 − x )2 , 𝑆𝑦2 = 𝑆𝑦𝑦 = (𝑦𝑖 − y)2
𝑖=1 𝑖=1

all are estimated from data D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )}.

• If we write
𝑛
∑︁
𝑆𝑆𝑇 := (𝑦𝑖 − y)2 (9.11)
𝑖=1

for the sum of squared deviations between actual and estimated mean values, and
𝑛
∑︁
𝑆𝑆𝑅 = 𝑦𝑖 − y)2
(ˆ (9.12)
𝑖=1

the sum of squared regression errors, we get 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝛽ˆ 𝑆𝑥𝑦 , and also

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸. (9.13)

ˆ of 𝛼 and 𝑏 = 𝛽ˆ of 𝛽, which minimizes


The OLS requires us to find estimates (estimated values) 𝑎 = 𝛼
the sum of 𝑆𝑆𝐸 of the residuals (deviations) around the regression line. The expression 𝑄 = 𝑆𝑆𝐸 =
ℎ(𝑎, 𝑏) depends on two variables 𝑎 and 𝑏, and we find its extreme by taking partial derivatives
𝑛
∑︁
𝜕𝑄/𝜕𝑎 = −2 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ),
𝑖=1
𝑛
∑︁
𝜕𝑄/𝜕𝑏 = −2 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )
𝑖=1
∑︀𝑛
of 𝑄 = 𝑖=1 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2 , equating them to 0, and solving the resulting equations for 𝑎 and 𝑏.

Inference, Linear Regression and Stochastic Processes


9.3. Regression and Linear Regression 259

Figure 9.4: The sum 𝑆𝑆𝐸, the orange fitted line gives value minimum 𝑆𝑆𝐸

Knowledge box 6. The fitted linear model

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
260 SIMPLE LINEAR REGRESSION

We get estimates of 𝛼 and 𝛽 as


𝑎 = 𝑦 − 𝑏 𝑥, (9.14)

the estimated intercept, and the estimated regression slope


𝑆𝑥𝑦 𝑆𝑦
𝑏= = 𝑟𝑥𝑦 (9.15)
𝑆𝑥2 𝑆𝑥
𝑛
∑︁
where 𝑆𝑥𝑦 = (𝑥𝑖 − x )(𝑦𝑖 − y), and (with 𝑠2𝑥 the sample variance of 𝑋)
𝑖=1

𝑛
∑︁
𝑆𝑥2 = 𝑆𝑥𝑥 = (𝑥𝑖 − x )2 = (𝑛 − 1) 𝑠2𝑥 . (9.16)
𝑖=1

The fitted linear model is 𝐺(𝑋) = 𝑌̂︀ = 𝑎 + 𝑏𝑋.

9.3.3 Computation and Practice

Computation on R :

The basic function for fitting ordinary multiple models is lm(), and a streamlined version of the call
in general is as follows:
fitted.model = lm( model, data = data.frame); # or if we write

data D = data.frame then

fitted.model = lm( model, data D )

Details

Models for lm are specified symbolically. A typical model has the form
response ∼ terms
where response is the (numeric) response vector and
terms is a series of terms which specifies a linear predictor for response.

There are two cases:

1. Single predictor regression: model = 𝑦 ∼ 𝑥 so

fm1 ← lm(𝑦 ∼ 𝑥, data )

2. If dataset has two predictors 𝑥1 , 𝑥2 : model = 𝑦 ∼ 𝑥1 + 𝑥2 , so

fm2 ← lm(𝑦 ∼ 𝑥1 + 𝑥2 , data )

Inference, Linear Regression and Stochastic Processes


9.3. Regression and Linear Regression 261

would fit a multiple regression model of 𝑦 on 𝑥1 and 𝑥2 (with implicit intercept).

Example 9.6. (World population)

We observed the world population from 1950 to 2010, each survey carried out for 5 years, and want
to fit data by a linear regression model.
3000 4000 5000 6000 7000
y

2 4 6 8 10 12

x
Figure 9.5: Dot diagram of the world population

> x = c(1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010);
# 13 years observing the world population from 1950 to 2010

> y = c(2558,2782,3043,3350,3712,4089,4451,4855,5287,5700,6090,6474,6864);
# the world population
> D = data.frame(x,y); # a table with 13 rows and 2 columns
> M1 <- lm(y ~ x, D); # fitted model
> summary(M1)

Call:
lm(formula = y ~ x, data = D)

Residuals:
Min 1Q Median 3Q Max
-107.08 -96.26 -12.29 62.90 223.55

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
262 SIMPLE LINEAR REGRESSION

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.422e+05 3.041e+03 -46.76 5.24e-14 ***
x 7.412e+01 1.536e+00 48.26 3.71e-14 ***

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 103.6 on 11 degrees of freedom

Multiple R-squared: 0.9953,Adjusted R-squared: 0.9949


F-statistic: 2329 on 1 and 11 DF, p-value: 3.707e-14

World population in 1950–2010 and its regression forecast for 2015 and 2020.
Figure 9.6: Linear model of world population

Hence the linear model is 𝐺(𝑥) = 𝑌̂︀ = −142201 + 74.1 𝑥.

We conclude that the world population grows at the average rate of 74.1 million every year. Note that
2
the Multiple R-squared= 0.9953 is exactly the square of Pearson’s correlation, 𝑟𝑥𝑦 given in Equation
(9.3).

The straight line- regression line in Figure 9.6- that fits the observed data for years 1950–2010-
predicts the population of 7.15 billion in 2015 and 7.52 billion in 2020. But what is the Adjusted

Inference, Linear Regression and Stochastic Processes


9.4. Analysis of variance for regression 263

R-squared? See Section 9.4 for discussion.

Practice 1. (Understanding a linear regression)


Consider a linear regression model with 𝜇𝑦 = 40.5 − 2.5𝑥, and a standard deviation 𝜎 = 2.

1. What is the slope of the population regression line? Explain clearly what this slope says about the
change in the mean of 𝑦 for a change in 𝑥.

2. What is the subpopulation mean when 𝑥 = 10?

3. Between what 2 values would approximately 95% of the observed responses, 𝑦, fall when 𝑥 = 10?

9.4 Analysis of variance for regression

From this section, we primarily discuss Step 3 of Box 5. We will

• evaluate the goodness of fit of the chosen linear model to the observed data

D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )};

• estimate the response variance.

Motivation

Example 9.7. (Practical modeling house prices).

Seventy house sale prices in a certain county are depicted in Figure 9.7 along with the house area.
First, we see a clear relation between these two variables, and in general, bigger houses are more
expensive. However, the trend no longer seems linear.
Second, there is a large amount of variability around this trend. Indeed, area is not the only factor
determining the house price. Houses with the same area may still be priced differently. Then, how can
we estimate the price of a 3200-square-foot house?
We can estimate the general trend (the dotted line in Figure 9.7) and plug 3200 into the resulting
formula, but due to obviously high variability, our estimation will not be as accurate as in Example 9.6.
For example, there exists some variation among the house sale prices on Figure 9.7.
Why are the houses priced differently?

Well, the price depends on the house area, and bigger houses tend to be more expensive. So, to
some extent, variation among prices is explained by variation among house areas. However, two
houses with the same area may still have different prices. These differences cannot be explained by
the area. 

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
264 SIMPLE LINEAR REGRESSION

House sale prices and their footage.

Figure 9.7: Does house price depend only on area?

9.4.1 ANOVA and the correlation coefficient R

Analysis of variance (ANOVA) - mostly based on probability distribution arguments- explores


variation among the observed responses.
 A portion of this variation can be explained by predictors.
 The rest is attributed to “error.”
ANOVA technically uses the sum of squares 𝑆𝑆𝑇, 𝑆𝑆𝑅 and 𝑆𝑆𝐸.

Three sum of squares and their relationship 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

Total sum SST Recall that


𝑛
∑︁
𝑆𝑆𝑇 := (𝑦𝑖 − y)2 = (𝑛 − 1) 𝑠2𝑦 (9.17)
𝑖=1

is the sum of squared deviations between actual and estimated mean values. 𝑆𝑆𝑇 is the variation of
𝑦𝑖 about their sample mean regardless of our regression model.

Regression sum SSR A portion of this total variation is attributed to predictor 𝑋 and the regression
model connecting predictor and response. This portion is measured by the sum of squared regres-

Inference, Linear Regression and Stochastic Processes


9.4. Analysis of variance for regression 265

sion errors,
𝑛
∑︁
𝑆𝑆𝑅 = 𝑦𝑖 − y)2

𝑖=1 (9.18)
2 2
= 𝑏 𝑆𝑥𝑥 = 𝑏 (𝑛 − 1) 𝑠2𝑥

with 𝑆𝑥𝑥 given in Equation (9.16). This is the portion of total variation explained by the model.

Error sum SSE The rest of total variation is attributed to “random errors”

𝑑𝑖 = observed value − fitted value = 𝑦𝑖 − 𝑦̂︀𝑖 .

The error portion of total variation then is measured by the error sum of squares
𝑛
∑︁ 𝑛
∑︁
𝑆𝑆𝐸 = 𝑑2𝑖 = (𝑦𝑖 − 𝑦̂︀𝑖 )2 .
𝑖=1 𝑖=1

This is the portion of total variation not explained by the model.

Definition 9.3. [Unbiased estimator of the response variance]

𝑆𝑆𝐸
1. The value is called the sample variance of the residuals around the least squares re-
𝑛−1
2
gression line, denoted by 𝑆𝑦|𝑥 .

2. Moreover, the expectation of 𝑆𝑆𝐸 is

E[𝑆𝑆𝐸] = (𝑛 − 2) 𝜎 2 ,

so an unbiased estimator of 𝜎 2 - the regression variance of the response 𝑌 - is


𝑆𝑆𝐸
ˆ2 =
𝜎 . (9.19)
𝑛−2

See a proof in Section 10.3, Chapter 10.

The correlation coefficient 𝑅𝑥𝑦

The regression line - determined from Knowledge Box 6- passes through the point (x , y), where x , y are
the sample means of 𝑋 and 𝑌 . Due to Equation (9.10), corresponding to the least squares estimate
𝑆𝑥𝑦
(then 𝑏 − 2 = 0) the value
𝑆𝑥
𝑆𝑆𝐸 2
= 𝑆𝑦|𝑥 = 𝑆𝑦2 (1 − 𝑅𝑥𝑦
2
). (9.20)
𝑛−1

𝑛
∑︁
• Here 𝑆𝑦2 = 𝑆𝑦𝑦 = (𝑦𝑖 − y)2 and 𝑆𝑦|𝑥
2
is the sample variance of the residuals around the least
𝑖=1
squares regression line.
2 2
By definition, 𝑆𝑦|𝑥 ≥ 0, hence 𝑅𝑥𝑦 ≤ 1, or −1 ≤ 𝑅𝑥𝑦 ≤ 1.

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
266 SIMPLE LINEAR REGRESSION

2
• 𝑅𝑥𝑦 = ±1 when 𝑆𝑦|𝑥 = 0. This is the case when all the points (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, · · · , 𝑛 lie on a straight
line.

2
• If 𝑅𝑥𝑦 = 0, then the slope of the regression line is 𝑏 = 0 and 𝑆𝑦|𝑥 = 𝑆𝑦2 .

2
• 𝑅𝑥𝑦 is called the coefficient of determination.

9.4.2 Goodness of fit with the coefficient of determination

Definition 9.4 (Fisher).

The goodness of fit, meaning appropriateness of the predictor and the chosen regression model,
can be judged by the proportion 𝑅2 of 𝑆𝑆𝑅 over 𝑆𝑆𝑇 . The coefficient of determination 𝑅2 is
the proportion of the total variation being explained by the regression model,
𝑆𝑆𝑅
𝑅2 = .
𝑆𝑆𝑇

It is always between 0 and 1, with high values generally suggesting a good fit. In univariate regression,
R-square also equals the squared sample correlation coefficient, derived from Concept 7.

Besides, due to Equation 9.20 we see that


2 )︂
𝑆𝑦|𝑥
(︂
2
𝑅𝑥𝑦 = 1− 2 , (9.21)
𝑆𝑦

• 𝑆𝑦2 = (𝑛 − 1) 𝑠𝑦 , 𝑠𝑦 is the sample standard deviation of 𝑌 ;

2
• 𝑅𝑥𝑦 = 𝑅2 (the coefficient of determination above) is the change in 𝑌 , which is explained by the
linear relationship 𝑦̂︀ = 𝑎+𝑏𝑥. Thus, 𝑅𝑥𝑦 (correlation coeff.) measures the degree of linear relationship
in the data.

• Linear regression line (or predictive line) can be used to predict the values of 𝑌 .

A 𝑅2 “adjusted ” - defined as
[︂ ]︂
2 2 𝑛−1
𝑅𝑥𝑦 (adjusted) = 1 − (1 − 𝑅𝑥𝑦 ) . (9.22)
𝑛−2

will be more useful for explaining the determinants, especially when surveying with multiple
regression models.

Example 9.8. (World population- continued)

Inference, Linear Regression and Stochastic Processes


9.4. Analysis of variance for regression 267

Continuing Example 9.6, from Equations 9.17 and 9.18 we find

𝑛
∑︁
𝑆𝑆𝑇 = (𝑦𝑖 − y)2 = (𝑛 − 1) 𝑠2𝑦 = (12)(2.093 · 106 ) = 2.512 · 107
𝑖=1
𝑛
∑︁
𝑆𝑆𝑅 = 𝑦𝑖 − y)2 = 𝑏2 𝑆𝑥𝑥 = (74.1)2 (4550) = 2.500 · 107 ,

𝑖=1

𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑅 = 1.2 · 105 .

A linear model for the growth of the world population has a very high 𝑅-square of
𝑆𝑆𝑅
𝑅2 = = 0.995 𝑜𝑟 99.5%.
𝑆𝑆𝑇
This is a very good fit although some portion of the remaining 0.5% of total variation can still be
explained by adding non-linear terms into the model.

• We could use the term Model adequacy to talk about Goodness of fitting model.

• The standard deviation of the residuals (also called the residual standard error) around the regres-
sion line is 𝑆𝑒 with
2
(1 − 𝑅𝑥𝑦 )
𝑆𝑒2 = (9.23)
𝑛−2
𝑛−1 2
Here we see that 𝑆𝑒2 = 𝑆 .
𝑛 − 2 𝑦|𝑥

Regression-based prediction is not perfect!

Figure 9.8: Overfitted models and linear model

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
268 SIMPLE LINEAR REGRESSION

Overfitting a model

Among all possible straight lines, the method of least squares chooses one line that is closest to the
observed data. Still, as we see in Figure 9.8.b, we had some residuals 𝑑𝑖 = (𝑦𝑖 − 𝑦̂︀𝑖 ) and some positive
sum of squared residuals. The straight line has not accounted for all 100% of variation among 𝑦𝑖 . Why,
one might ask, have we considered only linear models?
As long as all 𝑥𝑖 are different, can we always find a regression function 𝑌̂︀ (𝑥) that passes through all
the observed points without any error? Then, the sum 𝑖 𝑑2𝑖 = 0 will truly be minimized!
∑︀

Trying to fit the data perfectly is a rather dangerous habit. Although we can achieve an excellent fit
to the observed data, it never guarantees a good prediction. The model will be overfitted, too much
attached to the given data. Using it to predict unobserved responses is very questionable.

Table 9.3: Data of catalyst 𝑋 and response 𝑌

TT 𝑋 𝑌 𝑋𝑌 𝑋 − X 𝑌 − Y (𝑋 − X )(𝑌 − Y ) (𝑋 − X )2 (𝑌 − Y )2

1 1.0 30.5 30.50 -2.25 -20.93 47.09 5.06 438.06

2 1.5 38.0 57.00 -1.75 -13.43 23.50 3.06 18.37

3 2.0 34.0 68.00 -1.25 -17.43 21.79 1.56 303.81

4 2.5 42.5 106.25 -0.75 -8.93 6.70 0.56 79.74

5 3.0 46.0 138.00 -0.25 -5.43 1.36 0.06 29.50

6 3.5 58.2 203.70 0.25 6.77 1.70 0.06 45.83

7 4.0 55.4 221.60 0.75 3.97 2.98 0.56 15.76

8 4.5 72.1 324.45 1.25 20.67 25.84 1.56 427.25

9 5.0 75.0 375.00 1.75 23.57 41.25 3.06 555.54

10 5.5 62.6 344.3 2.25 11.17 25.13 5.06 124.77


∑︀ ∑︀ ∑︀
Sample size 𝑥 𝑦 𝑥𝑦 197.34 20.6 2200.63
197.34 √︁
20.6
𝑛 = 10 = 32.5 = 514.3 = 1868.8 𝑆𝑥𝑦 = 𝑆𝑥 = 𝑛−1
𝑛−1

9.5 Chapter’s Problem

Problem 9.1. [Engineering.] Catalyst in chemical processes


Table 9.3 shows the concentration of 𝑋 (the horizontal axis) of a catalyst and response 𝑌 (the vertical
axis) as the amount of chemical detergent to produce.

Inference, Linear Regression and Stochastic Processes


9.5. Chapter’s Problem 269

𝑆𝑥𝑦
We get X = 3.25, Y = 51.43, and correlation value 𝑟 = =? Thus, the catalyst concentration
𝑆𝑥 · 𝑆𝑦
𝑋 and the detergent output 𝑌 are closely/weakly or inversely correlated?

Problem 9.2. [Environmental Science.] Water quality management


Another example of positive or negative correlation is the association between variables measuring
water quality. A case study is taken from Chaopraya River in central Thailand, which is constantly
monitored for the control of pollution. The variables that are measured, among others, are the amounts
of dissolved oxygen, DO, and the biochemical oxygen demand, BOD, in the water. Dissolved oxygen
1
is required for the respiration of aerobic life forms such as fish.

Usually determined in a laboratory after a 5-day incubation of samples taken from the water. Sam-
pling at 38 stations along the river gives the data set, shown in the table on Figure 9.9, in mg / L
(milligrams per liter). Compute coefficient of correlation of DO and BOD. What is your comment?
HINT:
𝑆𝑥𝑦
The coefficient of correlation from Eq. (9.3) is 𝑟 = = −0.90?
𝑆𝑥 · 𝑆𝑦

• As expected, the scatter diagram below of Figure 9.9 strongly indicates a negative type of correlation
(inverse correlation) with high values of DO associated with low values of BOD and vice versa.

• It suggests that the value of BOD can be estimated from a measurement of the DO. The scatter in
the diagram may be partly attributed to some inadequacies of the BOD test and partly to factors such
as temperature and rate of flow, which affect the DO. 

1 The BOD denotes the amount of oxygen used in meeting the metabolic needs of aerobic micro organisms in water, whether
naturally occurring or resulting from sewage outflows and other discharges; thus, high values of BOD generally indicate high
levels of pollution.

DATA ANALYTICS- FOUNDATION


CHAPTER 9. REGRESSION ANALYSIS I
270 SIMPLE LINEAR REGRESSION

Dissolved oxygen (DO) and biochemical oxygen demand (BOD)


(milligrams per liter)
DO 8.15 5.45 6.05 6.49 6.11 6.46 6.22 6.05
BOD 2.27 4.41 4.03 3.75 3.37 3.23 3.18 4.08
DO 6.3 6.53 6.74 6.9 7.05 7.19 7.55 6.92
BOD 4 3.92 3.83 3.74 3.66 3.58 3.16 3.43
DO 7.11 7.28 7.44 7.6 7.28 7.44 7.59 7.73
BOD 3.36 3.3 3.24 3.19 3.22 3.17 3.13 3.08
DO 7.85 7.97 8.09 8.19 8.29 8.38 8.46 8.54
BOD 3.04 3 2.96 2.93 2.89 2.86 2.82 2.79
DO 8.62 8.69 8.76 9.26 9.31 9.35
BOD 2.76 2.73 2.7 2.51 2.49 2.46
Source: Table E.1.3, Nathabandu, Blackwell Publishing 2008

Figure 9.9: Scatter diagram of water quality data

Inference, Linear Regression and Stochastic Processes


Chapter 10

Regression Analysis II:


Inference for Regression

[Source [56]]
CHAPTER 10. REGRESSION ANALYSIS II:
272 INFERENCE FOR REGRESSION

10.1 Introduction and Overview

We studied about the simple linear regression model in previous chapter. In this chapter, we study
some other advanced techniques in regression analysis. Many problems in engineering and the sci-
ences involve a study or analysis of the relationship between two or more variables. In many situations,
the relationship between variables is not deterministic.

• For example, a) the electrical energy consumption of a house (𝑦) is related to the house’s size (𝑥),
but it is unlikely to be a deterministic relationship. Similarly, b) the fuel usage of an automobile (𝑦) is
related to the vehicle weight 𝑥, but the relationship is not a deterministic one.

• In both of these examples, the value of the response of interest 𝑦 (energy consumption or fuel
usage as mentioned above) can not be predicted perfectly from knowledge of the corresponding 𝑥.
It is possible in b) for different automobiles to have different fuel usage even if they weigh the same,
and it is possible in a) for different houses to use different amounts of electricity even if they are the
same size.

This chapter covers the followings:

1. Empirical Models and Their Linearity in Section 10.2

2. Confidence intervals of regression coefficients in Section 10.3

3. Estimation of responses in Section 10.4

4. Adequacy of the Regression Model in Section 10.5

After careful study of this chapter, you should be able to

1. Test statistical hypotheses and construct confidence intervals on regression model parameters

2. Test linearity of predictors and response

3. Use the regression model

* to construct a confidence interval of the response mean

* to construct an appropriate prediction interval of a response

4. Analyze residuals to determine

whether the model is an adequate fit to the data

or whether any underlying assumptions are violated

Inference, Linear Regression and Stochastic Processes


10.2. Empirical Models and Their Linearity 273

10.2 Empirical Models and Their Linearity

Models for non-deterministic relationships

The models, capturing non-deterministic relationship of 𝑦 with 𝑥, are briefly named empirical
models if we can find them based on observed datasets.
The collection of statistical tools that are used to model and explore relationships between many
variables that are related in a non-deterministic manner is called regression analysis.

Problems of this type occur so frequently in many branches of engineering and science, there re-
gression analysis is one of the most widely used statistical tools.

10.2.1 Essential role of realistic data in regression analysis

We introduce about using of regression analysis in specific domains as below.


Consider two arbitrary populations 𝑋 and 𝑌 , where 𝑌 would depend on 𝑋. We will find out their
relationship, and discuss some relevant inferential matters.

Example 10.1. [Aviation Engineering.]

• Telecommunication satellites are powered while in orbit by solar cells. Tadicell, a solar cells producer
that supplies several satellite manufacturers, was requested to provide data on the degradation of its
solar cells over time.

• Tadicell engineers performed a simulated experiment in which solar cells were subjected to temper-
ature and illumination changes similar to those in orbit and measured the short circuit current ISC
(amperes) of solar cells at three different time periods, in order to determine their rate of degradation.

• In Table 10.1 we present the ISC values of 𝑛 = 16 solar cells, measured at three time epochs, one
month apart. The data is given in file SOCELL.csv.

The response 𝑌 of that empirical model measures the short circuit current ISC (amperes) of solar
cells at time point t2; while

the predictor 𝑋 measures the short circuit current ISC at time point t1.

• With concrete data, this study is about degradation of solar cells in telecommunication satellites,
the key aim was establishing an empirical model.

In Figure 10.1 we see the scatter of the ISC values at time point 𝑡1 and time point 𝑡2 .

The R computation for the empirical model follows.

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
274 INFERENCE FOR REGRESSION

Table 10.1: ISC Values of solar cells at three time epochs

Time epochs
Cell
𝑋 = 𝑡1 𝑌 = 𝑡2 𝑡3

1 4.18 4.42 4.55

2 3.48 3.70 3.86

3 4.08 4.39 4.45

4 4.03 4.19 4.28

5 3.77 4.15 4.22

6 4.01 4.12 4.16

8 3.70 3.89 3.99

9 5.11 5.37 5.44

10 3.51 3.81 3.76

11 3.92 4.23 4.14

12 3.33 3.62 3.66

13 4.06 4.36 4.43

14 4.22 4.47 4.45

15 3.91 4.17 4.14

16 3.49 3.90 3.81

> data(SOCELL); X = SOCELL$t1; Y= SOCELL$t2; D = data.frame(X, Y);


> plot(D, xlab="X", ylab="Y");

> LmISC <- lm(Y ~ 1 + X, data=SOCELL)


> summary(LmISC)

Call:
lm(formula = Y ~ X, data = SOCELL)
Residuals:
Min 1Q Median 3Q Max

Inference, Linear Regression and Stochastic Processes


10.2. Empirical Models and Their Linearity 275

-0.145649 -0.071240 0.008737 0.056562 0.123051

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.53578 0.20314 2.638 0.0195 *
X 0.92870 0.05106 18.189 3.88e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08709 on 14 degrees of freedom


Multiple R-squared: 0.9594,Adjusted R-squared: 0.9565
F-statistic: 330.8 on 1 and 14 DF, p-value: 3.877e-11

Then our empirical model is 𝑌 = 𝐿𝑚𝐼𝑆𝐶 = 0.5357 + 0.9287 𝑋. 

Figure 10.1: Relationship of ISC values 𝑡1 and 𝑡2 .

♣ QUESTION 10.1. But some concerns can possibly be raised as:

1. What are Std.Error, t value? Why they appear next to the estimate?

2. What are Multiple R-squared, Adjusted R-squared?

3. What are their roles in building models?

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
276 INFERENCE FOR REGRESSION

We answer these questions in this chapter, Chapter 11.

10.2.2 Testing for Linearity

Key Arguments

• If the value of 𝑦 does not change linearly with the value of 𝑥, then using the mean value of 𝑦 is the
best predictor for the actual value of 𝑦.

This implies 𝑦 = y [the sample response mean] is preferable.

• If the value of 𝑦 does change linearly with the value of 𝑥, then using the regression model gives a
better prediction for the value of 𝑦 than using the mean of 𝑦.

This implies 𝑦 = 𝑦ˆ [predicted model] is preferable.

Three Tests for Linearity

1. Testing the Coefficient of Correlation

𝐻0 : 𝜌 = 0: There is no linear relationship between 𝑋 and 𝑌 .

vs

𝐻1 : 𝜌 ̸= 0: There is a linear relationship between 𝑋 and 𝑌 .

2. Testing the Slope of the Regression Line (see Section 10.3.1)

𝐻0 : 𝛽1 = 0 : There is no linear relationship between 𝑋 and 𝑌 .

vs

𝐻1 : 𝛽1 ̸= 0: There is a linear relationship between 𝑋 and 𝑌 .

3. The Global F-test (see Section 10.3.4)

𝐻0 : There is no linear relationship between 𝑋 and 𝑌 .

vs 𝐻1 : There is a linear relationship between 𝑋 and 𝑌

10.3 Tests and estimations of regression coefficients

Methods of estimating a regression line and partitioning the total variation do not rely on any distribu-
tion; thus, we can apply them to virtually any data.
For further analysis, we introduce standard regression assumptions.

Inference, Linear Regression and Stochastic Processes


10.3. Tests and estimations of regression coefficients 277

Assumption 1: Linearity between predictor and response, 𝑋 and 𝑌 .

(𝑋 could be family size, interest rate or a project input, number of drunk men per day in BKK, and
𝑌 could be electricity consumption, project return in investment, or number of traffic accidents in
Bangkok)

At the 𝑖-th observation, predictor 𝑋𝑖 is considered non-random, and we assume a linear relationship
between the two 𝑌𝑖 and 𝑋𝑖 of the form:

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖 , for each 𝑖 = 1, 2, . . . , 𝑛 (10.1)

where

• 𝑌𝑖 denotes the 𝑖-th observation of the response 𝑌 ,

• 𝑋𝑖 denotes the 𝑖-th observation on the predictor 𝑋, with E[𝑋𝑖 ] = 𝑥𝑖 , and

• 𝑒𝑖 is the error with E[𝑒𝑖 ] = 0, ∀𝑖 = 1, 2, . . . , 𝑛; here 𝑛 is the sample size.

Assumption 2: Normality of the responses: We therefore assume that observed responses 𝑌𝑖 are
independent normal random variables with mean

E[𝑌𝑖 ] = E[𝑌 |𝑋 = 𝑥𝑖 ] = 𝐺(𝑥𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖

and with constant variance 𝜎 2 = V[𝑌𝑖 ] = V[𝑌 ].

Assumption 3: Normality of regression coefficients:

The linear regression coefficients 𝛽0 , 𝛽1 have normal distribution, and


theirs estimate are denoted by 𝑏0 , 𝑏1 , found in Knowledge Box 6.

After we estimate the variance 𝜎 2 , they can be studied by T-tests and T-intervals.

Degrees of freedom and variance estimation

According to Assumption 2, responses (𝑌1 , 𝑌2 , · · · , 𝑌𝑛 ) have different means but the same variance 𝜎 2 ,
that is V[𝑌𝑖 ] = 𝜎 2 . Let us estimate it.

1. First, we estimate each expectation E[𝑌𝑖 ] = 𝐺(𝑥𝑖 ) by

𝐺(𝑥
̂︀ 𝑖 ) = 𝑦̂︀𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 .

We consider errors 𝑒𝑖 = 𝑦𝑖 − 𝑦̂︀𝑖 , obtain the error sum of squares 𝑆𝑆𝐸 as below

𝑛
∑︁ 𝑛
∑︁ 𝑛
∑︁
𝑆𝑆𝐸 = 𝑒2𝑖 = (𝑦𝑖 − 𝑦̂︀𝑖 )2 = (𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 (10.2)
𝑖=1 𝑖=1 𝑖=1

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
278 INFERENCE FOR REGRESSION

2. Since 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸, so the degrees of freedom of 𝑆𝑆𝑇 is df 𝑇 = df 𝑅 + df 𝐸 , where

• the regression sum of squares 𝑆𝑆𝑅 has df 𝑅 = 1 degree of freedom (the dimension of the
corresponding space (𝑋, 𝑌 ) is 1);
• the total sum of squares
𝑛
∑︁
𝑆𝑆𝑇 = (𝑦𝑖 − y)2 = (𝑛 − 1) 𝑠2𝑦 ,
𝑖=1

has 𝑛 − 1 degrees of freedom, because it is computed directly from the sample variance 𝑠2𝑦 . So,
𝑆𝑆𝐸 has df 𝐸 = df 𝑇 − df 𝑅 = 𝑛 − 2 degrees of freedom. Obviously,

df 𝐸 = sample size − number of estimated location parameters = 𝑛 − 2

with 2 degrees of freedom deducted for 2 estimated parameters 𝛽0 and 𝛽1 .

3. With these info, we now unbiasedly estimate the response variance 𝜎 2 = V[𝑌 ] by the sample
regression variance
𝑆𝑆𝐸
𝑠2 = 𝜎
̂︀2 = . (10.3)
𝑛−2
Notice that the usual sample variance
𝑛
∑︁
(𝑦𝑖 − y)2
𝑆𝑆𝑇 𝑖=1
𝑠2𝑦 = =
𝑛−1 𝑛−1
is biased because y no longer estimates the expectation E[𝑌𝑖 ] of 𝑌𝑖 .

A standard way to present analysis of variance of the response is the ANOVA table.

Definition 10.1.

• RMSE: We see that the sample (estimated) regression variance


𝑆𝑆𝐸
𝑠2 = 𝑀 𝑆𝐸 = .
𝑛−2
The estimated standard deviation 𝑠 is called root mean squared error or RMSE.

• F-ratio: The F-ratio


𝑀 𝑆𝑅
𝐹 =
𝑀 𝑆𝐸
is used to test significance of the entire regression model.

Example 10.2. [Transportation Science.]

A chairman of a city management council needs to know


how much happinessof citizens in his city
depends on the number traffic jams 𝑋 per month.
Happiness will be measured by the number of extra hours 𝑌 the city dwellers can spend for

Inference, Linear Regression and Stochastic Processes


10.3. Tests and estimations of regression coefficients 279

Table 10.2: Simple ANOVA table

Univariate ANOVA

Source D.F. S.S. M.S. F

Sum of squares Mean squares

𝑀 𝑆𝑅
Model 1 𝑆𝑆𝑅 = 𝑏2 𝑆𝑥𝑥 𝑀 𝑆𝑅 = 𝑆𝑆𝑅
𝑀 𝑆𝐸
𝑛
∑︁
= 𝑦𝑖 − y)2

𝑖=1

Error 𝑛−2 𝑆𝑆𝐸 𝑀 𝑆𝐸 = 𝑆𝑆𝐸/(𝑛 − 2)


∑︀𝑛
= 𝑖=1 (𝑦𝑖 − 𝑦̂︀𝑖 )2 = 𝑠2
𝑆𝑆𝑇
Total 𝑛−1 𝑆𝑆𝑇 = 𝑆𝑦𝑦 𝑠2𝑦 =
𝑛−1
𝑛
∑︁
= (𝑦𝑖 − y)2
𝑖=1

creative - relaxing activities (sport, cinema...) after work, ...


In general, more traffic jams reduce time for creation after work, and therefore, less happi-
ness index for urban citizens.

DATA: He conducted a survey to different classes of people, and got a sample

No. traffic jams (per month), 𝑥 6 7 7 8 10 10 15

No. extra hours (per month), 𝑦 40 55 50 41 17 26 16

Aim: The response variable here is the number of extra hours (𝑌 ) the dwellers can enjoy, and we
attempt to predict it from the number traffic jams (𝑋). More precisely we will answer the questions:

1. Estimation of the regression line

2. Find ANOVA table and variance estimation

3. Test Goodness of fit via the coefficient of determination

4. Conduct Inference about the slope.

5. Compute ANOVA F-test.

1. Estimation of the regression line means to build a linear model of 𝑌 from 𝑋.


We can start by computing 𝑛 = 7, x = 9, y = 35, 𝑆𝑥𝑥 = 56, 𝑆𝑥𝑦 = −232, 𝑆𝑦𝑦 = 1452.

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
280 INFERENCE FOR REGRESSION

Estimate regression slope and intercept by

𝑆𝑥𝑦
𝑏1 = = −4.14; and 𝑏0 = y −𝑏1 x = 72.3.
𝑆𝑥2

Then, the estimated regression line has an equation 𝑦 = 72.3 − 4.14 𝑥.


Notice the negative slope. It means that increasing traffic jams by 1 case, we expect to reduce 4.14
hours for creation after work.

2. ANOVA table and variance estimation. Let us compute all components of the ANOVA. We have
𝑆𝑆𝑇 = 𝑆𝑦𝑦 = 1452 partitioned into

𝑆𝑆𝑅 = 𝑏21 𝑆𝑥𝑥 = 961 and 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑅 = 491.

Simultaneously, 𝑛 − 1 = 6 degrees of freedom of 𝑆𝑆𝑇 are partitioned into 𝑑𝑓𝑅 = 1 and 𝑑𝑓𝐸 = 5 degrees
of freedom. Fill the rest of the ANOVA table,

Source D.F. S.S. M.S. F

Sum of squares Mean squares

𝑀 𝑆𝑅
Model 1 𝑆𝑆𝑅 = 𝑏2 𝑆𝑥𝑥 = 961 𝑀 𝑆𝑅 = 𝑆𝑆𝑅 = 961 = 9.79
𝑀 𝑆𝐸
𝑆𝑆𝐸
Error 5 𝑆𝑆𝐸 = 491 𝑠2 = 𝑀 𝑆𝐸 = = 98.2
(𝑛 − 2)
Total 6 𝑆𝑆𝑇 = 𝑆𝑦𝑦 = 1452

See the solutions of questions 2-5 in Example 10.3. 


Recall from Knowledge Box 6 the following.

Inference, Linear Regression and Stochastic Processes


10.3. Tests and estimations of regression coefficients 281

Point Estimate of the intercept and the regression slope

We estimate 𝛽0 by the estimated intercept 𝑏0 = 𝑦 − 𝑏1 𝑥, and estimate 𝛽1 by the estimated


regression slope
𝑆𝑥𝑦 𝑆𝑦
𝑏1 = = 𝑟𝑥𝑦 (10.4)
𝑆𝑥2 𝑆𝑥
𝑛
∑︁
where 𝑆𝑥𝑦 = (𝑥𝑖 − x )(𝑦𝑖 − y), and (with 𝑠2𝑥 the sample variance of 𝑋)
𝑖=1

𝑛
∑︁
𝑆𝑥2 = 𝑆𝑥𝑥 = (𝑥𝑖 − x )2 = (𝑛 − 1) 𝑠2𝑥 . (10.5)
𝑖=1

The fitted linear model is

𝐺(𝑥) = 𝑌̂︀ = E[𝑌 |𝑋 = 𝑥] = 𝑏0 + 𝑏1 𝑥. (10.6)

10.3.1 T-statistic for testing significance of a model

Having obtained the unbiased estimate


𝑆𝑆𝐸
𝑠2 =
𝑛−2

of the response variance 𝜎 2 in Equation 10.3, we can proceed with tests and confidence intervals for
the regression slope 𝛽1 . As usually, we start with the estimator 𝑏1 of 𝛽1 and its sampling distribution.

Sampling distribution and confidence interval of the regression slope

The slope statistic is estimated by

𝑛
∑︁
(𝑥𝑖 − x )(𝑦𝑖 − y)
𝑆𝑥𝑦 𝑖=1
𝑏1 = 𝑏 = =
𝑆𝑥2 𝑆𝑥2

𝑛
∑︁
(𝑥𝑖 − x ) 𝑦𝑖
𝑖=1
=
𝑆𝑥𝑥
𝑛
∑︁
[we can drop y because it is multiplied by (𝑥𝑖 − x ) = 0].
𝑖=1

According to standard regression assumptions above, 𝑦𝑖 are normal random variables and 𝑥𝑖 are
non-random. The estimated slope 𝑏1 - as a linear function of 𝑦𝑖 - is also normal N (𝜇𝑏 , 𝜎𝑏2 ) with the

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
282 INFERENCE FOR REGRESSION

expectation and variance, respectively

𝜇𝑏 = E[𝑏1 ] = ... = 𝛽1 , (10.7)


𝑛
∑︁
(𝑥𝑖 − x )2
𝑖=1 𝜎2
𝜎𝑏2 = V[𝑏1 ] = 2
V[𝑦𝑖 ] = . (10.8)
𝑆𝑥𝑥 𝑆𝑥𝑥

Sampling distribution of the regression slope statistic

𝑏1 ∼ N (𝜇𝑏 , 𝜎𝑏2 ) where

𝜎2
• the mean 𝜇𝑏 = E[𝑏1 ] = 𝛽1 , and variance 𝜎𝑏2 =
𝑆𝑥𝑥
• the standard error of 𝑏1 is estimated by
√︀
S.E.(𝑏1 ) = 𝑠/ 𝑆𝑥𝑥

with given data, and therefore, we can use T-intervals and T-tests, with 𝑛 − 2 degrees of
freedom, due to Equation 10.3.

Confidence interval for slope

• Confidence interval CI with (1 − 𝛼)100% (confidence level) for the slope 𝛽1 is

CI (𝛽1 ) = Point estimate 𝑏1 ± 𝑡𝑛−2, 𝛼/2 S.E.(𝑏1 )


𝑠 (10.9)
= 𝑏1 ± 𝑡𝑛−2, 𝛼/2 √ .
𝑆𝑥𝑥

The estimator 𝑏1 can be shown to have minimum variance, and, because it is linear, it is called the
BLUE of 𝛽1 , that is, the best (signifying minimum variance) linear unbiased estimator.

Sampling distribution and confidence interval of the intercept

By similar arguments one can show that 𝑏0 the BLUE of 𝛽0 as follows. Firstly, the 𝑌𝑖 are independent
and have a constant variance 𝜎 2 then
𝜎2 𝑠2
V[𝑌 ] = =⇒ V[𝑦] = .
𝑛 𝑛

We find the CI with (1 − 𝛼)100% (confidence level) for the intercept 𝛽0 as

CI (𝛽0 ) = Point estimate 𝑏0 ± 𝑡𝑛−2, 𝛼/2 S.E.(𝑏0 ), (10.10)

Inference, Linear Regression and Stochastic Processes


10.3. Tests and estimations of regression coefficients 283

√︀
here S.E.(𝑏0 ) = V[𝑏0 ], where the estimated variance of 𝛽0 is

𝑠2
V[𝑏0 ] = V[𝑦 − 𝑏1 𝑥] = + 𝑥2 V[𝑏1 ]
𝑛
(10.11)
𝑠2 𝑠2 𝑥2
(︂ )︂
2 1
= + 𝑥2 =𝑠 + .
𝑛 𝑆𝑥𝑥 𝑛 𝑆𝑥𝑥

SUMMARY 1. On distribution of the slope and intercept parameters

1/ Under the model assumptions of the prediction


𝑌̂︀ = E[𝑌 |𝑋 = 𝑥] = 𝑏0 + 𝑏1 𝑥, we have

𝑠2
• the estimated slope 𝑏1 ∼ N (𝛽1 , V[𝑏1 ]) where V[𝑏1 ] = ;
𝑆𝑥𝑥
• the estimated intercept 𝑏0 ∼ N (𝛽0 , V[𝑏0 ]) where V[𝑏0 ]) is given in (10.11):

𝑥2
(︂ )︂
2 1
√︀
V[𝑏0 ] = 𝑠 + =⇒ so S.E.(𝑏0 ) = V[𝑏0 ].
𝑛 𝑆𝑥𝑥

2/ Interval estimate of the slope:


𝑠
CI (𝛽1 ) = 𝑏1 ± 𝑡𝑛−2, 𝛼/2 √ .
𝑆𝑥𝑥

Interval estimate of the intercept: CI (𝛽0 ) = 𝑏0 ± 𝑡𝑛−2, 𝛼/2 S.E.(𝑏0 ).

• On testing for linearity of 𝑋 and 𝑌 [that also means testing significance of the model (10.6)] we
check whether 0 ∈ CI (𝛽1 ) or not.

If yes, accept 𝐻0 : 𝛽1 = 0 : there is no linearity between 𝑋 and 𝑌 .

If no, 0 ̸∈ CI (𝛽1 ) we accept 𝐻1 : 𝛽1 ̸= 0: there is a linear relationship between 𝑋 and 𝑌 .

• The test of 𝐻0 : 𝛽1 = 0 is quite useful. When accept 𝐻0 we substitute 𝛽1 = 0 in the model, the 𝑋
term drops out and we are left with E(𝑌 ) = 𝛽0 .

An equivalent way of testing for linearity is testing the null hypothesis

𝐻0 : 𝛽1 = 0 𝑣𝑠 𝐻𝑎

using t-test, as in the next part.

10.3.2 Testing hypotheses on the regression slope

Theory: Continue from the CI given in Equation 10.9, we can test hypothesis
𝐻0 : 𝛽1 = 𝐵 about the regression slope, use the T-statistic
𝑏1 − 𝐵 𝑏1 − 𝐵
𝑇 = = √ ∼ 𝑡[𝑛 − 2] (10.12)
S.E.(𝑏1 ) 𝑠/ 𝑆𝑥𝑥

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
284 INFERENCE FOR REGRESSION

where 𝑡-distribution has (𝑛 − 2) degrees of freedom, these degrees of freedom used in the estimation
of 𝜎 2 . As always, the form of the alternative hypothesis determines whether it is a two-sided, right-tail,
or left-tail test.

Argument: A non-zero slope 𝑏1 indicates significance of the model, relevance of predictor 𝑋 in the
inference about response 𝑌 , and existence of a linear relation among them. It means that a change
in 𝑋 causes changes in 𝑌 .

In the absence of such relation, E(𝑌 ) = 𝛽0 remains constant.

Solution: T-test and the method using P-value.

To see if 𝑋 is significant for the prediction of 𝑌 , test the null hypothesis

𝐻0 : 𝛽1 = 0 𝑣𝑠 𝐻𝑎

using the updated T-statistic

𝑏1 𝑏1
𝑡0 = = √ ∼ 𝑡[𝑛 − 2]. (10.13)
S.E.(𝑏1 ) 𝑠/ 𝑆𝑥𝑥

The P-value for a test of 𝐻0 against

• 𝐻𝑎 : 𝛽1 > 0 is P[𝑇 > 𝑡0 ],

• 𝐻𝑎 : 𝛽1 < 0 is P[𝑇 < 𝑡0 ],

• 𝐻𝑎 : 𝛽1 ̸= 0 is 2 P[𝑇 > |𝑡0 |].

When accept 𝐻𝑎 : 𝛽1 ̸= 0 we conclude that the empirical model of 𝑌 and 𝑋 is significant.

10.3.3 Goodness of fit and Coefficient of determination

Definition 10.2.

The goodness of fit, meaning appropriateness of the predictor and the chosen regression model.
This appropriateness can be judged by the coefficient of determination 𝑅2 , defined to be the
proportion of the total variation being explained by the regression model,
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = =1− .
𝑆𝑆𝑇 𝑆𝑆𝑇
We always have 0 ≤ 𝑅2 ≤ 1, with high values generally suggesting a good fit.

This concept is extremely useful in the next example on Transportation Science, and also for multiple
regression in next chapters.

Knowledge box 7. Properties of Coefficient of determination

Inference, Linear Regression and Stochastic Processes


10.3. Tests and estimations of regression coefficients 285

• We often refer loosely to 𝑅2 as


* the proportion of the total variation explained by the model, or
* the amount of variability in the data explained or accounted for by the regression model.

• In univariate regression, R-square also equals the squared sample correlation coefficient 𝑟𝑥𝑦 =
𝑠𝑥𝑦
.
𝑠𝑥 · 𝑠𝑦
• The statistic 𝑅2 should be used with caution because it is always possible to make 𝑅2 unity by
simply adding enough terms to the model. In general, 𝑅2 will increase if we add a variable to the
model, but this does not necessarily imply that the new model is superior to the old one.

• There are several misconceptions about 𝑅2 . In general, 𝑅2 does not measure the magnitude of
the slope of the regression line. A large value of 𝑅2 does not imply a steep slope. Furthermore, 𝑅2
does not measure the appropriateness of the model because it can be artificially inflated by adding
higher order polynomial terms in 𝑥 to the model.

Example 10.3. [Transportation Science.]

Continued from Example 10.2 with data

No. traffic jams (per month), 𝑥 6 7 7 8 10 10 15

No. extra hours (per month), 𝑦 40 55 50 41 17 26 16

In general, more traffic jams reduce time for creation after work, and therefore, less happiness
index for urban citizens.
Aim: The response variable here is the number of extra hours (𝑌 ) the dwellers can enjoy, and we
attempt to predict it from the number traffic jams (𝑋).

1. Estimation of the regression line: Done.

2. ANOVA table and variance estimation: Done

3. Goodness of fit via the coefficient of determination

4. Inference about the slope.

5. ANOVA F-test.

GUIDANCE for solving.

1. Estimation of the regression line. The estimated regression line, from Example 10.2, has an
equation 𝑦 = 72.3 − 4.14 𝑥.
Notice the negative slope. It means that increasing traffic jams by 1 case, we expect to reduce 4.14
hours for creation after work.

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
286 INFERENCE FOR REGRESSION

Source D.F. S.S. M.S. F

Sum of squares Mean squares

𝑀 𝑆𝑅
Model 1 𝑆𝑆𝑅 = 𝑏2 𝑆𝑥𝑥 = 961 𝑀 𝑆𝑅 = 𝑆𝑆𝑅 = 961 = 9.79
𝑀 𝑆𝐸
𝑆𝑆𝐸
Error 5 𝑆𝑆𝐸 = 491 𝑠2 = 𝑀 𝑆𝐸 = = 98.2
(𝑛 − 2)
Total 6 𝑆𝑆𝑇 = 𝑆𝑦𝑦 = 1452

2. ANOVA table and variance estimation.

3. Goodness of fit via the coefficient of determination:

Regression variance 𝜎 2 and R square (by Definition 10.2) are estimated by

𝑠2 = 𝑀 𝑆𝐸 = 98.2,
𝑆𝑆𝑅
and 𝑅2 = = 0.662.
𝑆𝑆𝑇
That is, only 66.2% of the total variation of the number of extra relaxing hours 𝑌 is explained by the
number of traffic jams (per month) 𝑋.

We solve questions 4. and 5 later.

10.3.4 ANOVA F-statistic for testing significance of a model

Recall that the single linear model

E[𝑌 ] = 𝜇𝑦 = 𝛽0 + 𝛽1 𝑥 (10.14)

is significant as long as its slope 𝛽1 is not zero, i.e. the null 𝐻0 : 𝛽1 = 0 is rejected.

A more universal, and therefore, more popular method of testing significance of a model is the
ANOVA F-test. It compares
the portion of variation explained by regression with
the portion that remains unexplained.

• Significant models explain a relatively large portion.

• Each portion of the total variation is measured by the corresponding sum of squares,

𝑆𝑆𝑅 for the explained portion (here R = regression) and

𝑆𝑆𝐸 for the unexplained portion (error).

• Dividing each SS by the number of degrees of freedom, we obtain mean squares, as given in Table
10.2:
𝑀 𝑆𝑅 = 𝑆𝑆𝑅,

Inference, Linear Regression and Stochastic Processes


10.3. Tests and estimations of regression coefficients 287

𝑆𝑆𝐸
𝑀 𝑆𝐸 = .
𝑛−2
Knowledge box 8.

F-test statistic

Hence, under the null hypothesis


𝐻0 : 𝛽1 = 0

both 𝑀 𝑆𝑅 and 𝑀 𝑆𝐸 are independent, and their ratio


𝑀 𝑆𝑅 𝑆𝑆𝑅
𝐹 = = 2
𝑀 𝑆𝐸 𝑠
has F-distribution with 𝑑𝑓𝑅 = 1 and 𝑑𝑓𝐸 = 𝑛 − 2 degrees of freedom (d.f.).

SUMMARY 2.

ANOVA F-test is always one-sided and right-tail because only large values of the F-statistic show
a large portion of explained variation and the overall significance of the model. See Chapter ?? for
background of Fisher distribution.

10.3.5 Relationship of F-test and T-test

We now have two tests for the model significance, a T-test for the regression slope and the ANOVA
F-test. For the univariate regression, they are absolutely equivalent. In fact, the F-statistic equals the
squared T-statistic for testing 𝐻0 : 𝛽1 = 0, due to Equation 10.13

𝑏21 𝑟2 𝑆𝑆𝑇
𝑇2 = = ... =
𝑠2 /𝑆𝑥𝑥 𝑠2
(10.15)
𝑆𝑆𝑅
= 2 = 𝐹.
𝑠
Hence, both tests give us the same result. Note that the T-statistic itself is
𝑏1
𝑇 = √︀ (10.16)
𝑠2 /𝑆𝑥𝑥

Example 10.4. [Transportation Science.] Continue from Example 10.3

No. traffic jams (per month), 𝑥 6 7 7 8 10 10 15

No. extra hours (per month), 𝑦 40 55 50 41 17 26 16


The response is the number of extra hours (𝑌 ) that city dwellers can enjoy, and we attempt to
predict it from the number traffic jams (𝑋).

1. Estimation of the regression line.

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
288 INFERENCE FOR REGRESSION

2. ANOVA table and variance estimation

3. Goodness of fit via the coefficient of determination

4. Inference about the slope?

5. ANOVA F-test?

GUIDANCE for solving.

1. Estimation of the regression line.

2. ANOVA table and variance estimation.

3. Goodness of fit via the coefficient of determination.

4. Inference about the slope. Is the slope statistically significant? Meaning: what if data says the
regression slope favors 𝐻𝑎 : 𝛽1 ̸= 0?

Solution: find 2 P[𝑇 > |𝑡|] where the t-score is

𝑏1 𝑏1
𝑡= = √
S.E.(𝑏1 ) 𝑠/ 𝑆𝑥𝑥

and compute P-value 𝑃 = 2 P[𝑇 > |𝑡|] = P[𝑇 < −𝑡] + P[𝑇 > 𝑡].

In the [Transportation Science example], does the number of extra relaxing hours 𝑌 really depend on
the number traffic jams 𝑋? We test the null hypothesis 𝐻0 : 𝛽1 = 0 by computing the T-statistic as in
Equation 10.16
𝑏1
𝑡 = √︀ = −3.13
𝑠2 /𝑆𝑥𝑥

Checking the T-distribution table with 5 d.f., we find that the P-value
𝑝 = 2 P[𝑇 > 3.13] for the two-sided test is between 0.02 and 0.04.

We conclude that the slope is moderately significant. Precisely, it is significant (𝛽1 ̸= 0) at any level
𝛼 ≥ 0.04 and not significant at any 𝛼 ≤ 0.02.

5. ANOVA F-test. We knew from Equation 10.15 that a similar result can be found by the F-test. From
Equation (10.15), the F-statistic of 𝐹 = 9.79 = 𝑡2 is not significant at the 0.025 level, but significant
at the 0.05 level. 

10.4 Estimation of responses using regression

A major application of regression analysis is making forecasts, predictions of the response variable 𝑌
based on the known or controlled predictors 𝑋.

Inference, Linear Regression and Stochastic Processes


10.4. Estimation of responses using regression 289

Figure 10.2: Fiugre c) give two-sided P value of T distribution

Confidence Interval for the Mean of 𝑌 : places an upper and lower bound around the point estimate
for the mean (average value) of 𝑌 given 𝑋 = 𝑥.

Prediction Interval for an Individual 𝑌 : places an upper and lower bound around the point estimate
for an individual value of 𝑌 given 𝑋 = 𝑥.

Let 𝑥* be the value of the predictor 𝑋. The corresponding value of the response 𝑌 is

𝑦̂︀* = 𝐺(𝑥
̂︀ * ) = 𝑏0 + 𝑏1 𝑥* .

In Example 9.6, we got the linear model

𝐺(𝑥) = 𝑌̂︀ = −142201 + 74.1 𝑥,

̂︀ * ) = −142201 + 74.1 2015 = 7152 million people. As happens


so the predicted population in 2015 is 𝐺(𝑥
with any forecast, our predicted values are understood as the most intelligent guesses, and not as
guaranteed exact sizes of the population during these years.

Question 2.
a/ How reliable are regression predictions, and
b/ how close are they to the real true values?

As a good answer, we can

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
290 INFERENCE FOR REGRESSION

1. construct a (1 − 𝛼) 100% confidence interval for the expectation

𝜇* = E[𝑌 |𝑋 = 𝑥* ]

2. and compute a (1 − 𝛼) 100% prediction interval for the actual value of 𝑌 = 𝑦* when 𝑋 = 𝑥* .

10.4.1 Confidence interval for the mean of responses

The expectation
𝜇* = E[𝑌 |𝑋 = 𝑥* ] = 𝐺(𝑥* ) = 𝛽0 + 𝛽1 𝑥* (10.17)

is a population parameter. This is the mean response for the entire sub-population of units
where the independent variable 𝑋 equals 𝑥* .

For example, it corresponds to the average price of all houses with the area 𝑥* = 2500 square feet.

HOW TO DO? Given a data D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )}, we follow steps below. First, we
estimate 𝜇* by
𝑛
∑︁
𝑦ˆ* = 𝑏0 + 𝑏1 𝑥* = y −𝑏1 x +𝑏1 𝑥* = y +𝑏1 (𝑥* − x ) = 𝑚𝑖 𝑦𝑖 ,
𝑖=1

𝑛
∑︁ 𝑛
∑︁
(𝑥𝑖 − x ) (𝑥𝑖 − x ) 𝑦𝑖
𝑖=1 𝑖=1 ∑︀
with 𝑚𝑖 = 1/𝑛 + (𝑥* − x ), due to 𝑏1 = , y = ( 𝑦𝑖 )/𝑛.
𝑆𝑥𝑥 𝑆𝑥𝑥

We see again that the estimator 𝑦ˆ* is a linear function of responses 𝑦𝑖 . Then, under standard regres-
sion assumptions, 𝑦ˆ* is normal, with expectation

E[ˆ
𝑦* ] = E[𝑏0 + 𝑏1 𝑥* ] = 𝛽0 + 𝛽1 𝑥* = 𝜇*

(it is unbiased), and variance


(𝑥* − x )2
[︂ ]︂
1
V[ˆ
𝑦* ] = + 𝜎2 . (10.18)
𝑛 𝑆𝑥𝑥

Then, we estimate the regression variance 𝜎 2 by 𝑠2 and obtain the following confidence interval for the
mean of responses

Confidence interval for the mean of responses


(1 − 𝛼) 100% CI of 𝜇* = 𝑏0 + 𝑏1 𝑥* ± 𝑡𝑛−2, 𝛼/2 𝑠 ℎ

where
1 (𝑥* − x )2
ℎ= + .
𝑛 𝑆𝑥𝑥

Inference, Linear Regression and Stochastic Processes


10.4. Estimation of responses using regression 291

10.4.2 Prediction interval for the individual response

Often we are more interested in predicting the actual response rather than the mean of all possible
responses. For example, we may be interested in the price of one particular house that we are planning
to buy, not in the average price of all similar houses. Instead of estimating a population parameter, we
are now predicting the actual value of a random variable.

Definition 10.3.

An interval [𝑎, 𝑏] is a (1 − 𝛼) 100% prediction interval for the individual response 𝑌 corresponding
to predictor 𝑋 = 𝑥* if it contains the value of 𝑌 with probability (1 − 𝛼),

P[𝑎 ≤ 𝑌 ≤ 𝑏 | 𝑋 = 𝑥* ] = 1 − 𝛼 (10.19)

This time, all three quantities, 𝑌, 𝑎, and 𝑏, are random variables. We predict 𝑌 by 𝑦ˆ* = 𝑏0 + 𝑏1 𝑥* , and
estimate the standard deviation
√︃
√︀ 1 (𝑥* − x )2
Std(𝑌 − 𝑦ˆ* ) = 𝑉 [𝑌 ] + 𝑉 [ˆ
𝑦* ] = 𝜎 1+ + (10.20)
𝑛 𝑆𝑥𝑥

by the standard error √︃


1 (𝑥* − x )2
S.E. (𝑌 − 𝑦ˆ* ) = 𝑠 1+ + .
𝑛 𝑆𝑥𝑥
We get

Prediction interval for an individual response

A (1 − 𝛼)100% prediction interval for the individual response 𝑌 when 𝑋 = 𝑥* is


(1 − 𝛼) 100% PI = 𝑏0 + 𝑏1 𝑥* ± 𝑡𝑛−2, 𝛼/2 𝑠 𝑘 (10.21)
1 (𝑥* − x )2
where 𝑘 = 1 + + .
𝑛 𝑆𝑥𝑥

Elucidation: Several conclusions are apparent from this.

1. First, compare the standard deviations in Equation (10.18) and (10.20).


Response 𝑌 that we are predicting made its contribution into the variance. This is the difference
between a confidence interval for the mean of all responses and a prediction interval for the
individual response.
Predicting the individual value is a more difficult task; so the prediction interval is always wider
than the confidence interval for the mean response. More uncertainty is involved, and as a result,
the margin of a prediction interval is larger than the margin of a confidence interval.

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
292 INFERENCE FOR REGRESSION

2. Second, we get more accurate estimates and more accurate predictions from large samples. When
the sample size 𝑛 (and therefore, typically, 𝑆𝑥𝑥 ), tends to ∞, the margin of the confidence interval
converges to 0. Besides, the margin of a prediction interval converges to (𝑡𝛼/2 𝜎). As we collect
more observations, our estimates of 𝑏0 and 𝑏1 become more accurate; however, uncertainty about
the individual response 𝑌 will never vanish.

Figure 10.3: Illustration for prediction interval for the individual response

3. Prediction bands: For all possible values of a predictor 𝑥* , we can prepare a graph of (1 − 𝛼)
prediction bands given by Equation 10.21, see Figure 10.3. Then, for each value of 𝑥* , one can
draw a vertical line and obtain (1 − 𝛼) 100% prediction interval between these bands.

10.5 Adequacy of the Regression Model

Fitting a regression model requires making several assumptions.

• Estimating the model parameters requires assuming that the errors are uncorrelated random vari-
ables with mean zero and constant variance.

• Tests of hypotheses and interval estimation require that the errors be normally distributed.

• In addition, we assume that the order of the model is correct; that is, if we it a simple linear regression
model, we are assuming that the phenomenon actually behaves in a linear or first-order manner.

10.5.1 Major Assumptions for regression

We take a look back to the standard linear model (10.1)

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖 , 𝑖 = 1, 2, . . . , 𝑛. (10.22)

Inference, Linear Regression and Stochastic Processes


10.5. Adequacy of the Regression Model 293

The residuals from a regression model are 𝑒ˆ𝑖 = 𝑦𝑖 − 𝑦ˆ𝑖 where 𝑦𝑖 is an actual observation and 𝑦ˆ𝑖 is
the corresponding fitted value from the regression model. Analysis of the residuals is helpful

• in checking the assumption that the errors are approximately normally distributed with constant
variance and

• in determining whether additional terms in the model would be useful.

Assumption 1: The disturbances (random errors) have zero mean, i.e., E[𝑒𝑖 ] = 0 for every 𝑖 =
1, 2, . . . , 𝑛. This assumption is needed to insure that on the average we are on the true line.

Assumption 2: The disturbances have a constant variance, i.e., V[𝑒𝑖 ] = 𝜎 2 for every 𝑖 = 1, 2, . . . , 𝑛.
This insures that every observation is equally reliable.

If V[𝑒𝑖 ] = 𝜎𝑖2 , each observation has a different variance. An observation with a large variance is less
reliable than one with a smaller variance.

Assumption 3: The disturbances are not correlated, i.e.,


E(𝑒𝑖 𝑒𝑗 ) = 0 for 𝑖 ̸= 𝑗; 𝑖, 𝑗 = 1, 2, . . . , 𝑛. Knowing the 𝑖-th disturbance does not tell us anything about
the 𝑗-th disturbance, for 𝑖 ̸= 𝑗.

Assumption 4: The explanatory variable (predictor) 𝑋 is nonstochastic, i.e., fixed in repeated sam-
ples, and hence, not correlated with the disturbances.
∑︀𝑛
Also, ( 𝑖=1 𝑥2𝑖 )/𝑛 ̸= 0 and has a finite limit as 𝑛 −→ ∞.

10.5.2 Residual analysis

• The analyst should always consider the validity of these assumptions to be doubtful and conduct
analyses to examine the adequacy of the model that has been tentatively entertained. In this section,
we discuss methods useful in this respect.

• As an approximate check of normality, the experimenter can construct a frequency histogram of the
residuals or a normal probability plot of residuals.

• Many computer programs will produce a normal probability plot of residuals, and because the sample
sizes in regression are often too small for a histogram to be meaningful, the normal probability plotting
method is preferred. It requires judgment to assess the abnormality of such plots.

• It is frequently helpful to plot the residuals

(1) in time sequence (if known), (2) against the 𝑦ˆ𝑖 , and

(3) against the independent variable 𝑥.

These graphs will usually look like one of the four general patterns shown in Figure 10.4.

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
294 INFERENCE FOR REGRESSION

Figure 10.4: Patterns of residuals


Patterns for residual plots are:
(a) Satisfactory. (b) Funnel. (c) Double bow. (d) Nonlinear.

Pattern (a) represents the ideal situation, and patterns (b), (c), and (d) represent anomalies. If the
residuals appear as in (b), the variance of the observations may be increasing with time or with the
magnitude of 𝑦𝑖 or 𝑥𝑖 .

Example 10.5. Look back to the data SOCELL.csv in the case study on Aviation Engineering of
Example 10.1 with sample info in Table 10.1.

In Table 10.3 we present the values of ISC at time 𝑡2 , 𝑦, and their predicted values, according to
those at time 𝑡1 and 𝑦ˆ. We present also a graph (Figure 10.5) of the residuals, 𝑒ˆ = 𝑦 − 𝑦ˆ, versus the
predicted values 𝑦ˆ. 

DISCUSSION.

1. The analysis from Example 10.1 has shown that the least squares regression (prediction) line 𝑦̂︀ =
2
0.536+0.929𝑥. We read also that the coefficient of determination is 𝑅𝑥𝑦 = 0.959. This means that only
4% of the variability in the ISC values, at time period 𝑡2 , are not explained by the linear regression on
the ISC values at time 𝑡1 .

2. Observation #9 is an “unusual observation.” It has relatively a lot of influence on the regression line,
as can be seen in Figure 10.1. Here we see that
𝑛−1 2
𝑆𝑒2 = 𝑆 .
𝑛 − 2 𝑦|𝑥

Inference, Linear Regression and Stochastic Processes


10.5. Adequacy of the Regression Model 295

The value of 𝑆𝑒2 in the above analysis is 0.0076. The standard deviation of the residuals around the
2
regression line is 𝑆𝑒 = 0.08709. This explains the high value of 𝑅𝑥𝑦 .

Table 10.3: Observed and predicted values of ISC at time 𝑡2

𝑖 𝑦𝑖 𝑦̂︀𝑖 𝑒̂︀𝑖

1 4.42 4.419 0.0008

2 3.70 3.769 −0.0689

3 4.39 4.326 0.0637

4 4.19 4.280 −0.0899

5 4.15 4.038 0.1117

6 4.12 4.261 −0.1413

7 4.56 4.707 −0.1472

8 3.89 3.973 −0.0833

9 5.37 5.283 0.0868

10 3.81 3.797 0.0132

11 4.23 4.178 0.0523

12 3.62 3.630 −0.0096

13 4.36 4.308 0.0523

14 4.47 4.456 0.0136

15 4.17 4.168 0.0016

16 3.90 3.778 0.1218

3. In Table 10.3 we present the values of ISC at time 𝑡2 , 𝑦, and their predicted values, according to those
at time 𝑡1 and 𝑦ˆ. We present a graph (Figure 10.5) of the residuals, 𝑒ˆ = 𝑦 − 𝑦ˆ, versus the predicted
values 𝑦ˆ.

4. If the simple linear regression explains the variability adequately,


the residuals should be randomly distributed around zero, without any additional relationship to the
regression 𝑥.

DATA ANALYTICS- FOUNDATION


CHAPTER 10. REGRESSION ANALYSIS II:
296 INFERENCE FOR REGRESSION

Figure 10.5: Residual vs. predicted ISC values

10.6 Chapter’s Problem

Problem 10.1. (Tests in Simple Linear Regression)

Consider the following computer output.


The regression equation is Y = 26.8 + 1.48 x
Predictor Coef SE Coef T P
Constant 26.753 2.373 ? ?
X 1.4756 0.1063 ? ?
S = 2.70040 R-sq = 93.7% R-sq (adj) = 93.2%
Analysis of Variance
Source DF SS MS F P
Regression 1 ? ? ? ?
Residual error ? 94.8 7.3
Total 15 1500.0

(a) Fill in the missing information.


(b) Can you conclude that the model defines a useful linear relationship?
(c) What is your estimate of 𝜎 2 ?

Inference, Linear Regression and Stochastic Processes


Chapter 11

Regression Analysis III


Multiple Regression Analysis

[Source [56]]
CHAPTER 11. REGRESSION ANALYSIS III
298 MULTIPLE REGRESSION ANALYSIS

11.1 Introduction

After careful study of this chapter, you should be able to do the following:

Learning Objectives

1. Use multiple regression methods to build empirical models

2. Employ the method of least squares in fitting multiple data

3. Build multiple regression models with polynomial terms

* Use stepwise regression in practice

Multiple Regression Analysis- Major topics

This chapter covers the followings:

1. Setting for Multiple linear regression

2. Regression on few predictor variables

3. Aspects of Multiple Linear Regression

How can we build regression when there are many predictors?


In the present chapter we generalize the regression to cases where the variability of a variable 𝑌 of
interest can be explained, to a large extent, by the linear relationship between 𝑌 and (𝑝−1) predicting
or explaining variables 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 .
Multiple regression analysis (MRA) is an important statistical tool for exploring the relationship be-
tween the response 𝑌 on the set of predictors 𝑋𝑖 .
Applications of multiple regression analysis can be found in all areas of science and engineering.
This method of MRA, for instance, plays a meaningful role in the statistical planning and control of
 agriculture, and climate change science,
 industrial processes and manufacturing; economics, actuarial science;
 environment, ecology, transportation science, and public health.
In this chapter, practically we show how to

• fit multiple linear regression models,

• perform the statistical tests and confidence procedures that are analogous to those for simple linear
regression, and check for model adequacy.

Inference, Linear Regression and Stochastic Processes


11.1. Introduction 299

11.1.1 Setting

• Each 𝑋𝑖 is called a predictor, there are 𝑝 − 1 ≥ 2 predictors.

• All predictors 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 are continuous random variables.

• 𝑌 (the dependent variable) is called the response.

• The regression of 𝑌 on several predictors is called multiple regression.

• For multiple linear regression models, under uncertainty the response


𝑝−1
∑︁
𝑌 = 𝑓 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 + 𝜀 = 𝛽0 + 𝛽𝑗 𝑋𝑗 + 𝜀
𝑗=1

is a linear function of predictors 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 . Generally we need to have 𝑛 ≥ 𝑝 observations to


fit the model with 𝑝 − 1 these predictor variables.

Our Aim: to specify the relationship between 𝑌 the 𝑋𝑗 by a linear function.

Definition 11.1 (Single and Multiple regression).

Regression of 𝑌 on 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 is the conditional expectation,


𝑌̂︀ := E[𝑌 |𝑋 = 𝑥] if 𝑝 − 1 = 1 (single regression) or
𝑌̂︀ := E[𝑌 |𝑋1 = 𝑥1 , · · · , 𝑋𝑝−1 = 𝑥𝑝−1 ] if 𝑝 − 1 > 1 (multiple regression).

𝑌̂︀ is a function, whose form can be estimated from data.

Example 11.1. (Influence of advertising time)

Model the influence of advertising time 𝑥 on the number of positive reactions 𝑦 from the public. We
have a single linear regression model, described by

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀.

Taking mean with condition 𝑋 = 𝑥 we get the linear model

𝑌ˆ = 𝜇𝑌 = E[𝑌 |𝑋 = 𝑥] = 𝛽0 + 𝛽1 𝑥.

Here 𝑝 − 1 = 1, 𝑌 holds the number of positive reactions caused by the amount of advertising time 𝑥,
then the number of observations 𝑛 ≥ 2. 

Example 11.2.

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
300 MULTIPLE REGRESSION ANALYSIS

Suppose we develop an empirical model linearly relating the viscovity 𝑌 of a polymer to two
independent variables or regressor variables 𝑋1 , 𝑋2 :
the temparature 𝑋1 and the catalyst feed rate 𝑋2 .
Here 𝑝 − 1 = 2, a model that might describe this linear relationship is

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀 (11.1)

where 𝜀 is a random error term.

This model represents that the viscosity 𝑌 is influenced by the temperature 𝑋1 and the catalyst feed
rate 𝑋2 , plus a stochastic component 𝜀 [expressing uncertainty or uncontrolled noises of real world].

Taking mean with conditions 𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 we get the linear regression

𝑌ˆ = 𝜇𝑌 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ] = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 , (11.2)

(since regression of 𝑌 on 𝑋1 , 𝑋2 is a conditional expectation, by the above standard assumptions of


regression).

The regression coefficients There are 𝑝 = 3 coefficients in total.

• The parameter 𝛽0 is the intercept of the plane.

• We sometimes call 𝛽1 and 𝛽2 partial regression coefficients because


𝛽1 measures the expected change in 𝑌 per unit change in 𝑥1 when 𝑥2 is held constant, 𝛽2
measures the expected change 𝑌 per unit change in 𝑥2 when 𝑥1 is held constant.

The term linear is used because Equation 11.2 is a linear function of 𝑝 = 3 unknown parameters
𝛽0 , 𝛽1 , 𝛽2 . The regression model (11.2) describes a plane in the three-dimensional space of 𝑌ˆ , 𝑥1
and 𝑥2 .

Figure 11.1 shows this plane for the regression model

𝑌ˆ = 50 + 10𝑥1 + 7𝑥2

where we have assumed that the expected value of the errors E(𝜀) = 0.

Interaction effects can appear in and be analyzed via a multiple linear regression model, e.g. adding
a cross-product term into Equation 11.2 to get :
𝑌ˆ = 𝜇𝑌 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ] = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 . .

11.1.2 Computing regression coefficients

Similarly the method of least squares (OLS) for single predictor, here we looks for a plane

𝑦ˆ = 𝛽1 + 𝛽1 𝑥1 + 𝛽2 𝑥2

Inference, Linear Regression and Stochastic Processes


11.1. Introduction 301

Geometric view of a linear regression model- a plane in 3D

Figure 11.1: The regression plane for the model 𝑌^ = 50 + 10𝑥1 + 7𝑥2

such that the sum of residuals 𝜀𝑖 ,


𝑛
∑︁ 𝑛
∑︁ 𝑛
∑︁
𝑆𝑆𝐸 = 𝜀2𝑖 = (𝑦𝑖 − 𝑦̂︀𝑖 )2 = (𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 )2 (11.3)
𝑖=1 𝑖=1 𝑖=1

is the smallest. Here we use index 𝑖 = 1, 2, . . . , 𝑛 for numbering observations.

• The expression 𝑄 = 𝑆𝑆𝐸 = ℎ(𝛽0 , 𝛽1 , 𝛽2 ) depends on three unknowns 𝛽0 , 𝛽1 , 𝛽2 , and we find its
extreme by taking partial derivatives of
𝑛
∑︁
𝑄= (𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 )2 ,
𝑖=1

then equating them to 0: 𝜕𝑄/𝜕𝛽0 = 0, 𝜕𝑄/𝜕𝛽1 = 0, 𝜕𝑄/𝜕𝛽2 = 0.

• These equations are called the least squares normal equations.

• The least squares estimates [shown explicitly in Equations (11.14), (11.15) and (11.16)]

𝑏0 = 𝛽ˆ0 , 𝑏1 = 𝛽ˆ1 , 𝑏2 = 𝛽ˆ2

of 𝛽0 , 𝛽1 , 𝛽2 are solution of these normal equations.

Note that there are 𝑝 = 𝑘 + 1 normal equations, one for each of the unknown regression coeffi-
cients. The normal equations can be solved by any method appropriate for solving a system of linear
equations.

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
302 MULTIPLE REGRESSION ANALYSIS

11.2 Multiple linear regression (MLR)

NOTATION

Suppose we are interested in 𝑝 − 1 predictors 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 . For any predictor 𝑋𝑗 , when


having data comprised by observing 𝑛 times we write
⎡ ⎤
⎢ 𝑥1𝑗 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑥2𝑗 ⎥

⎢ ⎥
𝑥(𝑗) = [𝑥1𝑗 , 𝑥2𝑗 , . . . , 𝑥𝑛𝑗 ] = ⎢ ⎥
⎢ .. ⎥
.
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
𝑥𝑛𝑗

for its column vector of 𝑛 observations (𝑖 = 1, 2, . . . , 𝑛).


Hence, 𝑥′𝑗 = [𝑥1𝑗 , 𝑥2𝑗 , . . . , 𝑥𝑛𝑗 ] is a row vector (for 𝑗 = 1, 2, . . . , 𝑝 − 1).

Matrix form of regression

A response variable 𝑌 may generally be related to 𝑝 − 1 predictors 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 (or independent


variables). Let’s determine a possible relationship between the response 𝑌 and predictors 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 .
After 𝑛 ≥ 𝑝 observations, denote the observed responses 𝑦𝑖 by an 𝑛 × 1 response vector
⎡ ⎤
⎢ 𝑦1 ⎥
⎢ ⎥
⎢ ⎥

⎢ 𝑦2 ⎥⎥
𝑦=⎢ ⎥
⎢ .. ⎥
. ⎥
⎢ ⎥

⎢ ⎥
⎣ ⎦
𝑦𝑛

and an 𝑛 × (𝑘 + 1) predictor matrix by X, our observed dataset then by

D = (𝑦, X) = (𝑦 1 𝑥(1) · · · 𝑥(𝑝−1) )

of 𝑛 observations at 𝑝 − 1 predictors; and 1 is the constant vector of all 1.


In the matrix form we explicitly write
⎡ ⎤
⎢ 𝑦1 1 𝑥11 𝑥12 ··· 𝑥1𝑝−1 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑦
⎢ 2 1 𝑥21 𝑥22 ··· 𝑥2𝑝−1 ⎥

(𝑦, X) = [𝑦 1 𝑥(1) · · · 𝑥(𝑝−1) ] = ⎢
⎢ .
⎥. (11.4)
⎢ . .. .. .. .. .. ⎥
⎢ . . . . . .


⎢ ⎥
⎣ ⎦
𝑦𝑛 1 𝑥𝑛1 𝑥𝑛2 ··· 𝑥𝑛𝑝−1

Inference, Linear Regression and Stochastic Processes


11.2. Multiple linear regression (MLR) 303

• 𝑥𝑖𝑗 is the value of the 𝑗-th predictor 𝑋𝑗 at the 𝑖-th observation (observation 𝑖 = 1, 2, . . . , 𝑛 and predictor
𝑗 = 1, 2, . . . , 𝑝 − 1);

• (X) is called the observed matrix of predictors (or predictor matrix).

• 𝑌 relates to 𝑝 − 1 predictors 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 via a regression function

𝑓 : 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 −→ 𝑌 = 𝑓 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 ,

and due to uncertainty, we get a generic regression model

𝑦 = 𝑓 (X) + 𝑒 (11.5)

We intend to obtain a good overall fit of the model and easy mathematical tractability. The most
mathematically tractable model 𝑓 is a linear one:

𝑌 = 𝑓 (X) + 𝑒 = X 𝛽 + 𝑒 = 𝛽0 + 𝑋1 𝛽1 + · · · + 𝑋𝑝−1 𝛽𝑝−1 + 𝑒

Why should multiple regression be linear?

1. A linear model, firstly may serve as a suitable approximation to several nonlinear functional rela-
tionships.

2. Secondly, linearity ensures our computation mathematically feasible.

3. Thirdly, simplicity the best: many relationships in reality just need a linear function of predictors
𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 to describe the context. The linear models would guarantee the inclusion of impor-
tant variables, and the exclusion of unimportant variables.

11.2.1 Standard assumptions of multiple linear regression

Recall that a multiple linear model

𝑌 = 𝑓 (X) + 𝑒 = X 𝛽 + 𝑒 = 𝛽0 + 𝑋1 𝛽1 + · · · + 𝑋𝑝−1 𝛽𝑝−1 + 𝑒.

The 𝑖-th observation upon uncertainty is

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + · · · + 𝛽𝑝−1 𝑥𝑖(𝑝−1) + 𝑒𝑖 , 𝑖 = 1, 2, . . . , 𝑛

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
304 MULTIPLE REGRESSION ANALYSIS

Assumption 1: The deviations (random errors) have zero mean, i.e., E[𝑒𝑖 ] = 0 for every
𝑖 = 1, 2, . . . , 𝑛. This assumption is needed to insure that on the average we are on the true
line.

Assumption 2: The deviations have a constant variance, i.e.,


V[𝑒𝑖 ] = 𝜎 2 . This insures that every observation is equally reliable.

If V[𝑒𝑖 ] = 𝜎𝑖2 , each observation has a different variance.

Assumption 3: The deviations are not correlated, i.e.,


E(𝑒𝑖 𝑒𝑗 ) = 0 for 𝑖 ̸= 𝑗; 𝑖, 𝑗 = 1, 2, . . . , 𝑛. Knowing the 𝑖-th disturbance does not tell us anything
about the 𝑗-th disturbance, for 𝑖 ̸= 𝑗.

Briefly, Assumptions 1-3 mean the deviations 𝑒𝑖 are an SRS from the N(0, 𝜎 2 ) distribution.

Assumption 4: The predictors 𝑋𝑗 is nonstochastic, i.e., fixed in repeated samples, and


hence, not correlated with the random errors.

Example 11.3 (Describing a multiple regression).

As part of a recent study titled “Predicting Success for Actuarial Students in Undergraduate Mathe-
matics Courses,” data from 106 Mahidol Uni. actuarial graduates were obtained. The researchers
were interested in describing how students’ overall math grade point averages (GPA) are ex-
plained by SAT Math and SAT Verbal scores, class rank, and faculty of science’s mathematics
placement score.

• What is the response variable? What is 𝑛, the number of cases?

• What is 𝑝 − 1, the number of explanatory variables, i.e. predictors?

• What are the predictors (explanatory variables)?

HINTS: The response variable 𝑌 is student’s overall math GPA; 𝑝 − 1 = 4. 

11.2.2 Statistical multiple linear model

The statistical multiple linear model 𝑓 should be

𝑌 = 𝑓 (X) = X 𝛽 + 𝑒 = 𝛽0 + 𝑋1 𝛽1 + · · · + 𝑋𝑝−1 𝛽𝑝−1 + 𝑒 (11.6)

or if we get data at 𝑛 observations, we can write

𝑝−1
∑︁
𝑦 𝑖 = 𝛽0 + 𝛽𝑗 𝑥𝑖𝑗 + 𝑒𝑖 , 𝑖 = 1, 2, . . . , 𝑛,
𝑗=1

Inference, Linear Regression and Stochastic Processes


11.2. Multiple linear regression (MLR) 305

where 𝛽0 , 𝛽1 , 𝛽2 , · · · , 𝛽𝑝−1 are the linear regression coefficients, and 𝑒𝑖 are random errors, 𝑒𝑖 ∼
N(0, 𝜎 2 ), for 𝑖 = 1, 2, . . . , 𝑛.

With 𝑥(𝑖) are particular values of the predictor 𝑋𝑖 , then Model 11.6 implies that the conditional expec-
tation E[𝑌 |𝑋1 = 𝑥(1) , 𝑋2 = 𝑥(2) , . . . , 𝑋𝑝−1 = 𝑥(𝑝−1) ] of response 𝑌 [see Definition 9.1] is generated by
a linear combination of the predictor variables: 𝑦 = E[𝑌 ] = X 𝛽, or explicitly

E[𝑌 |𝑋1 = 𝑥(1) , 𝑋2 = 𝑥(2) , . . . , 𝑋𝑝−1 = 𝑥(𝑝−1) ] = 𝛽0 + 𝛽1 𝑥(1) + . . . + 𝛽𝑝−1 𝑥(𝑝−1) . (11.7)

More precisely, this true regression linear function is


⎡ ⎤ ⎡ ⎤
⎢ 1 𝑥11 𝑥12 ··· 𝑥1𝑝−1 ⎥ ⎢ 𝛽0 ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥

⎢ 1 𝑥21 𝑥22 ··· 𝑥2𝑝−1 ⎥ ⎢
⎥ ⎢ 𝛽1 ⎥

𝑦=X𝛽=⎢ ⎥·⎢ ⎥. (11.8)
⎢ .. .. .. .. .. ⎥ ⎢ .. ⎥
. . . . . .
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦
1 𝑥𝑛1 𝑥𝑛2 ··· 𝑥𝑛𝑝−1 𝛽𝑝−1

The vector ⎡ ⎤
⎢ 𝛽0 ⎥
⎢ ⎥
⎢ ⎥

⎢ 𝛽1 ⎥

𝛽=⎢ ⎥
⎢ .. ⎥
.
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
𝛽𝑝−1

in Equation 11.7 are unknown (scalar-valued) coefficients explaining


a) the direction (the slopes) and b) the magnitude of their influence on 𝑌 ;
where the magnitude of the 𝛽’s indicates their importance in explaining 𝑌 .
The parameters of the model are 𝛽0 , 𝛽1 , 𝛽2 , · · · , 𝛽𝑝−1 and 𝜎.

How to estimate 𝜎 2 ?

The parameter 𝜎 2 measures the variability of the responses about the population regression equa-
tion. As in the case of simple linear regression, we estimate 𝜎 2 by an average of the squared residuals.
The estimator is ∑︀ 2
𝑆𝑆𝐸 𝑒𝑖
𝑠2 = = . (11.9)
𝑛−𝑝 𝑛−𝑝

• The quantity 𝑛 − 𝑝 is the degrees of freedom associated with 𝑠2 .

• The degrees of freedom equal the sample size, 𝑛, minus 𝑝, the number of regression coefficients 𝛽𝑗 ’s
we must estimate to fit the model.

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
306 MULTIPLE REGRESSION ANALYSIS

How to estimate slope parameters?

We find values for 𝑝 parameters 𝛽0 , 𝛽1 , . . . , 𝛽𝑝−1 which minimize the sum of square differences
𝑛
∑︁
𝑒2𝑖 = 𝑒′ 𝑒 = (𝑦 − X𝛽)′ (𝑦 − X𝛽)
𝑖=1

where 𝑒 = (𝑒1 , 𝑒2 , · · · , 𝑒𝑛 ) ∼ N(0, 𝜎 2 ), with


𝑝−1
∑︁
𝑒𝑖 = 𝑦𝑖 − 𝑥′𝑖 𝛽 = 𝑦𝑖 − [𝛽0 + 𝛽𝑗 𝑥𝑖𝑗 ] for 𝑖 = 1, . . . , 𝑛.
𝑗=1

Here for each experimental observation 𝑖 = 1, . . . , 𝑛,

• 𝑥′𝑖 · 𝛽 = 𝛽0 + 𝑥𝑖1 𝛽1 + 𝑥𝑖2 𝛽2 · · · + 𝑥𝑖𝑝−1 𝛽𝑝−1 , with 𝑥′𝑖 = [1, 𝑥𝑖1 , · · · , 𝑥𝑖𝑝−1 ] is the 𝑖th row of matrix
X (recording of the values of predictors 𝑋1 , · · · , 𝑋𝑝−1 );

• 𝑦𝑖 is the observed value of the 𝑖th experiment.

• The 𝑒𝑖 is called residual or random error at the 𝑖th experiment.

A successful choice for regression parameter vector 𝛽 is indicated by small values of all 𝑒𝑖 . Quite
a few conceivable principles by which the quality of an actual choice for 𝛽 may be evaluated. Among
others, the following measures of the residual sum 𝑆(𝛽) have been proposed:
∑︀𝑛
* 𝑆(𝛽) = 𝑖=1 |𝑒𝑖 |; 𝑆(𝛽) = max𝑖=1..𝑛 |𝑒𝑖 |; or using Euclidean distance
𝑛
∑︁
𝑆(𝛽) = 𝑒2𝑖 = 𝑒′ 𝑒. (11.10)
𝑖=1

The first two proposals are subject to either some complicated mathematics or poor statistical prop-
erties, the last principle has become widely accepted, providing the basis for the famous method of
least squares in Chapter 9.

OLS for Multiple RA

We can do least squares [similarly as in Chapter 2, Section 9.3.2] to find estimates 𝑏𝑗 = 𝛽ˆ𝑗 of regression
coefficients 𝛽𝑗 that minimizes 11.10, the sum of squared residuals,
𝑛
∑︁
𝑆(𝛽) = 𝑒2𝑖 = 𝑒′ 𝑒 = (𝑦 − X 𝛽)′ (𝑦 − X 𝛽) = 𝑆𝑆𝐸 (11.11)
𝑖=1

given 𝑦 and X in data matrix (11.4). We rewrite 𝑆(𝛽) as

𝑆(𝛽) = 𝑦 ′ 𝑦 + 𝛽 ′ X′ X 𝛽 − 2𝛽 ′ X′ 𝑦 ′

A minimum will always exist, as 𝑆(𝛽) is a real–valued convex differentiable function.

Inference, Linear Regression and Stochastic Processes


11.2. Multiple linear regression (MLR) 307

 CONCEPT 8. Generalized inverse (g-inverse) of a square matrix 𝐴, written 𝐴− is the matrix that
satisfies 𝐴𝐴− 𝐴 = 𝐴.

Computing optimal value:


The first multi-variate differentiate with respect to vector 𝛽 gives:

𝜕𝑆(𝛽)
= 2X′ X 𝛽 − 2X′ y.
𝜕𝛽

Normal equations are


𝜕𝑆(𝛽)
= 0 ⇔ X′ X 𝛽 = X′ y
𝜕𝛽

This has linear form 𝐴u = v, to be solved using generalized inverse of 𝐴.

Remark that X is a 𝑛 × (𝑝 − 1)-matrix, X′ X is (𝑝 − 1) × (𝑝 − 1)-matrix, but its rank is at most


𝑝 − 1.

When the rank 𝑚 of X′ X fulfills that 𝑚 = 𝑝 − 1, the matrix X′ X is non-singular, then the
ˆ is found by
estimated (most-fitted) coefficients b = 𝛽

ˆ = (X′ X)−1 X′ y
b=𝛽 (11.12)

Example 11.4. (Quadratic model)

Let us fit a second-order (quadratic model) in one regressor (𝑝 − 1 = 1)

𝑦 = 𝛽1 𝑥 2 + 𝛽2 𝑥 + 𝛽3 ,

to the data observed: (𝑛 = 5)

𝑥 0 1 2 3 4

𝑦 1 0 3 5 8

Let’s rewrite the quadratic model equation as follows:

𝑌 = 𝛽2 𝑋2 + 𝛽1 𝑋1 + 𝛽3 + 𝜀, where 𝑋1 = 𝑋, 𝑋2 = 𝑋 2

The fitted model is 𝑦ˆ = 𝛽2 𝑥2 +𝛽1 𝑥1 +𝛽3 . Now this looks exactly like a multiple regression equation with
two predictors. The design (observed data) matrix 𝑋 (where the last column matches with coefficient
𝛽3 , the second column matches with coefficient 𝛽1 ...), and its transpose now are

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
308 MULTIPLE REGRESSION ANALYSIS

⎡ ⎤
⎢ 𝑥(2) 𝑥(1) 1 ⎥
⎢ ⎥
⎢ ⎥
⎢ − − − ⎥
⎢ ⎥
⎢ ⎥ ⎡ ⎤
⎢ ⎥
⎢ 0 0 1 ⎥
⎢ ⎥ ⎢ 0 1 4 9 16 ⎥
⎢ ⎥ ⎢ ⎥

⎢ ⎥ ⎢ ⎥
𝑋=⎢
⎢ 1 1 ⎥;
1 ⎥ 𝑋 =⎢
⎢ 0 1 2 3 ⎥,
4 ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎣ ⎦
⎢ 4 2 1 ⎥ 1 1 1 1 1
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ 9 3 1 ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
16 4 1

then the non-singular matrix 𝑋 ′ 𝑋 and its generalized inverse (𝑋 ′ 𝑋)−1 =?


ˆ = (𝛽ˆ2 , 𝛽ˆ1 , 𝛽ˆ3 ) therefore unique, computed by
The least square estimate 𝛽

ˆ = (X′ X)−1 X′ y = (0.5, −0.1, 0.6)′ .


𝛽

The quadratic model in one regressor 𝑋 is

𝑌 = 0.5𝑋 2 − 0.1𝑋 + 0.6 = 0.6 − 0.1𝑋 + 0.5𝑋 2 ,

and the vector of fitted values y ˆ = (0.6, 1, 2.4, 4.8, 8.2)′ as compared with the vector of observa-
̂︀ = X 𝛽
tions 𝑦 = (1, 0, 3, 5, 8). 

11.2.3 Principle of Least Squares- the multivariate case


ˆ as in Equation 11.12 and Example 11.4, then
Let 𝑏 := 𝛽

𝑏 = (X′ X)−1 X′ y ⇔ (X′ X) 𝑏 = X′ 𝑦

An interesting algebraic property can be seen from the following.

Theorem 11.1. Vector 𝑏 minimizes the sum of squared errors if and only if it is a solution of (X′ X) 𝛽 =
X′ 𝑦. All solutions are located on the hyperplane X𝑏.

ˆ = (X′ X)−1 X′ y of the normal equations


• The solutions 𝑏 = 𝛽

X′ X 𝛽 = X′ 𝑦

are called empirical regression coefficients, or


empirical least squares estimates of 𝛽,

• and 𝑦ˆ = X𝑏 is called the empirical regression hyper-plane.

Inference, Linear Regression and Stochastic Processes


11.3. Regression on few predictor variables 309

Fact 11.1. The following notes are useful later on.

ˆ′ 𝑒
• An important property of the sum of squared errors 𝑆(𝑏) = 𝑒 ˆ = 𝑆𝑆𝐸 is

̂︀ ′ 𝑦
𝑦′ 𝑦 = 𝑦 ˆ′ 𝑒
ˆ+𝑒 ˆ, (11.13)

ˆ denotes the residuals 𝑦 − 𝑦


where 𝑒 ˆ = 𝑦 − 𝑋𝑏. This means that the sum of squared observations
̂︀ ′ 𝑦
𝑆𝑆𝑇 = 𝑦 ′ 𝑦 may be decomposed additively into the sum of squared values 𝑆𝑆𝑅 = 𝑦 ˆ , explained by
ˆ′ 𝑒
regression; and the unexplained part 𝑆𝑆𝐸 := 𝑒 ˆ.

• The slope parameter 𝛽𝑗 in a linear model

𝑌̂︀ = 𝑓 (X) = X 𝛽 + 𝑒 = 𝛽0 + 𝑋1 𝛽1 + · · · + 𝑋𝑝−1 𝛽𝑝−1 ,

for each 𝑗, represents the expected change in response 𝑌 per unit change in 𝑋𝑗 when all the re-
maining predictors 𝑋𝑖 (𝑖 ̸= 𝑗) are held constant.

Knowledge box 9.

Least squares estimates- Key properties

All the estimated coefficients b of multivariate regression given in Equation 11.12, as

̂︀ = (X′ X)−1 X′ y
b=𝛽

are

• linear functions of observed responses 𝑦1 , 𝑦2 , · · · , 𝑦𝑛 ,

• unbiased for the regression slopes because

E[b] = (X′ X)−1 X′ E[y] = (X′ X)−1 X′ X 𝛽 = 𝛽,

• normal if the response variable 𝑌 is normal.

11.3 Regression on few predictor variables

Multiple covariance and correlation coefficients

Recall that the sample correlation (Pearson’s sample correlation) of two samples 𝑥 and 𝑦 [or two
𝑠𝑥𝑦
variables] is given as 𝑟𝑥𝑦 = where
𝑠𝑥 · 𝑠𝑦
𝑛
∑︁
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑖
𝑠𝑥𝑦 =
𝑛−1
is their sample covariance.

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
310 MULTIPLE REGRESSION ANALYSIS

11.3.1 Theory for the case of two predictors

The multiple regression linear model, in the case of two predictors, assumes the form 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 +
𝛽2 𝑥2𝑖 + 𝑒𝑖 𝑖 = 1, 2, . . . , 𝑛, where 𝑒𝑖 are independent r.v.’s, with E(𝑒𝑖 ) = 0 and V(𝑒𝑖 ) = 𝜎 2 .

The principle of least-squares - as described in Knowledge Box 9 - calls for the minimization of

𝑆𝑆𝐸 = 𝑒′ 𝑒 = (𝑦 − X 𝛽)′ (𝑦 − X 𝛽) = 𝑆(𝛽).

We differentiate SSE w.r.t. the unknown parameters 𝛽 = (𝛽0 , 𝛽1 , 𝛽2 ) to get

𝜕𝑆(𝛽)
= 2X′ X 𝛽 − 2X′ y.
𝜕𝛽

This yields the least squares estimators 𝑏0 , 𝑏1 and 𝑏2 of the regression coefficients 𝛽0 , 𝛽1 , 𝛽2 . We
specifically have

𝑏0 = y −𝑏1 x 1 −𝑏2 x 2 ; (11.14)

𝑠2𝑥2 𝑠𝑥1 𝑦 − 𝑠𝑥1 𝑥2 𝑠𝑥2 𝑦


𝑏1 = , (11.15)
𝑠2𝑥1 𝑠2𝑥2 − 𝑠2𝑥1 𝑥2

𝑠2𝑥1 𝑠𝑥2 𝑦 − 𝑠𝑥1 𝑥2 𝑠𝑥1 𝑦


𝑏2 = . (11.16)
𝑠2𝑥1 𝑠2𝑥2 − 𝑠2𝑥1 𝑥2

by solving the set of linear equations



⎨𝑠2 𝑏1 + 𝑠𝑥 𝑥 𝑏2 = 𝑠𝑥1 𝑦
𝑥1 1 2
(11.17)
⎩𝑠 𝑏 + 𝑠2 𝑏
𝑥1 𝑥2 1 𝑥2 2 = 𝑠𝑥2 𝑦 ,

where 𝑠2𝑥1 , 𝑠2𝑥2 denote the sample variances, and 𝑠𝑥1 𝑥2 , 𝑠𝑥1 𝑦 and 𝑠𝑥2 𝑦 the covariances of 𝑥1 , 𝑥2 and 𝑦.

 CONCEPT 9.

• The values 𝑦̂︀𝑖 = 𝑏0 + 𝑏1 𝑥𝑖1 + 𝑏2 𝑥𝑖2 , 𝑖 = 1, 2, . . . , 𝑛 are called the predicted response values of the
regression, and

• the residuals around the regression plane (for 𝑖 = 1, 2, . . . , 𝑛) are

𝑒̂︀𝑖 = 𝑦𝑖 − 𝑦̂︀𝑖 = 𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖1 + 𝑏2 𝑥𝑖2 )

Definition 11.2. The coefficients explain the variation of 𝑌 by predictors

Inference, Linear Regression and Stochastic Processes


11.3. Regression on few predictor variables 311

• The multiple squared-correlation (multiple-𝑅2 )


(︂ )︂
2 1
𝑅𝑦|(𝑥1 ,𝑥2 ) = 2 𝑏1 𝑠𝑥1 𝑦 + 𝑏2 𝑠𝑥2 𝑦 , (11.18)
𝑠𝑦

where 𝑠2𝑦 is the sample variance of 𝑦.

The interpretation of the multiple–𝑅2 is as before, i.e.

the proportion of the variability of 𝑦 which is explainable by the predictors 𝑥1 and 𝑥2 .

• The sum of square of the residuals around the regression plane is


𝑛
∑︁ (︂ )︂
2
𝑆𝑦|(𝑥 1 ,𝑥2 )
= 𝑒̂︀2𝑖 = 𝑠2𝑦 1− 2
𝑅𝑦|(𝑥1 ,𝑥2 )
. (11.19)
𝑖

√︃
2
𝑆𝑦|(𝑥 1 ,𝑥2 )
• The residual standard error 𝑠 = .
𝑛−𝑝

Example 11.5 (Regression on two predictor variables).

We illustrate the building of a multiple regression model when 𝑘 = 2 and interpret the meaning of
the variables involved.
Define a response variable 𝑌 = 𝐼𝑄 to be the human intelligence index, assuming that it de-
pends on two variables
𝑋1 = 𝐸𝐷𝑈 (education level, in years of schooling), and
𝑋2 = ℎ (height).

𝑋1 = 𝐸𝐷𝑈 6 7 7 8 10 12 15 16 18

𝑋2 = ℎ (ℎ𝑒𝑖𝑔ℎ𝑡) 1.6 1.65 1.72 1.59 1.68 1.61 1.77 1.7 1.69

The HI index 𝑌 140 155 150 141 147 166 176 183 199

We have 𝑝 − 1 = 2 predictors for describing IQ:

1. Education level via the number of schooling years (𝑋1 ),

2. Height index (𝑋2 , in meter).

The IQ predictive model is 𝑦 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ] = 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏0 .


With observed data, we use formulas 11.14, 11.15, 11.16 to obtain the following regression model:
𝑦 = IQ = 4.229 𝑥1 + 18.09 𝑥2 + 85.1. 

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
312 MULTIPLE REGRESSION ANALYSIS

Computation on R . How do me build the model by software R ?

We observe 𝑛 = 9 persons, and get the responses

𝑦 = 𝑐(140, 155, 150, 141, 147, 166, ...)

x1 = c(6, 7, 7, 8, 10, 12, 15, 16, 18);


x2 = c(1.6, 1.65, 1.72, 1.59, 1.68, 1.61, 1.77, 1.7, 1.69);
y = c(140, 155, 150, 141, 147, 166, 176, 183, 199);

D0 = data.frame(x1, x2);
M0=lm(y ~ x1+ x2, D0); anova(M0); summary(M0);

Call: lm(formula = y ~ x1 + x2, data = D0)

Residuals:
Min 1Q Median 3Q Max
-10.8804 -4.6568 0.4855 4.0847 10.3512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.1888 84.3722 1.010 0.35162
x1 4.2296 0.7173 5.897 0.00106 **
x2 18.0925 52.8039 0.343 0.74356
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.767 on 6 degrees of freedom Why?

Multiple R-squared: 0.8924,Adjusted R-squared: 0.8566


F-statistic: 24.89 on 2 and 6 DF, p-value: 0.001245

11.3.2 Regression on three and four predictors

See Section 11.5 for practical usages of regression on three and four predictors. For many predictors,
see Section 11.6 for a realistic case study about Climate Change’s impacts on agriculture in Mekong
Delta Region, Vietnam.

Inference, Linear Regression and Stochastic Processes


11.4. Aspects of Multiple Regression Modeling 313

11.4 Aspects of Multiple Regression Modeling

In this section, we briefly discuss several other aspects of building multiple regression models. The
linear model
𝑌 = X 𝛽 + 𝜀 = 𝛽0 + 𝑋1 𝛽1 + · · · + 𝑋𝑝−1 𝛽𝑝−1 + 𝜀

is a general model that can be used to it any relationship that is linear in the unknown parameters
𝛽. This includes two important classes of models with interactions and polynomial regression
models.

11.4.1 Models with Interactions

If the change in the mean 𝑦 value associated with a 1-unit increase in one independent variable de-
pends on the value of a second independent variable, there is interaction between these two variables.
Denoting the two independent variables by 𝑋1 , 𝑋2 , we can model this interaction by including as an
additional predictor 𝑋3 = 𝑋1 𝑋2 , the product of the two independent variables.

• When 𝑋1 and 𝑋2 do interact, this model will usually give a much better fit to resulting data than would
the no-interaction model.

• Failure to consider a model with interaction too often leads an investigator to conclude incorrectly
that the relationship between 𝑌 and a set of independent variables is not very substantial.

• In applied work, quadratic predictors 𝑋12 and 𝑋22 are often included to model a curved relationship.
This leads to the full quadratic or complete second-order model

𝑦 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ]
(11.20)
= 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥1 𝑥2 + 𝛽4 𝑥21 + 𝛽5 𝑥22 .

Example 11.6. [Industry- Manufacturing.]

Suppose that an industrial chemist is interested in a product yield (𝑌 ) of a polymer being influenced
by two independent variables or predictor 𝑋1 , 𝑋2 , and possibly theirs certain reaction. Here
𝑋1 = reaction temperature and
𝑋2 = pressure at which the reaction is carried out.

A model that might describe this relationship is

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽12 𝑋1 𝑋2 + 𝜀 (11.21)

Taking mean we get

𝑦 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ] = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 .

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
314 MULTIPLE REGRESSION ANALYSIS

(b)

Figure 11.2: Linear regression models with different shapes


Figure 11.2(a) shows the three-dimensional plot of the regression model

𝑦 = E[𝑌 ] = 50 + 10𝑥1 + 7𝑥2 + 5 𝑥1 𝑥2 .

Figure 11.2(a) provides a nice graphical interpretation of an interaction.

Inference, Linear Regression and Stochastic Processes


11.4. Aspects of Multiple Regression Modeling 315

Generally, interaction implies that the effect produced by changing one variable (𝑥1 , say) depends on
the level of the other variable (𝑥2 ). This figure shows that changing 𝑥1 from 2 to 8 produces a much
smaller change in E[𝑌 ] when 𝑥2 = 2 than when 𝑥2 = 10.
Figure 11.2(b) shows the three-dimensional plot of the regression model

𝑦 = E[𝑌 ] = 800 + 10𝑥1 + 7𝑥2 − 8.5𝑥21 − 5𝑥22 + 4 𝑥1 𝑥2 .

This is the full quadratic model of this regression.


Notice further that, although these models are all linear regression models, the shape of the surface
that is generated by the model is not linear.

11.4.2 Polynomial Regression Models

Concept: In general, any regression model such that the 𝛽’s is linear in parameters then is called a
linear regression model, regardless of the shape of the surface that it generates.

Let’s return for a moment to the case of bivariate data D consisting of 𝑛 pairs of (𝑥, 𝑦). Suppose that
a scatter plot shows a parabolic (in Figure 11.3.b) rather than linear shape. Then it is natural to specify
a quadratic regression model, via a second-degree polynomial in one variable

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + 𝜀, (11.22)

So what does this have to do with multiple regression?

Quadratic regression models

Let’s rewrite the quadratic model equation as follows:

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀, where 𝑋1 = 𝑋, 𝑋2 = 𝑋 2

Now this looks exactly like a multiple regression equation with two predictors.

• The message at the moment is that quadratic regression is a special case of multiple regression.
Thus any software package capable of carrying out a multiple regression analysis can fit the quadratic
regression model.

• The same is true of cubic regression and even higher-order polynomial models, although in practice
very rarely are such higher-order predictors needed.

• Polynomial regression models are widely used when the response is curvilinear (see Figure
11.3.b) because the general principles of multiple regression can be applied.

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
316 MULTIPLE REGRESSION ANALYSIS

U.S. population in 1790–2010 (million people)

Figure 11.3: Linear and nonlinear model of the U.S. population regression

• However, the interpretation of 𝛽𝑖 given previously for the general multiple regression model is not
legitimate in quadratic regression, and polynomial regression in general. This is because 𝑋2 = 𝑋 2
, so the value of 𝑋2 cannot be increased while 𝑋1 = 𝑋 is held fixed.

See Example 11.7 for more info.

Example 11.7. (U.S. population and nonlinear terms)

One can often reduce variability around the trend and do more accurate analysis by adding non-
linear terms into the regression model.

In Example 9.6, with sample data

𝑐(1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010);

we got the linear model


𝐺(𝑥) = 𝑌̂︀ = −142201 + 74.1 𝑥,

that allows us to predict the world population for years from 2015, based on the linear model

E[ population 𝑌 ] = 𝛽0 + 𝛽1 ( year 𝑥).

• This model 𝐺(𝑥) has a pretty good fit. However, a linear model does a poor prediction of the U.S.
population between 1790 and 2010 (see Figure 11.3.a).

Inference, Linear Regression and Stochastic Processes


11.5. Chapter’s Problem 317

• The population growth over a longer period of time is clearly nonlinear. On the other hand, a
quadratic model in Figure 11.3.b gives an amazingly excellent fit!

• For this model, we assumed

E[ population 𝑌 ] = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 = 𝛽0 + 𝛽1 ( year 𝑥) + 𝛽2 ( year 𝑥2 ).

By changing 𝑥1 = 𝑥, 𝑥2 = 𝑥2 we get a multiple linear model

E[ population 𝑌 ] = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 ,

but 𝑥2 now depends on 𝑥1 ! 

11.5 Chapter’s Problem

11.5.1 Computational problem

Problem 11.1. (Tests in Simple Linear Regression)

Consider the following computer output.


The regression equation is
Y = 26.8 + 1.48 x
Predictor Coef SE Coef T P
Constant 26.753 2.373 ? ?
X 1.4756 0.1063 ? ?
S = 2.70040 R-sq = 93.7% R-sq (adj) = 93.2%
Analysis of Variance
Source DF SS MS F P
Regression 1 ? ? ? ?
Residual error ? 94.8 7.3
Total 15 1500.0

(a) Fill in the missing information.


(b) Can you conclude that the model defines a useful linear relationship?
(c) What is your estimate of 𝜎 2 ?

11.5.2 Theoretic problems

In order to develop hypothesis tests and confidence intervals for the regression coefficients in the
subsequent Chapter ??, the standard deviations of the estimated coefficients are needed. These

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
318 MULTIPLE REGRESSION ANALYSIS

can be obtained from a certain covariance matrix, a matrix with the variances on the diagonal and the
covariances in the off-diagonal elements.

Problem 11.2. (Covariance Matrices - concept)

If 𝐵 is a column vector of random variables (𝐵1 , 𝐵2 , · · · , 𝐵𝑘 ) with means 𝜇1 = E[𝐵1 ], . . . , 𝜇𝑘 = E[𝐵𝑘 ],


let 𝜇 be the column vector of 𝑘 means. Define the variance-covariance matrix or just covariance
matrix of vector 𝐵 as.
⎡ ⎤
⎢ V[𝐵1 ] Cov(𝐵1 , 𝐵2 ) ··· Cov(𝐵1 , 𝐵𝑘 ) ⎥
⎢ ⎥
⎢ ⎥
⎢ Cov(𝐵 , 𝐵 )
⎢ 2 1 V[𝐵2 ] ··· Cov(𝐵2 , 𝐵𝑘 ) ⎥

Cov[𝐵] = ⎢
⎢ .
⎥, (11.23)
⎢ . .. .. .. ⎥
⎢ . . . .


⎢ ⎥
⎣ ⎦
Cov(𝐵𝑘 , 𝐵1 ) Cov(𝐵𝑘 , 𝐵2 ) · · · V[𝐵𝑘 ]

Prove that Cov[𝐵] = E[[𝐵 − 𝜇][𝐵 − 𝜇]′ ].

Problem 11.3. (Covariance Matrices- property)

If A is a matrix with constant entries and V = A 𝐵, then

Cov(V) = A Cov(𝐵) A ′ .

HINT: employ the linearity of the expectation, E(V) = E(AB) = A E(𝐵).

Problem 11.4.

Consider the linear regression model 𝑦 = X 𝛽 + 𝑒. Here 𝑒 is a vector of random variable , such
that E[𝑒𝑖 ] = 0 for all 𝑖 = 1, ..., 𝑛 and 𝑗 = 1, ..., 𝑛:

⎨𝜎 2 , when 𝑖 = 𝑗
Cov(𝑒𝑖 , 𝑒𝑗 ) =
⎩0, when 𝑖 ̸= 𝑗.

ˆ = (X′ X)−1 X′ y is
Show that the variance-covariance matrix of the LSE 𝑏 = 𝛽

Cov[𝑏] = 𝜎 2 (X′ X)−1 . (11.24)

11.5.3 Model fitting on three predictor variables with interaction

Inference, Linear Regression and Stochastic Processes


11.5. Chapter’s Problem 319

We illustrate the building of a multiple regression model when 𝑘 = 3 and interpret the meaning of
the variables involved. Define a response variable 𝑌 = 𝐼𝑄 to be the human intelligence index,
assuming that it depends on three variables 𝑋1 = 𝐸𝐷𝑈 (education level, in years of schooling),
𝑋2 = ℎ (ℎ𝑒𝑖𝑔ℎ𝑡) and 𝑋3 = 𝑔 (𝑠𝑒𝑥).

𝑋1 = 𝐸𝐷𝑈 6 7 7 8 10 12 15 16 18

𝑋2 = ℎ (ℎ𝑒𝑖𝑔ℎ𝑡) 1.6 1.65 1.72 1.59 1.68 1.61 1.77 1.7 1.69

𝑋3 = 𝑔 (𝑠𝑒𝑥) F F M F M F M F M

The HI index 𝑌 140 155 150 141 117 126 176 183 199

The IQ predictive model is 𝑦 = 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + 𝑎.


With data (assumed), we obtain the following regression model:

𝑦 = IQ = 0.2 𝐸𝐷𝑈 + 341 ℎ − 83 𝑔 + 5.22𝐸𝐷𝑈 * 𝑔 − 409. 

We have 𝑝 − 1 = 3 predictors for describing IQ:

1. Education level via the number of schooling years (𝑥1 ),

2. Gender (𝑥3 , categorical)

3. Height index (𝑥2 )

Computation on R . How do me build the model by R ?

We observe 𝑛 = 7 persons, and get the responses 𝑦 = 𝑐(140, 155, 150, 141, 117, 126, 196)...
x1 = c(6, 7, 7, 8, 10, 12, 15, 16, 18);
x2 = c(1.6, 1.65, 1.72, 1.59, 1.68, 1.61, 1.77, 1.7, 1.69);
x3 = c(’F’, ’F’, ’M’, ’F’, ’M’, ’F’,’M’,’F’,’M’);
fx3= factor(x3);
y = c(140,155,150,141,117,126,176, 183, 199);
M=lm(y~ x1+ x2+ fx3+ x1:fx3); anova(M); summary(M);
Call: lm(formula = y ~ x1 + x2 + fx3 + x1:fx3)
Residuals:
1 2 3 4 5 6 7 8 9
2.041 -0.249 17.665 6.041 -17.963 -16.625 -16.87 8.79 17.17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -409.9486 327.6905 -1.251 0.279
x1 0.2084 2.8920 0.072 0.946
x2 341.6604 209.9772 1.627 0.179
fx3M -83.3929 50.7856 -1.642 0.176

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
320 MULTIPLE REGRESSION ANALYSIS

x1:fx3M 5.2231 3.6543 1.429 0.226

11.5.4 Model fitting in oil industry

We illustrate the empirical model fitting of a multiple regression on a data set GASOL.csv with four
predictors in oil industry (see Kenett [?]). The data set consists of 32 measurements of distillation
properties of crude oils. There are five variables 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 and 𝑦, given as

𝑥1 : crude oil gravity, API;

𝑥2 : crude oil vapour pressure, psi;

𝑥3 : crude oil ASTM 10% point ∘ F;

𝑥4 : gasoline ASTM endpoint, ∘ F;

𝑦: yield of gasoline (in percentage of crude oil).

The measurements of crude oil, and gasoline volatility indicate the temperatures at which a given
amount of liquid has been evaporized. The sample correlations between these five variables are

𝑥2 𝑥3 𝑥4 𝑦

𝑥1 0.621 −0.700 −0.322 0.246

𝑥2 −0.906 −0.298 0.384

𝑥3 0.412 −0.315

𝑥4 0.712

We see that the yield 𝑦 is highly correlated with 𝑥4 and with 𝑥2 (or 𝑥3 ).

Computation on R .

The following is an R output of the regression of 𝑦 on 𝑥3 and 𝑥4 .

> data(GASOL)
> LmYield <- lm(yield ~ 1 + astm + endPt, data=GASOL)
> summary(LmYield)
Call:
lm(formula = yield ~ 1 + astm + endPt, data = GASOL)
Residuals:
Min 1Q Median 3Q Max
-3.9593 -1.9063 -0.3711 1.6242 4.3802

Inference, Linear Regression and Stochastic Processes


11.6. Step-wise regression for a Climate Change study 321

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.467633 3.009009 6.137 1.09e-06 ***
astm -0.209329 0.012737 -16.435 3.11e-16 ***
endPt 0.155813 0.006855 22.731 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.426 on 29 degrees of freedom
Multiple R-squared: 0.9521, Adjusted R-squared: 0.9488
F-statistic: 288.4 on 2 and 29 DF, p-value: < 2.2e-16

We compute now these estimates of the regression coefficients using the formulae in Section
11.3.1. The variances and covariances of 𝑥3 , 𝑥4 and 𝑦 are

ASTM End pt Yield

ASTM 𝑥3 1409.355

End pt 𝑥4 1079.565 4865.894

Yield −126.808 532.188 114.970

The means of these variables are X 3 = 241.500, X 4 = 332.094, Y = 19.6594. Thus, the least-squares
estimates 𝑏1 and 𝑏2 of 𝛽1 and 𝛽2 are obtained by solving the equations

⎨1409.355𝑏1 + 1079.565𝑏2 = −126.808
⎩1079.565𝑏 + 4865.894𝑏 = 532.188.
1 2

The solution is 𝑏1 = −0.20933, 𝑏2 = 0.15581. Finally, the estimate of 𝛽0 is

𝑏0 = 19.6594 + 0.20933 × 241.5 − 0.15581 × 332.094 = 18.469.

The the multiple 𝑅2 given in Eqn. 11.18 is


2 1
𝑅𝑦|(𝑥 3 ,𝑥4 )
= [0.20932 × 126.808 + 0.15581 × 532.88] = 0.9530.
114.970
In addition, (︂ )︂
2 2 2
𝑆𝑦|(𝑥 1 ,𝑥2 )
= 𝑠𝑦 1 − 𝑅𝑦|(𝑥1 ,𝑥2 ) = 114.97(1 − .9530) = 5.4036.

WHAT IF we extend our model 𝑦𝑖𝑒𝑙𝑑 ∼ 1 + 𝑎𝑠𝑡𝑚 + 𝑒𝑛𝑑𝑃 𝑡 to include an interaction of 𝑥3 , 𝑥4 :

𝑦𝑖𝑒𝑙𝑑 ∼ 1 + 𝑎𝑠𝑡𝑚 + 𝑒𝑛𝑑𝑃 𝑡 + 𝑎𝑠𝑡𝑚 * 𝑒𝑛𝑑𝑃 𝑡?

11.6 Step-wise regression for a Climate Change study

Realistic data set from Mekong Delta Region (MDR)- Vietnam

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
322 MULTIPLE REGRESSION ANALYSIS

Interest factor Y is the


number of BHPs.

Figure 11.4: Concrete dataset: BPH = Brown Plant-hoppers

Predictors could affect BPH growth and their measured values shown in the above table. But what
factors have most significant impacts on the BPH growth?

1. Longitude (𝑥1 ), Latitude (𝑥2 )

2. Leaf color (𝑥3 , categorical)

3. Number of leaves (ind/m2) (𝑥4 )

4. Seeding density (kg/ha) (𝑥5 )

5. Temperature (C) (𝑥6 )

6. Humidity (%) (𝑥7 )

7. Water level (cm) (𝑥8 )

8. Rice species (𝑥9 , categorical)

9. Grass density (ind/m2) (𝑥10 )

10. Number of buds (ind/m2) (𝑥11 )

Inference, Linear Regression and Stochastic Processes


11.6. Step-wise regression for a Climate Change study 323

Long. Lat. Rice Seed. Temp Humi. Water Leaf Grass No. No. No.

species den. level color den. buds leaves BPH

5562 11260 JASMIN 85 15 24 90 6 4 24 962 4329 0

5557 11260 JASMIN 85 21 25 90 0 5 0 1058 4768 0

5563 11259 OM2000 21 26 90 7 5 0 1046 4707 0

5561 11257 OM1490 15 25 90 0 4 0 1070 4815 0

5559 11256 OMCS1490 21 29 75 0 5 48 1050 4725 0

5559 11254 VD20 18 28 82 6 5 0 966 4347 0

5565 11256 VD20 19 25 90 3 5 0 982 4419 0


..
.

5640 10780 MT1240 18 29 90 6 5 2 528 2112 106

5641 10779 OM1490 16 29 90 6 4 0 608 2432 109

5642 10778 OM1490 16 29 90 5 4 0 604 2416 326

5644 10779 OM1490 12 29 90 5 3 0 512 2048 72

5640 10777 OM3240 16 29 90 8 4 12 576 2304 184

No Interaction Terms
> DataBPH = read.csv("BPH.csv",sep=’;’)
> ncol(DataBPH); nrow(DataBPH);
> Y= BPH;
> out=lm(Y ~ x2+ x6+x7+x8, data= DataBPH)
> anova(out)
> summary(out)

Call:
lm(formula = BPH ~ x2 + x6 + x7 + x8)

Residuals:
Min 1Q Median 3Q Max
-556.7 -223.4 -120.2 55.7 12531.5

Coefficients:
Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.592e+03 1.180e+03 3.045 0.00240 **


x2 -2.134e-03 1.101e-03 -1.938 0.05293 .

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
324 MULTIPLE REGRESSION ANALYSIS

x6 -6.930e+00 2.666e+00 -2.600 0.00949 **


x7 -2.113e+01 5.369e+00 -3.936 8.97e-05 ***
x8 -8.829e+01 3.577e+01 -2.468 0.01378 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 587.2 on 835 degrees of freedom


Multiple R-squared: 0.03925,Adjusted R-squared: 0.03465
F-statistic: 8.528 on 4 and 835 DF, p-value: 9.555e-07

With Interaction Terms, having only numeric predictors


To add an interaction term to a model, use a colon (:). For example, x1:x2 denotes the interaction
between the two variables x1 and x2. This command builds a model with terms for two variables and
their interaction.
out1=lm(BPH~ x2+ x6+ x7+ x8+ x2:x6+ x6:x7+ x6:x8+ x7:x8)
anova(out1); summary(out1)
Call:
lm(formula = BPH ~ x2 + x6 + x7 + x8 + x2:x6 + x6:x7 + x7:x8 + x7:x8)

Residuals:
Min 1Q Median 3Q Max
-595.3 -212.5 -119.5 47.5 12486.8

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.293e+04 1.221e+04 1.059 0.2899
x2 -1.213e-02 1.121e-02 -1.082 0.2797
x6 -1.169e+02 1.473e+02 -0.794 0.4276
x7 -1.062e+02 5.240e+01 -2.026 0.0431 *
x8 3.674e+02 3.855e+02 0.953 0.3408
x2:x6 1.214e-04 1.353e-04 0.897 0.3697
x6:x7 1.984e-01 6.394e-01 0.310 0.7564
x6:x8 -6.422e+00 4.616e+00 -1.391 0.1645
x7:x8 1.797e+01 9.103e+00 1.974 0.0487 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 586.1 on 831 degrees of freedom


Multiple R-squared: 0.04756,Adjusted R-squared: 0.03839

Inference, Linear Regression and Stochastic Processes


11.6. Step-wise regression for a Climate Change study 325

F-statistic: 5.187 on 8 and 831 DF, p-value: 2.433e-06

Interaction Terms, with few categorical predictors

> fx3=factor(x3)
# x3 (Leaf color) is categorical;
# fx3 is the dummy variable of x3,
# now eligible for using in linear model
# need the theory of Analysis of Covariance (ANCOVA) to explain
> out3=lm(Y~ fx3+ x6+ x7+ x8+ fx3:x6+ x6:x7+ x6:x8+ x7:x8)
> anova(out3)

Analysis of Variance Table


Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
fx3 67 17664907 263655 0.7608 0.9202764
x6 1 2527302 2527302 7.2924 0.0070843 **
x7 1 3651910 3651910 10.5375 0.0012230 **
x8 1 4092488 4092488 11.8087 0.0006228 ***
fx3:x6 33 16766076 508063 1.4660 0.0458599 *
x6:x7 1 95067 95067 0.2743 0.6006118
x6:x8 1 298724 298724 0.8620 0.3534967
x7:x8 1 551672 551672 1.5918 0.2074662
Residuals 733 254031696 346564

>summary(out3)
Residual standard error: 588.7 on 733 degrees of freedom
Multiple R-squared: 0.1523,Adjusted R-squared: 0.02974
F-statistic: 1.243 on 106 and 733 DF, p-value: 0.06013

Higher-order Interaction Terms


As you can see, this notation becomes lengthy for models with many variables. R has an useful
shorthand notations for interaction terms, which are the asterisk (*) notation.

out4=lm(Y~ fx3+ x6+ x7+ x8+ fx3*x6*x7*x8)


anova(out4)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
fx3 67 17664907 263655 0.6945 0.968346
x6 1 2527302 2527302 6.6570 0.010110 *

DATA ANALYTICS- FOUNDATION


CHAPTER 11. REGRESSION ANALYSIS III
326 MULTIPLE REGRESSION ANALYSIS

x7 1 3651910 3651910 9.6192 0.002015 **


x8 1 4092488 4092488 10.7797 0.001085 **
fx3:x6 33 16766076 508063 1.3382 0.100679
fx3:x7 33 2889708 87567 0.2307 0.999998
x6:x7 1 1264150 1264150 3.3298 0.068526 .
fx3:x8 27 6730902 249293 0.6566 0.908548
x6:x8 1 24241 24241 0.0639 0.800593
x7:x8 1 854526 854526 2.2508 0.134061
fx3:x6:x7 22 2582661 117394 0.3092 0.999115
fx3:x6:x8 15 1953772 130251 0.3431 0.990497
fx3:x7:x8 16 3551246 221953 0.5846 0.896520
x6:x7:x8 1 621910 621910 1.6381 0.201071
fx3:x6:x7:x8 11 3678137 334376 0.8808 0.559122
Residuals 608 230825906 379648
summary(out4)
Residual standard error: 616.2 on 608 degrees of freedom
Multiple R-squared: 0.2298,Adjusted R-squared: -0.06288
F-statistic: 0.7851 on 231 and 608 DF, p-value: 0.9844

11.7 Chapter’s Summary

PROPERTIES OF THE LEAST SQUARES ESTIMATORS


Consider a statistical linear model

𝑌 = X 𝛽 + 𝜀 = 𝛽0 + 𝑋1 𝛽1 + · · · + 𝑋𝑝−1 𝛽𝑝−1 + 𝑒.

The statistical properties of the least squares estimators

ˆ = (X′ X)−1 X′ y
b=𝛽

is easily found under certain assumptions on the error terms 𝜀𝑖 ., namely:

1. linear functions of observed responses 𝑦1 , 𝑦2 , · · · , 𝑦𝑛 ,

2. unbiased for the regression slopes because

E[b] = (X′ X)−1 X′ E[y] = (X′ X)−1 X′ X 𝛽 = 𝛽,

3. normal if the response variable 𝑌 is normal.

Furthermore,

Inference, Linear Regression and Stochastic Processes


11.7. Chapter’s Summary 327

• If we have 𝑛 > 1 observations measured simultaneously at predictors, the true linear regression
model (at the 𝑖-th observation) is

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + · · · + 𝛽𝑝−1 𝑥𝑖(𝑝−1) + 𝑒𝑖 , 𝑖 = 1, 2, . . . , 𝑛

where 𝛽1 , 𝛽2 , · · · , 𝛽0 𝑝 − 1 are the regression coefficients, and 𝑒𝑖 ∼𝑖𝑖𝑑 N(0, 𝜎 2 ).

• The parameters of the model are 𝑏1 , 𝑏2 , · · · , 𝑏𝑝−1 and 𝜎.

• Regression of 𝑌 on 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 is the conditional expectation,

𝑌̂︀ := E[𝑌 |𝑋1 = 𝑥1 , · · · , 𝑋𝑝−1 = 𝑥𝑝−1 ] = 𝛽0 + 𝑋1 𝛽1 + · · · + 𝑋𝑝−1 𝛽𝑝−1 ,

𝑌̂︀ is a function, whose form can be estimated from data.

DATA ANALYTICS- FOUNDATION


Part D
Stochastic Process Based Analysis

Chapter 12 Stochastic Process

Chapter 13 Statistical Simulation

Chapter 14 Poisson Process

Chapter 15: Branching Processes

Chapter 16: Stochastic Analysis in Engineering

• Stochastic Process uses probabilistic models to study uncertainty. Probabilistic models often
involve several popular random variables of interest as Poisson, exponential or Gaussian. For
example,

– in a medical diagnosis context, the results of several tests may be significant, or

– in a networking context, the workloads of several gateways may be of interest.

All of these random variables are associated with the same experiment, sample space, and
probability law, and their values may relate in interesting ways.

• Stochastic Analysis uses probabilistic models, simulation theory and computation to study un-
certainty.
Chapter 12

Stochastic Process
Characterizing systems with randomly
spatial-temporal fluctuations

[Source [9]]
CHAPTER 12. STOCHASTIC PROCESS
330 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

12.1 What and Why Stochastic Processes?

Briefs of Stochastic Processes include

• Introductory Stochastic Processes

• Markov chains (discrete process)

• Stationary distributions and Limiting distributions

• Classification of states of a Markov chain

• Theory of stochastic matrix for Markov chains

• Markov processes

Typical applications include:

A Service systems: mathematical model of queuing systems.

B Evolutionary Dynamics: measuring extinction probabilities of species?

C Brand loyalty in Business: compute market shares.

A. Service sharing and high performance PC


Over the last few years the Processor Sharing scheme has attracted renewed attention as a conve-
nient and efficient approach for studying bandwidth sharing mechanisms such as TCP or any process
requiring resource sharing. Understanding and computing such those processes to produce a high
performance system with limited resources is a very difficult task.
Few typical aspects of the resource allocation are:

1. the fact that many classes of jobs (clients) come in a system with distinct rates demands a wise policy
to get them through efficiently,

2. measuring performance of a system through many different parameters (metrics) is hard, requires
complex mathematical models.

B. Evolutionary Dynamics
Keywords: critial lineages, virus mutant,mutation, reproductive ratio, invasion, ecology, vaccine.
Introductory Biomatics- Invasion and Escape. Some realistic biological phenomena occur in na-
ture such as: (a) a parasite infecting a new host, (b) a species trying to invade a new ecological niche,
(c) cancer cells escaping from chemotherapy.

Typical problems. Imagine a virus of one host species that is transferred to another host species
(HIV, SARS). In the new host, the virus has a basic reproductive ratio 𝑅 < 1.

Inference, Linear Regression and Stochastic Processes


12.1. What and Why Stochastic Processes? 331

Main components of a queuing system

Figure 12.1: An example of mixed flow of arrivals ( jobs, clients, cars ...)- Source [58]

1. How to calculate the probability that such an attempt succeeds?

2. How to calculate the probability that a virus quasi-species contains an escape mutant that establishes
an infection and thereby causes vaccine failure?

We need a theory to calculate the probability of non-extinction/ escape for lineages starts from studying
evolutionary dynamics of single individuals.
C. Brand loyalty in Buz: Consider a Markov chain M describing the loyalty of customers to three
retailers Coop, BigC and Walmart, coded by states 0, 1, and 2 respectively. The transition probability
matrix 𝑃 is given by
⎛ ⎞
⎜𝐶 𝐵 𝑊⎟
⎜ ⎟
⎜ ⎟
⎜0.4 0.2 0.4⎟
⎜ ⎟
𝑃 =⎜



⎜0.6 0 0.4⎟
⎜ ⎟
⎜ ⎟
⎝ ⎠
0.2 0.5 0.3

What is the probability in the longrun that the chain M is in state 1, that is how much chance the
customers will go with BigC? We need Markov Chain theory to solve this.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
332 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

12.2 Introductory Stochastic Processes

We first take a look at few problems that can be solved by the chapter’s methods.

1. A certain product is made by two companies, A and B, that control the entire market.

Currently, A and B have 60 percent and 40 percent, respectively, of the total market.

Each year, A loses 5 of its market share to By while B loses 3 of its share to A.

Find the relative proportion of the market that each hold after 2 years.

2. Let two gamblers, A and B, initially have 𝑘 dollars and 𝑚 dollars, respectively.

Suppose that at each round of their game, A wins one dollar from B with probability 𝑝 and loses one
dollar to B with probability 𝑞 = 1 − 𝑝.

Assume that A and B play until one of them has no money left. (This is known as the Gambler’s
Ruin problem.)

Let 𝑋𝑛 be A’s capital after round 𝑛, where 𝑛 = 0, 1, 2, · · · and 𝑋0 = 𝑘.

(a) Show that 𝑋(𝑛) = {𝑋𝑛 , 𝑛 ≥ 0} is a Markov chain with absorbing states.

(b) Find its transition probability matrix 𝑃 . Realize 𝑃 when 𝑝 = 𝑞 = 1/2 and 𝑁 = 4

(c*) What is the probability of A’s losing all his money?

Basic concepts

A stochastic process is a mathematical model of a probabilistic experiment that evolves in time and
generates a sequence of numerical values.

Definition 12.1 (Short version).

A stochastic process is just a collection (usually infinite) of random variables, denoted 𝑋𝑡 or 𝑋(𝑡);
where parameter 𝑡 often represents time.
State space S of a stochastic process consists of all realizations 𝑥 of 𝑋𝑡 , i.e. 𝑋𝑡 = 𝑥 says the
random process is in state 𝑥 at time 𝑡.

Stochastic processes can be generally subdivided into four distinct categories depending on whether
𝑡 ∈ 𝑇 or 𝑋𝑡 ∈ S are discrete or continuous.

Inference, Linear Regression and Stochastic Processes


12.2. Introductory Stochastic Processes 333

Time index set 𝑇 State space S

Discrete Continuous

Discrete 1. Discrete process 4. Discrete time

continuous state process

Continuous 2. Continuous time 3. Continuous

discrete state process process

12.2.1 Classification of stochastic processes

1. Discrete processes: both S and 𝑇 are discrete, as Bernoulli process or Discrete Time Markov
chains.

2. Continuous time discrete state processes: the state space S of 𝑋𝑡 is discrete and the index set
(time set) 𝑇 of 𝑡 is continuous, as the reals R = (−∞, ∞) or its intervals.

• Poisson process– the number of clients 𝑋(𝑡) who has entered a bank from the opening time
until 𝑡. 𝑋(𝑡) follows a Poisson distribution with mean E[𝑋(𝑡)] = 𝜆𝑡 (𝜆 - the arrive rate).

• Continuous time Markov chain: the state space S is finite

3. Continuous processes: both state space S and time index set 𝑇 are continuous, such as diffusion
process (Brownian motion).

4. Discrete time continuous state processes: the time index 𝑇 is discrete, and the state space S is
continuous– the so-called TIME SERIES such as

• monthly fluctuations of the inflation rate of Vietnam, Thailand or India

• daily fluctuations of a stock market.

Examples

1. Discrete processes: random walk model consisting of positions 𝑋𝑡 of an object (drunkand) at hourly
time point 𝑡 during 24 hours, whose directional distance from a particular point 0 is measured in
integer units. Here 𝑇 = {0, 1, 2, . . . , 24}. See details of random walk model in Section 13.6.

2. Continuous time discrete state processes: 𝑋𝑡 is the number of infant births in a given population
during time period [0, 𝑡]. Here the time index 𝑇 = R+ = [0, ∞) and

the state space is {0, 1, 2, . . . , }

3. Continuous processes: 𝑋𝑡 is population density at time 𝑡 ∈ 𝑇 = R+ , the state space of 𝑋𝑡 is R+ .

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
334 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

4. Discrete time continuous state processes:

TIME SERIES of daily fluctuations of a stock market

Realistic data of a financial time series model: from http://cafef.vn, shown in Table 12.1 has
58 records corresponding with 58 trading days in Quarter 1, 2013. VNM and DPM are two giant firms
in Vietnam.
Table 12.1: Data of VNM and DPM stock price in Quater 1, 2013.

VNM DPM

Seq Date Price Return Log return Price Return Log return

𝑠(𝑡) 𝑠(𝑡) 𝑠(𝑡) 𝑠(𝑡)


𝑡 𝑠(𝑡) 𝑟𝑖 = 𝑠(𝑡−1)
𝑥𝑖 = ln 𝑠(𝑡−1)
𝑠(𝑡) 𝑟𝑖 = 𝑠(𝑡−1)
𝑥𝑖 = ln 𝑠(𝑡−1)

0 2012/12/28 88 0.98864 -0.011 35.8 1.01955 0.019

1 2013/01/02 87 0.99425 -0.006 36.5 1.0274 0.027

2 2013/01/03 86.5 1.01734 0.017 37.5 1.02667 0.026

3 2013/01/04 88 1 0 38.5 1.04675 0.046

4 2013/01/07 88 1.04545 0.044 40.3 1.01737 0.017

5 2013/01/08 92 1.03804 0.037 41 1.03415 0.034

··· ··· ··· ··· ··· ··· ··· ···

51 2013/03/20 107 1.03738 0.037 45.2 1.01106 0.011

52 2013/03/21 111 1.01802 0.018 45.7 0.96937 -0.031

53 2013/03/22 113 1.0177 0.018 44.3 1.02709 0.027

54 2013/03/25 115 0.98261 -0.018 45.5 0.99121 -0.009

55 2013/03/26 113 1 0 45.1 0.98891 -0.011

56 2013/03/27 113 1.0177 0.018 44.6 1.00897 0.009

57 2013/03/28 115 1.0087 0.009 45 0.99778 -0.002

Let {𝑆(𝑡) : 𝑡 ≥ 0} be a stock price process, with the stock price change over the period (𝑘, 𝑘 + 1) be
𝑆(𝑘 + 1)
𝑅𝑘 = , named the return of stock, see Nguyen (2013) [26].
𝑆(𝑘)

Inference, Linear Regression and Stochastic Processes


12.2. Introductory Stochastic Processes 335

Our stock prices given in Table 12.1 follow a geometric Brownian motion.

·10−2 ·10−2

4 4

2 2

10 20 30 40 50 10 20 30 40 50

−2 −2

(a) VNM (b) DPM

Figure 12.2: The log returns look like a Brownian motion.

• Next figures 12.3 and 12.4 separately show time-series graphs of actual stock prices (in blue color)
of two data sets of VNM and DPM.

• The other two curves (with red and green colors) show some approximated statistical models (the
log-normal and auto-regressive ones) that are estimated from the same data above.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
336 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

115 Actual price


Lognormal - Binomial - Trinomial expected price
AR(1) expected price
110

105

100

95

90

5 10 15 20 25 30 35 40 45 50 55

Figure 12.3: The VNM stock price process.

46

44

42

40

38
Actual price
Lognormal - Binomial expected price
AR(1) expected price
36
5 10 15 20 25 30 35 40 45 50 55

Figure 12.4: The DPM stock price process.

Inference, Linear Regression and Stochastic Processes


12.2. Introductory Stochastic Processes 337

Definition 12.2 (Full version).

A stochastic process is a series of random variables, depends on time 𝑡, or other index. It is


therefore a function of two arguments, 𝑋(𝑡, 𝑤), where

• parameter 𝑡 ∈ 𝑇 is time, with 𝑇 being a set of possible times, usually [0, ∞), (−∞, ∞),

N = {0, 1, 2, . . .}, or

Z = {. . . , −3, −2, −1, 0, 1, 2, . . .};

• 𝑤 ∈ Ω, an outcome of an experiment, in sample space Ω.

A stochastic process can be visualized by a map

𝑋 : 𝑇 × Ω −→ S

(𝑡, 𝑤) ↦→ 𝑋(𝑡, 𝑤) = 𝑥

Values of 𝑋(𝑡, 𝑤) are called states.

Time index set T


T could be discrete or continuous

Stochastic process as a map

X:TxΩ S

(t, w) X(t, w)= s

The state space S could


be discrete or
continuous
The sample space Ω
Figure 12.5: Visualization of a stochastic process 𝑋

State space: State space

S = {𝑠 : 𝑠 = 𝑋(𝑡, 𝑤) for certain 𝑡 ∈ 𝑇 ∧ 𝑤 ∈ Ω}.

of a stochastic process consists of all states/realizations 𝑥 of 𝑋(𝑡, 𝑤), i.e.

𝑋(𝑡, 𝑤) = 𝑠

says the process is in state 𝑠 at time 𝑡, measured at 𝑤.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
338 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

12.2.2 Key characteristics of stochastic process

Knowledge box 10. We have known that a stochastic process is a mathematical model of a proba-
bilistic experiment that evolves in time and generates a sequence of numerical values.

Three interesting aspects of SP that we want to know:

(a) The dependencies in a series of values generated by the process. For example, how do
future prices of a stock depend on past values?

(b) Long-term averages, involving the entire sequence of generated values. E.g, what is the
fraction of time that a machine is idle?

(c) The likelihood or frequency of certain boundary events.

E.g, a) what is the probability that within a given hour all circuits of some telephone system become
simultaneously busy, or
b) what is the frequency with which some buffer in a PC network overflows with data?

There are two fundamental properties of stochastic processes.

1. STATIONARY property:

A process is stationary when all the 𝑋(𝑡) have the same distribution. That means,

• for any 𝜏 , the distribution of a stationary process will be unaffected by a shift in the time
origin, and

• 𝑋(𝑡) and 𝑋(𝑡 + 𝜏 ) will have the same distributions.


For the first-order distribution, we have

𝐹𝑋 (𝑥; 𝑡) = 𝐹𝑋 (𝑥; 𝑡 + 𝜏 ) = 𝐹𝑋 (𝑥); and 𝑓𝑋 (𝑥; 𝑡) = 𝑓𝑋 (𝑥).

These processes are found in Arrival-Type Processes, see Chapter 14.

Applications: Arrival-Type Processes is essentially based on Poisson process, in which, we are


interested in occurrences that have the characteristic of an “arrival” such as
* message receptions at a receiver, [in telecommunication]
* job completions in a manufacturing cell, [in industry]
* customer purchases at a store, etc. [in business]

Mathematical treatment: We will focus on models in which the inter-arrival times (the times between
successive arrivals) are independent random variables.

Inference, Linear Regression and Stochastic Processes


12.2. Introductory Stochastic Processes 339

♢ The case where arrivals occur in discrete time and the interarrival times are geometrically
distributed – is the Bernoulli process.

♢ The case where arrivals occur in continuous time and the interarrival times are exponentially
distributed – is the Poisson process, see Chapter 14.

2. MARKOVIAN (memory-less) property:

Many processes with memory-less property caused by experiments that evolve in time and in which
the future evolution exhibits a probabilistic dependence on the past.

As an example, the future daily prices of a stock are typically dependent on past prices. However, in
a Markov process, we assume a very special type of dependence:

the next value depends on past values only through the current value, that is 𝑋𝑖+1 depends only on 𝑋𝑖 ,
and not on any previous values.

Remark 4. We knew 𝑋 : Ω × 𝑇 −→ S, a stochastic process, depends on both outcomes/units of an


experiment Ω and time points 𝑇

• At any fixed time 𝑡, we see a r. variable 𝑋𝑡 (𝑤), a function of a random 𝑤.

• On the other hand, if we fix 𝑤 ∈ Ω, we obtain a function of time 𝑋𝑤 (𝑡). This function is called a
a sample path, or a trajectory of the process 𝑋(𝑡, 𝑤).

Example 12.1. (CPU usage).

Looking at the past usage of the central processing unit (CPU), we see a realization of this process
until the current time (Figure 12.6). However, the future behavior of the process is unclear.

Depending on which outcome 𝑤 will actually take place, the process can develop differently. For
example, see two different trajectories for 𝑤 = 𝑤1 and 𝑤 = 𝑤2 , and
two elements of the sample space Ω, on Figure 12.7). 

PRACTICE 3.
a/The CPU usage process, in percents of the above example belong to what class of SP?
b/ In a printer shop, now let

1. 𝑋(𝑛, 𝑤) = the amount of time required to print the 𝑛-th job.

2. 𝑌 (𝑛, 𝑤) be the number of pages of the 𝑛-th printing job.

1. Describe components of process X = {𝑋(𝑛, 𝑤)}. Is it discrete-time, continuous-state stochastic


process? Why?

2. Describe components of process Y = {𝑌 (𝑛, 𝑤)}. What class of SP does it belong to?

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
340 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

Observed sample path

Figure 12.6: A sample path of CPU usage up to time 𝑡

Possible development, determined by the outcome ω ∈ Ω

Figure 12.7: Sample paths of CPU usage after time 𝑡

12.3 Markov chains

From now on, we shall not write 𝑤 as an argument of 𝑋(𝑡, 𝑤). Just keep in mind that behavior of a
stochastic process depends on chance, just as we did with random variables and random vectors.

Inference, Linear Regression and Stochastic Processes


12.3. Markov chains 341

Informally, a stochastic process is called a Markov process, only its present state is important to
know the future development, and it does not matter how the process arrived to this state (memory-less
property),
P[ future | past, present ] = P[ future | present ]

A bit historical fact: The idea of Markov dependence was developed by Andrei Markov (1856-1922)
who was a student of P. Chebyshev.

12.3.1 Markov processes and Markov chains

Definition 12.3 (Markov process).

Stochastic process 𝑋(𝑡) is Markov if for any


𝑡1 < 𝑡2 < . . . < 𝑡𝑛 < 𝑡, and any sets 𝐴; 𝐴1 , 𝐴2 , . . . , 𝐴𝑛

P[𝑋(𝑡) ∈ 𝐴|𝑋(𝑡1 ) ∈ 𝐴1 , · · · , 𝑋(𝑡𝑛 ) ∈ 𝐴𝑛 ] = P[𝑋(𝑡) ∈ 𝐴|𝑋(𝑡𝑛 ) ∈ 𝐴𝑛 ].

Example 12.2. [Computing.] (Internet connections)

Let 𝑋(𝑡) be the total number of internet connections registered by some internet service provider
by the time 𝑡. Typically, people connect to the internet at random times, regardless of how many
connections have already been made.
Therefore, the number of connections in a minute will only depend on the current number.
For example, if 999 connections have been registered by 10 o’clock, then their total number will
exceed 1000 during the next minute regardless of when and how these 999 connections were made in
the past. This process is Markov. 

Markov chains (MC)

A Markov chain is a discrete-time, discrete-state Markov stochastic process.

• Suppose we have a sequence 𝑀 of consecutive trials, numbered 𝑛 = 0, 1, 2, · · · .

• The outcome of the 𝑛th trial is represented by the random variable 𝑋𝑛 , which we assume

to be discrete and

to take one of the values 𝑗 in a finite set 𝑄 of discrete outcomes/states {𝑒1 , 𝑒2 , 𝑒3 , . . . , 𝑒𝑠 }.

• Sequence 𝑀 is called a (discrete time) Markov chain if, while occupying 𝑄 states at each of the unit
time points 1, 2, 3, . . . , 𝑛 − 1, 𝑛, 𝑛 + 1, . . ., 𝑀 satisfies the following property, called Markov property
or Memoryless property.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
342 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

X1 X2 X(n-1) X(n)

ei ej

Time 1 Time 2 Time n-1 Time n

Time set T = N the natural number set

P[Xn = j|Xn−1 = i, 𝑋𝑛−2 = 𝑘, · · · , 𝑋0 = 𝑎] = P[Xn = j|Xn−1 = i],

for all 𝑛 = 1, 2, · · · .

(In each time step 𝑛 − 1 to 𝑛, the process can


either stay at the same state 𝑒𝑖 (at both 𝑛 − 1, 𝑛) or move to other state 𝑒𝑗 (at 𝑛)
with respect to the memoryless rule, saying the future behavior of system depends only on the
present and not on its past history.)

Definition 12.4 (One-step transition probability).

XDenote the absolute probability of outcome 𝑗 at the 𝑛th trial by

𝑝𝑗 (𝑛) = P[𝑋𝑛 = 𝑗] (12.1)

zThe one-step transition probability, denoted

𝑝𝑖𝑗 (𝑛) = P[𝑋𝑛 = 𝑗 |𝑋𝑛−1 = 𝑖],

defined as the conditional probability that the process is in state 𝑗 at time 𝑛 given the fact that
the process was in state 𝑖 at the previous time 𝑛 − 1, for all 𝑖, 𝑗 ∈ 𝑄.

Inference, Linear Regression and Stochastic Processes


12.3. Markov chains 343

12.3.2 Homogeneous Markov chains

If the state transition probabilities 𝑝𝑖𝑗 (𝑛) in a Markov chain 𝑀 is independent of time 𝑛, write 𝑝𝑖𝑗 (𝑛) =
𝑝𝑖𝑗 , they are said to be stationary, time homogeneous or just homogeneous.

The state transition probability in homogeneous chain then can be written without mention
time point 𝑛:
𝑝𝑖𝑗 = P[𝑋𝑛 = 𝑗|𝑋𝑛−1 = 𝑖]. (12.2)

The Markov property, quantitatively described through transition probabilities, is represented in the
state transition matrix 𝑃 = [𝑝𝑖𝑗 ]:

⎡ ⎤
⎢ 𝑝11 𝑝12 𝑝13 ... .𝑝1𝑠 . ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑝 𝑝22 𝑝23 ... 𝑝2𝑠 . ⎥
⎢ 21 ⎥
⎢ ⎥
⎢ ⎥
𝑃 =⎢
⎢ 𝑝31 𝑝32 𝑝33 ... 𝑝3𝑠 . ⎥⎥ (12.3)
⎢ ⎥
⎢ . .. .. .. .. ⎥
⎢ . .
⎢ . . . . ⎥

⎢ ⎥
⎣ ⎦
𝑝𝑠1 𝑝𝑠2 𝑝𝑠3 ... 𝑝𝑠𝑠 .

Unless stated otherwise, we assume and will work with homogeneous Markov chains 𝑀 . The one-step
transition probabilities given by (12.2) of these Markov chains satisfy (that each row total equals 1):

𝑠
∑︁
𝑝𝑖𝑗 = 𝑝𝑖1 + 𝑝𝑖2 + 𝑝𝑖3 + . . . + 𝑝𝑖𝑠 = 1;
𝑗=1

for each 𝑖 = 1, 2, · · · , 𝑠 and 𝑝𝑖𝑗 ≥ 0

(since states-destinations are disjoint and exhaustive events).


Transition Probability Matrix: We are practically given a/ the initial distribution- the probability distri-
bution of starting position of the concerned object at time point 0, and b/ the transition probabilities;
and we want to determine the the probability distribution of position 𝑋𝑛 for any time point 𝑛 > 0.

• In practice, the initial probabilities 𝑝(0) is obtained at the current time (begining of a research),

• and the transition probability matrix 𝑃 is found from empirical observations.

In most cases, the major concern is using 𝑃 and 𝑝(0) to predict future. In summary we have

Definition 12.5 (Homogeneous Markov chain).

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
344 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

A (homogeneous) Markov chain 𝑀 is a triple (𝑄, 𝑝, 𝐴) in which:

• 𝑄 is a finite list of states (be identified with [𝑒1 , 𝑒2 , 𝑒3 , . . . , 𝑒𝑠 ]

• 𝑝(0) are initial probabilities, (at initial time point 𝑛 = 0)

• 𝑃 are state transition probabilities, denoted by a matrix 𝑃 = [𝑝𝑖𝑗 ] in which

𝑝𝑖𝑗 = P[𝑋𝑛 = 𝑗 |𝑋𝑛−1 = 𝑖].

• And such that the memoryless property is satisfied,ie.,

P[𝑋𝑛 = 𝑗 |𝑋𝑛−1 = 𝑖, · · · , 𝑋0 = 𝑎]

= P[𝑋𝑛 = 𝑗 |𝑋𝑛−1 = 𝑖], for all 𝑛 ≥ 1.

Example 12.3.

The Coopmart chain (denoted 𝐶) in SG currently controls 60% of the daily processed-food market,
their rivals Maximart and other brands (denoted 𝑀 ) takes the other share. Data from the previous
years (2016 and 2017) show that 88% of 𝐶’s customers remained loyal to 𝐶, while 12% switched to
rival brands. In addition, 85% of 𝑀 ’s customers remained loyal to 𝑀 , while other 15% switched to 𝐶.
Assuming that these trends continue, determine 𝐶’s share of the market
(a) in 5 years and (b) over the long run.

GUIDANCE for solving.

• Suppose that the brand attraction is time homogeneous, for a sample of large enough size 𝑛, we
denote the customer’s attention in the year 𝑛 by a random variable 𝑋𝑛 .

• The market share probability of the whole population then can be approximated by using the sample
statistics, e.g.
|{𝑥 : 𝑋𝑛 (𝑥) = 𝐶}|
P(𝑋𝑛 = 𝐶) = , and
𝑛
P(𝑋𝑛 = 𝑀 ) = 1 − P(𝑋𝑛 = 𝐶).

• Set 𝑛 = 0 for the current time, the initial probabilities then is

𝑝(0) = [0.6, 0.4] = [P(𝑋0 = 𝐶), P(𝑋0 = 𝑀 )].

Obviously we want to know the market share probabilities

𝑝(𝑛) = [P(𝑋𝑛 = 𝐶), P(𝑋𝑛 = 𝑀 )]

at any year 𝑛 > 0.

Inference, Linear Regression and Stochastic Processes


12.3. Markov chains 345

• We build a transition probability matrix with labels on rows and columns to be 𝐶 and 𝑀
⎡ ⎤
⎢ 𝐶 𝑀 ⎥
⎢ ⎥ ⎡ ⎤
⎢ ⎥
⎢ −− −− −− ⎥
⎢ ⎥ ⎢ 1−𝑎 𝑎 = 0.12 ⎥
𝑃 =⎢

⎥=⎢
⎥ ⎣

⎦ (12.4)
⎢ 𝐶 0.88 0.12 ⎥ 𝑏 = 0.15 1−𝑏
⎢ ⎥
⎢ ⎥
⎣ ⎦
𝑀 0.15 0.85
⎡ ⎤
⎢ 0.88 0.12 ⎥
or 𝑃 = ⎢ ⎥ , 𝑎 = 𝑝𝐶𝑀 = P[𝑋𝑛+1 = 𝑀 |𝑋𝑛 = 𝐶], 𝑏 = 𝑝𝑀 𝐶 = P[𝑋𝑛+1 = 𝐶|𝑋𝑛 = 𝑀 ].
⎣ ⎦
0.15 0.85

12.3.3 Chapman Kolmogorov equations

We can find the absolute probabilities at any stage 𝑛 using


(ℎ) (1)
𝑝𝑖𝑗 = P[𝑋𝑡+ℎ = 𝑗|𝑋𝑡 = 𝑖], with 𝑝𝑖𝑗 = 𝑝𝑖𝑗 . (12.5)

This is called the ℎ-step transition probability, being independent of 𝑚 ∈ N if the chain is homoge-
(ℎ)
neous, see from Equation 12.2. The ℎ-step transition matrix is denoted as 𝑃 (ℎ) = (𝑝𝑖𝑗 ).
We have a recursive way for obtaining 𝑃 (ℎ) as follows.

Chapman Komopgorov equations.

(ℎ)
relate the ℎ-step transition probabilities 𝑝𝑖𝑗
with 𝑘-step and ℎ − 𝑘-step transition probabilities:

𝑠
(ℎ) (ℎ−𝑘) (𝑘)
∑︁
𝑝𝑖𝑗 = 𝑝𝑖𝑙 𝑝𝑙𝑗 , 0 < 𝑘 < ℎ.
𝑙=1

This results in the matrix notation


𝑃 (ℎ) = 𝑃 (ℎ−𝑘) 𝑃 (𝑘) .

For the case ℎ = 0, we have


(0)
𝑝𝑖𝑗 = 𝛿𝑖𝑗 = 1 if 𝑖 = 𝑗, and 𝑖 ̸= 𝑗.

Since 𝑃 (1) = 𝑃 , we get 𝑃 (2) = 𝑃 2 , and generally 𝑃 (ℎ) = 𝑃 ℎ .

12.3.4 Compute the probability distribution at stage 𝑛

Now from each 𝑝𝑖 (𝑛) being defined as in Equation (12.1), that is the density P[𝑋𝑛 = 𝑖] of the chain 𝑋𝑛
at time 𝑛 receiving state 𝑖 ∈ 𝑄, we set 𝑝(𝑛) to be the vector form of probability mass distribution (pmf
or absolute probability distribution) associated with all possible 𝑋𝑛 of the Markov process, i.e.

𝑝(𝑛) = [𝑝1 (𝑛), 𝑝2 (𝑛), 𝑝3 (𝑛), . . . , 𝑝𝑠 (𝑛)].

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
346 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

Proposition 12.1.

The absolute probability distribution 𝑝(𝑛) at any stage 𝑛 of a Markov chain is given in the form

𝑝(𝑛) = 𝑝(0) 𝑃 𝑛 , (12.6)

where 𝑝(0) = 𝑝 is the initial probability vector.

Proof. Obviously 𝑝(𝑛) = 𝑝(0) 𝑃 (𝑛) . We then employ two facts:


* 𝑃 (𝑛) = 𝑃 𝑛 , and
* the absolute probability distribution 𝑝(𝑛 + 1) at any stage 𝑛 + 1 (associated with 𝑋𝑛+1 ) can be found
by the 1-step transition matrix 𝑃 = [𝑝𝑖𝑗 ], and the distribution vector 𝑝(𝑛) Indeed, we have
𝑠
∑︁
𝑝𝑗 (𝑛 + 1) = 𝑝𝑖𝑗 𝑝𝑖 (𝑛),
𝑖=1

or in the matrix notation


𝑝(𝑛 + 1) = 𝑝(𝑛) 𝑃.

Then just do the induction 𝑝(𝑛 + 1) = 𝑝(𝑛) 𝑃 = 𝑝(𝑛−1) 𝑃 𝑃 = · · · = 𝑝(0) 𝑃 𝑛+1 .

Example 12.4 (The Coopmart chain: cont. ).


(a/) 𝐶’s share of the market in 5 years can be computed by

𝑝(5) = [𝑝𝐶 (5), 𝑝𝑀 (5)] = 𝑝(0) 𝑃 5 .

Practical Problem 2. A state transition diagram of a finite-state Markov chain is a line diagram
- with a vertex corresponding to each state and
- a directed line between two vertices 𝑖 and 𝑗 if 𝑝𝑖𝑗 > 0.

In such a diagram, if one can move from 𝑖 and 𝑗 by a path following the arrows, then 𝑖 → 𝑗.
The diagram is useful to determine whether a finite-state Markov chain is irreducible or not, or to
check for periodicity.
Draw the state transition diagrams and classify the states of the MCs with the following transition
probability matrices:
⎡ ⎤
⎢ 0 0 0.5 0.5 ⎥
⎡ ⎤
⎢ 0 0.5 0.5 ⎥ ⎢ ⎥
⎢ ⎥
⎢ ⎥ ⎢ 1 0 0 0 ⎥
⎢ ⎥ ⎢ ⎥
𝑃1 = ⎢
⎢ 0.5 0 0.5 ⎥ ; 𝑃2 = ⎢


⎥.

⎢ 0 1 0 0 ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦
0.5 0.5 0
⎢ ⎥
⎣ ⎦
0 1 0 0

Inference, Linear Regression and Stochastic Processes


12.3. Markov chains 347

12.3.5 Describing and using a Markov chain

We will use in parallel notations below for convenience from now on:

• 𝑋(0), 𝑋0 both mean the distribution of initial states of the chain.

• 𝑋(𝑛), 𝑋𝑛 are the distribution of states of the chain at any stage (time point) 𝑛 ∈ N,

or equivalently 𝑋(ℎ), 𝑋ℎ the distribution of states at time point ℎ ∈ N.

♣ QUESTION 12.1. What do we need to know to describe a Markov chain?

By the Markov property, each next state should be predicted from the previous state only. Therefore,
it is sufficient to know

1. 𝑝(0) - the distribution of its initial state 𝑋(0),

𝑝(0) = [𝑝1 (0), 𝑝2 (0), 𝑝3 (0), . . . , 𝑝𝑠 (0)] = P[𝑋(0) = [𝑒1 , 𝑒2 , · · · , 𝑒𝑠 ]]

here 𝑝𝑖 (0) = P[𝑋(0) = 𝑖] is the pmf of 𝑋(0) receiving state 𝑖 ∈ S.

2. The mechanism of transitions from one state to another, i.e. all one-step transition probabilities 𝑝𝑖,𝑗 ,
or the transition probability matrix P = [𝑝𝑖,𝑗 ].

Knowledge box 11 (Long-term forecast probability distribution).

Based on the data 𝑝(0) and P, we:

(ℎ)
• compute ℎ-step transition probabilities 𝑝𝑖,𝑗 , using P(ℎ) = Pℎ ;

• then find the vector


𝑝(𝑛) = [𝑝1 (𝑛), 𝑝2 (𝑛), 𝑝3 (𝑛), . . . , 𝑝𝑠 (𝑛)],

called the probability distribution of states at time point 𝑛, which is our forecast for 𝑋(𝑛);
given by
𝑝(𝑛) = 𝑝(0) P𝑛 , (12.7)

where 𝑝(0) = 𝑝 = P[𝑋(0) = [𝑒1 , 𝑒2 , · · · , 𝑒𝑠 ]] is the initial probability vector;

• to know our long-term forecast: take the limit of 𝑝(𝑛) as 𝑛 → ∞. It will be more efficient to take
the limit of matrix P𝑛 .

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
348 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

12.4 Limiting distributions and Classification of states

NOTATIONS

𝑝𝑖𝑗 = Prob(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖).

transition probability
(ℎ)
𝑝𝑖𝑗 = Prob(𝑋𝑚+ℎ = 𝑗|𝑋𝑚 = 𝑖)

ℎ-step transition probabilities

𝑝𝑖 (𝑡) = P[𝑋(𝑡) = 𝑖], distribution of 𝑋(𝑡),

distribution of states 𝑖 ∈ 𝑄 at time 𝑡

𝑝(𝑛) = [𝑝1 (𝑛), 𝑝2 (𝑛), 𝑝3 (𝑛), . . . , 𝑝𝑠 (𝑛)]

distribution vector at time point 𝑛

𝑝(0) = P[𝑋(0) = [𝑒1 , 𝑒2 , · · · , 𝑒𝑠 ]] = [𝑝1 (0), 𝑝2 (0), 𝑝3 (0), . . . , 𝑝𝑠 (0)]

initial distribution vector

Definition 12.6 (Stationary distribution).

Vector 𝑝* = (𝑝*1 , 𝑝*2 , · · · , 𝑝*𝑠 ) is called the stationary distribution of a Markov chain {𝑋𝑛 , 𝑛 ≥ 0}
with the state transition matrix 𝑃 if:

𝑝* P = 𝑝* . (12.8)

This equation indicates that a stationary distribution 𝑝* is a left eigenvector of P with eigenvalue 1.
Note that any nonzero multiple of 𝑝* is also an eigenvector of P. But the stationary distribution 𝑝* is
fixed by being a probability vector; that is, its components sum to 1.
The absolute probability distribution 𝑝(𝑛) is

𝑝(𝑛) = 𝑝(0) P𝑛 = 𝑝(1) P𝑛−1 = 𝑝(2) P𝑛−2 = . . . , (12.9)

where 𝑝(0) = 𝜋 is the initial probability vector.

Inference, Linear Regression and Stochastic Processes


12.4. Limiting distributions and Classification of states 349

In general, taking 𝑛 → ∞ in Equation (12.9) we may find the limiting probability 𝑝(∞) as

𝑝(∞) = 𝑝(0) P∞ .

We need some general results to determine the stationary distribution 𝑝* and the limiting probability
𝑝(∞) of a Markov chain. For a specific class of MCs below, there exist stationary distribution.

Definition 12.7 (Regular matrix).

Let 𝑀 = {𝑋𝑛 : 𝑛 ≥ 0} = (S, 𝜋, P) be a Markov chain.


𝑀 is called regular if there is a finite positive integer 𝑚 such that after 𝑚 time-steps, every state
has a nonzero chance of being occupied, no matter what the initial state.
This is equivalent to say there exists 𝑚 ∈ N such that

P(𝑚) = P𝑚 > 0

a
(i.e. every matrix entry is positive).
a Matrix 𝐴 = [𝑎𝑖,𝑗 ] > 0 means that its entries 𝑎𝑖,𝑗 > 0.

12.4.1 Limiting distribution at states

Lemma 12.2.

If 𝑀 = {𝑋𝑛 : 𝑛 ≥ 0} = (S, 𝜋, P) is a regular homogeneous Markov chain then

lim P𝑛 = P∞ (12.10)
𝑛→∞

where P∞ is a matrix whose rows are all equal to the stationary distribution 𝑝* .

Then the limiting probability 𝑝(∞) is found as 𝑝(∞) = 𝑝(0) P∞ .


We discuss here two particular cases when 𝑠 = 2 and 𝑠 > 2.

I) Markov chains that have two states, 𝑠 = 2.


At first we investigate the case of Markov chains that have two states, say S = {𝑒1 , 𝑒2 }. Let 𝑎 = 𝑝𝑒1 𝑒2
and 𝑏 = 𝑝𝑒2 𝑒1 the state transition probabilities between distinct states in a two state Markov chain, its
state transition matrix is
⎡ ⎤ ⎡ ⎤
⎢ 𝑝1,1 𝑝2,1 ⎥ ⎢ 1 − 𝑎 𝑎 ⎥
P=⎢ ⎥=⎢ ⎥ , where 0 < 𝑎 < 1, 0 < 𝑏 < 1. (12.11)
⎣ ⎦ ⎣ ⎦
𝑝1,2 𝑝2,2 𝑏 1−𝑏

Proposition 12.3.
a) The 𝑛-step transition probability matrix is given by

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
350 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

⎡ ⎤ ⎡ ⎤
⎢ 𝑎 −𝑎 ⎥ ⎪
⎧ 𝑏 𝑎 ⎥ ⎫
1 ⎪
P(𝑛) = P𝑛 = ⎥ + (1 − 𝑎 − 𝑏)𝑛 ⎢


⎩⎢ ⎥⎪
𝑎+𝑏 ⎣ ⎦ ⎣ ⎦⎭
𝑏 𝑎 −𝑏 𝑏

b) Find the limit matrix when 𝑛 −→ ∞.

Proof. To prove this (computing transition probability matrix of a 2-state Markov chain), we use a fun-
damental result of Linear Algebra, recalled in Section 12.5.2.
The eigenvalues of the state transition matrix 𝑃 found by solving equation

𝑐(𝜆) = |𝜆𝐼 − P| = 0

are 𝜆1 = 1 and 𝜆2 = 1 − 𝑎 − 𝑏. The spectral decomposition of square matrix (Section 12.5.2) says P
can be decomposed into two constituent matrices 𝐸1 , 𝐸2 (since only two eigenvalues was found):

1 1
𝐸1 = [𝑃 − 𝜆2 𝐼], 𝐸2 = [𝑃 − 𝜆1 𝐼].
𝜆1 − 𝜆2 𝜆2 − 𝜆1

That means, 𝐸1 , 𝐸2 are orthogonal matrices, i.e. 𝐸1 · 𝐸2 = 0 = 𝐸2 · 𝐸1 , and

P = 𝜆1 𝐸1 + 𝜆2 𝐸2 ; 𝐸12 = 𝐸1 , 𝐸22 = 𝐸2 .

Hence,
P𝑛 = 𝜆𝑛1 𝐸1 + 𝜆𝑛2 𝐸2 = 𝐸1 + (1 − 𝑎 − 𝑏)𝑛 𝐸2 ,

or ⎡ ⎤ ⎡ ⎤
⎢ 𝑎 −𝑎 ⎥ ⎪
⎧ 𝑏 𝑎 ⎥ ⎫
(𝑛) 1 ⎪
= P𝑛 = ⎥ + (1 − 𝑎 − 𝑏)𝑛 ⎢

P ⎪
⎩⎢ ⎥⎪
𝑎+𝑏 ⎣ ⎦ ⎣ ⎦⎭
𝑏 𝑎 −𝑏 𝑏

b) The limit matrix when 𝑛 −→ ∞:


⎡ ⎤
⎧ 𝑏 𝑎 ⎥⎫
1 ⎪
lim P𝑛 =

⎪⎢ ⎥⎪

𝑎+𝑏 ⎣
⎩ ⎦⎭
𝑛→∞
𝑏 𝑎

II) Markov chains that have more than two states, 𝑠 > 2.
For 𝑠 > 2, it is cumbersome to compute constituent matrices 𝐸𝑖 of P, we could employ the so-called
regular property, see Theorem 12.9, or using directly linear algebra via Equation 12.16.

In next sections we provide a classification of states in a Markov chain. Four types are accessible,
recurrent/persistence, periodic, and absorbing.

Inference, Linear Regression and Stochastic Processes


12.4. Limiting distributions and Classification of states 351

12.4.2 Accessible states


(𝑁 )
• State 𝑗 is said to be accessible from state 𝑖 if for some 𝑁 ≥ 0, 𝑝𝑖,𝑗 > 0, and we write 𝑖 → 𝑗.

(𝑁 )
• Two states 𝑖 and 𝑗 accessible to each other are said to communicate, write 𝑖 ↔ 𝑗, if ∃𝑁 ≥ 0, 𝑝𝑖,𝑗 > 0
(𝑀 )
and ∃𝑀 ≥ 0, 𝑝𝑗,𝑖 > 0.

Two states that communicate are said to be in the same class. All members of one class communi-
cate with one another.

If a class is not accessible from any state outside the class, we define the class to be a closed
communicating class.

Definition 12.8 (Irreducible Markov chain).

If all states communicate with each other, then we say that the Markov chain 𝑀 = (S, 𝜋, P) is
irreducible. Formally, irreducibility means either of the following 3 conditions is satisfied.

(𝑁 )
1. 𝑀 is irreducible if and only if for all 𝑖, 𝑗 ∈ S : ∃𝑁 ≥ 0 [𝑝𝑖,𝑗 > 0].

2. 𝑀 is irreducible if and only if 𝑖 → 𝑗 or 𝑗 → 𝑖 for all pairs of 𝑖, 𝑗 ∈ S.

3. The chain is irreducible if and only if there exists a path, whose probability is strictly positive,
which starts from any state 𝑖 and returns to 𝑖 after having visited at least once all other states of
the chain.

We can say that there exists a cycle with strictly positive probability.

The states of a Markov chain can be classified into two broad groups: those that the process enters
infinitely often and those that it enters finitely often. In the long run, the process will be found to be in
only those states that it enters infinitely often.

12.4.3 Recurrent states and Transient states

Let 𝐴(𝑗) be the set of states that are accessible from state 𝑗. We say that

1. State 𝑗 is recurrent if from any future state, there is always some probability of returning to 𝑗 and,
given enough time, this is certain to happen.

By repeating this argument, if a recurrent state is visited once, it will be revisited an infinite number
of times.

2. State 𝑗 is transient if it is not recurrent.

Thus, a transient state will only be visited a finite number of times.

We now formalize concepts of recurrent (persistence) state and transient state.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
352 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

The first return time: The first return time 𝑇𝑗 to state 𝑗 ∈ S is the number of steps the chain is firstly
at state 𝑗 after leaving 𝑗 after time 0.

𝑇𝑗 is a discrete r.v., taking values in Range(𝑇𝑗 ) = {1, 2, 3, ...} ∪ {∞}

(if 𝑗 is never reached let 𝑇𝑗 = ∞).

Probability of the first passage or visit: For any two states 𝑖 ̸= 𝑗 and 𝑛 > 0,

let 𝑓𝑖,𝑗 (𝑛) be the conditional probability that given that the process is presently in state 𝑖, the first time
it will enter state 𝑗 occurs in exactly 𝑛 transitions (or steps):

𝑓𝑖,𝑗 (𝑛) = P[𝑇𝑗 = 𝑛|𝑋0 = 𝑖]


= P[𝑋𝑛 = 𝑗, 𝑋𝑘 ̸= 𝑗, 𝑘 = 1, 2, ..., 𝑛 − 1|𝑋0 = 𝑖].
(12.12)
With 𝑓𝑖,𝑗 (0) = 0 since 𝑇𝑗 ≥ 1,
and 𝑓𝑖,𝑗 (1) = P[𝑋1 = 𝑗|𝑋0 = 𝑖] = 𝑝𝑖,𝑗 .

• We call 𝑓𝑖,𝑗 (𝑛) the probability of first passage from state 𝑖 to state 𝑗 in 𝑛 steps. By addition rule and
Bellman’s optimal principle:
∑︁
𝑓𝑖,𝑗 (𝑛) = 𝑝𝑖,𝑘 𝑓𝑘,𝑗 (𝑛 − 1), for 𝑛 = 2, 3, ...
𝑘̸=𝑗

• Note that this probability is still valid when 𝑗 = 𝑖:


∑︁
𝑓𝑗,𝑗 (𝑛) = 𝑝𝑗,𝑘 𝑓𝑘,𝑗 (𝑛 − 1), for 𝑛 = 2, 3, ...
𝑘̸=𝑗

Definition 12.9 (Recurrent and transient states).

The probability of visiting 𝑗 starting from 𝑖 in a finite number of steps is



∑︁ ∞
∑︁
𝑓𝑖,𝑗 = P[𝑇𝑗 < ∞|𝑋0 = 𝑖] = 𝑓𝑖,𝑗 (𝑛) = P[𝑋𝑛 = 𝑗|𝑋0 = 𝑖].
𝑛=0 𝑛=0

1. State 𝑗 is recurrent if
𝑓𝑗,𝑗 = P[𝑇𝑗 < ∞|𝑋0 = 𝑗] = 1,

i.e., starting from the state, the process is guaranteed to return to the state again and again, in
fact, infinitely many times.

2. State 𝑗 is said to be transient (or non-recurrent) if

𝑓𝑗,𝑗 = P[𝑇𝑗 < ∞|𝑋0 = 𝑗] < 1. (12.13)

In this case there is positive probability 1 − 𝑓𝑗,𝑗 of never returning to state 𝑗.

Inference, Linear Regression and Stochastic Processes


12.4. Limiting distributions and Classification of states 353

So, a transient state, as the name suggests, is a state to which we may not come back. Recurrent
state is one to which we will come back with probability 1. Next, we will characterize the recurrent
states further.

Problem 12.1. Prove the following key facts of a Markov chain .

1. Show that in a finite-state Markov chain, not all states can be transient, in other words at least one of
the states must be recurrent.

2. Show that if P is a Markov matrix, then P𝑛 is also a Markov matrix for any positive integer 𝑛.

3. Verify the transitivity property of Markov chains, that is, if 𝑖 → 𝑗 and 𝑗 → 𝑘, then 𝑖 → 𝑘. (Hint: use
Chapman Komopgorov equations).

GUIDANCE for solving.

1. Use contradiction.
2. Employ the fact that the row sums of a stochastic matrix all are equal to 1.

Theorem 12.4.

1. With 𝑇𝑗 is the first return time, i.e the number of times that state 𝑖 will be visited, given that
𝑋0 = 𝑖. The state 𝑗 is recurrent if and only if E[𝑇𝑗 ] = ∞.

Hint: Argue that T𝑗 ∼ Geom(𝑝 := 𝑓𝑗,𝑗 ).

2. Recurrence is a class property: if 𝑖 and 𝑗 communicate, then they are either both recurrent or
both transient.

Hint: It is sufficient to show that if 𝑖 is recurrent, then 𝑗 too is recurrent. (The other result being
simply the contrapositive of this assertion).

3. Hence, all states of a finite and irreducible Markov chain are recurrent.

Recurrent states: positive recurrent and null recurrent


Assume that 𝑗 is a recurrent state. The mean time to return to 𝑗 starting at 𝑗 is denoted by

∑︁
𝜇𝑗𝑗 = E[𝑇𝑗 |𝑋0 = 𝑗] = 𝑛 𝑓𝑗,𝑗 (𝑛).
𝑛=0

1. State 𝑗 is said to be positive recurrent if, starting from the state, the expected number of transitions
until the chain return to the state is finite:

𝜇𝑗𝑗 = E[𝑇𝑗 |𝑋0 = 𝑗] < ∞.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
354 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

2. State 𝑗 said to be null recurrent if 𝜇𝑗𝑗 = ∞.

Problem 12.2.

Consider a Markov chain with state space {0, 1} and transition probability matrix

⎡ ⎤
⎢ 1 0 ⎥
P=⎢



1/2 1/2

i/ Show that state 0 is recurrent. ii/ Show that state 1 is transient.


SOLUTION: Use definition, compute

∑︁
𝑓𝑖,𝑗 = P[𝑇𝑗 < ∞|𝑋0 = 𝑖] = 𝑓𝑖,𝑗 (𝑛).
𝑛=0

12.4.4 Periodic states

In a finite Markov Chain 𝑀 = (𝑄, 𝜋, P) (i.e. having finite number of states),

• Define the period of a state 𝑖 ∈ S to be

(𝑛)
𝑑𝑖 = gcd{𝑛 > 0 : 𝑝𝑖,𝑖 > 0}

where gcd stands for the greatest common divisor.


.
• A periodic state 𝑖 is state to which 𝑑𝑖 > 1, it means 𝑝𝑖,𝑖 > 0 for 𝑛 .. 𝑑, and 𝑝𝑖,𝑖 = 0 for any 𝑛 that
(𝑛) (𝑛)

is not divisible by 𝑑.

• State 𝑖 is called aperiodic if 𝑑𝑖 = 1.

• A Markov Chain 𝑀 is aperiodic if the period of each state 𝑖 ∈ S is 1; in other words, there is no
such periodic state in 𝑀 .

Proposition 12.5.

(𝑛)
1. If 𝑝𝑖,𝑖 = 0 for all 𝑛, we consider 𝑖 as an aperiodic state. That is, the chain may start from state 𝑖,
but it leaves 𝑖 on its first transition and never returns to 𝑖.

2. It can be shown that periodicity is a class property: if 𝑖 and 𝑗 communicate and if 𝑖 is periodic
with period 𝑑, then 𝑗 too is periodic with period 𝑑.
(1)
3. If 𝑝𝑖,𝑖 > 0, then 𝑖 is evidently aperiodic.
(2) (3)
4. If 𝑝𝑖,𝑖 > 0 and 𝑝𝑖,𝑖 > 0, then 𝑖 is aperiodic. Prove this. 

Example 12.5.

Inference, Linear Regression and Stochastic Processes


12.4. Limiting distributions and Classification of states 355

Consider a Markov chain with state space {0, 1, 2} with transition probability matrix
⎡ ⎤
⎢ 0 1/2 1/2 ⎥
⎢ ⎥
⎢ ⎥
P=⎢
⎢ 1/2 0 1/2 ⎥⎥.
⎢ ⎥
⎣ ⎦
1/2 1/2 0

Prove that:
1/ the Markov chain is irreducible, 2/ the Markov chain is aperiodic.

HINT:
(𝑛)
1/ Find 𝑝𝑖,𝑖 > 0 for each 𝑖. 2/ Use Property 4 above. 

Example 12.6.

We could check if a MC has the transition matrix


⎡ ⎤
⎢ 0 0 0.6 0.4 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0.3 0.7 ⎥
⎢ ⎥
𝑃 =⎢

⎥;

⎢ 0.5 0.5 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
0.2 0.8 0 0

then it is periodic.

HINT: Indeed, if the Markovian random variable (agent)


- starts at time 0 in state 𝐸1 , then
- at time 1 it must be in state 𝐸3 or 𝐸4 , at time 2 it must be in state 𝐸1 or 𝐸2 .
Therefore, it generally can visit only 𝐸1 at times 2,4,6, ...

12.4.5 Absorption problems: states and probabilities

Absorbing state: State 𝑗 is said to be an absorbing state if 𝑝𝑗𝑗 = 1; that is, once state 𝑗 is reached,
it is never left. If there are multiple absorbing states, the probability that one of them will be eventually
reached is still 1, but the identity of the absorbing state to be entered is random and the associated
probabilities may depend on the starting state.

Example 12.7. The Gambler’s Ruin problem

A player, at each play of a game, wins one unit (for example, 1 USD) with probability 𝑝 and loses one
unit with probability 𝑞 := 1 − 𝑝. Assume that he initially possesses 𝑖 units and that he plays independent
repetitions of the game until his fortune reaches 𝑘 units or he goes broke.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
356 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

Let 𝑋𝑛 be the fortune of the player at time 𝑛 (after 𝑛 plays).

• Then {𝑋𝑛 : 𝑛 = 0, 1, . . .} is a Markov chain whose state space is

S = {0, 1, . . . , 𝑘}.

• The states 0 and 𝑘 are absorbing. We have that 𝑝0,0 = 𝑝𝑘,𝑘 = 1.

• The chain thus has three classes: 𝐶1 = {0}, 𝐶2 = {𝑘}, and 𝐶3 = {1, 2, . . . , 𝑘 − 1}. The first two
are recurrent, because 0 and 𝑘 are absorbing, whereas the third one is transient. For example,
by Condition (12.13), at state 1 ∈ 𝐶3 we have that

P[𝑋1 = 0|𝑋0 = 1] = 𝑞 > 0 =⇒ 𝑓1,1 < 1,

there is positive probability 1 − 𝑓1,1 of never returning to state 1; from which we can conclude
that the player’s fortune will reach 0 or 𝑘 units after a finite number of repetitions.

Summary

A finite Markov chain 𝑀 = (S, 𝜋, P) is

1. irreducible iff it has only one single recurrent class, or any state can be accessible from all
states.

2. aperiodic iff the period of each state 𝑖 ∈ S is 1.

3. ergodic if it is positive recurrent and aperiodic.

Fact 12.1. In a DTMC 𝑀 that have more than two states, we have 4 cases:

1. 𝑀 has irreducible, positive recurrent, but periodic states. The component 𝜋𝑖 of the stationary distri-
bution vector 𝜋 is understood as the long-run proportion of time that the process is in state 𝑖.

2. 𝑀 has several closed, positive recurrent classes. In this case, the transition matrix of the DTMC
takes the block form. In contrast to the irreducible ergodic DTMC, where the limiting distribution is
independent of the initial state, the DTMC with several closed, positive recurrent classes has the
limiting distribution that is dependent on the initial state.

3. 𝑀 has both recurrent and transient classes. In this situation, we often seek the probabilities that
the chain is eventually absorbed by different recurrent classes. See the well-known gambler’s ruin
problem.

4. 𝑀 is an irreducible DTMC with null recurrent or transient states. This case is only possible when
the state space is infinite, since any finite-state, irreducible DTMC must be positive recurrent. In this
case, neither the limiting distribution nor the stationary distribution exists. A well-known example of
this case is the random walk model, see Chapter 13.

Inference, Linear Regression and Stochastic Processes


12.5. Theory of stochastic matrix for Markov chains 357

12.5 Theory of stochastic matrix for Markov chains

12.5.1 On eigenspace of a square matrix (or linear operator)

Let 𝐴 ∈ Mat𝑛 (R) be a square matrix of order 𝑛 with real entries. Let 𝑥 be an indeterminate, then the
polynomial
𝑝𝐴 (𝑥) = 𝑝(𝑥) = det([𝐴 − 𝑥 I𝑛 ]) = 𝑥𝑛 + 𝑎𝑛−1 𝑥𝑛−1 + . . . + 𝑎1 𝑥 + 𝑎0 (12.14)

is called the characteristic polynomial of 𝐴, here I𝑛 is the identity matrix of size 𝑛.

• A value 𝜆 is an eigenvalue of 𝐴 if det([𝐴 − 𝜆 I𝑛 ]) = 0 ⇐⇒ 𝑝(𝜆) = 0. A vector 𝑣 ∈ R𝑛 is an


eigenvector of 𝜆 if equation (𝐴 − 𝜆 I𝑛 )𝑣 = 0 is satisfied.

• For an eigenvalue 𝜆, the pair (𝜆, 𝑣) is named an eigenpair. All eigenvectors being associated with
eigenvalue 𝜆 forms a subspace of R𝑛 , denoted by

𝑉𝜆 := {𝑣 ∈ R𝑛 : 𝐴𝑣 = 𝜆𝑣},

and called the eigenspace of 𝜆.

Example 12.8. Find all eigenvalues of matrix


⎡ ⎤
⎢ 1 −4 −4 ⎥
⎢ ⎥
⎢ ⎥
P=⎢
⎢ 8 −11 ⎥.
−8 ⎥
⎢ ⎥
⎣ ⎦
−8 8 5

The characteristic equation is 𝑝(𝜆) = det(P − 𝜆 I3 ) = (𝜆 − 1)(𝜆 + 3)2 = 0. (why?)


So 𝜆 = 1 is a single eigenvalue, and so 𝜆 = −3 is with algebraic multiplicity is 2.

Definition 12.10.

• Non-negative matrix 𝐴 = [𝑎𝑖𝑗 ] ≥ 0 means all 𝑎𝑖𝑗 ≥ 0.

• A stochastic matrix 𝐴 is non-negative and each row sum equals one. E.g., the transition probability
matrix P = [𝑝𝑖,𝑗 ] of a Markov chain is a stochastic matrix.

• If the column sums also equal one, the matrix is called doubly stochastic.

Fact 12.2. Consider a square matrix P of order 𝑠 with eigenvalues 𝜆𝑖 .

𝑠
∑︁ ∏︀𝑠
• The trace 𝑡𝑟(P) = 𝜆𝑖 , and the determinant |P| = 𝑖=1 𝜆𝑖 .
𝑖=1

• The spectrum 𝜎(P) of P consists of its eigenvalues, 𝜎(P) = {𝜆1 , 𝜆2 , · · · , 𝜆𝑘 }.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
358 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

• If P has eigenpairs (i.e. eigenvalues are all distinct, 𝜆𝑖 ̸= 𝜆𝑗 )

{(𝜆1 , 𝑥1 ), (𝜆2 , 𝑥2 ), · · · , (𝜆𝑘 , 𝑥𝑘 )},

then 𝑆 = {𝑥1 , · · · , 𝑥𝑘 } is a linearly independent set in the space R𝑠 .

• If 𝐵𝑖 is a basis for the null space 𝑁 (P − 𝜆𝑖 𝐼), then B = 𝐵1 ∪ 𝐵2 · · · ∪ 𝐵𝑘 is a linearly independent set.

12.5.2 Characterization for Diagonalizable Matrices

Definition 12.11.

Any square matrix that can be transformed into a diagonal matrix through the postmultiplication by a
nonsingular matrix and premultiplication by its inverse is said to be diagonalizable.
Precisely, a square matrix P is diagonalizable if and only if there exists a nonsingular matrix 𝐻 (i.e.
det(𝐻) ̸= 0) such that 𝐻 −1 P𝐻 is a diagonal matrix.

We can test a matrix to be diagonalizable by using eigenpairs (see Theorem 12.15).

Square matrix P of order 𝑠 is diagonalizable if and only if P possesses a complete set of eigen-
vectors (i.e. a set of 𝑠 linearly independent vectors corresponding with distinct eigenvalues 𝜆𝑗 ).
Moreover, the nonsingular matrix 𝐻 is built by a complete set of eigenvectors 𝑣𝑗 as the columns,
where each (𝜆𝑗 , 𝑣𝑗 ) is an eigenpair of P; and we have

𝐻 −1 P𝐻 = 𝐷 = Diag(𝜆1 , 𝜆2 , · · · , 𝜆𝑠 ).

Thus, a square matrix with distinct eigenvalues is diagonalizable.

Lemma 12.6 (Spectral decomposition of Diagonalizable Matrices by orthogonal components).

A square matrix P of order 𝑠 with spectrum 𝜎(P) = {𝜆1 , 𝜆2 , · · · , 𝜆𝑘 }, where 𝑘 ≤ 𝑠, is diagonal-


izable if and only if there exist constituent matrices {𝐸1 , 𝐸2 , · · · , 𝐸𝑘 } such that

P = 𝜆 1 𝐸1 + 𝜆 2 𝐸2 + · · · + 𝜆 𝑘 𝐸𝑘 , (12.15)

where the 𝐸𝑖 ’s have the following properties:

• 𝐸𝑖 · 𝐸𝑗 = 0 whenever 𝑖 ̸= 𝑗, and 𝐸𝑖2 = 𝐸𝑖 for all 𝑖 = 1..𝑘

• 𝐸1 + 𝐸2 + · · · + 𝐸𝑘 = 𝐼

12.5.3 Properties of stochastic matrices

Proposition 12.7.

Inference, Linear Regression and Stochastic Processes


12.5. Theory of stochastic matrix for Markov chains 359

1. Every stochastic matrix 𝐾 has

a/ 1 as an eigenvalue (possibly with multiple), and

b/ none of the eigenvalues exceeds 1 in absolute value, that is all eigenvalues 𝜆𝑖 satisfy |𝜆𝑖 | ≤ 1.

2. If 𝐾 is a stochastic matrix then 𝐾 𝑚 is a stochastic matrix.

Proof. Item 1. The spectral radius 𝜌(𝐾) of any square 𝐾 is defined as

𝜌(𝐾) = max { eigen values 𝜆𝑖 }.


𝑖

a/ When 𝐾 is stochastic, 𝜌(𝐾) = 1, because 𝜆 = 1 is eigenvalue with eigenvector 𝑒 = [1, 1, . . . , 1]𝑡 due
to eigenvalue-eigenvector equation

det(𝐾 − 𝜆 I𝑛 ) = 0 ⇔ (𝐾 − 1. I𝑛 )𝑒 = 0 ⇔ 𝐾𝑒 = 𝑒.

Item 2. Use the fact that 𝐾𝑒 = 𝑒 to prove that 𝐾 𝑚 𝑒 = 𝑒.

In practice we employ Proposition 12.7, Item 2 in two ways:


I: if we know the decomposition 12.15 explicitly, we can compute powers

P𝑚 = 𝜆𝑚 𝑚 𝑚
1 𝐸1 + 𝜆2 𝐸2 + · · · + 𝜆𝑘 𝐸𝑘 , for any integer 𝑚 > 0. (12.16)

II: if we know P is diagonalizable, we find the constituent matrices 𝐸𝑖 by:


* finding the nonsingular matrix 𝐻 = (𝑥1 |𝑥2 | · · · |𝑥𝑘 ), where each 𝑥𝑖 is a basis left eigenvector of the
null subspace
𝑁 (P − 𝜆𝑖 𝐼) = {𝑣 : (P − 𝜆𝑖 𝐼)(𝑣) = 0 ⇔ P𝑣 = 𝜆𝑖 𝑣};

** then, P = 𝐻𝐷𝐻 −1 = (𝑥1 |𝑥2 | · · · |𝑥𝑘 ) · 𝐷 · 𝐻 −1 where


𝐷 = 𝑑𝑖𝑎𝑔(𝜆1 , · · · , 𝜆𝑘 ) the diagonal matrix, and
⎡ ⎤
⎢ 𝑦 𝑡1 ⎥
⎢ ⎥
⎢ ⎥
𝑡 ⎥
⎢ 𝑦2 ⎥
𝐻 −1 ′

=𝐾 =⎢ ⎥; (𝑖.𝑒.𝐾 = (𝑦 1 |𝑦 2 | · · · |𝑦 𝑘 )).
⎢ .. ⎥
. ⎥
⎢ ⎥

⎢ ⎥
⎣ ⎦
𝑡
𝑦𝑘

Here each 𝑦 𝑖 is a basis right eigenvector of the null subspace

𝑁 (P − 𝜆𝑖 𝐼) = {𝑣 : 𝑣 ′ 𝑃 = 𝜆𝑖 𝑣 ′ }.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
360 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

The constituent matrices 𝐸𝑖 = 𝑥𝑖 · 𝑦 𝑡𝑖 . Equation 12.15 is called the spectral decomposition of P.


In the cases of only two eigenvalues 𝜆1 , 𝜆2 , two constituent 𝐸1 , 𝐸2 are
1 1
𝐸1 = [𝑃 − 𝜆2 I], 𝐸2 = [𝑃 − 𝜆1 I].
𝜆1 − 𝜆2 𝜆2 − 𝜆1
See Example 12.12 for a case of three eigenvalues.

12.5.4 Ergodicity and regularity of stochastic matrices

Definition 12.12.

(𝑚)
• A stochastic matrix P = [𝑝𝑖,𝑗 ] is ergodic if lim𝑚→∞ P𝑚 = 𝐿 (say) exists, that is each 𝑝𝑖,𝑗 has a
limit when 𝑚 → ∞.

• A stochastic matrix P is regular if there exists an 𝑚 ∈ N such that P𝑚 > 0.

In our context, a Markov chain, with transition probability matrix P, is called regular if there exists an
𝑚 > 0 such that P𝑚 > 0, i.e. there is a finite positive integer 𝑚 such that after 𝑚 time-steps, every
state has a nonzero chance of being occupied, no matter what the initial state is.

Example 12.9. Is the matrix ⎡ ⎤


⎢ 0.88 0.12 ⎥
P=⎢



0.15 0.85

regular? ergodic? Calculate the limit matrix 𝐿 = lim𝑚→∞ P𝑚 .

GUIDANCE for solving. DIY, with the 𝑛-step transition probability matrix is
⎡ ⎤ ⎡ ⎤
⎢ 𝑎 −𝑎
⎧ 𝑏 𝑎 ⎫
1 ⎪
P(𝑛) = P𝑛 = ⎥ + (1 − 𝑎 − 𝑏)𝑛 ⎢
⎢ ⎥ ⎥⎪

⎩⎢ ⎥⎪
𝑎+𝑏 ⎣ ⎦ ⎣ ⎦⎭
𝑏 𝑎 −𝑏 𝑏

when ⎡ ⎤ ⎡ ⎤
⎢ 𝑝1,1 𝑝2,1 ⎥ ⎢ 1 − 𝑎 𝑎 ⎥
P=⎢ ⎥=⎢ ⎥ , where 0 < 𝑎 < 1, 0 < 𝑏 < 1. (12.17)
⎣ ⎦ ⎣ ⎦
𝑝1,2 𝑝2,2 𝑏 1−𝑏

♣ QUESTION 12.2.

The limit matrix 𝐿 = lim𝑚→∞ P𝑚 practically shows the long-term behaviour (distribution, property) of
the process. How to know the existence 𝐿 (i.e. the ergodicity of transition matrix P = [𝑝𝑖,𝑗 ])?

Theorem 12.8. [Ergodicity of stochastic matrices]

Inference, Linear Regression and Stochastic Processes


12.5. Theory of stochastic matrix for Markov chains 361

A stochastic matrix P = [𝑝𝑖,𝑗 ] is ergodic if and only if


* the only eigenvalue 𝜆 of modul (magnitude) 1 is 1 itself, and
* if 𝜆 = 1 has multiplicity 𝑘, there exist 𝑘 independent eigenvectors associated with this 1.

Theorem 12.9. [Regularity of stochastic matrices]

If a stochastic matrix P = [𝑝𝑖,𝑗 ] is regular then

1. 1 is an eigenvalue of multiplicity one, and all other eigenvalues 𝜆𝑖 satisfy |𝜆𝑖 | < 1;

2. P is ergodic, that is lim𝑚→∞ P𝑚 = 𝐿 exists. Furthermore, 𝐿’s rows are identical and equal to
the stationary distribution 𝑝* .

Proof.
Item (2). If (1) is proved then, by Theorem 12.8, P = [𝑝𝑖,𝑗 ] is ergodic. Hence, when P = [𝑝𝑖,𝑗 ] is
regular, the limit matrix 𝐿 = lim𝑚→∞ 𝑃 𝑚 does exist. By the decomposition (12.15),

P = 𝐸1 + 𝜆2 𝐸2 + · · · + 𝜆𝑘 𝐸𝑘 , where all |𝜆𝑖 | < 1, 𝑖 = 2, · · · , 𝑘.

Then, by Equation (12.16)

𝐿 = lim P𝑚 = lim (𝐸1 + 𝜆𝑚 𝑚


2 𝐸2 + · · · + 𝜆𝑘 𝐸𝑘 ) = 𝐸1 .
𝑚→∞ 𝑚→∞

• Let vector 𝑝* be the unique left eigenvector associating with the biggest eigenvalue 𝜆1 = 1 (which is
simple eigenvalue since it has multiplicity one), that is

𝑝* P = 𝑝* ⇔ 𝑝* (P − 1 𝐼) = 0.

• We now prove that 𝐿’s rows are identical and equal to the stationary distribution 𝑝* : 𝐿 = [𝑝* , · · · , 𝑝* ]′ .

Theorem 12.10. [Equilibrium distribution]

Given a finite, aperiodic and irreducible Markov chain 𝑀 = (𝑄, 𝜋, P), where S consists of 𝑠 states.
Then there exist stationary probabilities

𝑝*𝑖 := lim 𝑝𝑖 (𝑡),


𝑡→∞

where the 𝑝*𝑖 form a unique solution to the conditions:


∑︀𝑠
C1 𝑖=1 𝑝*𝑖 = 1; where each 𝑝*𝑖 ≥ 0;
∑︀𝑠
C2 𝑝*𝑗 = 𝑖=1 𝑝*𝑖 𝑝𝑖,𝑗 .

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
362 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

See the proof in Theorem 12.9, because C2 means that the stationary vector 𝑝* = [𝑝*1 , 𝑝*2 , . . . , 𝑝*𝑠 ]𝑇
satisfies equation 𝑝* P = 𝑝* .

Corollary 12.11. Few important remarks are:

• (a) for regular MC, stationary distribution 𝑝* does not depend on the initial state distribution probabil-
ities 𝑝(0); by Theorem 12.9 [Item 2];

• (b) but, in general, the long-term behavior expressed by the limiting distributions 𝑝(∞) are influenced
by the initial distributions 𝑝(0), via 𝑝(∞) = 𝑝(0) 𝐿; whenever the stochastic matrix P = [𝑝𝑖,𝑗 ] is
ergodic but not regular.

Example 12.10. Consider a Markov chain with two states and transition probability matrix

⎡ ⎤
⎢ 0 1 ⎥
⎢ ⎥
⎣ ⎦
1 0

(a) Find the stationary distribution 𝑝* of the chain. (b) Find lim𝑛→∞ P𝑛 .
SOLUTION:
a) Use definition 12.8

⎡ ⎤
⎢ 0 1 ⎥
𝑝* P = 𝑝* ⇐⇒ [𝑝1 , 𝑝2 ] ⎢

⎥ = [𝑝1 , 𝑝2 ]

1 0

give us 𝑝1 = 𝑝2 = 1/2. And so 𝑝* =?

Example 12.11. Consider a Markov chain with two states and transition probability matrix

⎡ ⎤
⎢ 3/4 1/4 ⎥
⎢ ⎥
⎣ ⎦
1/2 1/2

(a) Find the stationary distribution 𝑝* of the chain.


(b) Find lim𝑛→∞ P𝑛 . (c) Find lim𝑛→∞ P𝑛 by evaluating P𝑛 first.
SOLUTION:
a) Similarly, solve ⎡ ⎤
⎢ 3/4 1/4 ⎥
𝑝* P = 𝑝* ⇐⇒ [𝑝1 , 𝑝2 ] ⎢

⎥ = [𝑝1 , 𝑝2 ] =?

1/2 1/2

Inference, Linear Regression and Stochastic Processes


12.5. Theory of stochastic matrix for Markov chains 363

b) The a Markov chain is regular, hence by Theorem 12.9,

⎡ ⎤
⎢ 𝑝1 𝑝2 ⎥
lim P𝑚 = 𝐿 = [𝑝* , 𝑝* ]′ = ⎢ ⎥.
𝑚→∞ ⎣ ⎦
𝑝1 𝑝2

Example 12.12. Diagonalize the following matrix and provide its spectral decomposition.

⎡ ⎤
⎢ 1 −4 −4 ⎥
⎢ ⎥
⎢ ⎥
P=⎢
⎢ 8 −11 ⎥.
−8 ⎥
⎢ ⎥
⎣ ⎦
−8 8 5

The characteristic equation is 𝑝(𝜆) = det(P − 𝜆𝐼) = 𝜆3 + 5𝜆2 + 3𝜆 − 9 = 0.


So 𝜆 = 1 is a simple eigenvalue, and 𝜆 = −3 is repeated twice (its algebraic multiplicity is 2). Any set
of vectors 𝑥 satisfying
𝑥 ∈ 𝑁 (P − 𝜆𝐼) ⇔ (𝑃 − 𝜆𝐼)𝑥 = 0

can be taken as a basis of the eigenspace (or null space) 𝑁 (P − 𝜆𝐼). Bases of the eigenspaces are:
(︂ )︂ (︂ )︂
′ ′ ′
𝑁 (P − 1𝐼) = 𝑠𝑝𝑎𝑛 [1, 2, −2] ; and 𝑁 (P + 3𝐼) = 𝑠𝑝𝑎𝑛 [1, 1, 0] , [1, 0, 1] .

Easy to check that these three eigenvectors 𝑥𝑖 form a linearly independent set, then P is diagonaliz-
able. The nonsingular matrix (also called similarity transformation matrix)
⎡ ⎤
⎢ 1 1 1 ⎥
⎢ ⎥
⎢ ⎥
𝐻 = (𝑥1 |𝑥2 |𝑥3 ) = ⎢
⎢ 2 ⎥;
1 0 ⎥
⎢ ⎥
⎣ ⎦
−2 0 1

will diagonalize 𝑃 , and since 𝑃 = 𝐻𝐷𝐻 −1 we have

⎡ ⎤
⎢ 1 0 0 ⎥
⎢ ⎥
𝐻 −1 P𝐻 = 𝐷 = Diag(𝜆1 , 𝜆2 , 𝜆2 ) = Diag(1, −3, −3) = ⎢
⎢ ⎥
⎢ 0 −3 0 ⎥

⎢ ⎥
⎣ ⎦
0 0 −3

⎡ ⎤
⎢ 1 −1 −1 ⎥
⎢ ⎥
Here, 𝐻 −1
⎢ ⎥
=⎢
⎢ −2 3 ⎥ implies that
2 ⎥
⎢ ⎥
⎣ ⎦
2 −2 −1

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
364 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

𝑦 𝑡1 = [1, −1, −1], 𝑦 𝑡2 = [−2, 3, 2], 𝑦 𝑡3 = [2, −2, −1]. The constituent matrices are
⎡ ⎤
⎢ 1 −1 −1 ⎥
⎢ ⎥
𝐸1 = 𝑥1 · 𝑦 𝑡1 = ⎢
⎢ ⎥
⎢ 2 −2 ⎥;
−2 ⎥
⎢ ⎥
⎣ ⎦
−2 2 2

⎡ ⎤ ⎡ ⎤
⎢ −2 3 2 ⎥ ⎢ 2 −2 −1 ⎥
⎢ ⎥ ⎢ ⎥
𝑡 𝑡
⎢ ⎥ ⎢ ⎥
𝐸2 = 𝑥2 · 𝑦 2 = ⎢
⎢ −2 3 2 ⎥ ; 𝐸3 = 𝑥3 · 𝑦 3 = ⎢

⎢ 0 0 ⎥.
0 ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦
0 0 0 2 −2 −1

Obviously,
⎡ ⎤
⎢ 1 −4 −4 ⎥
⎢ ⎥
⎢ ⎥
𝑃 = 𝜆1 𝐸1 + 𝜆2 𝐸2 + 𝜆3 𝐸3 = ⎢
⎢ 8 −11 ⎥.
−8 ⎥
⎢ ⎥
⎣ ⎦
−8 8 5

12.6 Markov Process’s Theory

Recall from Definition 12.3 that a stochastic process 𝑋(𝑡) is said to be a Markov process if for any
sequence real numbers
𝑡1 < 𝑡2 < . . . < 𝑡𝑛 < 𝑡, and any sets 𝐴; 𝐴1 , 𝐴2 , . . . , 𝐴𝑛 (𝐴𝑖 ⊂ R)

P[𝑋(𝑡) ∈ 𝐴|𝑋(𝑡1 ) ∈ 𝐴1 , · · · , 𝑋(𝑡𝑛 ) ∈ 𝐴𝑛 ] = P[𝑋(𝑡) ∈ 𝐴|𝑋(𝑡𝑛 ) ∈ 𝐴𝑛 ].

12.6.1 Stationary or time homogeneous Markov processes

A Markov process (by Definition 12.3) 𝑀 = {𝑋(𝑡)}𝑡≥0 = (S, P(𝑡)) is called


either a Markov jump process or continuous-time Markov chain (CTMC)
if the state space S is a finite or countable set.

• The Markov jump process has the Markov property :

P[𝑋(𝑠 + 𝑡) = 𝑗|𝑋(𝑠) = 𝑖 and 𝑋(𝑢) for 𝑢 < 𝑠] = P[X(s + t) = j|X(s) = i], (12.18)

for all 𝑖, 𝑗 ∈ S and 𝑠, 𝑡 > 0.

𝑡𝑖𝑚𝑒 : − − − − − − 𝑢 − − − − − s − − − − − s + t − − − − − − >

𝑠𝑡𝑎𝑡𝑒 : − − −𝑋(𝑢) − − − − X(s) = i − − − −X(s + t) = j − − − −

Inference, Linear Regression and Stochastic Processes


12.6. Markov Process’s Theory 365

That means given the evolution of the process up to any current time 𝑠, the future value and its
probabilistic description depend only on the current state at time 𝑠.

• The state transition probability is the conditional probability

𝑝𝑖,𝑗 (𝑠, 𝑡) = P[𝑋(𝑠 + 𝑡) = 𝑗 |𝑋(𝑠) = 𝑖],

that the process is in state 𝑗 at time 𝑠 + 𝑡 given that the process was in state 𝑖 at the previous time 𝑠,
for all 𝑖, 𝑗 ∈ S, [compare with Definition 12.2].

Definition 12.13 (Homogeneous Markov jump process).

The Markov jump process is stationary or time homogeneous, if the 𝑝𝑖,𝑗 (𝑠, 𝑡) will be unaffected
by a shift in the time origin (see from Definition 12.2). It means the state transition probability is
given by
𝑝𝑖,𝑗 (𝑠, 𝑡) = P[𝑋(𝑠 + 𝑡) = 𝑗 |𝑋(𝑠) = 𝑖] = 𝑝𝑖,𝑗 (𝑡) (12.19)

now depends only on the length 𝑡 of the time interval [𝑠, 𝑠 + 𝑡].

With the Markov property quantitatively described in (12.18), these transition probabilities are sum-
marized in the transition matrix P(𝑡) = [𝑝𝑖,𝑗 (𝑡)].

12.6.2 The transition matrix P(𝑡)

Analogous to the discrete time setting, the matrix P(𝑡) whose elements are 𝑝𝑖,𝑗 (𝑡) is called the state
transition matrix of the process.
⎡ ⎤
⎢ 𝑝1,1 (𝑡) 𝑝1,2 (𝑡) 𝑝1,3 (𝑡) ... .𝑝1,𝑠 (𝑡) ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑝 (𝑡) 𝑝 (𝑡) 𝑝 (𝑡) ... 𝑝2,𝑠 (𝑡) ⎥
⎢ 2,1 2,2 2,3 ⎥
⎢ ⎥
⎢ ⎥
P(𝑡) = ⎢
⎢ 𝑝3,1 (𝑡) 𝑝3,2 (𝑡) 𝑝3,3 (𝑡) ... 𝑝3,𝑠 (𝑡) ⎥⎥. (12.20)
⎢ ⎥
⎢ .. .. .. .. .. ⎥

⎢ . . . . .


⎢ ⎥
⎣ ⎦
𝑝𝑠,1 (𝑡) 𝑝𝑠,2 (𝑡) 𝑝𝑠,3 (𝑡) ... 𝑝𝑠,𝑠 (𝑡)

It is a stochastic matrix, i.e. its entries are non-negative and each row sum equals 1:
∑︁
𝑝𝑖,𝑗 (𝑡) = 1; for each 𝑖 ∈ S;
𝑗∈S

and 𝑝𝑖,𝑗 (𝑡) ≥ 0, for all states 𝑖, 𝑗 ∈ S.

What do we need to know to describe a Markov jump process?

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
366 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

The dynamic of the Markov jump process (CTMC) (S, 𝑝(0), P(𝑡)) can be determined by the transition
probability matrix P(𝑡) = [𝑝𝑖,𝑗 (𝑡)] and its initial probability distribution (at 𝑋(0))

𝑝(0) = [𝑝𝑖 (0)]𝑖∈S where 𝑎𝑖 = 𝑝𝑖 (0) = P[𝑋(0) = 𝑖] (12.21)

(here 𝑎𝑖 = 𝑝𝑖 (0) = P[𝑋(0) = 𝑖] is the pmf of the event 𝑋(0) = 𝑖 ∈ S).

Definition 12.14 (State probability distribution).

The state probability distribution vector at any 𝑡 > 0 is

𝑝(𝑡) = [𝑝𝑗 (𝑡)]𝑗∈S ] (12.22)

where

𝑝𝑗 (𝑡) = P[𝑋(𝑡) = 𝑗] (12.23)

is the (marginal) probability that the process will be in state 𝑖 at time 𝑡. We may express this marginal
probability via the initial distribution and the transitions
∑︁
𝑝𝑗 (𝑡) = 𝑎𝑖 𝑝𝑖,𝑗 (𝑡). (12.24)
𝑖∈S

The main problem when studying continuous-time Markov chains is finding these probabilities 𝑝𝑖,𝑗 (𝑡).
They are continuous functions of 𝑡, for every pair (𝑖, 𝑗). In general, it is not easy to calculate the
functions 𝑝𝑖,𝑗 (𝑡) explicitly.

Example 12.13.

For 𝑟 any sequence real numbers 0 < 𝑡1 < 𝑡2 < . . . < 𝑡𝑛 , few typical probabilities involving the
process 𝑋(𝑡) can be determined via 𝑝𝑖,𝑗 (𝑡).

• If know state at 𝑡 = 0:

P[𝑋(0) = 𝑎, 𝑋(𝑡1 ) = 𝑖1 , 𝑋(𝑡2 ) = 𝑖2 , · · · , 𝑋(𝑡𝑛 ) = 𝑥𝑛 ]

= 𝑝𝑎 (0) 𝑝𝑎𝑖1 (𝑡1 )𝑝𝑖1 𝑖2 (𝑡2 − 𝑡1 ) · · · 𝑝𝑖𝑛−1 𝑖𝑛 (𝑡𝑛 − 𝑡𝑛−1 )

• If do not know state at 𝑡 = 0:

P[𝑋(𝑡1 ) = 𝑖1 , 𝑋(𝑡2 ) = 𝑖2 , · · · , 𝑋(𝑡𝑛 ) = 𝑥𝑛 ]


∑︁
= 𝑝𝑖 (0) 𝑝𝑖 𝑖1 (𝑡1 )𝑝𝑖1 𝑖2 (𝑡2 − 𝑡1 ) · · · 𝑝𝑖𝑛−1 𝑖𝑛 (𝑡𝑛 − 𝑡𝑛−1 ). 
𝑖∈𝑄

Inference, Linear Regression and Stochastic Processes


12.6. Markov Process’s Theory 367

12.6.3 The Chapman-Kolmogorov equations

The transition probabilities of the Markov jump process at origin time 𝑡 = 0 satisfy

𝑝𝑖,𝑗 (0) = 𝛿𝑖𝑗 = 1 if 𝑖 = 𝑗, and 𝑖 ̸= 𝑗.

In addition, they also satisfy the Chapman-Kolmogorov equations:


∑︁
𝑝𝑖,𝑗 (𝑠 + 𝑡) = 𝑝𝑖𝑘 (𝑠) 𝑝𝑘𝑗 (𝑡), for all 𝑖, 𝑗 ∈ S; ∀𝑠, 𝑡 ≥ 0. (12.25)
𝑘∈S

• This equation is called the Chapman-Kolmogorov equation for the continuous time Markov
chain.

• The Chapman-Kolmogorov equation in matrix form becomes

P(𝑠 + 𝑡) = P(𝑠) P(𝑡).

Example 12.14. For a Poisson process 𝑋(𝑡) ∼ Pois(𝜆𝑡) with set of states S = N = {0, 1, 2, ...} we can
determine 𝑝𝑖,𝑗 (𝑡).

Concept of Poisson process: We can write 𝑋(𝑡) to count the number of events randomly occurring
in the time interval [0, 𝑡), with pdf
𝑒−𝜆𝑡 (𝜆𝑡)𝑖
𝑝𝑋(𝑡) (𝑖) = 𝑝(𝑖; 𝜆𝑡) = P[𝑋(𝑡) = 𝑖] = 𝑖 = 0, 1, 2, ... (12.26)
𝑖!
with a constant value 𝜆 > 0 defined as the average number of events occurring in one unit of time, or
the speed of rare events or just the process rate.
The transition probabilities 𝑝𝑖,𝑗 (𝑡) of 𝑋(𝑡) ∼ Pois(𝜆𝑡) are given by

⎧ −𝜆𝑡
⎨𝑒 (𝜆𝑡)𝑗−𝑖
for 𝑗 − 𝑖 ≥ 0

𝑝𝑖,𝑗 (𝑡) = (𝑗 − 𝑖)! (12.27)

⎩0 for 𝑗 − 𝑖 < 0.
Then we can check explicitly that with 𝑗 ≥ 𝑖,
∑︁
𝑝𝑖,𝑗 (𝑠 + 𝑡) = 𝑝𝑖𝑘 (𝑠) 𝑝𝑘𝑗 (𝑡).
𝑘∈S

Hint: Write explicitly 𝑝𝑖𝑘 (𝑠) and 𝑝𝑘𝑗 (𝑡) then use the Newton binomial theorem. 

Whenever a stochastic process enters a state 𝑖, it spends an amount of time called the dwell time (or
holding time) in that state.

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
368 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

Definition 12.15 (Characterizing homogeneous CTMC).

1. If the holding time is non-parametric or unspecified distribution, we call it a semi-Markov process.

2. A homogeneous CTMC or Markov jump process {𝑋(𝑡)} = (S, 𝑝(0), P(𝑡)) is one which satisfies
two conditions, for any state 𝑖 ∈ S:

C1: The holding time 𝐻 in state 𝑖

- is exponentially distributed with rate constant 𝑣𝑖 (or with mean 1/𝑣𝑖 , the cdf 𝐹𝐻 (𝑡) = 1 − 𝑒−𝑣𝑖 𝑡 ),
[𝑣𝑖 represents the transition rate at which the process leaves state 𝑖]
- and 𝐻 does not depend on the next state 𝑗;
C2: After that time, it jumps to some other state 𝑗 with probability 𝑝𝑖,𝑗 .

12.6.4 Transition rates (forces of transition)- Transition rate matrix

Transition rates play an important roles in Markov jump processes. They are defined as the instanta-
neous rate of change of the transition probability.

 CONCEPT 10 (For homogeneous (CTMC) Markov jump process).

• For all 𝑖 ̸= 𝑗 ∈ S, the transition rate of the process when the process makes a transition from state 𝑖 to state 𝑗,
denoted by 𝑞𝑖,𝑗 , is defined by
𝑞𝑖,𝑗 = 𝑣𝑖 𝑝𝑖,𝑗 (12.28)

The transition rates 𝑞𝑖,𝑗 are also known as instantaneous transition rates, transition intensities,
or forces of transition.

Lemma 12.12. Note that the following limits hold:

1 − 𝑝𝑖𝑖 (ℎ)
𝑣𝑖 = lim
ℎ→0 ℎ
(12.29)
𝑝𝑖,𝑗 (ℎ)
𝑞𝑖,𝑗 = 𝑝′𝑖,𝑗 (0) = lim for 𝑖 ̸= 𝑗.
ℎ→0 ℎ
Definition 12.16.

The matrix ⎡ ⎤
⎢ 𝑞1 𝑞1,2 𝑞1,3 ... 𝑞1,𝑠 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑞 𝑞2 𝑞2,3 ... 𝑞2,𝑠 ⎥
⎢ 2,1 ⎥
⎢ ⎥
⎢ ⎥
Q = [𝑞𝑖,𝑗 ] = ⎢
⎢ 𝑞3,1 𝑞3,2 𝑞3 ... 𝑞3,𝑠 ⎥
⎥ (12.30)
⎢ ⎥
⎢ . .. .. .. .. ⎥
⎢ . .
⎢ . . . . ⎥

⎢ ⎥
⎣ ⎦
𝑞𝑠,1 𝑞𝑠,2 𝑞𝑠,3 ... 𝑞𝑠

Inference, Linear Regression and Stochastic Processes


12.6. Markov Process’s Theory 369

is called the transition rate matrix or generator matrix of the process. We will see that in practice,
the distribution of the process can be determined by the matrix Q and its initial distribution. We see
that:

1. 𝑞𝑖,𝑗 ≥ 0 for all 𝑖 ̸= 𝑗 ∈ S.

𝑞𝑖,𝑗 = 0; for each 𝑖 ∈ S, by differentiating the identity


∑︀
2. 𝑗∈S

∑︁
𝑝𝑖,𝑗 (𝑡) = 1; for each 𝑖 ∈ S
𝑗∈S

with respect to 𝑡 at 𝑡 = 0.

𝑞𝑖,𝑗 < 0, for each 𝑖 ∈ S.


∑︀
3. Therefore, 𝑞𝑖 := 𝑞𝑖,𝑖 = − 𝑖̸=𝑗∈S

For small ℎ, the transition probability is expressed as



⎨1 + ℎ 𝑞𝑖𝑖 + 𝑜(ℎ) if 𝑖 = 𝑗
𝑝𝑖,𝑗 (ℎ) = (12.31)
⎩ℎ 𝑞 + 𝑜(ℎ)
𝑖,𝑗 if 𝑖 ̸= 𝑗,

or can be approximated by

⎨1 + ℎ 𝑞𝑖𝑖 , if 𝑖 = 𝑗
𝑝𝑖,𝑗 (ℎ) ≈ (12.32)
⎩ℎ 𝑞
𝑖,𝑗 if 𝑖 ̸= 𝑗,

i.e. the probability of a transition from 𝑖 to 𝑗 during any short time interval [𝑠, 𝑠 + ℎ] is proportional to ℎ.

Example 12.15. Consider a Markov jump process with two states, namely states 1 and 2. Denote
the transition rates of the process from state 1 to 2 and from 2 to 1 respectively by 𝜆 and 𝜇 for some
𝜆, 𝜇 > 0.

• Draw the transition diagram.

• Write down the transition rate matrix Q.

HINT:
1. In this case the state space S = {1, 2} is finite.
2. The transition rate matrix Q is given by
⎡ ⎤
⎢ −𝜆 𝜆 ⎥
Q=⎢

⎥.

𝜇 −𝜇

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
370 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

12.7 Kolmogorov’s differential equations

NOTATION: We recall and further define the following vectors and matrix:

𝑝(𝑡) = [𝑝1 (𝑡), 𝑝2 (𝑡), 𝑝3 (𝑡), . . .]


𝑝𝑖 (𝑡) = P[𝑋(𝑡) = 𝑖], ∀𝑖 ∈ S
𝑑𝑝(𝑡) 𝑑𝑝1 (𝑡) 𝑑𝑝2 (𝑡) 𝑑𝑝3 (𝑡)
=[ , , , . . .]
𝑑𝑡 𝑑𝑡 𝑑𝑡 𝑑𝑡
𝑑𝑝𝑖,𝑗 (𝑡)
𝑞𝑖,𝑗 = 𝑝′𝑖,𝑗 (0) = |𝑡=0
𝑑𝑡
∑︁
𝑞𝑖 = 𝑞𝑖,𝑖 = − 𝑞𝑖,𝑗 = −𝑣𝑖
𝑖̸=𝑗∈S

(12.33)
⎡ ⎤
⎢ 𝑞1 𝑞1,2 𝑞1,3 ... 𝑞1,𝑠 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑞 𝑞2 𝑞2,3 ... 𝑞2,𝑠 ⎥
⎢ 2,1 ⎥
⎢ ⎥
⎢ ⎥
Q=⎢
⎢ 𝑞3,1 𝑞3,2 𝑞3 ... ⎥.
𝑞3,𝑠 ⎥
⎢ ⎥
⎢ . .. .. .. .. ⎥
⎢ . .
⎢ . . . . ⎥

⎢ ⎥
⎣ ⎦
𝑞𝑠,1 𝑞𝑠,2 𝑞𝑠,3 ... 𝑞𝑠

𝑑𝑝𝑖 (𝑡)
In the steady state, 𝑝𝑖 (𝑡) → 𝑝𝑖 and lim𝑡→∞ = 0.
𝑑𝑡
Kolmogorov’s (backward and forward) differential equations provide a relationship between the tran-
sition probabilities and transition rates. By solving the differential equations, we can express
the transition probabilities in terms of the transition rates. Consequently, statistical properties of
the Markov jump process can be completely determined by the transition rates, given in the matrix
Q = [𝑞𝑖,𝑗 = 𝑝′𝑖,𝑗 (0)], satisfying two properties
(1) each row sum is zero and
(2) the off-diagonal elements are nonnegative.

Kolmogorov’s backward differential equations

The goal of this section is to give a general methodology of finding the transition probabilities 𝑝𝑖,𝑗 (𝑡)
which would completely characterize a CTMC. These probabilities are functions of the time 𝑡 elapsed
between the two states. They will be expressed as solutions of a system of differential equations in 𝑡.

Theorem 12.13 ( Kolmogorov’s backward differential equations).

Inference, Linear Regression and Stochastic Processes


12.7. Kolmogorov’s differential equations 371

For all 𝑖, 𝑗 ∈ S, and 𝑡 ≥ 0,


𝑑
P(𝑡) = Q P(𝑡), P(0) = I𝑛 , (12.34)
𝑑𝑡
where the matrix Q = [𝑞𝑖,𝑗 ] satisfies the properties as stated above.
Explicitly, this equation means
𝑑 ∑︁ ∑︁
𝑝𝑖,𝑗 (𝑡) = 𝑞𝑖,𝑘 𝑝𝑘,𝑗 (𝑡) = 𝑞𝑖,𝑘 𝑝𝑘,𝑗 (𝑡) − 𝑣𝑖 𝑝𝑖,𝑗 (𝑡). (12.35)
𝑑𝑡
𝑘∈S 𝑘̸=𝑖

Theorem 12.14. Kolmogorov’s forward differential equations (FDE)

For all 𝑖, 𝑗 ∈ S, and 𝑡 ≥ 0,


𝑑
P(𝑡) = P(𝑡) Q, P(0) = I𝑛 (12.36)
𝑑𝑡
Explicitly, this equation means
𝑑 ∑︁ ∑︁
𝑝𝑖,𝑗 (𝑡) = 𝑝𝑖,𝑘 (𝑡) 𝑞𝑘,𝑗 = 𝑝𝑖,𝑘 (𝑡) 𝑞𝑘,𝑗 − 𝑣𝑗 𝑝𝑖,𝑗 (𝑡). (12.37)
𝑑𝑡
𝑘∈S 𝑘̸=𝑗

From properties of Q in Definition 12.16 we see that the row sums of 𝑄 equal 0, the nondiagonal
entries of 𝑄 are nonnegative, and the diagonal entries 𝑞𝑖𝑖 ≤ 0.

Fact 12.3. The matrix P(𝑡) = [𝑝𝑖,𝑗 (𝑡)] is then given by


∑︁ Q𝑘 𝑡 𝑘
P(𝑡) = 𝑒Q𝑡 = I𝑠 + . (12.38)
𝑘!
𝑘=1

Example 12.16 (The Birth–death Process).

Let the state space S = {0, 1, 2, . . .}. We define transition rates

𝑞𝑖(𝑖+1) = 𝜆𝑖 “ a birth ”
𝑞𝑖(𝑖−1) = 𝜇𝑖 “ a death ” (12.39)

𝑣𝑖 = 𝜆𝑖 + 𝜇𝑖 “ rate of moving out of state ”𝑖

In addition, we define 𝑣0 = 𝜆0 , 𝑞0,−1 = 𝜇0 = 0. Note that in this case the birth and the death rates de-
pend on the population size 𝑖. This is very realistic, but complicates the differential equations presented
in Section 14.7. .

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
372 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

Figure 12.8: Diagram for birth and death process.

Problems for stochastic processes

Problem 12.3.

Consider a transition probability matrix of a Markov chain 𝑀 as follows:


⎡ ⎤
⎢ 0.88 0.12 ⎥
𝑃 =⎢

⎥.
⎦ (12.40)
0.15 0.85

𝑀 represents a system with 2 states on and off of an email server, where on = acceptable operation,
and
off = overload (when the server can not receive or deliver email) Time shift unit is 1 minute. Which
of the followings is true:

• 𝑃 has 3 eigenvalues

• 𝑀 is reducible

• 𝑃 is not a stochastic matrix; 𝑃 is a regular matrix.

Problem 12.4.

Let 𝑀 be a two state Markov chain, with its state transition matrix is
⎡ ⎤ ⎡ ⎤
⎢ 𝑝11 𝑝21 ⎥ ⎢ 1 − 𝑐 𝑐 ⎥
𝑃 =⎢ ⎥=⎢ ⎥ , where 0 < 𝑐 < 1, 0 < 𝑑 < 1. (12.41)
⎣ ⎦ ⎣ ⎦
𝑝12 𝑝22 𝑑 1−𝑑

𝑀 represents a traffic system with 2 states on and off of a road at SG, where on = acceptable vehicle
flow, and
off = traffic jam (when the road can not fulfill its functionality).
When 𝑐 = 𝑑, compute the limit matrix lim𝑛→∞ 𝑃 𝑛 .

Inference, Linear Regression and Stochastic Processes


12.7. Kolmogorov’s differential equations 373

Problem 12.5.

Consider a Markov chain with two states and transition probability matrix
⎡ ⎤
⎢ 3/4 1/4 ⎥
𝑃 =⎢



1/2 1/2

Find the stationary distribution 𝑝* of the chain.

Problem 12.6. [Climate Science.] ([58], p. 136)

In some town, each day is either sunny or rainy. A sunny day is followed by another sunny day with
probability 0.7, whereas a rainy day is followed by a sunny day with probability 0.4.
It rains on Monday. Make forecasts for Tuesday, Wednesday, and Thursday.

Solution.

Weather conditions in this problem represent a homogeneous Markov chain with 2 states:
state 1 = “sunny” and
state 2 = “rainy.”
Transition probabilities are: 𝑝11 = 0.7, 𝑝12 = 0.3, 𝑝21 = 0.4, 𝑝22 = 0.6,
where 𝑝12 and 𝑝22 were computed by the complement rule.
If it rains on Monday, then
- Tuesday is sunny with probability 𝑝21 = 0.4 (making a transition from a rainy to a sunny day), and
- Tuesday is rainy with probability 𝑝22 = 0.6; can predict a 60% chance of rain.
Wednesday forecast requires 2-step transition probabilities,

• making one transition from Monday to Tuesday, 𝑋(0) to 𝑋(1), and

• another one, from Tuesday to Wednesday, 𝑋(1) to 𝑋(2).

(2)
Conditioning on the weather situation of Tuesday and using the Law of Total Probability 𝑝21 =
Prob(𝑋𝑚+2 = 1|𝑋𝑚 = 2) =

= P[ Wednesday is sunny | Monday is rainy ] = ...

= 𝑝21 𝑝11 + 𝑝22 𝑝21 = (0.4) (0.7) + (0.6) (0.4) = 0.52.


(2)
By the Complement Rule, 𝑝22 = 0.48, and thus, we predict a 52% chance of sun and a 48% chance
of rain on Wednesday.
For the Thursday forecast, we compute 3-step transition probabilities
2 ∑︁
2
(3)
∑︁
𝑝𝑖𝑗 = ...
𝑖=1 𝑗=1

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
374 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

because it takes 3 transitions to move from Monday to Thursday. We have to use the Law of Total
Probability conditioning on both Tuesday and Wednesday. Explain and DIY based on:

• 1-step transition probabilities found by the following sequence of states

2 → 𝑖 → 𝑗 → 1;

(2) (2)
• or using already computed 2-step transition probabilities 𝑝21 and 𝑝22 , describing transition from
Monday to Wednesday, and 1-step transition probabilities in 𝑃 from Wednesday to Thursday; ...

We predict a 55.6% chance of sun and a 44.4% of rain on Thursday. ♦

Problem 12.7.

A certain product is made by two companies, A and B, that control the entire market. Currently, A
and B have 60 percent and 40 percent, respectively, of the total market. Each year, A loses 5 of its
market share to By while B loses 3 of its share to A.
Find the relative proportion of the market that each hold after 2 years, 20 years.

Problem 12.8.

Bernoulli process: Consider a Bernoulli random variable (a trial or r.v.) 𝑋 that can take only two
possible values, success as 1 and failure as 0, i.e. S𝑋 = {0, 1}. The probability of success is 𝑝 and the
probability of failure is 1 − 𝑝,

P[𝑋 = 1] = 𝑝, P[𝑋 = 0] = 1 − 𝑝 for some 𝑝 ∈ [0, 1].

a/ Write its probability mass function


b/ Now let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 , . . . be independent Bernoulli r.v.’s, and assume that the PMF of 𝑋𝑛 is as
follows:
P[𝑋𝑛 = 1] = 𝑝, and P[𝑋𝑛 = 0] = 1 − 𝑝, for all 𝑛.

Describe the Bernoulli process and construct a typical sample sequence of this process.

Problem 12.9. Prove the following.

Theorem 12.15.

If every eigenvalue of a matrix 𝑃 yields linearly independent left eigenvectors in number equal
to its multuiplicity, then
* there exists a nonsingular matrix 𝑀 whose rows are left eigenvectors of 𝑃 , such that
* 𝐷 = 𝑀 𝑃 𝑀 −1 is a diagonal matrix with diagonal elements are the eigenvalues of 𝑃 , repeated
according to multiplicity.

Inference, Linear Regression and Stochastic Processes


12.7. Kolmogorov’s differential equations 375

Apply this for a practical problem in Business Intelligence through a case study in mobile phone in-
dustry in Thailand.

Due to a most recent survey, there are four big mobile producers/sellers 𝑁 , 𝑆, 𝑀 and 𝐿, and their
market distributions in 2017 is given by the stochastic matrix:
⎡ ⎤
⎢ 𝑁 𝑀 𝐿 𝑆 ⎥
⎢ ⎥
⎢ ⎥
⎢ −− −− −− −− −− ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑁 1 0 0 0 ⎥
⎢ ⎥
𝑃 =⎢ ⎢


⎢ 𝑀 0.4 0 0.6 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝐿 0.2 0 0.1 0.7 ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
𝑆 0 0 0 1

Is 𝑃 regular? ergodic? Find the long term distribution matrix 𝐿 = lim𝑚→∞ 𝑃 𝑚 .


What is your conclusion? (Remark that the state 𝑁 and 𝑆 are called absorbing states).

Problem 12.10. [Computing.] ([58], p. 139)

A computer is shared by 2 users who send tasks to a computer remotely and work independently. At
any minute,

• any connected user may disconnect with probability 0.5, and

• any disconnected user may connect with a new task with probability 0.2.

Let 𝑋(𝑡) be the number of concurrent users at time t (minutes). This is a Markov chain with 3 states:
0, 1, and 2.
Compute transition probabilities in matrix 𝑃 .

Solution.

Row 1 of 𝑃 :
Suppose 𝑋(0) = 0, i.e., there are no users at time 𝑡 = 0. Then 𝑋(1) is the number of new connections
within the next minute.
It has binomial distribution Bin(2, 0.2), therefore, 𝑝00 =?, 𝑝01 =?, 𝑝02 =?
Draw the transition diagram for the Markov chain in this case.
Row 2 of 𝑃 :
Try to go further using binomial distributions Bin(1, 0.2), Bin(1, 0.5).
Row 3 of 𝑃 : Use Bin(2, 0.5).

DATA ANALYTICS- FOUNDATION


CHAPTER 12. STOCHASTIC PROCESS
376 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS

DIY to obtain the following transition probability matrix


⎡ ⎤
⎢ 0.64 0.32 0.04 ⎥
⎢ ⎥
⎢ ⎥
𝑃 =⎢
⎢ .40 .50 ⎥.
.10 ⎥
⎢ ⎥
⎣ ⎦
.25 .50 .25

a) Explain 𝑃 .
b) Check that both matrices 𝑃 and 𝑃 2 are stochastic. ♦

Problem 12.11. A Markov jump process has two states, namely states 0 and 1 (for example state 0
for healthy and state 1 for sick respectively). In this case, we ignore the transitions from either healthy
or sick state to death state.
Known that the distribution of stay in state 0 is exponential with rate 𝑣0 = 𝜆, and the distribution of
the time spent in state 1 is also exponential with rate 𝑣1 = 𝜇.

1. Draw a transition diagram.

2. Write down the forward differential equations (FDE).

3. Find the transition probabilities, i.e. compute 𝑝00 (𝑡), 𝑝01 (𝑡).

4. Find the limit of P(𝑡) as 𝑡 → ∞, and explain the results.

SOLUTION: Let us calculate the transition probabilities by solving the Kolmogorov equations. For
example, we use the forward equations. Note we can do this since in this case the state space S =
{0, 1} is finite.
1. The transition rate matrix Q is given by
⎡ ⎤
⎢ −𝜆 𝜆 ⎥
Q=⎢



𝜇 −𝜇

𝑑 ∑︀
2. Use forward Kolmogorov equations 𝑝𝑖,𝑗 (𝑡) = 𝑘̸=𝑗 𝑝𝑖,𝑘 (𝑡) 𝑞𝑘,𝑗 − 𝑣𝑗 𝑝𝑖,𝑗 (𝑡):
𝑑𝑡
So
𝑑 ∑︁
𝑝1,1 (𝑡) = ... = 𝑝1,0 (𝑡) 𝑞0,1 − 𝑣1 𝑝1,1 (𝑡)
𝑑𝑡
𝑘̸=1

with
𝑞0,1 = 𝑣0 𝑝0,1 = 𝜆 𝑝0,1 = 𝜆 1 = 𝜆,

hence
𝑑
𝑝1,1 (𝑡) = 𝜆𝑝1,0 (𝑡) − 𝜇𝑝1,1 (𝑡).
𝑑𝑡
Now, since 𝑝1,0 (𝑡) + 𝑝1,1 (𝑡) = 1, we get

𝑝′1,1 (𝑡) = 𝜆[1 − 𝑝1,1 (𝑡)] − 𝜇𝑝1,1 (𝑡) = 𝜆 − (𝜆 + 𝜇) 𝑝1,1 (𝑡).

Inference, Linear Regression and Stochastic Processes


12.7. Kolmogorov’s differential equations 377

Denote 𝑦(𝑡) = 𝑝1,1 (𝑡), multiply both side by 𝑒(𝜆+𝜇)𝑡 , ...

𝜆
𝑦(𝑡) = + 𝑐 𝑒−(𝜆+𝜇)𝑡 .
𝜆+𝜇
Find 𝑐 by using
𝜆 𝜇
1 = 𝑝1,1 (0) = 𝑦(0) = +𝑐⇒𝑐=
𝜆+𝜇 𝜆+𝜇
which gives the solution
𝜆 𝜇 −(𝜆+𝜇)𝑡
𝑝1,1 (𝑡) = + 𝑒 .
𝜆+𝜇 𝜆+𝜇
We can find 𝑝1,0 (𝑡) next using the fact that 𝑝1,0 (𝑡) + 𝑝1,1 (𝑡) = 1.

3. Use the backward Kolmogorov eqns. for 𝑝0,1 (𝑡), 𝑝0,0 (𝑡); or see 4.

4. The limit of P(𝑡)? The 𝑘-th power of 𝑄 is shown (by induction) as

Q𝑘 = [−(𝜆 + 𝜇)]𝑘−1 Q, for any 𝑘 ≥ 1.

With this we obtain


∞ ∞
∑︁ Q𝑘 𝑡𝑘 1 ∑︁ [−(𝜆 + 𝜇)𝑡]𝑘
P(𝑡) = I2 + = I2 − Q
𝑘! 𝜆+𝜇 𝑘!
𝑘=1 𝑘=1

so ⎡ ⎤
⎢ 1 0 ⎥
⎥− 1
[︀ −(𝜆+𝜇)𝑡 ]︀
P(𝑡) = ⎢
⎣ ⎦ 𝜆+𝜇 𝑒 −1 Q
0 1

Therefore,

⎡ ⎤
⎢ 1 0 ⎥
P(𝑡) = ⎢ ⎥ + 1 Q − 1 Q 𝑒−(𝜆+𝜇)𝑡
⎣ ⎦ 𝜆+𝜇 𝜆+𝜇
0 1

finally,

⎡ ⎤ ⎡ ⎤
𝜇 𝜆 𝜆 −𝜆
⎢ 𝜆+𝜇 𝜆+𝜇 ⎥ ⎢ 𝜆+𝜇 𝜆+𝜇 ⎥
P(𝑡) = ⎢ ⎥+⎢ ⎥ 𝑒−(𝜆+𝜇)𝑡
⎣ 𝜇 𝜆 ⎦ ⎣ −𝜇 𝜇 ⎦
𝜆+𝜇 𝜆+𝜇 𝜆+𝜇 𝜆+𝜇
Now we can write down 𝑝0,1 (𝑡), 𝑝0,0 (𝑡) .

DATA ANALYTICS- FOUNDATION


This page is left blank intentionally.
Chapter 13

Statistical Simulation
Describing systems with algorithms

[Source [56]]
CHAPTER 13. STATISTICAL SIMULATION
380 DESCRIBING SYSTEMS WITH ALGORITHMS

13.1 Introduction

The main purpose of simulations is estimating such quantities whose direct computation is
complicated, risky, consuming (time and money), expensive, or impossible.

1. For example, suppose a complex device or machine is to be built and launched. Before it happens,
its performance is simulated, and this allows experts to evaluate its adequacy and associated risks
carefully and safely.

2. For example, one surely prefers to evaluate reliability and safety of a new module of a space station
by means of computer simulations rather than during the actual mission.

Recall that probability can be defined as a long-run proportion. With the help of random number
generators, computers can actually simulate a long run. Then, probability can be estimated by the
associated observed frequency. The longer run is simulated, the more accurate result is obtained.
Similarly, one can estimate expectations, variances, and other distribution characteristics from a long
run of simulated random variables. In brief we present

• Engineering problems possibly worked out by simulation

• How to generate random numbers with R commands?

• Generation of random numbers

• Transformation random numbers into input data

• Synchronous and asynchronous simulation

• Solving problems by Monte Carlo method

13.1.1 Engineering problems worked out by simulation

Problems in engineering and technology cane be solved by theory of simulation and Poisson pro-
cesses, including:

1. Generate a random matrix in which columns represent different variables, see Section 13.2.2

2. Generate sample paths of a homogeneous Discrete Time Markov Chain (DTMC) by synchronous
simulation and asynchronous simulation, in Section 13.5

Example 13.1. (Percolation in LAN).

Consider a network of nodes. Some nodes are connected, say, with transmission lines, others are
not (mathematicians would call such a network a graph). A signal is sent from a certain node. Once
a node 𝑘 receives a signal, it sends it along each of its output lines with some probability 𝑝𝑘 . After

Inference, Linear Regression and Stochastic Processes


13.1. Introduction 381

a certain period of time, one desires to estimate the proportion of nodes that received a signal, the
probability for a certain node to receive it, etc.
Technically, simulation of such a network reduces to generating Bernoulli random variables [in
Chapter 4] with parameters 𝑝𝑖 . Line 𝑖 transmits if the corresponding generated variable 𝑋𝑖 = 1. In
the end, we simply count the number of nodes that got the signal, or verify whether the given node
received it.

Example 13.2. (Queuing) A queuing system [discussed in detail in Chapter 14.] is described by
a number of random variables. It involves spontaneous arrivals of jobs, their random waiting time,
assignment to servers, and finally, their random service time and departure.

When designing a queuing system or a server facility, it is important to evaluate its vital performance
characteristics. This will include

• the job’s average waiting time,

• the average length of a queue,

• the proportion of customers who had to wait,

• the expected usage of each server,

• the average number of available (idle) servers at the time when a job arrives, and so on.

Example 13.3. (Industry and Management)

An organization has realized that a system is not operating as desired, it will look for ways to improve
its performance. To do so, sometimes it is possible to experiment with the real system and, through
observation and the aid of Statistics, reach valid conclusions towards future system improvement.

• However, experiments with a real system may entail ethical and/or economical problems, which may
be avoided dealing with a prototype, a physical model.

• Sometimes, it is not feasible or possible, to build a prototype, yet we may obtain a mathematical
model describing, through equations and constraints, the essential behaviour of the system.

• This analysis may be done, sometimes, through analytical or numerical methods, but the model may
be too complex to be dealt with.

13.1.2 Basic concepts and topics

Statistically, in the design phase of a system, there is no system available, we can not rely on mea-
surements for generating a pdf.

In such extreme cases, we may use simulation. Large complex system simulation has become
common practice in many industrial areas. Essentially, simulation consists of

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
382 DESCRIBING SYSTEMS WITH ALGORITHMS

(i) building a computer model that describes the behaviour of a system;

(ii) experimenting with this model to reach conclusions.

Once we have a computer siumulation model of the actual system, we need to generate values for the
random quantities that are part of the system input (to the model).
In this chapter, from the Statistical point of view, we introduce key concepts, methods and tools from
simulation with the Industrial Statistics orientation in mind. The major parts of this section are from
[58, Chapter 5]. We mainly consider the problem within Step (ii) only.
To conduct Step (i) rightly and meaningfully, a close collaboration with experts in specific areas is
vital. Topics discussing Step (ii) are shown in the other chapters.

We learn

1. How to generate random numbers, using software R?

2. How to transformation random numbers into input data?

3. How to measure/record output data?

4. How to analyze and intepret output data and making meaningful inferences?

13.2 How to generate random numbers with R commands?

13.2.1 Generating random samples

SYNTAX: # rxyzts(parameters) = generates randomly sample with distribution name xyzts

• Gaussian: b = rnorm(k, mean, sd);

b = rnorm(37, 1.65, 0.5);

# get a sample of 37 numbers, follow the Gaussian N (1.65, 0.5)

• Binomial: x = rbinom(k, n, p);

x = rbinom(18, 9, 0.5); # get a sample of 18 numbers, follow the Bin(9, 0.5)

• Poisson: y = rpois(n, lambda);

y = rpois(8, 12);

# get a sample of 8 numbers, follow the Pois(𝜆) assuming the average rate 𝜆 = 12

Inference, Linear Regression and Stochastic Processes


13.2. How to generate random numbers with R commands? 383

13.2.2 Computing probability distributions

SYNTAX:
# dxyzts(parameters) = computes probability mass func - p.d.f with name xyzts
# pxyzts(parameters) = finds c.d.f - cumulative distribution
# qxyzt(parameters) = gives the quantile function, inverse of cdf

𝑥* is the p-th quantile of a distribution if

P[𝑋 < 𝑥* ] = 𝑝 ⇔ 𝐹 (𝑥* ) = 𝑝 ⇔ 𝑥* = 𝐹 −1 (𝑝)

• Gaussian: pnorm(t, m, s);

E.g., Probability of male height less than or equal to 180 cm, given that the Gauss distribution has
mean=175 and sd= 5

t = 180; m = 175; s = 5;

height-prob = pnorm(t, m, s); height-prob;

• Binomial: x = dbinom(k, n, p);

# gives p.d.f at k=4, sum of n= 7 var. Bernoulli(p=0.5)

dbinom(4, size = 7, prob = 0.5);

• Fisher: What is the upper 𝑎 = 5% = 0.05 critical value for the Fisher with two degree of freedom: n1
= 16; n2 = 21;

n1 = 16; n2 = 21; qf(1-a, n1,n2); 1- pf(2.156, n1,n2);

Probability-based Simulation with soft R

We will illustrate a simulation of data using popular probability distributions in software R, via a practical
problem of insurance premium determination in Actuarial Science with 𝑝 − 1 = 8 predictors.

Example 13.4. Specific data matrix in [Actuarial Science.]

For premium determination in Actuarial Science we assume 𝑝 − 1 = 8 predictors for describing


customer’ payments of AIA firm, as follows:

1. House’s location with longitude (𝑥1 ), and latitude (𝑥2 )

2. Gender (𝑥3 , categorical)

3. Number of family members (𝑥4 , ordinal)

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
384 DESCRIBING SYSTEMS WITH ALGORITHMS

4. Number of schooling years (𝑥5 )

5. Pollution index at customer’s house (PM10, PM2.5, SO...) (𝑥6 )

6. Job (profession) types (𝑥7 , categorical)

7. Yearly income (in $ 1000 US) (𝑥8 )

We observe 𝑛 = 18 customers together their actuarial premium amounts 𝑦 = [𝑦1 , 𝑦2 , . . . , 𝑦18 ] they
annually bought from AIA to against risk in their lives. Hence, our data matrix X has size 18 × 9.
How to design the data matrix before collecting real sample data?

• In Actuarial Industry we also call each customer a ‘policy’. We observed 𝑛 = 18 customers so we


have 18 policies.

• How to know specific values of 18 policies/customers, to build up an empirical model?

• Fitting empirical model needs data matrix X and response sample 𝑦.

Our data matrix X and response sample 𝑦 must reflex realistic conditions of real life, such as the
predictors 𝑋𝑗 and the response 𝑌 [at least] have to follow certain probability distributions.

HOW WOULD WE DO?


Data matrix with R - employ various distributions
R SYNTAX:
rxyzts(parameters) = generates randomly sample with distribution name xyzts

Sample size n=18; we use many distributions to obtain simulated data.

1. House’s location - Longitude (𝑥1 ), and latitude (𝑥2 )

x1 = rnorm(n, 1000, 20); x2 = rnorm(n, 7500, 30);

2. Gender: x3 = rbinom(n, 1, 0.5);

# 0 is male, 1 is male customer

3. Number of family members (𝑥4 , ordinal) x4 = rbinom(n, 4, 0.5)+ 1; # the number of family members
of customer

4. Number of schooling years (𝑥5 )

x5 = rpois(n, 12);

# assuming the average number of schooling years is 12

Inference, Linear Regression and Stochastic Processes


13.2. How to generate random numbers with R commands? 385

5. Pollution index at customer’s house (𝑥6 )

x6 = rnorm(n, 60, 20);

# the higher index, the poorer pollution; mean = 60

6. Job (profession) types (𝑥7 , categorical, assume 7 types)

x7 = c(’d’, ’t’, ’b’, ’e’, ’d’, ’d’,’p’, ’n’, ’l’, ’t’, ’b’, ’p’, ’n’, ’l’, ’t’,’e’, ’p’, ’n’);

7. Yearly income (in $ 1000 US) (𝑥8 )

x8 = rpois(n, 13); # assume the annually income is 13000 USD/y

Generate data matrix and fit linear model with R

Y = rpois(n, 860);
# actuarial premium cost per year of customer,
# average = 860 USD
dataX= data.frame(x1, x2, x3, x4, x5, x6,x7, x8); nrow(dataX);
# No INTERACTION
M1=lm(Y ~ x4+ x5+ x6+ x8)
anova(M1); summary(M1)
# No INTERACTION
# but transform nominal variable to numeric
fx3= factor(x3); fx7= factor(x7);
M0=lm(Y~ x1+ x2+ fx3+ x4+ x5+ x6+ fx7+ x8)
anova(M0); summary(M0)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 4.3 4.3 0.1030 0.7643802
x2 1 742.0 742.0 17.8248 0.0134566 *
fx3 1 264.6 264.6 6.3567 0.0652693 .
x4 1 156.9 156.9 3.7694 0.1241658
x5 1 1122.7 1122.7 26.9700 0.0065463 **
x6 1 1529.3 1529.3 36.7356 0.0037411 **
fx7 6 6905.2 1150.9 27.6459 0.0032232 **
x8 1 7771.5 7771.5 186.6839 0.0001662 ***
Residuals 4 166.5 41.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.452 on 4 degrees of freedom
Multiple R-squared: 0.9911,Adjusted R-squared: 0.9621
F-statistic: 34.18 on 13 and 4 DF, p-value: 0.001878

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
386 DESCRIBING SYSTEMS WITH ALGORITHMS

13.3 Generation of random numbers

General concepts.
The most basic computational component in simulation involves the generation of random variables
distributed uniformly between 0 and 1.

These then can be used to generate other random variables, both discrete and continuous de-
pending on practical contexts. Key requirements for meaningfully reasonable/reliable simulation:

• the simulation is run long enough to obtain an estimate of the operating characteristics of the system

• the number of runs also should be large enough to obtain reliable estimate

• the result of each run is a random sample implies that a simulation is a statistical experiment, that
must be conducted using statistical tools such as:

i) point estimation,

ii) confident intervals and

iii) hypothesis testing.

A scheme to mathematically simulate a system.


If a system 𝑆 is described by a discrete random variable 𝑋, a fundamental diagram to simulate 𝑆 is:

A random number generator 𝐺 → uniform random variable 𝑈

→ pdf or cdf of 𝑋.

13.3.1 Transformation random numbers into input data

Advanced simulation techniques


We use 𝐺 to randomly compute specific value of 𝑋. in the last two phases of this diagram.
The discrete inverse transform: write the cdf of 𝑋 by

𝑘
∑︁
𝐹 (𝑘) = 𝑝(𝑖) ∈ [0, 1],
𝑖=0

- then generate a uniform random number 𝑈 ∈ [0, 1] by 𝐺,


- find the value of 𝑋 = 𝑘 by determining the interval [𝐹 (𝑘 − 1), 𝐹 (𝑘)] consisting of 𝑈 , mathematically
this means to find the preimage 𝐹 −1 (𝑈 ).

Inference, Linear Regression and Stochastic Processes


13.3. Generation of random numbers 387

The Transformation Method

Generally, we need an algorithm, in two steps:

Step 1 use an algorithm 𝐴 to generates variates 𝑉𝑛 , 𝑛 = 1, 2, ... of a r.v. 𝑉 (𝑉 = 𝑈 in the above


example) with specific cdf 𝐹𝑉 (𝑣) for continuous case or pdf 𝑓𝑉 (𝑣) for discrete case. Then

Step 2 employ an approriate transformation 𝑔(.) to generate a variate of 𝑋, namely 𝑋𝑛 = 𝑔(𝑉𝑛 ).

Theorem 13.1 (Relationship of 𝑉 and 𝑋).

Consider a r.v. 𝑉 with pdf 𝑓𝑉 (𝑣) and given transformation 𝑋 = 𝑔(𝑉 ). Denote by 𝑣1 , 𝑣2 , · · · , 𝑣𝑛 the
real roots of the equation

𝑥 − 𝑔(𝑣) = 0 then the pdf of the r.v. 𝑋 is given by (13.1)

𝑛
∑︁ 1
𝑓𝑋 (𝑥) = 𝑓𝑉 (𝑣𝑙 ) · .
|𝑑𝑔(𝑣𝑙 )/𝑑𝑣𝑙 |
𝑙=1

Given 𝑥, if Equ. 13.1 has no real solutions then the pdf 𝑓𝑋 (𝑥) = 0

Proof. DIY

13.3.2 Two most important usages

• A) Linear (affine when 𝑏 ̸= 0) case: 𝑋 = 𝑔(𝑉 ) = 𝑎𝑉 + 𝑏 where 𝑎, 𝑏 ∈ R. Then

1 𝑥−𝑏
𝑓𝑋 (𝑥) = 𝑓𝑉 ( ).
|𝑎| 𝑎

−1
• B) Inverse case: given the cdf 𝐹𝑋 (𝑥) of 𝑋, then 𝑋 = 𝑔(𝑉 ) = 𝐹𝑋 (𝑉 ).

Theorem 13.2 (Inverse case).

Consider a r.v. 𝑉 with uniform cdf 𝐹𝑉 (𝑣) = 𝑣, 𝑣 ∈ [0, 1]. Then the transformation 𝑋 = 𝑔(𝑉 ) =
−1
𝐹𝑋 (𝑉 ) gives variates 𝑥 of 𝑋 with cdf 𝐹𝑋 (𝑥).

Proof. For any real number 𝑎, and due to the monotonicity of the cdf function 𝐹𝑋 ,

−1
P(𝑋 ≤ 𝑎) = P[𝐹𝑋 (𝑉 ) ≤ 𝑎] = P[𝑉 ≤ 𝐹𝑋 (𝑎)] = 𝐹𝑉 (𝐹𝑋 (𝑎)) = 𝐹𝑋 (𝑎).

Use this, an algorithm is formulated for generating variates of a r.v. 𝑋.

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
388 DESCRIBING SYSTEMS WITH ALGORITHMS

Generating variates of a r.variable

−1
1. Invert the given cdf 𝐹𝑋 (𝑥) to find its inverse 𝐹𝑋

2. Generate a uniform variate 𝑉 ∈ [0, 1]

−1
3. Generate variates 𝑥 via the transformation 𝑋 = 𝐹𝑋 (𝑉 ).

We next discuss how to generate values of arbitrary discrete and continuous distributions.

Generate discrete random variables

Specifically, to simulate discrete 𝑋 with mass probabilities 𝑝𝑘 ,


a) we simply split the interval [0, 1] into 𝑛 + 1 sub-intervals 𝐴𝑘 , with the
length of the 𝑘-th subinterval equal to 𝑝𝑘 = P[𝑋 = 𝑥𝑘 ], 𝑘 ∈ {0, 1, ..., 𝑛},
b) then explicitly we employ the fact that 𝑉 ∼ Uni([0.1]), and use the cdf

𝐹 (𝑥𝑘 ) = P[𝑋 ≤ 𝑥𝑘 ] = 𝑉 ⇐⇒ 𝑥𝑘 = 𝐹 −1 (𝑉 ),

in which the parameters of the step function 𝑔(𝑉 ) are given by:

• 𝑋 = 0 if 𝑉 < 0,

• else
∑︁ ∑︁
𝑋 = 𝐹 −1 (𝑉 ) = 𝑥𝑘 ⇐⇒ 𝑝𝑖 < 𝑉 ≤ 𝑝𝑖 , 𝑘 ∈ {1, ..., 𝑛}; (13.2)
𝑖=0..𝑘−1 𝑖=0..𝑘

This is fully expanded as





⎪ 0 if 𝑉 < 0


𝑥1 if 𝑉 < 𝑝1







⎨𝑥

2 if 𝑝1 < 𝑉 < 𝑝1 + 𝑝2
𝑋= .. (13.3)



⎪ .

⎪ ∑︀ ∑︀
if 𝑝𝑖 < 𝑉 ≤



⎪ 𝑥𝑘 𝑖=0..𝑘−1 𝑖=0..𝑘 𝑝𝑖

⎪ .
⎩..

Clearly,
[︂ ∑︁ ∑︁ ]︂
P[𝑋 = 𝑥𝑘 ] = P 𝑘 − 1𝑝𝑖 < 𝑉 ≤ 𝑘 𝑝 𝑖 = 𝑝𝑘 .
𝑖=0 𝑖=0

Hence we summarize steps in Algorithm I as follows.

Inference, Linear Regression and Stochastic Processes


13.3. Generation of random numbers 389

Generating discrete random variables.


The value of X is determined by the region where
the generated value of U belongs.

Figure 13.1: Generating values of a random variable

Algorithm 1 Generating discrete variables


Discrete-randomize(𝑝𝑘 )

Input An arbitrary discrete random variable 𝑋 that takes pmf 𝑝𝑘 , 0 < 𝑘 ∈ N

Output Values 𝑥𝑘 of 𝑋 such that P[𝑋 = 𝑥𝑘 ] = 𝑝𝑘 .

1. Divide the interval [0, 1] into subintervals as in Fig. 13.1(b).

𝐴0 = [0, 𝑝0 )
𝐴1 = [𝑝0 , 𝑝0 + 𝑝1 )
(13.4)
𝐴2 = [𝑝0 + 𝑝1 , 𝑝0 + 𝑝1 + 𝑝2 )
···

Subinterval 𝐴𝑘 will have length 𝑝𝑘 ; there may be a finite or infinite number of them, according to
possible values of 𝑋.

2. Obtain a standard uniform random variable 𝑉 from a random number generator or a table of random
numbers.

3. If 𝑉 belongs to 𝐴𝑘 , let 𝑋 = 𝑥𝑘 .

Example 13.5. Consider a Bernoulli r.v. 𝑋 ∼ B(𝑝), here 𝑝 = P(𝑋 = 1).

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
390 DESCRIBING SYSTEMS WITH ALGORITHMS

The cdf 𝐹 (𝑥) = P[𝑋 ≤ 𝑥] = 𝑉 is a step (stair-case) function 𝑢(.).


That is, 𝑢(𝑡) = 𝑏𝑖 if 𝑎𝑖 ≤ 𝑡 < 𝑎𝑖+1 , where (𝑎𝑖 )𝑖 is an ascending sequence. Here
𝐹 (𝑥) = 0 if 𝑥 < 0,
𝐹 (𝑥) = 𝑝 if 0 ≤ 𝑥 < 1, and
𝐹 (𝑥) = 𝑝 + (1 − 𝑝) = 1 if 1 ≤ 𝑥, see Figure 13.1(a).
How to generate 𝑋? We employ 𝑉 ∼ Uni([0.1]), and the fact that the inverse

𝐹 −1 (𝑉 ) = 𝑢(𝑉 − (1 − 𝑝)).

Example 13.6. Consider a binomial 𝑋 ∼ Bin(𝑛, 𝑝) where 𝑝 = P[𝑋 = 1] = P[“𝑠𝑢𝑐𝑐𝑒𝑠𝑠′′ ].

𝑋 ∼ Bin(𝑛, 𝑝) takes values in Range(𝑋) = {0, 1, ..., 𝑛} and given by a pdf

(︂ )︂
𝑛 𝑘
𝑝𝑘 = P[𝑋 = 𝑘] = 𝑝 (1 − 𝑝)𝑛−𝑘 .
𝑘

We employ the fact that 𝑉 ∼ Uni([0.1]), and use Equation (13.2),

∑︁ ∑︁
𝑝𝑖 < 𝑉 ≤ 𝑝𝑖 , 𝑘 ∈ {1, ..., 𝑛};
𝑖=0..𝑘−1 𝑖=0..𝑘

now explicitly expanded as




⎪ 0 if 𝑉 < 0



1 if 𝑉 < 𝑝1









⎪ 2 if 𝑝1 < 𝑉 < 𝑝1 + 𝑝2

⎨.
𝑋 = .. (13.5)


⎪ ∑︀ ∑︀
𝑘 if 𝑝𝑖 < 𝑉 ≤ 𝑝𝑖



⎪ 𝑖=0..𝑘−1 𝑖=0..𝑘
⎪..


.







⎩𝑛 if ??

Inference, Linear Regression and Stochastic Processes


13.4. How to measure/record output data? 391

Generate continuous random variables

Algorithm 2 Generating continuous variables

Continuous-randomize(𝐹 )

Input A continuous random variable 𝑋 that has cdf 𝐹

Output Values 𝑥 of 𝑋.

1. Obtain a standard uniform random variable 𝑉 ∈ [0, 1] from a random number generator.

2. Generate values 𝑥 via the transformation 𝑋 = 𝐹 −1 (𝑉 ). In other words, solve the equation 𝐹 (𝑋) = 𝑉
for 𝑋.

Example 13.7. Generate an exponential 𝑋 ∼ E(𝜆) with parameter 𝜆.

Recall that the exponential cdf is 𝐹 (𝑥) = 1 − 𝑒−𝜆𝑥 , we solve equation


1 1
𝐹 (𝑋) = 𝑉 ⇐⇒ 𝑋 = − ln(1 − 𝑉 ) = − ln 𝑈, 𝑈 ∼ Uni([0, 1]).
𝜆 𝜆
Example 13.8. Generate a gamma 𝑋 ∼ Gamma(𝛼, 𝛽).

The shape parameter 𝛼 and the frequency parameter 𝛽 completely determine Gamma(𝛼, 𝛽). How-
ever, the probability density function of 𝑋 ∼ Gamma(𝛼; 𝛽)

1


𝛼 Γ(𝛼)
𝑥𝛼−1 𝑒−𝑥/𝛽 , if 𝑥 ≥ 0,
𝑔(𝑥; 𝛼, 𝛽) = 𝛽 (13.6)
⎩0, if 𝑥 < 0.

gives the complicated cdf 𝐹 , given by


∫︁ 𝑡
1
𝐹𝐺 (𝑡) = 𝑥𝛼−1 𝑒−𝑥/𝛽 𝑑𝑥. (13.7)
𝛽 𝛼 Γ(𝛼 0

When 𝛼 = 1 then gamma becomes exponential, so Gamma(1, 𝛽) = E(𝛽).


A Gamma variable generally can be generated as a sum of 𝛼 independent exponential E(𝛽):
𝑛
∑︁
𝑋𝑖 ∼ 𝐺(𝑛, 𝛽).
𝑖=1

In Matlab we compute this sum 𝑋 ∼ 𝐺(𝑛, 𝛽) by

𝑋 = 𝑠𝑢𝑚( −1/𝜆 * log(𝑟𝑎𝑛𝑑(𝛼, 1)).

13.4 How to measure/record output data?

The third step in a simulation process consists of passing the inputs through the simulation model to
obtain outputs to be analyzed later. We shall consider three main application areas:

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
392 DESCRIBING SYSTEMS WITH ALGORITHMS

A/ discrete event simulation,


B/ Monte Carlo Monte Carlo (MCMC) simulation, and
C/ Discrete Time Markov chains (DTMC).

A/ Discrete event simulation models

Discrete event simulation (DES) deals with systems whose state changes at discrete times, not con-
tinuously. These methods were initiated in the late 50’s; for example, the first DES-specific language,
GSP, was developed at GE by Tocher and Owen to study manufacturing problems.
To study such systems, we build a discrete event model. Its evolution in time implies changes in
the attributes of one of its entities, or model components, and it takes place in a given instant. Such
change is called event. The time between two events/instants is an interval. A process describes the
sequence of states of an entity throughout its life in the system.

There are several strategies to describe such evolution, which depend on the mechanism that regulates
time evolution within the system.

 CONCEPT 11.

• When such evolution is based on time increments of the same duration, we talk about

synchronous simulation.

• When the evolution is based on intervals, we talk about asynchronous simulation.

B/ Markov chain Monte Carlo (MCMC)

• MCMC: a modern technique of generating random variables from rather complex, often intractable
distributions, as long as conditional distributions have a reasonably simple form.

• According to the Markov chain Monte Carlo (MCMC) methodology, a long sequence of random vari-
ables is generated from conditional distributions.
A wisely designed MCMC will then produce random variables that have the desired unconditional
distribution, no matter how complex it is.

Example 13.9 (In semiconductor industry).

• The joint distribution of good and defective chips on a produced wafer has a rather complicated
correlation structure.

As a result, it can only be written explicitly for rather simplified artificial models.

• On the other hand, the quality of each chip is predictable based on the quality of the surrounding,
neighboring chips.

Inference, Linear Regression and Stochastic Processes


13.4. How to measure/record output data? 393

• Given its neighborhood, conditional probability for a chip to fail can be written, and thus, its
quality can be simulated by generating a corresponding Bernoulli random variable with 𝑋𝑖 = 1
indicating a failure. 

We discuss this approach in detail in Section 13.7.

C/ Discrete Time Markov Chain (DTMC)

Generation of values of a Markov Chain is discussed now. We consider a homogeneous Discrete Time
Markov Chain (DTMC) described by a transition matrix 𝑃 .

Question 3. How do we generate sample paths of states {𝑋𝑛 }?

Briefly, we have learned that

Definition 13.1.

A (homogeneous) Markov chain 𝑀 is a triple (𝑄, 𝑝, 𝐴) in which:

• 𝑄 is a finite set of states (be identified with an alphabet Σ),

• 𝑝 are initial probabilities, (at time point 𝑡 = 0)

• 𝑃 = [𝑝𝑖𝑗 ]- the state transition matrix, with 𝑝𝑖𝑗 = P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖).

• And such that the memoryless property is satisfied,ie.,

P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖, · · · , 𝑋0 = 𝑎)

= P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖), for all 𝑛.

Independent of time property- Homogeneous Markov chains

If the state transition probabilities 𝑝𝑖𝑗 (𝑛 + 1) in a Markov chain 𝑀 is independent of time 𝑛, they are
said to be stationary, time homogeneous or just homogeneous. The state transition probability in
homogeneous chain then can be written without mention time point 𝑛:

𝑝𝑖𝑗 = P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖). (13.8)

Unless stated otherwise, we assume and will work with homogeneous Markov chains 𝑀 . The one-
step transition probabilities given by 12.2 of these Markov chains must satisfy:

𝑠
∑︁
𝑝𝑖𝑗 = 1; for each 𝑖 = 1, 2, · · · , 𝑠 and 𝑝𝑖𝑗 ≥ 0.
𝑗=1

Transition Probability Matrix In practical applications, we are likely given

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
394 DESCRIBING SYSTEMS WITH ALGORITHMS

• the initial distribution (i.e. the probability distribution of starting position of the concerned object
at time point 0), and

• the transition probabilities.

We want to determine the the probability distribution of position 𝑋𝑛 for any time point 𝑛 > 0. The
Markov property, quantitatively described through transition probabilities, can be represented conve-
niently in the so-called state transition matrix 𝑃 = [𝑝𝑖𝑗 ]:

⎡ ⎤
⎢ 𝑝11 𝑝12 𝑝13 ... .𝑝1𝑠 . ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑝 𝑝22 𝑝23 ... 𝑝2𝑠 . ⎥
⎢ 21 ⎥
𝑃 =⎢


⎥ (13.9)
⎢ 𝑝31 𝑝32 𝑝33 ... 𝑝3𝑠 . ⎥
⎢ ⎥
⎢ ⎥
⎣ . .. .. .. .. ⎦
.. . . . .

Definition 13.2.

Vector 𝑝* is called a stationary distribution of a Markov chain {𝑋𝑛 , 𝑛 ≥ 0} with the state transi-
tion matrix 𝑃 if:

𝑝* 𝑃 = 𝑝* .

Question 4. How to find a stationary distribution of a Markov chain?

Consider a homogeneous DTMC {𝑋𝑛 } described by the transition matrix 𝑃 = [𝑝𝑖𝑗 ]. How do we
generate sample paths of {𝑋𝑛 }. Two issues involved here:

a) Only steady state results are of interest

b) Transient results are of interest as well.

In a), we want to generate values for a single stationary random variable 𝑝* that describes the
steady-state behavior of the MC. 𝑝* is one-dimensional pdf the algorithm after Theorem 13.2 suffices.

13.5 Synchronous and asynchronous simulation

We illustrate both strategies describing how to sample from a Markov chain with state space 𝑆 and
transition matrix
𝑃 = (𝑝𝑖𝑗 ), with 𝑝𝑖𝑗 = P(𝑋(𝑛 + 1) = 𝑗|𝑋(𝑛) = 𝑖).

Inference, Linear Regression and Stochastic Processes


13.5. Synchronous and asynchronous simulation 395

The obvious way to simulate the (𝑛 + 1)-th transition, given 𝑋(𝑛), is

𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑋(𝑛 + 1) ∼ {𝑝𝑥(𝑛)𝑗 : 𝑗 ∈ 𝑆}.

This synchronous approach has the potential shortcoming that 𝑋(𝑛) = 𝑋(𝑛 + 1), with the corre-
sponding computational effort lost.

Alternatively, we may

• simulate 𝑇𝑛 , the time until the next change of state,

• then sample the new state 𝑋(𝑛 + 𝑇𝑛 ). If 𝑋(𝑛) = 𝑠, 𝑇𝑛 follows a geometric distribution Geom(𝑝𝑠𝑠 ) of
parameter 𝑝𝑠𝑠 and

• 𝑋(𝑛 + 𝑇𝑛 ) will have a discrete distribution with mass function


{︁ 𝑝 }︁
𝑠𝑗
: 𝑗 ∈ 𝑆 ∖ {𝑠} .
(1 − 𝑝𝑠𝑠 )

Should we wish to sample 𝑁 transitions of the chain, assuming 𝑋(0) = 𝑖0 , we could run the following
algorithm.

Algorithm 3 Sampling 𝑁 transitions of a Markov chain

Sampling-transitions(𝑁, S, P)

Do 𝑡 = 0, 𝑋(0) = 𝑖0

While 𝑡 < 𝑁
Sample ℎ ∼ Geom(𝑝𝑥(𝑡) 𝑥(𝑡) )
Sample
{︁ 𝑝𝑥(𝑡) 𝑗 }︁
𝑋(𝑡 + ℎ) ∼ : 𝑗 ∈ 𝑆 ∖ {𝑥(𝑡)}
(1 − 𝑝𝑥(𝑡) 𝑥(𝑡) )
Do 𝑡 = 𝑡 + ℎ

Two key strategies for asynchronous simulation

1. Event scheduling

The simulation time advances until the next event and the corresponding activities are executed.

If we have 𝑘 types of events (1, 2, . . . , 𝑘) , we maintain a list of events, ordered according to their
execution times (𝑡1 , 𝑡2 , . . . , 𝑡𝑘 ) .

A routine 𝑅𝑖 associated with the 𝑖-th type of event is started at time 𝜏𝑖 = min(𝑡1 , 𝑡2 , . . . , 𝑡𝑘 ).

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
396 DESCRIBING SYSTEMS WITH ALGORITHMS

2. Process interaction

• A process represents an entity and the set of actions that experiments throughout its life within
the model. The system behavior may be described as a set of processes that interact, for
example, competing for limited resources.
A list of processes is maintained, ordered according to the occurrence of the next event. Pro-
cesses may be interrupted, having their routines multiple entry points, designated reactivation
points.

• Each execution of the program will correspond to a replication, which corresponds to simulating
the system behavior for a long enough period of time, providing average performance measures,
say 𝑋(𝑛), after 𝑛 customers have been processed.
If the system is stable,
𝑋(𝑛) −→ 𝑋.

If, e.g., processing 1000 jobs is considered long enough, we associate with each replication 𝑗
of the experiment the output 𝑋 𝑗 (1000).

13.6 Basic Random Walks

Random walks are special cases of Markov chain, thus can be studied by Markov chain methods.
We use random walks to supply the math base for BLAST. BLAST is a procedure often employed in
Biomatics that

• searches for high-scoring local alignments between two sequences, then

• tests for significance of the scores found via P-values.

Example 13.10. Consider a simple case of the two aligned DNA sequences

ggagactgtagacagctaatgctata

gaacgccctagccacgagcccttatc

Suppose we give
- a score +1 if the two nucleotides in corresponding positions are the same and
- a score -1 if they are different.

When we compare two sequences from left to right, the accumulated score performs a random walk,
or better a simple random walk in one dimension. The following theory although mentions the generic
case, but we will use this example and BLAST as running example.

Inference, Linear Regression and Stochastic Processes


13.6. Basic Random Walks 397

Random Walk- a mathematical realization

Let 𝑍1 , 𝑍2 , · · · be independent identically distributed r.v.’s with

P(𝑍𝑛 = 1) = 𝑝 and P(𝑍𝑛 = −1) = 𝑞 = 1 − 𝑝

for all 𝑛. Let

𝑛
∑︁
𝑋𝑛 = 𝑍𝑖 , 𝑛 = 1, 2, · · · and 𝑋0 = 0.
𝑖=1

The collection of r.v.’s {𝑋𝑛 , 𝑛 ≥ 0} is a random process, and it is called the simple random walk in
one dimension.

(a) Describe the simple random walk 𝑋(𝑛).

(b) Construct a typical sample sequence (or realization) of 𝑋(𝑛).

(c) Find the probability that 𝑋(𝑛) = −2 after four steps.

(d) Verify the result of part (a) by enumerating all possible sample sequences that lead to the value
𝑋(𝑛) = −2 after four steps.

(e) Find the mean and variance of the simple random walk 𝑋(𝑛). Find the autocorrelation function
𝑅𝑋 (𝑛, 𝑚) of the simple random walk 𝑋(𝑛).

(f) Show that the simple random walk 𝑋(𝑛) is a Markov chain.

(g) Find its one-step transition probabilities.

(h) Derive the first-order probability distribution of the random walk 𝑋(𝑛).

Solution.
(a) Describe the simple random walk. 𝑋(𝑛) is a discrete-parameter (or time), discrete-state random
process. The state space is 𝐸 = {..., −2, −1, 0, 1, 2, ...}, and the index parameter set is 𝑇 = {0, 1, 2, ...}.
(b) Typical sample sequence.
A sample sequence 𝑥(𝑛) of a simple random walk 𝑋(𝑛) can be produced by tossing a coin every
second and letting 𝑥(𝑛) increase by unity if a head H appears and decrease by unity if a tail T appears.
Thus, for instance, we have a small realization of 𝑋(𝑛) in Table 13.1:

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
398 DESCRIBING SYSTEMS WITH ALGORITHMS

Table 13.1: Simple random walk from Coin tossing

𝑛 0 1 2 3 4 5 6 7 8 9 10 · · ·

Coin tossing H T T H H H T H H T ···

𝑥𝑛 0 1 0 - 1 0 1 2 1 2 3 2 ···

The sample sequence 𝑥(𝑛) obtained above is plotted in (𝑛, 𝑥(𝑛))-plane.


The simple random walk 𝑋(𝑛) specified in this problem is said to be unrestricted because there are
no bounds on the possible values of 𝑋.

Remark 5. The simple random walk process is often used in Game Theory or Biomatics.

We define the ladder points to be the points in the walk lower than any previously reached point. An
excursion in a walk is the part of the walk from a ladder point to the highest point attained before the
next ladder point.

(c) The probability that 𝑋(𝑛) = −2 after four steps.


We compute the first-order probability distribution of the random walk 𝑋(𝑛):

𝑝𝑛 (𝑘) = P(𝑋𝑛 = 𝑘),

with boundary conditions 𝑝0 (0) = 1, and 𝑝𝑛 (𝑘) = 0 if 𝑛 < |𝑘|.

Thus 𝑛 ≥ |𝑘|. We find that


(︂ )︂
𝑛
𝑝𝑛 (𝑘) = 𝑝(𝑛+𝑘)/2 𝑞 (𝑛+𝑘)/2 , where 𝑞 = 1 − 𝑝; (13.10)
(𝑛 + 𝑘)/2
by letting 𝐴, 𝐵 to be the r.v.’s indicating the number of +1 and -1, and as a result

𝐴 + 𝐵 = 𝑛, 𝐴 − 𝐵 = 𝑋𝑛 .

When 𝑋(𝑛) = 𝑘, we see that 𝐴 = (𝑛 + 𝑘)/2, which is a binomial r.v. with parameters (𝑛, 𝑝). We
conclude that the probability distribution of 𝑋(𝑛) is given by 13.10, in which 𝑛 ≥ |𝑘|, and 𝑛, 𝑘 must be
both even or odd.
Set 𝑘 = −2 and 𝑛 = 4 in (13.10) to get the concerned probability 𝑝4 (−2) that 𝑋(4) = −2

(d) Verify the result of part (a) by enumerating all possible sample sequences that lead to the value
𝑋(𝑛) = −2 after four steps. DIY!
(e) The mean and variance of the simple random walk 𝑋(𝑛). Use the fact

P(𝑍𝑛 = +1) = 𝑝 and P(𝑍𝑛 = −1) = 1 − 𝑝.

Inference, Linear Regression and Stochastic Processes


13.7. Solving problems by Monte Carlo methods 399

13.7 Solving problems by Monte Carlo methods

The term Monte Carlo originally referred to simulations that involved random walks [see Section 13.6]
and was first used by Jon von Neumann and S. M. Ulam in the 1940’s. Today, the Monte Carlo method
refers to any simulation that involves the use of random numbers.
In the following parts, we show that Monte Carlo simulations (or experiments) are feasible and in-
expensive way to understand the phenomena of interest. To conduct a simulation, you need a model
that represents your population or phenomena of interest and a way to generate random numbers
(according to your model) using a computer. The data that are generated from your model can then be
studied as if they were observations.
As we will see, one can use statistics based on the simulated data (means, medians, modes, vari-
ance, skewness, etc.) to gain understanding about the population.

13.7.1 Basic Monte Carlo Procedure- Methodology

The fundamental idea behind Monte Carlo simulation for inferential statistics is that insights regarding
the characteristics of a statistic can be gained by repeatedly drawing random samples from the same
population of interest and observing the behavior of the statistic over the samples.
W. Martinez [?] suggests steps of a basic Monte Carlo procedure as follows:

1. Determine the pseudo-population or model that represents the true population of interest.

2. Use a sampling procedure to sample from the pseudo-population.

3. Calculate a value for the statistic of interest and store it.

4. Repeat steps 2 and 3 for 𝑀 trials.

5. Use the 𝑀 values found in step 4 to study the distribution of the statistic.

13.7.2 Estimating Probabilities by Monte Carlo simulation

This section discusses the most basic and typical application of Monte Carlo methods. Keeping in
mind that probabilities are long-run proportions, we generate a long run of experiments and compute
the proportion of times when our event occurred.

For a variable 𝑋, the probability 𝑝 = P[𝑋 ∈ 𝐴] is estimated by the estimator

̂︀ ∈ 𝐴] = number of 𝑋1 , 𝑋2 , . . . , 𝑋𝑁 ∈ 𝐴 = 𝑆
𝑝ˆ = P[𝑋 (13.11)
𝑁 𝑁
where 𝑁 is the size of a Monte Carlo experiment, 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 are generated random variables with
the same distribution as 𝑋, and a “hat” means the estimator. The latter is a very common and standard
notation:

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
400 DESCRIBING SYSTEMS WITH ALGORITHMS

𝜃̂︀ = estimator of an unknown quantity 𝜃.

The number 𝑆 of 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 that fall within set 𝐴 has binomial Bin(𝑁, 𝑝), with the mean E[𝑆] =
𝑁 𝑝, and the variance V[𝑆] = 𝑁 𝑝(1 − 𝑝). The accuracy of a Monte Carlo study depends on expectation
and variance of the estimator 𝑝ˆ

𝑁𝑝
E[ˆ
𝑝] = = 𝑝,
𝑁
𝑁 𝑝(1 − 𝑝)
V[ˆ
𝑝] = (13.12)
𝑁 2 √︂
𝑝(1 − 𝑝)
=⇒ Std[ˆ𝑝] = .
𝑁

♣ OBSERVATION 3.

1. Accuracy of a Monte Carlo study:

𝑝] = 𝑝, shows that our Monte Carlo estimator of 𝑝 is unbiased, so that over a long
The first result, E[ˆ
run, it will- on the average- return the desired quantity 𝑝. The last result, the standard deviation
√︂
𝑝(1 − 𝑝)
Std[ˆ
𝑝] =
𝑁

indicates that the standard error of our estimator 𝑝ˆ decreases with 𝑁 at the rate of 1/ 𝑁 .

Larger Monte Carlo experiments produce more accurate results. A 100-fold increase in the number of
generated variables reduces the standard deviation (therefore, enhancing accuracy) by a factor of 10.

2. Size of a Monte Carlo study:

The size 𝑁 should fulfill


(︂ )︂2
𝑧𝛼/2
𝑁 ≥ 0.25 .
𝜀

Why? Because we want to design a Monte Carlo study that attains desired accuracy. That is,

* we can choose some small 𝜀 and 𝛼 and

** conduct a Monte Carlo study of such a size 𝑁 that will guarantee

an error |ˆ
𝑝 − 𝑝| not exceeding 𝜀 with high probability (1 − 𝛼), i.e.

𝑝 − 𝑝| > 𝜀] ≤ 𝛼.
P[|ˆ (13.13)

Remind that, 𝑧𝛼 = Φ−1 (1 − 𝛼) is such a value (critical value) of a Standard Normal variable 𝑍 that
can be exceeded with probability 𝛼.

Inference, Linear Regression and Stochastic Processes


13.8. Chapter 13’s Problems 401

13.8 Chapter 13’s Problems

Computational problems

Problem 13.1.

Consider a transition probability matrix of a Markov chain 𝑀 as follows:


⎡ ⎤
⎢ 0.88 0.12 ⎥
𝑃 =⎢

⎥.
⎦ (13.14)
0.15 0.85

𝑀 represents a system with 2 states on and off of an email server, where on = acceptable operation,
and off = overload (when the server can not receive or deliver email) Time shift unit is 1 minute. Which
of the followings is true:

• 𝑃 has 3 eigenvalues

• 𝑀 is reducible

• 𝑃 is not a stochastic matrix; 𝑃 is a regular matrix.

Problem 13.2.

Let 𝑀 be a two state Markov chain, with its state transition matrix is
⎡ ⎤ ⎡ ⎤
⎢ 𝑝11 𝑝21 ⎥ ⎢ 1 − 𝑐 𝑐 ⎥
𝑃 =⎢ ⎥=⎢ ⎥ , where 0 < 𝑐 < 1, 0 < 𝑑 < 1. (13.15)
⎣ ⎦ ⎣ ⎦
𝑝12 𝑝22 𝑑 1−𝑑

𝑀 represents a traffic system with 2 states on and off of a road at SG, where on = acceptable vehicle
flow, and off = traffic jam (when the road can not fulfill its functionality).
When 𝑐 = 𝑑, compute the limit matrix lim𝑛→∞ 𝑃 𝑛 .

Problem 13.3.

Consider a Markov chain with two states and transition probability matrix
⎡ ⎤
⎢ 3/4 1/4 ⎥
𝑃 =⎢



1/2 1/2

Find the stationary distribution 𝑝* of the chain.

Problem 13.4.

DATA ANALYTICS- FOUNDATION


CHAPTER 13. STATISTICAL SIMULATION
402 DESCRIBING SYSTEMS WITH ALGORITHMS

Toyota (denoted 𝑇 ) currently takes over 60% of the yearly car market in Vietnam, its rival Ford and
other brands (denoted 𝐹 ) takes the other share. Historical data shows that the state transition matrix
𝑃 of the market fluctuation is found as

⎡ ⎤
⎢ 𝑇 𝐹⎥
⎢ ⎥
⎢ ⎥
⎢ −− −− −− ⎥
⎢ ⎥
𝑃 =⎢


⎥ (13.16)
⎢ 𝑇 0.88 0.12 ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
𝐹 0.15 0.85

The car market share of Toyota in the next 4 years is given by the vector 𝑝(4) as:
a) 𝑝(4) = [𝑝𝑇 (4), 𝑝𝐹 (4)] = 𝑃 3 𝑝(1)?
b) 𝑝(4) = [𝑝𝑇 (4), 𝑝𝐹 (4)] = 𝑃 3 𝑝(0)?
c) 𝑝(5) = [𝑝𝑇 (5), 𝑝𝐹 (5)] = 𝑃 5 𝑝(0)?
d) 𝑝(4) = [2/3, 1/3]?
—————————————————–

Theoretic problems

A/ Concepts

1. Show that if 𝑃 is a Markov matrix, then 𝑃 𝑛 is also a Markov matrix for any positive integer 𝑛.

2. A state transition diagram of a finite-state Markov chain is a line diagram with a vertex corresponding
to each state and a directed line between two vertices 𝑖 and 𝑗 if 𝑝𝑖𝑗 > 0.

In such a diagram, if one can move from 𝑖 and 𝑗 by a path following the arrows, then 𝑖 → 𝑗. The
diagram is useful to determine whether a finite-state Markov chain is irreducible or not, or to check
for periodicities.

Draw the state transition diagrams and classify the states of the MCs with the following transition
probability matrices:
⎡ ⎤
⎢ 0 0 0.5 0.5 ⎥
⎡ ⎤
⎢ 0 0.5 0.5 ⎥ ⎢ ⎥
⎢ ⎥
⎢ ⎥ ⎢ 1 0 0 0 ⎥
⎢ ⎥ ⎢ ⎥
𝑃1 = ⎢
⎢ 0.5 0 0.5 ⎥ ; 𝑃2 = ⎢


⎥;

⎢ 0 1 0 0 ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦
0.5 0.5 0
⎢ ⎥
⎣ ⎦
0 1 0 0

Inference, Linear Regression and Stochastic Processes


13.8. Chapter 13’s Problems 403

⎡ ⎤
⎢ 0.3 0.4 0 0 0.3 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 1 0 0 0 ⎥
⎢ ⎥
𝑃3 = ⎢



⎢ 0 0 0 0.6 0.4 ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
0 0 1 0 0

3. Verify the transitivity property of the Markov chain ; that is, if 𝑖 → 𝑗 and 𝑗 → 𝑘, then 𝑖 → 𝑘. (Hint: use
Chapman Komopgorov equations).

4. Show that in a finite-state Markov chain, not all states can be transient.

B/ Markov Chains and Modeling

1. A certain product is made by two companies, A and B, that control the entire market. Currently, A
and B have 60 percent and 40 percent, respectively, of the total market. Each year, A loses 5 of its
market share to By while B loses 3 of its share to A.

Find the relative proportion of the market that each hold after 2 years.

2. Let two gamblers, A and B, initially have 𝑘 dollars and 𝑚 dollars, respectively. Suppose that at
each round of their game, A wins one dollar from B with probability 𝑝 and loses one dollar to B with
probability 𝑞 = 1 − 𝑝. Assume that A and B play until one of them has no money left. (This is known
as the Gambler’s Ruin problem.)

Let 𝑋𝑛 be A’s capital after round 𝑛, where 𝑛 = 0, 1, 2, · · · and 𝑋0 = 𝑘.

(a) Is 𝑋(𝑛) = {𝑋𝑛 , 𝑛 ≥ 0} a Markov chain with absorbing states?

(b) Find its transition matrix 𝑃 . Realize 𝑃 when 𝑝 = 𝑞 = 1/2 and 𝑁 = 4

(c*) What is the probability of A’s losing all his money?

GUIDANCE for solving.

Different rounds are assumed independent.

• The gambler 𝐴, say, plays continuously until he either accumulates a target amount of 𝑚, or
loses all his money.
We introduce the Markov chain shown whose state 𝑖 represents the gambler’s wealth at the
beginning of a round.

• The states 𝑖 = 0 and 𝑖 = 𝑚 correspond to losing and winning, respectively.

• All states are transient, except for the winning and losing states which are absorbing. Thus, the
problem amounts to finding the probabilities of absorption at each one of these two absorbing
states. Note, these absorption probabilities depend on the initial state 𝑖.

DATA ANALYTICS- FOUNDATION


This page is left blank intentionally.
Chapter 14

Poisson Process and Variations


Systems changed by random arrivals in time

[Source [56]]
CHAPTER 14. POISSON PROCESS AND VARIATIONS
406 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

14.1 Introduction and Overview

Engineering and service problems being solved by the theory of Poisson processes include:

1. Model dynamical systems:

a/ a flow of patients arrive at the doctor’s office (equivalent emails arrive at server...)

b/ a bus system with many lines in metropolis, in Section 14.8.

2. Queuing systems and management of flows in industry and economics.

REMINDER: Time homogeneous Markov processes 𝑀 = {𝑋(𝑡)}𝑡≥0 = (S, 𝑝, 𝑃 (𝑡)) have stationary
(or homogeneous) transition probabilities. For such processes we have known that

𝑝𝑖,𝑗 (𝑡) = P[𝑋(𝑠 + 𝑡) = 𝑗 |𝑋(𝑠) = 𝑖]


(14.1)
𝑝(𝑢) = [𝑝𝑗 (𝑢)]𝑗∈S where 𝑝𝑗 (𝑢) = P[𝑋(𝑢) = 𝑗].

Here

• 𝑝𝑖,𝑗 (𝑡) is the transition probability from state 𝑖 to state 𝑗 after a time duration 𝑡, whatever the time
point 𝑠 is, and

• 𝑝𝑗 (𝑢) is the ‘state’ probability that the process is in state 𝑗 at time point 𝑢.

We will covers major topics:

• Poisson process in Section 14.2

• Arrival-Type Processes: Few variations in Section 14.3

• Conditional Distribution in Section 14.4

• Compound Poisson process in Section 14.5

• Introductory Birth and Death processes in Section 14.7

14.2 The Poisson process

A Poisson process is a sequence of Poisson random variables.


Poisson random variable 𝑋 counts the number of (rare) events randomly occurring in one 1 unit of
time, denoted by 𝑋 ∼ Pois(𝜆), is determined by 5 components:

Inference, Linear Regression and Stochastic Processes


14.2. The Poisson process 407

• The observed values 𝑆𝑋 = Range(𝑋) = {0, 1, 2, 3, 4, . . . , 𝑚, 𝑚 + 1, . . .}.

• Probability density function of 𝑋 ∼ Pois(𝜆) is


𝑒−𝜆 𝜆𝑖
𝑝𝑋 (𝑖) = 𝑝(𝑖; 𝜆) = P[𝑋 = 𝑖] = 𝑖 = 0, 1, 2, ... (14.2)
𝑖!

Here constant value 𝜆 > 0 is the average number of events occurring in one unit of time, or the
rate or speed of events. Then we can write 𝑋(𝑡) to count the number of events randomly occurring
in the time interval [0, 𝑡), now with pdf
𝑒−𝜆𝑡 (𝜆𝑡)𝑖
𝑝𝑋(𝑡) (𝑖) = 𝑝(𝑖; 𝜆𝑡) = P[𝑋(𝑡) = 𝑖] = 𝑖 = 0, 1, 2, ... (14.3)
𝑖!

• Probability cumulative function of 𝑋 ∼ Pois(𝜆) is

𝑥
∑︁ 𝑥
∑︁
𝐹 (𝑥; 𝜆) = P(𝑋 ≤ 𝑥) = P[𝑋 = 𝑖] = 𝑝(𝑖; 𝜆) 𝑥 = 0, 1, 2, . . . (14.4)
𝑖=0 𝑖=0

• The expectation and variance of 𝑋 ∼ Pois(𝜆) respectively are


∑︁
E[𝑋] = 𝜇 = 𝑖 𝑝𝑋 (𝑖) = 𝜆,
𝑎𝑙𝑙 𝑖 (14.5)
2
V[𝑋] = E[(𝑋 − 𝜇) ] = 𝜆.

• Hence, if V[𝑥] of a data 𝑥 is much greater than the mean E(𝑥), the Poisson distribution would not
be a good model for fitting data.

Example 14.1.

In the Poisson Pois(𝜆), the constant 𝜆 > 0 is the rate or speed of events, we see that if customers
come to a SCB branch in Bangkok and follow a Poisson distribution with rate 𝜆 = 10 then

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
408 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

• the mean 𝜇 = 𝜆 = 10 customers per hour, the variance is also V(𝑋) = 𝜎 2 = 𝜆 = 10, and

• the standard deviation of the arrivals is 𝜎 = 20 = 3.16 customers per hour. We see that the proba-
bility of 20 arrivals is not negligible.

Example 14.2. [Environmental Science.] (Atmospheric pollution)

• Atmospheric dust particles (PM10 or PM2.5) at a particular location cause a serious environ-
mental problem for inhabitants.

• The number of particles within a unit volume is observed by focusing a powerful microscope on
the particles and making counts. The results of tests on 100 such volumes are shown in Table
14.1.

Table 14.1: Poisson distribution of dust particles in the atmosphere

Particles in unit volume

0 1 2 3 4 >4

Observed frequency 13 24 30 18 7 8

Poisson frequency 12 26 27 19 10 6

Compute the theoretical Poisson frequency 𝑝𝑋 (𝑥) by exploiting the data.

HINT: Use formula 14.5, the estimated mean of the number of dust particles within each volume is
calculated as follows:

∑︁ 13 24 30 18 7 8
x= 𝑓𝑖 𝑥𝑖 = ·0+ ·1+ ·2+ ·3+ ·4+ ·6
𝑖
100 100 100 100 100 100
= 2.14 = 𝜆.

The theoretical Poisson frequencies of occurrence shown in 14.1 are obtained from
𝜆𝑥 𝑒−𝜆
P(𝑋 = 𝑥|𝜆) = = 2.14𝑥 𝑒−2.14 / 𝑥! for 𝑥 = 0, 1, 2, 3, 4, 6. 
𝑥!

14.2.1 Counting Process

Counting processes are often found in Arrival-Type Processes. For such a process, we are interested
in occurrences that have the character of an ‘arrival’, such as
- message receptions at a receiver,
- job completions in a manufacturing cell, and customers purchase at a store, etc.

Inference, Linear Regression and Stochastic Processes


14.2. The Poisson process 409

Definition 14.1 (Counting process).

Suppose we count the total number of events 𝑁 (𝑡) that have occurred up to time 𝑡.

𝑁 (𝑡) is said to be a counting process if it satisfies:

a/ it is nonnegative and integer valued, 𝑁 (𝑡) ∈ N,

b/ if 𝑠 < 𝑡 then 𝑁 (𝑠) ≤ 𝑁 (𝑡), and

c/ for 𝑠 < 𝑡, the quantity 𝑁 (𝑡) − 𝑁 (𝑠) equals the number of events that occurred in the interval
[𝑠, 𝑡].

𝑡𝑖𝑚𝑒 𝑇 : 0 − − − −s − − − −t − − − − − −𝑢 − − − − − 𝑣 − − − −− >

𝑠𝑡𝑎𝑡𝑒 N : 0 − − N(s) = i − − − N(t) = j − − − −𝑁 (𝑢) − − − 𝑁 (𝑣) − − − −

[E.g. events = customer arrivals at an HSBC bank from 8 am till 12 pm, equivalent to interval [0, 4], 𝑡 =
4 hours; events = storms attacking coastal area of Vietnam during June to December, equivalent to
interval [0, 7], 𝑡 = 7 months...)

A counting process 𝑁 (𝑡) could satisfies the following two properties:

P1. Independent increment property: says that the (random) number of events (arrivals) 𝑁 (𝑡)−𝑁 (𝑠)
and 𝑁 (𝑣) − 𝑁 (𝑢) in two disjoint intervals [say [𝑠, 𝑡] ∩ [𝑢, 𝑣] = ∅] are independent.

P2. Stationary increment property: says that the distribution of the number of events 𝑁 (𝑠, 𝑡) :=
𝑁 (𝑡)−𝑁 (𝑠) occurring in interval [𝑠, 𝑡] depends only on the length ℎ = 𝑡−𝑠 of the interval, not on the position
of the interval.

P2. is formulated based on the generic concept of stationary increment as follows.

 CONCEPT 12.

A stochastic process {𝑋(𝑡), 𝑡 ≥ 0} is stationary increment when all the 𝑋(𝑡) have the same distribution.
That means, mathematically for any 𝜏 , the distribution of a stationary process will be unaffected by a
shift in the time origin, i.e. 𝑋(𝑡) and 𝑋(𝑡 + 𝜏 ) have the same distribution. For the first-order distribution,
that means
𝐹𝑋 (𝑥; 𝑡) = 𝐹𝑋 (𝑥; 𝑡 + 𝜏 ) = 𝐹𝑋 (𝑥); and 𝑓𝑋 (𝑥; 𝑡) = 𝑓𝑋 (𝑥).

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
410 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

14.2.2 Poisson process and its properties

Definition 14.2 (Poisson process in words).

A Poisson process is a counting process 𝑁 (𝑡) that satisfies the followings.

The Orderliness says two or more arrivals cannot simultaneously occur.

P1- Independent increment: the (random) number of events- arrivals in two disjoint intervals are
independent.

P2- Stationary increment: the distribution of the number of events depends only on the length of
the interval, not on the position of the interval. It mathematically means: 𝑁 (𝑡 + ℎ) − 𝑁 (𝑠 + ℎ)
(the number of events in the interval [𝑠 + ℎ, 𝑡 + ℎ]) has the same distribution as 𝑁 (𝑡) − 𝑁 (𝑠) (the
number of events in the interval [𝑠, 𝑡]), for all 𝑠 < 𝑡 and ℎ > 0.

Combining Definition 14.1 and 14.2 we have the following working description.

Definition 14.3. 𝑁 (𝑡) is Poisson process with rate 𝜆, denoted 𝑁 (𝑡) ∼ Pois(𝜆), if

1. 𝑁 : R+ −→ N is a map, values 𝑁 (𝑡) ∈ N, for any 𝑡 ∈ R+ , and 𝑁 (0) = 0,

2. P[𝑁 (ℎ) ≥ 2] = 𝑜(ℎ) for small ℎ, [the Orderliness]

3. P[𝑁 (ℎ) = 1] = 𝜆ℎ + 𝑜(ℎ) for small ℎ, [the rate of events]

4. if 0 ≤ 𝑎 < 𝑏 then 𝑁 (𝑎) ≤ 𝑁 (𝑏), [non-decreasing function], and the quantity 𝑁 (𝑏) − 𝑁 (𝑎) equals
the number of events that occurred in the interval [𝑎, 𝑏],

5. 𝑁 (𝑡)−𝑁 (𝑠) and 𝑁 (𝑣)−𝑁 (𝑢) are independent random variables, for [𝑠, 𝑡]∩[𝑢, 𝑣] = ∅, [independent
increments],

6. 𝑁 (𝑡 + ℎ) − 𝑁 (𝑠 + ℎ) has the same distribution as 𝑁 (𝑡) − 𝑁 (𝑠), for all 𝑠 < 𝑡 and ℎ > 0 [stationary
increments,].

Combining 4. and 6. we now can say 𝑁 (𝑡 − 𝑠), the number of events that occurred in the time interval
[0, 𝑡 − 𝑠], has the same distribution as that of 𝑁 (𝑡) − 𝑁 (𝑠), i.e. P[𝑁 (𝑡 − 𝑠) = 𝑘] = P[𝑁 (𝑡) − 𝑁 (𝑠) = 𝑘],
for all 0 ≤ 𝑠 < 𝑡.
See postulates of Poisson processes in details in Section 14.2.3.

Example 14.3. Under assumption of purely random events, here are typical Poisson processes.

The number of students entering the school before time 𝑡,


the number of goals a footballer scores by time 𝑡.
However the number of students are in school at time 𝑡 is a stochastic, but not a counting process,
and so not a Poisson process. 

Inference, Linear Regression and Stochastic Processes


14.2. The Poisson process 411

Knowledge box 12 (Poisson-type events).

• Therefore, we can think of the following one-way scheme:

Stochastic process ⇐ Counting process ⇐ Poisson process.

But the other way round is not true.

• The Poisson process is often applied to occurrences of events in time; like requests for service,
breakdowns of equipment, or arrivals of vehicles at a road intersection in cities...

• Hereafter we will refer to Poisson-type events that depend on temporal scale (in time) as arrivals,
such as customers arriving at a queue.

• The concept of Poisson process can be extended to spatial applications (in space) to model,
e.g, the locations of demands for service.

Theorem 14.1 (Computing Poisson pdf).

If 𝑁 (𝑡) is a Poisson process, then its pmf is given by

(𝜆𝑡)𝑛 −𝜆𝑡
P[𝑁 (𝑡) = 𝑛] = 𝑒 𝑛 = 0, 1, 2, ... (14.6)
𝑛!

• This means 𝑁 (𝑡) ∼ Pois(𝜆𝑡), it is distributed as a Poisson random variable with mean 𝜆𝑡.

• We have P[𝑁 (𝑠 + 𝑡) − 𝑁 (𝑠) = 𝑛] = P[𝑁 (𝑡) = 𝑛], by the stationarity of the increments, so this
distribution completely characterizes the entire process.

14.2.3 Postulates of Poisson Processes

We consider three postulates being associated with the Poisson process of rate 𝜆.

1. The Orderliness: given that one Poisson arrival occurs at a particular time, the conditional probability
that another occurs at exactly the same time is zero.

Fact 14.1. Thus, two or more arrivals cannot occur simultaneously, i.e. the probability that at least
two Poisson arrivals occur in a time interval of length 𝜏 is 𝑜(𝜏 ):

P[𝑁 (𝑡 + 𝜏 ) − 𝑁 (𝑡) > 1] = 𝑜(𝜏 ), as 𝜏 → 0.

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
412 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

2. The Independent increment: the numbers of arrivals happening in disjoint time intervals are mutu-
ally independent random variables. Define

𝑁 (𝑠, 𝑡) = 𝑁 (𝑡) − 𝑁 (𝑠)

be the number of customer arriving in the interval [𝑠, 𝑡].

Fact 14.2. If the {[𝑎𝑖 , 𝑏𝑖 ] : 1 ≤ 𝑖 ≤ 𝑛} are non-overlapping, then random variables {𝑁 (𝑎𝑖 , 𝑏𝑖 ) : 1 ≤
𝑖 ≤ 𝑛} are independent.

3. The Stationary increment: the number of Poisson-type arrivals happening in any prespecified time
interval of fixed length is

• neither dependent on the ”starting time” of the interval,


• nor on the total number of Poisson arrivals recorded prior to the interval.

14.3 Arrival-Type Processes: Few variations

14.3.1 Multiple Independent Poisson Processes

Proposition 14.2 (Sum of independent Poisson processes).

• Suppose that {𝑁1 (𝑡)} and {𝑁2 (𝑡)} are two independent Poisson processes with rates 𝜆1 and 𝜆2
respectively. [They are the respective cumulative numbers of arrivals in time interval [0, 𝑡).]

• Let 𝑁 (𝑡) = 𝑁1 (𝑡) + 𝑁2 (𝑡), the combined process of a cumulative number of arrivals until time 𝑡.
Then {𝑁 (𝑡)} is a Poisson process with arrival-rate parameter 𝜆1 + 𝜆2 ( the sum of the individual
arrival rates).
(︀ )︀
𝑁 (𝑡) ∼ Pois (𝜆1 + 𝜆2 )𝑡 .

This result extends in the obvious way to more than two independent Poisson processes. There
are many ways to prove this result, but the simplest is just to observe that the pooled process satisfies
each of the four postulates of the Poisson process in Section 14.2.3.

Fact 14.3 (Min of exponential random variables).

If 𝑇1 , 𝑇2 , · · · , 𝑇𝑛 are independent exponential variables, with 𝑇𝑖 ∼ E(𝛽𝑖 ), for all 𝑖 = 1, 2, . . . , 𝑛 then the
𝑛
∑︁
∑︀
min variable 𝑇 = min{𝑇1 , 𝑇2 , · · · , 𝑇𝑛 } ∼ E( 𝑖 𝛽𝑖 ), it is exponentially distributed with parameter 𝛽𝑖 .
𝑖=1

For 𝑛 = 2, an interesting problem regarding the competition between two exponential random vari-
ables is: the probability that one of the two variables, such as 𝑋, is less than the other, say 𝑌 ; as
in service providing, a provider completing his service earlier than that of the other provider being
considered as the winner?

Inference, Linear Regression and Stochastic Processes


14.3. Arrival-Type Processes: Few variations 413

Example 14.4. [Transportation Science.]

Given two independently operating Poisson processes with rate parameters 𝜆1 and 𝜆2 respectively,
what is the probability that an arrival from process 1 (a ”type 1” arrival as blue buses) occurs before an
arrival from process 2 (a ”type 2” arrival as red buses)?

GUIDANCE for solving. To solve this problem, let the two independent inter-arrival times of interest
be denoted by 𝐴 and 𝐵 for processes 1 and 2, respectively. We want to compute P[𝐴 < 𝐵]?

Work out yourself, invoking our knowledge of Poisson processes, and using the joint pdf 𝑓𝐴,𝐵 (𝑎, 𝑏) =
𝑓𝐴 (𝑎)𝑓𝐵 (𝑏)?, when the pdf’s for 𝐴, 𝐵 are negative exponential with means 1/𝜆1 and 1/𝜆2 respectively,
get
𝜆1
P[𝐴 < 𝐵] = .
𝜆1 + 𝜆2
(employ integrating over the part of the positive quadrant for which 𝑎 < 𝑏)

PROOF: Denote 𝑇1 = 𝐴, 𝑇2 = 𝐵 then

∫︁ +∞ ∫︁ 𝑣
𝛽2
P[𝑇1 < 𝑇2 ] = 𝑑𝑣 𝑓𝑇1 ,𝑇2 (𝑢, 𝑣)𝑑𝑢 = . (14.7)
0 0 𝛽1 + 𝛽2

Indeed, since 𝐴, 𝐵 are independent the joint p.d.f. of 𝑇1 , 𝑇2


1 1 −𝑢/𝛽1 −𝑣/𝛽2
𝑓𝑇1 ,𝑇2 (𝑢, 𝑣) = 𝑓𝑇1 (𝑢)𝑓𝑇2 (𝑣) = 𝑒 𝑒 .
𝛽1 𝛽2

Probability that 𝑇1 < 𝑇2 is obtained by the Total Probability law:


∫︁ ∞ ∫︁ 𝑣
P[𝑇1 < 𝑇2 ] = 𝑑𝑣 𝑓𝑇1 ,𝑇2 (𝑢, 𝑣)𝑑𝑢
0 0
∫︁ ∞ ∫︁ 𝑣
1
= 𝑑𝑣 𝑒−𝑢/𝛽1 𝑒−𝑣/𝛽2 𝑑𝑢
𝛽 1 𝛽2 0 0
𝛽2
= .
𝛽1 + 𝛽2
Yet a second way of deriving the result involves consideration of a long time period of length 𝑇 . Dur-
ing that period the expected total number of arrivals is

E[𝑋(𝑇 )] = (𝜆1 + 𝜆2 )𝑇,

the expected number of type 1 arrivals is

E[𝑋1 (𝑇 )] = 𝜆1 𝑇...

This result makes sense intuitively: the probability that a type 1 arrival occurs before a type 2 arrival
is equal to the fraction of the pooled arrival rate comprising type 1 arrivals.

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
414 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

14.3.2 Thinning of a Poisson process

Proposition 14.3.

Let {𝑁 (𝑡), 𝑡 ≥ 0} represent Poisson arrivals with rate 𝜆. Moreover, each arrival can be of
type 1 with probability 𝑝 and be of
type 2 with probability 1 − 𝑝 independently of all other arrivals.
Let 𝑁1 (𝑡) be the number of type 1 arrivals up to time 𝑡 and 𝑁2 (𝑡) be the number of type 2 arrivals
up to time 𝑡. Then {𝑁1 (𝑡), 𝑡 ≥ 0} and {𝑁2 (𝑡), 𝑡 ≥ 0} are independent Poisson processes with
rates 𝜆𝑝 and 𝜆(1 − 𝑝) respectively.

Example 14.5. [Actuarial Science.]

Suppose the number of claims {𝑁 (𝑡)} to an insurance company like AIA is formed from smokers
1
and non-smokers following independent Poisson processes. Intuitively one thinks that 4 of the claims
are from non-smokers and the rest from smokers.

• Suppose that in the process{𝑁 (𝑡)} we classify each event as

Type I (e.g., non-smoker) and

Type II (e.g., smoker) with probability 𝑝 and 1 − 𝑝 respectively.

• Then the number of events, 𝑁1 (𝑡) of Type I and the number of events, 𝑁2 (𝑡) of Type II also give rise
to Poisson processes.

If the rate of {𝑁 (𝑡)} is 𝜆, then the rate of {𝑁1 (𝑡)} is 𝜆𝑝 and the rate of {𝑁2 (𝑡)} is 𝜆(1 − 𝑝). 

14.4 Conditional Distribution

Recall conditional distributions from Section 8.4, in which the concepts of conditional expectation and
conditional variance are particularly useful for Section 14.5.

14.4.1 Conditional expectation

Definition 14.4 (Conditional expectation).

The conditional expectation of 𝑌 given 𝑋 = 𝑥 is defined as

E[𝑌 | 𝑥] := E[𝑌 | 𝑋 = 𝑥].

It is the expected value of 𝑌 with respect to the conditional p.m.f. 𝑝𝑌 (𝑦|𝑥) (or the conditional p.d.f.
𝑓𝑌 (𝑦|𝑥)).

Inference, Linear Regression and Stochastic Processes


14.4. Conditional Distribution 415

Two cases include:



∑︀
𝑦𝑗 𝑝𝑌 (𝑦𝑗 |𝑥) if (𝑋, 𝑌 ) discrete,


⎨ 𝑦𝑗 ∈Range(𝑌 )


E[𝑌 | 𝑥] = E[𝑌 |𝑋 = 𝑥] = (14.8)


⎩ ∞ 𝑦 𝑓𝑌 (𝑦|𝑥) 𝑑𝑦 if (𝑋, 𝑌 ) continuous.

⎪ ∫︀
−∞

♣ OBSERVATION 4.

Note that for a pair of variables (𝑋, 𝑌 ), the conditional expectation E[𝑌 | 𝑋 = 𝑥] changes with 𝑥,
if 𝑋 and 𝑌 are dependent. Thus, we can consider E[𝑌 | 𝑋] to be a random variable, which is a
function of 𝑋.

Proposition 14.4 (Change expectation via conditioning).

E[𝑌 | 𝑋] is a random variable, so E[E[𝑌 |𝑋]] does exit, computed on the range of 𝑋. We have a
very powerful identity
E[𝑌 ] = E[E[𝑌 |𝑋]]. (14.9)

Proof. We check this when 𝑌, 𝑋 are discrete. We must show that


∑︁
E[𝑌 ] = E[𝑌 |𝑋 = 𝑥] P[𝑋 = 𝑥]? (14.10)
𝑥

We start from the RHS


∑︁ ∑︁ ∑︁
E[𝑌 |𝑋 = 𝑥] P[𝑋 = 𝑥] = [ 𝑦 𝑝𝑌 (𝑦|𝑥)] P[𝑋 = 𝑥]
𝑥 𝑥 𝑦
∑︁ ∑︁ P[𝑌 = 𝑦, 𝑋 = 𝑥]
= [ 𝑦 ] P[𝑋 = 𝑥] (14.11)
𝑥 𝑦
P[𝑋 = 𝑥]
∑︁ ∑︁ ∑︁ ∑︁
= 𝑦 P[𝑌 = 𝑦, 𝑋 = 𝑥] = 𝑦 P[𝑌 = 𝑦, 𝑋 = 𝑥]
𝑥 𝑦 𝑦 𝑥
∑︀ ∑︀
Hence, E[E[𝑌 |𝑋]] = 𝑥 E[𝑌 |𝑋 = 𝑥] P[𝑋 = 𝑥] = 𝑦 𝑦 P[𝑌 = 𝑦] = E[𝑌 ].

Expectation of a function of a random variable: Just as conditional probabilities satisfy all the
properties of ordinary probabilities, so do the conditional expectations satisfy all the properties of
ordinary expectations.
Let 𝑔(𝑌 ) be a function of a r.v. 𝑌 . We consider the conditional expectation of 𝑔(𝑌 ), as an extension
of (14.8):

∑︀
𝑔(𝑦) 𝑝𝑌 (𝑦|𝑥) if (𝑋, 𝑌 ) discrete,


⎨ 𝑦


E[𝑔(𝑌 )| 𝑥] = E[𝑔(𝑌 )|𝑋 = 𝑥] = (14.12)


⎩ ∞ 𝑔(𝑦) 𝑓𝑌 (𝑦|𝑥) 𝑑𝑦 if (𝑋, 𝑌 ) continuous.

⎪∫︀
−∞

E[𝑔(𝑌 )|𝑋] is a function of 𝑋 and takes the value E[𝑔(𝑌 )|𝑋 = 𝑥] when 𝑋 = 𝑥.

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
416 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

Proposition 14.5 (Expectation of a function of a random variable).

Consequently, E[𝑔(𝑌 )|𝑋] is a random variable, whose mean can be calculated. We have a gener-
alization of Equation (14.40) as follows:

E[𝑔(𝑌 )] = E[E[𝑔(𝑌 )|𝑋]]. (14.13)

14.4.2 Conditional variance

Similarly as Definition 14.4, we define the conditional variance of 𝑌 , given 𝑋 = 𝑥, as the variance of
𝑌 , with respect to the conditional p.d.f. 𝑓𝑌 (𝑦 | 𝑥) = 𝑓𝑌 |𝑋 (𝑦 | 𝑥).

Definition 14.5 (Conditional variance).

The conditional variance of a r.v 𝑌 , given a r.v 𝑋, is defined by


[︂ ]︂
(︀ )︀2 ⃒
V[𝑌 | 𝑋] = E 𝑌 − E[𝑌 |𝑋] ⃒ 𝑋 (14.14)

That is, V[𝑌 | 𝑋] is equal to the (conditional) expected square of the difference between 𝑌 and its
(conditional) mean E[𝑌 |𝑋] when the value of 𝑋 is given. In other words, V[𝑌 | 𝑋] is exactly analogous
to the usual definition of variance, but now all expectations are conditional on the fact that 𝑋 is known.

We find that, for any pair of random variables 𝑋, 𝑌 then:

• V[𝑌 | 𝑋] = E[𝑌 2 |𝑋] − (E[𝑌 |𝑋])2 .

• And furthermore, the variance of 𝑌 itself is computed via the conditional variance V[𝑌 | 𝑋]
and the conditional expectation E[𝑌 |𝑋]:

V[𝑌 ] = E[ V[𝑌 |𝑋]] + V[ E[𝑌 |𝑋]]. (14.15)

Why V[𝑌 ] = E[ V[𝑌 |𝑋]] + V[ E[𝑌 |𝑋]]?


From V[𝑌 | 𝑋] = E[𝑌 2 |𝑋] − (E[𝑌 |𝑋])2 , taking expectation both sides w.r.t. 𝑋 gives
[︂ ]︂ [︂ ]︂ [︂ ]︂
2 2 2 2
[︀ ]︀
E V[𝑌 | 𝑋] = E E[𝑌 |𝑋] − E (E[𝑌 |𝑋]) = E[𝑌 ] − E (E[𝑌 |𝑋])

[︂ ]︂
[︀ ]︀2
due to (14.13). Also, E[ E[𝑌 |𝑋]] = E[𝑌 ] =⇒ V[ E[𝑌 |𝑋]] = E (E[𝑌 |𝑋])2 − E[𝑌 ] then summing
[︀ ]︀2
the two shows E[ V[𝑌 |𝑋]] + V[ E[𝑌 |𝑋]] = E[𝑌 2 ] − E[𝑌 ] = V[𝑌 ].

Inference, Linear Regression and Stochastic Processes


14.4. Conditional Distribution 417

14.4.3 Hypergeometric random variable and its mean

Definition 14.6 (Hypergeometric distribution).

Suppose that in a population 𝑃 of size 𝑁 , there are 𝑀 units that have a certain property 𝑇 , and
so 𝑁 − 𝑀 units that do not have the property. Let 𝐽𝑛 denote the number of units having the certain
property 𝑇 , randomly sampled without replacement (RSWOR) from 𝑃 .
𝐽𝑛 is a random variable with Range(𝐽𝑛 ) = {0, 1, 2, . . . , 𝑀 }.
The distribution of 𝐽𝑛 is called the hypergeometric distribution, denoted by 𝐻(𝑁, 𝑀, 𝑛).
The condition for 𝑛 is 𝑛 ≤ min{𝑀, 𝑁 − 𝑀 }.

Figure 14.1: The curve of ℎ(𝑗; 500, 350, 100).

1. Its pmf is denoted by ℎ(𝑗; 𝑁, 𝑀, 𝑛), for 𝑗 ∈ {0, 1, . . . , 𝑛} [Fig. 14.1 shows ℎ(𝑗; 500, 350, 100)],
given as
𝐶(𝑀, 𝑗) 𝐶(𝑁 − 𝑀, 𝑛 − 𝑗)
𝑝𝑗 = P[𝐽𝑛 = 𝑗] = ℎ(𝑗; 𝑁, 𝑀, 𝑛) = ; (14.16)
𝐶(𝑁, 𝑛)
(︀𝐴)︀
where 𝐶(𝐴, 𝑎) = 𝑎 is the binomial coefficient of choosing 𝑎 units from 𝐴 units. The pmf table
is

𝐽𝑛 0 1 ··· 𝑛−1 𝑛

𝑝𝑗 := P[𝐽𝑛 = 𝑗] 𝑝0 𝑝1 ··· 𝑝𝑛−1 𝑝𝑛

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
418 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

𝑘
∑︁
2. Its cdf is denoted by 𝐻(𝑘; 𝑁, 𝑀, 𝑛) = 𝑝𝑗 .
𝑗=1

3. The expected valueof 𝐻(𝑁, 𝑀, 𝑛) is:


𝑀
E(𝐽𝑛 ) = 𝑛 . (14.17)
𝑁

14.4.4 Examples for conditional expectation and variance

Example 14.6 (Conditional expectation with hypergeometric distribution).

Assume that 𝑋 and 𝑌 are independent binomial random variables with identical parameters 𝑛 and
𝑝; having the joint p.m.f. 𝑝(𝑥, 𝑦) = 𝑝𝑋,𝑌 (𝑥, 𝑦).
Calculate the conditional expected value of 𝑌 given that 𝑌 + 𝑋 = 𝑚.

GUIDANCE for solving.

• 𝑋 ∼ Bin(𝑛, 𝑝) and 𝑌 ∼ Bin(𝑛, 𝑝) are independent, so the joint pmf

𝑝𝑌,𝑋 (𝑦, 𝑥) = P[𝑌 = 𝑦 ∩ 𝑋 = 𝑥] = 𝑝𝑌 (𝑦) 𝑝𝑋 (𝑥).

FACT: Furthemore, the sum 𝑆 = 𝑋 + 𝑌 ∼ Bin(2𝑛, 𝑝), it means 𝑋 + 𝑌 also is a binomial random
variable with parameters 2𝑛 and 𝑝.

• First, we compute the conditional p.m.f. 𝑝𝑌 (𝑘|.) [of 𝑌 ] given that 𝑌 + 𝑋 = 𝑚:


P[𝑌 = 𝑘 ∩ 𝑋 = 𝑚 − 𝑘]
𝑝𝑌 (𝑘|𝑋 + 𝑌 = 𝑚) = P[𝑌 = 𝑘|𝑋 + 𝑌 = 𝑚] =
P[𝑋 + 𝑌 = 𝑚]

P[𝑌 = 𝑘] P[𝑋 = 𝑚 − 𝑘]
=
P[𝑋 + 𝑌 = 𝑚]

(14.18)
(︀𝑛)︀ (︀ 𝑛 )︀ 𝑚−𝑘
𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 𝑚−𝑘 𝑝 (1 − 𝑝)𝑛−𝑚+𝑘
= (︀2𝑛)︀
𝑚 2𝑛−𝑚
𝑚 𝑝 (1 − 𝑝)

(︀𝑛)︀ (︀ 𝑛
)︀
𝑘
= (︀2𝑛𝑚−𝑘
)︀ = ℎ(𝑘; 𝑁, 𝑀, 𝑛)
𝑚

Hence, the conditional p.m.f. 𝑝𝑌 (𝑌 |𝑋 + 𝑌 = 𝑚) of 𝑌 , given that 𝑌 + 𝑋 = 𝑚, is the hypergeometric


distribution 𝐽 with parameters 𝑁 = 2𝑛, 𝑀 = 𝑛, 𝑁 − 𝑀 = 𝑛 and 𝑚 = 𝑛.

• Second, we find the conditional expectation E[𝑌 | 𝑋 = 𝑚 − 𝑌 ] = E[𝑌 | 𝑋 + 𝑌 = 𝑚] by using


Equation 14.17:
𝑛 𝑚
E[𝑌 | 𝑋 = 𝑚 − 𝑌 ] = 𝑚 = .
2𝑛 2

Inference, Linear Regression and Stochastic Processes


14.4. Conditional Distribution 419

Example 14.7 (Conditional variance for Poisson modeling).

Suppose that by any time 𝑡 the number of people that have arrived at a train depot is a Poisson
random variable with mean 𝜆 𝑡.
If the initial train arrives at the depot at a time (independent of when the passengers arrive) that
is uniformly distributed over (0, 𝑇 ), what are the mean and variance of the number of passengers
who enter the train? Is the load of train high?

GUIDANCE for solving.

Modeling:

For each 𝑡 ≥ 0, let 𝑁 (𝑡) denote the number of arrivals by 𝑡, and let 𝑌 denote the time at which the
train arrives. Obviously 𝑌 ∼ Uniform((0, 𝑇 )).

The random variable of interest is then 𝑁 (𝑌 ).

Finding solution:

Should we find E[𝑁 (𝑌 )] (and V[𝑁 (𝑌 )]) or the conditional E[𝑁 (𝑌 )|𝑌 ]?

Conditioning on 𝑌 gives

E[𝑁 (𝑌 )| 𝑌 = 𝑡] = E[𝑁 (𝑡)| 𝑌 = 𝑡]


= E[𝑁 (𝑡)] by the independence of 𝑌 and 𝑁 (𝑡) (14.19)

=𝜆𝑡 since 𝑁 (𝑡) is Poisson


[︀ ]︀
Hence, E 𝑁 (𝑌 )|𝑌 = 𝜆 𝑌 , so taking expectations and applying (14.13) gives
[︂ ]︂
[︀ ]︀ 𝑇
E[𝑁 (𝑌 )] = E E 𝑁 (𝑌 )|𝑌 = 𝜆 E[𝑌 ] = 𝜆
2

Making decision

The load of train is measured by the variance V[𝑁 (𝑌 )].

To obtain V[𝑁 (𝑌 )], we use the conditional variance formula:

V[𝑁 (𝑌 )| 𝑌 = 𝑡] = V[𝑁 (𝑡)| 𝑌 = 𝑡]


= V[𝑁 (𝑡)] by independence (14.20)

= 𝜆 𝑡.
[︀ ]︀
As above arguments, V 𝑁 (𝑌 )|𝑌 = 𝜆 𝑌 . Therefore,
[︀ ]︀ [︀ ]︀
V[𝑁 (𝑌 )] = E V[𝑁 (𝑌 )| 𝑌 ] + V E[𝑁 (𝑌 )| 𝑌 ]
= E[𝜆 𝑌 ] + V[𝜆 𝑌 ] (14.21)

= 𝜆 E[𝑌 ] + 𝜆2 V[𝑌 ] = ...

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
420 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

14.5 Compound and Nonhomogeneous Poisson process

14.5.1 Expectation of a sum of random number of random variables

Definition 14.7 (Compound random variable).

Assumption: Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 be a random sample (iid) from a certain distribution, where 𝑁 itself
is a natural-valued random variable (having its own distribution).
The compound random variable of 𝑋𝑖 and 𝑁 is given by 𝑆𝑁 := 𝑋1 + 𝑋2 + · · · + 𝑋𝑁 . In practice, 𝑁
may be the number of people stopping at a service station in a day, and the 𝑋𝑖 are the amounts of gas
they purchased.
One can find the mean and variance of 𝑆𝑁 if observations are random.

Proposition 14.6 (Mean and variance of 𝑆𝑁 ).

𝑖.𝑖.𝑑.
When 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 ∼ 𝑋 and 𝑋𝑖 is independent with 𝑁 , then

(1) E[𝑆𝑁 ] = E[𝑁 ]E[𝑋]


(14.22)
(2) V[𝑆𝑁 ] = E[𝑁 ] V[𝑋] + V[𝑁 ] (E[𝑋])2 .

Proof. (1) First, E[𝑋1 + 𝑋2 + · · · + 𝑋𝑁 |𝑁 = 𝑛] = E[𝑋1 + 𝑋2 + · · · + 𝑋𝑛 ] = 𝑛 E[𝑋],


and so E[𝑆𝑁 |𝑁 ] = E[𝑋1 + 𝑋2 + · · · + 𝑋𝑁 |𝑁 ] = 𝑁 E[𝑋]. Now if we view 𝑔(𝑌 ) = 𝑌 = 𝑆𝑁 and with
conditioning on 𝑁 as in Equation 14.13 we can write
[︂ ]︂ [︂ ]︂
E[𝑆𝑁 ] = E E[𝑆𝑁 |𝑁 ] = E E[𝑋1 + 𝑋2 + · · · + 𝑋𝑁 |𝑁 ]

[︂ ]︂
= E 𝑁 E[𝑋] = E[𝑁 ] E[𝑋]

since 𝑋 is independent with 𝑁 .


(2) Accept it.

14.5.2 Compound Poisson process

We will use above results in the next chapters. In the particular case when 𝑁 = 𝑁 (𝑡) is a regular
Poisson process and 𝑋𝑖 ’s are independent of the time of the event, the sum
𝑁 (𝑡)
∑︁
𝑆𝑁 (𝑡) := 𝑋𝑖
𝑖=1

and call it the (simple) compound Poisson process. Formally, we have

Definition 14.8.

Inference, Linear Regression and Stochastic Processes


14.5. Compound and Nonhomogeneous Poisson process 421

Let {𝑁 (𝑡), 𝑡 > 0} be a Poisson process with rate 𝜆, and let 𝑋1 , 𝑋2 , · · · be random variables that are
i.i.d. and independent of the process {𝑁 (𝑡)}.

The stochastic process


𝑁 (𝑡)

⎪ ∑︁
𝑋1 + 𝑋2 + · · · + 𝑋𝑁 (𝑡) = 𝑋𝑖 , ∀𝑡 ≥ 0


𝑌 (𝑡) = 𝑆𝑁 (𝑡) := 𝑖=1 (14.23)


0 if 𝑁 (𝑡) = 0

is called a compound Poisson process.

Theorem 14.7 (Mean and variance of a compound Poisson process).

If 𝜆 is the rate for the regular Poisson process 𝑁 (𝑡) and the iid variables 𝑋𝑖 have mean 𝜇 and variance
𝜎 2 , then we can calculate the mean and standard deviation of the compound Poisson process 𝑆𝑁 (𝑡)
as

E[𝑆𝑁 (𝑡)] = 𝜆 𝜇 𝑡,
(14.24)
V[𝑆𝑁 (𝑡)] = 𝜆 (𝜇2 + 𝜎 2 ) 𝑡 = 𝜆 E[𝑋 2 ] 𝑡

14.5.3 Nonhomogeneous Poisson process- NHPP

Definition 14.9. {𝑁 (𝑡), 𝑡≥0} is a nonhomogeneous (or non-stationary) Poisson process with intensity
(rate) function 𝜆(𝑡) if

1. 𝑁 (0) = 0
2. {𝑁 (𝑡), 𝑡≥0} has independent increments
3. P[ 2 or more events in (𝑡, 𝑡 + ℎ)] = 𝑜(ℎ)
4. P[ exactly 1 event in (𝑡, 𝑡 + ℎ)] = 𝜆(𝑡) ℎ + 𝑜(ℎ) or equivalently
4*. limℎ→0 (1/ℎ)P[𝑁 (𝑡 + ℎ) − 𝑁 (𝑡) = 1] = 𝜆(𝑡).

NOTATION. If we let ∫︁ 𝑡
𝑚(𝑡) = 𝜆(𝑠)𝑑𝑠
0

then it can be showed that

[𝑚(𝑡)]𝑘
𝑝𝑘 (𝑡) = P[𝑁 (𝑡) = 𝑘] = 𝑒−𝑚(𝑡) , 𝑘 ≥ 0.
𝑘!

𝑁 (𝑡) has a Poisson distribution with mean 𝑚(𝑡), 𝑚(𝑡) is called the mean value function, or also the
principal function of the process.

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
422 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

Key features of NHPP

• In the non-homogeneous case, the rate parameter 𝜆(𝑡) now depends on 𝑡. That is

P {𝑁 (𝜏 + 𝑡) − 𝑁 (𝜏 ) = 1} = 𝜆(𝑡) 𝑡, as 𝑡 → 0.

• When 𝜆(𝑡) = 𝜆 constant, then it reduces to the homogeneous case

• The mean value function of a non-homogeneous Poisson process is defined by


∫︁ 𝑡
𝑚(𝑡) = 𝜆(𝑠)𝑑𝑠 (14.25)
0

• 𝑁 (𝑟 + 𝑡) − 𝑁 (𝑟) has a Poisson distribution with mean 𝑚(𝑟 + 𝑡) − 𝑚(𝑟).

Fact 14.4. 𝑁 (𝑡) is a Poisson random variable with mean 𝑚(𝑡). If {𝑁 (𝑡), 𝑡 ≥ 0} is a non-homogeneous
with mean value function 𝑚(𝑡), then 𝑁 (𝑚−1 (𝑡)), 𝑡 ≥ 0 is homogeneous with intensity 𝜆 = 1.
{︀ }︀

This result follows because 𝑁 (𝑡) is Poisson random variable with mean 𝑚(𝑡), and if we let 𝑋(𝑡) =
𝑁 (𝑚−1 (𝑡)), then 𝑋(𝑡) is Poisson with mean 𝑚(𝑚−1 (𝑡)) = 𝑡.

14.6 Summary and Problems on Poisson processes

A Poisson process, {𝑁 (𝑡), 𝑡 > 0}, only counts the number of events that occurred in the interval [0, 𝑡],
while the process {𝑌 (𝑡), 𝑡 > 0} gives, for example,

• the sum of the lengths of telephone calls that happened in [0, 𝑡], or

• the total number of persons who were involved in car accidents in this interval [0, 𝑡], etc.

Note that we must assume that the lengths of the calls or the numbers of persons involved in distinct
accidents are independent and identically distributed random variables. We could consider the two-
dimensional process {[𝑁 (𝑡), 𝑌 (𝑡)], 𝑡 > 0} to retain all the information of interest.

Problem 14.1 (Application of Equation (14.40) in Rescue Analysis).

A miner is trapped in a mine containing 3 doors.

• The first door leads to a tunnel that will take him to safety after 3 hours of travel.

• The second door leads to a tunnel that will return him to the mine after 5 hours of travel.

• The third door leads to a tunnel that will return him to the mine after 7 hours.

If we assume that the miner is at all times equally likely to choose any one of the doors, what is the
expected length of time until he reaches safety?

Inference, Linear Regression and Stochastic Processes


14.7. The Birth and Death processes 423

GUIDANCE for solving.

Modeling: Let 𝑌 denote the amount of time (in hours) until the miner reaches safety, and let 𝑋 denote
the door he initially chooses. Now,

∑︁
E[𝑌 ] = E[E[𝑌 |𝑋]] = E[𝑌 |𝑋 = 𝑥] P[𝑋 = 𝑥]
𝑥

∑︀3 1 [︀ ]︀
then E[𝑌 ] = 𝑖=1 E[𝑌 |𝑋 = 𝑖] P[𝑋 = 𝑖] = E[𝑌 |𝑋 = 1] + E[𝑌 |𝑋 = 2] + E[𝑌 |𝑋 = 3] .
3

Finding solution We see that E[𝑌 |𝑋 = 1] = 3, and need to find E[𝑌 |𝑋 = 2] =?, E[𝑌 |𝑋 = 3] =? They
are kind of recursive formulae!

Making decision The answer is E[𝑌 ] = 15. 

14.7 The Birth and Death processes

Birth and death (BD) processes informally are obtained when we generalize the Poisson process in
two ways:

1. By letting the value of the arrival rate 𝜆 depend on the current state 𝑛;

2. By including departures, which allows the process to instantaneously decrease its value by one unit
(at a rate that will also be a function of 𝑛).

Birth and death processes formally are a special type of continuous-time Markov chains (CTMC or
Markov jump model).
Consider a continuous-time Markov chain with states 0, 1, 2, . . .

Definition 14.10.

If 𝑝𝑖𝑗 = 0 whenever 𝑗 ̸= 𝑖−1 or 𝑗 ̸= 𝑖+1, then the Markov chain is called a birth and death process.
Thus, a birth and death process is a CTMC in which transitions from state 𝑖 can only go to either
state 𝑖 − 1 or 𝑖 + 1.

• That is, a transition either causes an increase in state by one or a decrease in state by one.

• A birth is said to occur when the state increases by one, and a death is said to occur when the state
decreases by one.

Reminder :

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
424 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

For all 𝑖 ̸= 𝑗 ∈ S, the transition rate of the process when the process makes a transition
from state 𝑖 to state 𝑗, denoted by 𝑞𝑖,𝑗 , is defined by

𝑝𝑖,𝑗 (ℎ)
𝑞𝑖,𝑗 = 𝑝′𝑖,𝑗 (0) = lim for 𝑖 ̸= 𝑗. (14.26)
ℎ→0 ℎ
The transition rates 𝑞𝑖,𝑗 are also known as instantaneous transition rates, transition intensi-
ties, or forces of transition.

Definition 14.11.

If employ the (instantaneous) transition rate 𝑞𝑖,𝑗 given in Equation 14.26 of the continuoustime
Markov chain, then the process is said to be a birth and death process when

𝑞𝑖,𝑗 = 0 𝑖𝑓 |𝑗 − 𝑖| > 1.

14.7.1 Description of a birth and death process

To describe the process we define the birth and death rates from each state 𝑖 ∈ S:

𝜆𝑖 = 𝑞𝑖,𝑖+1 = 𝑣𝑖 𝑝𝑖,𝑖+1
(14.27)
𝜇𝑖 = 𝑞𝑖,𝑖−1 = 𝑣𝑖 𝑝𝑖,𝑖−1

• Thus, 𝜆𝑖 is the rate at which a birth occurs, and 𝜇𝑖 is the rate at which a death occurs, both when the
process is in state 𝑖.

• The rate of transition out of state 𝑖 is the sum of these two rates 𝜆𝑖 + 𝜇𝑖 = 𝑣𝑖 .

• The state-transition-rate diagram of a B & D process is shown in Figure 14.2.

Figure 14.2: State-transition-rate diagram for birth and death process.

Figure 14.2 shows a state-transition-rate diagram as opposed to a state-transition diagram; be-


cause it shows the rate at which the process moves from state to state, and not the probability of
moving from one state to another.

Note that 𝜇0 = 0, because there can be no death when the process is in empty state.

Inference, Linear Regression and Stochastic Processes


14.7. The Birth and Death processes 425

14.7.2 Transient Analysis of Birth and Death Processes

𝑞𝑖,𝑗 < 0, for 𝑖 ∈ S was


∑︀
The generic transition rate matrix, with 𝑞𝑖 := 𝑞𝑖,𝑖 = − 𝑖̸=𝑗∈S

⎡ ⎤
⎢ 𝑞1 𝑞1,2 𝑞1,3 ... 𝑞1,𝑠 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑞 𝑞2 𝑞2,3 ... 𝑞2,𝑠 ⎥
⎢ 2,1 ⎥
⎢ ⎥
⎢ ⎥
Q=⎢
⎢ 𝑞3,1 𝑞3,2 𝑞3 ... ⎥.
𝑞3,𝑠 ⎥ (14.28)
⎢ ⎥
⎢ . .. .. .. .. ⎥
⎢ . .
⎢ . . . . ⎥

⎢ ⎥
⎣ ⎦
𝑞𝑠,1 𝑞𝑠,2 𝑞𝑠,3 ... 𝑞𝑠

For a BD process with parameters given by diagram 14.2, Q takes a special form
⎡ ⎤
⎢ −𝜆0 𝜆0 0 ... 0... 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝜇
⎢ 1 −(𝜆1 + 𝜇1 ) 𝜆1 0... 0... 0 ⎥⎥
Q=⎢

⎥.
⎥ (14.29)
⎢ 0 𝜇2 −(𝜆2 + 𝜇2 ) 𝜆2 0... 0 ⎥
⎢ ⎥
⎢ ⎥
⎣ . .. .. .. .. .. ⎦
.. . . . . .

Remind the Kolmogorov’s forward differential equations in Theorem 12.14 that

𝑑
𝑝(𝑡) = 𝑝(𝑡) Q, (14.30)
𝑑𝑡
where 𝑝(𝑡) = [𝑝0 (𝑡), 𝑝1 (𝑡), 𝑝2 (𝑡), · · · , 𝑝𝑖 (𝑡), 𝑝𝑖+1 (𝑡), · · · ] is the vector of state distribution.

Transient analysis of this birth and death (B & D) process is done by studying the following system
of differential equations:

𝑑
𝑝0 (𝑡) = −𝜆0 𝑝0 (𝑡) + 𝜇1 𝑝1 (𝑡)
𝑑𝑡
𝑑
𝑝1 (𝑡) = 𝜆0 𝑝0 (𝑡) − (𝜆1 + 𝜇1 )𝑝1 (𝑡) + 𝜇2 𝑝2 (𝑡)
𝑑𝑡 (14.31)
....
..
𝑑
𝑝𝑖 (𝑡) = 𝜆𝑖−1 𝑝𝑖−1 (𝑡) − (𝜆𝑖 + 𝜇𝑖 )𝑝𝑖 (𝑡) + 𝜇𝑖+1 𝑝𝑖+1 (𝑡)
𝑑𝑡

14.7.3 Local Balance Equations

Thus, a B & D process can be described by a system of differential equations:

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
426 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

𝑑𝑝0 (𝑡)
= −𝜆0 𝑝0 (𝑡) + 𝜇1 𝑝1 (𝑡)
𝑑𝑡 (14.32)
𝑑𝑝𝑖 (𝑡)
= −(𝜆𝑖 + 𝜇𝑖 )𝑝𝑖 (𝑡) + 𝜇𝑖+1 𝑝𝑖+1 (𝑡) + 𝜆𝑖−1 𝑝𝑖−1 (𝑡), for 𝑖 > 0
𝑑𝑡
𝑑𝑝𝑖 (𝑡)
where the left hand side is the total rate of transition changing of state 𝑖, the + indicates
𝑑𝑡
the moving in rate and the − indicates the moving out rate.

Now in the steady state, we have

𝑑𝑝𝑖 (𝑡)
• lim𝑡→∞ = 0, and
𝑑𝑡
• lim𝑡→∞ 𝑝𝑖 (𝑡) = 𝑝𝑖 exists, for every 𝑖 = 0, 1, 2, . . .;

then the system (14.32) becomes the ‘local balance conditions’

𝜆0
𝜆0 𝑝0 = 𝜇1 𝑝1 =⇒ 𝑝1 = 𝑝0
𝜇1
(𝜆𝑖 + 𝜇𝑖 )𝑝𝑖 = 𝜇𝑖+1 𝑝𝑖+1 + 𝜆𝑖−1 𝑝𝑖−1 for 𝑖 = 1, 2, . . . (14.33)
∑︁
𝑝𝑖 = 1
𝑖

Recursively using this system yields the general result

𝜆𝑖 𝑝𝑖 = 𝜇𝑖+1 𝑝𝑖+1 ∀𝑖 = 0, 1, . . .

This result states that when the process is in the steady state, the rate at which it makes a transition
from state 𝑖 to state 𝑖 + 1, which we refer to the rate of flow from state 𝑖 to state 𝑖 + 1, is equal to the
rate of flow from state 𝑖 + 1 to state 𝑖. This property (14.33) is referred to as the local balance equation
or condition because it balances (or equates) the rate at which the process enters state 𝑖 with the rate
at which it leaves state 𝑖.
Direct application of the property allows us to solve for the steady-state probabilities of the birth and
death process recursively as follows:

∏︀𝑖
𝜆𝑖 𝑗=0 𝜆𝑗
𝑝𝑖+1 = 𝑝𝑖 = . . . = ∏︀𝑖+1 𝑝0
𝜇𝑖+1 𝑗=1 𝜇𝑗

When 𝜆𝑖 = 𝜆 and 𝜇𝑖 = 𝜇 for all 𝑖, [under what distribution?] we get

[︂ ∞ (︂ )︂𝑖 ]︂−1
∑︁ 𝜆
𝑝0 = 1 + (14.34)
𝑖=1
𝜇

Inference, Linear Regression and Stochastic Processes


14.8. Chapter’s Problems 427

𝜆
The sum converges if and only if < 1, equivalent to 𝜆 < 𝜇. Under this condition we obtain
𝜇
⎧ 𝜆
⎨𝑝0
⎪ =1−
𝜇
(︂ )︂
𝜆 (︁ 𝜆 )︁𝑖 (14.35)
⎩𝑝𝑖
⎪ = 1− , for 𝑖 = 1, 2, . . .
𝜇 𝜇

Example 14.8.

A machine is operational for an exponentially distributed time with mean 1/𝜆 before breaking down.
When it breaks down, it takes a time that is exponentially distributed with mean 1/𝜇 to repair it.
What is the fraction of time that the machine is operational (or available)?

Solution: This is a two-state birth and death process. Let 𝑈 denote the up state and 𝐷 the down
state. Let 𝑝𝑈 denote the steady-state probability that the process is in the operational state, and
let 𝑝𝐷 denote the steady-state probability that the process is in the down state. Then the balance
equations become

𝜆𝑝𝑈 = 𝜇 𝑝𝐷
(14.36)
𝑝𝑈 + 𝑝𝐷 = 1 ⇒ 𝑝𝐷 = 1 − 𝑝𝑈

Hence, the fraction of time that the machine is operational is just 𝑝𝑈 = 𝜇/(𝜆 + 𝜇). 

14.8 Chapter’s Problems

Problem 14.2 (Basic Poisson modeling).

a) Let 𝑁 (𝑡) be the number of failures of a computer system in the time interval [0, 𝑡]. We suppose that
{𝑁 (𝑡), 𝑡 ≥ 0} is a Poisson process with rate 𝜆 = 1 per week.

Find the probability that the system operates without failure during two consecutive weeks.

b) Let 𝑁 (𝑡) be the number of telephone calls received at an exchange in the time interval [0, 𝑡]. We
suppose that {𝑁 (𝑡), 𝑡 ≥ 0} is a Poisson process with rate 𝜆 = 10 per hour. Calculate the probability
that no calls will be received during each of two consecutive 15-minute periods.

GUIDANCE for solving.

The pmf of a Poisson 𝑁 (𝑡) counting the number of events randomly occurring in the time interval
[0, 𝑡) is given by
𝑒−𝜆𝑡 (𝜆𝑡)𝑖
𝑝𝑁 (𝑡) (𝑖) = 𝑝(𝑖; 𝜆𝑡) = P[𝑁 (𝑡) = 𝑖] = 𝑖 = 0, 1, 2, ...
𝑖!

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
428 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

a) The system operates without failure during two consecutive weeks, so we get event 𝑁 (2) = 0, with
𝑁 (𝑡) ∼ Pois(𝜆 = 1), therefore the probability is

𝑒−2 (2)0
P[𝑁 (2) = 0] = = 𝑒−2 .
0!

b) The probability that no calls will be received during each of two consecutive 15-minute periods.

The counting process 𝑁 (𝑡) ∼ Pois(𝜆 = 10), with the unit of one hour, then two consecutive 15-minute
periods mean 1/2 unit. The Poisson satisfies the stationary increment property, therefore we can find
the probability of interest for the interval [0, 1/2):

𝑒−10(1/2) (5)0
P[𝑁 (1/2) = 0] = = 𝑒−5 .
0!

Problem 14.3 (Practical Poisson modeling).

a) Suppose that there are 𝑚 terrorists in a group of 𝑁 visitors arriving per day in all airports of the
U.S., with 𝑚 ≪ 𝑁 . If you choose randomly 𝑛 visitors from that group, 𝑛 < 𝑁 , compute the expected
number of terrorists.

b) Use the moment generating function to prove that both the mean E[𝑋] and variance V[𝑋] of a
Poisson random variable 𝑋 with parameter 𝜆 are

E[𝑋] = 𝜆; V[𝑋] = 𝜆.

GUIDANCE for solving.

a) There are 𝑚 terrorists in a group of 𝑁 visitors. You chose randomly 𝑛 visitors from that group, 𝑛 < 𝑁 .

Denote by 𝑋 the number of terrorists in that random sample of 𝑛 visitors, then 𝑋 ∼ Bin(𝑛, 𝑝) a
binomial, since
𝑋 = 𝐵1 + 𝐵2 + · · · + 𝐵𝑛
𝑚
where each 𝐵𝑖 ∼ B(𝑝). The probability 𝑝 = are the same for each 𝐵𝑖 . The linearity of expectation
𝑁
says
𝑛𝑚
E[𝑋] = 𝑛𝑝 = .
𝑁

b) Prove that both the mean E[𝑋] and variance V[𝑋] of a Poisson random variable 𝑋 with parameter
𝜆 are
E[𝑋] = 𝜆; V[𝑋] = 𝜆.

The moment generating function of 𝑋 is

Inference, Linear Regression and Stochastic Processes


14.8. Chapter’s Problems 429


𝑡𝑋 −𝜆
∑︁ 𝜆𝑗
𝑀 (𝑡) = E[𝑒 ]=𝑒 · 𝑒𝑡𝑗
𝑗=0
𝑗! (14.37)
𝑡 𝑡
= 𝑒−𝜆 · 𝑒𝜆 𝑒 = 𝑒−𝜆(1−𝑒 ) , −∞ < 𝑡 < ∞.
Therefore, ⎧
⎨ 𝑑𝑀 = 𝑀 ′ (𝑡) = 𝜆𝑀 (𝑡)𝑒𝑡 ,

𝑑𝑡
⎩𝑀 ′′ (𝑡)

= (𝜆2 𝑒2𝑡 + 𝜆𝑒𝑡 ) 𝑀 (𝑡).

Using
𝑀 (𝑛) (𝑡)|𝑡=0 = 𝜇𝑛 = E[𝑋 𝑛 ] = 𝑀 (𝑛) (0) (14.38)

the mean and variance of the Poisson 𝑋 are

E[𝑋] = 𝜇 = 𝑀 ′ (0) = 𝜆,
(14.39)

V[𝑋] = E[𝑋 2 ] − E[𝑋]2 = 𝑀 ′′ (0) − 𝑀 ′ (0)2 = 𝜆.

Problem 14.4 (Application of Proposition (14.6)).

Suppose that the number of people entering a department store on a given day is a random variable
with mean 50. Suppose further that the amounts of money spent by these customers are independent
random variables having a common mean of $8. Finally, suppose also that the amount of money spent
by a customer is also independent of the total number of customers who enter the store. What is the
expected amount of money spent in the store on a given day?

Problem 14.5 (Using compound Poisson process).

Consider an insurance company which receives claims according to a Poisson process with rate
𝜆 = 400/𝑦𝑒𝑎𝑟. Suppose the size of claims are random variables 𝑅𝑛 ∼ 𝑅 which are distributed with an
exponential distribution with mean E[𝑅] = $1000.

1. Calculate the expected total amount of the claims during a year.

2. Assuming that the insurance company has 𝑛 clients, how much should the monthly insurance pre-
mium be to make sure that the company has a yearly profit.

Provide numerical values if the company has 𝑛 = 1, 000, 10, 000 and 100,000 clients.

3. Assume that the company has 10 people on the staff with a total of $500,000 salary budget and it
has to produce profit of $500,000 at the end of the year. How much should the monthly premium be?

Problem 14.6.

1. Prove that a Poisson process 𝑋(𝑡) with positive rate 𝜆 has stationary increments, and

E[𝑋(𝑡)] = 𝜆𝑡, Var[𝑋(𝑡)] = 𝜆𝑡.

DATA ANALYTICS- FOUNDATION


CHAPTER 14. POISSON PROCESS AND VARIATIONS
430 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME

2. Patients arrive at the doctor’s office according to a Poisson process with rate 𝜆 = 1/10 minute. The
doctor will not see a patient until at least three patients are in the waiting room.

a/ Find the expected waiting time until the first patient is admitted to see the doctor.

b/ What is the probability that nobody is admitted to see the doctor in the first hour?

Problem 14.7 (Conditional expectation).

A miner is trapped in a mine containing 4 doors.

• The first door leads to a tunnel that will take him to safety after 3 hours of travel.

• The second door leads to a tunnel that will return him to the mine after 5 hours of travel.

• The third door leads to a tunnel that will return him to the mine after 6 hours.

• The fourth door leads to a tunnel that will return him to the mine after 7 hours.

If we assume that the miner is at all times equally likely to choose any one of the doors, what is the
expected length of time until he reaches safety?

GUIDANCE for solving.

We employ the fact: if 𝑋, 𝑌 are random variables, then E[𝑌 | 𝑋] is a random variable, so E[E[𝑌 |𝑋]]
does exit, computed on the range of 𝑋. We have

E[𝑌 ] = E[E[𝑌 |𝑋]]. (14.40)

The mine contains 4 doors, let 𝑋 be the doors, having pmf 𝑝(𝑥) = 1/4 for every 𝑥 ∈ {1, 2, 3, 4}. Let
𝑔(𝑋) = E[𝑌 |𝑋] then
∑︁ 4
∑︁
E[𝑌 ] = E[𝑌 |𝑋 = 𝑥] 𝑝(𝑥) = 𝑔(𝑥) 𝑝(𝑥)
𝑥 𝑥=1

therefore, put 𝑢 = E[𝑌 ] we get the equation


1 1
𝑢 = E[𝑌 ] = [𝑔(1) + 𝑔(2) + 𝑔(3) + 𝑔(4)] = [3 + 5 + 𝑢 + 6 + 𝑢 + 7 + 𝑢] (14.41)
4 4
hence 𝑢 = E[𝑌 ] = 21 hours.

Problem 14.8.

Consider an insurance company that has two types of policy: Policy A and Policy B . Total claims
from the company arrive according to a Poison process at the rate of 9 per day. A randomly selected
1
claim has a 3 chance that it is of policy A. Calculate, on a given day

Inference, Linear Regression and Stochastic Processes


14.8. Chapter’s Problems 431

• a/ the probability that claims from policy A will be fewer than 2,

• b/ the probability that claims policy B will be fewer than 2,

• c/ the probability that total claims from the company will be fewer than 2.

GUIDANCE for solving. Apply the Thinning technique in Section 14.3.2 above. Brief inputs are:

• 𝑁𝐴 (𝑡)= number of claims of policy A∼Poisson process at rate 𝜆 𝑝 = 9.1/3 = 3 per day

• 𝑁𝐵 (𝑡)= number of claims of policy B∼Poisson process at rate 𝜆(1 − 𝑝) = 9.2/3 = 6 per day

• N(t) = total number of claims ∼ Poisson process at rate 𝜆 = 9 per day.

0 1
a) 𝑃 (𝑁𝐴 (1) < 2) = 𝑃 (𝑁𝐴 (1) = 0) + 𝑃 (𝑁𝐴 (1) = 1) = 𝑒−3 30! .𝑒−3 31! = 4.𝑒−3 = 0.19915
0 1
b) 𝑃 (𝑁𝐵 (1) < 2) = 𝑃 (𝑁𝐵 (1) = 0) + 𝑃 (𝑁𝐵 (1) = 1) = 𝑒−6 60! .𝑒−6 61! = 7.𝑒−6 = 0.1735
0 1
c) 𝑃 (𝑁 (1) < 2) = 𝑃 (𝑁 (1) = 0) + 𝑃 (𝑁 (1) = 1) = 𝑒−9 90! .𝑒−9 91! = 10.𝑒−9 = 0.0123

Problem 14.9.

Suppose that a security expert arrives at a server room at 6:15 AM. Until 7:00 AM, emails arrive
at a Poisson rate of 1 email per 30 minutes. Starting from 7:00 AM, they arrive at a Poisson rate of 2
emails per 30 minutes.

Calculate your expected waiting time until an email arrives.

GUIDANCE for solving.

Apply the Thinning technique in Section 14.3.2 above. Brief inputs are:

⎨ ∼ 𝑒1/30 , 𝑇𝑏 ≤ 45
𝑇𝑤 = wait time = .
⎩ ∼ 𝑒1/15 , 𝑇 > 45
𝑎

Can we realize that

E(𝑇𝑤 ) = average wait time = E[min(45, 𝑇𝑏 )] + E[𝑇𝑎 ] * P[𝑁 (45) = 0] =?

𝑁 (45) ∼ Poisson process at rate of 1/30, hence

P[𝑁 (45) = 0] = 𝑒−45/30 = 𝑒−1.5 ; and 𝐸(𝑇𝑎 ) = 15,

E(𝑇𝑤 ) = E[min(45, 𝑇𝑏 )] + E[𝑇𝑎 ] * P[𝑁 (45) = 0]

∫︁ 45 ∫︁ ∞
1 −𝑡/30 1 −𝑡/30
= 𝑡 𝑒 𝑑𝑡 + 45. 𝑒 𝑑𝑡 + 15 𝑒−1.5
0 30 45 30

= 30(1 − 𝑒−1.5 ) + 15𝑒−1.5 = 30 − 15𝑒−1.5 = 26.6

DATA ANALYTICS- FOUNDATION


This page is left blank intentionally.
Chapter 15

Branching Processes
And Renewal Processes

[Source [56]]
CHAPTER 15. BRANCHING PROCESSES
434 AND RENEWAL PROCESSES

Introduction

Branching processes is a special case of a Markov process with infinitely many states. The states
are nonnegative integers that usually represent the number of members of a population. It arises in
situations where one individual produces a random number of off-springs (possibly zero, according to
a specific probability distribution), who in turn keep on reproducing themselves in the same manner.
This is repeated by the offspring themselves, from generation to generation, leading to either a pop-
ulation explosion or its ultimate extinction. Examples include:

1. Nuclear chain reaction (neutrons are the “offspring” of each atomic fission).

2. Survival of family names (carried by males) or of a new (mutated) gene.

3. In one-server queueing theory, customers arriving (and lining up) during the service time of a given
customer can be, in this sense, considered that customer’s “offspring” - this simplifies dealing with
some tricky issues of queueing theory.

Francis Galton (1822-1911) formulated the problem of population extinction (e.g. certain family
names would disappear, for lack of male descendants) mathematically in the Educational Times in
1873. The corresponding stochastic processes are sometimes called branching processes.
Henry William Watson (1827- 1903) replied with a solution in the same venue. Together, they then
wrote a paper entitled “On the probability of extinction of families” in 1874. Nowadays, we can name
branching processes by Galton-Watson process.
We will study

• Key concepts

• The variance V[𝑋𝑛 ] of a branching process in Section 15.2

• Ultimate Extinction in Section 15.3

• Generations of Offsprings, in Section 15.4.

15.1 Key concepts

Definition 15.1.

Let {𝑍𝑛,𝑗 , 𝑛 = 0, 1, . . . ; 𝑗 = 1, 2, . . .} be a set of i.i.d. random variables whose possible values are
nonnegative integers, S𝑍𝑛,𝑗 = S = N = {0, 1, 2, 3, . . .}. 𝑍𝑛,𝑗 is the number of descendants of the 𝑗-th
member of the 𝑛-th generation.

Inference, Linear Regression and Stochastic Processes


15.1. Key concepts 435

A branching process is a Markov chain {𝑋𝑛 , 𝑛 = 1, 2, . . .} defined by


⎧ 𝑋𝑛
⎪ ∑︁

⎨ 𝑍𝑛,𝑗 = 𝑍𝑛,1 + 𝑍𝑛,2 + . . . + 𝑍𝑛,𝑋𝑛 if 𝑋𝑛 > 0
𝑋𝑛+1 = 𝑗=1 (15.1)


0 if 𝑋𝑛 = 0

assuming that 𝑋0 > 0.

• 𝑋0 is the number of members of the initial generation, that is,

the number of ancestors of the population. The process {𝑋𝑛+1 , 𝑛 = 0, 1, 2, . . .} is said to be lineage
if 𝑋0 = 1.

• 𝑋𝑛 is the total numbers of individuals/members of only the 𝑛-th generation.

Definition 15.2. Write the probability that at the death of each such individual we obtain exactly 𝑖
offspring, 𝑖 = 0, 1, 2, . . . as

𝑝𝑖 = P[𝑍𝑛,𝑗 = 𝑖] for all 𝑛, 𝑗.


∑︀∞
Let 𝑚 = 𝑖=0 𝑖𝑝𝑖 , it is expected number of offspring or basic reproductive rate.

[To avoid trivial cases, we assume that 𝑝𝑖 is strictly smaller than 1, for all 𝑖 ≥ 0.]

Properties of branching process ( Galton-Watson process)

The process {𝑋𝑛 , 𝑛 = 1, 2, . . .} given in (15.1) is also called a Galton-Watson process.

1. Its state space is S𝑋 = S𝑋𝑛 = N.

2. The transition probability 𝑝𝑖,𝑘 = P[𝑋𝑛+1 = 𝑘|𝑋𝑛 = 𝑖] is just the probability that
𝑋𝑛
∑︁
𝑍𝑛,𝑗 = 𝑍𝑛,1 + 𝑍𝑛,2 + . . . + 𝑍𝑛,𝑖 = 𝑘.
𝑗=1

As state 0 is absorbing, a trapping state [𝑝0,0 = 1, no future offspring can arise in this case], we can
decompose S𝑋 into two sets:
S𝑋 = 𝐷 ∪ {0} (15.2)

where 𝐷 = {1, 2, . . .} is the set of transient states.

3. Limiting population size: Given that a transient state is visited only a finite number of times, we can
assert that the process cannot remain indefinitely in the set 𝐷𝑘 = {1, 2, . . . , 𝑘}, for any finite 𝑘. Thus,
we conclude that the population will either disappear (ultimate extinction), or that its size will tend
to infinity.

DATA ANALYTICS- FOUNDATION


CHAPTER 15. BRANCHING PROCESSES
436 AND RENEWAL PROCESSES

4. The average number of individuals in the 𝑛-th generation- The lineage case.
Suppose that 𝑋0 = 1. Let’s now calculate the average number 𝜇𝑛 = E[𝑋𝑛 ] of individuals in
the 𝑛-th generation, for 𝑛 = 1, 2, . . ..

𝜇1 ≡ E[𝑋1 ] (just the average number of descendants of an individual, in general)


∑︁
𝜇𝑛 ≡ E[𝑋𝑛 ] = E[𝑋𝑛 |𝑋𝑛−1 = 𝑗] P[𝑋𝑛−1 = 𝑗] (15.3)
𝑗=0

∑︁ ∞
∑︁
= 𝑗 𝜇1 P[𝑋𝑛−1 = 𝑗] = 𝜇1 𝑗 P[𝑋𝑛−1 = 𝑗] = 𝜇1 E[𝑋𝑛−1 ] = · · · = 𝜇𝑛1 E[𝑋0 ] = 𝜇𝑛1
𝑗=0 𝑗=0

15.2 The variance V[𝑋𝑛 ] of a branching process

Reminder: we defined the conditional variance of 𝑌 , given 𝑋 = 𝑥, as the variance of 𝑌 , with respect
to the conditional p.d.f. 𝑓𝑌 (𝑦 | 𝑥) = 𝑓𝑌 |𝑋 (𝑦 | 𝑥).

The conditional variance of a r.v 𝑌 𝑋, given a r.v 𝑋, is defined by


[︂ ]︂
(︀ )︀2
V[𝑌 | 𝑋] = E 𝑌 − E[𝑌 |𝑋] |𝑋 (15.4)

Reminder

• V[𝑌 | 𝑋] = E[𝑌 2 |𝑋] − (E[𝑌 |𝑋])2 .

• And furthermore, the variance of 𝑌 itself is computed via the conditional variance V[𝑌 | 𝑋]
and the conditional expectation E[𝑌 |𝑋]:

V[𝑌 ] = E[V[𝑌 |𝑋]] + V[E[𝑌 |𝑋]]. (15.5)

We compute the variance V[𝑋𝑛 ] of 𝑋𝑛 , depending on two cases of 𝜇1 = 1 and 𝜇1 ̸= 1. Write


𝜎12 = V[𝑋1 ].

1. When 𝜇1 = 1: use 𝑌 = 𝑋𝑛 and 𝑋 = 𝑋𝑛−1 , by Equation (15.5) we have

V[𝑌 ] = V[𝑋𝑛 ] = E[V[𝑋𝑛 | 𝑋𝑛−1 ]] + V[E[𝑋𝑛 | 𝑋𝑛−1 ]]


𝑖.𝑖.𝑑
V[𝑋𝑛 ] = E[𝑋𝑛−1 × 𝜎12 ] + V[𝑋𝑛−1 × 𝜇1 ]
(15.6)
= 𝜎12 × 1 + V[𝑋𝑛−1 ] = 𝜎12 × 2 + V[𝑋𝑛−2 ]
= 2𝜎12 + V[𝑋𝑛−2 ] = . . . = 𝑛𝜎12 + V[𝑋0 ] = 𝑛𝜎12 .
since 𝑋0 is a constant.

Inference, Linear Regression and Stochastic Processes


15.3. Ultimate Extinction 437

2. When 𝜇1 ̸= 1, we can prove


𝜇𝑛1 − 1
(︂ )︂
V[𝑋𝑛 ] = 𝜎12 𝜇𝑛−1
1 . (15.7)
𝜇1 − 1

15.3 Ultimate Extinction

 CONCEPT 13.

The probability of eventual/ ultimate extinction of the population is defined by

𝑞0,𝑖 = lim P[𝑋𝑛 = 0|𝑋0 = 𝑖], for 𝑖 > 0. (15.8)


𝑛→∞

We wish to determine 𝑞0,𝑖 . By independence among individuals in generation 0:

𝑖
𝑞0,𝑖 = 𝑞0,1 = 𝑞0𝑖 . (15.9)

Reduce 𝑞0 := 𝑞0,1 = P[𝑋𝑛 = 0|𝑋0 = 1], then



∑︁
P[𝑋𝑛 = 0|𝑋0 = 1] = 1 − P[𝑋𝑛 ≥ 1|𝑋0 = 1] = 1 − P[𝑋𝑛 = 𝑘|𝑋0 = 1].
𝑘=1

We know that (using Equation 14.8 with 𝑋𝑛 = 𝑌, 𝑋0 = 𝑋, and Range(𝑋𝑛 ) = N):



∑︁ ∞
∑︁
E[𝑋𝑛 |𝑋0 = 1] = 𝑘 P[𝑋𝑛 = 𝑘|𝑋0 = 1] ≥ P[𝑋𝑛 = 𝑘|𝑋0 = 1],
𝑘=1 𝑘=1

so 𝑞0 = P[𝑋𝑛 = 0|𝑋0 = 1] ≥ 1 − E[𝑋𝑛 |𝑋0 = 1]. In the lineage case,

𝑞0 = P[𝑋𝑛 = 0|𝑋0 = 1] ≥ 1 − E[𝑋𝑛 ] = 1 − 𝜇𝑛1 , by Equation 15.3.

1. When 𝜇1 ∈ (0, 1): we obtain that


𝑞0 ≥ lim 1 − 𝜇𝑛1 = 1
𝑛−→∞

since if each individual has less than one descendant, on average, we indeed expect the population
to disappear.

2. When 𝜇1 ≥ 1:
𝜇1 = 1 =⇒ 𝑞0 ≥ 0

𝜇1 > 1 =⇒ 𝑞0 ≥ −∞.

Theorem 15.1.

The probability of eventual extinction of the population 𝑞0 = 1 if 𝜇1 ≤ 1;


while 𝑞0 < 1 if 𝜇1 > 1.

DATA ANALYTICS- FOUNDATION


CHAPTER 15. BRANCHING PROCESSES
438 AND RENEWAL PROCESSES

15.4 Generations of Offsprings

Recall that the (point) probability distribution of a discrete r. v. 𝑋, with Range(𝑋) = N and pmf
P[𝑋 = 𝑗] = 𝑝𝑗 gives rise the probability-generating function of 𝑋, defined by

∑︁
𝑃 (𝑡) = 𝑃𝑋 (𝑡) = 𝑝𝑗 𝑡𝑗 = E(𝑡𝑋 ), (15.10)
𝑗=0

here E is the expectation operator, 𝑡 is called a dummy variable. Obviously



∑︁ ∞
∑︁
𝑃 (1) = 𝑝𝑗 = 1; and 𝑃 ′ (1) = 𝑗 𝑝𝑗 = 𝜇𝑋 . (15.11)
𝑗=0 𝑗=1

15.4.1 Probability-generating function of a compound variable

Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 ∼ 𝑋 be a random sample (iid) from a certain distribution, where 𝑁 itself is a


random variable (having its own distribution on non-negative integers). Put the sum variable 𝑆𝑁 :=
𝑋1 + 𝑋2 + · · · + 𝑋𝑁 .
We now assume the distribution of 𝑋 is of a discrete (integer-valued) type and that its probability-
generating function (PGF) is 𝑃𝑋 , given as

∑︁
𝑃𝑋 (𝑠) = 𝑝𝑗 𝑠𝑗 .
𝑗=0

Similarly, the PGF of the distribution of 𝑁 is 𝑃𝑁 , given as


∑︁
𝑃𝑁 (𝑠) = P[𝑁 = 𝑖] 𝑠𝑖 (15.12)
𝑖=0

We would like to find the PGF 𝐻(𝑠) of the sum (the total purchased gas)

𝑆𝑁 := 𝑋1 + 𝑋2 + · · · + 𝑋𝑁

We can prove

∑︁
𝐻(𝑠) = P[𝑆𝑁 = 𝑘] 𝑠𝑘 = ... = 𝑃𝑁 (𝑃𝑋 (𝑠)). (15.13)
𝑘=0

We find P[𝑆𝑁 = 𝑘] by conditioning on 𝑁 = 𝑗:



∑︁
P[𝑆𝑁 = 𝑘] = P[𝑆𝑁 = 𝑘|𝑁 = 𝑗] P[𝑁 = 𝑗]
𝑗

So
∞ ∑︁
∑︁ ∞
𝐻(𝑡) = P[𝑆𝑁 = 𝑘|𝑁 = 𝑗] P[𝑁 = 𝑗] 𝑡𝑘
𝑘=0 𝑗

Inference, Linear Regression and Stochastic Processes


15.4. Generations of Offsprings 439


∑︁ ∞
∑︁
= P[𝑁 = 𝑗] P[𝑆𝑗 = 𝑘]𝑡𝑘 = 𝑃𝑁 (𝑃𝑋 (𝑡))
𝑗 𝑘=0
∑︀∞ 𝑗
since 𝑘=0 P[𝑆𝑗 = 𝑘]𝑡𝑘 = 𝑃𝑆𝑗 (𝑡) = 𝑃𝑋 (𝑡). Hence, the PGF of 𝑆𝑁 is thus a composition of the PGF
𝑃𝑁 (.) of 𝑁 and 𝑃𝑋 (𝑠) of 𝑋𝑖 .

15.4.2 Probability-generating function (PGF) for Galton-Watson process

We assume a branching process (15.1) starts with a single individual (Generation 0); that is, 𝑋0 = 1
(the corresponding PGF is thus equal to 𝑡).

1. He (and ultimately all of his descendants) produces a random number of offspring, each according
to a distribution whose PGF is 𝑃 (𝑡). This is thus the PGF of the number of members of the first
generation (denoted by 𝑋1 ).

2. 𝑋1 = 𝑁 becomes the number of ancestors for producing the next generation with

𝑋2 = 𝑍1 + 𝑍2 + · · · + 𝑍𝑋1 (15.14)

members, where the individual 𝑍𝑖 are independent of each other.


To get the PGF of 𝑋2 , we must compound 𝑃 (𝑡) (the PGF of 𝑁 = 𝑋1 ) with the same 𝑃 (𝑡) (the PGF
of the 𝑍𝑖 ), getting 𝑃 (𝑃 (𝑠)).

3. The number of members of the 𝑛-th generation (𝑛 ≥ 2) is

𝑋𝑛 = 𝑍1 + 𝑍2 + · · · + 𝑍𝑋𝑛−1 .

The PGF of 𝑋𝑛 generally is the 𝑛-fold composition of 𝑃 (𝑡) with itself, given as

𝑃(𝑛) (𝑡) = 𝑃 (𝑃 (. . . (𝑃 (𝑠)))); 𝑛 times 𝑃 (15.15)

15.4.3 Compute the expected value 𝜇𝑛 = E[𝑋𝑛 ]

Based on the recurrence formula (15.15) for computing 𝑃(𝑛) (𝑡) namely,

𝑃(𝑛) (𝑡) = 𝑃 (𝑃(𝑛−1) (𝑡)), (15.16)

we can easily derive the corresponding formula for the expected value of 𝑋𝑛 by a simple differentiation
and the chain rule, to get

𝑃(𝑛) (𝑡) = 𝑃 ′ (𝑃(𝑛−1) (𝑡)) · 𝑃(𝑛−1)

(𝑡).

Let 𝑡 = 1, then


𝑃(𝑛−1) (1) = 1, 𝑃(𝑛−1) (1) = 𝜇𝑛−1 ,

so again we get (15.3):


𝜇𝑛 = E[𝑋𝑛 ] = 𝜇 · 𝜇𝑛−1 = 𝜇𝑛 = 𝜇𝑛1 .

DATA ANALYTICS- FOUNDATION


CHAPTER 15. BRANCHING PROCESSES
440 AND RENEWAL PROCESSES

Example 15.1.

Let us assume the offspring distribution is Poisson, with a mean 𝜆 = 1.


What is the distribution of 𝑋10 (the number of members of Generation 10) and the corresponding
mean and variance?
ANSWER:
Compute the 10-fold composition 𝐻(𝑡) = 𝑃(10) (𝑡) = 𝑃 (𝑃(𝑛−1) (𝑡)) = 𝑃 (10) (𝑡),
then 𝜇10 = 𝐻 ′ (1) = 6.31 and by Formula (15.7)

𝜇𝑛1 − 1
(︂ )︂
V[𝑋10 ] = 𝜎12 𝜇𝑛−1
1 = 5.432 .
𝜇1 − 1

15.5 Introduction to Renewal Processes

Recall the counting process from Section 14.2.1.

𝑁 (𝑡) is said to be a counting process if it satisfies:

a/ it is nonnegative and integer valued, 𝑁 (𝑡) ∈ N,


b/ if 𝑠 < 𝑡 then 𝑁 (𝑠) ≤ 𝑁 (𝑡), and
c/ for 𝑠 < 𝑡, the quantity 𝑁 (𝑡) − 𝑁 (𝑠) equals the number of events that occurred in the interval
(𝑠, 𝑡].

A counting process 𝑁 (𝑡) could satisfies the following two properties:

P1. Independent increment property: says that the (random) number of events (arrivals) 𝑁 (𝑡)−𝑁 (𝑠)
and 𝑁 (𝑣) − 𝑁 (𝑢) in two disjoint intervals [say (𝑠, 𝑡] ∩ (𝑢, 𝑣] = ∅] are independent.

P2. Stationary increment property: says that the distribution of the number of events 𝑁 (𝑠, 𝑡) :=
𝑁 (𝑡) − 𝑁 (𝑠) occurring in interval (𝑠, 𝑡] depends only on the length ℎ = 𝑡 − 𝑠 of the interval, not
on the position of the interval.

A Poisson process with rate 𝜆, formally is a counting process 𝑁 (𝑡) with P1, P2 and that additionally
satisfies the following feature:
Orderliness: saying two or more arrivals cannot simultaneously occur,

P[𝑁 (𝑡 + 𝜏 ) − 𝑁 (𝑡) > 1] = 𝑜(𝜏 ), as 𝜏 → 0.

We study Renewal Processes with the following parts.

Inference, Linear Regression and Stochastic Processes


15.5. Introduction to Renewal Processes 441

• Transform Methods

• The Renewal Equations

• Markov Renewal Process

Definition 15.3.

Let {𝑁 (𝑡), 𝑡 ≥ 0} be a counting process, and


𝑇𝑖 be the random variable denoting the time point that the process hits state 𝑖 (or event 𝐸𝑖 ), for
𝑖 = 1, . . . , 𝑛, . . . ,

• The process {𝑁 (𝑡), 𝑡 ≥ 0} is called a renewal process if the nonnegative variables 𝑇1 , . . . , 𝑇𝑛 . . .


are independent and identically distributed.

• The 𝑛-th renewal is determined by the time point 𝑇𝑛 .

• The lifetime of the event 𝐸𝑖 is duration 𝑋𝑖 = 𝑇𝑖 − 𝑇𝑖−1 .

The Poisson process is a special case of a renewal process, being obtained by taking the 𝑋𝑖
variables exponentially distributed with some constant rate of failure 𝜆; see Theorem ??.

Interarrival times of a renewal process

Example 15.2.

Consider an experiment that involves a set of identical lightbulbs whose lifetimes are independent.
The experiment consists of using one lightbulb at a time, and when it fails it is immediately replaced by
another lightbulb from the set.

DATA ANALYTICS- FOUNDATION


CHAPTER 15. BRANCHING PROCESSES
442 AND RENEWAL PROCESSES

Each time a failed lightbulb is replaced constitutes a renewal event.


Let 𝑋𝑖 denote the lifetime of the 𝑖th lightbulb, 𝑖 = 1, 2, . . ., where 𝑋0 = 0. Because the light-
bulbs are assumed to be identical, the 𝑋𝑖 are independent and identically distributed with a pdf
𝑓𝑋 (𝑥), 𝑥 ≥ 0, and mean E[𝑋].
Let 𝑁 (𝑡) denote the number of renewal events up to and including the time 𝑡, where it is
assumed that the first lightbulb was turned on at time 𝑡 = 0.

• The time to failure 𝑇𝑛 of the first 𝑛 lightbulbs or also the time of the 𝑛-th renewal (replacing the bulb)
𝑛
∑︁
is given by 𝑇𝑛 = 𝑋𝑖 .
𝑖=1

In this light bulb example, 𝑋𝑖 ’s are the lifetimes of each light bulb replaced. It is easy to see that the
renewal process is a more general process than the a Poisson process. 

15.5.1 The distribution of 𝑁 (𝑡)

• We have that

𝑁 (𝑡) = max{𝑛 : 𝑇𝑛 ≤ 𝑡}.

• The process {𝑁 (𝑡), 𝑡 ≥ 0} is a counting process, known as a renewal process,

and 𝑁 (𝑡) denotes the number of renewals up to time 𝑡.

• Observe that the event that

“the number of renewals up to and including the time 𝑡 is less than 𝑛”

is equivalent to the event that


𝑛
∑︁
“the 𝑛th renewal happening at a time point 𝑇𝑛 = 𝑋𝑖 is later than 𝑡”, hence:
𝑖=1

{𝑁 (𝑡) < 𝑛} = {𝑇𝑛 > 𝑡} = {𝑋1 + · · · + 𝑋𝑛 > 𝑡}. (15.17)

Therefore, P[𝑁 (𝑡) < 𝑛] = P[𝑇𝑛 > 𝑡], and

P[𝑁 (𝑡) ≥ 𝑛] = P[𝑇𝑛 ≤ 𝑡] = 𝐹𝑇𝑛 (𝑡) = P[𝑋1 + · · · + 𝑋𝑛 ≤ 𝑡]. (15.18)

Property. Denote the cdf of the failure time 𝑇𝑛 by 𝐹𝑛 (𝑡), it is P[𝑁 (𝑡) ≥ 𝑛].

Inference, Linear Regression and Stochastic Processes


15.6. Transform Methods 443

• By the sum rule, we conclude that

P[𝑁 (𝑡) = 𝑛] = P[𝑁 (𝑡) ≥ 𝑛] − P[𝑁 (𝑡) ≥ 𝑛 + 1].

• The pmf (probability mass function) of the random variable 𝑁 (𝑡) can be obtained from the
formula
P[𝑁 (𝑡) = 𝑛] = P[𝑇𝑛 ≤ 𝑡] − P[𝑇𝑛+1 ≤ 𝑡] = 𝐹𝑛 (𝑡) − 𝐹𝑛+1 (𝑡). (15.19)

15.5.2 Find the distribution of the 𝑛-th renewal 𝑇𝑛

Equation 15.19 interestingly links the distribution of the renewal process 𝑁 (𝑡) with the distribution of the
times of renewals 𝑇𝑛 . This relationship, however, does not reduce to a simple exponential/Poisson
relationship as in the case of the Poisson process of Chapter ??.
In some cases, we know the exact distribution of the random variable 𝑇𝑛 .
* If 𝑋𝑖 ∼ Pois(𝜆), by Proposition 14.2, then 𝑇𝑛 ∼ Pois(𝑛𝜆).
In general, it is difficult to find the exact distribution function 𝐹𝑛 (𝑡) of 𝑇𝑛 .

15.6 Transform Methods

We recall key transformations of probability-generating function, moment generating function and


Laplace transform.

15.6.1 Probability-generating function (p.g.f. or PGF) of a discrete variable

Consider the (point) probability distribution of a discrete r. v. 𝑋, with the observed values in Range(𝑋) =
N and pmf P[𝑋 = 𝑗] = 𝑝𝑗 .

The probability-generating function of 𝑋 is defined by



∑︁
𝑃 (𝑡) = 𝑃𝑋 (𝑡) = 𝑝𝑗 𝑡𝑗 = E[𝑡𝑋 ], (15.20)
𝑗=0

here E is the expectation operator, 𝑡 is called a dummy variable. Obviously



∑︁
𝑃 (1) = 𝑝𝑗 = 1. (15.21)
𝑗=0

Besides, the moment-generating capability of the PGF-transform lies in the results obtained from

DATA ANALYTICS- FOUNDATION


CHAPTER 15. BRANCHING PROCESSES
444 AND RENEWAL PROCESSES

evaluating the derivatives of the transform at 𝑡 = 1. We have, for a discrete r. v. 𝑋, with values
𝑗 ∈ Range(𝑋) = N and pmf 𝑝𝑗 that

∑︁
𝑃𝑋 (𝑡) = E(𝑡𝑋 ) = 𝑡𝑗 𝑝𝑗
𝑗=0

𝑑𝑃 (𝑡) ∑︁
= 𝑗 𝑡𝑗−1 𝑝𝑗
𝑑𝑡 𝑗=1

(15.22)
𝑑𝑃 (𝑡) ∑︁
|𝑡=1 = 𝑃 ′ (1) = 𝑗𝑝𝑗 = E[𝑋] = 𝜇𝑋
𝑑𝑡 𝑗=0

𝑑2 𝑃 (𝑡) ′′ ∑︁
2
|𝑡=1 = 𝑃 (1) = 𝑗(𝑗 − 1) 𝑡𝑗−1 𝑝𝑗 = E[𝑋 2 ] − E[𝑋].
𝑑𝑡 𝑗=1

Now let 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 ∼ 𝑋 be a random sample (iid) from a certain distribution, where 𝑁 itself is a
random variable (having its own distribution on N).
We would like to find the PGF 𝐻(𝑠) of the sum

𝑆𝑁 := 𝑋1 + 𝑋2 + · · · + 𝑋𝑁 .

We knew from Section 15.4.1 that the PGF 𝐻(𝑠) of 𝑆𝑁 is thus a composition of the PGF 𝑃𝑁 (.) of 𝑁
and 𝑃𝑋 (𝑠) of 𝑋𝑖 , because

∑︁
𝐻(𝑠) = P[𝑆𝑁 = 𝑘] 𝑠𝑘 = 𝑃𝑁 (𝑃𝑋 (𝑠)). (15.23)
𝑘=0

15.6.2 Moment generating function of a continuous variable

The moment generating function (m.g.f.) of a continuous random variable 𝑋, is defined as a


function of a real variable 𝑡,
∫︁ ∞
𝑀 (𝑡) = 𝑀𝑋 (𝑡) = E[𝑒𝑡𝑋 ] = 𝑒𝑡𝑥 𝑓 (𝑥)𝑑𝑥. (15.24)
−∞

𝑀 (0) = 1 for all distributions. But 𝑀 (𝑡) may not exist for some 𝑡 ̸= 0. To be useful, it is sufficient
that 𝑀 (𝑡) will exist in some interval containing 𝑡 = 0.

• For example, if 𝑋 has a continuous distribution with p.d.f.



⎨1/(𝑏 − 𝑎), 𝑎 ≤ 𝑥 ≤ 𝑏, 𝑎 < 𝑏
𝑓 (𝑥) =
⎩0, otherwise,

then ∫︁ 𝑏
1 1
𝑀 (𝑡) = 𝑒𝑡𝑥 𝑑𝑥 = (𝑒𝑡𝑏 − 𝑒𝑡𝑎 ).
𝑏−𝑎 𝑎 𝑡(𝑏 − 𝑎)
This is a differentiable function of 𝑡, for all 𝑡, −∞ < 𝑡 < ∞.

Inference, Linear Regression and Stochastic Processes


15.6. Transform Methods 445

♣ OBSERVATION 5.

Another useful property of the m.g.f. 𝑀 (𝑡) is that often we can obtain the moments of 𝐹 (𝑥) by differen-
tiating 𝑀 (𝑡). More specifically, consider the 𝑛-th order derivative of 𝑀 (𝑡). Assuming that this derivative
exists, and differentiation can be interchanged with integration (or summation), then

𝑑𝑛
∫︁
𝑀 (𝑛) (𝑡) = 𝑒𝑡𝑥 𝑓 (𝑥)𝑑𝑥
𝑑𝑡𝑛
∫︁ (︂ 𝑛 )︂ ∫︁
𝑑 𝑡𝑥
= 𝑒 𝑓 (𝑥)𝑑𝑥 = 𝑥𝑛 𝑒𝑡𝑥 𝑓 (𝑥)𝑑𝑥.
𝑑𝑡𝑛

Thus, if these operations are justified, then


∫︁
(𝑛)
𝑀 (𝑡)|𝑡=0 = 𝑥𝑛 𝑓 (𝑥)𝑑𝑥 = 𝜇𝑛 = E[𝑋 𝑛 ] = 𝑀 (𝑛) (0). (15.25)

In the following section we will illustrate the usefulness of the m.g.f.

Example 15.3.

The m.g.f. of the Poisson distribution 𝑋 ∼ 𝑃 (𝜆), with pdf


𝜆𝑗 𝑒−𝜆
𝑝(𝑗; 𝜆) = P(𝑋 = 𝑗) = 𝑗 = 0, 1, 2, ...
𝑗!
is given by

∑︁ 𝜆𝑗
𝑀 (𝑡) = E[𝑒𝑡𝑋 ] = 𝑒−𝜆 · 𝑒𝑡𝑗
𝑗=0
𝑗!
𝑡 (15.26)
= 𝑒−𝜆 · 𝑒𝜆 𝑒
𝑡
= 𝑒−𝜆(1−𝑒 ) , −∞ < 𝑡 < ∞.
Therefore, ⎧
⎨ 𝑑𝑀 = 𝑀 ′ (𝑡) = 𝜆𝑀 (𝑡)𝑒𝑡 ,

𝑑𝑡
⎩𝑀 ′′ (𝑡)

= (𝜆2 𝑒2𝑡 + 𝜆𝑒𝑡 ) 𝑀 (𝑡).

The mean and variance of the Poisson distribution are


E[𝑋] = 𝜇 = 𝑀 ′ (0) = 𝜆,

(15.27)
V[𝑋] = E[𝑋 2 ] − E[𝑋]2
= 𝑀 ′′ (0) − 𝑀 ′ (0)2 = 𝜆.

15.6.3 Laplace transform and Fourier transform

Besides of the PGF, we will make extensive use of the Laplace and Fourier integral transforms. Here
we provide a basic introduction to such methods.

DATA ANALYTICS- FOUNDATION


CHAPTER 15. BRANCHING PROCESSES
446 AND RENEWAL PROCESSES

Laplace transform (or 𝑠 transform) for continuous variable

Let 𝑓 (𝑥) = 𝑓𝑋 (𝑥) be the PDF of a continuous random variable 𝑋 that takes only non-negative values;
that is, 𝑓𝑋 (𝑥) = 0 for 𝑥 < 0.

The Laplace transform of 𝑓𝑋 (𝑥), denoted by 𝐿𝑋 (𝑠), is defined by


∫︁ ∞
𝐿𝑋 (𝑠) = E[𝑒−𝑠𝑋 ] = 𝑒−𝑠𝑥 𝑓 (𝑥) 𝑑𝑥. (15.28)
0

PROPERTIES:

1. When the Laplace transform 𝐿𝑋 (𝑠) is evaluated at 𝑠 = 0, its value 𝐿𝑋 (0) = 1:


∫︁ ∞
𝐿𝑋 (𝑠)|𝑠=0 = 𝑓 (𝑥)𝑑𝑥 = 1.
0

2. One of the primary reasons for studying 𝐿𝑋 (𝑠) is to derive the moments of the different probability
distributions. By definition, we take different derivatives of the Laplace transform 𝐿𝑋 (𝑠) and evaluate
them at 𝑠 = 0, we obtain the followings:

∫︁ ∞
𝑑𝐿(𝑠) 𝑑𝐿(𝑠)
=− 𝑥 𝑒−𝑠𝑥 𝑓 (𝑥) 𝑑𝑥 ⇒ |𝑠=0 = 𝐿′ (0) = −E[𝑋]
𝑑𝑠 0 𝑑𝑠
∫︁ ∞
𝑑2 𝐿(𝑠) 𝑑2 𝐿(𝑠)
2
= 𝑥2 𝑒−𝑠𝑥 𝑓 (𝑥) 𝑑𝑥 ⇒ 2
|𝑠=0 = 𝐿′′ (0) = E[𝑋 2 ]
𝑑𝑠 0 𝑑𝑠 (15.29)
..
.
𝑑𝑛 𝐿(𝑠)
|𝑠=0 = (−1)𝑛 E[𝑋 𝑛 ]
𝑑𝑠𝑛

3. The Laplace transform operator 𝐿 is linear, since for constants 𝑎, 𝑏:

𝐿(𝑎𝑓𝑋 + 𝑏𝑓𝑌 ) = 𝑎𝐿𝑋 + 𝑏𝐿𝑌

4. For two pdf 𝑓 (𝑡), 𝑔(𝑡) of continuous random variables 𝑋 and 𝑌 that takes only non-negative values;
we have

a) the convolution 𝑓 * 𝑔 of 𝑓 and 𝑔 is the distribution function of the sum variable 𝑋 + 𝑌 , being
defined in next part;

b) the Laplace transform of the convolution 𝑓 * 𝑔 then is given by

𝐿(𝑓 * 𝑔)(𝑠) = 𝐿(𝑓 )𝐿(𝑔). (15.30)

Inference, Linear Regression and Stochastic Processes


15.7. The Renewal Equation 447

15.6.4 Sums of Random Variables- Convolutions

 CONCEPT 14.

1. When 𝑋 and 𝑌 are two independent discrete variables, having respectively pdf 𝑓 and 𝑔; the general
formula for the distribution of the sum 𝑍 = 𝑋 + 𝑌 is

∑︁
P[𝑍 = 𝑧] = P[𝑋 = 𝑘]P[𝑌 = 𝑧 − 𝑘] (15.31)
𝑘=−∞

The convolution of 𝑓 and 𝑔 is the distribution function of the sum 𝑍 = 𝑋 + 𝑌 , given by



∑︁
ℎ(𝑧) = 𝑓𝑍 (𝑧) = (𝑓 * 𝑔)(𝑧) = 𝑓 (𝑘) 𝑔(𝑧 − 𝑘) (15.32)
𝑘=−∞

2. When 𝑋 and 𝑌 are two independent continuous variables, we have the convolution of 𝑓, 𝑔 (or the
distribution of the sum 𝑍 = 𝑋 + 𝑌 ) as

ℎ(𝑧) = 𝑓𝑍 (𝑧) = (𝑓 * 𝑔)(𝑧)


∫︁ ∞ ∫︁ ∞ (15.33)
= 𝑓 (𝑧 − 𝑡)𝑔(𝑡) 𝑑𝑡 = 𝑓 (𝑡)𝑔(𝑧 − 𝑡) 𝑑𝑡
−∞ −∞

15.7 The Renewal Equation

The renewal function 𝐻(𝑡) is the expected number of renewals E[𝑁 (𝑡)] by time 𝑡.

We cand find 𝐻(𝑡) as follows:



∑︁
𝐻(𝑡) = E[𝑁 (𝑡)] = 𝐹𝑛 (𝑡), where 𝐹𝑇𝑛 (𝑡) = P[𝑇𝑛 ≤ 𝑡]. (15.34)
𝑛=1

Proof.

∑︁ ∞
∑︁
𝐻(𝑡) = E[𝑁 (𝑡)] = 𝑛P[𝑁 (𝑡) = 𝑛] = 𝑛[𝐹𝑛 (𝑡) − 𝐹𝑛+1 (𝑡)]
𝑛=0 𝑛=0

(15.35)
∑︁
= ... = 𝐹𝑛 (𝑡)
𝑛=1

Now, if the 𝑋𝑖 ’s are continuous random variables, we can take the derivative of each side we obtain
∞ ∞
𝑑𝐻(𝑡) ∑︁ ∑︁
ℎ(𝑡) = = 𝑑𝐹𝑛 (𝑡)/𝑑𝑡 = 𝑓𝑇𝑛 (𝑡)
𝑑𝑡 𝑛=1 𝑛=1

DATA ANALYTICS- FOUNDATION


CHAPTER 15. BRANCHING PROCESSES
448 AND RENEWAL PROCESSES

The renewal density is ℎ(𝑡). Using the Laplace transform we obtain an explicit integral equation

∫︁ 𝑡
ℎ(𝑡) = 𝑓𝑋 (𝑡) + ℎ(𝑡 − 𝑢) 𝑓𝑋 (𝑢)𝑑𝑢. (15.36)
𝑢=0

The renewal equation of the process 𝑁 (𝑡) is given by integrating both sides of the last equation

∫︁ 𝑡 ∫︁ 𝑡
𝐻(𝑡) = ℎ(𝑢)𝑑𝑢 = 𝐹𝑋 (𝑡) + 𝐻(𝑡 − 𝑢) 𝑓𝑋 (𝑢) 𝑑𝑢 for 𝑡 ≥ 0. (15.37)
𝑢=0 𝑢=0

This equation is called the fundamental equation of renewal theory.

Proposition 15.2 (Relationship).

Recall that 𝑋𝑖 = 𝑇𝑖 − 𝑇𝑖−1 is the lifetime of event 𝐸𝑖 .


By Equation (15.37), there is a one-to-one correspondence between the distribution function of the
random variables 𝑋𝑖 and the renewal function 𝐻(𝑡).

Theorem 15.3.

Suppose that 𝐻(𝑡) is the renewal function of a renewal process 𝑁 (𝑡), with the lifetime of event 𝑖 is
𝑋𝑖 = 𝑇𝑖 − 𝑇𝑖−1 , and 𝑇𝑖 is the 𝑖th renewal time point.

The Elementary Renewal Theorem says that


𝐻(𝑡) 1
lim = . (15.38)
𝑡→∞ 𝑡 E[𝑋]

Example 15.4.

Assume that the lifetime 𝑋 is exponentially distributed with mean E[𝑋] = 1/𝜆. Then

𝑓𝑋 (𝑡) = 𝜆𝑒−𝜆𝑡 ; 𝐹𝑋 (𝑡) = 1 − 𝑒−𝜆𝑡


ℎ(𝑡) = 𝜆; (15.39)
∫︁ 𝑡
𝐻(𝑡) = ℎ(𝑢)𝑑𝑢 = 𝜆𝑡.
𝑢=0

Clearly,
𝐻(𝑡) 𝜆𝑡 1
= =𝜆= .
𝑡 𝑡 E[𝑋]

Corollary 15.4. The Poisson process is the only renewal process having a linear renewal function. It
is 𝐻(𝑡) = 𝜆𝑡.
The Poisson process is also the only Markovian renewal process.

Inference, Linear Regression and Stochastic Processes


15.8. Markov Renewal Process 449

Proof: This is because only the exponential distribution has the memory-less property.

Proposition 15.5 (Sum of independent renewal processes).

Let {𝑁1 (𝑡), 𝑡 ≥ 0} and {𝑁2 (𝑡), 𝑡 ≥ 0} independent renewal processes.


The sum process
𝑁 (𝑡) = 𝑁1 (𝑡) + 𝑁2 (𝑡), for 𝑡 ≥ 0 (15.40)

is a renewal process if and only if {𝑁𝑖 (𝑡), 𝑡 ≥ 0} are Poisson processes, for 𝑖 = 1, 2.

Problem 15.1.

Suppose that the renewal process {𝑁 (𝑡), 𝑡 ≥ 0} ∼ Pois(𝜆𝑡) is a Poisson process with rate 𝜆. Com-
pute E[𝑁 2 (𝑡)] and confirm that

∫︁ 𝑡
E[𝑁 2 (𝑡)] = 𝐻(𝑡) + 2 𝐻(𝑡 − 𝑢) 𝑑𝐻(𝑢) for 𝑡 ≥ 0.
0

HINT: Use E[𝐾 2 ] = V[𝐾] + E[𝐾]2 

Problem 15.2.

Consider a machine that is subject to failure and repair.

• The time to repair the machine when it breaks down is exponentially distributed with mean 1/𝜇.

• The time the machine runs before breaking down is also exponentially distributed with mean 1/𝜆.

• When repaired the machine is considered to be as good as new. The repair time and the running
time are assumed to be independent.

If the machine is in good condition at time 0, what is the expected number of failures up to time 𝑡?

15.8 Markov Renewal Process

We have earlier defined the renewal process {𝑁 (𝑡), 𝑡 ≥ 0} as a counting process that denotes the
number of renewals up to time 𝑡.

The Markov renewal process is a generalization of the renewal process in which the
times between renewals are chosen according to a Markov chain.

Consider a random variable 𝑋𝑛 that takes values in a countable set S, and a random variable 𝑇𝑛
(renewal time) that takes values in the interval [0, ∞) such that

0 = 𝑇0 ≤ 𝑇1 ≤ 𝑇2 ≤ . . .

There are two ways to define Markov renewal process.

DATA ANALYTICS- FOUNDATION


CHAPTER 15. BRANCHING PROCESSES
450 AND RENEWAL PROCESSES

1. The stochastic process {(𝑋𝑛 , 𝑇𝑛 )| 𝑛 ∈ N} is defined to be a Markov renewal process with state
space S if

P[Xn+1 = j, Tn+1 − Tn ≤ t| Xn , 𝑋𝑛−1 , · · · , 𝑋0 ; 𝑇𝑛 , 𝑇𝑛−1 , · · · , 𝑇0 ]

= P[Xn+1 = j, Tn+1 − Tn ≤ t| Xn ],

for all 𝑛 = 0, 1, 2, · · · ; 𝑗 ∈ S; and 𝑡 ∈ [0, ∞).

The interval 𝐻𝑛 = 𝑇𝑛+1 − 𝑇𝑛 , 𝑛 ≥ 0, is called the holding/waiting time in state 𝑋𝑛 .

2. Write N+ = {1, 2, 3, . . .}, with 𝑘 ∈ S and 𝑡 ≥ 0 we define the following function:



⎨1 if 𝑋𝑛 = 𝑘, 𝐻0 + 𝐻1 + 𝐻2 + · · · + 𝐻𝑛−1 ≤ 𝑡, 𝑛 ∈ N+
𝑉𝑘 (𝑛, 𝑡) = (15.41)
⎩0 otherwise

The number of times the process {(𝑋𝑛 , 𝑇𝑛 )} visiting state 𝑋𝑛 = 𝑘 in the interval (0, 𝑡] is


∑︁
𝑁𝑘 (𝑡) = 𝑉𝑘 (𝑛, 𝑡), 𝑘 ∈ S, 𝑡 ≥ 0. (15.42)
𝑛=0

Then {𝑁𝑘 (𝑡), 𝑘 ∈ S, 𝑡 ≥ 0} is a Markov renewal process.

15.8.1 The Markov Renewal Function

The function

𝑀𝑖,𝑘 (𝑡) = E[𝑁𝑘 (𝑡)| 𝑋0 = 𝑖], 𝑖, 𝑘 ∈ S, 𝑡 ≥ 0 (15.43)

is called the Markov renewal function.


Use (15.42) and denote 𝐽𝑛 = 𝐻0 + 𝐻1 + 𝐻2 + · · · + 𝐻𝑛−1 (the time from the beginning until the
process enters state 𝑋𝑛 ), we can explicitly write

∞ ∞
[︀ ∑︁ ]︀ ∑︁
𝑀𝑖,𝑘 (𝑡) = E 𝑉𝑘 (𝑛, 𝑡)| 𝑋0 = 𝑖 = E[𝑉𝑘 (𝑛, 𝑡)| 𝑋0 = 𝑖]
𝑛=0 𝑛=0

(15.44)
∑︁
= P[𝑋𝑛 = 𝑘, 𝐽𝑛 ≤ 𝑡| 𝑋0 = 𝑖]
𝑛=1

𝐽𝑛 is also called the epoch of the 𝑛-th transition of the process {(𝑋𝑛 , 𝑇𝑛 )| 𝑛 ∈ N}.

Inference, Linear Regression and Stochastic Processes


15.8. Markov Renewal Process 451

15.8.2 The one-step transition probability

The one-step transition probability 𝑄𝑖,𝑗 (𝑡) of the above Markov renewal process by

𝑄𝑖,𝑗 (𝑡) = P[𝑋𝑛+1 = 𝑗, 𝐻𝑛 ≤ 𝑡| 𝑋𝑛 = 𝑖], 𝑡≥0 (15.45)

independent of 𝑛.Thus, 𝑄𝑖,𝑗 (𝑡) is the conditional probability that


the process will be in state 𝑗 next, given that
it is currently in state 𝑖 and the waiting time in the current state 𝑖 is no more than 𝑡.

The family of probabilities Q = {𝑄𝑖,𝑗 (𝑡), 𝑖, 𝑗 ∈ S, 𝑡 ≥ 0} is called the semi-Markov kernel over S.

In particular, when 𝑗 = 𝑘

𝑄𝑖,𝑘 (𝑡) = P[𝑋1 = 𝑘, 𝐻0 ≤ 𝑡| 𝑋0 = 𝑖], 𝑡 ≥ 0.

We then obtain


∑︁
𝑀𝑖,𝑘 (𝑡) = P[𝑋𝑛 = 𝑘, 𝐽𝑛 ≤ 𝑡| 𝑋0 = 𝑖] = · · ·
𝑛=1

(15.46)
∑︁
= 𝑄𝑖,𝑘 (𝑡) + P[𝑋𝑛 = 𝑘, 𝐽𝑛 ≤ 𝑡| 𝑋0 = 𝑖]
𝑛=2

If we define
(𝑛)
𝑄𝑖,𝑘 (𝑡) = P[𝑋𝑛 = 𝑘, 𝐽𝑛 ≤ 𝑡| 𝑋0 = 𝑖] (15.47)

and
(0)
𝑄𝑖,𝑘 (𝑡) = 0 if 𝑖 ̸= 𝑘; 1 if 𝑖 = 𝑘 (15.48)

then we can obtain the following recursive relationship:

(𝑛+1)
∑︁ ∫︁ 𝑡 (𝑛)
𝑄𝑖,𝑘 (𝑡) = 𝑄𝑖,𝑗 (𝑡 − 𝑢) 𝑑𝑄𝑗,𝑘 (𝑢). (15.49)
𝑗∈S 0

If we define the matrix 𝑄 = [𝑄𝑖,𝑘 ], then the above expression is the convolution of 𝑄(𝑛) and 𝑄. That
is,
𝑄(𝑛+1) (𝑡) = 𝑄(𝑛) (𝑡) * 𝑄(𝑡).

15.8.3 Computing the Markov renewal function 𝑀𝑖,𝑘 (𝑡)

Thus, if we define the matrix 𝑀 (𝑡) = [𝑀𝑖,𝑘 (𝑡)], we obtain


∑︁ ∞
∑︁
𝑀 (𝑡) = 𝑄(𝑛) (𝑡) = 𝑄(𝑡) + 𝑄(𝑛) (𝑡), 𝑡≥0 (15.50)
𝑛=1 𝑛=2

DATA ANALYTICS- FOUNDATION


CHAPTER 15. BRANCHING PROCESSES
452 AND RENEWAL PROCESSES

or

∑︁
𝑀 (𝑡) = 𝑄(𝑡) + 𝑄(𝑛) (𝑡) * 𝑄(𝑡) = ... = 𝑄(𝑡) + 𝑄(𝑡) * 𝑀 (𝑡) (15.51)
𝑛=1

Write 𝑀 * (𝑠) be the Laplace transform of 𝑀 (𝑡) and 𝑄* (𝑠) be the Laplace transform of 𝑄(𝑡), then

𝑀 * (𝑠) = [I −𝑄* (𝑠)]−1 − I .

15.9 Chapter’s problems

Problem 15.3. Prove that: The convolution of two binomial distributions 𝑋 ∼ Bin(𝑚, 𝑝) and 𝑌 ∼
Bin(𝑛, 𝑝), one with parameters 𝑚 and 𝑝; and the other with parameters 𝑛 and 𝑝, is a binomial distri-
bution with parameters (𝑚 + 𝑛) and 𝑝, ie.

𝑍 = 𝑋 + 𝑌 ∼ Bin(𝑚 + 𝑛, 𝑝).

Problem 15.4.

The price 𝑋 of a stock on a given trading day changes according to the distribution

𝑥: -1 0 1 2

𝑝𝑥 : 1/4 1/2 1/8 1/8

Describe the random variable 𝑍 = 𝑋 + 𝑋. Find the distribution for the change in stock price after
two (independent) trading days.

Problem 15.5. We consider a branching process {𝑋𝑛 , 𝑛 = 0, 1, 2, . . .}, with

𝑋𝑛−1
∑︁
𝑋𝑛 = 𝑍𝑛−1,𝑗 , 𝑍𝑘,𝑗 ∼ 𝑁 = Pois(𝜆)
𝑗=1

for which 𝑋0 = 1 = 𝑍0 and

𝑒−𝜆 𝜆𝑖
𝑝𝑖 = P[𝑁 = 𝑖] = 𝑖 = 0, 1, 2, ...
𝑖!
That is, the number of descendants of an arbitrary individual has a Poisson distribution with parameter
𝜆. Determine the probability 𝑞0 of eventual extinction of the population if
(a) 𝜆 = ln 2 and (b) 𝜆 = ln 4.

Problem 15.6 (System reliability). Let 𝑁 (𝑡) be the number of failures of a computer system in the
interval [0, 𝑡]. We suppose that {𝑁 (𝑡), 𝑡 > 0} is a Poisson process with rate 𝜆 = 1 per week. Calculate
the probability that

Inference, Linear Regression and Stochastic Processes


15.9. Chapter’s problems 453

i) the system operates without failure during two consecutive weeks,

ii) the system will have exactly two failures during a given week, knowing that it operated without failure
during the previous two weeks,

iii) less than two weeks elapse before the third failure occurs.

Problem 15.7.

Let 𝑁 (𝑡) be the number of accidents at a specific intersection in the interval [0, 𝑡]. We suppose that
{𝑁 (𝑡), 𝑡 > 0} is a Poisson process with rate 𝜆1 = 1 per week. Moreover, the number 𝑌𝑘 of persons
injured in the 𝑘th accident has (approximately) a Poisson distribution with parameter 𝜆2 = 1/2, for all 𝑘.
Finally, the random variables 𝑌1 , 𝑌2 , ... are independent among themselves and are also independent
of the stochastic process {𝑁 (𝑡), 𝑡 > 0}.

a) Calculate the probability that the total number of persons injured in the interval [0, 𝑡] is greater than
or equal to 2, given that 𝑁 (𝑡) = 2.

b) Calculate V[𝑁 (𝑡) 𝑌1 ].

c) Let 𝑆𝑘 be the time instant when the 𝑘th person was injured, for 𝑘 = 1, 2, .... We set 𝑇 = 𝑆2 − 𝑆1 .
Calculate P[𝑇 > 0].

DATA ANALYTICS- FOUNDATION


This page is left blank intentionally.
Chapter 16

Statistical Data Analytics in Practice


Uncertainty-accepted decision making

[Source [56]]
CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
456 UNCERTAINTY-ACCEPTED DECISION MAKING

16.1 Data Analytics 1 with Emergency Medical Services Data

16.1.1 The problem and related study

Emergency medical services (EMS), also known as ambulance services or paramedic services, are
emergency services which treat illnesses and injuries that require an urgent medical response, pro-
viding out-of-hospital treatment and transport to definitive care. They may also be known as a first aid
squad, FAST squad, emergency squad, rescue squad, ambulance squad, ambulance corps, life squad
or by other initialisms such as EMAS or EMARS [?].
The assistance of EMS has steps and procedures according to the operating system. However,
systematic work must compete with the time that decides the fate of the patient. Arrival of rescue
teams to the patient and giving the basic treat in a short time will be likely to save the patient’s life. That
is, if the rescuers can use the shortest travel time is better. Sometimes in a remote area to the hospital
or treatment center is difficult to assist in a short time. Therefore, the optimal location of the station
improves the quality of service.
A related study to this project was published by Zhaoxiang He et. al. [?]. Their research is about
EMS work in the South Dakota area. Their study using data from the U.S. National Emergency Medical
Services Information System (NEMSIS) in 2012, it is case-by-case data and 2 types of data and related
factors that affect the performance of services which are divided into

• case-specific variables: caller’s complaint, light and siren, dispatch time, location type, and weather
condition,

• service-specific variables: EMS station location, staffing, weather, highway and, last but not least,
traffic conditions.

Their study used 3 regression methods which are the linear regression, the spatial econometric
model, and geographically weighted regression (GWR) model to find the best fitted model.
In this case study, our experimental data set was obtained in Thailand. Generally, the population in
the rural areas of Thailand are lives in groups, sporadically scattered throughout the area, and some of
them are far from hospitals or public health centers. In the event of an accident or emergency illness,
that is difficult to reach immediate treatment. Therefore, the researcher chose the northeastern region,
the largest region in Thailand. And select Ubon Ratchathani province because it is the most prosperous
palace that ranks 77th in the survey of the Office of the National Economic and Social Development
Council in the year 2015 (measured by GPP and GPP per capita) [?] and is the 10th poorest province
based on the ranking of the Ministry of Science and Technology in 2018 [?].
We will employ the powerful GLM approach in Appendix C to analyze the specific data in which the
response is not defined clearly at the first glance.

Inference, Linear Regression and Stochastic Processes


16.1. Data Analytics 1 with Emergency Medical Services Data 457

16.1.2 The observed data and its structure

Our realistic data is provided by National Institute for Emergency Medical Services (NIEMS), observed
during 1st October 2018 to 31st September 2019 at 7 districts of Amnat Chareon province which are
Chanuman, Pathum Ratchawong, Phana, Lue Amnat, Hua Taphan, Senang kha Nikhom, and Mueang
Amnat Chareon.
The data table of each station with it contributing factors is partially shown in Figure 16.1. In some
columns of all realistic data, they are in the same group and can be generated as other single factors.
For example, the column of colors, red, yellow, green, white, and black, they can be used to divide the
damage of patients, and summation of them is the total number of observations.
From all 11619 cases (observations) of our realistic data, the researchers generate data according
to the relevant factors in the operation of EMS as follows.

1. Location factor or location of each station by dividing each station according to the district where
the station is located as a binary factor:

1 = station located in the city area, and 0 = station located in the rural area.

The criteria for dividing are the number of educational institutions, department stores, hospitals,
subdistrict public health centers, and the provincial barn. It can be seen that stations located in
Mueang Amnat Charoen district will be organized in the city area and the other will be in the rural
area.

2. Vehicle factor is the number of emergency vehicles used in the operation of each station.

3. Staff factor is the number of operators in the station. The levels are divided according to education
and training duration as follows

• Professional Staffs are


Doctor, Nurse, Emergency medical technician Paramedic(EMT-P),
Advance Emergency medical technician(A-EMT),
Emergency medical technician Intermediate(EMT-I),
Emergency medical dispatcher(EMD) and superior

• Volunteer Staffs are


Emergency medical technician basic(EMT-B),
Emergency medical technician(EMT),
Emergency medical responder(EMR), and First responder(FR)

4. Injury factor or number of patients sorting by telephone or telephone triage from operating of each
station according to the severity of the patient.

The triage criteria are under the principles of The Emergency Severity Index(ESI) - a five-level emer-
gency department triage algorithm - which can be divided into

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
458 UNCERTAINTY-ACCEPTED DECISION MAKING

Figure 16.1: The table of data

Resuscitation (red), Emergent (yellow), Urgent (green),

Less urgent (white), and Nonurgent (black).

Red and yellow are cases that should be delivered to the hospital emergency room(ER) immediately
and within 10 minutes, respectively, according to green and white are cases that are delivered to the
outpatient department(OPD) under ESI principles.

- Trauma cases or patients with serious emergency conditions requiring urgent assistance will
include red and yellow

- Non-trauma cases or mild emergency patients consisting of green, white, and black, see [7]

5. Light and Siren factor is the number of cases in which the rescue team opens emergency lights for

Inference, Linear Regression and Stochastic Processes


16.1. Data Analytics 1 with Emergency Medical Services Data 459

spaces while traveling to the scene of each station.

From the information that we have separated according to the work order of each level of operation
of each station which is linked to the phone triage factor. The operating instructions of each level are
divided according to the severity of the patient as follows:

• Red: indicating for the operation levels of


Intermediate life support(ILS),
Emergency medical responder(EMR),
Advanced life support(ALS).
The operational response must use both Light and siren.

• Yellow: Basic life support(BLS) operation level.


The operational response must use both light and siren

• Green: First reponder(FR) operation level response.


Do not use both Light and siren

• White: Respond with Telephone Referral / Recommendation program

• Black: Not responding because there is no patient [7].

Therefore, the rescue teams at the ILS / EMR / ALS and BLS levels are emergency light and siren
enabled groups or called light and siren factor. And the non-light and siren factor is the number of
cases in which the rescue team at the FR level does not turn on emergency lights for space while
traveling to the scene of each station.

6. Response time in 8 minutes factor is the number of cases that the rescue team uses to travel from
the station to the scene within 8 minutes.

7. Response time out of 8 minutes factor is the number of cases that the rescue team used to travel
from the station to the scene for more than 8 minutes.

After screening, the new factors are shown in the table below.

16.1.3 Method for Solving Problem

We will employ linear modeling and its extension, generalized linear modeling (GLM), to analyze the
observed data and get conclusion from analyzing and computing on soft R. But why using the general-
ized linear models?
The essential reason is based on a key principle in statistical modeling and analytics, that data
structure decides method of analysis. Explicitly, our observed data does not provide time-related in-
formation, hence time-series models in statistics or continuous models in mathematics as ordinary
differential equations are not appropriate to apply.

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
460 UNCERTAINTY-ACCEPTED DECISION MAKING

Figure 16.2: New factors obtained after screening data

Furthermore, we knew from the previous chapters that, the linear model is suitable when certain
assumptions are presumed, and the modeled response variables 𝑌 receives continuous values. Now
we will see that our response variable 𝑌 , to be defined in subsequent sections actually are counts, ie.
its range are the natural numbers.

Therefore the only tools to be applicable for analyzing this specific dataset possibly are the GLMs
and ANCOVA, if the location are of interest.

Inference, Linear Regression and Stochastic Processes


16.1. Data Analytics 1 with Emergency Medical Services Data 461

A Brief of Generalized Linear Models

Linear regression models have the form

E[𝑌 ] = 𝜇 = E[𝑌 |X = 𝑥] = 𝑥𝑇 𝛽, 𝑌 ∼ N(𝜇, 𝜎 2 ) (16.1)

or explicitly for many observations

E[𝑌𝑖 ] = 𝜇𝑖 = E[𝑌𝑖 |X = 𝑥𝑖 ] = 𝑥𝑇𝑖 𝛽, 𝑌𝑖 ∼ N 𝜇𝑖 , 𝜎 2


(︀ )︀

where the random variables 𝑌𝑖 are independent. Note that the 𝑌𝑖 for different subjects, indexed by
the subscript 𝑖, may have different expected values 𝜇𝑖 = E[𝑌𝑖 ].

• Generalized linear models (GLM), extended from linear regression models, are important in the anal-
ysis of insurance data, for which, the assumptions of the normal model are frequently not applica-
ble.

For example, in actuarial science, claim sizes, claim frequencies and the occurrence of a claim on a
single policy are all outcomes which are not normal. Also, the relationship between outcomes and
drivers of risk is often multiplicative rather additive.

• HOW? Generalized linear modeling is used to assess and quantify the relationship between a
response variable 𝑌 and explanatory variables x = 𝑥1 , 𝑥2 , · · · , 𝑥𝑝 .

The generalized linear modeling differs from ordinary regression modeling (Equation 16.1) in two
important respects:

1. The distribution of the response is chosen from the exponential family, consisting of Poisson, Gaus-
sian and binomial families. Thus the distribution of the response need not be normal or close to
normal and may be explicitly nonnormal.

2. A transformation of the mean E[𝑌 ] is linearly related to the variables 𝑥𝑖 .

General steps in generalized linear modeling

Given a response variable 𝑌 , with E[𝑌 ] = 𝜇, constructing a GLM consists of the following steps.

1. Choose a response distribution 𝑓 (𝑦).

2. Choose a link 𝑔(𝜇) = 𝑥𝑇 𝛽.

3. Choose explanatory variables 𝑥 = 𝑥1 , 𝑥2 , · · · , 𝑥𝑝 , in terms of which 𝑔(𝜇) is to be modeled. Similar


considerations apply as in ordinary regression modeling.

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
462 UNCERTAINTY-ACCEPTED DECISION MAKING

4. Collect observations 𝑦1 , 𝑦2 , · · · , 𝑦𝑛 on the response 𝑦 from 𝑛 individuals in a sample, and corre-


[︀ ]︀
sponding values x𝑖 on the explanatory variables 𝑥, written as X𝑇 = x1 x2 · · · x𝑖 · · · x𝑛 or
⎡ ⎤ ⎡ ⎤
⎢ x𝑇1 ⎥ ⎢ 𝑥11 𝑥12 ··· 𝑥1𝑗 ··· 𝑥1𝑛 ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥

⎢ x𝑇2 ⎥
⎥ ⎢ 𝑥21
⎢ 𝑥22 ··· 𝑥2𝑗 ··· 𝑥2𝑝 ⎥

X=⎢ ⎥=⎢ ⎥. (16.2)

⎥ ⎢ .
.. ⎥ ⎢ .. .. .. .. ⎥
. ⎥ ⎢ .. . ··· . . . ⎥
⎢ ⎥

⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦
x𝑇𝑛 𝑥𝑛1 𝑥𝑛2 ··· 𝑥𝑛𝑗 ··· 𝑥𝑛𝑝

Successive observations are assumed to be independent, i.e. the sample will be regarded as a ran-
dom sample from the background population.

5. Fit the model by estimating 𝛽 and, if unknown, 𝜑. In R , we use

Model = glm(Y ∼ x1+ x2 + ... + xp, family = poisson/binomial, data).

6. Given the estimate of 𝛽 = (𝛽0 , 𝛽1 , 𝛽2 , · · · , 𝛽𝑝 )𝑇 ,

we generate predictions (or fitted values) of 𝑦 for different settings of 𝑥; and

examine how well the model fits by examining the departure of the fitted values from actual values,
as well as other model diagnostics.

What would we do next? The procedural analyzing include the following steps.

1. Summarizing the data, obtained from governmental offices.

Describing the structure of data (to list independent variables 𝑋𝑖 ).

2. Coding variables and defining the response 𝑌 from the relevant factors 𝑋𝑖 .

3. Choosing suitable statistical models (selecting key predictors 𝑋𝑖 which corresponding to 𝑌𝑖 into the
model).

4. Fitting the model in the R software, improving, and selecting the best model.

5. Finally, we discuss in details from outcomes computed from above steps.

16.1.4 Coding all variables and Defining the response variable

We encode the above variables as follows.

• 𝑋1 = the station’s name,

Inference, Linear Regression and Stochastic Processes


16.1. Data Analytics 1 with Emergency Medical Services Data 463

• 𝑋2 = the station’s location,

• 𝑋3 = the number of vehicles of a station,

• 𝑋4 = the number of professional staffs of a station,

• 𝑋5 = the number of volunteer staffs of a station,

• 𝑋6 = the number of the trauma cases of a station,

• 𝑋7 = the number of non-trauma cases of a station,

• 𝑋8 = the number of cases that the rescue team open light and siren while drive to the scene,

• 𝑋9 = the number of cases that the rescue team do not use light and siren while drive to the scene,

• 𝐼 = the number of cases which response time less than 8 minutes,

• 𝑂 = the number of cases which response time more than 8 minutes.

𝑌 is not known at the first glance to data. So 𝑌 must generally be determined by exploiting possible
relationships among the existing independent variables. How do we determine 𝑌 ?

♣ OBSERVATION 6.

This step is interesting since there are a few relevant factors in data, showing a mixture capacity of
entire EMS system, like the rescue-time of medical team under requests, in terms of variables 𝐼 and
𝑂, or rescue-task related factor 𝑋4. In other words,

• 𝑌 can not directly defined as the response time (not exist in data), 𝑌 can be discretely defined via

𝐼- the number of cases of which response (rescue) time of medical teams arriving in site at most 8
minutes, Range(𝐼) = N+ ;

and 𝑂- the number of cases of which response (rescue) time of medical teams arriving in site more
than 8 minutes, Range(𝑂) = N+ .

• 𝑌 apparently has natural values, it can be assumed to be Poisson distributed, 𝑌 ∼ 𝑃 𝑜𝑖𝑠𝑠𝑜𝑛.

𝐼 and 𝑂 are good candidates to choose. Indeed, if a station got large 𝐼 that means it has a great
value of 𝑌 . On the other hand, if a station got large 𝑂 that means in average the medical team took a
long time travel to the scene in order to save lives. In other words,
* 𝐼 definitely has a positive impact on 𝑌 , and 𝑂 possibly has a negative impact on 𝑌 .

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
464 UNCERTAINTY-ACCEPTED DECISION MAKING

𝑌 , the response factor, should represent the ‘goodness-of-service’. Then the formula of 𝑌 is pro-
posed to be
𝑌 = 𝐼 − 𝑂. (16.3)

In addition, for the definition of 𝑌 then the family of 𝑌 should be Poisson, 𝑌 ∼ 𝑃 𝑜𝑖𝑠𝑠𝑜𝑛. Since
assuming 𝑌 ∼ 𝑃 𝑜𝑖𝑠𝑠𝑜𝑛 so 𝐼 − 𝑂 cannot be negative. Then we redefine,

𝑌 = 𝐼 − 𝑂 when 𝐼 ≥ 𝑂
𝑌 = 0 when 𝐼 < 𝑂

Moreover, the researcher can improve 𝑌 further by using the number of professionals 𝑋4, because
when 𝑋4 is high the performance of the station might be higher. Then we redefine 𝑌 as

𝑌 = 𝑋4 + 𝐼 − 𝑂 (16.4)

Remark 6.

1. The last formula is interesting, since 𝑌 now combines the professional level of the rescue team (𝑋4)
on one side, with the on-time level of their travels to the scenes, on the other side. Both 𝑋4 and the
pair of 𝐼 and 𝑂 obviously depend on locations of the local clinics.

2. Note that, in these definitions, the response 𝑌 is unit-less, and has natural values.

16.1.5 Choosing suitable statistical models via predictors and response

After screening factors, see Figure 16.2, we decide to choose only few factors being most meaningful
impacting on 𝑌 , which are 𝑋2, 𝑋3, 𝑋5, and 𝑋8. The reasons for this choice are as follows.

• For location factor(𝑋2), if the station located in the city area it might be difficult traveling from place
to place,

• for vehicle factor(𝑋3), if there is an accident that has many seriously injured patients, more vehicles
might be handle better,

• for volunteer factor(𝑋5), since some station has no professional staff, then the number of volunteer
staff is important in this case,

• for light and siren factor(𝑋8), if the rescue team open the light and siren will make other drivers know
to let them pass.

The reason why we do not choose trauma(𝑋6) or non-trauma(𝑋7) cases as predictors is that, no
matter what patients’ status are, it is better if the rescue team goes to the scene in a short time. We
are specifically interested in two factor interactions 𝑋2 * 𝑋5 and 𝑋2 * 𝑋8 as well.

Inference, Linear Regression and Stochastic Processes


16.1. Data Analytics 1 with Emergency Medical Services Data 465

1. We put 𝑋2 * 𝑋5 in the model because, for the stations in the city area that do not have professional
staff, the rescue volunteer team will be taking as an alternative, which is helpful only if locations is
not far from hospitals, so this interaction is worthy to consider;

2. and put 𝑋2 * 𝑋8 because in the city areas might have a lot of cars more than rural areas that the
other drivers may not cooperate to let rescue team go first.

As a result, the multiple linear model, involving 4 predictors 𝑋2, 𝑋3, 𝑋5, 𝑋8, is

𝑌 ∼ 𝑋3 + 𝑋2 * 𝑋5 + 𝑋2 * 𝑋8

Note that writing 𝐴 * 𝐵 we mean terms both singleton of 𝐴, 𝐵 for main effects, and combination of
𝐴𝐵 for interaction effect are included in the model.

Choosing suitable response’s distribution


As in the previous section it is assumed that 𝑌 ∼ 𝑃 𝑜𝑖𝑠𝑠𝑜𝑛, see Section C.3 for more info.

• We will check if E[𝑌 ] = V[𝑌 ], the observed mean and variance are the same or not, then the model’s
response should really follow the family of Poisson distributions,

else if E[𝑌 ] < V[𝑌 ] then the response would be quasipoisson, because 𝑌 is overdispersed Poisson.

• The dataset shows that 𝑌 is indeed overdispersed, hence we could improve the model from Poisson
to negative binomial, a specific case of quasipoisson.

The R code of using negative binomial distribution for the response is

model = glm( newY, X3 + newX2+ X5+ newX2 * X8, family ∼ quasipoisson, data= dataX)

We implemented the following segments of R code for data analysis, as in the following Figures 16.3
and 16.4.

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
466 UNCERTAINTY-ACCEPTED DECISION MAKING

Figure 16.3: R code - the first part

Inference, Linear Regression and Stochastic Processes


16.1. Data Analytics 1 with Emergency Medical Services Data 467

Figure 16.4: R code continue

16.1.6 Discussion from computational outcome of R

After defining every independent variable, then defining the dependent variable 𝑛𝑒𝑤𝑌 , and checked the
mean and variance of 𝑛𝑒𝑤𝑌 to see the family of 𝑛𝑒𝑤𝑌 . The output gave the mean of 𝑛𝑒𝑤𝑌 is 180.25
and the variance is 14328.91, meaning that E[𝑛𝑒𝑤𝑌 ] < V[𝑛𝑒𝑤𝑌 ]. Then the probability distribution
family of 𝑌 is quasipoisson.

Figure 16.5 shows the ANOVA outcome of the GLM with quasipoissonl but there is no significant fac-
tor. Figure 16.6 shows a summary of the quasipoisson model. The intercept of the model is significant,
𝑋8 given p-value = 0.0384, and the interaction of 𝑋2 and 𝑋8 given p-value = 0.0497.
The dispersion parameter for quasipoisson family taken to be 65.7762, this parameter tells us how
many times larger the variance is than the mean.

Since our dispersion was more than one, it turns out the conditional variance is actually larger than
the conditional mean. We have over-dispersion. The null deviance is 3316.5 and the residual deviance
is 2201.7, the AIC value is not available.
Figure 16.7 presents the Residuals 𝑣𝑠 Fitted plot that shown curvilinear trends, but the fit of logistic
regression is curvilinear by itself. Figure 16.8 presents the Normal Q-Q plot that shown the residuals
are normally distributed, but the deviance residuals are not normally distributed.

• Since the AIC value does not display and the model is overdispersion then we try to use the nega-
tive binomial model to improve the model. Figure 16.9 shows the ANOVA outcome of the negative
binomial model. The value of deviance residuals is all decrease from the quasipoisson model. There
are significant factors which are 𝑋2, p-value = 0.03330, 𝑋2 : 𝑋5, p-value = 0.06730, and 𝑋2 : 𝑋8,
p-value = 0.03003.

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
468 UNCERTAINTY-ACCEPTED DECISION MAKING

Figure 16.5: ANOVA result of the GLM

Figure 16.6: Summary of GLM

Inference, Linear Regression and Stochastic Processes


16.1. Data Analytics 1 with Emergency Medical Services Data 469

Figure 16.7: Residuals vs Fitted plot of GLM

Figure 16.8: Normal Q-Q plot of GLM

• Figure 16.10 shows a summary of the negative binomial model. The value of deviance residuals
is all decrease from the quasipoisson model. There are the same significant factors that are con-
stant(intercept), 𝑋8, and 𝑋2 : 𝑋8.

But the dispersion parameter for negative binomial is 1.8836 that close to 1 that means E[𝑛𝑒𝑤𝑌 ] =
V[𝑛𝑒𝑤𝑌 ]. The null deviance is 63.380 and the residual deviance is 48.197. These two are decreased
from quasipoisson model. The AIC value is 497.09. Hence, the negative binomial model is chosen
to be the best fit model.

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
470 UNCERTAINTY-ACCEPTED DECISION MAKING

Figure 16.9: ANOVA result of GLM.NB

Figure 16.10: Summary of GLM.NB

Inference, Linear Regression and Stochastic Processes


16.2. Data Analytics Project 2: Bridge Health Monitoring 471

16.2 Data Analytics Project 2: Bridge Health Monitoring

We analyzed a data set collected from major bridges in Saigon, Vietnam, aimed for monitoring and/or
evaluating bridge health. This problem is known as bridge health monitoring (BHM) in literature.
Kindly see full details in Nguyen et. al. (2013) [25].

16.2.1 Overview

Damage prognosis by Statistical Methodology - is it a right trend?


Advances in data collection and storage capabilities during the past decades have led to an infor-
mation overload in most sciences. High-dimensional data sets present many mathematical challenges
as well as some opportunities, and are bound to give rise to new theoretical developments. One of
the problems with high-dimensional data sets is that not all the measured variables are “important” for
understanding the underlying phenomena of interest, for several reasons:

• Variables having a variation smaller than the measurement noise and thus will be irrelevant.

• Many of the variables will be correlated with each other (e.g., through linear combinations or other
functional dependence); a new set of uncorrelated variables should be found.

Structural Health Monitoring, specifically Bridge Health Monitoring (BHM) is an important problem
in many countries, including developing countries like Viet Nam. Many mechanical, mathematical,
statistical, etc. methods have been proposed to solve this problem. In BHM process, one important step
is to reduce and extract important information from realistic datasets obtained from bridge monitoring.
Our first contribution in this study is on reduction of variables measured on the bridge using the
Principal components analysis (PCA), in coupling with some additional methods. Specifically, after
achieving a new dataset having new uncorrelated variables using PCA, this study uses the idea of
cross-validation to point out some first few components enough to be able to reconstruct the original
data with appropriate information (variance). To this end, for the purpose of variable reduction, the
Canonical correlations analysis is used to decide which subset of the original dataset keeps the most
information
Due to the time series nature of bridge database, on the other hand we consider a probabilistic
approach to detect potential fatality of monitored bridge which have been caused by external forces
and conditions. Specifically, the second contribution is based on a combination of auto-regressive (AR)
modelling, its variations and sequential probability ratio test that is aimed for evaluating how severe
damages can be and for identifying where they can possibly be located on the bridge.

A multiphase scheme for evaluating structure reliability

In this case study we propose a multiphase scheme for evaluating reliability/health of a structure or
system 𝑆 when on-line measurement of that structure is possible. Suppose you obtained a huge

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
472 UNCERTAINTY-ACCEPTED DECISION MAKING

dataset 𝐷 after continuously on-line measuring 𝑆 by using many sensors distributed in a certain way
on the structure. Exploiting the spatial and temporal characteristics of 𝐷, your aim is answering two
key questions:

1. which sensors could provide most information about the structure status/ health?

2. from the most informative sensors/locations determined by the first answer, how much certain we
could conclude that they could potentially be fatal places of the observed structure?

Having a mathematical answer of the 1st question helps us to optimize the resource for investigation
the structure’ lifetime or usage; and to some extent, knowing fatally dangerous places of the struc-
ture/system obviously guides the engineers and managers to make the right decision at the right time.

16.2.2 Principal Component Analysis

PCA, which is one of the most widely used multivariate techniques, is described in most standard
multivariate texts, e.g. [81][85]. One of its most popular uses is that of dimensionality reduction in
which there are a large number of interrelated variables, while retaining as much variation present in
the data set as possible. This reduction is achieved by linear transformation to a new set of variables,
the PCs, which are uncorrelated, and ordered so that the first few retain most of the variation present
in all of the original variables.
Let us consider a sample x = (𝑥1 , 𝑥2 , . . . , 𝑥𝑝 )⊤ is a 𝑝 random variables vector. In essence, PCA
seeks to reduce the dimension of the data by finding a few orthogonal linear combinations (the PCs)
of the original variables with the largest variance. The first PC, 𝑌1 , is the linear combination with the
largest variance. We have 𝑌1 = 𝛿1 ⊤ x, where the 𝑝-dimensional coefficient vector 𝛿1 = (𝛿1,1 , . . . , 𝛿1,𝑝 )⊤
solves
𝛿1 = arg max Var(𝛿1 ⊤ x). (16.5)
‖𝛿‖=1

The second PC is the linear combination with the second largest variance and orthogonal to the first
PC, and so on. There are as many PCs as the number the original variables. For many datasets, the
first several PCs explain most of the variance, so that the rest can be disregarded with minimal loss of
information.
Since the variance depends on the scale of the variables, it is customary to first standardize each
variable to have mean zero and standard deviation one. After the standardization, the original variables
with possibly different unit of measurement are all in comparable units. Assuming a standardized
1 ⊤
data with the empirical covariance matrix Σ𝑝×𝑝 = 𝑛 𝑋𝑋 , where 𝑋 is a (𝑛 × 𝑝) matrix consists of 𝑛
observations on 𝑝 variables in x, we can use the spectral decomposition theorem to write Σ as

Σ = 𝑉 𝐿𝑉 ⊤ , (16.6)

where 𝐿 = diag(𝑙1 , 𝑙2 , . . . , 𝑙𝑝 ) is the diagonal matrix of the ordered eigenvalues 𝑙1 ≤ · · · ≤ 𝑙𝑝 , and 𝑉 is


a 𝑝 × 𝑝 orthogonal matrix containing the eigenvectors. It can be shown that the PCs are given by the 𝑝

Inference, Linear Regression and Stochastic Processes


16.2. Data Analytics Project 2: Bridge Health Monitoring 473

rows of the 𝑝 × 𝑛 matrix 𝑌 , where


𝑌 = 𝑉 ⊤𝑋 ⊤. (16.7)

Performing PCA using (16.7) (i.e., by initially finding the eigenvalues of the sample covariance and
then finding the corresponding eigenvectors) is already simple and computationally fast. However, case
of computation can be further enhanced by utilizing the connection between PCA and the singular value
decomposition (SVD) of the mean-centered data matrix 𝑋 which takes the form:

𝑋 = 𝑈 𝑆𝑉 ⊤ , (16.8)

where 𝑈 ⊤ 𝑈 = I𝑝 , 𝑉 𝑉 ⊤ = 𝑉 ⊤ 𝑉 = I𝑝 and 𝑆 is diagonal with diagonal elements 𝑠1 , 𝑠2 , . . . , 𝑠𝑝 . Here,


𝑠1 ≥ 𝑠2 ≥ · · · ≥ 𝑠𝑝 are the non-negative square-roots of the eigenvalues of 𝑋 ⊤ 𝑋 or 𝑋𝑋 ⊤ , the columns
of 𝑈 are the 𝑝 orthogonal eigenvectors of 𝑋𝑋 ⊤ and the rows of 𝑉 ⊤ are the orthogonal eigenvectors of
𝑋 ⊤ 𝑋. It can be verified easily that matrix 𝑉 in equations (16.7) and (16.8) are the same, and the PC
scores are given by 𝑈 𝑆.

16.2.3 Selecting the Components

Frequently, just the first few of the PCs are sufficient to represent the original data adequately. The
precise number of components to be retained, however, is often not clear. A approach using cross-
validation will be used in this paper.
The standard cross-validation procedure is to subdivide the data matrix into a number of groups.
Each subgroup is deleted from the data in turn, the parameters of the predictor are evaluated from the
remainder of the data for each competing model, and the deleted values are then predicted for each
model. Some suitable function relating actual and predicted values, summed over all group deletions, is
used as the objective function and the model that optimizes this function is selected. We can describe
the technique as follows. Consider a (𝑛 × 𝑝) data matrix 𝑋 obtained by observing 𝑛 objects on 𝑝
variables, mean-centered and appropriately scaled. Associated with a given value 𝑘 is the predictor
𝑋ˆ (𝑘) ; an estimate of 𝑋 which arises from fitting only the first 𝑘 PCs. Thus the prediction model is given
by
ˆ (𝑘) + 𝐸 (𝑘) ,
𝑋=𝑋 (16.9)

where 𝐸 (𝑘) is the (𝑛 × 𝑝) matrix of error scores and 𝑘 = 0, 1, 2, . . . Each row of 𝐸 (𝑘) has a multivariate
normal distribution under the usual distributional assumptions. The errors in any row of 𝐸 (𝑘) are sta-
tistically independent of the errors in any other row since the rows of a data matrix generally represent
randomly sampled subjects.
To compute the discrepancy between actual and predicted values, we use
1 {︁(︁ )︁⊤ (︁ )︁}︁
PRESS(𝑘) = trace 𝐸 (𝑘) 𝐸 (𝑘)
𝑛×𝑝
and some suitable function of these PRESS values is considered in order to choose the optimum value
of 𝑘. The notation PRESS, stands for PREdiction Sum of Squares, and this is taken in a similar sense

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
474 UNCERTAINTY-ACCEPTED DECISION MAKING

as in linear regression. These PRESS(𝑘) values are a measure of how well the model in (16.9) predicts
the data for each 𝑘. As noted before, the singular value decomposition of the data matrix enables us
to represent 𝑥𝑖𝑗 , the (𝑖, 𝑗)-th element of the data matrix with the (𝑖, 𝑗)-th element of 𝑈 , 𝑢𝑖𝑗 , and (𝑖, 𝑗)-th
element of 𝑉 , 𝑣𝑖𝑗
𝑝
∑︁
𝑥𝑖𝑗 = 𝑢𝑖𝑡 𝑠𝑡 𝑣𝑡𝑗 . (16.10)
𝑡=1

And in the present situation we seek the optimal value of 𝑘 in model

𝑘
∑︁
𝑥𝑖𝑗 = 𝑢𝑖𝑡 𝑠𝑡 𝑣𝑡𝑗 + 𝜀𝑖𝑗 , (16.11)
𝑡=1

where 𝜀𝑖𝑗 is a residual term, and this is equivalent to estimating the data using only the first 𝑘 PCs.Cross-
validation ensures that each data point is not used in both the prediction and assessment stages, but
nevertheless to use as much of the original data as possible in predicting each 𝑥𝑖𝑗 . This suggests that
𝑥𝑖𝑗 should be predicted from all the data except the 𝑖th row and 𝑗th column of 𝑋.

ˆ , 𝑉ˆ and 𝑆ˆ when the


We now write 𝑋 = 𝑈 𝑆𝑉 ⊤ . Denote the updated values of 𝑈 , 𝑉 , and 𝑆 by 𝑈
𝑗th column of 𝑋 is deleted, and by 𝑈 , 𝑉 , and 𝑆 when the 𝑖th row of 𝑋 is deleted. Corresponding
notation is attached to the elements of all these matrices in the obvious way. In the prediction model
ˆ must be
in (16.11), the prediction of the part arising from 𝑈 requires information on the 𝑖th row, so 𝑈
used. Similarly, prediction of the part arising from 𝑉 requires information on the 𝑗th column, so 𝑉 must
be used. Prediction of the central part, 𝑆, can come from either 𝑆ˆ or 𝑆, so it seems reasonable to use
a composite of the two. Hence we predict 𝑥𝑖𝑗 by

𝑘
(𝑘)
∑︁ (︀ √︀ )︀(︀√ )︀
𝑥
ˆ𝑖𝑗 = 𝑢
ˆ𝑖𝑡 𝑠ˆ𝑡 𝑠𝑡 𝑣 𝑡𝑗 . (16.12)
𝑡=1

To choose the optimum value of 𝑘, we finally consider a suitable function of PRESS(𝑘). Analogy with
regression analysis suggests some function of the difference between successive PRESS values. One
such possibility is the statistic

PRESS(𝑘 − 1) − PRESS(𝑘) PRESS(𝑘)


𝑊 (𝑘) = ÷ . (16.13)
𝐷𝑘 𝐷𝑟

where 𝐷𝑘 is the number of degrees of freedom required to fit the 𝑘th component and 𝐷𝑟 is the number
of degrees of freedom remaining after fitting 𝑘th component. Consideration of the number of param-
eters to be estimated, together with all the constraints on the eigenvectors at each stage, shows that
𝐷𝑘 = 𝑛 + 𝑝 − 2𝑘. Also, since there are 𝑛𝑝 − 𝑝 degrees of freedom at the outset (each column of 𝑋
being mean-centered), 𝐷𝑟 can be found easily at each stage. 𝑊 represents the increase in predictive
information supplied by the 𝑘th component, divided by the average information in each of the remaining
components. We therefore suggest that the optimum value for 𝑘 is the last value of 𝑘 at which 𝑊 is
greater than a chosen unity.

Inference, Linear Regression and Stochastic Processes


16.2. Data Analytics Project 2: Bridge Health Monitoring 475

16.2.4 Selecting the Variables

The first major aim of this paper is to explore the choice of subsets of the original variables that will
retain the overall features or the multivariate structure present in the entire set of variables. The tech-
nique used in this paper is motivated by the long standing and well established technique of Canonical
Correlation Analysis [83], we used measures of multivariate association based on canonical correla-
tions as criteria for selecting variables in PCA. The idea is to maximize the similarity or overlap between
the spaces spanned by the sets of two PCs, one arises from the full set data while the other arises
from the subset data.

Let 𝑋 be the (𝑛 × 𝑝) data matrix, consisting of 𝑝 variables measured on each 𝑛 individuals in the
sample, and 𝑌 be the (𝑛 × 𝑘) transformed data matrix of PC scores yielding the best 𝑘-dimensional
approximation to the original data determined using 𝑊 in the previous section. Similarly, let 𝑋
̃︀ denote
the (𝑛 × 𝑞) reduced data matrix which retains only 𝑞 selected variables, and 𝑌̃︀ be the corresponding
(𝑛 × 𝑘) matrix of PC scores. It should be noted that since 𝑘 components may be sufficient to model the
‘signal’ in the data, the remaining 𝑝 − 𝑘 dimensions are a reflection of the ‘noise’. Hence, it would seem
reasonable to set 𝑞, the number of variables to retain, to 𝑘.
(︀ )︀
Now let 𝑍̃︀ = 𝑌 |𝑌̃︀ be the (𝑛 × 2𝑘) partitioned matrix arising from the horizontal concatenation of
the matrices of PC scores 𝑌 and 𝑌̃︀ . Then the corresponding (2𝑘 × 2𝑘) partitioned correlation matrix
between PCs is given by ⎛ ⎞
⎜ 𝑅𝑌 𝑌 𝑅𝑌 𝑌̃︀ ⎟
𝑅=⎜

⎟.
⎠ (16.14)
𝑅𝑌̃︀ 𝑌 𝑅𝑌̃︀ 𝑌̃︀

Where, 𝑅𝑌 𝑌̃︀ is the (𝑘 × 𝑘) matrix of correlations between the PCs of the set 𝑌 ⊤ and 𝑌̃︀ ⊤ , and since
the correlation matrix 𝑅 is symmetric, and the PCs in 𝑌 are orthogonal to each other and similarly,
the PCs in 𝑌̃︀ are orthogonal to each other and hence uncorrelated. Therefore, the squared canonical
correlations between the two sets of PCs are given by the 𝑘 eigenvalues of

𝑅𝑌 𝑌̃︀ 𝑅𝑌⊤𝑌̃︀ , (16.15)

arranged in descending order. The canonical correlations can also be interpreted as the simple correla-
tions between linear combinations of the PCs of the set 𝑌 and those of set 𝑌̃︀ , computed in the specific
manner. Such linear combinations are usually referred to as canonical variates. For a reasonable index
of the total association between the sets (of variables), it would seem appropriate to combine in some
way the successive canonical correlations which can be extracted. For our purpose. we use an index,
recommended in [93], as follows: ⎛ ⎞
𝑘
1 ⎝∑︁ 2 ⎠
𝛾ˆ = 𝜌ˆ , (16.16)
𝑘 𝑗=1 𝑗

where 𝜌ˆ1 , 𝜌ˆ2 , . . . , 𝜌ˆ𝑘 are the canonical correlations between the sets of PCs in 𝑌 and 𝑌̃︀ arranged in
descending order.

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
476 UNCERTAINTY-ACCEPTED DECISION MAKING

16.2.5 Time Series Modeling for BHM

Recall that the damage identification process for bridge, a major topic of Damage Prognosis, is complex
and referred as bridge health monitoring (BHM). This process can be addressed at many levels. The
damage state of the bridge can be described as a five-step process as discussed in [90] to answer the
following questions.
1) Existence: Is there damage in the system?
2) Location: Where are damages in the system?
3) Type: What kind of damages presents?
4) Extent: How serve are the damages?
5) Prognosis: How much/how long does the useful life of the structure remain?

After the dimensionality reduction step discussed in previous parts, that allowed us to select meaningful
predictor sensor variables, our strategy now is answering the final question. To do so, we reversely have
to find solutions to question 3 and 4, then question 1 and 2. To start with, in this paper, we combine
the approaches suggested in works [90][96][97][107] to answer questions 1 and 2: whether a bridge
was damaged or not, and if it was damaged, where the damages should be located. We employ time
series analysis, data clustering and hypothesis testing to make decision about the state of a bridge.
We have known that many sensors were stuck on the bridge at varying locations to on-line measure
signals of the concerned bridge at varying operational and environmental conditions. Sai Gon bridge
have 32 spans, each has length of 24 meters. There are two ways to measure the data. Dynamic
measurement and static measurement, obtained the distortion or displacement of the bridge when
having vehicle movements and when do not have vehicles on the bridge, respectively.

The data that measured at the first time can be used as reference database, i.e. the data were
measured at the undamaged state of the bridge. Then, the new data are recorded at unknown state of
bridge, be used as unknown (new) database, and the data clustering step is used to choose a sample
of data in the reference database which is in similar operational and environmental conditions as new
sample. Our data clustering step in part 16.2.6 is a combination of auto-regressive (AR) modelling and
data clustering technique. First, data sample from two databases are applied the AR model, then the
AR model coefficients are used for data clustering. After clustering step, damage-sensitive features
are extracted from two data samples. Auto-regressive with eXogenous (ARX) model is used in part
16.2.7 to extract the features. In part 16.2.8, the hypothesis testing technique is deployed to make
decision about the state of the bridge. An application of the mentioned 3-steps paradigm to numerical
data obtained from Sai Gon bridge is presented in section ??.

16.2.6 Data clustering

In this study, time series signals measured from real world are stored in two databases. The reference
database contains signal data which are measured when the bridge is undamaged covering various en-
vironmental and operational conditions (climate, traffic loading...). The second database contains new

Inference, Linear Regression and Stochastic Processes


16.2. Data Analytics Project 2: Bridge Health Monitoring 477

signal data which are measured when bridge is in unknown condition. The data clustering procedure
is a process to select the previously recorded signal from the reference database which is recorded
under environmental and operational conditions closest to that of new obtained signal.

Step 1. First, all signals are standardized such that:


𝑥 − 𝜇𝑥
𝑥
̂︀ = (16.17)
𝜎𝑥
where 𝑥
̂︀ is the standardized signal, 𝜇𝑥 and 𝜎𝑥 are the mean and standard deviation of 𝑥 respectively.
This standardization procedure is applied to all signals in two database but for simplicity, 𝑥 is used to
denoted 𝑥
̂︀ hereafter.

Step 2. After standardization procedure, all time series signals 𝑥(𝑡) in reference database are fitted
with AR model of order 𝑟 such that:
𝑟
∑︁
𝑥(𝑡) = 𝜃𝑥 (𝑗)𝑥(𝑡 − 𝑗) + 𝑒𝑥 (𝑡) (16.18)
𝑗=1

where 𝜃𝑥 (𝑗), 𝑗 = 1..𝑟 are the AR coefficients, 𝑒𝑥 (𝑡) is white noise input with variance of 𝜎𝑥2 and
{𝜃𝑥 (𝑗), 𝜎𝑥2 } can be regarded as feature of signal.
For each time series signal 𝑦(𝑡) in new database under unknown condition of bridge, we repeat two
steps above. The AR model of order 𝑟 for 𝑦(𝑡) can be written as:
𝑟
∑︁
𝑦(𝑡) = 𝜃𝑦 (𝑗)𝑦(𝑡 − 𝑗) + 𝑒𝑦 (𝑡) (16.19)
𝑗=1

where 𝜃𝑦 (𝑗), 𝑗 = 1..𝑟 are the AR coefficients, 𝑒𝑦 (𝑡) is white noise input with variance of 𝜎𝑦2 and
{𝜃𝑦 (𝑗), 𝜎𝑦2 } can be regarded as feature of signal.

When all time series signals in the reference database and the new database are fitted with AR
model, two feature spaces, Ω𝑅 and Ω𝑁 respectively corresponding to the reference database and
new database, can be obtained. The data clustering procedure is then implemented by searching in
space Ω𝑅 a point {𝜃𝑥 (𝑗), 𝜎𝑥2 } that is similar to the target point {𝜃𝑦 (𝑗), 𝜎𝑦2 } in Ω𝑁 . First, the following
condition is applied for a certain target point {𝜃𝑦 (𝑗), 𝜎𝑦2 } to select a subspace of feature points in Ω𝑅 :
⃒ 2
⃒𝜎𝑥 − 𝜎𝑦2 ⃒ ≤ 𝜀1 ,

(16.20)

meaning that when the variance of residual errors is close, the distribution of these variables is similar.
This step aims at choosing two data samples that have similar environmental and operational condi-
tions. After this step, a subspace Ω𝑅
1 is obtained. This space contains the feature points satisfying Eq
(16.20). Then, the cosine between two coefficient vectors is calculated to further searching in subspace
Ω𝑅
1 the feature points satisfying:
𝑟
∑︀
𝜃𝑥 (𝑗)𝜃𝑦 (𝑗)
𝑗=1
√︃ √︃ ≥ 𝜀2 (16.21)
𝑟
∑︀ 𝑟
∑︀
𝜃𝑥2 (𝑗) 𝜃𝑦2 (𝑗)
𝑗=1 𝑗=1

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
478 UNCERTAINTY-ACCEPTED DECISION MAKING

After step 1, only some feature points in Ω𝑅 𝑅


1 pass the test and a smaller subspace Ω2 is obtained. In
the final step, the Euclidean distance between the target point and each feature point in subspace Ω𝑅
2
is calculated. The feature point in subspace Ω𝑅
2 that have minimum Euclidean distance as

⎸ 𝑝
⎸∑︁
2
min ⎷ [𝑎𝑦 (𝑗) − 𝑎𝑥 (𝑗)] (16.22)
𝑗=1

After data clustering procedure, the time series signals which are measured in similar environmental
and operational conditions with each time series signal in new database are chosen to make the signal
pairs which are used in further steps.

16.2.7 Damage extraction

The prediction capability of an AR model can be tested by calculating the standard deviation of the
prediction error 𝑒𝑥 (𝑡) in Eq (16.18). If the AR model well represents signal series 𝑥(𝑡) then the standard
deviation of the predict error 𝑒𝑥 (𝑡) should be keep smaller than 10% of standard deviation of origin
signal 𝑥(𝑡). In the practical calculation which is presented later, the standard deviation of the prediction
error around 30-40% of standard deviation of origin signal 𝑥(𝑡). This indicates that AR model is not
well predicted the time series signal.
To solve this problem, an ARX model is used to model time series signals. For construction the ARX
model, it is assumed that the error between the measurement and the prediction in AR model mainly
caused by unknown external input. Based on this assumption, the ARX model is used to represent the
input/output relationship between 𝑒𝑥 (𝑡) and 𝑥(𝑡):
𝑝
∑︁ 𝑞
∑︁
𝑥(𝑡) = 𝛼𝑖 𝑥(𝑡 − 𝑖) + 𝛽𝑗 𝑒𝑥 (𝑡 − 𝑗) + 𝑧𝑥 (𝑡) (16.23)
𝑖=1 𝑗=1

where 𝑧𝑥 (𝑡) is the residual error after fitting the ARX(𝑝, 𝑞) model to 𝑒𝑥 (𝑡) and 𝑥(𝑡). After that, this
ARX(𝑝, 𝑞) model is used to reproduce the input/output relationship between 𝑒𝑦 (𝑡) and 𝑦(𝑡):
𝑝
∑︁ 𝑞
∑︁
𝑧𝑦 (𝑡) = 𝑦(𝑡) − 𝛼𝑖 𝑦(𝑡 − 𝑖) + 𝛽𝑗 𝑒𝑦 (𝑡 − 𝑗), (16.24)
𝑖=1 𝑗=1

here 𝛼𝑖 and 𝛽𝑗 are the coefficients in Eq (16.23). If the ARX model of reference signal 𝑥(𝑡) and 𝑒𝑥 (𝑡)
were not good for representing the new signal 𝑦(𝑡) and 𝑒𝑦 (𝑡), there would be a significant change in
the standard deviation of residual error 𝑧𝑦 (𝑡) compared to that of 𝑧𝑥 (𝑡). Consequently, the standard
deviation of residual error can be defined as the damage-sensitive feature.

16.2.8 Sequential Probability Ratio Test for Damage Identification

When the data are collected sequentially, the Sequential Probability Ratio Test (SPRT) procedure is an
appropriate technique to analyse data. This procedure helps to make the decision faster and needs
less collected data compared to classical test. In the classical hypothesis test, the number of samples
are fixed at the beginning of the test and the data are not analysed while collecting. After all data are

Inference, Linear Regression and Stochastic Processes


16.2. Data Analytics Project 2: Bridge Health Monitoring 479

collected, the analysis is done and conclusion are drawn. While in the sequential test, the data are
analyzed during data collection and the parameter at the moment are compared to threshold value in
the hypothesis to make the decision. So the decision can be make at the time data are collected and
the conclusion can be drawn at the moment where the number of data samples are smaller than that
in the classical test.

Sequential Test

Sequential test is a method of statistical inference in which the number of observations required by
the procedure is not determined in advance [102]. This procedure starts with the accumulation of a
sequence of random variables {𝑧𝑖 } (𝑖 = 1, 2, ...). This accumulated data set at stage 𝑛 is denoted as:

𝑍𝑛 = (𝑧1 , 𝑧2 , ..., 𝑧𝑛 )
The data set in this study is the collection of the residual errors computed from the ARX model in the
previous section.

Sequential Probability Ratio Test (SPRT)

The sequential probability ratio test frequently results in a saving of 50 % in the number of observations
over the most efficient test procedure based on the fixed number of observations. As in classical
hypothesis test, the SPRT starts with a pair of hypotheses, say 𝐻0 and 𝐻1 for the null hypothesis and
alternative hypothesis respectively. The next step calculates the cumulative sum of the log-likelihood
ratio:
𝑓 (𝑧𝑛 ; 𝜎1 )
𝑇𝑛 = 𝑇𝑛−1 + ln
𝑓 (𝑧𝑛 ; 𝜎0 )
where 𝑓 (𝑧𝑛 ; 𝜎1 ) is the probability density function of 𝑧𝑛 at 𝜎 = 𝜎1 . The stopping rule is a simple
threshold scheme:

1. 𝑎 < 𝑇𝑛 < 𝑏: continue monitoring (critical inequality)

2. 𝑇𝑛 ≤ 𝑏: Accept 𝐻0

3. 𝑇𝑛 ≥ 𝑎: Reject 𝐻0 (Accept 𝐻1 )

where 𝑎 and 𝑏 (0 < 𝑎 < 𝑏 < ∞) depend on the desired type I and type II errors, 𝛼 and 𝛽. They may be
chosen as follows:
𝛽 1−𝛽
𝑎∼
= log , 𝑏∼
= log .
1−𝛼 𝛼

SPRT in Normal Distribution Assumption

In the damage detection problem, the standard deviation of residual errors is considered as parameter
of the hypothesis testing:

𝐻0 : 𝜎 ≤ 𝜎 0 𝐻1 : 𝜎 ≥ 𝜎1 where 0 < 𝜎0 < 𝜎1 .

DATA ANALYTICS- FOUNDATION


CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
480 UNCERTAINTY-ACCEPTED DECISION MAKING

In the hypothesis above, if the standard deviation of the residual errors, 𝜎 is equal or less than a user
specified standard deviation value 𝜎0 , that bridge’ location (where the sensor is plugged) is considered
possibly undamaged. Otherwise, if 𝜎 is equal or greater than the other user specified standard deviation
value 𝜎1 , the location is concluded to potentially be damaged. There are many ways to choose two user
specified standard deviation values. One can choose these values from the training database obtain
from structure. Alternatively, one can initialize these values by experiments and then adjust when there
are more data.

If modified observations {𝑡𝑖 }(𝑖 = 1, 2, ...) are defined as follows

𝑓 (𝑧𝑖 ; 𝜎1 )
𝑡𝑖 = ln
𝑓 (𝑧𝑖 ; 𝜎0 )

then, the cumulative sum of the log-likelihood ratio 𝑇𝑛 is calculated by:

𝑓 (𝑧𝑛 ; 𝜎1 )
𝑇𝑛 = 𝑇𝑛−1 + ln
𝑓 (𝑧𝑛 ; 𝜎0 )

where 𝑓 (𝑧𝑛 ; 𝜎1 ) is the probability density function of 𝑧𝑛 at 𝜎 = 𝜎1 . Rewrite 𝑇𝑛 as:

𝑛 𝑛 𝑛
ln 𝑓𝑓 (𝑧𝑖 ;𝜎1 )
∑︀ ∑︀ ∑︀
𝑇𝑛 = 𝑡𝑖 = (ln 𝑓 (𝑧𝑖 ; 𝜎1 ) − ln 𝑓 (𝑧𝑖 ; 𝜎0 ))
(𝑧𝑖 ;𝜎0 ) =
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
(︂ 𝑛 )︂ (︂ 𝑛 )︂
∑︀ ∑︀ ∏︀ ∏︀
= ln 𝑓 (𝑧𝑖 ; 𝜎1 ) − ln 𝑓 (𝑧𝑖 ; 𝜎0 ) = ln 𝑓 (𝑧𝑖 ; 𝜎1 ) − ln 𝑓 (𝑧𝑖 ; 𝜎0 )
𝑖=1 𝑖=1 𝑖=1 𝑖=1

= ln 𝑓 (𝑧1 , 𝑧2 , ..., 𝑧𝑛 ; 𝜎1 ) − ln 𝑓 (𝑧1 , 𝑧2 , ..., 𝑧𝑛 ; 𝜎0 )

= ln 𝑓𝑓 (𝑧 1 ,𝑧2 ,...,𝑧𝑛 ;𝜎1 )


(𝑧1 ,𝑧2 ,...,𝑧𝑛 ;𝜎0 ) .

If 𝑇𝑛 > 𝑎 then 𝑓 (𝑧1 , 𝑧2 , ..., 𝑧𝑛 ; 𝜎1 ) > 𝑒𝑎 * 𝑓 (𝑧1 , 𝑧2 , ..., 𝑧𝑛 ; 𝜎0 ). The value of joint probability density
function with standard deviation 𝜎1 of (𝑧1 , 𝑧2 , ..., 𝑧𝑛 ) is greater than the value of joint probability density
function with standard deviation 𝜎0 (𝑒𝑎 times). Then we accept 𝐻1 . This argument can be applied for
two remaining stopping rules. Assuming that 𝑇𝑛 has a normal distribution with mean 𝜇 and standard
deviation 𝜎, 𝑡𝑖 can be calculated by 𝑧𝑖 as follows:

(𝑧𝑖 −𝜇)2
1 − 2
2𝜎1
𝑓 (𝑧𝑖 ; 𝜎1 ) 𝜎1

2𝜋
𝑒
𝑡𝑖 = ln = ln (𝑧𝑖 −𝜇)2
𝑓 (𝑧𝑖 ; 𝜎0 ) 1 −
2𝜎02
𝜎0

2𝜋
𝑒
(𝑧 −𝜇)2 (𝑧 −𝜇)2
𝜎0 ( 𝑖 2
2𝜎0
− 𝑖 2
2𝜎1
)
= ln 𝑒
𝜎1 (16.25)

𝜎0 1 (𝑧𝑖 −𝜇)2 (𝜎0−2 −𝜎1−2 )


= ln 𝑒2
𝜎1
1 𝜎1
= (𝑧𝑖 − 𝜇)2 (𝜎0−2 − 𝜎1−2 ) − ln
2 𝜎0

A concrete data analysis with realistic dataset are fully explored in Nguyen et. al. [25].

Inference, Linear Regression and Stochastic Processes


16.2.9 Closing remarks and open problems

Nevertheless, to have a more concise conclusion, both theoretically and practically, the research team
would collaborate with bridge engineers to compare the result with mechanical approaches. Regarding
to Questions 3, 4, 5 raised in the beginning of Section 16.2.5, many tough problems remains. However,
few interesting problems possibly be investigated in the near future could be the followings.

1. Constant selecting. How could the best constants (𝜀1 , 𝜀2 , 𝜎0 , 𝜎1 , 𝑑𝑚𝑖𝑛 . . .) be chosen, for a certain
bridge, to get highly reliable prediction level? A feasibly empirical way may be applying the proposed
solution to other bridges to get more experience on choosing these values.

2. Efficient data-mining. When a damage would be discovered at certain positions on a bridge, be-
sides repairing or upgrading measures, what more efficient data-mining methods can be realized
and/or which stabilizers can be designed to automatically set off extreme frequencies or deviations
that caused by vehicle movements?

3. Sensor clustering- cost optimizing. What happens if impacts-outcomes of some sensors depend
on other impacts? If that is the case, detecting possible dependence between sensors and then
reducing the number of sensors used in the investigation may be meaningful actions.

4. Finally, although we tried to use a better (mathematical and statistical) approach to choose the num-
ber 𝑘 of PCs other than eigenvalue-based one, cross-validation, mentioned in Section 16.2.3 is not
the best one, and the research on this direction still needs more attention.
This page is left blank intentionally.
List of Figures

1.1 Would we reveal information and knowledge behind this beautiful picture? . . . . . . . . 3

1.2 Complement and sub event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 General union rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Visualization of a random variable 𝑋 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 The probability density function of Bin(50, 𝑝) with 𝑝 = .25, .50, .75 . . . . . . . . . . . . . 19

1.6 Histogram of Poisson (p.d.f bar plot)- Case 𝜆 = 4. . . . . . . . . . . . . . . . . . . . . . . 20

1.7 Geometric view of cumulative probability and expectation . . . . . . . . . . . . . . . . . 22

1.8 The probability density function 𝑓 of Uniform(𝑎, 𝑏) . . . . . . . . . . . . . . . . . . . . . . 24

1.9 The probability density function of N(𝜇, 𝜎) with 𝜇 = 10 and 𝜎 = 1, 2, 3 . . . . . . . . . . . 25

1.10 Areas with radius 1, 2 and 3 𝜎 - when data approximate Gauss distribution . . . . . . . 26

1.11 Symmetry of the standard Gauss distribution . . . . . . . . . . . . . . . . . . . . . . . . 30

1.12 The pdf with 𝛽 = 1 and 𝜈 = 0.5, 1, 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.13 The pdf of Weibull with 𝛼 = 1.5; 2 and 𝛽 = 1. . . . . . . . . . . . . . . . . . . . . . . . . . 35

1.14 The pdf 𝑓 (𝑥; 𝜈1 , 𝜈2 ) of Beta(𝜈1 , 𝜈2 ) when 𝜈1 = 2.5, 𝜈2 = 2.5; 𝜈1 = 2.5, 𝜈2 = 5.00. . . . . . 36

3.1 A popular graphical type distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.2 Pie-chart and Time series graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.3 Symmetric and asymmetric distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.4 Normal, steep, and flat distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.5 Quantile plot of log yarn-strength data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.6 Box whiskers plot of log yarn-strength data. . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.7 Histogram of log yarn strength. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.8 Areas with radius 1, 2 and 3 𝜎 - when data approximate Gauss distribution . . . . . . . 92

4.1 Would we infer useful knowledge behind this beautiful picture? . . . . . . . . . . . . . . 99

DATA ANALYTICS- FOUNDATION


484 LIST OF FIGURES

4.2 Data collection and statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 102


4.3 The curves of likelihood function and its log. . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4 Critical point of the upper tail probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.5 A CI of 𝜇 with Margin of Error 𝑅(𝛼) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.6 Different samples give distinct CI of 𝜇 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.7 Two-side testing with Z statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.8 Standard normal quantiles split the area under the Gauss density curve . . . . . . . . . 126
4.9 Probability density function of 𝑡[𝜈] with 𝜈 = 5, 50 . . . . . . . . . . . . . . . . . . . . . . . 129
4.10 Visualization of 𝑡-distribution with 2-tailed probability . . . . . . . . . . . . . . . . . . . . 129

5.1 Acceptance region 𝐴 and its complement 𝐴𝑐 . . . . . . . . . . . . . . . . . . . . . . . . 140


5.2 The learning process produces data and knowledge, that allows humans to solve prob-
lems. Source [61] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.3 The probability density function of N(𝜇, 𝜎) with 𝜇 = 10 and 𝜎 = 1, 2, 3 . . . . . . . . . . . 143
5.4 Testing hypotheses in quality control. [Source [?]] . . . . . . . . . . . . . . . . . . . . . . 148
5.5 Critical region for two side test with 𝑍 statistic . . . . . . . . . . . . . . . . . . . . . . . . 152
5.6 Critical region for two side test with 𝑍 statistic . . . . . . . . . . . . . . . . . . . . . . . . 154
5.7 Acceprtance area and rejection area in one side test . . . . . . . . . . . . . . . . . . . 157
5.8 Test a hypothesis using critical value approach . . . . . . . . . . . . . . . . . . . . . . . 158
5.9 Test a hypothesis using 𝑝 value approach . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.1 Sir R.A. Fisher, the inventor of Experimental Designs . . . . . . . . . . . . . . . . . . . 180


6.2 The pdf of 𝐹 (5, 5), 𝐹 (5, 15) and percentiles . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.3 A factorial design with 3 binary factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.4 Effect of spring coefficient on cycle time. (The 𝑦 axis corresponds to cycle time in minutes).206
6.5 Interaction plot of piston weight with spring coefficient. . . . . . . . . . . . . . . . . . . . 207

8.1 P[(𝑋, 𝑌 ) ∈ 𝐴 = volume under density surface above 𝐴 . . . . . . . . . . . . . . . . . . . 234

9.1 A sample correlation from a small data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246


9.2 Multiple scatterplots of the aluminum pins measurements. . . . . . . . . . . . . . . . . . 248
9.3 Residuals and the estimated regression line . . . . . . . . . . . . . . . . . . . . . . . . . 257
9.4 The sum 𝑆𝑆𝐸, the orange fitted line gives value minimum 𝑆𝑆𝐸 . . . . . . . . . . . . . 259
9.5 Dot diagram of the world population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.6 Linear model of world population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
9.7 Does house price depend only on area? . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

Inference, Linear Regression and Stochastic Processes


LIST OF FIGURES 485

9.8 Overfitted models and linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

9.9 Scatter diagram of water quality data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

10.1 Relationship of ISC values 𝑡1 and 𝑡2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

10.2 Fiugre c) give two-sided P value of T distribution . . . . . . . . . . . . . . . . . . . . . . 289

10.3 Illustration for prediction interval for the individual response . . . . . . . . . . . . . . . . 292

10.4 Patterns of residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

10.5 Residual vs. predicted ISC values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

11.1 The regression plane for the model 𝑌ˆ = 50 + 10𝑥1 + 7𝑥2 . . . . . . . . . . . . . . . . . . 301

11.2 Linear regression models with different shapes . . . . . . . . . . . . . . . . . . . . . . . 314

11.3 Linear and nonlinear model of the U.S. population regression . . . . . . . . . . . . . . . 316

11.4 Concrete dataset: BPH = Brown Plant-hoppers . . . . . . . . . . . . . . . . . . . . . . 322

12.1 An example of mixed flow of arrivals ( jobs, clients, cars ...)- Source [58] . . . . . . . . . 331

12.2 The log returns look like a Brownian motion. . . . . . . . . . . . . . . . . . . . . . . . . . 335

12.3 The VNM stock price process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

12.4 The DPM stock price process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

12.5 Visualization of a stochastic process 𝑋 . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

12.6 A sample path of CPU usage up to time 𝑡 . . . . . . . . . . . . . . . . . . . . . . . . . . 340

12.7 Sample paths of CPU usage after time 𝑡 . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

12.8 Diagram for birth and death process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

13.1 Generating values of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

14.1 The curve of ℎ(𝑗; 500, 350, 100). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

14.2 State-transition-rate diagram for birth and death process. . . . . . . . . . . . . . . . . . 424

16.1 The table of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458

16.2 New factors obtained after screening data . . . . . . . . . . . . . . . . . . . . . . . . . . 460

16.3 R code - the first part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466

16.4 R code continue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

16.5 ANOVA result of the GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

16.6 Summary of GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

16.7 Residuals vs Fitted plot of GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469

16.8 Normal Q-Q plot of GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469

DATA ANALYTICS- FOUNDATION


16.9 ANOVA result of GLM.NB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
16.10Summary of GLM.NB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
List of Tables

1.1 Tabulated values of Φ(𝑥) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 GDP in USA, UK, Mexico and India via monthly income . . . . . . . . . . . . . . . . . . 69

3.2 Market share of car producers in Thailand . . . . . . . . . . . . . . . . . . . . . . . . 71

3.3 Annual incomes per capita of four countries . . . . . . . . . . . . . . . . . . . . . . . 72

3.4 Frequency distribution of 𝑥 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.5 Frequency distributions of the days needed to deliver products . . . . . . . . . . . . . . 82

3.6 Sample data for the sound equipment store . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.7 Raw grades of a class of 40 students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.1 Tabulated values of Laplace function Φ(𝑧) . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.2 Crtical value for one-sided test, 𝛼 is significant level . . . . . . . . . . . . . . . . . . . . 127

4.3 The percentile 𝛼 of probability density function of 𝑇 . . . . . . . . . . . . . . . . . . . . . 130

5.1 Tabulated values of Laplace function Φ(𝑧) . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.2 The percentile 𝛼 of probability density function of 𝑇 . . . . . . . . . . . . . . . . . . . . . 172

5.3 Data produced from binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

5.4 Data produced from Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

6.1 CRD design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

6.2 CRD design - experimental data [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

6.3 ANOVA of the factor of hardwood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

6.4 ANOVA table for RCBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

6.5 ANOVA for a BIBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

6.6 Block sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

6.7 Values of 𝑌𝑖𝑙 , 𝑖 ∈ 𝐵𝑖 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

DATA ANALYTICS- FOUNDATION


488 LIST OF TABLES

6.8 The set 𝑇𝑗 and the statistics 𝑊𝑗 , 𝑊𝑗* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194


6.9 ANOVA for BIBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.10 Mean Effects and their S.E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.11 Table of ANOVA for a 2-factor factorial experiment . . . . . . . . . . . . . . . . . . . . . 198
6.12 Treatment combinations of a 23 experiment . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.13 The labels in standard order for a 25 factorial design . . . . . . . . . . . . . . . . . . . . 201
6.14 Treatment means in a 22 design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.15 Two-way ANOVA for cycle time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.16 An orthogonal array with 11 binary factors . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.17 Eight factors, the number of levels and the level meanings . . . . . . . . . . . . . . . . 212
6.18 A mixed orthogonal design with 3 distinct sections . . . . . . . . . . . . . . . . . . . . 214

7.1 ANOVA of the factor 𝐴 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220


7.2 ANOVA for the Two-Stage Nested Design . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.3 Coded Purity Data for Example- (Code: 𝑦𝑖𝑗𝑘 = 𝑃 𝑢𝑟𝑖𝑡𝑦 − 93) . . . . . . . . . . . . . . . . 225
7.4 Expected Mean Squares in the Two-Stage Nested Design . . . . . . . . . . . . . . . . . 225

8.1 Data of insurance agency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

9.1 Sample covariances of aluminum pins variables . . . . . . . . . . . . . . . . . . . . . . . 247


9.2 Sample correlations of aluminum pins variables . . . . . . . . . . . . . . . . . . . . . . . 248
9.3 Data of catalyst 𝑋 and response 𝑌 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

10.1 ISC Values of solar cells at three time epochs . . . . . . . . . . . . . . . . . . . . . . . . 274


10.2 Simple ANOVA table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
10.3 Observed and predicted values of ISC at time 𝑡2 . . . . . . . . . . . . . . . . . . . . . . 295

12.1 Data of VNM and DPM stock price in Quater 1, 2013. . . . . . . . . . . . . . . . . . . . 334

13.1 Simple random walk from Coin tossing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

14.1 Poisson distribution of dust particles in the atmosphere . . . . . . . . . . . . . . . . . . 408

B.1 The cycle of piston with control factors are set to minimum values . . . . . . . . . . . . . 494

C.1 Commonly used links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501

Inference, Linear Regression and Stochastic Processes


Appendix A

Transform Methods

We recall probability-generating function, and Laplace transform.

A.1 Probability-generating function of a discrete variable

Consider the (point) probability distribution of a discrete r. v. 𝑋, with the observed values in 𝑗 ∈
Range(𝑋) = N and pmf P[𝑋 = 𝑗] = 𝑝𝑗 .

Probability-generating function

The probability-generating function (p.g.f. or PGF) of 𝑋 is defined by



∑︁
𝐺(𝑧) = 𝑝𝑗 𝑧 𝑗 = E[𝑧 𝑋 ], (A.1)
𝑗=0

here E is the expectation operator, 𝑧 is called a dummy variable. Obviously



∑︁
𝐺(1) = 𝑝𝑗 = 1. (A.2)
𝑗=0

Theorem A.1. (Usages of the PGF-transform)

For a discrete r. v. 𝑋, with pmf 𝑝𝑗 = P[𝑋 = 𝑗] and the pgf 𝐺(𝑧) that is differentiable at the points
𝑧 = 0 and 𝑧 = 1, we have the followings.

1. Usage 1: recovering moments of the pmf via differentiation of 𝐺 at 𝑧 = 1, as shown below:

DATA ANALYTICS- FOUNDATION


APPENDIX A. TRANSFORM METHODS
490

∞ ∞
𝑑𝐺(𝑧) ∑︁ 𝑗−1 𝑑𝐺(𝑧) ∑︁
= 𝑗𝑧 𝑝𝑗 =⇒ 𝐺′ (1) = |𝑧=1 𝑗𝑝𝑗 = E[𝑋] = 𝜇𝑋 ,
𝑑𝑧 𝑗=1
𝑑𝑧 𝑗=0

(A.3)

𝑑2 𝐺(𝑧) ′′ ∑︁
2
|𝑧=1 = 𝐺 (1) = 𝑗(𝑗 − 1) 𝑧 𝑗−2 𝑝𝑗 = E[𝑋 2 ] − E[𝑋].
𝑑𝑧 𝑗=1

2. Usage 2: computing the pmf via differentiation of 𝐺 at 𝑧 = 0:


1 𝑑𝑛 𝐺(𝑧)
𝑝𝑛 = |𝑧=0 , 𝑛 = 0, 1, 2, . . . (A.4)
𝑛! 𝑑𝑧 𝑛

Example A.1. (Poisson Distribution)

Find the probability generating function of variable 𝑋 ∼ Pois(𝜆).

Its probability generating function is given by


∞ ∞ ∞
∑︁ ∑︁ 𝑒−𝜆 𝜆𝑗 𝑗 ∑︁ (𝜆 𝑧)𝑗
𝐺(𝑧) = 𝑝𝑗 𝑧 𝑗 = 𝑧 = 𝑒−𝜆 = 𝑒𝜆(𝑧−1) .
𝑗=0 𝑗=0
𝑗! 𝑗=0
𝑗!

Use Usage 1 above to check that


E[𝑋] = V[𝑉 ] = 𝜆. 

Binomial variable describes a random variable 𝑋 that is a number of successes in 𝑛 indepen-


dent and identical Bernoulli trials with probability of success 𝑝. In other words, 𝑋 is a sum of
𝑛 i.i.d. Bernoulli random variables:

𝑋 = 𝐵1 + 𝐵2 + . . . + 𝐵𝑛 , where each 𝐵𝑖 ∼ 𝐵 ∼ B(𝑝)

Therefore, 𝑋 takes values in the range

Range(𝑋) = {0, 1, 2, ..., 𝑛}

and its distribution is given by the probability density function


(︂ )︂
𝑛 𝑘
𝑝𝑘 = P[𝑋 = 𝑘] = 𝑝 (1 − 𝑝)𝑛−𝑘 . (A.5)
𝑘
It is easy to check that the mean and variance are

E(𝑋) = 𝑛𝑝, V(𝑋) = 𝑛𝑝(1 − 𝑝).

Example A.2. (Binomial Distribution)

Inference, Linear Regression and Stochastic Processes


A.2. Laplace transform for continuous variable 491

Find the probability generating function of variable 𝑋 ∼ Bin(𝑛, 𝑝).

You can check that the probability generating function of 𝑋 is


∞ (︂ )︂
∑︁ 𝑛 𝑘
𝐺(𝑧) = 𝑝 (1 − 𝑝)𝑛−𝑘 𝑧 𝑘 = (1 − 𝑝 + 𝑝𝑧)𝑛 . 
𝑘
𝑘=0

Example A.3. (Geometric Distribution)

Find the probability generating function of geometric variable 𝑋 ∼ Geom(𝑛, 𝑝).


Geometric variable 𝑋 ∼ Geom(𝑛, 𝑐) describes the number of independent and identical Bernoulli
trials until the first success occurs. With 𝑐 = P[‘success′ ], 𝑞 = 1 − 𝑐, its probability density function

𝑝𝑘 = P[𝑋 = 𝑘] = 𝑐 (1 − 𝑐)𝑘−1 = (1 − 𝑞) 𝑞 𝑘−1 , 𝑘 ≥ 1. (A.6)

Its probability generating function is given by



∑︁ ∞
∑︁
𝐺(𝑧) = 𝑝𝑘 𝑧 𝑘 = (1 − 𝑞) (𝑞𝑧)𝑘 =?
𝑘=1 𝑘=0

A.2 Laplace transform for continuous variable

Besides of the PGF, we will make extensive use of the Laplace Steiltjes transform or just Laplace
transform. Let 𝑓 (𝑡) = 𝑓𝐴 (𝑡) be the pdf of a continuous random variable 𝐴 that takes only non-negative
values; that is, 𝑓𝐴 (𝑡) = 0 for 𝑡 < 0.
The Laplace transform of 𝐴 or 𝑓 (𝑥), denoted by 𝐴(𝑠), is defined by

∫︁ ∞ ∫︁ ∞
−𝑠𝐴 −𝑠𝑡
𝐿𝑇 (𝐴) = 𝐴(𝑠) = E[𝑒 ]= 𝑒 𝑓 (𝑡) 𝑑𝑡 = 𝑒−𝑠𝑡 𝑑𝐹 (𝑡), (A.7)
0 0
where 𝐹 = 𝐹𝐴 is the cdf of 𝐴.

1. When the Laplace transform 𝐴(𝑠) is evaluated at 𝑠 = 0, its value 𝐴(0) = 1:


∫︁ ∞
𝐴(𝑠)|𝑠=0 = 𝑓 (𝑡)𝑑𝑡 = 1.
0

2. The Laplace transform operator 𝐿𝑇 is linear, since for constants 𝑎, 𝑏:

𝐿𝑇 (𝑎𝐴 + 𝑏𝐵) = 𝑎𝐿𝑇 (𝐴) + 𝑏𝐿𝑇 (𝐵)

One of the primary usages of LT 𝐴(𝑠) is to derive the moments of distributions.

DATA ANALYTICS- FOUNDATION


APPENDIX A. TRANSFORM METHODS
492

Theorem A.2. (Laplace transform gives moments)

Let E[𝐴𝑛 ] (𝑛 ≥ 1) be the 𝑛-th moment of a continuous random variable 𝐴. Take different
derivatives of the Laplace transform 𝐿𝑇 (𝐴) = 𝐴(𝑠) and evaluate them at 𝑠 = 0, we obtain the
following.
∫︁ ∞
𝑑𝐴(𝑠) 𝑑𝐴(𝑠)
=− 𝑡 𝑒−𝑠𝑡 𝑓 (𝑡) 𝑑𝑡 =⇒ |𝑠=0 = 𝐴′ (0) = −E[𝐴]
𝑑𝑠 0 𝑑𝑠


𝑑2 𝐴(𝑠) 𝑑2 𝐴(𝑠)
∫︁
= 𝑡2 𝑒−𝑠𝑡 𝑓 (𝑡) 𝑑𝑡 =⇒ |𝑠=0 = 𝐴′′ (0) = E[𝐴2 ] (A.8)
𝑑𝑠2 0 𝑑𝑠2
..
.
𝑑𝑛 𝐴(𝑠)
|𝑠=0 = (−1)𝑛 E[𝐴𝑛 ].
𝑑𝑠𝑛

Example A.4. (Laplace transform of Exponential Distribution)

Let 𝐴 ∼ E(𝜇) denote an exponentially distributed random variable with mean rate 𝜇. Its pdf is
𝑓 (𝑡) = 𝜇 𝑒−𝜇 𝑡 .
The Laplace transform of 𝐴 or 𝑓 (𝑡) is
∫︁ ∞ ∫︁ ∞
−𝑠𝐴 −𝑠𝑡 𝜇
𝐴(𝑠) = E[𝑒 ]= 𝑒 𝑓 (𝑡) 𝑑𝑡 = 𝑒−𝑠𝑡 𝜇 𝑒−𝜇 𝑡 𝑑𝑡 = ? (A.9)
0 0 𝜇+𝑠

The Laplace transforms of the pdfs of random variables will be particularly important for the analysis
of M/G/1 queuing systems.

Inference, Linear Regression and Stochastic Processes


Appendix B

Statistical Software and Computation


Powerful tools for dealing with large data

B.1 Introductory R- a popular Statistical Software

Why do we use Statistical Software R?

1. Common commerical statistical softwares: SAS, SPSS, Stata, Statistica, Gauss, Splus (S+). All very
costly

2. R is a new program - FREE


– a free version of S+, http://cran.R-project.org

3. R is a statistical language
– can perform any common statistical functions
– interactive interface

All R applications mentioned in this book are contained in a package called mistat being available
and downloaded from the site CRAN (link at https://cran.r-project.org/).
Remark 7. The functions in R have parentheses. The library function library() loads an extra pack-
age to extend the functionality of R, and CYCLT is an object that contains a vector of many numeric
values.
Practical Problem 3. A piston is a mechanical device that is present in most types of engines. A
measure of the performance of a piston is the time it takes to complete a cycle. We call this measure
a cycle (a cycle time).
We provide R code here to simulate operations of the piston.

> # This is a comment


> install.packages("mistat", # Install mistat package
dependencies=TRUE) # and its dependencies
> library(mistat) # A command to make our datasets
> # and functions available
> data(CYCLT) # Load specified data set
APPENDIX B. STATISTICAL SOFTWARE AND COMPUTATION
494 POWERFUL TOOLS FOR DEALING WITH LARGE DATA

> # CYCLT is a vector of values


> help(CYCLT) # Read the help page about CYCLT
> CYCLT # Print CYCLT to Console

Table B.1: The cycle of piston with control factors are set to minimum values

1.008 1.117 1.141 0.449 0.215


1.098 1.080 0.662 1.057 1.107
1.120 0.206 0.531 0.437 0.348
0.423 0.330 0.280 0.175 0.213
1.021 0.314 0.489 0.482 0.200
1.069 1.132 1.080 0.275 0.187
0.271 0.586 0.628 1.084 0.339
0.431 1.118 0.302 0.287 0.224
1.095 0.319 0.179 1.068 1.009
1.088 0.664 1.056 1.069 0.560

R screenshot and Environments

1. Prompt: >

2. Current working direction: getwd()

3. Change working direction: setwd(“c:/stats”)

4. Getting help: After notation # is explanation for us - human-being, R doesn’t understand

> ?sd; # or equivalently we use


> help(sd);
> ?quantile; # used in Chapter 2
> ?t.test # will be used in Chapter 5
> ?lm; help(lm) # will be used in Chapter 8

Inference, Linear Regression and Stochastic Processes


B.1. Introductory R- a popular Statistical Software 495

R Practicality

R Grammar- Operations

$x= 5$ # variable $x$ receives value 5


1.
2. $x == 5$ # x equals to 5

3. $x != 5$ # x is not equal to 5
4. y < x # y is less than x
5. $p <= 1$ # p is smaller than or equal to

6. $A & B$ # A and B

R as a number generator Sequence: use seq(from, to, by= )

Generate a variable with numbers ranging from 1 to 12:

> x <- (1:12);


[1] 1 2 3 4 5 6 7 8 9 10 11 12
> seq(12)
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> seq(4, 6, 0.25)
[1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00

R- Data types and Input data


Two popular Data types: data vector (column) and data frame

Two most Input data methods: entry by c() and by edit(data.frame())

First data vector:

DATA ANALYTICS- FOUNDATION


APPENDIX B. STATISTICAL SOFTWARE AND COMPUTATION
496 POWERFUL TOOLS FOR DEALING WITH LARGE DATA

> natri = c(5,6,6,4,8,7,7,7,8,6,10);


# Data entry by c(), c means column vector
> mean(natri); sd(natri); # the sample mean and sample standard deviation
> x = natri; x; # change variable

R- Dataframe

• Dataset −→ data.frame

• columns −→ variables

• rows −→ observations

age = c(23, 43, 17, 52, 28, 31, 15, 31) # unit: year

• insulin <- c(10, 12, 19, 23, 21, 20, 17, 10) # unit: mg / litter
• insudata <- data.frame(age, insulin);
• insudata;

Data frame: insudata


Variables: age, insulin
Number of observations: 8
R- Input data: c() and edit(data.frame())
Input data methods: by column vector c() and edit(data.frame())

age = c(23, 43, 17, 52, 28, 31, 15, 31)


• insulin <- c(10, 12, 19, 23, 21, 20, 17, 10)

• ins <- edit(data.frame()); # create an empty frame


• ins;
age insulin Name
1 43 12 Min
2 17 19 Aren
3 52 23 Air
4 28 21 Aim
5 31 20 Tan
• ins-new <- edit(data.frame(insudata));

# modify a built data frame

R - Basic plots

Inference, Linear Regression and Stochastic Processes


B.2. Basic R commands for Statistical Analysis 497

hist(age);
# Can draw histogram of single variables only

• hist(insulin);
• hist(insudata);
# Oh oh! Can not! Why? Now how to do?
• plot(insudata); # 2D table requires using plot

We discuss how to encode or group


continuous values to discrete or ordinal values in the next parts.
R - Advanced commands Transform continuous values to discrete values: how to?

gdpi <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11);


# GDP growth index
# negative values: a degeneracy economy
# positive values: a growing economy

• diagnosis <- gdpi;

• diagnosis[gdpi <= 0] = 0
• diagnosis[gdpi > 0 & gdpi <= 2.0] = 1
• diagnosis[gdpi > 2.0] = 2
• healthEconomies <- data.frame(gdpi, diagnosis);

• healthEconomies;
# the larger index the better growth economy

B.2 Basic R commands for Statistical Analysis

The full description of probability distributions will be presented in Chapter ??. We only show how to
get numerical values of samples by using key R commands here.

Generating random samples

SYNTAX:
# rxyzts(parameters) = generates randomly sample with distribution name xyzts

• Gaussian: b = rnorm(k, mean, sd);

DATA ANALYTICS- FOUNDATION


APPENDIX B. STATISTICAL SOFTWARE AND COMPUTATION
498 POWERFUL TOOLS FOR DEALING WITH LARGE DATA

b = rnorm(37, 1.65, 0.5); # get a sample of 37 numbers, follow the Gaussian N (1.65, 0.5)

• Binomial: x = rbinom(k, n, p);

x = rbinom(18, 9, 0.5); # get a sample of 18 numbers, follow the Bin(9, 0.5)

• Poisson: y = rpois(n, lambda);

y = rpois(8, 12); # get a sample of 8 numbers, follow the Pois(𝜆) with rate 𝜆 = 12

• Fisher: z = rf(n, n1,n2);

z= rf(6, n1,n2); # get a sample of 6 numbers, follow the Fisher 𝐹 [𝑛1, 𝑛2]

Computing distributions

SYNTAX:
# dxyzts(parameters) = computes probability mass func - p.d.f with name xyzts
# pxyzts(parameters) = finds c.d.f - cumulative distribution func
# qxyzt(parameters) = gives the quantile function, inverse of cdf
𝑥* is the p-th quantile of a distribution if

P[𝑋 < 𝑥* ] = 𝑝 ⇔ 𝐹 (𝑥* ) = 𝑝 ⇔ 𝑥* = 𝐹 −1 (𝑝)

• Binomial: x = dbinom(k, n, p); # gives p.d.f at k=4, sum of n= 7 var. Bernoulli(p=0.5)

dbinom(4, size = 7, prob = 0.5);

• Gaussian: pnorm(t, m, s);

E.g., Probability of male height less than or equal to 180 cm, given that the Gauss distribution has
mean=175 and sd= 5

t = 180; m = 175; s = 5; height-prob = pnorm(t, m, s); height-prob;

• Fisher: What is the upper 𝑎 = 5% = 0.05 critical value for the Fisher with two degree of freedom:

n1 = 16; n2 = 21; qf(1-a, n1,n2); 1- pf(2.156, n1,n2);

Inference, Linear Regression and Stochastic Processes


Appendix C

Generalized Linear Model

Since the linear model is the model describe the continuous response but cannot handle discrete or
skewed continuous response such as binary or count data. However, generalized linear models(GLMs)
extend the linear modeling ideas to address this problem. The example of GLMs such as regression
analysis, analysis of variance(ANOVA), analysis of covariance(ANCOVA), etc.

C.1 Model components

There are three components of GLM which are random component, systematic component, and link
function.
I) Random component: GLMs assume the responses come from a distribution that belongs to an
exponential family of distributions, also called the exponential dispersion model family (EDMs).
Continuous EDMs include the normal and gamma distributions. Discrete EDMs include the Poisson,
binomial and negative binomial distributions. The EDM enables GLMs to be fitted to a wide range of
data types, including binary data, proportions, counts, and positive continuous data.
Definition C.1.

Consider a random variable 𝑌 whose pdf 𝑓 depends on parameters 𝜃 and 𝜑. The distribution
belongs to the EDM family if it can be written as
{︂ }︂
𝑦 𝜃 − 𝑏(𝜃)
𝑓 (𝑦) = 𝑓 (𝑦; 𝜃, 𝜑) = 𝑎(𝑦, 𝜑) 𝑒𝑥𝑝 (C.1)
𝜑

or equivalently {︂ }︂
𝑦 𝜃 − 𝑏(𝜃)
𝑓 (𝑦) = 𝑓 (𝑦; 𝜃, 𝜑) = 𝑒𝑥𝑝 + 𝑐(𝑦, 𝜑) (C.2)
𝜑
where

• 𝜃 is called the canonical (natural) parameter, specific to observations 𝑌𝑖 , which will carry infor-
mation from the explanatory variables,

• 𝜑 > 0 is the scale or dispersion parameter, common to all 𝑦,

DATA ANALYTICS- FOUNDATION


500 APPENDIX C. GENERALIZED LINEAR MODEL

• 𝑏(𝜃) is a known function, and is called the cumulant function,

• 𝑎(𝑦, 𝜑) is a normalizing function ensuring that (C.1) is a probability function.

The choice of 𝑏(𝜃) determines the response distribution. Given 𝜃, 𝑦 is determined as a draw from
the exponential density specified in (𝜃).
¯

NOTATIONS:

1. The notation 𝑌 ∼ EDM(𝜇, 𝜑) indicates that the responses come from the EDM family (C.1), with
mean 𝜇 and dispersion parameter 𝜑. The corresponding domain of 𝜇 is denoted Ω.

2. The support of 𝑌 is denoted by S = Range(𝑌 ) (the set of its possible values), where S does not
depend on the parameters 𝜃 and 𝜑.

3. The domain of 𝜃, denoted Θ ⊂ R, is an open interval satisfying 𝑏(𝜃) < ∞ that includes zero.

II) Systematic component is a function definition of predictor used to forecasting response, the
linear combination of the context variable is called the linear predictor

𝜂 = X𝛽 = 𝛼 + 𝛽1 𝑥1 + ... + 𝛽𝑝 𝑥𝑝 (C.3)

III) Link function 𝑔(.) describe the relationship between the mean of response variable and system-
atic elements:

𝑔(𝜇) = 𝑔(𝐸[𝑌 ]) = 𝜂 = 𝛼 + 𝛽1 𝑥1 + ... + 𝛽𝑝 𝑥𝑝 (C.4)

• Often 𝑔 is monotonic and differentiable, such as identity function, logarithm log 𝜇, or square root

𝜇.

• This systematic component 𝑔(𝜇) = 𝜂 shows that GLMs are regression models linear in the parame-
ters 𝛽 = (𝛽0 , 𝛽1 , 𝛽2 , · · · , 𝛽𝑝 )𝑇 .

• The canonical link function is a special link function, the function 𝑔(𝜇) such that 𝜂 = 𝜃 = 𝑔(𝜇).

Example C.1 (In linear models, the Gaussian response is an EDM ).

The classic linear models have three parts, upon the view of GLM:

Inference, Linear Regression and Stochastic Processes


C.1. Model components 501

1. The Random component: the components of Y have independent Normal distributions with
E[𝑌 ] = 𝜇 and constant variance 𝜎 2 ;

2. The Systematic component: covariates 𝑥1 , 𝑥2 , · · · , 𝑥𝑝 produce a linear predictor 𝜂 given by


𝑝
∑︁
𝜂 = 𝑥 𝑇 𝛽 = 𝛽0 + 𝛽𝑗 𝑥𝑗 ;
𝑗=1

3. The link between the random component and systematic component

𝜇 = 𝜂.

Hence, for classic linear models, the response 𝑌 is Normal (Gaussian) distributed in Item 1, and
the link in Item 3 is identity function (the mean and the linear predictor are identical). Specifically,
𝑌 ∼ N(𝜇, 𝜎 2 ), the link function 𝑔(𝜇) = 𝜇 = 𝜂.
Commonly used link functions 𝑔(𝜇) are given in Table C.1. 

Table C.1: Commonly used links

Link function 𝑔(𝜇) Canonical link for


distrbution

identity 𝜇 Gaussian (normal)


log log 𝜇 Poisson
𝜋
logit log binomial 𝜋 = P[𝑌 = 1]
1−𝜋

Brief distributions of the exponential family

Key distributions of the exponential family: include the Binomial, Poisson, and Gassian (Normal).

1. Poisson variable 𝑌 ∼ Pois(𝜆) with parameter 𝜆 is given by the pmf


𝜆𝑦
𝑓 (𝑦; 𝜆) = P[𝑌 = 𝑦] = 𝑒−𝜆 𝑦 = 0, 1, 2, ... (C.5)
𝑦!
Expectation and variance of Poisson variable Pois(𝜆) respectively are

E[𝑌 ] = 𝜇 = 𝜆 > 0, V[𝑌 ] = E[(𝑌 − E[𝑌 ])2 ] = 𝜆. (C.6)

In the form of (C.1),


1 {︀ }︀
𝑓 (𝑦; 𝜆) = 𝑓 (𝑦; 𝜇) = exp 𝑦 log 𝜇 − 𝜇 , (C.7)
𝑦!
showing that 𝜃 = log 𝜇 = 𝜂 is the canonical parameter, 𝑏(𝜃) = 𝜇, and 𝜑 = 1.

This logarithmic link ensures 𝜂 (which possibly takes any real value) always maps to a positive value
of 𝜇.

DATA ANALYTICS- FOUNDATION


502 APPENDIX C. GENERALIZED LINEAR MODEL

1
The normalizing function is 𝑎(𝑦, 𝜑) = . The Poisson distribution 𝑌 ∼ Pois(𝜆) is an EDM .
𝑦!
𝑌 ∼ Pois(𝜆) is used to model count data. Typically these are the number of occurrences of some
event in a defined time period or space, when the probability of an event occurring in a very small
time (or space) is low and the events occur independently.

Over-dispersion: Real data that might be plausibly modeled by the Poisson distribution often have
a larger variance and are said to be overdispersed, and the model may have to be adapted to reflect
this feature.

2. The Bernoulli distribution belongs to the exponential family of distributions. The Bernoulli and
Poisson distributions have no dispersion parameters, so for these distributions we take 𝜑 ≡ 1.

C.2 Model’s Formula

A generalized linear model is made up of a linear predictor

𝜂𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + ... + 𝛽𝑝𝑥𝑝𝑖 (C.8)

and two functions of

• a link function that describes how the mean, 𝐸[𝑌𝑖 ] = 𝜇𝑖 depends on the linear predictor

𝑔(𝜇𝑖 ) = 𝜂𝑖 ; (C.9)

• and a variance function that describes how the variance, 𝑣𝑎𝑟[𝑌𝑖 ] depends on the mean

𝑣𝑎𝑟[𝑌𝑖 ] = 𝜑𝑉 [𝜇] (C.10)

where the dispersion parameter 𝜑 is a constant.

Most of the commonly used statistical distributions, e.g. Gaussian, Binomial, and Poisson, are mem-
bers of the exponential family of distributions whose densities can be written in the form

𝑦𝜃 − 𝑏(𝜃)
𝑓 (𝑦; 𝜃, 𝜑) = 𝑒𝑥𝑝( (C.11)
𝜑 + 𝑐(𝑦, 𝜑)

where 𝜑 is the dispersion parameter and 𝜃 is the canonical parameter. It can be shown that

𝐸[𝑌 ] = 𝑏′ (𝜃) = 𝜇 (C.12)

and
𝑣𝑎𝑟[𝑌 ] = 𝜑𝑏′′ (𝜃) = 𝜑𝑉 [𝜇] (C.13)

Inference, Linear Regression and Stochastic Processes


C.3. Poisson Generalized Linear Models 503

C.3 Poisson Generalized Linear Models

We study counts when the individual events being counted are independent, or nearly so, and
where there is no clear upper limit for the number of events that can occur,
or where the upper limit is very much greater than any of the actual counts.
The Poisson distribution with expected counts 𝜇 = E[𝑌 ] > 0, has the pmf
𝜇𝑦 𝑒−𝜇
𝑝(𝑦; 𝜇) = P[𝑌 = 𝑦] = , 𝑦 = 0, 1, 2, ... (C.14)
𝑦!
It has already been established as an EDM .
The most common link function used for Poisson GLMs is the logarithmic link function, which ensures
𝜇 > 0 and enables the regression parameters to be interpreted as having multiplicative effects. Using
the logarithmic link function log, the general form of a Poisson GLM is


⎨ 𝑦 ∼ Pois(𝜇)


𝑝
∑︁ (C.15)
⎩ 𝑔(𝜇) = log 𝜇 = 𝛽0 +

⎪ 𝛽𝑗 𝑥𝑗 = 𝑥𝑇 𝛽.
𝑗=1

The systematic component can be written as


𝑝
∑︁
𝛽0 + 𝛽𝑗 𝑥 𝑗
𝜇=𝑒 𝑗=1 = (exp 𝛽0 ) · (𝑒𝑥𝑝 𝛽1 )𝑥1 · · · (𝑒𝑥𝑝 𝛽𝑝 )𝑥𝑝

This shows that the impact of each explanatory variable is multiplicative.

• Increasing 𝑥𝑗 by one increases 𝜇 by factor of 𝑒𝑥𝑝(𝛽𝑗 ).

• If 𝛽𝑗 = 0 then 𝑒𝑥𝑝(𝛽𝑗 ) = 1 and 𝜇 is not related to 𝑥𝑗 .


Knowledge box 13.

1. A Poisson GLM is denoted GLM (Pois; link), and is specified in R

using family=poisson() in the glm() call.

2. When the explanatory variables 𝑥𝑗 are all qualitative (that is, factors), the data can be summa-
rized as a contingency table and the model is often called a log-linear model.

3. If any of the explanatory variables are quantitative (that is, covariates), the model is often called
a Poisson regression model.

C.4 Measuring the goodness of fit and Comparing models

Having seen how to fit a given parametric model, a major decision is: what parametric model should
we fit? There are many metrics for comparing models. So not only do we choose a model, we have

DATA ANALYTICS- FOUNDATION


504 APPENDIX C. GENERALIZED LINEAR MODEL

to choose what metric to use to choose a model. In other words, we present various statistics and
techniques useful for assessing the fit of a GLM in this section.

C.4.1 What is model fitting?

The process of fitting a model to data may be regarded as a way of replacing a set of data values 𝑦 by
a set of fitted values 𝜇
̂︀ derived from a model involving usually a relatively small number of parameters.
Measures of discrepancy or goodness of fit may be formed in various ways, but we shall be primar-
ily concerned with that formed from the logarithm of a ratio of likelihoods, to be called the deviance.
Given 𝑛 observations we can fit models to them containing up to 𝑛 parameters.

• The full model 𝑀0 has 𝑛 parameters, one per observation, and the 𝜇𝑗 derived from it match the data
exactly. The full model gives us a baseline for measuring the discrepancy for an intermediate model
with 𝑝 < 𝑛 parameters.

• The fitted model, 𝑀1 , has one parameter, representing a common 𝜇 for all the 𝑦s.

C.4.2 What is Deviance?

In general, the discrepancy of a fit is proportional to

∆ := 2[𝑙0 − 𝑙1 ],

twice the difference between the maximum log likelihood achievable and that achieved by the model un-
der investigation 𝑀1 , see Equation (C.16). The deviance is sort of logarithm of a ratio of two likelihood,
we call log-likelihood ratio.
In estimation based on maximum likelihood (ML), a standard assessment is to compare the fitted
model with a fully specified model (a model with as many independent parameters as observations).
The scaled deviance ∆ is given in terms of the likelihoods 𝐿𝑀0 , 𝐿𝑀1 of the full model 𝑀0 and fitted
model 𝑀1 , by
[︀ ]︀
∆ = −2 log(𝐿𝑀1 /𝐿𝑀0 ) = 2 log 𝐿𝑀0 − log 𝐿𝑀1 = 2[𝑙0 − 𝑙1 ]. (C.16)

This quantity is the deviance of these models. This deviance briefly is a statistic measuring the
current fit against the best possible fit. The deviance is twice the difference between

• the log-likelihood of the best fit model, named full or saturated model, and

• the log-likelihood of the fitted model under consideration.

Deviance for GLMs Consider model (or null hypothesis)

𝑀0 : 𝑔(𝜇) = X0 𝛽 0

against
𝑀1 : 𝑔(𝜇) = X1 𝛽 1

Inference, Linear Regression and Stochastic Processes


C.4. Measuring the goodness of fit and Comparing models 505

Our goal in assessing deviance is to determine the utility of the parameters added to the null model.
When working with GLMs in practice, it is useful to have a quantity that can be interpreted in a similar
way to the residual sum of squares, in ordinary linear modeling.
The deviance, Dev, is defined as:

Dev = ∆ = 2(𝑙0 − 𝑙1 ) = 2[L(𝑦; 𝑦) − L(𝑦; 𝜇)] (C.17)

where

• 𝑙0 = L(𝑦; 𝑦) is the log-likelihood of the full model or saturated model, denoted 𝑀0 (with a single
parameter for each observation); and

• 𝑙1 = L(𝑦; 𝜇) is the log-likelihood for the fitted model, denoted 𝑀1 under consideration.

The notation L(𝑦; 𝑦) for the full model is a reflection that the saturated model perfectly captures the
outcome variable. Thus, the model’s predicted values 𝜇
̂︀ = 𝑦. The difference in deviance statistics
between the saturated model and the fitted model captures the distance between the predicted values
and the outcomes.
The full model 𝑀0 is, by definition, the best possible fit to the data, so the deviance ∆ measures
how far the model 𝑀1 under consideration is from a perfect fit.

C.4.3 Deviance for key response distributions of exponential family

Fix a vector of independent observations 𝑦 = 𝑦1 , 𝑦2 , · · · , 𝑦𝑛 .

Normal Deviance

For normal distributions N (𝜇𝑖 , 𝜎 2 ), with estimated means 𝜇


̂︀ = (̂︀ ̂︀𝑛 ) we can find that
𝜇1 , . . . , 𝜇
𝑛
∑︁ 1 ∑︁
∆ = Dev = ∆𝑖 = ̂︀𝑖 )2 .
(𝑦𝑖 − 𝜇
𝑖
𝜎 2 𝑗=1

Poisson Deviance

• The unit deviance for the Poisson distribution is


(︀ 𝑦 )︀ (︀ )︀
∆𝑦 = 𝑑(𝑦, 𝜇) = 𝑦 log − 𝑦−𝜇 (C.18)
𝜇

• Explicitly, with L(𝑦; 𝜇) =


∑︀
(𝑦𝑖 log 𝜇𝑖 − 𝜇𝑖 ), the residual deviance is
𝑛 {︂ }︂
∑︁ ∑︁ (︀ 𝑦𝑖 )︀ (︀
∆ = 2 L(𝑦; 𝑦) − L(𝑦; 𝜇) = 2
[︀ ]︀ )︀
∆𝑖 = 2 𝑦𝑖 log − 𝑦𝑖 − 𝜇
̂︀𝑖 (C.19)
𝑖=1
𝜇
̂︀𝑖

DATA ANALYTICS- FOUNDATION


506 APPENDIX C. GENERALIZED LINEAR MODEL

C.4.4 Information Criteria - Akaike’s Information Criterion

An information criterion balances the goodness-of-fit of a model against its complexity. The aim is to
provide a single statistic which allows model comparison and selection. We provide the formulas for
two criterion measures useful for model comparison, including Akaike Information Criterion and the
Bayesian Information Criterion.
Both are based on the log likelihood along with a penalty term based on the number of parameters
in the model. In this way, the criterion measures seek to balance our competing desires for finding the
best model (in terms of maximizing the likelihood) with model parsimony (including only those terms
that significantly contribute to the model).
We use the following definitions,

𝑝 = number of predictors
𝑛 = number of observations
𝐿(𝑀𝑘 ) = likelihood for model 𝑘 (C.20)

L(𝑀𝑘 ) = log likelihood for model 𝑘


Dev𝑘 = 𝐷(𝑀𝑘 ) = deviance of model 𝑘.

AIC- Akaike’s Information Criterion

Akaike (1987) proposed a simple information criterion based on


the log-likelihood 𝑙 = L(𝑀𝑘 ) = log 𝐿(𝑀𝑘 ), and the number of parameters 𝑝.
The (scaled) formula, now named Akaike’s Information Criterion, is given by

AIC = −2L(𝑀𝑘 ) + 2𝑝 = −2𝑙 + 2𝑝 (C.21)

The information criterion is a measure of the information lost in using the associated model, and our
goal is to find the model that has the lowest loss of information, ie. lower values of the criterion are
indicative of a preferable model.
QUESTION: What sort of difference in AIC should be regarded as significant?

• A rule of thumb is that a difference of 4 or more AIC units would be regarded as significant. This
does require a degree of judgement: if there were two models with AICs within 4 units of each other,
we would pick the more parsimonious (economical), ie. the one with fewer parameters.

• The number of parameters 𝑝 is a penalty against larger covariate lists. The criterion measure is
especially amenable to comparing GLMs of the same link and variance function but different covariate
lists.

Inference, Linear Regression and Stochastic Processes


• There are several other versions of the AIC statistic that are used in research, such as the AIC 𝑐,
which is a corrected or finite-sample AIC
𝑝(𝑝 + 1)
AIC 𝑐 = −2L(𝑀𝑘 ) + 2𝑝 + 2 (C.22)
𝑛−𝑝−1

Generally, the model having the lower AIC statistic is preferred over another model, but there is no
specific statistical test from which a 𝑝-value may be computed.

Summary

The GLM is briefly denoted as GLM (EDM , Link function), and explicitly

⎨ 𝑦𝑖 ∼ EDM(𝜇𝑖 , 𝜑)


𝑝
∑︁ (C.23)


⎩ 𝑔(𝜇𝑖 ) = 𝑜 𝑖 + 𝛽 0 + 𝛽𝑗 𝑥𝑖𝑗 .
𝑗=1

The core structure of a GLM is specified by the choice of distribution from the EDM class and
the choice of link function.

EDM = Random component: The observations 𝑦𝑖 come independently from a specified EDM such
that 𝑦𝑖 ∼ EDM(𝜇𝑖 , 𝜑) for 𝑖 = 1, 2, . . . , 𝑛.
Link function defines systematic component: A linear predictor
𝑝
∑︁
𝜂𝑖 = 𝑔(𝜇𝑖 ) = 𝑜𝑖 + 𝛽0 + 𝛽𝑗 𝑥𝑖𝑗 where the 𝑜𝑖 are offsets, that are often equal to zero, and 𝑔(𝜇) = 𝜂
𝑗=1
is a known, monotonic, differentiable link function.
This page is left blank intentionally.
Bibliography

[1] Annette J. Dobson and Adrian G. Barnett,


An Introduction to Generalized Linear Models, Third Edition, CRC (2008)
[2] Antal Kozak, Robert A. Kozak, Christina L. Staudhammer, Susan B. Watts Introductory Probability
and Statistics Applications for Forestry and Natural Sciences, CAB (2008)
[3] Ali Jahan & Md Yusof Ismail and Rasool Noorossana Multi Response optimization in DOE
considering capability index in bounded objectives method, Journal of Scientific and Industrial
Research - India, 69 (2010), 11–16.

[4] Friesland Campina Thailand,


https://www.frieslandcampina.com/en/organisation/
organogram/frieslandcampina-thailand/

[5] David S. Moore, George P. Mccabe and Bruce A. Craig, Introduction to the Practice of Statistics,
6th edition, (2009), W. H. Freeman and Company New York

[6] Madhav, S. P., Quality Engineering using robust design, Prentice Hall, 1989.

[7] Chitavorn Jirajan, Triage in Emergency Department.


[8] Montgomery D.C. Introduction to Statistical Quality Control, 7th ed., (2009), Wiley.
[9] Canvas paintings by Australian artists, Australian National Museum
[10] Arijit Chaudhuri, Tasos C. Christofides and C.R. Rao, Handbook of Statistics Volume 34, North-
Holland publications, Elsevier B.V., 2016.
[11] C.R. Rao. Factorial experiments derivable from combinatorial arrangements of arrays, Suppl. J.
Roy. Statistics Soc., vol 9, pp. 128- 139, 1947

[12] David S. Moore, George P. McCabe and Bruce A. Craig, 2009.


Introduction to the Practice of Statistics, 6th edition, W. Freeman Company, New York
[13] S.R. Dalai and al., Factor-covering designs for Testing Software, Technometrics 40(3), 234-243,
American Statistical Association and the American Society for Quality, 1998.

[14] Douglas C. Montgomery, George C. Runger,


Applied Statistics and Probability for Engineers, Sixth Edition, (2014) Wiley

DATA ANALYTICS- FOUNDATION


510 BIBLIOGRAPHY

[15] Brian Bergstein, AI still gets confused about how the world works, pp 62-65, MIT Technology
Review, The predictions issues, Vol 123 (2), 2020

[16] Judea Pearl,


Causality: Models, Reasoning, and Inference, 2nd Edition, Cambridge University Press (2009)

[17] Man Van Minh Nguyen.


DATA ANALYTICS- STATISTICAL FOUNDATION: Inference, Linear Regression and Stochastic
Processes
ISBN: 978-620-2-79791-7 Publisher: LAP LAMBERT Academic Publishing (2020)

[18] Man Van Minh Nguyen.


PROBABILITY and STATISTICS: Inference, Causal Analysis and Stochastic Analysis ISBN:
978-620-0-08656-3 Publisher: LAP LAMBERT Academic Publishing (2019)

[19] Nhut Cong Nguyen, Man Van Minh Nguyen, and Phu Le Vo.
Co-kriging Method for Air Pollution Prediction - A Case Study in Hochiminh City, accepted to
appear in Thailand Statistician, Journal of the Thai Statistical Association, 2020

[20] Uyen Huynh, Nabendu Pal, Buu-Chau Truong and Man Nguyen.
A Statistical Profile of Arsenic Prevalence in the Mekong Delta Region, accepted to appear in
Thailand Statistician Journal 2020

[21] Man Van Minh Nguyen.


Quality Engineering with Balanced Factorial Experimental Designs, Southeast Asian Bulletin of
Mathematics, Vol 44 (6), pp. 819-844 (2020)

[22] Man VM. Nguyen and Nhut C. Nguyen.


Analyzing Incomplete Spatial Data For Air Pollution Prediction, Southeast-Asian J. of Sciences,
Vol. 6, No 2, pp. 111-133, (2018)

[23] Nguyen V. Minh Man.


A Survey on Computational Algebraic Statistics and Its Applications East-West Journal of
Mathematics, Vol. 19, No 2, pp. 1-44 (2017)

[24] Nguyen V. Minh Man.


Permutation Groups and Integer Linear Algebra for Enumeration of Orthogonal Arrays, East-West
Journal of Mathematics, Vol. 15, No 2 (2013)

[25] Man Nguyen, Tran Vinh Tan and Phan Phuc Doan,
Statistical Clustering and Time Series Analysis for Bridge Monitoring Data, Recent Progress in
Data Engineering and Internet Technology,
Lecture Notes in Electrical Engineering 156, (2013) pp. 61 - 72, Springer-Verlag

Inference, Linear Regression and Stochastic Processes


BIBLIOGRAPHY 511

[26] Man Nguyen and Le Ba Trong Khang.


Maximum Likelihood For Some Stock Price Models, Journal of Science and Technology, Vol. 51,
no. 4B, (2013) pp. 70- 81, VAST, Vietnam

[27] Nguyen Van Minh Man and Scott H. Murray.


Algebraic Methods for Construction of Mixed Orthogonal Arrays,
Southeast Asian Journal of Sciences, Vol 1, No. 2 (2012) pp. 155-168;
(science.utcc.ac.th/sajs/wp-content/uploads/2013/06/3-MinhMan.pdf)
[28] Nguyen Van Minh Man and Scott H. Murray.
Mixed Orthogonal Arrays: Constructions and Applications, talk in International Conference on
Applied Probability and Statistics, December 28-31, 2011, The Chinese Univ. of Hong Kong,
Hong Kong

[29] Hien Phan, Ben Soh and Man VM. Nguyen, A Parallelism Extended Approach for the Enumeration
of Orthogonal Arrays, ICA3PP 2011, Part I, Lecture Notes in Computer Science, Vol. 7016, Y.
Xiang et al. eds., Springer- Verlag Berlin Heidelberg, pp. 482–494, 2011.
[30] Hien Phan, Ben Soh and Man VM. Nguyen, A Step-by-Step Extending Parallelism Approach for
Enumeration of Combinatorial Objects, ICA3PP 2010, Part I, Lecture Notes in Computer Science,
Vol. 6081, C.-H. Hsu et al. eds., Springer- Verlag Berlin Heidelberg, pp. 463-475, 2010.

[31] Eric D. Schoen, Pieter T. Eendebak, and Man Nguyen, Complete enumeration of pure-level and
mixed-level orthogonal array, Journal of Combinatorial Designs 18(2) (2010) 123-140.

[32] Man Nguyen and Tran Vinh Tan.


Selecting Meaningful Predictor Variables: A Case Study with Bridge Monitoring Data,
Proceeding of the First Regional Conference on Applied and Engineering Mathematics (RCAEM
I) (2010), University of Perlis, Malaysia.

[33] Man Nguyen and Phan Phuc Doan.


A Combined Approach to Damage Identification for Bridge,
Proceeding of the 5th Asian Mathematical Conference, pp 629- 636, (2009),
Universiti Sains Malaysia in collaboration with UNESCO, Malaysia.

[34] Nguyen, Man V. M. Some New Constructions of strength 3 Orthogonal Arrays,


the Memphis 2005 Design Conference Special Issue of the Journal of Statistical Planning and
Inference, Vol 138, Issue 1 (Jan 2008) pp. 220-233.

[35] Brouwer, A. E., Cohen, A. M. and Nguyen, M. V. M. (2006), Orthogonal arrays of strength 3 and
small run sizes, Journal of Statistical Planning and Inference, 136, 3268-3280.

[36] Nguyen Van Minh Man,


Computer-Algebraic Methods for the Construction of Designs of Experiments, Ph.D. thesis, Eind-
hoven Technology University (TUe), Netherlands (2005)

[37] Marie-Pierre De Bellefon, Jean-Michel Floch. Handbook of Spatial Analysis. Chapter 9, 231-254.
2018.

DATA ANALYTICS- FOUNDATION


512 BIBLIOGRAPHY

[38] Nelder, J. and R. Wedderburn (1972). Generalized linear models. Journal of the Royal Statistical
Society, Series A 132, 370–384.
[39] McCuIlagh Peter and NeIder John Ashworth, Generalized Linear Models, 2nd ed., Springer, 1989.
[40] David Ardia, Financial Risk Management with Bayesian Estimation of GARCH Models, Springer
(2008)
[41] Peter K. Dunn and Gordon K. Smyth Generalized Linear Models With Examples in R (2018),
Springer Nature.
[42] Philippe Jorion , Value at Risk- The New Benchmark for Managing Financial Risk, 3rd Edition
McGraw Hill (2007)
[43] Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning Data
Mining, Inference, and Prediction, 2nd Ed. Springer (2017)
[44] Simon Hubbert, Essential Mathematics for Market Risk Management, Wiley (2012)
[45] Soren Asmussen and Peter W. Glynn, Stochastic Simulation- Algorithms and Analysis, Springer
(2007)
[46] A. Stewart Fotheringham, Chris Brundon, Martin Charlton. Geographically Weighted Regression
: the analysis of spatoally varying relationships. Wiley, England. 2002.
[47] Scheffe, H. (1959) The Analysis of Variance, John Wiley & Sons, Inc., New York.
[48] Sung H. Park, Six-Sigma for Quality and Productivity Promotion, Asian Productivity Organization,
1-2-10 Hirakawacho, Chiyoda-ku, Tokyo, Japan, 2003.
[49] M. F. Fecko and al., Combinatorial designs in Multiple faults localization for Battlefield networks,
IEEE Military Communications Conf., Vienna, 2001.
[50] Glonek G.F.V. and Solomon P.J. Factorial and time course designs for cDNA microarray experi-
ments, Biostatistics 5, 89-111, 2004.
[51] Hedayat, A. S., Sloane, N. J. A. and Stufken, J. Orthogonal Arrays, Springer, 1999.
[52] Joel Cutcher-Gershenfeld – ESD.60 Lean/Six Sigma Systems, LFM, MIT
[53] John J. Borkowski’s Home Page, www.math.montana.edu/ jobo/courses.html/
[54] Joseph A. de Feo, Junran’s Quality Management And Analysis, McGraw-Hill, 2015.
[55] Jay L. Devore and Kenneth N. Berk,
Modern Mathematical Statistics with Applications, 2nd Edition, Springer (2012)
[56] Google Earth, Digital Globe, 2014- 2019
[57] Robert V. Hogg, Joseph W. McKean, Allen T. Craig Introduction to Mathematical Statistics, Sev-
enth Edition Pearson, 2013.
[58] Michael Baron, Probability and Statistics for Computer Scientists, 2nd Edition (2014), CRC Press,
Taylor & Francis Group
[59] R. H. Myers, Douglas C. Montgomery and Christine M. Anderson-Cook
Response Surface Methodology : Process and Product Optimization Using Designed Experi-
ments, Wiley, 2009.
[60] Nathabandu T. Kottegoda, Renzo Rosso.
Applied Statistics for Civil and Environmental Engineers, 2nd edition (2008), Blackwell Publishing
Ltd and The McGraw-Hill Inc

Inference, Linear Regression and Stochastic Processes


BIBLIOGRAPHY 513

[61] Paul Mac Berthouex. L. C. Brown. Statistics for Environmental Engineers; 2nd edition (2002), CRC
Press

[62] Ron S. Kenett, Shelemyahu Zacks. Modern Industrial Statistics with applications in R, MINITAB,
2nd edition, (2014), Wiley

[63] Online https://news.samsung.com/global/samsung-announces-new-and-


enhanced-quality-assurance-measures-to-improve-product-safety

[64] Online www.samsungengineering.com/sustainability/quality/common/suView

[65] Sudhir Gupta, Balanced Factorial Designs for cDNA Microarray Experiments, Communications in
Statistics: Theory and Methods, Volume 35, Number 8 , p. 1469-1476 (2006)

[66] Sung H. Park, Six-Sigma for Quality and Productivity Promotion, Asian Productivity Organization,
1-2-10 Hirakawacho, Chiyoda-ku, Tokyo, Japan, 2003.

[67] Sloane N.J.A., http://neilsloane.com/hadamard/index.html/

[68] Online www.toyota-global.com/company/history-of-toyota/75years/data/


company-information/management-and-finances/management/tqm/change.html

[69] Vo Ngoc Thien An, Design of Experiment for Statistical Quality Control, Master thesis, LHU, Viet-
nam (2011)

[70] Wang, J.C. and Wu, C. F. J. (1991), An approach to the construction of asymmetrical orthogonal
arrays, Journal of the American Statistical Association, 86, 450–456.

[71] Larry Wasserman, All of Statistics- A Concise Course in Statistical Inference, Springer, (2003)

[72] William J. Stevenson, Operations Management, 12th ed., McGraw-Hill

[73] C.F. Jeff Wu, Michael Hamada Experiments: Planning, Analysis and Parameter Design Optimiza-
tion, Wiley, 2000.

[74] Doebling, S. W., Farrar, C. R., Prime, M. B., and Shevitz, D. W.. ”Damage Identification and Health
Monitoring of Structural and Mechanical Systems From Changes in Their Vibration Characteris-
tics: A Literature Review,” Los Alamos National Laboratory Report LA-13070-MS, 1996.

[75] Dohono, David. High-dimensional data analysis: The curses and blessings of dimensionality.,
2000.

[76] Farrar, Charles R, and Keith Worden. ”An introduction to structural health monitoring.” Philo-
sophical transactions. Series A, Mathematical, physical, and engineering sciences 365, no. 1851
(2007): 303-15.

[77] Fodor, Imola. A Survey of Dimension Reduction Techniques. Center for Applied Scientific Com-
puting, Lawrence Livermore National Laboratory, 2002.

[78] Eastment, H. T., and W. J. Krzanowski. ”Cross-Validatory Choice of the Number of Components
from a Principal Component Analysis.” Technometrics 24, no. 1 (1982): 73 - 77.

[79] Garcia, Gabriel V., and Roberto A. Osegueda. ”Combining damage detection methods to improve
probability of detection.” In Smart Structures and Materials 2000: Smart Systems for Bridges,
Structures, and Highways, Shih-Chi Liu, 135-142. SPIE, 2000.

[80] Halfpenny, Angela. ”A Frequency Domain Approach for Fatigue Life Estimation from Finite Ele-
ment Analysis.” Key Engineering Materials 167-168: D (1999): 401-410.

[81] Härdle, Wolfgang, and Léopold Simar. Applied multivariate statistical analysis. 2nd. Springer,
2007.

DATA ANALYTICS- FOUNDATION


514 BIBLIOGRAPHY

[82] Haywood, Jonathan, Wieslaw J. Staszewski, and Keith Worden. ”Impact Location in Composite
Structures Using Smart Sensor Technology and Neural Networks.” In The 3rd International Work-
shop on Structural Health Monitoring, 1466-1475. Stanford, California, 2001.

[83] Hotelling, Harold. ”Relations Between Two Sets of Variates.” Biometrika 28, no. 3-4 (1936): 321-
377.

[84] Inada, T., Shimamura, Y., Todoroki, A., Kobayashi, H., and Nakamura, H., Damage Identification
Method for Smart Composite Cantilever Beams with Piezoelectric Materials, Structural Health
Monitoring 2000, Stanford University, Palo Alto, California, 1999,pp. 986-994.

[85] Jolliffe, I. T. Principal component analysis. 2nd. Springer, 2002.

[86] Lapin, L.L. , Probability and Statistics for Modern Engineering, PWS-Kent Publishing, 2nd Edition,
Boston, Massachusetts,1990.

[87] Ljung, L. System identification: theory for the user, Prentice Hall, Englewood Cliffs, NJ, 1987

[88] Masri, S.F., Smyth, A.W., Chassiakos, A.G., Caughey, T.K., and Hunter, N.F.,Application of Neural
Networks for Detection of Changes in Nonlinear Systems, Journal of Engineering Mechanics,
July,2000, pp. 666-676.

[89] Papadimitriou, C. ”Optimal sensor placement methodology for parametric identification of struc-
tural systems.” Journal of Sound and Vibration 278, no. 4-5 (2004): 923-947.

[90] Rytter, A., Vibration based inspection of civil engineering structures. Ph.D Dissert., Department of
Building Technology and Structural Engineering, Aalborg University, Denmark, 1993.

[91] Rytter, A., and Kirkegaard, P. ,Vibration Based Inspection Using Neural Networks,” Structural Dam-
age Assessment Using Advanced Signal Processing Procedures, Proceedings of DAMAS ‘97,
University of Sheffield, UK, 1997,pp. 97-108.

[92] Silverman, B.W. , Density Estimation for Statistics and Data Analysis, Chapman and Hall, New
York, New York,1986.

[93] Sithole, M.M., and S. Ganeshanandam. ”Variable selection in principal component analysis to pre-
serve the underlying multivariate data structure.” In ASC XII – 12th Australian Stats Conference.
Monash University, Melbourne, Australia, 1994.

[94] Sohn, Hoon. ”Effects of environmental and operational variability on structural health monitoring..”
Philosophical transactions. Series A, Mathematical, physical, and engineering sciences 365, no.
1851 (2007): 539-60.

[95] Sohn, Hoon, and Charles R. Farrar. Damage diagnosis using time series analysis of vibration
signals. Smart Materials and Structures. Vol. 10, 2001.

[96] Sohn, Hoon, David W.Allen, Keith Worden and Charles R. Farrar, Statistical damage classification
using sequential probability ratio test, Structural Health Monitoring, 2003.p.57-74

[97] Sohn, Hoon, Keith Worden, Charles R. Farrar, Statistical Damage Classification under Changing
Environmental and Operational Conditions, Journal of Intelligent Materials Systems and Struc-
tures, 2007

[98] Sohn, Hoon, Charles R. Farrar, Francois M. Hemez, Devin D. Shunk, Daniel W. Stinemates,
Brett R. Nadler, and Jerry J. Czarnecki. A Review of Structural Health Monitoring Literature:
1996–2001. Structural Health Monitoring. Los Alamos National Laboratery Report, 2004.

[99] Todd, M.D., and Nichols, J.M., Structural Damage Assessment Using Chaotic Dynamic Interroga-
tion, Proceedings of 2002 ASME International Mechanical Engineering Conference and Exposi-
tion, New Orleans, Louisiana, 2002.

Inference, Linear Regression and Stochastic Processes


[100] Vanik, M. W., Beck, J. L., and Au, S. K. , Bayesian Probabilistic Approach to Structural Health
Monitoring, Journal of Engineering Mechanics, Vol. 126, No. 7, 2000,pp. 738-745.
[101] Vapnik, V., Statistical Learning Theory, John Wiley & Sons, Inc., New York,1998
[102] Wald, A. Sequential Analysis, John Wiley and Sons, New York, 1947
[103] Worden, K., and Lane, A.J. , Damage Identification Using Support Vector Machines, Smart Ma-
terials and Structures, Vol. 10,2001, pp. 540-547.
[104] Worden, K., Pierce, S.G., Manson, G., Philp, W.R., Staszewski, W.J., and Culshaw, B. , Detection
of Defects in Composite Plates Using Lamb Waves and Novelty Detection, International Journal
of Systems Science, Vol. 31,2000, pp. 1,397-1,409

[105] Worden, K., and Fieller, N.R.J., Damage Detection Using Outlier Analysis, Journal of Sound and
Vibration, Vol. 229, No. 3,1999, pp. 647-667.
[106] Yang, Lingyun, Jennifer M. Schopf, Catalin L. Dumitrescu, and Ian Foster. ”Statistical Data Re-
duction for Efficient Application Performance Monitoring.” CCGRID (2006).
[107] Q.W.Zhang, Statistical damage identification for bridges using ambient vibration data, Elsevier,
2006. p.476-485.
Index

3-balanced fractional design, 213 main effect, 184


main effects, 212
abnormal, exotic, 85 marginal distributions, 235
ANOVA, 191, 193 marginal p.d.f., 231, 233, 235
mixed design, 204
balance, 213 MTTF
Bayes’ Theorem, 11 mean time till failure, 31
central tendency, 61 multivariate methods, 244
combined influence of the factors, 211 mutually independent, 238
commutation, 6
observe/explain variables, 244
compound Poisson process, 420
ordered statistic, 85
compound random variable, 420
conditional distributions, 239 Poisson distribution, 18
conditional expectation, 418 population, 69
conditional variance, 416
continuous probability distribution, 20 quality engineer, 210
correlation, 246, 309
cumulative hazard rate, 31 regression coefficients, 212
residual maximum likelihood, 221
dependency, 10 response, 211
distribution run, 211
gamma, 33
Schwarz inequality, 245
experimental run, 211 settings, 211
experiments, 210 spreading tendency, 61
statistic, 85
factorial design, 204 statistical population, 69
factors, 210 strength 3 orthogonal array, 213
fraction, 204 symmetric, 204
fractional factorial design, 213
full factorial design, 213 two-factor interactions, 212
higher-order interactions, 212
hypergeometric distribution, 417

integrable function, 238


intercept, 212

joint distribution - joint c.d.f, 233


joint p.d.f, 233

level combinations, 211


levels, 211
linear regression models, 211

Inference, Linear Regression and Stochastic Processes

You might also like