0% found this document useful (0 votes)
393 views179 pages

STAT 251 Course Text

The document outlines statistical concepts across multiple chapters, including summarizing univariate and multivariate data, probability, random variables and distributions, the normal distribution, probability models, and statistical inference. Statistical formulas and examples are provided.

Uploaded by

borislavd433
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
393 views179 pages

STAT 251 Course Text

The document outlines statistical concepts across multiple chapters, including summarizing univariate and multivariate data, probability, random variables and distributions, the normal distribution, probability models, and statistical inference. Statistical formulas and examples are provided.

Uploaded by

borislavd433
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 179

Contents

1 Summary and Display of Univariate Data 5


1.1 Frequency Table and Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Sample Standard Deviation, Variance and Covariance . . . . . . . . . . . . . . . . . 11
1.4 Sample Quantiles, Median and Interquartile Range . . . . . . . . . . . . . . . . . . . 13
1.5 Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Summary and Display of Multivariate Data 27


2.1 Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Covariance and Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 The Least Squares Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Probability 41
3.1 Sets and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Conditional Probability and Independence . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Random Variables and Distributions 61


4.1 Definition and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Summarizing the Main Features of f (x) . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Sum and Average of Independent Random Variables . . . . . . . . . . . . . . . . . . 74
4.6 Max and Min of Independent Random Variables . . . . . . . . . . . . . . . . . . . . 77
4.6.1 The Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6.2 The Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.7.1 Exercise Set A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.7.2 Exercise Set B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

1
2 CONTENTS

5 Normal Distribution 89
5.1 Definition and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Checking Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.1 Exercise Set A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.2 Exercise Set B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6 Some Probability Models 103


6.1 Bernoulli Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Bernoulli and Binomial Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 Geometric Distribution and Return Period . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4 Poisson process and associated random variables . . . . . . . . . . . . . . . . . . . . 108
6.5 Poisson Approximation to the Binomial . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.6 Heuristic Derivation of the Poisson and Exponential Distributions . . . . . . . . . . 114
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.7.1 Exercise Set A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.7.2 Exercise Set B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7 Normal Probability Approximations 119


7.1 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Normal Approximation to the Binomial Distribution . . . . . . . . . . . . . . . . . . 123
7.3 Normal Approximation to the Poisson Distribution . . . . . . . . . . . . . . . . . . . 125
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.1 Exercise Set A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.2 Exercise Set B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

8 Statistical Modeling and Inference 129


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.2 One Sample Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.2.1 Point Estimates for µ and σ . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.2.2 Confidence Interval for µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.2.3 Testing of Hypotheses about µ . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.3 Two Sample Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.4.1 Exercise Set A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.4.2 Exercise Set B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

9 Simulation Studies 147


9.1 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

10 Comparison of several means 153


10.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
10.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10.2.1 Exercise Set A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10.2.2 Exercise Set B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
CONTENTS 3

11 The Simple Linear Regression Model 167


11.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
11.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

12 Appendix 179
12.1 Appendix A: tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
4 CONTENTS
Chapter 1

Summary and Display of Univariate


Data

1.1 Frequency Table and Histogram


Engineers and applied scientists are often involved with the generation and collection of data and the
retrieval of information contained in data sets. They must also communicate to different audiences
the results of complex numerical studies including one or more data sets.
Experience shows that data sets are often messy, difficult to grasp and hard to analyze. In this
chapter we introduce some statistical techniques and ideas which can be used to summarize and
display data.

Table 1.1: Live Load Data


Bay 1st 2d 3d 4th 5th 6th 7th 8th 9th 10th
A 44.4 130.4 127.6 127.7 108.4 184.0 139.1 120.6 174.1 187.9
B 138.4 236.4 202.5 128.7 154.3 117.0 125.9 127.2 175.6 114.1
D 164.7 110.4 185.7 185.0 150.0 198.7 144.5 121.5 93.2 202.2
E 98.3 154.5 171.9 104.8 230.1 102.8 156.6 136.1 93.8 197.8
F 178.0 108.1 197.9 112.0 66.6 160.9 106.8 123.2 162.5 118.3
G 123.7 185.4 130.3 169.2 91.8 134.5 153.5 131.4 254.0 194.6
H 157.5 62.3 65.2 94.4 156.1 133.6 101.9 117.6 87.6 142.4
I 119.4 74.1 118.2 144.4 212.0 132.3 136.1 184.3 177.2 151.8
J 150.4 137.8 105.5 55.2 122.9 127.8 180.6 53.0 150.1 138.4
K 92.2 54.0 139.2 116.7 32.1 184.8 127.1 171.8 159.6 123.8
L 169.8 168.4 169.9 159.6 179.6 33.5 193.3 99.5 124.3 208.6
M 181.5 147.5 104.1 167.4 172.4 128.8 138.6 110.1 141.1 189.3
N 105.4 133.1 62.0 144.9 129.1 94.9 147.6 167.9 136.7 173.2
O 157.6 164.6 195.0 136.3 136.6 223.7 134.0 179.1 85.7 122.3
P 168.4 173.5 150.4 116.4 143.7 179.5 84.5 161.5 140.5 94.1
Q 161.0 132.8 161.0 147.1 199.8 141.4 178.1 145.7 124.8 179.8
R 156.3 128.6 111.8 157.6 129.3 115.2 73.3 94.3 161.9 154.7
S 152.3 169.5 162.1 106.6 112.0 141.0 110.7 145.8 206.1 88.8
T 138.9 101.1 127.9 178.3 127.5 145.1 53.5 182.4 147.9 138.0
U 112.3 135.1 123.9 258.9 192.1 155.0 122.3 86.1 147.0 118.0

Frequency Table

5
6 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

Consider the 200 measurements of the live load distribution (pounds per square foot) on ten
floors and twenty bays of a large warehouse (Table 1.1). The live load is the load supported by
the structure excluding the weight of the structure itself. Notice how hard it is to understand data
presented in this raw form. They must clearly be organized and summarized in some fashion before
their analysis can be attempted.
One way to summarize a large data set is to condense it into a frequency table (see Table 1.2).
The first step to construct a frequency table is to determine an appropriate data range, that is,
an interval that contains all the observations and that has end points close (but not necessarily
equal) to the smallest and largest data values. The second step is to determine the number k of
bins. The data range is divided into k smaller subintervals, the bins, usually taken of the same size.
Normally, the number of bins k is chosen between 7 and 15, depending on the size of the data set
with fewer bins producing simpler but less detailed tables. For example, in the case of the live load
data, the smallest and largest observations are 32.1 and 258.9, the data range is [20, 260] and there
are 12 bins of size 20. The third step is to calculate the bin mark, ci , which represents that bin.
The bin mark is the center of the bin interval (that is, one half of the sum of the bin’s end points).
For example, 30 = (20 + 40)/2 for the first bin in Table 1.2. The fourth step is to calculate the
bin frequencies, ni . The bin frequency is equal to the number of data points lying in that bin. Each
data point must be counted once; if a data point is equal to the end points of two successive bins,
then it is included (only) in the second. For example, a live load of 60 is included in the third bin
(see Table 1.2). The fourth step is to calculate the relative frequencies
ni
fi =
n1 + n2 + . . . + nk
and the cumulative relative frequencies
n1 + . . . + ni
Fi = .
n1 + n2 + . . . + nk
Notice that fi 100% gives the percentage of observations in the ith bin and Fi 100% gives the per-
centage of observations below the end point of the ith bin. For example, from Table 1.2, 18% of the
live loads are between 140 and 160 psf , and 95% of the live loads are below 200 psf .

Table 1.2: Frequency Table

Class ci ni fi Fi
20–40 30 2 0.010 0.010
40–60 50 5 0.025 0.035
60–80 70 6 0.030 0.065
80–100 90 15 0.075 0.140
100–120 110 28 0.14 0.280
120–140 130 47 0.235 0.515
140–160 150 36 0.180 0.695
160–180 170 32 0.160 0.855
180–200 190 19 0.095 0.950
200–220 210 5 0.025 0.975
220–240 230 3 0.015 0.990
240–260 250 2 0.010 1.000

At this point it is worth comparing Table 1.1 and Table 1.2. We can quickly learn, for instance,
from Table 1.2 that only 2 live loads lie between 20 and 40, but we cannot say which they are. On
1.1. FREQUENCY TABLE AND HISTOGRAM 7

the other hand, with considerably effort, we can find out from Table 1.1 that these live loads are
32.1 and 33.5. Table 1.2 looses some information in exchange for clarity. The loss of information
and gain in clarity are proportional to the number of bins.
Histogram:

The information contained in a frequency table can be graphically displayed in a picture called
histogram (see Figure 1.1). Bars with areas proportional to the bin frequencies are drawn over each
bin. Notice that in the case of bins of equal size the bar areas are proportional to the bar heights.
The histogram shows the shape or distribution of the data and permits a direct visualization of
its general characteristics including typical values, spread, shape, etc. The histogram also helps
to detect unusual observations called outliers. From Figure 1.1 we notice that the distribution of
the live load is approximately symmetric: the central bin 120 − 140 is the most frequent and the
frequency of the other bins decrease as we move away from this central bin.

Histogram of Live Load


0.002 0.004 0.006 0.008 0.010 0.012
probability

0.0

50 100 150 200 250

class

Figure 1.1: Histogram of the Live Load

Many data sets encountered in practice are not symmetric. For example the histogram of
Tobin’s Q-ratios (market value to replacement cost, out of 250) for 50 firms in Figure 1.2 (a) shows
high positive skewness. There are a few firms which are highly over rated. The age of officers
attaining the rank of colonel in the Royal Netherlands Air Force (Figure 1.2 (b)) exhibit a pattern
of negative skewness. There appear to be more “whizzes” than “laggards” in the Netherlands Air
Force. Figure 1.2 (c) displays Simon Newcomb’s measurements of the speed of light. Newcomb
measured the time required for light to travel from his laboratory on the Potomac River to a mirror
at the base of the Washington Monument and back, a total distance of about 7400 meters. These
measurements were used to estimate the speed of light. The histogram of Newcomb’s data (Figure
1.2 (c)) shows a symmetric distribution except for two outliers. Deleting these outliers gives the
symmetric histogram on Figure 1.2 (d).
Data sets can be further summarized in terms of just two numbers, one giving their location and
the other their dispersion. These summaries are very convenient and perhaps unavoidable when we
8 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

(a) Tobin’s Q ratio (b) Age of officers

0.002 0.004 0.006

0.30
0.10 0.20
0.0

0.0
0 100 200 300 400 500 600 46 48 50 52 54

(c) Speed of light (d) Outliers deleted

80
0 10 20 30 40 50 60

60
40
20
0
24.76 24.78 24.80 24.82 24.84 24.81524.82024.82524.83024.83524.840

Figure 1.2: Some Non-Symmetric Histograms

must compare several data sets (e.g. the production figures from several plants and shifts). The
loss of information is not severe in the case of data sets with approximately symmetric histograms,
but may be very severe in other cases.
Two commonly used measures of location and dispersion are the sample mean and the sample
standard deviation. They are studied in the next two sections.

1.2 Sample Mean


Quantitative variables such as the live load are usually denoted by upper case letters X, Y , etc. The
particular measurements for these variables are denoted by the corresponding lower case letters, xi ,
yi , etc. The subscripts give the order in which the measurements have been taken. For example,
the variable “live load” can be represented by X and, if the measurements were made floor by floor
from the first to the tenth, from bay A to bay U, then

x1 = 44.4, x2 = 138.4, ... x10 = 92.2, ... x200 = 118.0.

The sample mean x (also called sample average) of a data set or “sample” is defined as
Pn
x1 + x2 + · · · + xn i=1 xi
x= = ,
n n

where n represents the number of data points (observations). For the live load data (see Table 1.1)

x = 140.156 pounds per ft2 .


1.2. SAMPLE MEAN 9

The sample average can also be approximately calculated from a frequency table using the formula
Pk k
ci ni X
x ≈ Pi=1
k
= ci fi .
i=1 ni

The approximation is better when the measurements are symmetrically distributed over each bin.
For the live load data (see Table 1.2) we have
(30 × 2) + (50 × 5) + . . . + (250 × 2)
x ≈
2 + 5 + ... + 2

= (30 × 0.01) + (50 × 0.035) + . . . + (250 × 0.01) = 139.8 pounds per ft2 ,

which is close to the exact value, 140.156.

Properties of the Sample Mean

Linear Transformations: If the original measurements, xi are linearly transformed to obtain new
measurements
yi = a + bxi ,
for some constants a and b, then
y = a + bx.
In fact, Pn Pn Pn Pn
i=1 yi i=1 (a + bxi ) na + b i=1 xi i=1 xi
y= = = =a+b = a + bx.
n n n n
Example 1.1 Suppose that each live load from Table 1.1 is increased by 5 kilograms and converted
to kilograms per square foot. Since one pound equals 0.4535 kilograms, the revised measurements
are yi = 5 + 0.4535xi and y = 5 + 0.4535x = 5 + 0.4535 × 140.2 = 68.58kg.

Sum of Variables: If new measurements zi are obtained by adding old measurements xi and yi
then
z = x + y.
In fact, Pn Pn Pn Pn
i=1 zi i=1 (xi + yi ) i=1 xi + i=1 yi
z= = = = x + y.
n n n
Example 1.2 Let ui and vi (i = 1, . . . , 10) represent the live loads on bays A and B. The mean
load across floors for these two bays are (see Table 1.1)

u = (44.4 + 130.4 + . . . + 187.9)/10 = 134.42 (Bay A)


v = (138.4 + 236.4 + . . . + 114.1)/10 = 152.01 (Bay B).

If wi represent the combined live loads on bays A and B (i.e. wi = ui + vi ) then the combined mean
load across floors for these two bays is

w = u + v = 134.42 + 152.01 = 286.43.


10 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

Least Squares: The sample mean has a nice geometric interpretation. If we represent each obser-
vation xi as a point on the real line, then the sample mean is the point which is “closest” to entire
collection of measurements. More precisely, let S(t) be the sum of the squared distances from each
observation xi to the point t:
n
X
S(t) = (xi − t)2 .
Then S(t) ≥ S(x) for all t. To prove this write
n
X
S(t) = [(xi − x) + (x − t)]2

n
X
= [(xi − x)2 + (x − t)2 + 2(xi − x)(x − t)]

n
X n
X
= (xi − x)2 + n(x − t)2 + 2(x − t) (xi − x)

n
X
= S(x) + n(x − t)2 , since (xi − x) = nx − nx = 0

≥ S(x), since n(x − t)2 ≥ 0, for all t.

Moreover, equality holds only if all the measurements are equal.

Center of Gravity: The sample mean has also a nice physical interpretation. If we think of
the observations xi as points on a uniform beam where vertical equal forces, Fi , are applied (see
Figure 1.3), then the sample mean is the center of gravity of this system. To see this consider the
magnitude and the placement of the opposite force F needed to achieve static equilibrium. Since all
the forces are vertical, the horizontal component of F must be equal to zero. To achieve translation
equilibrium the sum of the vertical components of all the forces must also be equal to zero. If we
denote the vertical components of Fi by Fi , and the vertical component of F by F , then

F + (F1 + F2 + . . . + Fn ) = 0 (Static Equilibrium).

Since the Fi0 s are all equal (Fi = −w, say) we have F − nw = 0 and so F = nw. To achieve torque
equilibrium, the placement d of F must satisfy

dF + (x1 F1 ) + (x2 F2 ) + . . . + (xn Fn ) = 0 (Torque Equilibrium).

Replacing Fi by −w and F by nw we have

dnw − w(x1 + x2 + . . . + xn ) = 0.

Therefore,
x1 + x2 + . . . + xn
d= = x.
n
1.3. SAMPLE STANDARD DEVIATION, VARIANCE AND COVARIANCE 11

F3 F5 F2 F4 F1
? ? ? ? ?

x3 x5 x̄ 6x2 x4 x1

F
Figure 1.3: The Sample Mean As Center of Gravity

1.3 Sample Standard Deviation, Variance and Covariance


Given the measurements (or sample) x1 , x2 , . . . , xn , their sample standard deviation SD(x) is defined
as sP
n 2
i=1 (xi − x)
SD(x) = + .
n−1
The expression inside the square root is called the sample variance, and denoted Var(x). In the
case of the live load data (Table 1.1)
Var(x) = 1583.892 square pounds per ft4 and SD(x) = 39.798 pounds per ft2 .
The standard deviation can be approximately calculated from a frequency table using the formula
sP
k
− x)2 ni
i=1 (ci
SD(x) ≈ + .
n−1
The approximation is better when the observations are symmetrically distributed on each bin. For
the live load (Table 1.2) we have
s
(30 − 139.8)2 × 2 + (50 − 139.8)2 × 5 + · · · + (250 − 139.8)2 × 2
SD(x) ≈ = 37.75 pounds per ft2 ,
199
which is close to the exact value, 39.798.

Properties of the Sample Variance

Linear Transformations: If the original measurements, xi are linearly transformed to obtain new
measurements
yi = a + bxi ,
for some constants a and b, then
Var(y) = b2 Var(x).
In fact, since y = a + bx,
P P
(yi − y)2 (a + bxi − a − bx)2
Var(y) = =
(n − 1) (n − 1)
P P
[b(xi − x)]2 (xi − x)2
= = b2 = b2 Var(x).
(n − 1) (n − 1)
12 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

Example 1.3 As in Example 1.1, each live load in Table 1.1 is increased by 5 kilograms per square
foot and converted to kilograms per square foot. Since one pound equals 0.4535 kilograms, the revised
measurements are yi = 5 + 0.4535xi kilograms per square foot and so Var(y) = 0.45352 × Var(x) =
0.2056623 ×√1583.892 = 325.747kg 2 square kilograms per ft4 . The corresponding standard deviation
is SD(y) = 325.747 = 18.048kg kilograms per square foot.

Sum of Variables: If new measurements zi are obtained by adding old measurements xi and yi
then

Var(z) = Var(x) + Var(y) + 2Cov(x, y), (1.1)

where Pn
i=1 (xi
− x)(yi − y)
Cov(x, y) = ,
n−1
is the covariance between xi and yi . The covariance will be further discussed in the next Chapter.
The important point here is to notice that the variances of xi and yi cannot simply be added to
obtain the variance of zi .
To prove (1.1) write
Pn Pn Pn
− z)2
i=1 (zi i=1 (xi + yi − x − y)2 i=1 [(xi − x) + (yi − y)]2
Var(z) = = =
n−1 n−1 n−1
Pn
i=1 [(xi − x)2 + (yi − y)2 + 2(xi − x)(yi − y)]
=
n−1
Pn Pn Pn
i=1 (xi − x)2 + i=1 (yi − y)2 + 2 i=1 (xi − x)(yi − y)
= .
n−1
Example 1.4 As in Example 1.2 let ui and vi be the live loads on bays A and B. The variances
and covariance for these loads are (see Table 1.1 and Example 1.2)
(44.4 − 134.42)2 + (130.4 − 134.42)2 + · · · + (187.9 − 134.42)2
Var(u) = = 1777.128 (Bay A)
9

(138.4 − 152.01)2 + (236.4 − 152.01)2 + · · · + (114.1 − 152.01)2


Var(v) = = 1657.93 (Bay B)
9
(44.4 − 134.42)(138.4 − 152.01) + · · · + (187.9 − 134.42)(114.1 − 152.01)
Cov(u, v) = = −218.650.
9
If wi represents the combined live loads on bays A and B (i.e. wi = ui + vi ) then

Var(w) = Var(u) + Var(v) + 2Cov(u, v) = 1777.128 + 1657.93 + 2 × (−218.6502) = 2997.758

Two Simple Identities: the following identities are very useful for handling calculations of vari-
ances and covariances:
n
X n
X n
X n
X
2
(xi − x) = x2i 2
− nx = x2i −( xi )2 /n (1.2)
i=1 i=1 i=1 i=1
1.4. SAMPLE QUANTILES, MEDIAN AND INTERQUARTILE RANGE 13

and
n
X n
X n
X n
X n
X
(xi − x)(yi − y) = xi yi − nx y = xi yi − ( xi )( yi )/n. (1.3)
i=1 i=1 i=1 i=1 i=1

To prove (1.2) write


n
X n
X n
X n
X
(xi − x)2 = (x2i + x2 − 2xi x) = x2i + nx2 − 2x xi .
i=1 i=1 i=1 i=1
Pn
The identities in (1.2) follow now because i=1 xi = nx and so
n
X n
X
nx2 − 2x xi = nx2 − 2nx2 = −nx2 = −( xi )2 /n.
i=1 i=1

The proof of (1.3) is similar and is left as an exercise.

Table 1.3: Variance and Covariance Calculations


Floor (i) Bay A (ui ) Bay B (vi ) u2i vi2 ui vi
1 44.4 138.4 1971.36 19154.56 6144.96
2 130.4 236.4 17004.16 55884.96 30826.56
3 127.6 202.5 16281.76 41006.25 25839.00
4 127.7 128.7 16307.29 16563.69 16434.99
5 108.4 154.3 11750.56 23808.49 16726.12
6 184.0 117.0 33856.00 13689.00 21528.00
7 139.1 125.9 19348.81 15850.81 17512.69
8 120.6 127.2 14544.36 16179.84 15340.32
9 174.1 175.6 30310.81 30835.36 30571.96
10 187.9 114.1 35306.41 13018.81 21439.39
Total 1344.2 1520.1 196681.5 245991.8 202364.0

Example 1.5 To illustrate the use of (1.2) and (1.3), let’s calculate again Var(u), Var(v) and
Cov(u, v) where ui and vi are as in Example 1.4. Using (1.2) and the totals from Table 1.3 we have
(1344.2)2 (1520.1)2
196681.5 − 10 245991.8 − 10
Var(u) = = 1777.128 and Var(v) = = 1657.93.
9 9
Using (1.3) and the totals from Table 1.3 we have
(1344.2)(1520.1)
202364.0 − 10
Cov(u, v) = = −218.650.
9

1.4 Sample Quantiles, Median and Interquartile Range


The location of non-symmetric data sets may be poorly represented by the sample mean because
the sample mean is very sensitive to the presence of outliers in the data. Notice that observations
far from the center have high “torque” or “leverage” and attract the sample mean (center of gravity)
toward them. The dispersion of non-symmetric data sets may also be poorly represented by the
sample standard deviation.
14 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

Example 1.6 A student with an average of 94.7% (SD=2.8%) on the first 10 assignments had a
personal problem and did very poorly on the eleventh where he got zero. Calculate his current
average and standard deviation.

Solution The mean drops from 95 to

(10 × 95) + 0
x= = 86.09.
11
P10 2
To calculate the new standard deviation notice that i=1 (xi − 95) = 9 × 2.82 = 70.56 and by (1.2)

10
X 10
X
x2i = (xi − 95)2 + 10 × 952 = 70.56 + 90250 = 90320.56.
i=1 i=1

Therefore,
90320.56 + 02 − (11 × 86.092 )
Var(x) = = 879.4191,
10

and the standard deviation, then, increases from 2.8 to 879.4191 = 29.66. 2
We will see that data sets which are asymmetric or include outliers may be better summarized
using the sample quantiles defined below.

Sample Quantiles

Let 0 < p < 1 be fixed. The sample quantile of order p, Q(p), is a number with the property
that approximately p100% of the data points are smaller than it. For example, if the 0.95 quantile
for the class final grades is Q(0.95) = 85 then 95% of the students got 85 or less. If your grade is
87 then you are in the the top 5% of the class. On the other hand, if your mark were smaller than
Q(0.10) than you would be in the lowest 10% of the class.
To compute Q(p) we must follow the following steps

1 Sort the data from smallest data point, x(1) , to largest data point, x(n) , to obtain

x(1) ≤ x(2) ≤ . . . ≤ x(n) .

The ith largest data point is denoted x(i) .

2 Compute the number np + 0.5. If this number is an integer, m, then

Q(p) = x(m) .

If np + 0.5 is not an integer and m < np + 0.5 < m + 1 for some integer m then

x(m) + x(m+1)
Q(p) = .
2
1.4. SAMPLE QUANTILES, MEDIAN AND INTERQUARTILE RANGE 15

Example 1.7 Let ui and vi be the live loads on the first two floors (see Table 1.4). Calculate the
quantiles of order 0.25, 0.50 and 0.75 for the live load on floors 1 and 2 and for the differences
wi = ui − vi between the live loads on these two floors.
Solution
To calculate the quantile of order 0.25 for the live load on floor 1, Qu (0.25), observe that n = 20,
p = .25 and so np + .5 = 20 × .25 + .5 = 5.5 is between 5 and 6. Using the column u(i) from Table
1.4 we obtain
u(5) + u(6) 112.3 + 119.4
Qu (0.25) = = = 115.85.
2 2
Similar calculations give Qv (0.25) = 109.25 and Qw (0.25) = −25.25. To calculate Qu (0.50) notice
that np + .5 = 20 × .50 + .5 = 10.5 is between 10 and 11. Again, using the column u(i) from Table
1.4 we obtain
u(10) + u(11) 150.4 + 152.3
Qu (0.50) = = = 151.35.
2 2
The reader can check using similar calculations that Qv (0.50) = 134.1, Qw (0.50) = 7, Qu (0.75) =
162.85, Qv (0.75) = 166.5 and Qw (0.75) = 38.
Unfortunately, the sample quantiles do not have the same nice properties as the the sample
mean in relation with sums and differences of variables. For example

Qu (0.50) − Qv (0.50) = 151.35 − 134.1 = 17.25

is quite different from Qu−v (0.50) = Qw (0.50) = 7. Also

Qu (0.25) − Qv (0.25) = 115.85 − 109.25 = 6.6 6= −25.25 = Qu−v (0.50)

and
Qu (0.75) − Qv (0.75) = 151.35 − 134.1 = 17.25 6= 38 = Qu−v (0.75).

Median and Interquartile Range


The quantiles Q(0.25), Q(0.5) and Q(0.75) are particularly useful and given special names: lower
quartile, median and upper quartile. Notice that the lowest 25% of the data is below Q(0.25) and
the lowest 75% of the data is below Q(0.75). Because of that, Q(0.25) and Q(0.75) are also called
first and third qartiles.
The lowest 50% of the data is below Q(0.5) and the other half is above it. Therefore the median
divides the data into two equal pieces, regardless the shape of the histogram. Because of this
property and the fact that the median is not much affected by outliers, it is often used as a measure
of location (instead of the mean).
The mean and the median are equal in the case of perfectly symmetric data sets. They are also
close in the presence of mild asymmetry. But very asymmetric data sets can produce very different
means and medians. When the mean and the median roughly agree we will normally prefer the
mean because of its nicer numerical properties (see the comments at the end of Problem 1.7). When
they do not, however, we will normally prefer the median because of its resistance to outliers. The
difference between the mean and the median is a strong indication of the presence outliers in the
data which are severe enough to upset the sample mean.
16 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

Table 1.4: Live Load on the First and Second Floors


i ui u(i) vi v(i) wi w(i)
1 44.4 44.4 130.4 54.0 -86.0 -98.0
2 138.4 92.2 236.4 62.3 -98.0 -86.0
3 164.7 98.3 110.4 74.1 54.3 -61.7
4 98.3 105.4 154.5 101.1 -56.2 -56.2
5 178.0 112.3 108.1 108.1 69.9 -27.7
6 123.7 119.4 185.4 110.4 -61.7 -22.8
7 157.5 123.7 62.3 128.6 95.2 -17.2
8 119.4 138.4 74.1 130.4 45.3 -7.0
9 150.4 138.9 137.8 132.8 12.6 -5.1
10 92.2 150.4 54.0 133.1 38.2 1.4
11 169.8 152.3 168.4 135.1 1.4 12.6
12 181.5 156.3 147.5 137.8 34.0 27.7
13 105.4 157.5 133.1 147.5 -27.7 28.2
14 157.6 157.6 164.6 154.5 -7.0 34.0
15 168.4 161.0 173.5 164.6 -5.1 37.8
16 161.0 164.7 132.8 168.4 28.2 38.2
17 156.3 168.4 128.6 169.5 27.7 45.3
18 152.3 169.8 169.5 173.5 -17.2 54.3
19 138.9 178.0 101.1 185.4 37.8 69.9
20 112.3 181.5 135.1 236.4 -22.8 95.2
Mean 138.53 135.38 3.145
SD 34.66 43.61 51.37

As a rule of thumb we will calculate both the mean and the median and use the mean if they
are similar. Otherwise we will use the median. To guide our choice we can calculate the discrepancy
index
√ |Mean − Median|
d= n
2 IQR
and choose the mean when d is smaller than 1. The interquartile range (IQR), used in the denom-
inator of d above, is defined as
IQR = Q(0.75) − Q(0.25),
The IQR is recommended as a measure of dispersion in the presence of outliers and lack of symmetry.
Notice that IQR is proportional to the length of the central half of the data, regardless the shape
of the histogram, and it is not much affected by outliers.

Example 1.8 Refer to Example 1.6. Calculate the median, the interquatile range and the discrep-
ancy index d for the student’s marks before and after the eleventh assignment (The marks are 94,
93, 95, 91, 96, 91, 98, 93, 99, 97 and 0). just one

Solution Since the sorted marks (before the eleventh assignment) are 91, 91, 93, 93, 94, 95, 96, 97,
98, 99, Q(0.25) = x(3) = 93, Q(0.5) = (x(5) + x(6) )/2 = (94 + 95)/2 = 94.5 and Q(0.75) = x(8) = 97.

Therefore, Median(x) = 94.5, IQR(x) = 97−93 = 4 and d = 10(94.7−94.5)/(2×4) = 0.07905694.
Including the eleventh assignment we have Q(0.25) = (x(3) + x(4) )/2 = (91 + 93)/2 = 92,
Q(0.5) = x(6) = 94 and Q(0.75) = (x(8) + x(9) )/2 = (96 + 97)/2 = 96.5. Therefore, the new median
and IQR are: Median(x) = 94 and IQR(x) = 96.5 − 92 = 4.5. Unlike the mean, the median is very
little√affected by the single poor performance. This is also reflected by the large discrepancy index
d = 11(86.09 − 94)/(2 × 9) = 2.915.
1.5. BOX PLOT 17

Example 1.9 Table 1.5 gives the mean, median, standard deviation and IQR for the data sets on
Figure 1.2. The mean and median of Tobin’s Q ratios show appreciable differences (d = 2.98). In
addition, their standard deviation is more than twice their IQR. Clearly, the mean and standard
deviation are upset by a few heavily over–rated firms. Tobin’s Q ratios are then better represented
by their median and IQR. The effect of outliers and lack of symmetry is moderate in the case of the
“Age of Officers” data. Although d = 1.07 the mean and standard deviation still summarize these
data well. Finally, for the “Speed of Light” data the two clear (lower) outliers do not seem to have
much affect on the sample mean (d = 0.64).

Table 1.5: Summary figures for the data sets displayed on Figure 1.2

Data Set Mean Median Discrepancy S. Deviation IQR


Tobin’s Q ratio 158.6 118.5 2.98 97.749 47.593
Age of officers 51.494 52 1.07 1.739 2.222
Speed of light 24.826 24.827 0.64 0.011 0.005

1.5 Box Plot


The box plot is a powerful tool to display and compare data sets. It is just a box with “whiskers”
which helps to visualize the main quantiles (Q(0.25), Q(0.50) and Q(0.75)) and the extreme data
points (maximum and minimum).
For the following discussion refer to Figure 1.4 (b) and (d). The lower and upper ends of
the box are determined by the lower and upper quartiles (Q(0.25) and Q(0.75)); a line sectioning
the box displays the sample median and its relative position within the interquartile range. The
median then divides the main box into two smaller sub–boxes which represent the lower and upper
central quarters of the data. Symmetric data sets have upper and lower sub–boxes of equal size.
Asymmetric data sets have sub–boxes of different sizes, the larger one indicating the direction of
the asymmetry. The data on Figure 1.4 (b) is mildly asymmetric with a longer lower tail: the lower
sub–box is larger than the upper one and the lower whisker is longer than the upper one. The data
on Figure 1.4 (d) is symmetric. The location and dispersion of a data set are also clearly conveyed
by the box plot: the position of the box (and the median line) give the location; the size (length)
of the box (proportional to the IQR) gives the dispersion. Larger boxes indicate larger dispersion.
Finally, the whiskers at either end extend to the extreme values (maximum and minimum).
Points which are above Q(0.75) + 1.5IQR or below Q(0.25) − 1.5IQR are considered outliers.
The following rule is used to help visualizing outliers in the data: the length of the whiskers should
not exceed 1.5IQR and points outside this range are displayed as unconnected horizontal lines. This
is illustrated by Figure 1.4 (a) and (c) where the presence of outliers is flagged by the existence of
unconnected horizontal lines above the upper whisker (Figure 1.4 (a)) or below the lower whisker
(Figure 1.4 (c)).
18 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

(a) Tobin’s Q ratio (b) Age of officers

54
100 200 300 400 500

52
50
48
(c) Speed of light (d) Outliers deleted

24.820 24.830 24.840


24.7624.7824.8024.8224.84

Figure 1.4: Box plots for the data sets displayed on Figure 1.2

Example 1.10 Table 2.3 gives the monthly average flow (cubic meters per second) for the Fraser
River at Hope, BC, for the period 1971–1990. Figure 1.5 gives the boxplots for each month, from
January to December (from left to right). The year to year distributions of the monthly flows are
mildly asymmetric, with longer upper tails, and there are some outliers. However, the location and
dispersion summaries (see Table 1.10) are roughly consistent for most months and point to the same
conclusion: the river flow, and its variability as well, are much larger in the summer.

Table 1.6: Fraser River Monthly Flow (cms)

Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Mean 957.4 894.8 993.1 1941.0 4994.5 6973.0 5505.0 3548.0 2340.0 1816.0 1588.9 1092.4
Median 868.0 849.5 926.5 2010.0 5000.0 6365.0 5120.0 3380.0 2245.0 1910.0 1525.0 1005.0
SD 274.4 202.8 233.5 477.8 976.4 1434.2 1212.2 886.4 685.6 401.7 366.1 282.2
IQR 174.6 163.0 257.0 427.8 613.0 1325.9 1277.8 505.6 446.3 424.1 377.8 181.1
1.6. EXERCISES 19

10000
8000
6000
4000
2000

Figure 1.5: Fraser River monthly flow (cms) from January (left) to December (right)

1.6 Exercises
Problem 1.1 The records of a department store show the following total monthly finance charges
(in dollars) for 240 customers which accounts included finance charges (see Table 1.7). From a
department store’s records for a particular month, the total monthly finance charges in dollars were
obtained from 240 customers accounts that included finance charges. See the table shown below:
(a) Complete the frequency table. What percentage of customers were charged less than $20?

Table 1.7: Finance Charges from 240 Accounts


Class Limits Numbers of Customers
0–5 65
5 – 10 88
10 – 15 42
15 – 20 27
20 – 25 18

(b) Construct a histogram using the four classes given above.


(c) Calculate the mean, variance and standard deviation.
Problem 1.2 Before microwave ovens are sold, the manufacturer must check to ensure that the
radiation coming through the door is below a specified safe limit. The amounts of radiation leakage
(mw/cm2 ) from 25 ovens, with the door closed, are:
15 9 18 10 5
12 8 5 8 10
7 2 1 5 3
5 15 10 15 9
8 18 1 2 11
20 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

(a) Calculate the mean, variance and standard deviation.


(b) What are the median, quartiles and interquartile range?
(c) Compare the results of (a) and (b).
(d) Draw the box plot.

Problem 1.3 The following data are the waiting times (in minutes) between eruptions of Old
Faithful geyser between August 6 and 10, 1985.

816 611 796 573 809


778 599 774 748 723
796 1051 820 748
682 781 772 797
711 578 696 851

(a) Calculate the mean, variance and standard deviation.


(b) What are the median, quartiles and interquartile range?
(c) Compare the results of (a) and (b).
(d) Draw the box plot.

Problem 1.4 The following numbers are the final marks of 16 students in a previous STAT 251
class.

64 86 77 68 95 91 58 91 83 97 96 14 32 68 89 75

(a) Calculate the mean, variance and standard deviation.


(b) What are the median, quartiles and interquartile range?
(c) Compare the results of (a) and (b).
(d) Draw the box plot.

Problem 1.5 In 1798, Henry Cavendish estimated the density of the earth (as a multiple of the
density of water) by using a torsion balance. The dataset below contains his 29 measurements.

Table 1.8: Cavendish Measurements of the Density of the Earth


5.50 5.47 5.29 5.55 5.75 5.27
5.57 4.88 5.34 5.34 5.29 5.85
5.42 5.62 5.26 5.30 5.10 5.65
5.61 5.63 5.44 5.36 5.86 5.39
5.53 4.07 5.46 5.79 5.58

(a) Calculate the mean, variance and standard deviation.


(b) What are the median, quartiles and interquartile range?
(c) Compare the results of (a) and (b). I particular calculate the discrepancy index between the
mean and median.
(d) Briefly state your conclusions.
1.6. EXERCISES 21

Problem 1.6 The mean size of twenty five recent projects at a construction company (in square
meters) is 25,689 m2 . The standard deviation is 2,542 m2 .
(a) Calculate the mean, variance and standard deviation in square feet [Hint: 1 foot = 0.3048 m].
(b) A new project of 226050 f t2 has been just completed. Update the mean, variance and standard
deviation.

Problem 1.7 The daily sales in April, 1994 for two departments of a large department store (in
thousands of USA dollars) are summarized below.

Table 1.9: Daily Sales, April 1994


Department A Department B
Mean 24.3 32.4
Standard Deviation 12.4 10.3
Covariance 96.1

(a) Convert the figures above to hundreds of Canadian dollars (CN $1 = US $0.7)
(b) Calculate the mean and standard deviation for the total daily sales for the two departments.
Why do you think the combined daily sales are more variable than the individual ones?
(c) Calculate the mean and standard deviation for the difference in daily sales between the two
departments. Comment your results.
(d) Under what conditions would the variance of the sums be smaller than the variance of the
differences?

Problem 1.8 A manufacturer of automotive accessories provides bolts to fasten the accessory to
the car. Bolts are counted and packaged automatically by a machine. There are several adjustments
that affect the machine operation. An experiment to find out how several variables affect the speed
of the packaging process was carried out. In particular, the total number of bolts to be counted (10
and 30) and the sensitivity of the electronic eye (6 and 10) have been considered. The observed
times (in seconds per bolt) are given in Table 1.10.
(a) Summarize and describe the data.
(b) What adjustments have the greatest effect?
(c) How would you adjust the machine to shorten the packaging time?

Problem 1.9 Find the average, variance and standard deviation for the following sets of numbers.
a) 1, 2, 3, 4, 5, . . . , 300
b) 4, 8, 12, 16, 20, . . . , 1200
c) 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, . . . , 9, 9, 9, 9, 9, 9, 9, 9, 9
Pn Pn 2 Pn 3
Hint: i = n(n + 1)/2, i = n(n + 1)(2n + 1)/6, i = n2 (n + 1)2 /4 and
Pn 4 3 2
i = n(n + 1)(6n + 9n + n − 1)/30
22 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

Table 1.10: Time for Counting and Packaging Bolts

10 Bolts 30 Bolts Low Sens (6) High Sens (10)


0.57 0.90 0.57 1.76
1.76 0.65 1.13 0.84
1.13 0.62 1.67 1.20
0.84 0.86 0.92 0.39
1.67 0.63 0.90 0.65
1.20 0.75 0.62 0.86
0.92 0.80 0.63 0.75
0.39 1.00 0.80 1.00
1.34 4.31 1.34 3.43
3.43 3.58 3.97 1.06
3.97 3.72 2.89 3.56
1.06 3.64 1.72 0.60
2.89 3.35 4.31 3.58
3.56 3.64 3.72 3.64
1.72 3.55 3.35 3.64
0.60 4.47 3.55 4.47

Table 1.11: Earthquakes in 1993


Magnitude frequency
0.1–1.0 9
1.0–2.0 1177
2.0–3.0 5390
3.0–4.0 4263
4.0–5.0 5034
5.0–6.0 1449
6.0–7.0 141
7.0–8.0 15
8.0–9.0 1

Problem 1.10 The number of worldwide earthquakes in 1993 is shown in the following table
(a) Complete the frequency table. What percentage of earthquakes were below 5.0? Above 6.0?
(b) Draw a histogram and comment on it.
(c) Calculate the mean and standard deviation for the earthquake magnitude in 1993.

Problem 1.11 The daily number of customers served by a fast food restaurant were recorded for
30 days including 9 weekends and 21 weekdays. The average and standard deviations are as follows:
Weekends: x1 = 389.56, SD1 = 27.4
Weekdays: x2 = 402.19, SD2 = 26.2
Calculate the average and standard deviation for the 30 days.

Problem 1.12 The average and the standard deviation for the weights of 200 small concrete–mix
bags (nominal weight = 50 pounds) are 51.2 pounds and 1.5 pounds, respectively. A new sample
of 200 large concrete–mix bags (nominal weight = 100 pounds) have just been weighed. Do you
expect that the standard deviation for the last sample will be closer to 1.5 pounds or to 3.0 pounds?
Justify your answer.
1.6. EXERCISES 23

Problem 1.13 Given the data set x1 = 1, x2 = 3, x3 = 8, x4 = 12, x5 = 20 calculate the


function
5
X
D(t) = |xi − t|,
for several values of t between 1 and 20, and plot D(t) versus t. Where is the minimum achieved?
Do the same “experiment” for the data set x1 = 1, x2 = 3, x3 = 8, x4 = 12. Do you notice
any pattern? If so, repeat this experiment for several additional sets of numbers, to investigate the
persistence of this pattern. What is your conclusion? Can you prove it mathematically?

Problem 1.14 Each pair (xi , wi ), i = 1, · · · , n, represents the placement and magnitude of a ver-
tical force acting on a uniform beam. Find the center of gravity of this system. [Hint: see the
discussion under “The Sample Mean as Center of Gravity” and notice that in the present case the
vertical forces are not equal].

Problem 1.15 Calculate the center of gravity of the system when the placements (xi ) and weights
(wi ) are given by Table 1.12.

Table 1.12: Placements of Vertical Forces on a Uniform Beam


xi wi xi wi
1.8 2.1 1.2 1.5
1.4 1.6 1.3 4.7
1.3 1.4 1.2 2.3
3.8 6.4 1.2 2.3
1.2 1.3 1.4 3.1
1.9 1.2 1.3 1.9
1.2 1.2 1.6 2.4
1.1 3.1 1.1 3.7
1.1 1.1 1.2 1.2

Problem 1.16 Each pair (xi , wi ), i = 1, · · · , n, represents the placement and magnitude of a ver-
tical force acting on a uniform beam. What values of wi would make the sample median the center
of gravity? Consider the cases when n is even and n odd separately.

Problem 1.17 The maximum annual flood flows for a certain river, for the period 1941-1990, are
given in Table 1.6.
(i) Summarize and display these data.
(ii) Compute the mean, median, standard deviation and interquartile range.
(iii) If a one–year construction project is being planned and a flow of 150000 cfs or greater will halt
construction, what is the “probability” (based on past relative frequencies) that the construction
will be halted before the end of the project? What if it is a two-year construction project?

Problem 1.18 The planned and the actual times (in days) needed for the completion of 20 job
orders are given in Table 1.14.
(a) Calculate the average and the median planned time per order. Same for the actual time.
(b) Calculate the corresponding standard deviations and interquartile ranges.
24 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA

Table 1.13: Maximum annual flood flows


Year Flood, cfs Year Flood, cfs
1941 153000 1966 159000
1942 184000 1967 75000
1943 66000 1968 102000
1944 103000 1969 55000
1945 123000 1970 86000
1946 143000 1971 39000
1947 131000 1972 131000
1948 99000 1973 111000
1949 137000 1974 108000
1950 81000 1975 49000
1951 144000 1976 198000
1952 116000 1977 101000
1953 11000 1978 253000
1954 262000 1979 239000
1955 44000 1980 217000
1956 8000 1981 103000
1957 199000 1982 86000
1958 6000 1983 187000
1959 166000 1984 57000
1960 115000 1985 102000
1961 88000 1986 82000
1962 29000 1987 58000
1963 66000 1988 34000
1964 72000 1989 183000
1965 37000 1990 22000

(c) If there is a delay penalty of $5000 per day and a before–schedule bonus of $2500 per day, what
is the average net loss ( negative loss = gain) due to differences between planned and actual times?
What is the standard deviation?
(d) Study the relationship between the planned and actual times.
(e) What would be your advice to the company based on the analysis of these data?
P
Problem 1.19 Show that (a) Cov(x, y) = [( xi yi ) − nx y]/(n − 1).
(b) If ui = a + b xi and vi = c + d yi , then Cov(u, v) = bdCov(x, y).
1.6. EXERCISES 25

Table 1.14: The planned and the actual times

Order Planned Time Actual Time Order Planned Time Actual Time
1 22 22 11 17 18
2 11 8 12 27 34
3 11 8 13 16 14
4 16 14 14 30 35
5 21 20 15 22 18
6 12 16 16 17 16
7 25 29 17 13 12
8 20 20 18 18 14
9 13 10 19 21 19
10 34 39 20 18 17

Problem 1.20 The total paved area, X (in km2 ), and the time, Y (in days), needed to complete
the project was recorded for 25 different jobs. The data is summarized as follows:

x = 12.5 km2 , SD(x) = 1.2 km2


y = 30.8 days , SD(y) = 3.7 days
Cov(x, y) = 3.4
Give the corresponding summaries when the area is measured in ft2 and the time is measure in
hours.
Hint: 1 foot = 0.3048 m, and 1 km = 1000 m.
26 CHAPTER 1. SUMMARY AND DISPLAY OF UNIVARIATE DATA
Chapter 2

Summary and Display of Multivariate


Data

In practice, we usually consider several variables simultaneously. In addition to describing each


variable as in Chapter 1, we may wish to investigate their possible relationships. Some examples
are provided by the first–crack and failure load data on Table 2.1, the Fraser River flow data in
Table 2.3 and the yield data in Table 2.2. Are the first–crack and failure load of concrete beams
related? Is it possible to use the first–crack load to predict the failure load? Are the Fraser River
mean monthly flows related? Is it possible to use the average flows from previous months to predict
the current and future months flows? How does the temperature affect the yield of the chemical
process? Is there a simple equation relating the yield response to changes on the temperature?
As explained in the previous chapter, raw data must be summarized and/or graphically dis-
played to facilitate their analysis. We will now learn some simple techniques which can be used to
summarize multivariate data and describe their relationships. In the next sections we will introduce
scatter plots, correlation coefficients, multiple correlation coefficients, simple linear regression and
multiple linear regression.

2.1 Scatter Plot


Simultaneous observations on a pair of variables (xi , yi ), i = 1, . . . , n, can be graphically displayed
on a scatter plot. Each observation is represented as a point with x–coordinate xi and y–coordinate
yi . Scatter plots help in visualizing statistical relationships between variables (or the lack of them).

Linear Association and Causality


Some examples of scatter plots are presented on Figure 2.1. The dotted lines represent the
means for the “x” and “y” variables. For example the mean flows for January, February and June
are 957.4, 894.8 and 6973, respectively. Figure 2.1 (a) shows a positive linear association between
January and February flows: years with higher than average flows in January tend to have also
higher than average flows in February and vice versa for lower than average flows. Figure 2.1 (b),
on the other hand, shows a lack of linear association: years with higher than average flows in January
come together with higher than average and lower than average flows in June with approximately

27
28 CHAPTER 2. SUMMARY AND DISPLAY OF MULTIVARIATE DATA
(a) Jan-Feb Fraser Flow (b) Jan-Jun Fraser Flow

5000 7000 9000 11000


• •

800 100012001400

February

•• •

June
• •• • •
•• •
• • •
• • • • • •
•• • • • • •••

• ••
•• •
600 800 1000 1200 1400 1600 1800 600 800 1000 1200 1400 1600 1800
January January

(c) House Age and Price (d) Mean Monthly Fow


• • •

800 1200 1600 2000


• •

2000400060008000
• ••
• • ••
• ••• ••
Price

Flow
•• •
•• • ••
•• •• • ••• ••
• •• • •• •
• •
• • • • • •• ••
• • •• • •• •• ••
• • • ••• • •• •• ••
• •• •• •• •• • •• • ••
10 20 30 40 2 4 6 8 10 12
Age Month

Figure 2.1: Some Examples of Scatter Plots

the same frequency. Similarly for lower than average January flows. Figure 2.1 (c) shows a negative
linear association between the age and price of twenty randomly selected houses: older than average
houses tend to have lower than average prices and vice versa for newer houses; Figure 2.1 (d) shows
a non–linear association between time of the year and river flow: the monthly mean flows first
increase (until June) and then decrease.
A common mistake is to confuse the concepts of linear association and causality. If we find a
positive linear association between two variables we can say that they tend to take values above and
below their means simultaneously. The observed linear association may be the result of a causal
relation between the variables – an increase in one of them causes an increase in the other. In many
occasions, however, observed linear associations are the result of the action of a third variable (called
lurking variable) which drives the other two. For instance, the linear association between January
and February Fraser flows might be due to the effect of a lurking variable, namely the weather. If in
a given year we artificially increase the Fraser January flow we cannot expect a naturally occurring
higher flow in February.

Several Pairs of Variables

We often wish to investigate the pairwise relations between several pairs of variables. This can
be accomplished by several ways. One way is to use different symbols (dots, stars, letters, numbers,
etc.) to represent the points and overlay the scatter plots on a single picture, facilitating their
comparison. For instance, the weights and heights of men and women could be plotted on a single
scatter plot using the letter “w” for women and “m” for male.
Another technique for dealing with several variables is to display the scatter plots in a “matrix”
layout. Scatter plot matrices are useful for uncovering possible patterns in the pairwise association
structure. An example is given by Figure 2.2. Notice that the strength of association decreases as
months get further apart. Moreover, while January, February and March show some association,
April and May seem to have less (if any) association with other months.
2.2. COVARIANCE AND CORRELATION COEFFICIENT 29
600 8001000120014001600 10001500200025003000

600 1000 1400 1800


• • • •
• • • •
• • • •
• • •• • • •• • • ••
Jan •• ••• • • • •• ••••• ••• • • ••••
• • • • • •• • • •
• •• ••• •• • • • • • • • •••• • •
••• • •• •• • • •• ••• •• • • • •• ••• • • • •••• •• •••• ••••• •• ••
•• • • • ••••••• •••• • •••• • • ••• • • • • •••• •••
• ••••• • •• • • ••••••• •• • • •• •
•• ••••••• •• •• ••• ••• • •
• ••••••• ••• • • •• •• • •
• • •
•• • •• • • ••• • • ••• •• ••• • •
• •••• • •• •••• • ••
•• • • •• •• • • • •• • •

1200 1600
•• • • • • ••
• • • • • • •• • • ••
• • • •
•• • • •
•• •• • ••• • • •• •• • •••
• •
• • •••• Feb ••••••• • • •• •• • • • • ••• • •• • •
•• •• •••• ••••• • • • •• • •• • • • •••• • • •• •••• •• •
•• • •••• •• ••

600800
•• •• ••••••• •••••••••• • • • • • •• • • • ••• • •• • • • • • ••
• •• •• • • • • •
• ••••••••••• ••• • ••••••••• • •• • • • • •
••••• • ••• •••• •• • • • ••• • • • • •
• •• •• •• ••• • •• • •• • • • • • •• •• • •
• •• ••• •• • •• • ••

600 1000 1400 1800


• • • •

• • • • •• • •
• •• •• • • •• • • • • •• • • • • • •• ••
•• •• • • • • Mar • •• • • •• •
• • •• •••• •••••••• • • ••• •• •• ••••• • •• ••• • •••• ••• • • • • • • ••••••••• •• • •
•• • ••••••••• • •• • •• • • ••• ••• •• •• • •• •
•• • • •• • •• • ••• • ••• • • • • •• •
• ••••• •• • ••• • • •• •
• ••• •••• • •••
••••• •• • •
•• ••• • • •• ••• • ••••••••••
•• • • •
••••• • •• •••• • • • ••
•• •• • ••••• •• •• •
••
• • •• •• •• • •
• • • •

3000
• • • •
• • • •
• • •• ••• •• •• • • • • • •• • •• • • • •••• ••
• • •

• •• •• ••• • • ••
• •• • • •••• • •••• • • ••••• • •••• • • • •• ••
2000
• • •• • • • • • • • • • • ••
• ••• • •
• • •
• • •
••••• •• ••
• •• Apr • •••
••••

• •• •• • • • •• •

• • • • •• • ••••••• ••• • •
• • •• • •• •• •
• ••• •• • ••• • • • • • • • • • • • ••••••• • ••• • •••• ••• • ••• • •

• •••• •••
1000

• •
• • • • •••• •• • • • ••••• • •• • •• •• • • • ••••• • • • •
•• • •• •• •••• • • • ••
• • • •

5000 7000
• • • • • • ••
• • • • •• • ••
• • •
• •
• •• • •• • • • •• • •
• • •• •• • • •• ••• ••• •• •• • • • • ••• •• •
•• •• • •••• • •• •
•• •• • • • •••
•• • • ••• • •• • • • ••••••• ••• •••• •• •• • •• •••• •
•• ••• •• •••••• •• •• • • • • •• •
• •• • • •• • • •••• • •
May
• • • •• • •
• • • • •• • ••• ••••• • • • • ••••• •• ••• • • •
••• • • ••••••• •••• • •• • • •• ••••• ••••• •
• •

• •• •• • • •
• ••• • • •• ••• • ••••• • • • •• • • • •

3000
• • • • • •• • • • •
6008001000 1400 1800 6008001000 1400 1800 300040005000600070008000

Figure 2.2: Fraser River Monthly Average Flow (1914-1990)

2.2 Covariance and Correlation Coefficient


The Covariance and the correlation coefficient are used to quantify the degree of linear association
between pairs of variables. If two variables, xi and yi , are positively associated then when one of
them is above (below) its mean the other will also tend to be above (below) its mean. Therefore,
the products (xi − x)(yi − y) will be mostly positive and the sample covariance,
n
1 X
Cov(x, y) = (xi − x)(yi − y) (2.1)
n − 1 i=1

will be large and positive. On the other hand, if the variables are negatively associated, when one
of them is above (below) its mean the other will tend to be below (above) its mean and so the
products (xi − x)(yi − y) will be mostly negative. In this case the sample covariance (2.1) will be
large and negative. Finally, if the variables are not positively nor negatively associated the products
(xi − x)(yi − y) will be positive and negative with approximately the same frequency (there will be
a fair degree of cancellation) and the sample covariance will be small.
The following formula provides a simple procedure for the hand calculation of the covariance:
n
1 X
Cov(x, y) = (xi − x)(yi − y)
n − 1 i=1
n
n 1X
= [xy − x y] , where xy = xi yi (2.2)
n−1 n i=1

Some problems with the interpretation of the covariance and its direct use as a measure of linear
association are illustrated in Example 2.1.

Example 2.1 Consider the measurements (xi , yi ) of the first–crack and failure load (in pounds
per square foot) on Table 2.1. Figure 2.3 suggests that there little association between these mea-
surements. Since x = 8396.6 pounds per square foot, y = 16, 064.4 pounds per square foot, and
30 CHAPTER 2. SUMMARY AND DISPLAY OF MULTIVARIATE DATA

xy = 134875, 645 square pounds per ft4 , from (2.2)

Cov(x, y) = (20/19) [(134875645) − (8396.6)(16064.4)] = −11, 258.99 square pounds per ft4

If the loads are given in thousand of pounds per square foot instead of pounds per square foot, then
ui = xi /1000 ,vi = yi /1000 and, from Problem 1.19,

Cov(x, y)
Cov(u, v) = = −0.011259 million square pounds per ft4 .
1000 × 1000

Table 2.1: Strength of concrete beams

Unit First–Crack Load (X) Failure Load (Y)


1 7610 18103
2 9528 15283
3 7071 19171
4 7463 16014
5 4440 12840
6 10929 19606
7 12385 14570
8 5734 16755
9 6342 15713
10 6772 17094
11 7519 13808
12 8511 16480
13 9087 16131
14 9072 15315
15 12157 12683
16 6504 14625
17 6654 16615
18 8700 15643
19 11613 15480
20 9841 19359

Correlation Coefficient
Problem 2.1 illustrates the strong dependency of Cov(x, y) on the scale of the variables. A
measure of linear association which is independent from the variables scale (see 2.5) is provided the
sample correlation coefficient,
Cov(x, y) Cov(x, y)
r(x, y) = p = .
Var(x)Var(y) SD(x) SD(y)

More precisely, if ui = a + bxi and vi = c + dyi then r(u, v) = sign(bd)r(x, y).


Another advantage of r(x, y) is that it takes values between −1 and 1 (see 2.7). Therefore,
values of r(x, y) close to 1 indicate positive linear association, values of r(x, y) close to −1 indicate
negative linear association. Values of r(x, y) close to 0 indicate lack of linear association.
For the data in Example 2.1, Cov(x, y) = −11258.99, SD(x) = 2193.17, SD(y) = 1949.36 and
−11258.99
r(x, y) = = −0.0026.
(2193.17)(1949.36)
2.2. COVARIANCE AND CORRELATION COEFFICIENT 31

Scatterplot of Failure vs First-Crack Load




18000

Failure Load
• • •

16000
• •
• •
• • •

14000 • •

• •

6000 8000 10000 12000

First-Crack Load

Figure 2.3: First–Crack Load vs Failure Load

The small value of r(x, y) confirms the qualitative impression from Figure 2.3 that the first crack
and the failure loads (in the case of these concrete beams) are not related. The main implication
from a practical point of view is that the first crack of a given beam cannot be used to predict its
ultimate failure load.

Example 2.2 Table 2.2 gives the results of an experiment to study the relation between tem-
perature (in units of 10o Fahrenheit) and yield of a certain chemical process (percentage). The
reader can verify that in this case x = 34.5, y = 43.07, Var(x) = 77.50, Var(y) = 128.06 and
Cov(x, y) = 96.2759. Therefore, the correlation coefficient,

96.2759
r(x, y) = √ = 0.9664 = 0.97,
77.50 × 128.06

indicates a strong positive linear association between temperature and yield. This is also clearly
suggested by the scatter plot in Figure 2.4. Notice that the relation between yield and tempreature
is likely to be causal, that is, the increase in yield may be actually caused by the increase in
temperature.

Several Pairs of Variables

When we have several variables their covariances and correlation coefficients can be arranged in
matrix layouts called covariance matrix and correlation matrix. Although the covariance matrix is
difficult to interpret due to its dependence on the scale of the variables, it is nevertheless routinely
computed for future usage.
The correlation matrix is the numerical counterpart of the scatter plot matrix discussed before.
For the River Fraser Data (see Figure 2.2) we have
32 CHAPTER 2. SUMMARY AND DISPLAY OF MULTIVARIATE DATA

Table 2.2: Yield of a chemical process

Unit Temp. (X) Yield (Y) Unit Temp. (X) Yield (Y)
1 20 28 16 35 41
2 21 26 17 36 45
3 22 22 18 37 53
4 23 25 19 38 46
5 24 27 20 39 44
6 25 32 21 40 49
7 26 31 22 41 53
8 27 33 23 42 49
9 28 38 24 43 51
10 29 41 25 44 55
11 30 41 26 45 56
12 31 38 27 46 58
13 32 41 28 47 58
14 33 46 29 48 58
15 34 44 30 49 63

Jan Feb Mar Apr May


Jan 1.00 0.78 0.65 0.40 0.18
Feb 0.78 1.00 0.75 0.34 0.15
Mar 0.65 0.75 1.00 0.50 0.19
Apr 0.40 0.34 0.50 1.00 0.29
May 0.18 0.15 0.19 0.29 1.00

As already observed from Figure 2.2, February flaws are somewhat correlated with January and
March flaws (with correlation coefficients 0.78 and 0.75, respectively). January and March flaws
are also marginally correlated (correlation coefficient equal to 0.65). The correlation coefficients
between all the other pairs of months are below 0.50.

2.3 The Least Squares Regression Line


The scatter plot of linearly associated variables approximately follows a linear function

fˆ(x) = β̂0 + β̂1 x

called regression line. The hats indicate that β̂0 , β̂0 and fˆ(x) are calculated from the data. In this
context X and Y play different roles and are given special names. The independent variable X is
called explanatory variable and the dependent variable Y is called response variable.
Least Squares
The solid line on Figure 2.4 (see Example 2.2) was obtained by the method of least squares (LS).
According to this method, the regression coefficients (the intercept β̂0 and the slope β̂1 ) minimize
(in b0 and b1 ) the sum of squares
n
X
S(b0 , b1 ) = (yi − b0 − b1 xi )2 .
i=1
2.3. THE LEAST SQUARES REGRESSION LINE 33

Scatterplot of Yield vs Temperature


60
•••
••
• •

50
• •
• •
• •

Yield

•• • •

40
• •

30 •••

• ••

20 25 30 35 40 45 50

Temperature

Figure 2.4: Yield vs Temperature

The LS coefficients are the solution to linear equations


n
X
(yi − β̂0 − β̂1 xi ) = 0
i=1
(Gauss Equations)
n
X
(yi − β̂0 − β̂1 xi )xi = 0.
i=1

which are obtained by r differencing S(b0 , b1 ) with respect to b0 and b1 . Carrying out the summations
and dividing by n we obtain,

y − β̂0 − β̂1 x = 0 (2.3)


xy − β̂0 x − β̂1 xx = 0. (2.4)

where
n
X n
X
xy = (1/n) xi yi and xx = (1/n) x2i (2.5)
i=1 i=1

From (2.3), β̂0 = y − β̂1 x. Substituting this into (2.4) and solving for β̂1 gives
xy − x y
β̂1 = .
xx − x x

Fitted Values and Residuals


34 CHAPTER 2. SUMMARY AND DISPLAY OF MULTIVARIATE DATA

The regression line fˆ(x) and the regression coefficients β̂0 and β̂1 are good summaries for linearly
associated data. In this case the fitted value
ŷi = fˆ(xi ) = β̂0 + β̂1 xi (Fitted Value)
will be “close” to the observed value of yi . How close depends on the strength of the linear associ-
ation. The differences between the observed values yi and the fitted values ŷi ,
ei = yi − ŷi (Residual),
are called regression residuals.
Residual Plot
The regression residuals ei are usually plotted against the fitted values ŷi to determine the
appropriateness of the linear regression fit. If the data are well summarized by the regression line
(see Figure 2.5 (a)) the corresponding scatter plot of (ŷi , ei ) has no systematic pattern (see Figure
2.5 (c)). Examples of “bad” residual plots – that is, plots that indicate that the regression line is a
poor summary for the data – are given on Figure 2.5 (d) and (e). The corresponding scatter plots
and linear fits are given on Figure 2.5 (b) and (c). In the case of Figure 2.5 (d), the residuals go
from positive to negative and back to positive, suggesting that the relation between X and Y may
not be linear. In the case of Figure 2.5 (e) larger fitted values have larger residuals (in absolute
value).

2.4 Multiple Linear Regression


In practice we often use several explanatory variables to “predict” or “interpolate” the values of a
single response variable. The explanatory variables may all be distinct or may include functions
(powers) of the observed explanatory variables.
If for example, we have p explanatory variables (X1 , X2 , · · ·, Xp ) and n observations or “cases”,
it is convenient to use double subscript notation. The first subscript (i) indicates the case and the
second subscript (j) indicates the variable.
Case (i) Response Variable (yi ) Explanatory Variables (xij )
1 y1 x11 x12 · · · x1p
2 y2 x21 x22 · · · x2p
3 y3 x31 x32 · · · x3p
· · · · · ·
· · · · · ·
· · · · · ·
n yn xn1 xn2 · · · xnp
The linear regression function is now given by
fˆ(x) = β̂0 + β̂1 x1 + β̂2 x2 + · · · + β̂p xp ,
and the regression coefficients (β̂0 , β̂1 , · · ·,β̂p ) minimize (in b0 , b1 ,· · ·, bp ) the sum of squares
n
X
S(b0 , b1 , · · · , bp ) = (yi − b0 − b1 xi1 − b2 xi2 − · · · − b2 xip )2 .
i=1
2.5. EXERCISES 35

The least square coefficients are the solution to the linear equations

n
X
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 − · · · − β̂p xip ) = 0
i=1
n
X
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 − · · · − β̂p xip )xi1 = 0
i=1
Xn
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 − · · · − β̂p xip )xi2 = 0 (Gauss Equations)
i=1
·········
·········
·········
n
X
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 − · · · − β̂p xip )xip = 0
i=1

which are obtained by differencing S(b0 , b1 , · · · , bp ) with respect to b0 , b1 , · · ·, bp .


Carrying out the sums and dividing by n we obtain,

y − β̂0 − β̂1 x1 − β̂2 x2 − · · · − β̂p xp = 0


x1 y − β̂0 x1 − β̂1 x1 x1 − β̂2 x2 x1 − · · · − β̂p xp x1 = 0
x2 y − β̂0 x2 − β̂1 x1 x2 − β̂2 x2 x2 − · · · − β̂p xp x2 = 0
·········
·········
·········
xp y − β̂0 xp − β̂1 x1 xp − β̂2 x2 xp − · · · − β̂p xp xp = 0

where
n
X n
X
yxj = (1/n) xij yi and xj xk = (1/n) xij xik . (2.6)
i=1 i=1

2.5 Exercises
Problem 2.1

Problem 2.2 The following data give the logarithm (base 10) of the volume occupied by algal
cells on successive days, taken over a period over which the relative growth rate was approximately
constant.
36 CHAPTER 2. SUMMARY AND DISPLAY OF MULTIVARIATE DATA

(a) Linear Relation (b) Nonlinear Relation (c) Increasing Variability

400
200

12000
• •• •

• •


10000
••

150

300

• •• •

8000
• •
• •• • • •
• • •
• • • •

100

6000

200

y

y
• • • •
• • ••
• •
• • • •

4000
• •• • •
• •••
• •• • • •• • • •

50

• • • •

100
• • • • • •

2000
• •• •
•• • •
• •• • • •
• • •• •• ••
•• • •

• • • • •• •• •

0
0

• • • ••

0
5 10 15 20 25 30 0 10 20 30 40 50 0 20 40 60 80 100
x x x

(d) Patternless Residuals (e) Quadratic Pattern (d) Megaphone Pattern

2000
• • •
• ••


••

100
40

• •

• •
• • •
1000 • •
• ••
• • • • •• •
20

• • • • •
• • •
Residual

Residual

Residual
• •• •• ••• • ••

0
• • • •
•• •• • • • •• • •
•• • •
• • • • • •• •
• •• • •
0
0

• • •• •
• •
• • •

-100
• • • •
• ••
• • • •

-20

• • • • •
• •• • • ••
-1000

• •
•• ••• •


• • • •

-200
20 40 60 80 100 120 140 -2000 0 2000 4000 6000 8000 10000 50 100 150 200 250 300
Fitted Value Fitted Value Fitted Value

Figure 2.5: Examples of linear regression fits (above) and their residual plots (below).

Day (x) log Volume (log(y))


1 3.592
2 3.823
3 4.174
4 4.534
5 4.956
6 5.163
7 5.495
8 5.602
9 6.087
(1) Plot log y against x. Do you think using the logarithmic scale is appropriate? Why?
(2) Calculate and interpret the sample correlation coefficient.

Problem 2.3 The maximum annual flood flows of a river, for the period 1949–1990, are given in
Table 1.6.
(i) Summarize and display these data.
(ii) Compute the mean, median, standard deviation and interquartile range.
(iii) If a one–year construction project is being planned and a flow of 150000 cfs or greater will halt
construction, what is the relative frequency (based on past relative frequencies) that the construction
will be halted before the end of the project? What if it is a two-year construction project?
2.5. EXERCISES 37

Linear Fit Fitted vs Residuals Diameter vs Residuals

100
•• • •

•••

10

10
• •• • •
• • •
••• • •

90
• • • • •

5
•• • • •• • •

80

Residual

Residual
• •• ••

Time
•• • •

0
70
• • • •

-5

-5
• •

60
• • • •
• • • • •

-10

-10
• • •

50
• •
•• • •
5 10 15 20 25 60 70 80 90 100 5 10 15 20 25
Diameter Fitted Value Diameter

Quadratic Fit Fitted vs Residuals Diameter vs Residuals


100

••• • •

6
• ••• ••

••• • • • •
90

4
• •

• • • •

80

Residual

Residual
2

2

Time

•• • • • •
• •
• •
70

0
• •• • ••
• •
• • • •
• •
60

-2

-2
• • • •
• • •
• • • • •
• • •
50

-4

-4
•• • •
5 10 15 20 25 60 70 80 90 5 10 15 20 25
Diameter Fitted Value Diameter

Cubic Fit Fitted vs Residuals Diameter vs Residuals


100

••• •
• •

• •
••
4

4
• • ••
•• • •
90

••
2

2
• • •• • • •• •
80

Residual

Residual
• • •
Time

•• • •
• • • •
0

0
70

• • • • • •
60

• • • • •
-2

-2
• • •
• • • • •
• • •

50

•• • • • •
5 10 15 20 25 50 60 70 80 90 5 10 15 20 25
Diameter Fitted Value Diameter

Figure 2.6: Polishing Times.

Problem 2.4 The planned and the actual times (in days) needed for the completion of 20 job
orders are given in Table 1.14
(a) Calculate the average and the median planned time per order. Same for the actual time.
(b) Calculate the corresponding standard deviations and interquartile ranges.
(c) If there is a delay penalty of $5000 per day and a before–schedule bonus of $2500 per day, what
is the average net loss ( negative loss = gain) due to differences between planned and actual times?
What is the standard deviation?
(d) Study the relationship between the planned and actual times.
(e) What would be your advice to the company based on the analysis of these data?

Problem 2.5 (a) Show that


P
( xi yi ) − nx y Cov(x, y)
Cov(x, y) = and β= ,
n−1 Var(x)

(b) Show that if ui = a + b xi and vi = c + d yi , then


(i) u = a + bx
(ii) Var(u) = b2 Var(x)
(iii) r(u, v) = r(x, y)
(iv) β(u, v) = db β(x, y)
38 CHAPTER 2. SUMMARY AND DISPLAY OF MULTIVARIATE DATA

Table 2.3: Fraser River Monthly Flow (cms)

Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1971 855 1030 841 1550 6120 7590 5590 3570 2360 1890 1550 908
1972 774 857 1500 2100 6450 10800 7330 4120 2280 1940 1500 1000
1973 984 842 850 1550 4910 6180 5000 2930 1680 2080 1620 1130
1974 987 929 927 2320 5890 8430 7470 4360 2440 1930 1290 978
1975 797 780 736 1100 3940 6830 6070 3420 2300 1950 2360 1480
1976 1140 1030 924 2300 7070 7250 7670 6440 4460 2510 1800 1480
1977 1240 1230 1130 2350 4710 5670 4830 3620 2340 1650 1260 1030
1978 881 791 952 1960 3950 5730 4540 2970 2600 2090 1590 1010
1979 801 721 957 1290 4910 6360 4860 2610 1830 1420 918 952
1980 684 649 703 1760 5120 4900 4010 2720 2600 2080 1630 1900
1981 1860 1480 1300 1880 4950 6260 4890 3620 2130 1530 1950 1140
1982 821 927 844 1010 5360 8690 7230 4850 3620 2310 1470 1110
1983 972 977 1240 1990 4090 6060 5240 3460 2210 1470 2050 878
1984 1160 1010 1160 2030 2870 6370 6580 3780 2920 2560 1370 861
1985 740 706 801 2070 5300 7390 4650 2770 1940 1980 1230 746
1986 813 809 1280 2090 3770 8390 5380 3220 1890 1470 1340 908
1987 1000 944 1300 2280 5120 5840 4070 2980 1680 1020 1210 811
1988 629 657 809 2410 5450 5940 4430 3010 1890 1540 1470 926
1989 800 685 682 1780 4860 6020 3990 3170 1840 1380 2060 1410
1990 1210 841 926 3000 5050 8760 6270 3340 1790 1520 2110 1190

Problem 2.6 The total paved area, X (in km2 ), and the time, Y (in days), needed to complete
the project was recorded for 25 different jobs. The data is summarized as follows:

x = 12.5 km2 , SD(x) = 1.2 km2


y = 30.8 days , SD(y) = 3.7 days
Cov(x, y) = 3.4 , r(x, y) = 0.766 , β = 2.36
Give the corresponding summaries when the area is measured in feet2 and the time is measure in
hours.
Hint: 1 foot = 0.305 m, and 1 km = 1000 m.

Problem 2.7 Show that −1 ≤ r(x, y) ≤ 1.


Hint: One can assume without loss of generality that

x=y=0 and SD(x) = SD(y) = 1 (why?)

Then use the fact that n


X
0≤ (yi − bxi )2
i=1

for all b, and in particular for b = Cov(x, y)/SD(x)2 .


2.5. EXERCISES 39

Table 2.4: The records of maximum annual flood flows


Year Flood, cfs Year Flood, cfs
1941 153000 1966 159000
1942 184000 1967 75000
1943 66000 1968 102000
1944 103000 1969 55000
1945 123000 1970 86000
1946 143000 1971 39000
1947 131000 1972 131000
1948 99000 1973 111000
1949 137000 1974 108000
1950 81000 1975 49000
1951 144000 1976 198000
1952 116000 1977 101000
1953 11000 1978 253000
1954 262000 1979 239000
1955 44000 1980 217000
1956 8000 1981 103000
1957 199000 1982 86000
1958 6000 1983 187000
1959 166000 1984 57000
1960 115000 1985 102000
1961 88000 1986 82000
1962 29000 1987 58000
1963 66000 1988 34000
1964 72000 1989 183000
1965 37000 1990 22000

Table 2.5: The planned and the actual times

Order Planned Time Actual Time Order Planned Time Actual Time
1 22 22 11 17 18
2 11 8 12 27 34
3 11 8 13 16 14
4 16 14 14 30 35
5 21 20 15 22 18
6 12 16 16 17 16
7 25 29 17 13 12
8 20 20 18 18 14
9 13 10 19 21 19
10 34 39 20 18 17
40 CHAPTER 2. SUMMARY AND DISPLAY OF MULTIVARIATE DATA
Chapter 3

Probability

3.1 Sets and Probability


The theory of probability, which is briefly discussed below, is needed for the better un-
derstanding of some important statistical techniques. This theory is, roughly speaking, con-
cerned with the assessment of the chances (or likelihood) that certain events will or will not
occur. In order to give a more precise (and useful) definition of probability, we need first to
introduce some technical concepts and definitions.

Random Experiment: The defining feature of a random experiment is that its outcome
cannot be determined beforehand. That is, the outcome of the random experiment will
only be known after the experiment has been completed. The next time the experiment is
performed (seemingly under the exact same conditions) the outcome may be different. Some
examples of random experiments are:

– asking a randomly selected person if she smokes,


– counting the number of defective items found in a lot,
– measuring the time elapsed between two consecutive breakdowns of a computer network,
– counting the yearly number of work-related accidents in a production plant,
– measuring the yield of a chemical reaction.

Sample Space (S): Although we may not be able to say beforehand what the outcome of
the random experiment will be, we should at least in principle to be able to make a complete
“list” of all the possible outcomes. This list (set) of all the possible outcomes is called “the
sample space” and denoted by S. A generic outcome (that is, element of S) is denoted by
w. The sample spaces for the random experiment listed above are:

– S = {Yes, No},
– S = {0, 1, 2, . . . , n} where n is the lot size,
– S = [0, ∞), the time (in hours) between breakdowns can be any non-negative real number.

41
42 CHAPTER 3. PROBABILITY

– S = {0, 1, 2, . . .}, the number of accidents can be any non-negative integer number.
– S = [0, 100], the percentage yield can be any real number between zero and one hundred.

Event: The events, usually denoted by the first upper case letters of the alphabet (A, B, C,
etc), are simply subsets of S. Most events encountered in practice are meaningful and can
be expressed either in words or using mathematical notation. Some examples (related to the
list of random experiments given above) are:

–A = { less than four defectives} = {0, 1, 2, 3}.


–B = { more than 200 hours} = (200, ∞).
–C = {2, 3, 5, 9}
–D = {between ten and twenty percent} = [10, 20].

An important feature of the events is that they can or cannot occur, depending on the
actual outcome of the random experiment. For instance, if after completing the inspection
of the lot we find two defectives, the event A has occurred. On the other hand, if the actual
number of defectives turned out to be five, the event A did not occur.
Two rather special events are the “impossible” event – which can never occur – denoted
by the empty set ∅ and the “sure” event – which always occurs – consisting of the entire
sample space, S.
Some related mathematical notations are:

w∈A ⇐⇒ w belongs to A ⇐⇒ A occurs

and
w 6∈ A ⇐⇒ w doesn’t belongs to A ⇐⇒ A doesn’t occur

Probability Function (P ): Evidently, not all the events are equally likely. For instance,
the event
A = {more than three million accidents}
would appear to be quite unlikely, while the event

B = {more than three hours before the next crash}

would appear to be quite likely.


A probability function P is a function which assigns to each event a number representing
the likelihood that this event will actually occur.
For self-consistency reasons, any probability function P must satisfy the following prop-
erties:

(1) P (φ) = 0 and P (S) = 1.


3.1. SETS AND PROBABILITY 43

(2) 0 ≤ P (A) ≤ 1 for all A.


(3) P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Properties (4)-(6) below can be derived from (1)-(3).

(4) P (A ∪ B) = P (A) + P (B) if A and B are disjoint mutually exclusive) events.


In fact, if A and B are disjoint then A ∩ B = ∅ and P (A ∩ B) = 0.

(5) P (Ac ) = 1 − P (A), where Ac denotes the “complement of A”.


In fact, since A ∪ Ac = S and A ∩ Ac = ∅, 1 = P (A ∪ Ac ) = P (A) + P (Ac ) and (5) follows.

(6) If A ⊆ B then P (A) ≤ P (B).

In fact, since A ⊆ B,

B = (B ∩ A) ∪ (B ∩ Ac ) = A ∪ (B ∩ Ac ).

Since A and (B ∩ Ac ) are disjoint, P [A ∩ (B ∩ Ac )] = 0 and so

P (B) = P (A) + P (B ∩ Ac ) ≥ P (A).

Example 3.1 It is known from previous experience that the probability of finding zero, one,
two, etc. defectives in lots of 100 items shipped by a certain supplier are as given in Table
2.1 below.
Let A, B and C be the events “less than two defectives”, “more than one defective” and
“one or two defectives”, respectively. (a) Calculate P (A), P (B) and P (C). (b) What is the
meaning (in words) of the event Ac ? Calculate P (Ac ) directly and using Property 4. (c)
What is the meaning (in words) of the event A ∪ C? Calculate P (A ∪ C) directly and using
Property 3.

Table 3.1:
Defectives Probability
0 0.50
1 0.20
2 0.15
3 0.10
4 0.03
5 0.02
6 or more 0.00
44 CHAPTER 3. PROBABILITY

Solution
(a) From Table 2.1, P (A) = 0.70, P (B) = 0.30, and P (C) = 0.35

(b) Ac = {two or more defectives} = {more than one defective} = B, from Table 1, P (Ac ) =
P (B) = 0.30. This is consistent with the result we obtain using Property 4:
P (Ac ) = 1 − P (A) = 1 − 0.70 = 0.30.
(c) A ∪ C = {less than three defectives}. Therefore, directly from Table 1, P (A ∪ C) = 0.85.
To make the calculation using Property 3, we must first find P (A ∩ C). Since A ∩ C =
{exactly one defective}, it follows from Table 1 that P (A ∩ C) = 0.20. Now,
P (A ∪ C) = 0.70 + 0.35 − 0.20 = 0.85.
2

3.2 Conditional Probability and Independence


There are instances when, after obtaining some partial information regarding the outcome
of a random experiment, one would like to update the probabilities of certain events, taking
into account the newly acquired information.
The updated probability of the event A, when it is known that the event B has occurred,
is in general denoted by P (A|B) and called “the conditional probability of A given B”. This
conditional probability can be calculated by the formula
P (A ∩ B)
P (A|B) = (3.1)
P (B)
provided that P (B) > 0. A simple, but nevertheless important, consequence of (2) is that
P (A ∩ B) = P (A|B)P (B), (3.2)
which is sometime called “the multiplication law”.

Example 3.1 (continued): Suppose that we know that the lot contains two defectives or
more. What is the probability that it contains three or more defectives?

Solution Let
B = { two or more defectives } = { more than one defective }
and
D = { three or more defectives } = { more than two defectives }.
Since P (B) = 0.30 and P (D ∩ B) = P ({3, 4, 5}) = 0.15, the desired conditional probability
is
P (D|B) = P (D ∩ B)/P (B) = 0.15/0.30 = 0.50.
2
3.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 45

Posterior Probability and Bayes Formula

Suppose that we wish to investigate the occurrence of a certain event B. For example,
consider the collapse of a large industrial building or the crash of a computer network.
The event B may have been caused by one of several possible causes or states of nature
denoted A1 , A2 . . . Am , For example, the collapse of the industrial building may have been
caused by one (and only one) of the following:

A1 Poor design
- underestimated live load
- underestimated maximum wind speed
- etc.
A2 Poor construction
- Low grade material
- Insufficient supervision and control
- Gross human error
- etc.
A3 A combination of A1 and A2 .
A4 Other (non-assignable) causes.

Suppose that, from previous experience or some other source (for example some expert’s
opinion), the conditional probabilities of B given Ai are known. That is, the probabilities
that the event B will occur when the the cause Ai is present are known and represented by
p 1 , p2 , . . . , p m .
We will call these conditional probabilities risk factors. Suppose also that the probabilities
of each possible cause Ai are known. These probabilities are called prior probabilities and
denoted
π1 , π 2 , . . . , π m .
In the case of our example, the prior probabilities may represent the actual fractions of
industrial buildings in the country which have some design or construction problems. Or they
may represent the subjective beliefs (educated guesses) of some expert consultant (perhaps
the engineer hired by the insurance company to investigate the causes of the accident). In
summary, we suppose that
pi = P (B|Ai ), and πi = P (Ai ),
are known for all i = 1, . . . , m. Notice that
π1 + π2 + . . . + πm = 1.
The prior probabilities and the risks factors for the collapsed building example are given in
columns 2 and 3 of Table 3.2
46 CHAPTER 3. PROBABILITY

Table 3.2:
Cause (i) Prior Probability (πi ) Risk Factor (pi ) Posterior Probability
1 0.00050 0.10 0.29
2 0.00010 0.20 0.12
3 0.00001 0.40 0.02
4 0.99939 0.0001 0.57

The engineer hired by the insurance company to investigate the accident would certainly
wish to know where he can first start looking to find an assignable causes. More precisely, she
would wish to know what is the most likely assignable cause for the collapse of the building.
The conditional probability of each possible cause, given the fact that the event has
occurred, is called the posterior probability for this cause and can be calculated by the
famous Bayes’ formula

P (B|Ak )P (Ak )
P (Ak |B) =
P (B|A1 )P (A1 ) + P (B|A2 )P (A2 ) + . . . + P (B|Am )P (Am )
pk πk
= .
p1 π1 + p2 π2 + . . . + pm πm
In the case of our example the posterior probability of the cause “poor design” (A1 ), for
instance, is equal to

(0.00050)(0.10)
P (A1 |B) =
(0.00050)(0.10) + (0.00010)(0.20) + (0.00001)(0.40) + (0.99939)(0.0001)

= 0.29.

The other posterior probabilities are calculated analogously and the results are displayed in
the fourth column of Table 3.2.
What did the engineer learn from the results of these (posterior probability) calculations?
In the first place she learned that the chance of finding an assignable cause is approximately
43%. Furthermore, she learned that it is best to begin looking for flaws in the design of the
building, as this cause is almost three times more likely to have caused the accident than the
other assignable causes. Finally she learned that it is highly unlikely that the collapse of the
building has been caused by more than one assignable cause.
3.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 47

Derivation of Bayes’ Formula

By the definition of conditional probability,


P (B ∩ Ak )
P (Ak |B) = .
P (B)
Since in addition S can be expressed as the disjoint union
S = A1 ∪ A2 ∪ . . . ∪ Am ,
we follow
B = B ∩ S = B ∩ (A1 ∪ A2 ∪ . . . ∪ Am ) = (B ∩ A1 ) ∪ (B ∩ A2 ) ∪ . . . ∪ (B ∩ Am )
and so,
P (B) = P (B ∩ A1 ) + P (B ∩ A2 ) + . . . + P (B ∩ Am )

= P (B|A1 )P (A1 ) + P (B|A2 )P (A2 ) + . . . + P (B|Am )P (Am )

= π1 p 1 + π2 p 2 + · · · + πm p m . (3.3)
therefore,
P (B|Ak )P (Ak )
P (Ak |B) =
P (B|A1 )P (A1 ) + P (B|A2 )P (A2 ) + · · · + P (B|Am )P (Am )
πk pk
= .
π1 p1 + π2 p2 + · · · + πm pm

Example 3.2 A certain disease is known to affect 1% of the population. A test for the
disease has the following features: if the person is contaminated the test is positive with
probability 0.98. On the other hand, if the person is healthy, the test is negative with
probability 0.95. (a) What is the probability of a positive test when applied to a randomly
chosen subject? (b) What is the probability that an individual is affected by the disease after
testing positive? (c) Explain the connections between this problem and Bayes’ formula.

Solution
(a) Since B is clearly equal to the disjoint union of the events B ∩ C and B ∩ C c ,
P (B) = P (B ∩ C) + P (B ∩ C c )
= P (C)P (B|C) + P (C c )P (B|C c )
= (0.01 × 0.98) + (0.99 × 0.05)
= 0.0593
48 CHAPTER 3. PROBABILITY

(b)
P (B ∩ C) P (B|C)P (C) 0.98 0.01
P (C|B) = = = = 0.1653
P (B) P (B) 0.0593
Notice that the probability of having the disease, even after testing positive, is surprisingly
low (less than 0.17). Why do you think this is so?
(c) The calculation in part (a) produced the “unconditional” probability that the event
“testing positive”. This unconditional probability constitutes the denominator of Bayes’
formula. If a person has been tested positive, given the characteristics of the test, this can
be caused by two possible causes: being healthy and being contaminated. The posterior
probability of the second cause is the result of part (b). 2

Independence
Roughly speaking, two events A and B are independent when the probability of any one
of them is not modified after knowing the results for the other (occurrence or not occurrence).
In other words, knowing about the occurrence or no occurrence of any one of these events
does not alter the amount of information (or uncertainty) that we initially had regarding the
other event. Quite simply then, we can say that two events are independent if they do not
carry any information regarding each other.
The formal definition of independence is somewhat surprising at first because it doesn’t
make any direct reference to the events’ conditional probabilities. But see also the remarks
following the definition. Probabilists prefer this formal definition, because it is easy to check
and to generalize for the case of m events (m ≥ 2).
Definition: The events A and B are independent if

P (A ∩ B) = P (A)P (B).

Suppose that the events A and B are such that


P (A|B) = P (A).
In this case,
P (A ∩ B) = P (A|B)P (B) = P (A)P (B),
and the events A and B are independent according to the given definition.
On the other hand, if P (B) > 0 and A and B satisfy the given definition of independence,
then

P (A ∩ B) P (A)P (B)
P (A|B) = = = P (A).
P (B) P (B)
3.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 49

Example 3.3 The results of the STAT 251 midterm exam can be classified as follows:

Table 3.3:
Male Female
High 0.05 0.15 0.20
Medium 0.30 0.15 0.45
Low 0.30 0.05 0.35
0.65 0.35 1.00

What is the meaning of the statement “gender and performance are independent”? Are they?
Why?

Solution
Gender and performance are (intuitively) independent if for example, knowing the score
of a randomly chosen test doesn’t affect the probability that this test corresponds to a male
(0.65, from the table) or to a female (0.35). Or vice versa, knowing the gender of the student
who wrote the test doesn’t modify our ability to predict its score.
Let A and B be the events “a randomly chosen student is male” and “a randomly chosen
student has a high score”, respectively. Is it true that P (A|B) = P (A)? The answer, of
course, is no because

P (A|B) = 0.05/0.20 = 0.25 and P (A) = 0.65.

Before knowing that the score is high, the chances are almost two out of three that the
student is a male. However, after we know that the score is high, the chances are one out of
four that the student is a male. The lack of independence in this case is derived from the fact
that male students are “under–represented” in the high score category and “over-represented
” in the low score category. 2
If Table 3.3 above is replaced by Table 3.4

Table 3.4:
Male Female
High 0.13 0.07 0.20
Medium 0.29 0.16 0.45
Low 0.23 0.12 0.35
0.65 0.35 1.00

then gender and performance are independent. (Why?).

The concept of independence also applies to three or more events and we shall now give the
formal definition of independence of m events. At the same time we want to point out that,
in most practical applications, the independence of certain events is often simply assumed or
50 CHAPTER 3. PROBABILITY

derived from external information regarding the physical make up of the random experiment,
as illustrated in Example 3.4 below.
Fortunately then, we will have few occasions of checking this definition throughout this
course.

Definition: The events Ai (i = 1, . . . , m) are independent if


P (Ai ∩ Aj ) = P (Ai )P (Aj ) for all i 6= j and

P (Ai ∩ Aj ∩ Ak ) = P (Ai )P (Aj )P (Ak ) for all i 6= j 6= k and


...
P (A1 ∩ A2 ∩ . . . ∩ Am ) = P (A1 )P (A2 ) . . . P (Am )

Example 3.4 A certain system has four independent components {a1 , a2 , a3 , a4 }. The pairs
of components a1 , a2 and a3 , a4 are in line. This means that, for instance, the subsystem
{a1 , a2 } fails if any of its two component does; similarly for the subsystem {a3 , a4 }. The
subsystems {a1 , a2 } and {a3 , a4 } are in parallel. This means that the system works if at least
one of the two subsystems does. Calculate the probability that the system fails assuming
that the four components are independent and that each one of them can break down with
probability 0.10. How many parallel subsystems would be needed if the probability of failure
for the entire system cannot exceed 0.001?

3́ a1 - a2 Q
´ Q
´ Q
´ Q
´ Q
-´ s-
Q
Q 3́
Q ´
Q ´
Q ´
Q ´
s
Q a3 - a4 ´

Figure 3.1: A four-component system

Solution Let Ai be the event “component ai works” (i = 1, . . . , 4), and let C be the event
“the system works”.
P (C) = P [(A1 ∩ A2 ) ∪ (A3 ∩ A4 )] = P (A1 ∩ A2 ) + P (A3 ∩ A4 ) − P [(A1 ∩ A2 ) ∩ (A3 ∩ A4 )]
= P (A1 )P (A2 ) + P (A3 )P (A4 ) − P (A1 )P (A2 )P (A3 )P (A4 )
= 0.92 + 0.92 − 0.94 = 0.9639
3.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 51

To answer the second question, just notice that the probability of working for each inde-
pendent subsystem is 0.92 = 0.81. Now, if Bi (i = 1, . . . , m) is the event “the ith subsystem
works”, it follows that

0.001 ≥ 1 − P (B1 ∪ B2 ∪ . . . ∪ Bm ) = P (B1c ∩ B2c ∩ . . . ∩ Bm


c
)
= P (B1 )P (B2 ) . . . P (Bm ) = [1 − P (B1 )] = (1 − 0.81)m .
c c c m

Therefore,
log(0.001)
log(0.001) ≥ m log(0.19) =⇒ m≥ =⇒ m = 5.
log(0.19)
2
52 CHAPTER 3. PROBABILITY

3.3 Exercises
Problem 3.1 If A and B are independent events with P (A) = 0.2 and P (B) = 0.5, find the
following probabilities. (a) P (A ∪ B); (b) P (A ∩ B); and (c) P (Ac ∩ B c )

Problem 3.2 In a certain class, 5 students obtained an A, 10 students obtained a B, 17


students obtained a C, and 6 students obtained a D. What is the probability that a randomly
chosen student receive a B? If a student receives $10 for an A, $5 for a B, $2 for a C, and $0
for a D, what is the average gain that a student will make from this course?

Problem 3.3 Consider the problem of screening for cervical cancer. The probability that a
women has the cancer is 0.0001. The screening test correctly identifies 90% of all the women
who do have the disease, but the test is false positive with probability 0.001.
(a) Find the probability that a woman actually does have cervical cancer given the test says
she does.
(b) List the four possible outcomes in the sample space.

Problem 3.4 An automobile insurance company classifies each driver as a good risk, a
medium risk, or a poor risk. Of those currently insured, 30% are good risks, 50% are medium
risks, and 20% are poor risks. In any given year the probability that a driver will have at
least one accident is 0.1 for a good risk, 0.3 for a medium risk, and 0.5 for a poor risk.
(a) What is the probability that the next customer randomly selected will have at least one
accident next year?
(b) If a randomly selected driver insured by this company had an accident this year, what is
the probability that this driver was actually a good risk?

Problem 3.5 A truth serum given to a suspect is known to be 90% reliable when the person
is guilty and 99% reliable when the person is innocent. In other words, 10% of the guilty are
judged innocent by the serum and 1% of the innocent are judged guilty. If the suspect was
selected from a group of suspects of which only 5% have ever committed a crime, and the
serum indicates that he is guilty, what is the probability that he is innocent?

Problem 3.6 70% of the light aircrafts that disappear while in flight in a certain country
are subsequently discovered. Of the aircrafts that are discovered, 60% have an emergency
locator, whereas 80% of the aircrafts not discovered do not have an emergency locator.
(a) What percentage of the aircrafts have an emergency locator?
(b) What percentage of the aircrafts with emergency locator are discovered after they disap-
pear?

Problem 3.7 Two methods, A and B, are available for teaching a certain industrial skill.
The failure rate is 20% for A and 10% for B. However, B is more expensive and hence is
only used 30% of the time (A is used the other 70%). A worker is taught the skill by one
of the methods, but fails to learn it correctly. What is the probability that the worker was
taught by Method A?
3.3. EXERCISES 53

Problem 3.8 Suppose that the numbers 1 through 10 form the sample space of a random
experiment, and assume that each number is equally likely. Define the following events: A1 ,
the number is even; A2 , the number is between 4 and 7, inclusive.
(a) Are A1 and A2 mutually exclusive events? Why?
(b) Calculate P (A1 ), P (A2 ), P (A1 ∩ A2 ), and P (A1 ∪ A2 ).
(c) Are A1 and A2 independent events? Why?

Problem 3.9 A coin is biased so that a head is twice as likely to occur as a tail. If the coin
is tossed three times,
(a) what is the sample space of the random experiment?
(b) what is the probability of getting exactly two tails?

Problem 3.10 Items in your inventory are produced at three different plants: 50 percent
from plant A1 , 30 percent from plant A2 , and 20 percent from plant A3 . You are aware
that your plants produce at different levels of quality: A1 produces 5 percent defectives, A2
produces 7 percent defectives, and A3 yields 8 percent defectives. You select an item from
your inventory and it turns out to be defective. Which plant is the item most likely to have
come from? Why does knowing the item is defective decrease the probability that it has
come from plant A1 , and increase the probability that it has come from either of the other
two plants?

Problem 3.11 Calculate the reliability of the system described in the following figure. The
numbers beside each component represent the probabilities of failure for this component.
Note that the components work independently of one another.
.1
.05 .05 ¡ 3 .05
@
1 2 ¡ @ 5
@ ¡
@ 4 ¡

.1

Problem 3.12 A system consists of two subsystems connected in series. Subsystem 1 has
two components connected in parallel. Subsystem 2 has only one component. Suppose the
three components work independently and each has probability of failure equal to 0.2. What
is the probability that the system works?

Problem 3.13 A proficiency examination for a certain skill was given to 100 employees of
a firm. Forty of the employees were male. Sixty of the employees passed the exam, in that
they scored above a preset level for satisfactory performance. The breakdown among males
and females was as follows:
Male (M) Female (F)
Pass (P) 24 36
Fail 16 24
100
54 CHAPTER 3. PROBABILITY

Suppose an employee is randomly selected from the 100 who took the examination.
(a) Find the probability that the employee passed, given that he was male.
(b) Find the probability that the employee was male, given that he passed.
(c) Are the events P and M independent?
(d) Are the events P and F independent?

Problem 3.14 Propose appropriate sample spaces for the following random experiments.
Give also two examples of events for each case.
Counting/measuring:
1 - the number of employees attending work in a certain plant
2 - the number of days with wind speed above 50 km/hour, per year, in Vancouver
3 - the number of earthquakes in BC during any given period of two years
4 - the time between two consecutive breakdowns of a computer network
5 - the number of people leaving BC per year
6 - the percentage of STAT 241/51 students obtaining final marks above 80% in any given
term
7 - the number of engineers working in BC per year
8 - the percentage of computer scientists in BC who will make more than $65, 000 in 1996
9 - the number of employees still working in a certain production plant after 4:30 PM on
Fridays.

Problem 3.15 Let A and B be the events “construction flaw due to some human
error” and “construction flaw due to some mechanical problem”.
1) What are the meaning (in words) of the following events: (a) A ∪ B, (b) A ∩ B, (c)
A ∩ B c , (d) Ac ∩ B c , (e) (A ∪ B)c , (f) Ac ∪ B c , (g) (A ∩ B)c . Draw also the corresponding
diagrams.
2) Show that in general (A ∩ B)c = Ac ∪ B c and that (A ∪ B)c = Ac ∩ B c (so the results of
(f) and (g) and of (d) and (e) above were not mere coincidences).
3) Suppose that P (A) = 0.02, P (B) = 0.01 and P (A ∪ B) = 0.023. Calculate (a) P (A ∩ B),
(b) P (Ac ∩ B c ), (c) P (A ∩ B c ), (d) P (A|B c ), (e) P (A|B).

Problem 3.16 A large company hires most of its employees on the basis of two tests. The
two tests have scores ranging from one to five. The following table summarizes the perfor-
mance of 16,839 applicants during the last six years. From this table we learn, for example,
that 3% of the applicants got a score of 2 on Test 1 and 2 on Test 2; and that 15% of the
applicants got a score of 3 on Test 1 and 2 on Test 2. We also learn that, for example, 20%
of the applicants got a score of 2 on Test 1 and that 25% of the applicants got a score of 2
on Test 2.
A group of 1500 new applicants have been selected to take the tests.
(a) What should the cutting scores be if between 140 and 180 applicants will be short–listed
for a job interview? Assume that the company wishes to short–list people with the highest
possible performances on the two tests.
3.3. EXERCISES 55

Table 3.5:
Score 1 2 3 4 5 Total
1 0.07 0.03 0.00 0.00 0.00 0.10
2 0.15 0.03 0.02 0.00 0.00 0.20
3 0.08 0.15 0.09 0.02 0.01 0.35
4 0.10 0.04 0.08 0.01 0.02 0.25
5 0.00 0.00 0.06 0.02 0.02 0.10
Total 0.40 0.25 0.25 0.05 0.05 1.00

Table 3.6:
Score Test 1 Test 2
1 0.10 0.40
2 0.20 0.25
3 0.35 0.25
4 0.25 0.05
5 0.10 0.05

(b) Same as (a) but assuming now that the company wishes to hire people with the highest
possible performances on at least one of the two tests.
(c) (Continued from (a)) A manager suggests that only applicants who obtain marks above
a certain bottom line in one of the tests be given the other test. Noticing that giving
and marking each test costs the company $55, recommend which test should be given first.
Approximately how much will be saved on the basis of your advise?
(d) Repeat (a)–(c) if the two tests performances are independent and the probabilities are
given by Table 2.6.

Problem 3.17 A computer company manufactures PC compatible computers in two plants,


called Plant A and B in this exercise. These plants account for 35 % and 65 % of the
production, respectively. The company records show that 3 % of the computers manufactured
by Plant A must be repaired under the warranty. The corresponding percentage for plant B
is 2.5 %.
(a) What is the percentage of computers that are repaired under the warranty and come from
Plant A?
(b) What percentage of computers repaired under the warranty come from Plant A? From
Plant B?

Problem 3.18 Twenty per cent of the days in a certain area are rainy (there is some mea-
surable precipitation during the day), one third of the days are sunny (no measurable pre-
cipitation, more than 4 hours of sunshine) and fifteen per cent of the days are cold (daily
average temperature for the day below 5o C).
1 - Would you use the above information as an aid in
(i) Planning your next weekend activities (assuming that you live in this area)?
(ii) Deciding whether you want to move to this area?
(iii) Choosing the type of roofing for a large building in this area?
56 CHAPTER 3. PROBABILITY

Justify your answers.


2 - Given that five per cent of the days are sunny and cold, and five per cent of the days are
rainy and cold, calculate the probability that a given day will be either sunny, rainy or cold.
3 - Are sunny and cold days independent? What about rainy and cold days?

Problem 3.19 A company sells a (cheap) recording tape under a limited “lifetime war-
ranty”. From the company records one learns that
5% of the tapes sold by the company are defective and could be replaced under the warranty.
50% of the customers who get one of these defective tapes will claim it under the warranty
and have it replaced.
90% of the tapes which are claimed to be defective are actually so. These tapes are replaced
under the warranty.
(a) Which of the above are conditional probabilities?
(b) Using the above information, calculate the probability that a customer will claim the
warranty.
(c) What is the maximum allowable fraction of defective tapes if the company wants to have
at most 1% of the tapes returned?

Problem 3.20 Show that P (A ∩ B ∩ C) = P (A)P (B|A)P (C|A ∩ B).

Problem 3.21 On average, 20% of the students fail the first midterm. Of those, 60% fail
the second midterm. Moreover, 80% of the students that failed the two midterms fail also
the final exam.
(a) What is the probability that a randomly chosen student fails the two midterms?
(b) What is the probability that a randomly chosen student fails the two midterms and the
final exam?

Problem 3.22 The probability that system survives 300 hours is 0.8. The probability that
a 300 hours old system survives another 300 hours is 0.6. The probability that a 600 hours
old system survives another 300 hours is 0.5.
(a) What is the probability that the system survives 600 hours?
(b) What is the probability that the system survives 900 hours?

Problem 3.23 Recall the situation in Example 3.2 presented in class: the probability of
infection for an individual in the general population is π = .01 and a test for the disease
is such that it will be correctly positive 98% of the time and correctly negative 95% of the
time. Some individuals, however, may belong to some “high risk” groups and therefore have
a larger prior probability π of being infected.
1) calculate the posterior probability of infection as a function of the corresponding prior
probability, π, given that the test is positive (denote this probability by g(π)) and make a
plot of g(π) versus π.
2) what is the value of π for which the posterior probability given a positive test is twice as
large as the prior probability?
3.3. EXERCISES 57

Problem 3.24 Suppose that we wish to determine whether an uncommon but fairly costly
construction flaw is present. Suppose that in fact this flaw has only probability 0.005 of
being present. A fairly simple test procedure is proposed to detect this flaw. Suppose that
the probabilities of being correctly positive and negative for this test are 0.98 and 0.94,
respectively.
1) Calculate the probability that the test will indicate the presence of a flaw.
2) Calculate the posterior probability that there is no flaw given that the test has indicated
that there is one. Comment on the implications of this result.

Problem 3.25 One method that can be use to distinguish between granite (G) and basalt
(B) rocks is to examine a portion of the infrared spectrum of the sun’s energy reflected
from the rock surface. Let R1 , R2 and R3 denote measured spectrum intensities at three
different wavelengths. Normally, R1 < R2 < R3 would be consistent with granite and
R3 < R1 < R2 would be consistent with basalt. However, when the measurements are made
remotely (e.g. using aircrafts) several orderings of the Ri0 s can arise. Flights over regions of
known composition have shown that granite rocks produce

(R1 < R2 < R3 ) 60% of the time,


(R1 < R3 < R2 ) 25% of the time, and
(R3 < R1 < R2 ) 15% of the time

On the other hand, basalt rocks produce these orderings of the spectrum intensities with
probabilities 0.10, 0.20 and 0.70, respectively. Suppose that for a randomly selected rock
from a certain region we have P (G) = 0.25 and P (B) = 0.75.
1) Calculate P (G|R1 < R2 < R3 ) and P (B|R1 < R2 < R3 ). If the measurements for a given
rock produce the ordering R1 < R2 < R3 , how would you classify this rock?
2) Same as 1) for the case R1 < R3 < R2
3) Same as 1) for the case R3 < R1 < R2
4) If one uses the classification rule determined in 1) 2) and 3), what is the probability of
a classification error (that a G rock is classified as a B rock or a B rock is classified as a G
rock)?

Problem 3.26 Messages are transmitted as a sequence of zeros and ones. Transmission er-
rors occur independently, with probability 0.001. A message of 3500 bits will be transmitted.
(a) What is the probability that there will be no errors? What is the probability that there
will be more than one error?
(b) If the same message will be transmitted twice and those bits that do not agree will be
revised (and therefore these “detected” transmission errors will be corrected), what is the
probability that there will be no reception errors?

Problem 3.27 Suppose that the events A, B and C are independent. Show that,
(a) Ac and B c are independent.
58 CHAPTER 3. PROBABILITY

(b) A ∪ B and C are independent.


(c) Ac ∩ B c and C are independent.
Problem 3.28 A test has been designed to indicate the presence of a flaw in an electronic
component. The components which test positive are sent back to the production department.
It is known, however, that 1% of the time the test gives either a false positive or a false
negative result.
(a) What is the proportion of faulty components being produced if 2% of them are sent back
to production on the basis of the test?
(b) The company produces twenty thousand components each year. The loss associated
with the rejection of a sound component is $5, that associated with the rejection of a faulty
component is $50 and that associated with the selling of a defective component is $150. What
is the total loss? How much of this loss is due to “defective” testing?
Problem 3.29 Consider the probabilities given in Table 2.7 and the events
B1 = {Having a low GPA}, B2 = {Having a medium GPA}, B3 = {Having a high GPA}
C1 = {Having a low salary}, C2 = {Having a medium salary}, C3 = {Having a high salary}

Table 3.7:
Low Salary Medium Salary High Salary
Low GPA 0.10 0.08 0.02 0.20
Medium GPA 0.07 0.46 0.07 0.60
High GPA 0.03 0.06 0.11 0.20
0.20 0.60 0.20

1) Calculate P (Bi ∪ Cj ), i = 1, 2, 3 and j = 1, 2, 3


2) What is the meaning (in words), and the probability, of the event
A = (B1 ∩ C1 ) ∪ (B2 ∩ C2 ) ∪ (B3 ∩ C3 )
3) Are “salary” and “GPA” independent? Why?
4) Construct a table with the same marginals (same probabilities for the six categories) but
with salary and GPA being independent.

Problem 3.30 Consider the system of components connected as follows. There are two
subsystems connected in parallel. Components 1 and 2 constitute the first subsystem and are
connected in parallel (so that this subsystem works if either component works). Components
3 and 4 constitute the second subsystem and are connected in series (so that this subsystem
works if and only if both components do). If the components work independently of one
another and each component works with probability 0.85, (a) calculate the probability that
the system works. (b) calculate this probability if the two subsystems are connected in series.
3.3. EXERCISES 59

Problem 3.31 Calculate the reliability of the system described in the following figure. The
numbers beside each component represent the probabilities of failure for this component.
.05 .01 .01
1 3 4 .05
¡ S ¡ S
¡ S ¡ S 7
@ ¢ @ ¢
@ 2 ¢ @ 5 6 ¢

.05 .01 .01


60 CHAPTER 3. PROBABILITY
Chapter 4

Random Variables and Distributions

4.1 Definition and Notation


Mathematically, a random variable X is a function defined on the sample space S, assigning
a number, x = X(w), to each outcome w in the sample space. Notice that the upper case
letter X represents the random variable and the lower case letter x represents one of its
possible values.

Example 4.1 Let S the sample space associated with the inspection of four items. That is,

S = {w = (w1 , w2 , w3 , w4 )}

where wi , i = 1, . . . , 4, is equal to D (for defective) or N (for non–defective). The random


variable X is defined as “the number of D’s in w” and the random variable Y is defined
as “the indicator of two or more D’s in w” (that is, Y (w) = 1 if w contains two or more
defectives and Y (w) = 0, otherwise). For instance, X(N, N, N, N ) = 0, X(N, D, N, N ) = 1,
X(D, N, D, D) = 3, and Y (N, N, N, N ) = 0, Y (N, D, N, N ) = 0, Y (D, N, D, D) = 1.

Random variables are often used to summarize the most relevant information contained
in the sample space. For example, one may be interested in the total number of defectives
(number of D0 s in w) and may not care about the order in which they have been found. In this
case the random variable X(w) defined above would capture the most relevant information
contained in w. If we will reject lots with two or more defectives (among the four inspected
items) the random variable, Y would be of most interest.

Notation: The notations {X = x} {X ≤ x} etc. will be used very often in this course.
Their exact meaning is explained below. In general,

{X ∈ A} = {w : X(w) ∈ A}, where A is a set of numbers.

This takes on different forms for different sets A0 s. For example,

{X = x} = {w : X(w) = x},

61
62 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

where the set A = {x} and


{X ≤ x} = {w : X(w) ≤ x},
where the set A = (−∞, x]. Additional examples (related to Example 5 above) are

{X = 0} = {(N, N, N, N )}

and

{X ≤ 1} = {(N, N, N, N ), (D, N, N, N ), (N, D, N, N ), (N, N, D, N ), (N, N, N, D)}.

4.2 Discrete Random Variables


Discrete random variables are mainly used in relation to “counting” situations; for example,

• Counting the number of defective items in a lot

• Counting the number of yearly failures of an electrical network

• Counting the weekly number of customers arriving at a service outlet

• Counting the hourly number of cars crossing a bridge

• Counting the number of jobs interviews before finding a job

The defining feature of a discrete random variable is that its range (the set of all its
possible values) is finite or countable. The values in the range are often integer numbers, but
they don’t need to be so. For instance, a random variable taking the values zero, one half
and one with probabilities 0.5, 0.25 and 0.25 respectively is considered discrete.

The probability density function (or in short, the density), f (x), of a discrete random
variable X is defined as

f (x) = P (X = x), for all possible value x of X

That is, f (x) gives the probability of each possible value x of X. It obviously has the following
properties:

(1) f (x) ≥ 0 for all x in the range R of X


P
(2) x∈R f (x) = 1
P
(3) x∈A f (x) = P (X ∈ A) for all subsets A of R.
4.3. CONTINUOUS RANDOM VARIABLES 63

The distribution function of X (in short the distribution), F (x), is defined as


X
F (x) = P (X ≤ x) = f (k), for all real x.
k≤x

In many engineering applications one works with 1 − F (x) instead of F (x). Notice that
1 − F (x) = P (X > x) and therefore gives the probability that X will exceed the value x.
Example 4.1 (continued): Suppose that the items are independent and each one can be
defective with probability p. The density and distribution of the random variable (r.v.) X =
“number of defectives” can then be derived as follows:
f (0) = P (X = 0) = P ({N, N, N, N }) = (1 − p)(1 − p)(1 − p)(1 − p) = (1 − p)4
f (1) = P (X = 1) = P ({D, N, N, N }, {N, D, N, N }, {N, N, D, N }, {N, N, N, D})
= p(1 − p)(1 − p)(1 − p) + (1 − p)p(1 − p)(1 − p) + (1 − p)(1 − p)p(1 − p)
+(1 − p)(1 − p)(1 − p)p = 4(1 − p)3 p

In a similar way we can find that


f (2) = 6(1 − p)2 p2 , f (3) = 4(1 − p)1 p3 and f (4) = p4
The values of the density and distribution functions of X, for the cases p = 0.40 and p = 0.80
are given in Table 2.5. A comparison of the density functions shows that smaller values of
X (0, 1 and 2) are more likely when p = 0.4 (why?) and that higher values (3 and 4) are
more likely when p = 0.8. Also notice that the distribution function for the case p = 0.8 is
uniformly smaller. This is so because getting smaller values of X is always more likely when
p = 0.4.

Table 4.1:
p = 0.40 p = 0.80
x f (x) F (x) f (x) F (x)
0 0.1296 0.1296 0.0016 0.0016
1 0.3456 0.4752 0.0256 0.0272
2 0.3456 0.8208 0.1536 0.1808
3 0.1536 0.9744 0.4096 0.5904
4 0.0256 1.0000 0.4096 1.0000

4.3 Continuous Random Variables


The continuous random variables are used in relation with “continuous” type of outcomes,
as for example,

• the lifetime of a system or component


64 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

• the yield of a chemical process

• the weight of a randomly chosen item

• the difference between the specified and actual diameter of a part

• the measurement error when measuring the distance between the North and South
shores of a river.

The typical events in these cases are bounded or unbounded intervals with probabilities
specified in terms of the integral of a continuous density function, f (x), over the desired
interval. See property (3) below.
Since the probability of all intervals must be non-negative and the probability of the entire
line should be one, it is clear that f (x) must have the two following properties:

(1) Non negative:


f (x) ≥ 0 for all x.

(2) Total mass equal to one: Z +∞


f (x)dx = 1.
−∞

(3) Probability calculation:


Z b
P {a < X ≤ b} = f (x)dx.
a

Notice that, unlike in the discrete case, the inclusion or exclusion of the end points a and
b doesn’t affect the probability that the continuous variable X is in the interval. In fact,
the event that X will take any single value, x, can be represented by the degenerate interval
x ≤ X ≤ x and so,
Z x
P (X = x) = P (x ≤ X ≤ x) = f (t)dt = 0.
x

Therefore, unlike in the discrete case, f (x) doesn’t represent the probability of the event
X = x. What is then the meaning of f (x)? It represents the relative probability that X will
be near x: if d > 0 is small,

P (x − (d/2) < X < x + (d/2)) 1 Z x+(d/2)


= f (t)dt ≈ f (x).
d d x−(d/2)
4.3. CONTINUOUS RANDOM VARIABLES 65

Another important function related with a continuous random variable is its cumulative
distribution function defined as
Z x
F (x) = P (X ≤ x) = f (t)dt, for all x. (4.1)
−∞
Notice that, in particular,
P (a < X < b) = F (b) − F (a).

F(b) F(b)-F(a)
F(a)

b a a b
Figure 3.1: Probability on (a, b) under density function f (x)
By the Fundamental Theorem of Calculus,
f (x) = F 0 (x), for all x. (4.2)
Therefore, we can go back and forth from the density to the distribution function and vice
versa using formulas (4.1) and (4.2).
Example 4.2 Suppose that the maximum annual flood level of a river, X (in meters), has
density
f (x) = 0.125(x − 5), if 5 < x < 9
= 0 otherwise
Calculate F (x), P (5 < X < 6), P (6 ≤ X < 7), and P (8 ≤ X ≤ 9).
0.5

1.0
0.4

0.8

Density Function Distribution Function


0.3

0.6
F
f

0.2

0.4
0.1

0.2
0.0

0.0

4 5 6 7 8 9 10 4 5 6 7 8 9 10

x x

Figure 3.2: Distribution and density functions


66 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

Solution
F (x) = 0, if x ≤ 5
Z x
= 0.125(t − 5)dt = (0.0625)(x − 5)2 , if 5 < x < 9
5
= 1, if x ≥ 9.
Furthermore,
P (5 < X < 6) = F (6) − F (5)
= 0.0625[(6 − 5)2 − (5 − 5)2 ]
= 0.0625.
Analogously,
P (6 ≤ X < 7) = F (7) − F (6) = 0.25 − 0.0625 = 0.1875,
and
P (8 ≤ X < 9) = F (9) − F (8) = 1.0 − 0.5625 = 0.4375.
Notice that, since P (X = x) = 0, the inclusion or exclusion of the interval’s boundary points
doesn’t affect the probability of the corresponding interval. In other words,
P (6 ≤ X ≤ 7) = P (6 < X ≤ 7) = P (6 ≤ X < 7) = P (6 < X < 7) = F (7) − F (6) = 0.1875.
Also notice that, since f (x) is increasing on (5, 9), P (5 < X < 6), for instance, is much
smaller than P (8 < X < 9), despite the length of the two intervals being equal. 2

Example 4.3 (Rounding-off Error and Uniform Random Variables): Due to the resolution
limitations of a measuring device, the measurements are rounded-off to the second decimal
place. If the third decimal place is 5 or more, the second place is increased by one unit; if the
third decimal place is 4 or less, the second place is left unchanged. For example, 3.2462 would
be reported as 3.25 and 3.2428 would be reported as 3.24. Let X represent the difference
between the (unknown) true measurement, y, and the corresponding rounded–off reading, r.
That is
X = y − r.
Clearly, X can take any value between −0.005 < X < 0.005. It would appear reasonable
in this case to assume that all the possible values are equally likely. Therefore, the relative
probability f (x) that X will fall near any number x0 between −0.005 and 0.005 should then
be the same. That is,
f (x) = c, − 0.005 ≤ x ≤ 0.005,
= 0, otherwise.
The random variable X is said to be uniformly distributed between −0.005 and +0.005. By
property 2 Z +∞ Z 0.005
f (x)dx = cdx = 0.01c = 1,
−∞ −0.005
4.4. SUMMARIZING THE MAIN FEATURES OF F (X) 67

Therefore, c must be equal to 1/0.01 = 100 and

f (x) = 100, − 0.005 ≤ x ≤ 0.005,


= 0, otherwise.

The corresponding distribution function is

F (x) = 0, x ≤ −0.005,
= 100(x + 0.005), − 0.005 ≤ x ≤ 0.005,
= 1, x ≥ 0.005

1.5
120

Density Function Distribution Function


110

1.0


distributiion
density

100

0.5
90
80

0.0

-0.005 0.0 0.005 -0.005 0.0 0.005

x x

Figure 3.4: Distribution and Density of the uniform random variable

4.4 Summarizing the Main Features of f (x)


All the information concerning the random variable X is contained in its density function,
f (x), and this information can be used and displayed in the form of a picture (a graph of
f (x) versus x), a formula, or a table.
There are situations, however, when one would prefer to concentrate on a summary of the
more complete and complex information contained in f (x). This is the case, for example,
if we are working with several random variables that need to be compared in order to draw
some conclusions.
The summary of f (x), as any other summary, should be simple and informative. The
reader of such a summary should get a good idea of what are the most likely values of X and
what is the degree of uncertainty regarding the prediction of future values of X.
Typical densities found in practice are approximately symmetric and unimodal. These
densities can be summarized in terms of their central location and their dispersion. Therefore,
68 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

an approximately symmetric and unimodal density can be fairly well described by giving just
two numbers: a measure of its central location and a measure of its dispersion.
The median and the mean are two popular measures of (central) location and the
interquartile range and the standard deviation are two popular measures of dispersion.
These summary measures are defined and briefly discussed below.

The Median and the Inter–Quartile Range

Given a number α between zero and one, the quantile of order α of the distribution F (or
the r.v. X), denoted Q(α), is implicitly defined by the equation

P (X ≤ Q(α)) = α.

Therefore Q(α) has the property


Q(α) = F −1 (α)
and can be found by solving (for x) the equation

F (x) = α.

To find the quantile of order 0.25, for example, we must solve the equation

F (x) = 0.25.

The “special” quantiles Q(0.25) and Q(0.75) are often called the first quartile and the third
quartile, respectively.
The median of X, Med(X), is defined as the corresponding quantile of order 0.5, that is,

Med(X) = Q(0.5).

Evidently, Med(X) divides the range of X into two sets of equal probability. Therefore, it
can be used as a measure for the central location of f (x).
A simple sketch showing the locations of Q(0.25), M ed(X) and Q(0.75) constitutes a
good summary of f (x), even if it is not symmetric. Notice that if Q(0.75) − M ed(X) is
significantly larger (or smaller) than M ed(X) − Q(0.25), then f (x) is fairly asymmetric.
There are situations when there is no solution or too many solutions to the defining
equations above. This is typically the case for discrete random variables. In these cases the
quantiles (including the median) are calculated using some “common–sense” criterion. For
instance if the distribution function F (x) is constant and equal to 0.5 on the interval (x1 , x2 ),
then the median is taken equal to (x1 + x2 )/2 (see Figure 3.5 (a)). To give another example,
if the distribution function F (x) has a jump and doesn’t take the value 0.5, the median is
defined as the location of the jump (see Figure 3.5 (b))
The dispersion about the median is usually measured in terms of the interquartile
range, denoted IQR(X) and defined as:

IQR(X) = Q(0.75) − Q(0.25)


4.4. SUMMARIZING THE MAIN FEATURES OF F (X) 69

1.0

1.0
• •



F(x)

F(x)
0.5

0.5

• •



0.0

0.0
x1 x2 x1

0 20 0 20

(a) (b)

Figure 3.5: Calculation of the median


When the density f (x) is fairly concentrated (around some central value) IQR(X) tends
to be smaller. Roughly speaking, the size of IQR(X) is directly proportional to the degree
of uncertainty that one faces in trying to predict the future values of X.

Example 4.4 (Waiting Time and Exponential Random Variables) The waiting time X (in
hours) between the arrival of two consecutive customers at a service outlet is a random
variable with exponential density
f (x) = λe−λx , if x ≥ 0,
= 0, otherwise.
where λ is a positive parameter representing the rate at which customers arrive. For this
example, take λ = 2 customers per hour. (a) Find the distribution function F (x). (b)
Calculate Med(X), Q(0.25) and Q(0.25). (c) Is f (x) symmetric? (d) Calculate IQR(X).

Solution
(a)
Z x Z x
F (x) = f (t)dt = 2 exp {−2t}dt
−∞ 0

exp {−0} − exp {−2x}


= 2 = 1 − exp {−2x}.
2
70 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

(b) To calculate the median,

1 − exp {−2x} = 1/2 ⇒ exp {−2x} = 1/2 ⇒ − 2x = − log(2).

Therefore, Med(X) = log(2)/2 = 0.347.


To calculate Q(0.25),

1 − exp {−2x} = 1/4 ⇒ exp {−2x} = 3/4 ⇒ − 2x = log(3) − log(4).

Therefore,
log(4) − log(3)
Q(0.25) = = 0.144.
2
Analogously, to calculate Q(0.75),

1 − exp {−2x} = 3/4 ⇒ exp {−2x} = 1/4 ⇒ − 2x = − log(4).

log(4)
Q(0.75) = = 0.693.
2
(c) Since
Q(0.75) − Med(X) = 0.693 − 0.347 = 0.346
and
Med(X) − Q(0.25) = 0.347 − 0.144 = 0.203,
the distribution is fairly asymmetric.
(d)
IQR = Q(0.75) − Q(0.25) = 0.693 − 0.144 = 0.549.
2

The Mean, the Variance and the Standard Deviation

Let X be a random
√ variable with density f (x), and let g(X) be a function of X. For ex-
ample, g(X) = X or g(X) = (X −t)2 , where t is some fixed number. The notation E[g(X)],
read “expected value of g(X)”, will be used very often in this course. The expected value of
g(X) is defined as the weighted average of the function g(x), with weights proportional to
the density function f (x). More precisely:
Z +∞
E[g(X)] = g(x)f (x)dx in the continuous case, and (4.3)
−∞

X
E[g(X)] = g(x)f (x) in the discrete case, (4.4)
x∈R

where R is the range of X.


4.4. SUMMARIZING THE MAIN FEATURES OF F (X) 71

Example 4.5 Refer to the random variables of Example 3.1 (number of defectives) and
Example 4.3 (rounding–off error). Calculate E(X) and E(X 2 ).

Solution Since the random variable X of Example 3.1 is discrete, we must use formula (4.4)
to obtain:

E(X) = (0)(0.5) + (1)(0.2) + (2)(0.15) + (3)(0.10) + (4)(0.03) + (5)(0.02) = 1.02,

and

E(X 2 ) = (0)(0.5) + (1)(0.2) + (4)(0.15) + (9)(0.10) + (16)(0.03) + (25)(0.02) = 2.68.

In the case of the continuous random variable X of Example 4.3 we must use formula (4.3):
Z +∞ Z 0.005
E(X) = xf (x)dx = 100 xdx = 0,
−∞ −0.005

Z +∞ Z 0.005
E(X 2 ) = x2 f (x)dx = 100 x2 dx
−∞ −0.005

100[(0.005)3 − (−0.005)3 ] (200)(0.005)3


= = = 0.00000833.
3 3
2

The mean of X as a Measure of Central Location

Suppose that it is proposed that a certain number t is used as the measure of central
location of X. How could we decide if this proposed value is appropriate? One way to think
about this question is as follows. If t is a good measure of central location then, in principle,
one would expect that the squared residuals (x − t)2 will be fairly small for those values x
of X which are highly likely (those for which f (x) is large). If this is so, then one would also
expect that the average of these squared residuals,

D(t) = E{(X − t)2 },

will also be fairly small. Notice that


Z +∞
D(t) = (x − t)2 f (x)dx in the continuous case
−∞
X
= (x − t)2 f (x) in the discrete case,

But we could begin this reasoning from the end and say that a good measure of central
location must minimize D(t). This “optimal” value of t, called the mean of X, is denoted
by the Greek letter µ.
72 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

To find µ we differentiate D(t) and set the derivative equal to zero. In the continuous
case,
Z +∞
0
D (t) = −2 (x − t)f (x)dx = −2[E(X) − t] = 0 ⇒ t = E(X),
−∞

and the discrete case can be treated similarly. Since D00 (t) = 2 > 0 for all t, the critical point
t = E(X) minimizes D(t). Therefore,

µ = E(X)

This procedure of defining the desired summary measure by the property of minimizing the
average of the squared residuals is a very important technique in applied statistics called the
method of minimum mean squared residuals. We will come across several applications
of this technique throughout this course.

The Standard Deviation of X as a Measure of Dispersion

It is clear from the above discussion that

D(t) ≥ D(µ) = E{(X − µ)2 }

for all values of t. The quantity D(µ) is usually denoted by the Greek symbol σ 2 (read
“Sigma squared”) and called the variance of X. An alternative notation for the variance
of X, also often used in this course, is Var(X).
It is evident that Var(X) will tend to be smaller when the density of X is more concen-
trated around µ, since the smaller squared residuals will receive larger weights. Therefore,
Var(X) could be taken as a measure of the dispersion of f (x). A problem with Var(X) is
that it is expressed in a unit which is the square of original unit of X. This problem is
easily solved by taking the (positive) square root of Var(X). This is called the standard
deviation of X and denoted by either σ or SD(X).
q √
σ = SD(X) = + Var(X) = + σ 2 .

Example 4.4 (continued): (a) Calculate the mean and the standard deviation for the waiting
time between two consecutive customers, X. (b) How do they compare with the correspond-
ing median waiting time and interquartile range calculated before?

Solution
4.4. SUMMARIZING THE MAIN FEATURES OF F (X) 73

(a) Using integration by–part,


Z +∞ Z +∞
E(X) = 2 x exp {−2x}dx = 2[−x exp {−2x}/2]+∞
0 + exp {−2x}
0 0

= [− exp {−2x}/2]+∞
0 = exp {0}/2 = 0.5.
More generally, if X is an exponential random variable with parameter (rate) λ, then
E(X) = 1/λ. (4.5)
Using integration by–part again, we get
Z +∞ Z +∞
E(X 2 ) = 2 x2 exp {−2x}dx = 2[−x2 exp {−2x}/2]+∞
0 + 2[ x exp {−2x}]dx
0 0
Z +∞
= 2 x exp {−2x}]dx = 0.5.
0

Therefore, q q
SD(X) = + E(X 2 ) − [E(X)]2 = + 0.5 − (0.5)2 = 0.5.
More generally, if X is an exponential random variable with parameter λ, then
Var(X) = 1/λ2 and SD(X) = 1/λ. (4.6)
(b) Since the density of X is asymmetric, the median and the mean are expected to be
different (as they are). Since the density is skewed to the right (longer right hand side tail)
the mean expected time (0.5) is larger than the median expected time (0.347).
The two measures of dispersion (IQR = 0.549 and SD = 0.5) are quite consistent. 2

Properties of the Mean and the Variance

Property 1: E(aX + b) = aE(X) + b for all constants a and b.


Proof X X
E(aX + b) = (axi + b)f (xi ) = a[ xi f (xi )] + b = aE(X) + b.
The proof for the continuous case is identical. 2

Property 2: E(X + Y ) = E(X) + E(Y ) for all pairs of random variables X and Y .

Property 3: E(XY ) = E(X)E(Y ) for all pairs of independent random variables X and
Y.

Property 4: Var(aX + b) = a2 Var(X) for all constants a and b.

Proof
Var(aX + b) = E [(aX + b) − (aµ + b)]2 = E [a(X − µ)]2 = a2 E(X − µ)2 = a2 Var(X)
74 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

2
Property 5: Var(X ± Y ) = Var(X) + Var(Y ) for all pairs of independent random
variables X and Y .

All these properties will be used very often in this course. The proofs of properties 2, 3
and 5 are beyond the scope of this course, and therefore these properties must be accepted
as facts and used throughout the course.

The formula
Var(X) = E(X 2 ) − [E(X)]2 = E(X 2 ) − µ2 ,
is often used for calculations. The derivation of this formula is very simple, using the prop-
erties of the mean listed above. In fact,
Var(X) = E{(X − µ)2 } = E(X 2 + µ2 − 2µX) = E(X 2 ) + µ2 − 2µE(X)
= E(X 2 ) + µ2 − 2µ2 = E(X 2 ) − µ2 .

4.5 Sum and Average of Independent Random Variables


Random experiments are often independently repeated many times generating a sequence
X1 , X2 , . . . , Xn of n independent random variables. We will consider linear combinations of
these variables,
Y = a1 X1 + a2 X2 + · · · + an Xn ,
where the coefficients a1 , a2 , . . . , an are some given constants. For example, ai = 1, for all i,
produces the total
T = X1 + X2 + · · · + Xn ,
and ai = 1/n, for all i, produces the average
X = (X1 + X2 + · · · + Xn )/n.
Using the properties of the expected value and variance we have
E(Y ) = a1 E(X1 ) + a2 E(X2 ) + · · · + an E(Xn )
and
Var(Y ) = a21 Var(X1 ) + a22 Var(X2 ) + · · · + a2n Var(Xn ).
Typically, the n random variables Xi will have a common mean µ and an common variance
σ 2 . In this case the sequence {X1 , X2 , . . . , Xn } is said to be a random sample. In this case,
E(Y ) = (a1 + a2 + · · · + an )µ
and
Var(Y ) = (a21 + a22 + · · · + a2n )σ 2 .
4.5. SUM AND AVERAGE OF INDEPENDENT RANDOM VARIABLES 75

Example 4.6 Twenty randomly selected students will be asked the question “do you reg-
ularly smoke?”. (a) Calculate the expected number of smokers in the sample if 10% of the
students smoke; (b) what is your “estimate” of the proportion, p, of smokers if six students
answered “Yes”?; (c) What are the expected value and the variance of your estimate?

Solution

(a) Let Xi be equal to one if the ith student answers “Yes” and equal to zero otherwise.
Let p be equal to the proportion of smokers in the student population. Then the Xi are
independent discrete random variables with density f (0) = 1 − p and f (1) = p. Therefore,

E(Xi ) = E(Xi2 ) = 0f (0) + 1f (1) = f (1) = p = 0.1

and
Var(Xi ) = E(Xi2 ) − [E(Xi )]2 = p − p2 = p(1 − p) = 0.09.
Hence, the expected number of smokers in a sample of 20 students is

E(X1 + X2 + · · · + X20 ) = 20p = 2.

The corresponding variance is

Var(X1 + X2 + · · · + X20 ) = 20p(1 − p) = 1.8

(b) A reasonable estimate for the fraction, p, of smokers in the population is given by the
corresponding fraction of smokers in the sample, X. In the case of our sample, the observed
value, x, of X is x = 6/20 = 0.3.
(c) The expected value of the estimate in (b) is p and its variance is p(1 − p)/20. Why? 2

Example 4.7 The independent random variables X, Y and Z represent the monthly sales
of a large company in the provinces of BC, Ontario and Quebec, respectively. The mean and
standard deviations of these variables are as follows (in hundred of dollars):

E(X) = 1, 435 E(Y ) = 2, 300 E(Z) = 1, 500


SD(X) = 120 SD(Y ) = 150, SD(Z) = 150.

(a) What are the expected value and the standard deviation of the total monthly sales?
(b) Sales manager J. Smith is responsible for the sales in BC and 2/3 of the sales in Ontario.
Sales manager R. Campbell is responsible for the sales in Quebec and the remaining 1/3 of
the sales in Ontario. What are the expected values and standard deviations of Mr. Smith’s
and Mrs. Campbell’s monthly sales?
(c) What are the expected values and standard deviations of the annual sales for each
province? Assume for simplicity that the monthly sales are independent.
76 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

Solution

(a) The total monthly sales are


S = X + Y + Z.
By Property 2

E(S) = E(X) + E(Y ) + E(Z) = 1, 435 + 2, 300 + 1, 500 = 5, 235.

By Property 5

Var(S) = Var(X) + Var(Y ) + Var(Z) = 1202 + 1502 + 1502 = 59, 400.

Therefore, q
SD(S) = 59, 400 = 243.72
(b) First, notice that

S1 = X + (2/3)Y and S2 = Z + (1/3)Y,

are Mr. Smith’s and Mrs. Campbell’s monthly sales. By Property 2

E(S1 ) = E(X) + (2/3)E(Y ) = 1, 435 + (2/3)2, 300 = 2968.33.

Analogously
E(S2 ) = 2266.67
By Property 5

Var(S1 ) = Var(X) + (2/3)2 Var(Y ) = 1202 + (2/3)2 1502 = 24, 400,

and so q
SD(S1 ) = 24, 400 = 156.20.
Analogously
SD(S2 ) = 158.11.
(c) If Xi (i = 1, . . . , 12) represent BC’s monthly sales, the annual sales for BC are
12
X
T = Xi
i=1

Therefore, " 12 #
X 12
X
E(T ) = E Xi = E(Xi ) = (12)(1, 435) = 17, 220.
i=1 i=1

The variance and the standard deviation of the annual sales in BC (assuming independence)
are: " #
12
X 12
X
Var(T ) = Var Xi = Var(Xi ) = (12)(1202 ) = 172, 800.
i=1 i=1
4.6. MAX AND MIN OF INDEPENDENT RANDOM VARIABLES 77
q
SD(T ) = 172, 800 = 415.69.
The student can now calculate the expected values and the standard deviations for the annual
sales in Toronto and Quebec. 2
Question: The total monthly sales can be obtained as the sum of Mr. Smith’s (S1 =
X+(2/3)Y ) and Mrs. Campbell’s (S1 = Z+(1/3)Y ) monthly sales, with variances (calculated
in part (b)) equal to 24, 400 and 25, 000, respectively. Why is it then true that the total sales
variance (Var(X +Y +Z)), calculated in part (b), is not equal to the sum of 24, 400+25, 000 =
49, 400?

4.6 Max and Min of Independent Random Variables


The maximum, V , and the minimum, Λ, of a sequence of n independent random variables
are of practical interest. They can be used to represent (or model) a number of random
quantities which naturally appear in practice. For example, the maximum,

V = max{X1 , X2 , . . . , Xn }.

can be used to model

1. The lifetime of a system of n components connected in parallel. In this case

Xi = Lifetime of the ith component.

2. The completion time of a project made up of n subprojects which can be pursued


simultaneously. In this case

Xi = Completion time for the ith subproject .

3. The maximum flood level of a river in the next n years. In this case

Xi = Maximum flood level in the ith year .

On the other hand, the minimum,

Λ = min{X1 , X2 , . . . , Xn }.

can be used to model

1. The lifetime of a system of n components connected in series. In this case

Xi = Lifetime of the ith component.


78 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

2. The completion time of a project independently pursued by n competing teams. In this


case
Xi = Completion time by the ith team .

3. The minimum flood level of a river in the next n years. In this case
Xi = Minimum flood level in the ith year .

4.6.1 The Maximum


Suppose that Fi (x) and fi (x) are the distribution and density functions of the random variable
Xi , and let FV (v) and fV (v) be the distribution and density functions of the maximum V .
Since the maximum, V , is less than a given value, v, if and only if each random variable Xi
is less than v we have
FV (v) = P {V ≤ v} = P {X1 ≤ v, X2 ≤ v, . . . , Xn ≤ v}

= P {X1 ≤ v}P {X2 ≤ v} · · · P {Xn ≤ v} [ since the variables Xi are independent]

= F1 (v)F2 (v) · · · Fn (v) [since P {Xi ≤ v} = Fi (v), i = 1, . . . , n]

This formula is greatly simplified when the Xi ’s are identically distributed, that is, when
F1 (x) = F2 (x) = · · · = Fn (x) = F (x)
for all values of x. In this case,
FV (v) = [F (v)]n (4.7)
and
fV (v) = FV0 (v) = n[F (v)]n−1 f (v). (4.8)

Example 4.8 A system consists of five components connected in parallel. The lifetime
(in thousands of hours) of each component is an exponential random variable with mean
µ = 3. See Example 4.4 and Example 4.4 (continued) for the definition of exponential
random variables and formulas for their mean and variance.
(a) Calculate the median life (often called “half–life”) and standard deviation for each com-
ponent.
(b) Calculate the probability that a component fails before 3500 hours.
(c) Calculate the probability that the system will fail before 3500 hours. Compare this with
the probability that a component fails before 3500 hours.
(d) Calculate the half–life (median life), mean life and standard deviation for the system.
4.6. MAX AND MIN OF INDEPENDENT RANDOM VARIABLES 79

Solution

(a) Using equation (4.5) and the fact that the lifetime X of each component is exponentially
distributed with mean µ = 3 we obtain that λ = 1/3 and that the density and distribution
functions of X are
f (x) = (1/3) exp{−x/3} and F (x) = 1 − exp{−x/3}, x ≥ 0,
respectively. The half-life of each component can be obtained as follows
1 − exp{−x/3} = 0.5 ⇒ exp{−x/3} = 0.5 ⇒ x0 = −3 log(0.5) = 2.08.
Therefore, the half-life of each component is equal to 2, 080 hours. To obtain the standard
deviation, recall that from equation (4.6) the standard deviation of an exponential random
variable is equal to its mean, that is,
SD(X) = E(X) = µ.
Therefore, the standard deviation of the lifetime of each component is equal to 3.
(b) The probability that a component will fail before 3500 is
P {X ≤ 3.5} = F (3.5) = [1 − exp{−3.5/3}] = 0.6886.
(c) Using formula (4.7)
FV (v) = [1 − exp{−v/3}]5
and so the probability that the system will fail before 3, 500 hours is
P {V ≤ 3.5} = FV (3.5) = [1 − exp{−3.5/3}]5 = (0.6886)5 = 0.1548.
The probability that a single component fails (calculated in part (b)) is more than four times
larger.
(c) To calculate the median life of the system we must use formula (1) once again:
FV (v) = 0.5 ⇒ [1 − exp{−v/3}]5 = 0.5 ⇒ exp{−v/3}] = 1 − (0.5)1/5 = 0.12945

⇒ v0 = −3 log(0.12945) = 6.133.
Therefore, the median life of the system is equal to 6, 133 hours.
To calculate the mean life we must first obtain the density function of V . Using formula (2)
above we obtain
fV (v) = (5)[1 − exp{−v/3}]4 (1/3) exp{−v/3}

= (5/3)[exp{−v/3} − 4 exp{−2v/3} + 6 exp{−v} − 4 exp{−4v/3} + exp{−5v/3}].


Since, for any β > 0,
Z ∞ ·Z ∞ ¸
v exp{−βv}dv = (1/β) vβ exp{−βv}dv = E(V )(1/β) = (1/β)2 ,
0 0
80 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

the mean life, E(V ), is equal to


Z ∞ ·Z ∞ Z ∞
E(V ) = vfV (v)dv = (5/3) v exp{−v/3}dv − 4 v exp{−2v/3}dv
0 0 0
Z ∞ Z ∞ Z ∞ ¸
+ 6 v exp{−v}dv − 4 v exp{−4v/3}dv + v exp{−5v/3}dv
0 0 0

= (5/3)[ (9) − (4)(9/4) + (6)(1) − (4)(9/16) + (9/25) ] = 6.85.

To calculate SD(V ) we must first find

Var(V ) = E(V 2 ) − [E(V )]2 = E(V 2 ) − (6.85)2 .

Since, for any β > 0,


Z ∞
v 2 exp{−βv}dv = 2/(β 3 ), [why?]
0

we have that
Z ∞ Z ∞ Z ∞
E(V 2 ) = v 2 fV (v)dv = (5/3)[ v 2 exp{−v/3}dv − 4 v 2 exp{−2v/3}dv
0 0 0
Z ∞ Z ∞ Z ∞
+ 6 v 2 exp{−v}dv − 4 v 2 exp{−4v/3}dv + v 2 exp{−5v/3}dv
0 0 0

= (2)(5/3)[ (27) − (4)(27/8) + (6)(1) − (4)(27/64) + (27/125) ] = 60.095.

Therefore, q √
SD(V ) = 60.095 − (6.852 ) = 13.1725 = 3.63.
2

4.6.2 The Minimum


Now we turn our attention to the distribution of the minimum, Λ. Let FΛ (u) and fΛ (u)
denote the distribution and density functions of Λ. Since the minimum, Λ, is greater than a
given value, u, if and only if each random variable Xi is greater than u we have

FΛ (u) = P {Λ ≤ u} = 1 − P {Λ > u} = 1 − P {X1 > u, X2 > u, . . . , Xn > u}

= 1 − P {X1 > u}P {X2 > u} · · · P {Xn > u} [ since the variables Xi are independent]

= 1 − [1 − F1 (u)][1 − F2 (u)] · · · [1 − Fn (u)] [since P {Xi > u} = 1 − Fi (u), i = 1, . . . , n]


4.6. MAX AND MIN OF INDEPENDENT RANDOM VARIABLES 81

As before, this formula can be greatly simplified when the Xi ’s are equally distributed, that
is, when
F1 (x) = F2 (x) = · · · = Fn (x) = F (x)
for all values of x. In this case,
FΛ (u) = 1 − [1 − F (u)]n (4.9)
and
fΛ (u) = FΛ0 (u) = n[1 − F (u)]n−1 f (u). (4.10)

Example 4.9 A system consists of five components connected in series. The lifetime (in
thousands of hours) of each component is an exponential random variable with mean µ = 3.
(a) Calculate the probability that the system will fail before 3500 hours. Compare this with
the probability that a component fails before 3500 hours.
(b) Calculate the median life, the mean life and the standard deviation for the system.

Solution

(a) Using formula (4.9) above we obtain


FΛ (u) = 1 − [exp{−u/3}]5 = 1 − exp{−(5/3)u}
and so Λ is also exponentially distributed with parameter 5 × (1/3) = 5/3. In general,
the minimum of n exponential random variables with parameter λ is also exponential with
parameter nλ. Finally,
P {Λ ≤ 3.5} = FΛ (3.5) = 1 − exp{−(5/3)3.5} = 0.9971
The probability that a component will fail before 3500 has been found (in Example 4.8) to
be 0.6886. Therefore, the probability that the system will fail before 3, 500 hours is almost
45% larger.
(b) Since Λ is exponentially distributed, its mean and standard deviation can be obtained
directly from the distribution function found in (a), using equations (4.5) and (4.6). That is,
E(Λ) = SD(Λ) = (3/5) = 0.6.
Therefore, the mean life of the system, 600 hours, is 5 times smaller than that of the individual
components. Finally, the median life of the system can be found as follows:
1 − exp{−(5u)/3} = 0.5 ⇒ exp{−(5u)/3} = 0.5 ⇒ u0 = −3 log(0.5)/5 = 0.416.
Therefore, the median life of the system is equal to 416 hours. 2
82 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

4.7 Exercises
4.7.1 Exercise Set A
Problem 4.1 A system consists of five identical components all connected in series. Suppose
each component has a lifetime (in hours) that is exponentially distributed with the rate
λ = 0.01, and all the five components work independently of one another.
Define T to be the time at which the system fails. Consider the following questions:
(a) Obtain the distribution of T . Can you tell what type of distribution it is?
(b) Compute the IQR (interquartile range) for the distribution obtained in part (a).
(c) What is the probability that the system will last at least 15 hours?

Problem 4.2 Are the following functions density functions? Why?


(a) f1 (x) = 1, 1 ≤ x ≤ 3; 0, otherwise.
(b) f2 (x) = x, −1 ≤ x ≤ 1; 0, otherwise.
(c) f3 (x) = exp(−x), x ≥ 0; 0, otherwise.

Problem 4.3 Suppose that the response time X at a certain on-line computer terminal (the
elapsed time between the end of a user’s inquiry and the beginning of the system’s response
to that inquiry) has an exponential distribution with expected response time equal to 5
seconds (i.e. the exponential rate is λ = 0.2).
(a) Calculate the median response time.
(b) What is the probability that the next three response times exceed 5 seconds? (Assume
that all the response times are independent).

Problem 4.4 The hourly volume of traffic, X, for a proposed highway has density propor-
tional to g(x), where (
x(100 − x) if 0 < x < 100
g(x) =
0 otherwise.
(a) Derive the density and the distribution functions of X.
(b) The traffic engineer may design the highway capacity equal to the mean of X. Determine
the design capacity of the highway and the corresponding probability of exceedence
(i.e. traffic volume is greater than the capacity).

Problem 4.5 A discrete random variable X has the density function given below.
x −1 0 1 2
f (x) 0.2 c 0.2 0.1
(a) Determine c;
(b) Find the distribution function F (x);
(c) Show that the random variable Y = X 2 has the density function g(y) given by
y 0 1 4
g(y) 0.5 0.4 0.1
4.7. EXERCISES 83

(d) Calculate expectation E(X), variance Var(X) and the mode of X (the value x with the
highest density).

Problem 4.6 A continuous random variable X has the density function f (x) which is pro-
portional to cx on the interval 0 ≤ x ≤ 1, and 0 otherwise.
(a) Determine the constant c;
(b) Find the distribution function F (x) of X;
(c) Calculate E(X), Var(X) and the median, Q(0.5);
(d) Find P (|X| ≥ 0.5).

Problem 4.7 Show that


(a) Any distribution function F (x) is non-decreasing, i.e. for any real values x1 < x2 ,
F (x1 ) ≤ F (x2 ).
(b) Suppose X is a random variable with finite variance. Then, Var(X) ≤ E(X 2 ).
(c) If a density function f (x) is symmetric around 0, i.e. f (−x) = f (x) for all x ∈ R, then
F (0) = P (X ≤ 0) = 0.5.

Problem 4.8 If the probability density of a random variable is given by



 kx
 for 0 < x < 2
f (x) = 2k(3 − x) for 2 ≤ x < 3

 0 elsewhere
(a) Find the value of k such that f (x) is a probability density function.
(b) Find the corresponding distribution function.
(c) Find the mean and median.

Problem 4.9 Suppose a random variable X has a probability density function given by
(
kx(1 − x) for 0 ≤ x ≤ 1
f (x) =
0 elsewhere.

(a) Find the value of k such that f (x) is a probability density function.
(b) Find P (0.4 ≤ X ≤ 1).
(c) Find P (X ≤ 0.4|X ≤ 0.8).
(d) Find F (b) = P (X ≤ b), and sketch the graph of this function.

Problem 4.10 Suppose that random variables X and Y are independent and have the same
mean 3 and standard deviation 2. Calculate the mean and variance of X − Y .

Problem 4.11 Suppose X has exponential distribution with a unknown parameter λ, i.e.
its density is that (
λ exp(−λx) if x ≥ 0
f (x) =
0 otherwise.
If P (X ≥ 1) = 0.25, determine λ.
84 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

Problem 4.12 Suppose an enemy aircraft flies directly over the Alaska pipeline and fires
a single air-to-surface missile. If the missile hits anywhere within 10 feet of the pipeline, a
major structural damage will occur and the oil flow will be disrupted. Let X be the distance
from the pipeline to the point of impact. Note that X is a continuous random variable. The
probability function describing the missile’s point of impact is given by

60+x

 3600 for −60 < x < 0
60−x
f (x) = 3600
for 0 ≤ x < 60

 0 otherwise.
(a) Find the distribution function, F (x).
(b) Let A be the event “flow is disrupted.” Find P (A).
(c) Find the mean and the standard deviation of X.
(d) Find the median and the interquartile range of X.
Problem 4.13 Consider a random variable X which follows the uniform distribution on the
interval (0, 1). (a) Give the density function f (x) and obtain the cumulative distribution
functionF (x) of X;
(b) Calculate√the mean (expectation) E(X) and variance Var(X);
(c) Let Y = X. Find the E(Y ) and Var(Y );
(d) Obtain the distribution function G(y) and furthermore the density function g(y) of ran-
dom variable Y .
Problem 4.14 The reaction time (in seconds) to a certain stimulus is a continuous random
variable with density given below
(
3
2x2
for 1 ≤ x ≤ 3
f (x) =
0 otherwise
(a) Obtain the distribution function.
(b) Take next two observations X1 and X2 (we can assume they are i.i.d). Then consider
V = max{X1 , X2 }. What is the density and distribution functions of V ?
(c) Compute the expectation E(V ) and the standard deviation SD(V ).
(d) Compute the difference between the expectation and the median for the distribution of
V.

4.7.2 Exercise Set B


Problem 4.15 The continuous random variable X takes values between −2 and 2 and its
density function is proportional to
(a) 4 − x2
(b) x2
(c) 2 + x
(d) exp {−|x|}
Find, in each case, the density function, the distribution function, the mean, the standard
deviation, the median and the interquartile range of X.
4.7. EXERCISES 85

Problem 4.16 Find the density functions corresponding to the pictures in Figure 3.7. For
each case also calculate the distribution function, the mean, the median, the interquartile
range and the standard deviation.

6 30 3 9 15 -1 1
(a) (b) (c)

-2 0 2 0 10 -2 0 6
(d) (e) (f)

Figure 3.7: Pictures of densities

Problem 4.17 The density function for the lifetime of a part, X, decays exponentially fast.
If the half–life of X is equal to fifty weeks, find the mean and standard deviation of X.

Problem 4.18 The density function for the measurement error, X, is uniform on the interval
(−0.5, 0.8). What is the distribution function of X 2 ? What is the density of X 2 ?

Problem 4.19 The hourly volume of traffic, X, for a proposed highway has density propor-
tional to d(x), where

d(x) = x if 0 < x < 300


= (3/2)(500 − x) if 300 ≤ x < 500
= 0 otherwise.

(a) Derive the density and the distribution functions of X.


(b) The traffic engineer may design the highway capacity equal to one of the following:
(i) the mode of X (defined as the value x with highest density)
(ii) the mean of X
(iii) the median of X
(iv) the quantile of order 0.90 of X (Q(0.90)).
Determine the design capacity of the highway and the corresponding probability of ex-
ceedance (that is, capacity is less than traffic volume) for each of the four cases.

Problem 4.20 The company has 20 welders with the following “performances”:
86 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

Welder 0 1 2 3 4
1 0.10 0.20 0.40 0.20 0.10
2 0.20 0.20 0.20 0.20 0.20
3 0.50 0.30 0.10 0.05 0.05
4 0.05 0.05 0.10 0.30 0.50
5 0.50 0.00 0.00 0.00 0.50
6 0.85 0.00 0.00 0.00 0.15
7 0.30 0.25 0.20 0.10 0.15
8 0.20 0.30 0.20 0.10 0.20
9 0.10 0.10 0.50 0.20 0.10
10 0.20 0.50 0.10 0.20 0.00
11 0.30 0.30 0.40 0.00 0.00
12 0.10 0.10 0.50 0.15 0.15
13 0.35 0.25 0.20 0.15 0.05
14 0.40 0.30 0.10 0.10 0.10
15 0.20 0.30 0.50 0.00 0.00
16 0.60 0.30 0.10 0.00 0.00
17 0.70 0.10 0.10 0.10 0.00
18 0.10 0.80 0.10 0.00 0.00
19 0.40 0.40 0.10 0.10 0.00
20 0.15 0.60 0.15 0.10 0.00

1) How would you rank these twenty welders (e.g. for promotion) on the basis of this
information alone?

2) Would you change the “ranking” if you know that items with one, two, three and four
cracks must be sold for $6, $15, $40, and $60 less, respectively? What if the associated losses
are $6, $15, $40, and $80. Suggestion: Use the computer.

Problem 4.21 Suppose that the maximum annual wind velocity near a construction site,
X, has exponential density
f (x) = λ exp {−λx}, x > 0.
(a) If the records of maximum wind speed show that the probability of maximum annual
wind velocities less than 72 mph is approximately 0.90, suggest an appropriate estimate for
λ.
(b) If the annual maximum wind speeds for different years are statistically independent,
calculate the probability that the maximum wind speed in the next three years will exceed
75 mph. What about the next 15 years?
(c) Plot the distribution function of the maximum wind speed for the next year, for the next
3 years and for the next 15 years. Briefly report your conclusions.
(d) Let Qm (p) (m = 1, 2, . . .) the quantile of order p for the maximum wind speed on the
next m years. Show that
h i
Qm (p) = Q1 p1/m , for all m = 1, 2, . . .

Use this formula to plot Qm (0.90) versus m. Same for Qm (0.95). Briefly report your conclu-
sions. Suggestion: Use the computer.
4.7. EXERCISES 87

Problem 4.22 A system has two independent components A and B connected in parallel.
If the operational life (in thousand of hours) of each component is a random variable with
density
1
f (x) = (x − 4)(10 − x) 4 < x < 10
36

= 0 otherwise

(a) Find the median and the mean life of each component. Find also the standard deviation
and IQR.
(b) Calculate the distribution and density functions for the lifetime of the system. What is
the expected lifetime of the system?
(c) Same as (b) but assuming that the components are connected “in series” instead of “in
parallel”.

Problem 4.23 A large construction project consists of building a bridge and two roads
linking it to two cities (see the picture below). The contractual time for the entire project is
18 months.
The construction of each road will require between 15 and 20 months and that of the
bridge will require between 12 and 19 months. The three parts of the projects can be done
simultaneously and independently. Let X1 , X2 and Y represent the construction times for the
two roads and the bridge, respectively and suppose that these random variables are uniformly
distributed on their respective ranges.
(a) What is the expected time for completion of each part of the project? What are the
corresponding standard deviations?
(b) What is the expected time for the completion of the entire project? What is the corre-
sponding standard deviation?
(c) What is the probability that the project will be completed within the contractual time?

Problem 4.24 Same as Problem 2.51, but assuming that the variables X1 , X2 and Y have
triangular distributions over their ranges.

City

River

Road 2
Bridge
Road 1

City
88 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS
Chapter 5

Normal Distribution

5.1 Definition and Properties

Normal Distribution N (µ, σ 2 )

The Normal distribution is, for reasons that will be evident as we progress in this course,
the most popular distribution among engineers and other scientists. It is a continuous dis-
tribution with density,
( )
1 (x − µ)2
f (x) = √ exp −
σ 2π 2σ 2

where µ and σ are “parameters” which control the central location and the dispersion of the
density, respectively. The normal density is perfectly symmetric about the center, µ, and
this bell-shaped function is “shorter” and “fatter” as σ increases.

Normal Density
0.4

sigma=1
0.3

sigma=1.5
density

0.2

sigma=2

sigma=3
0.1
0.0

-6 -4 -2 0 2 4 6

x
Figure 4.1: Normal density functions

89
90 CHAPTER 5. NORMAL DISTRIBUTION

The density steadily decreases as we move away from its highest value
1
f (µ) = √ .
σ 2π
Therefore, the relative (and also the absolute) probability that X will take a value near µ is
the highest. Since f (x) → 0 as x → ∞, exponentially fast,
g(k) = P {|X − µ| ≤ kσ} → 1, as k → ∞,
very fast. In fact, it can be shown that g(1) = 0.6827, g(2) = 0.9544, g(3) = 0.9973 and
g(4) = 0.9999. For practical purposes g(k) = 1 for k ≥ 4.

Some Important Facts about the Normal Distribution

Fact 1: If X ∼ N (µ, σ 2 ) and Y = aX + b, where a and b are two constants with a 6= 0, then
Y ∼ N (aµ + b, a2 σ 2 ).
For example, if X ∼ N (2, 9) and Y = 5X + 1, then E(Y ) = (5)(2) + 1 = 11, Var(Y ) =
(52 )(9) = 225 and Y ∼ N (11, 225).

Proof We will consider the case a > 0. The proof for the a < 0 case is left as an exercise.
The distribution function of Y , denoted here by G is given by
à ! à !
y−b y−b
G(y) = P (Y ≤ y) = P (aX + b ≤ y) = P X ≤ =F ,
a a
where F is the distribution function of X. The density function g(y) of Y can now be found
by differentiating G(y). That is,
à ! à !
0 d y−b 1 y−b
g(y) = G (y) = F = f
dy a a a
( )
2
1 [y − (aµ + b)]
= √ exp − ,
aσ 2π 2a2 σ 2
2

Standardized Normal

An important particular case emerges when a = (1/σ) and b = −(µ/σ). In this case the
transformed variable is denoted by Z and called “standard normal”. Since
X −µ
Z = (1/σ)X − (µ/σ) = ,
σ
by Property 1, the parameters of the new normal variable, Z, can be obtained from those of
the given normal variable, X, (µ and σ 2 ) as follows:
µ −→ aµ + b = (1/σ)µ − (µ/σ) = 0
5.1. DEFINITION AND PROPERTIES 91

and
σ 2 −→ a2 σ 2 = (−1/σ)2 σ 2 = 1.
That is, any given normal random variable X ∼ N (µ, σ 2 ) can be transformed into a standard
normal Z ∼ N (0, 1) by the equation
X −µ
Z= . (5.1)
σ

Symmetry of the Normal Distribution


0.5

P(Z<-1)=1-P(Z<1)
0.4
0.3

P(Z<-1) 1-P(Z<1)
density

0.2
0.1
0.0

-3 -2 -1 0 1 2 3

Figure 4.2: Symmetry of normal distribution


Fact 2: The standard normal density is denoted by the Greek letter φ (pronounced Phi) and
the standard normal distribution function is denoted by the corresponding upper case Greek
letter Φ. In symbols, ( )
1 z2
φ(z) = √ exp −
2π 2
and Z z ( )
1 t2
Φ(z) = √ exp − dt.
−∞ 2π 2
Since φ(z) is symmetric about zero [φ(−z) = φ(z) for all z] we have the important identity
Φ(−z) = 1 − Φ(z)
for all z. See Figure 4.2.
For example,
Φ(−1) = 1 − Φ(1) = 1 − 0.8413447 = 0.1586553,
and
P (−1.5 < Z < 1.2) = Φ(1.2) − Φ(−1.5) = Φ(1.2) − [1 − Φ(1.5)]
= Φ(1.2) + Φ(1.5) − 1 = 0.8181231.
92 CHAPTER 5. NORMAL DISTRIBUTION

Fact 3: The normal density cannot be integrated in closed form. That is, there are no simple
formulas for calculating expressions like
Z x
F (x) = f (t)dt
−∞

or Z b
P (a < X < b) = f (t)dt = F (b) − F (a).
a

These expressions can only be calculated by numerical methods (numerical integration or


quadrature). Fortunately, however, we can use Fact 1 to reduce calculations involving any
normal random variable to the standard normal case (see Table 1 in the Appendix). The
basic formula for these calculations is

F (x) = P (X ≤ x) = P [(X − µ)/σ < (x − µ)/σ]


= P [Z < (x − µ)/σ] = Φ[(x − µ)/σ].

The application of this “reduction method” is illustrated in the following example.

Example 5.1 Let X ∼ N (2, 9). Calculate (a) P (X < 5), (b) P (−3 < X < 5) (c) P (X > 5)
(d) P (|X − 2| < 3) (e) The value of c such that P (X < c) = 0.95 (f) The value of c such
that P (|X − 2| > c) = 0.10

Solution

(a) P (X < 5) = F (5) = Φ[(5 − 2)/3)] = Φ(1) = 0.8413447, from Table 1 in the Appendix.

(b) P (−3 < X < 5) = F (5) − F (−3) = Φ[(5 − 2)/3] − Φ[(−3 − 2)/3] = Φ(1) − Φ(−5/3) =
0.8413447 − 0.04779035 = 0.7935544, from Table 1 in the Appendix.

(c) P (X > 5) = 1 − P (X ≤ 5) = 1 − F (5) = 1 − Φ(1) = 0.1586553.

(d) To solve this question we must first remember that a number has absolute value smaller
than 3 if and only if this number is between −3 and 3. In other words, to say that |X −2| < 3
is equivalent to saying that −3 < X − 2 < 3. Therefore,

P [|X − 2| < 3] = P [−3 < X − 2 < 3] = P [−1 < (X − 2)/3 < 1] = P [−1 < Z < 1]
= Φ(1) − Φ(−1) = Φ(1) − [1 − Φ(1)] = 2Φ(1) − 1
= 0.6826895.

One useful result to point out here is

P [|Z| ≤ z)] = 2Φ(z) − 1


5.1. DEFINITION AND PROPERTIES 93

(e) To solve this question we first notice that


P (X < c) = P [Z < (c − 2)/3] = Φ[(c − 2)/3].
Second, we see from the Normal Table that Φ(d) = 0.95 if d ≈ 1.64. Therefore
c−2
= 1.64 ⇒ c = (3)(1.64) + 2 = 6.92.
3

(f) The value of c such that P (|X − 2| > c) = 0.10 is calculated as follows,
P (|X − 2| > c) = P [|Z| > c/3] = 1 − P [|Z| ≤ c/3] = 1 − {2Φ(c/3) − 1} = 2[1 − Φ(c/3)] = 0.10
Therefore,
Φ(c/3) = 0.95 ⇒ c/3 = 1.64 ⇒ c = (3)(1.64) = 4.92
2

Fact 4: If X ∼ N (µ, σ 2 ), then


E(X) = µ and Var(X) = σ 2 .

Proof It suffices to prove that E(Z) = 0 and Var(Z) = 1, because from (9)
X = σZ + µ,
2
and then we would have E(X) = E(σZ +µ) = σE(Z)+µ = µ and Var(σZ √+µ) = σ Var(Z) =
σ . By symmetry, we must have E(Z) = 0. In fact, since φ (z) = (−z/ 2π) exp {−z 2 /2} =
2 0

−zφ(z), it follows that


Z ∞ Z ∞
zφ(z)dz = − φ0 (z)dz = φ(z)|∞
−∞ = 0.
−∞ −∞

Finally, using by–part integration [u = z and dv = φ0 (z)] we obtain


Z ∞ Z ∞ ½ Z ∞ ¾
z 2 φ(z)dz = − zφ0 (z)dz = − zφ(z)|∞
−∞ − φ(z)dz = 1.
−∞ −∞ −∞

Fact 5: Suppose that X1 , X2 , . . . , Xn are independent normal random variables with mean
E(Xi ) = µi and variance Var(Xi ) = σi2 . Let Y be a linear combination of the Xi , that is,
Y = a1 X1 + a2 X2 + . . . + an Xn ,
where ai ( i = 1, · · · , n ) are some given constant coefficients. Then,
Y ∼ N (a1 µ1 + a2 µ2 + . . . + an µn , a21 σ12 + a22 σ22 + . . . + a2n σn2 )
94 CHAPTER 5. NORMAL DISTRIBUTION

Proof The proof that Y is normal is beyond the scope of this course. On the other hand,
to show that
E(Y ) = a1 µ1 + a2 µ2 + . . . + an µn ,
and
Var(Y ) = a21 σ12 + a22 σ22 + . . . + a2n σn2 ,
is very easy, using Properties 2 and 5 for the mean and the variance of sums of random
variables. 2

Example 5.2 Suppose that X1 and X2 are independent, X1 ∼ N (2, 4), X2 ∼ N (5, 3) and
Y = 0.5X1 + 2.5X2 .
Find the probability that Y is larger than 15.

Solution By Fact 5, Y is a normal random variable, with mean

µ = (0.5 × 2) + (2.5 × 5) = 13.5,


and variance
σ 2 = (0.52 × 4) + (2.52 × 3) = 19.75.
Therefore,
à !
15 − 13.5
P (Y > 15) = 1 − Φ √ = 1 − Φ(0.34) = 1 − 0.6331 = 0.3669.
19.75
2

An important particular case arises when X1 , . . . , Xn is a normal sample, that is, when
the variables X1 , . . . , Xn are independent, identically distributed, normal random variables,
with mean µ and variance σ 2 . One can think of the Xi0 s as a sequence of n independent mea-
surements of the normal random variable, X ∼ N (µ, σ 2 ). µ is usually called the population
mean and σ 2 is usually called the population variance.
If the coefficients, ai , are all equal to 1/n. then Y is equal the sample average:
n
X n
1X
Y = (1/n)Xi = Xi = X.
i=1 n i=1
By Fact 5, then, the normal sample average is also a normal random variable, with mean
n
X n
X 1 nµ
ai µ = µ= = µ,
i=1 i=1 n n
and variance n n
X X 1 nσ 2 σ2
a2i σ 2 = 2
σ 2
= = .
i=1 i=1 n n2 n
5.2. CHECKING NORMALITY 95

Example 5.3 Suppose that X1 , X2 , . . . , X16 are independent N (µ, 4) and X is their average.
(a) Calculate P (|X1 − µ| < 1) and P (|X − µ| < 1). (b) Calculate P (|X − µ| < 1) when the
sample size is 25 instead of 16. (c) Comment on the result of your calculations.

Solution

(a) Since X1 ∼ N (µ, 4), X1 − µ ∼ N (0, 4) and so,

P (|X1 − µ| < 1) = 2Φ(1/2) − 1 = 2Φ(0.5) − 1 = 0.383.

Moreover, since X ∼ N (µ, 4/16), X − µ ∼ N (0, 1/4) and


· q ¸
P (|X − µ| < 1) = 2Φ 1/ 1/4 − 1 = 2Φ(2) − 1 = 0.954.

(b) Since X ∼ N (µ, 4/25), X − µ ∼ N (0, 4/25) and

P (|X − µ| < 1) = 2Φ(5/2) − 1 = 2Φ(2.5) − 1 = 0.9876.

(c) The probability that the sample mean, X, is close to the population mean, µ, (0.954
when n = 16, and 0.9876 when n = 25) is much larger than the probability that any single
observation, Xi , is close to µ (0.383). The probability that the sample mean is close to the
population mean depends on the sample size, n, and gets larger when the n gets larger. 2

5.2 Checking Normality


A data set, x1 , x2 , . . . , xn , is a sample if the x0i s are a sequence of independent observations
of a random variable, X. The sample is called normal if X ∼ N (µ, σ 2 ). The statistical
analysis of many data sets is based on the assumption the data set is a normal sample. The
validity of this assumption must be carefully examined because the conclusions of the analysis
may be seriously distorted in the absence of the assumed normality. The most common types
departures from normality are asymmetry, heavy tailness and the presence of outliers.
One simple method for checking the normality of the sample x1 , x2 , . . . , xn is the so called
normal Q–Q plot. A normal Q–Q plot is a plot of the theoretical standard normal quantiles
of order (i − 0.50/n, di , versus the corresponding empirical sample quantiles, q̂i = x(i) . If
the sample is normal, then the points (di , q̂i ) must be close to a straight line. Therefore,
departures from a straight–line pattern in the Q–Q plot indicate lack of normality.
Several normal Q–Q plots are displayed on Figure 4.3. The sample for case (a) is normal.
The samples for the other five cases depart from normality in different ways.
The Q–Q plot technique is based on the following rational. The theoretical quantile of
order (i − 0.5)/n for the random variable X, denoted qi , is defined by the equation

P (X ≤ qi ) = F (qi ) = (i − 0.5)/n.
96 CHAPTER 5. NORMAL DISTRIBUTION

That is,
qi = F −1 [(i − 0.5)/n],
where F −1 denotes the inverse of F . In the special case of the standard normal the theoretical
quantiles will be denoted by di . They are given by the formula

di = Φ−1 [(i − 0.5)/n],

where, as usual, Φ denotes the standard normal distribution function. In the case of a normal
random variable, X, with mean µ and variance σ 2 , we have

P (X ≤ qi ) = Φ[(qi − µ)/σ] = (i − 0.5)/n.

and therefore,

(qi − µ)/σ = Φ−1 [(i − 0.5)/n] = di =⇒ qi = µ + σdi .

Given the sample


x1 , x 2 , . . . , x n ,
the corresponding empirical quantiles, q̂i are simply given by the sorted sample,
x(1) , x(2) , . . . , x(n) , that is,

q̂1 = x(1) , q̂2 = x(2) , . . . , q̂n = x(n) .

If this sample comes from a N (µ, σ 2 ) distribution then one would expect that

q̂i ≈ qi = µ + σdi ,

and therefore the plot of x(i) versus di will be close to a straight line, with slope σ and
intercept µ.
5.2. CHECKING NORMALITY 97

(a) Normal Sample (b) Mixture of two Normal Samples


• •

10
• •
1.0

• • •

8
•• ••• •
•• ••
0.5

• ••

••

6
0.0


•••
••

4
-0.5


• •

2
-1.0


••
• ••
•••
-1.5

• •

0
• •
• •
-2 -1 0 1 2 -2 -1 0 1 2
Quantiles of Standard Normal Quantiles of Standard Normal

(c) 3 Outliers in a Normal Sample (d) 5 Inliers in a Normal Sample


• •

6


1.0




4


••
0.5
2


•• • ••
•••••
•• •
0.0

•••• •• •• ••
0

• •• •
• • ••
• •
-2

-0.5


• •
-4


-1.0

• •
-2 -1 0 1 2 -2 -1 0 1 2
Quantiles of Standard Normal Quantiles of Standard Normal

(e) Distribution with Heavy Tails (f) Distribution with Thin Tails
• •
2

150


• •
100


1


•••• •
•• •

50

•• •
0

• •

•• ••••
• •• ••
0

••
• •••
-1


• • •

-50


• •
-2

• •
-2 -1 0 1 2 -2 -1 0 1 2
Quantiles of Standard Normal Quantiles of Standard Normal

Figure 4.3: Q-Q plots for checking normality


98 CHAPTER 5. NORMAL DISTRIBUTION

5.3 Exercises
5.3.1 Exercise Set A
Problem 5.1 A machine operation produces steel shafts having diameters that are normally
distributed with a mean of 1.005 inches and a standard deviation of 0.01 inch. Specifications
call for diameters to fall within the interval 1.00±0.02 inches. What percentage of the output
of this operation will fail to meet specifications? What should be the mean diameter of the
shafts produced in order to minimize the fraction not meeting specifications?

Problem 5.2 Extruded plastic rods are automatically cut into nominal lengths of 6 inches.
Actual lengths are normally distributed about a mean of 6 inches and their standard deviation
is 0.06 inch.
(a) What proportion of the rods exceeds the tolerance limits of 5.9 inches to 6.1 inches?
(b) To what value does the standard deviation need to be reduced if 99% of the rods must
be within tolerance?

Problem 5.3 Suppose X1 and X2 are independent and identically distributed N (0, 4), and
define Y = max(X1 , X2 ). Find the density and the distribution functions of Y .

Problem 5.4 Assume that the height of UBC students is a normal random variable with
mean 5.65 feet and standard deviation 0.3 feet.
(a) Calculate the probability that a randomly selected student has height between 5.45 and
5.85 feet.
(b) What is the proportion of students above 6 feet?

Problem 5.5 The raw scores in a national aptitude test are normally distributed with mean
506 and standard deviation 81.
(a) What proportion of the candidates scored below 574?
(b) Find the 30th percentile of the scores.

Problem 5.6 Scores on a certain nationwide college entrance examination follow a normal
distribution with a mean of 500 and a standard deviation of 100.
(a) If a school admits only students who scores over 670, what proportion of the student pool
will be eligible for admission?
(b) What admission requirements would you see if only the top 15% are to be eligible?

Problem 5.7 A machine is designed to cut boards at a desired length of 8 feet. However,
the actual length of the boards is a normal random variable with standard deviation 0.2 feet.
The mean can be set by the machine operator. At what mean length should the machine be
set so that only 5 per cent of the boards are under cut (that is, under 8 feet)?

Problem 5.8 The temperature reading X from a thermocouple placed in a constant-


temperature medium is normally distributed with mean µ, the actual temperature of the
medium, and standard deviation σ.
(a) What would the value of σ have to be to ensure that 95% of all readings are within 0.1◦
5.3. EXERCISES 99

of µ?
(b) Consider the difference between two observations X1 and X2 (here we could assume that
X1 and X2 are i.i.d.), what is the probability that the absolute value of this difference is at
most 0.075◦ ?

Problem 5.9 Suppose the random variable X follows a normal distribution with mean µ =
50 and standard deviation σ = 5.
(a) Calculate the probability P (|X| > 60).
(b) Calculate EX 2 and the interquartile range of X.

5.3.2 Exercise Set B


Problem 5.10 Let Z be a standard normal random variable. Find:

(a) P (Z < 1.3)


(b) P (0.8 < Z < 1.3)
(c) P (−0.8 < Z < 1.3)
(d) P (−1.3 < Z < −0.8)
(e) c such that P (Z < c) = 0.9032
(f) c such that P (Z < c) = 0.0968
(g) c such that P (−c < Z < c) = 0.90
(h) c such that P (|Z| < c) = 0.95
(i) c such that P (|Z| > c) = 0.80

Problem 5.11 Let X be a normal random variable with mean 10 and variance 25 Find:

(a) P (X < 13)


(b) P (11 < X < 13)
(c) P (8 < X < 13)
(d) P (6 < X < 8)
(e) c such that P (X < c) = 0.9032
(f) c such that P (X < c) = 0.0968
(g) c such that P (−c < X − 10 < c) = 0.90
(h) c such that P (−c < X < c) = 0.95
(j) c such that P (|X − 10| > c) = 0.80
(k) c such that P (|X| > c) = 0.80
100 CHAPTER 5. NORMAL DISTRIBUTION

Problem 5.12 A scholarship is offered to students who graduate in the top 5% of their
class. Rank in the class is based on GPA (4.00 being perfect). A professor tells you the
marks are distributed normally with mean 2.64 and variance 0.5831. What GPA must you
get to qualify for the scholarship?

Problem 5.13 If the test scores of 40 students are normally distributed with a mean of 65
and a standard deviation of 10.
(a) Calculate the probability that a randomly selected student scored between 50 and 80;
(b) If two students are randomly selected, calculate the probability that the difference between
their scores is less than 10.

Problem 5.14 The length of trout in a lake is normally distributed with mean µ = 0.93
feet and standard deviation σ = 0.5 feet.
(a) What is the probability that a randomly chosen trout in the lake has a length of at least
0.5 feet;
(b) Suppose now that the σ is unknown. What is the value of σ if we know that 85% of the
trout in the lake are less than 1.5 feet long. Use the same mean 0.93.

Problem 5.15 The life of a certain type of electron tube is normally distributed with mean
95 hours and standard deviation 6 hours. Four tubes are used in a electronic system. Assume
that these tubes alone determine the operating life of the system and that, if any one fails,
the system is inoperative.
(a) What is the probability of a tube living at least 100 hours?
(b) What is the probability that the system will operate for more than 90 hours?

Problem 5.16 A product consists of an assembly of three components. The overall weight
of the product, Z, is equal to the sum of the weights X1 , X2 and X3 of its components.
Because of variability in production, they are independent random variables, each normally
distributed as N (2, 0.02), N (1, 0.010) and N (3, 0.03), respectively. What is the probability
that Z will meet the overall specification 6.00 ± 0.30 inches?

Problem 5.17 Due to variability in raw materials and production conditions, the weight
(in hundred of pounds) of a concrete beam is a normal random variable with mean 31 and
standard deviation 0.50.

(a) Calculate the probability that a randomly selected beam weights between 3000 and 3200
pounds.

(b) Calculate the probability that 25 randomly selected beams will


weight more than 79,500 pounds.

Problem 5.18 A machine fills 250-pound bags of dry concrete mix. The actual weight of
the mix that is put in the bag is a normal random variable with standard deviation σ = 0.40
pound. The mean can be set by the machine operator. At what mean weight should the
machine be set so that only 10 per cent of the bags are underweight? What about the larger
500-pound bags?
5.3. EXERCISES 101

Problem 5.19 Check if the following samples are normal. Describe the type of departure
from normality when appropriate.

(a) 2.52 3.06 2.41 3.98 2.63 4.11 4.66 5.83 4.80 6.17 4.44 5.38 5.02 1.09 3.31 2.72 1.75 3.81
4.45 2.93

(b) 2.15 -3.46 1.12 0.25 -1.42 0.06 -1.16 -2.24 -1.50 0.37 0.66 -0.76 6.24 0.36 -0.40 0.52 -0.97
0.36 1.74 -0.65

(c) 1.79 -0.65 1.16 1.23 2.80 0.92 -2.62 -5.48 0.75 -2.64 -6.41 0.92 1.14 0.18 0.06 -1.49 -3.99
-10.36 7.12 -1.86

(d) -0.53 0.71 1.40 0.28 -0.65 1.02 -0.71 0.70 1.55 -0.52 -0.73 -1.04 -2.39 0.39 5.71 6.39 4.28
6.70 6.05 5.62

(e) -1.61 -1.29 0.59 -0.33 0.14 1.16 2.02 -0.52 0.69 -0.30 -0.56 0.43 -1.01 0.83 -0.95 0.24 0.01
0.10 0.12 0.07
102 CHAPTER 5. NORMAL DISTRIBUTION
Chapter 6

Some Probability Models

6.1 Bernoulli Experiments


Some random experiments can be viewed as a sequence of identical and independent trials,
on each of which one of two possible outcomes occurs. Some examples of random experiments
of this kind are

- Recording the number of times the maximum annual wind speed exceeds a certain level v0
(during a fixed number of years).
- Counting the number of years until v0 is exceeded for the first time.
- Testing (pass–no–pass) a number of randomly chosen items.
- Polling some randomly (and independently) chosen individuals regarding some yes–no ques-
tion, for instance, “did you vote in the last provincial election?”

Each trial is called Bernoulli trial and a set of independent Bernoulli trials is called
Bernoulli process or Bernoulli experiment. The defining features of a Bernoulli experiment
are:

• The experiment consists of a sequence of trials

• The trials are independent

• Each trial has only two possible outcomes.

These outcomes refer to the occurrence or not of a certain event, A. They are arbitrarily
called success (when A occurs) and failure (when Ac occurs) and denoted by
S (for success) and F (for failure)

• The probability of S is the same for all trials.

This constant probability is denoted by p, that is


P (S) = p

103
104 CHAPTER 6. SOME PROBABILITY MODELS

and so
P (F ) = 1 − P (S) = 1 − p = q.
The number of trials in a Bernoulli experiment can either be fixed or random. For example,
if we are considering the number of maximum annual wind speed exceedances of v0 in the
next fifteen years, the number of trials is fixed and equal to 15. On the other hand, if we are
considering the number of years until v0 is first exceeded, the number of trials is random.

6.2 Bernoulli and Binomial Random Variables


Given a Bernoulli experiment of size n (n independent Bernoulli trials), there are n Bernoulli
random variables Y1 , Y2 , . . . , Yn associated with it. The random variable Yi (i = 1, . . . , n)
depends only on the outcome of the ith trial and is defined as follows

Yi = 1, if the ith trial ends in S


= 0, if the ith trial ends in F .

That is, Yi is a “counter” for the number of S’s in the outcome of the random experiment.
The variables Yi are very simple. By definition, they are independent and their common
density function is
f (y) = py (1 − p)1−y , y = 0, 1.
The mean and the variance of Yi (they are, of course, the same for all i = 1, . . . , n,) are given
by,
E(Yi ) = (0)f (0) + (1)f (1) = p,
and

Var(Yi ) = (0 − p)2 f (0) + (1 − p)2 f (1) = (−p)2 q + q 2 p = pq, where q = 1 − p.

The student can check that the variance is maximized when p = q = 0.5. This result is hardly
surprising as the uncertainty is clearly maximized when S and F are equally likely. On the
other hand, the uncertainty is clearly smaller for smaller or larger values of p. For example,
if p = 0.01 we can feel very confident that most of the trials will result in failures. Similarly,
if p = 0.99 we can confidently predict that most of the trials will result in successes.

Binomial Random Variable (B(n, p)).

Given a Bernoulli experiment of fixed size n, the corresponding Binomial random variable
X is defined as the total number of S’s in the sequence of F’s and S’s that constitutes the
outcome of the experiment. That is,
n
X
X= Yi .
i=1
6.2. BERNOULLI AND BINOMIAL RANDOM VARIABLES 105

Using properties (2) and (4) of the mean and variance of random variables,
à n ! n n
X X X
E(X) = E Yi = E(Yi ) = p = np.
i=1 i=1 i=1

and
n
X n
X n
X
Var(X) = Var( Yi ) = Var(Yi ) = pq = npq, where q = 1 − p.
i=1 i=1 i=1

The probability density function of X is

f (x) = (nx )px q n−x , for all x = 0, 1, . . . , n, (6.1)

where
n! [n(n − 1) . . . (2)(1)]
(nx ) = = .
x!(n − x)! [x(x − 1) . . . (2)(1)] [(n − x)(n − x − 1) . . . (2)(1)]
For example, if n = 5 and x = 3 we have

5! [(5)(4)(3)(2)(1)]
(53 ) = = = 10
3!2! [(3)(2)(1)][(2)(1)]

To derive the density (6.1) first notice that X takes the value x only if x of the Yi are equal
to one and the remainder are equal to zero. The probability of this event is px q n−x . In
addition, the n variables Yi can be divided into two groups of x and n − x variables in (nx )
many different ways.
The distribution function of X doesn’t have a simple closed form and can be obtained
from Table A5 for a limited set of values of n and p.

Example 6.1 Suppose that the logarithm of the operational life of a machine, T (in hours),
has a normal distribution with mean 15 and standard deviation 7. If a plant has 20 of these
machines working independently, (a) what is the probability that more than one machine
will breakdown before 1500 hours of operation? (b) how many more machines are needed if
the expected number of machines that will not break down before 1500 hours of operation
must be larger than 18?

Solution The number of machines breaking down before 1500 hours of operation, X, is a
binomial random variable with n = 20 and
" #
log(1500) − 15
p = P (T < 1500) = P (log(T ) < log(1500)) = Φ
7

= 1 − Φ(1.1) = 1 − 0.8643 = 0.14.

(a) First we notice that


P (X > 1) = 1 − P [X ≤ 1].
106 CHAPTER 6. SOME PROBABILITY MODELS

Since
P [X ≤ 1] = P (X = 0) + P (X = 1) = (20 20 20
0 )(0.86) + (1 )(0.14)(0.86)
19

= 0.04897 + 0.15945 = 0.21,

P (X > 1) = 1 − 0.21 = 0.79


(b) The expected number of machines in operation (out of n machines) after 1500 hours
of operation is n(1 − 0.14). For this expected value to be larger than 18, n must be larger
than 18/0.86 = 20.93. Therefore, the company needs to acquire one additional machine.

6.3 Geometric Distribution and Return Period


The expected value, τ , of the number of trials before the first occurrence of a certain event,
A, is called the return period of that event. For example, the return period of the event
“maximum annual wind speed exceeding v0 ” is equal to the expected number of years before
v0 is exceeded for the first time.
The number of trials itself is a discrete random variable, X, with Geometric density,
f (x) = p(1 − p)x−1 , x = 1, 2, . . . (6.2)
where p is the probability of A and q = (1 − p) is the probability of the complementary event,
Ac . The distribution function of X has a simple closed form (see Problem 6.6)
F (x) = 1 − (1 − p)x , x = 1, 2, . . .
The derivation of (6.2) is fairly straightforward: first of all, it is clear that the range of X is
equal to {1, 2, . . .}. Furthermore, we can have X = x only if the event Ac occurs during the
first x − 1 trials and A occurs in the xth trial. In other words, we must have a sequence of
x − 1 failures followed by a success. Because of the independence of the trials in a Bernoulli
experiment, it is clear that the probability of such a sequence is equal to p(1 − p)x−1 .
To check that f (x) = pq x−1 is actually a probability density function we must verify that

X
f (x) = 1.
0

In fact, using the well known formula for the sum of a geometric series with rate 0 < q < 1,
[1 + q + q 2 + · · ·] = 1/(1 − q),
we obtain ∞ ∞
X X p
f (x) = pq x−1 = p[1 + q + q 2 + · · ·] = = 1.
0 1 1−q
6.3. GEOMETRIC DISTRIBUTION AND RETURN PERIOD 107

Finally, the return period, τ , of the event A is given by



X ∞
X d
τ = E(X) = p x(1 − p)x−1 = p {− [(1 − p)x ]}
1 1 dp

d X d
= −p{ [ (1 − p)x ]} = −p{ (1 − p)[1 + (1 − p) + (1 − p)2 + · · ·]}
dp 1 dp
d 1−p 1 1
= −p{ }=p 2 = .
dp p p p

The return period of A is then inversely proportional to p = P (A). If p = P (A) is small then
we must wait, on average, a large number of periods τ until the first occurrence of A. On
the other hand, if p is large then we must wait, on average, a small number τ of periods for
the first occurrence of A.
The student will be asked to show (see Problem 6.6) that the variance of X is given by

Var(X) = (1 − p)/p2 = τ (τ − 1).

One may well ask the question: why is τ called “return period”? The reason for this becomes
clear after we notice that, because of the assumed independence, the expected number of trials
before the first occurrence of A is the same as the expected number of trials between any two
consecutive occurrences of A.

Example 6.2 Suppose that a structure has been designed for a “25–year rain” (that is, a
rain that occurs on average every 25 years).
(a) What is the probability that the design annual rainfall will be exceeded for the first time
on the sixth year after completion of the structure?
(b) If the annual rainfall Y (in inches) is normal with mean 55 and variance 16, what is the
corresponding design rainfall?

Solution
(a) To say that a certain structure has been designed for a “25–year rain” means that it has
been designed for an annual rainfall with return period of 25 years.
The return period, τ , is equal to 25, and therefore the probability of exceeding the design
annual rainfall is
p = 1/τ = 1/25 = 0.04.
If X represents the number of years until the first time the design annual rainfall is exceeded,
then
P (X = 6) = (0.04)(0.96)6−1 = (0.04)(0.96)5 = 0.033
is the required probability.
(b) The design rainfall, v0 , must satisfy the equation

P (Y > v0 ) = 0.04
108 CHAPTER 6. SOME PROBABILITY MODELS

or equivalently, · ¸
v0 − 55
Φ = 0.96.
4
From the Standard Normal Table we find that Φ(1.75) = 0.96. Therefore,
v0 − 55
= 1.75 ⇒ v0 = (4)(1.75) + 55 = 62.
4
2

6.4 Poisson process and associated random variables


Many physical problems of interest to engineers and other applied scientists involve the
possible occurrences of an event A at some points in time and/or space. For example:

– earthquakes can occur at any time over a seismically active region

– traffic accidents can occur at any time along a busy highway

– fatigue cracks can occur at any point along a continuous weld

– flaws can occur at any point over a wood panel

– phone calls can arrive at any time at a telephone switchboard

– crashes can occur at any time on a computer network

An important feature of these processes is the expected number of occurrences of the


event A per unit of time (or space). This average number of occurrences is represented by λ
and called the rate of the process. We will see that this parameter determines the main
features of the entire process.
To fix ideas, suppose that we are studying the sequence of crashes of a computer network
and that we are using a week as the unit of time. In this case λ is the average number of
crashes per week.
There are at least two main features of the sequence of occurrences (e.g. crashes) which
are of interest: the number of occurrences in an interval of length ∆ and the time between
consecutive occurrences of A. These features are represented by the random variables X and
T below.

• X is the number of occurrences in an interval of length ∆,

and

• T is the time between two consecutive occurrences


6.4. POISSON PROCESS AND ASSOCIATED RANDOM VARIABLES 109

The process is called a Poisson Process if A is a “rare” event, that is, has the following
properties:

1) The number of occurrences of A on non–overlapping intervals are independent.

2) The probability of exactly one occurrence of A on any interval of length ∆ is approxi-


mately equal to λ∆ when ∆ is small.

3) The probability of more than one occurrence of A on any interval of length ∆ is


approximately equal to λ∆2 when ∆ is small (that is, A is a “rare” event).

The discrete random variable X described above (number of occurrences on an interval


of fixed length ∆) has the so called Poisson density function
exp{−λ∆}(λ∆)x
f (x) = , x = 0, 1, 2, . . .
x!
and the continuous random variable T (time between consecutive occurrences of A, or inter-
arrival time) has the so called exponential density

f (t) = λ exp{−λt}, t > 0.

The derivation of these densities from assumptions 1) 2) and 3) is not very difficult. The
interested student can read the heuristic derivation given at the end of this chapter.

Example 6.3 In Southern California there is on average one earthquake per year with
Richter magnitude 6.1 or greater (big earthquakes).
(a) What is the probability of having three or more big earthquakes in the next five years?
(b) What is the most likely number of big earthquakes in the next 15 months?
(c) What is the probability of having a period of 15 months without a big earthquake?
(d) What is the probability of having to wait more than three and a half years until the
occurrence of the next four big earthquakes?

Solution We assume that the sequence of big earthquakes follows a Poisson process with
(average) rate λ = 1 per year.
(a) The number X of big earthquakes in the next five years is a Poisson random variable
with average rate 5 and so, using the Poisson Table we get

P (X ≥ 3) = 1 − P (X < 3) = 1 − F (2) = 1 − 0.125 = 0.875.


110 CHAPTER 6. SOME PROBABILITY MODELS

(b) In general, a Poisson density f (x) with parameter δ is increasing at x (x ≥ 1) if and only
if the ratio f (x)/f (x − 1) > 1. Since

f (x) exp{−δ}δ x exp{−δ}δ (x−1) δ


= ÷ = .
f (x − 1) x! (x − 1)! x
it follows that

f (x) > f (x − 1) when x < δ

f (x) = f (x − 1) when x = δ

f (x) < f (x − 1) when x > δ

Therefore, the largest value of f (x) is achieved when x = [δ], where

[δ] = integer part of δ

So, the most likely number of big earthquakes is [1.25] = 1 (notice that 15 months = 1.25
years).
(c) The waiting time T to the next big earthquake is an exponential random variable with
rate λ = 1 year, with distribution function

F (t) = 1 − exp {−t}.

Therefore,
P {T > 1.25} = 1 − F (1.25) = 1 − [1 − exp {−1.25}] = 0.287.

(d) Let Y represent the number of big earthquakes in the next three and a half years and let
W represent the waiting time (in years) until the occurrence of the next four big earthquakes.
We notice that Y is a Poisson random variable with rate 3.5 and that W is larger than
3.5 years if and only if Y is less than 4. So,

P [W > 3.5] = P [Y < 4] = F (3) = 0.5366.

Means and Variances The means of X and T are of practical interest, as they represent

the expected number of occurrences on a period of length ∆ and the expected waiting time
between consecutive occurrences, respectively. We will see that, not surprisingly,

1
E(X) = λ∆ and E(T ) = .
λ
6.4. POISSON PROCESS AND ASSOCIATED RANDOM VARIABLES 111

We will also see that

1
Var(X) = λ∆ and Var(T ) = .
λ2

First, let’s calculate E(X).

∞ ∞ ∞
X X exp{−λ∆}(λ∆)x X exp{−λ∆}(λ∆)x
E(X) = xf (x) = x = x
x=0 x=0 x! x=1 x!

X (λ∆)x−1
= exp{−λ∆}(λ∆) = exp{−λ∆}(λ∆) exp{λ∆} = λ∆.
x=1 (x − 1)!

Analogously, it follows that

∞ ∞
X X exp{−λ∆}(λ∆)x
E[X(X − 1)] = x(x − 1)f (x) = x(x − 1)
x=0 x=0 x!
∞ ∞
X exp{−λ∆}(λ∆)x X (λ∆)x−2
= x(x − 1) = exp{−λ∆}(λ∆)2
x=2 x! x=2 (x − 2)!

= exp{−λ∆}(λ∆)2 exp{λ∆} = (λ∆)2 .

Therefore,

E(X 2 ) = E[X(X − 1)] + E(X) = (λ∆)2 + λ∆,


and
Var(X) = E(X 2 ) − [E(X)]2 = (λ∆)2 + λ∆ − (λ∆)2 = λ∆.

To calculate E(T ), we use integration–by–part with


u = t and dv = exp{−λt}dt,
to get

Z ∞ Z ∞ Z ∞
E(T ) = tf (t)dt = tλ exp{−λt}dt = exp{−λt}dt = 1/λ.
0 0 0
112 CHAPTER 6. SOME PROBABILITY MODELS

To calculate E(T 2 ), we use integration–by–part with

u = t2 and dv = exp{−λt}dt,

to get
Z ∞ Z ∞ Z ∞
2 2 2
E(T ) = t f (t)dt = t λ exp{−λt}dt = 2 t exp{−λt}dt
0 0 0
Z ∞
= (2/λ)[ tλ exp{−λt}dt] = (2/λ)(1/λ) = (2/λ2 ).
0

Finally,

Var(T ) = E(T 2 ) − [E(T )]2 = (2/λ2 ) − (1/λ2 ) = (1/λ2 ).

Example 6.3 (continued):


(e) What is the expected number of big earthquakes in the next five years? Fifteen months?
What are the corresponding standard deviations?
(f) What is the expected waiting time (in years) between two consecutive big earthquakes?
(g) What is the expected waiting time (in years) until the 25th big earthquake? The standard
deviation?
(h) What is the approximate probability that the waiting time until the 25th big earthquake
will exceed 27 years? This question will be answered in the next chapter.

Solution
(e) Since X = “number of big √ earthquakes in the next five years” is Poisson(5), we have
that E(X) = 5 and SD(X) = 5 = 2.24. √ In the case of fifteen months (1.25 years) the mean
is 1.25 and the standard deviation is 1.25 = 1.12.
(f) Since T = “waiting time (in years) between two consecutive big earthquakes” is an ex-
ponential random variable with rate λ = 1, its expected value (E(T ) = 1/λ) and standard
deviation (Var(T ) = 1/λ2 ) are both equal to one.
(g) Let
W = “waiting time (in years) until the 25th big earthquake”
and let

Ti = “waiting time (in years) between the (i − 1)th and the ith big earthquakes”, i = 1, 2, . . . , 25.
6.5. POISSON APPROXIMATION TO THE BINOMIAL 113

Notice that
25
X
W = Ti
i=1
where, because of the Poisson process assumptions,
T1 , T2 , . . . , T25 are iid Exp(1),
where Exp(λ) means “exponential distribution with parameter (rate) λ”. Therefore,
25
X 25
X 25
X
E(W ) = E Ti = ETi = 1 = 25.
i=1 i=1 i=1

25
X 25
X 25
X
Var(W ) = Var( Ti ) = Var(Ti ) = 1 = 25.
i=1 i=1 i=1
and,
SD(W ) = 5.

6.5 Poisson Approximation to the Binomial


Let X ∼ B(n, p) be a binomial random variable with parameters n and p. If n is large
(n ≥ 20) and p is small (np < 5) then we can use a Poisson random variable with rate
λ = np, Y ∼ P(np), to approximate the probabilistic behavior of X. In other words, we can
use the approximation
P [B(n, p] = x) ≈ P [P(np) = x] = exp{−np}(np)x /x!, for all x = 0, . . . , n.

Example 17: On average, one per cent of the 50-kg dry concrete bags are underfilled below
49.5 kg. What is the probability of finding 4 or more of these underfilled bags in a lot of 200?

Solution: Since n = 200 and p = 0.01,


min{np, n(1 − p)} = min{2, 198} = 2 < 5.
Since n is large and np = 2 is small, we can use the Poisson approximation
P [B(200, 0.01) ≥ 4] ≈ P [P(2) ≥ 4] = 1 − P [P(2) < 4]

= 1 − F (3) = 1 − 0.857 = 0.143 from the Poisson table.


114 CHAPTER 6. SOME PROBABILITY MODELS

6.6 Heuristic Derivation of the Poisson and Exponential Distribu-


tions

Let m be some
³ fixed´ integer number. If Yi is the number of occurrences of the event A in
i−1
the interval m , im , then the total number of occurrences, X, in the interval (0, 1] (we are
taking ∆ = 1 for simplicity) can be written as
X = Y1 + Y2 + . . . + Ym .

Because of the independence of the number of occurrences on non-overlapping intervals,


the variables Y1 , Y2 , . . . , Ym are independent. Moreover, because of the assumption that A is
a rare event, the probability that the variables Yi will take values other than zero and one is
nearly zero,
P (Yi > 1) ≈ 0, when m is large,
and so the variables Yi are approximately Bernoulli random variables when m is large.
By the above remarks, the random variable X is approximately Binomial, B(m, λ/m),
when m is large. Of course, the larger m, the better the approximation, and in the limit
(when m → ∞) the approximation becomes exact. Therefore, the probability that X will
take any fixed value x can be obtained from the limit, as m → ∞, of the binomial expression
P (X = x) = (nx )[λ/m]x [1 − λ/m]m−x
Since, as m → ∞, we have
mm−1m−2 m−x+1
(nx )/mx = ... → 1,
m m m m

[1 − λ/m]−x → 1,
and

[1 − λ/m]m → exp{−λ},

we obtain that, as m → ∞,

mm−1m−2 m−x+1
(nx )[λ/m]x [1 − λ/m]m−x = ... [1 − λ/m]−x [1 − λ/m]m λx /x!
m m m m

→ exp{−λ}λx /x!, the Poisson density function.


6.6. HEURISTIC DERIVATION OF THE POISSON AND EXPONENTIAL DISTRIBUTIONS115

In particular, this justifies the P(np) approximation to the B(n, p), when n is large and p
is small. The requirement that n is large corresponds to m being large and the requirement
that p is small corresponds to λ/m being small.

To derive the Exponential density of T , we reason as follows: The waiting time T until
the first occurrence of A will be larger than t if and only if the number of occurrences X in
the period (0, t) is equal to zero. Since X ∼ P(λt),

P (T ≤ t) = 1 − P (T > t) = 1 − P (X = 0) = 1 − [exp {−λt}(λt)0 /0!]

= 1 − exp {−λt}, the exponential distribution with parameter λ.


116 CHAPTER 6. SOME PROBABILITY MODELS

6.7 Exercises
6.7.1 Exercise Set A
Problem 6.1 A weighted coin is flipped 200 times. Assume that the probability of a head
is 0.3 and the probability of a tail is 0.7. Each flip is independent from the other flips. Let
X be the total number of heads in the 200 flips.
(a) What is the distribution of X?
(b) What is the expected value of X and variance of X?
(c) What is the probability that X equals 35?
(d) What is the approximate probability that X is less than 45?
Note: Come back to this question after you learned about normal approximations in the
next chapter.

Problem 6.2 Suppose it is known that a treatment is successful in curing a muscular pain
in 50% of the cases. If it is tried on 15 patients, find the probabilities that:
(a) At most 6 will be cured.
(b) The number cured will be no fewer than 6 and no more than 10.
(c) Twelve or more will be cured.
(d) Calculate the mean and the standard deviation.

Problem 6.3 The office of a particular U.S. Senator has on average five incoming calls per
minute. Use the Poisson distribution to find the probabilities that there will be:
(a) exactly two incoming calls during any given minute;
(b) three or more incoming calls during any given minute;
(c) no incoming calls during any given minute.
(d) What is the expected number of calls during any given period of five minutes?

Problem 6.4 A die is colored blue on 5 of its sides, and green on the other 1 side. This die
is rolled 8 times. Assume each roll of the die is independent from the other rolls. Let X be
the number of times blue comes up n the 8 rolls of the die.
(a) What is the expected value of X and the variance of X?
(b) What is the probability that X equals 6?
(c) What is the probability that X is greater than 6?

Problem 6.5 A factory produced 10, 000 light bulbs in February, in which there are 500
defectives. Suppose 20 bulbs are randomly inspected. Let X denote the number of defectives
in the sample.
(a) Calculate P (X = 2).
(b) If the sample size, i.e., the number of the inspected bulbs, is large, how would you
calculate P (X ≥ 2) approximately? For n = 200, calculate this probability approximately.

6.7.2 Exercise Set B


Problem 6.6 Let X be a random variable with geometric density (6.2). Show that
6.7. EXERCISES 117

(a) F (x) = 1 − P (X > x) = 1 − (1 − p)x .


(b) E[X(X − 1)] = 2(1 − p)/p2 , and therefore E(X 2 ) = (2 − p)/p2 .
(c) Var(X) = (1 − p)/p2 = τ (τ − 1).

Problem 6.7 The Statistical Tutorial Center has been designed to handle a maximum of
25 students per day. Suppose that the number X of students visiting this center each day is
a normal random variable with mean 15 and variance 16.
(a) What is the return period τ for this center?
(b) What is the probability that the ”design” number of visits will not be exceeded before
the 10th day?

Problem 6.8 A transmission tower has been designed for a 30–year wind.
(a) What is the probability that the design maximum annual wind velocity will be exceeded
for the first time on the 7th year after completion of the project?
(b) What is the probability that the design maximum annual wind velocity will be exceeded
during the first 7 years after completion of the project?
(c) If the maximum annual wind velocity (in miles per hour) is an exponential random variable
with mean 35, what is the design maximum annual wind velocity?
(d) What is the return period if the design maximum annual wind velocity is decreased by
15%?

Problem 6.9 (a) Let X1 and X2 be two Binomial random variables with n = 14 and p =
0.30. Calculate
(i) P (X1 = 4), P (X1 < 6) and P (2 < X1 < 6) (use the Binomial table)
(ii) E(X1 ), SD(X1 ), E(X1 + X2 ), SD(X1 + X2 ), E(X1 X2 ) and SD(X1 X2 )
(iii) P (X1 + X2 = 8), P (X1 + X2 < 12) and P (4 < X1 + X2 < 12).

Problem 6.10 The arrival of customers to a service station is well approximated by a Pois-
son Process with rate λ = 5 per hour.
(a) What is the expected number of customers per day? (the service station is open eight
hours per day)
(b) What is the most likely number of customers in any given hour?
(c) What is the probability that more than seven customers will arrive in the next hour?
(d) What is the probability that the waiting time between two consecutive arrivals will be
25 minutes or more?
(e) What is the expected time until the arrival of the next 25 customers? The standard
deviation?

Problem 6.11 A bag contains 4 red balls and 6 white balls. One ball was drawn with equal
probability and replaced in the bag before next draw was made. Let X be the number of red
balls out of 100 draws from the bag.
(a) Give a general expression for P (X = k), k = 0, 1, ..., 100;
(b) Calculate the mean and variance of X;
(c) Calculate the probability P (X ≤ 38).
118 CHAPTER 6. SOME PROBABILITY MODELS

Problem 6.12 The number of killer whales arriving at the Pacific Rim Observatory Station
follows a Poisson Process with rate λ = 4 per hour.
(a) What is the expected number and variance during the next hour?
(b) What is the probability that the waiting time T between two consecutive arrivals will be
30 minutes or more?
(c) What is the expected time and variance until the next 20 killer whales arriving at the
Observatory Station.

Problem 6.13 Car accidents are random and can be said to follow a Poisson distribution.
At a certain intersection in East Vancouver there are, on average, 4 accidents a week. Answer
the following questions:
(a) What is the probability of there being no accidents at this intersection next week?
(b) The record for accidents in one month at a single intersection is 20. Find the probability
that this record will be broken, at this intersection, next month. (Assume 30 days in one
month)
(c) What is the expected waiting time for 20 accidents to occur?

Problem 6.14 A test consists of ten multiple-choice questions with five possible answers.
For each question, there is only one correct answer out of five possible answers. If a student
randomly chooses one answer at each question, calculate the following probabilities that
(a) at most three questions are answered correctly?
(b) five questions are answered correctly?
(c) all questions are answered correctly?
And (d) calculate the mean and the standard deviation of number of correct answers.

Problem 6.15 The number of meteorites hitting Mars follows a Poisson process with pa-
rameter λ = 6 per month.
(a) What is the probability that at least 2 meteorites hit Mars in any given month?
(b) Find the probability that exactly 10 meteorites hit Mars in the next 6 months.
(c) What is the expected number of meteorites hitting the Mars in the next year?

Problem 6.16 A biased coin is flipped 10 times independently. The probability of tails is
0.4. Let X be the total number of heads in the 10 flips.
(a) Use a computer to find P (X = 4);
(b) Use the Binomial table to find P (1 < X < 5);
(c) What is the probability that one has to flip at least 5 times to get the first head?

Problem 6.17 Three identical fair coins are tossed simultaneously until all three show the
same face.
(a) What is the probability that they are tossed more than three times?
(b) Find the mean for the number of tosses.
Chapter 7

Normal Probability Approximations

7.1 Central Limit Theorem

By Fact 5 in Chapter 4, if X1 , X2 , . . . , Xn is a sample from a normal population with mean


µ and standard deviation σ then
X1 + X2 + · · · + Xn σ2
X= ∼ N (µ, ).
n n
Often, however, one has to deal with non–normal samples. For example, the population
variable, X, may represent the lifetime of a part and Xi may represent the lifetime of the ith
randomly chosen part. Since the lifetime of a part cannot be negative, X cannot be normal.
A more reasonable assumption may be that X is exponentially distributed (X ∼ E(λ)) with
unknown parameter λ. Or more generally, one may simply assume that X is positive with
mean µ = 1/λ, and variance 1/λ2 . The sample, X1 , X2 , . . . , Xn , would then be typically
obtained in order to estimate the expected life, µ, for the parts.
Let X1 , X2 , . . . , Xn be a sample from an arbitrary population with mean µ and variance
2
σ . A very important result, called the Central Limit Theorem (CLT), states that, when
n is large, X is approximately normal, with mean µ and variance σ 2 /n, regardless of the
actual shape of the population distribution. This remarkable result will be extensively used
throughout this course.
The CLT is a limit (asymptotic) result, and the distribution of the average is not exactly
normal for any finite value of n. An obvious question at his point is when should n be
considered large enough for practical applications? Unfortunately, the size of n for which
the normal approximation is good depends on the distribution of the variables Xi being
averaged. If this distribution is symmetric and has light tails then the CLT approximation
may be quite good for small values of n (n equal to five or six). If the distribution of the
Xi0 s is very asymmetric, then it will take longer for the CLT approximation to provide a
reasonable approximation. In many practical situations, the CLT normal approximation can
be used when n ≥ 20.

119
120 CHAPTER 7. NORMAL PROBABILITY APPROXIMATIONS

Example 7.1 A system consists of 25 independent parts connected in such a way that the ith
part automatically turns–on when the (i − 1)th part burns out. The expected lifetime of each
part is 10 weeks and the standard deviation is equal to 4 weeks. (a) Calculate the expected
lifetime and standard deviation for the the system. (b) Calculate the probability that the
system will last more than its expected life. (c) Calculate the probability that the system
will last more than 1.1 times its expected life. (d) What are the (approximate) median life
and interquartile range for the system?

Solution

(a) Let Xi denote the lifetime of the ith component and let
25
X
T = Xi ,
i=1

denote the lifetime of the system. Then,


25
X
E(T ) = E(Xi ) = 25 × 10 = 250 weeks,
i=1

and, using the assumption of independence,


25
X
var(T ) = Var(Xi ) = 25 × 16 = 400.
i=1

Therefore, √
SD(T ) = 400 = 20 weeks,
Notice that the
√ mean of T is 25 times larger than that of each Xi while the standard deviation
of T is only 25 = 5 times larger.

(b) First observe that


T 16
= X ' N (10, ),
25 25
where the symbol ' means “approximately distributed as”, and so
16
T ' 25 N (10, ) = N (250, 400) = N (E(T ), Var(T )).
25
Therefore, · ¸
250 − 250
P (T > E(T )) = P (T > 250) ≈ 1 − Φ = 0.5.
20
(c) First of all notice that 1.1 × E(T ) = 1.1 × 250 = 275. Now, by the discussion in (b),
· ¸
275 − 250
P (T > 275) ≈ 1 − Φ = 1 − Φ(1.25) = 0.1056.
20
7.1. CENTRAL LIMIT THEOREM 121

(d) Let Z denote the standard normal random variable. Using that T ' N (250, 400), it
follows that

Q1 (T ) ≈ Q1 (N (250, 400)) = 250 + (20 × Q1 (Z)) = 250 − (20 × 0.675) = 236.5.

Analogously,
Q2 (T ) = Median(T ) = 250 + (20 × Q2 (Z)) = 250,
and

Q3 (T ) ≈ Q3 (N (250, 400)) = 250 + (20 × Q3 (Z)) = 250 + (20 × 0.675) = 263.5.

Therefore,
IQR(T ) ≈ Q3 (T ) − Q1 (T ) = 263.5 − 236.5 = 27.0
2

Table 7.1

Rainfall Intensity (in.) Midpoint Frequency


38–42 40 15
42–46 44 34
46–50 48 26
50–54 52 23
54–58 56 17
58–62 60 16
62–66 64 4
66–70 68 10
Total 145

Example 7.2 Consider Table 7.1 with data on the annual (cumulative) rainfall intensity (X)
on a certain watershed area. The average annual rainfall intensity can be calculated from

Table 7.1 as:


(40)(15) + (44)(34) + . . . + (68)(10)
X =
15 + 34 + . . . + 10
7388
= = 50.952.
145
Since the average has been calculated from a frequency table, using the midpoint of each class
to represent all the points in each class, there is an approximation error to be considered.
How likely is that this approximation error is (a) larger than 0.05? (b) larger than 0.10? (c)
larger than 0.5?
122 CHAPTER 7. NORMAL PROBABILITY APPROXIMATIONS

Solution
To make the required probability calculations we will assume that the rainfall intensities
are uniformly distributed on each interval. This is a reasonable assumption given that we do
not have any additional information on the distribution of values on each class.
Let ri represent the actual annual rainfall intensity (i = 1, 2, . . . , 145) and let mi be the
midpoint of the corresponding class. For instance, if r5 = 50.35 (a value in the class 50–54),
then mi = 52.0. Let
Ui = ri − mi , i = 1, 2, . . . , 145.
Given our “uniformity” assumption, the Ui0 s are uniform random variables on the interval
(−2, 2).
To proceed with our calculation, we will assume that the variables Ui0 s (which represent
the approximation errors) are independent.
Let
r1 + r2 + . . . + r145
r= .
145
The approximation error, D, in the calculation of X can now be written as
r1 + r2 + · · · + r145 m1 + m2 + . . . + m145 U1 + U2 + . . . + U145
D = ør − verlineX = − =
145 145 145
Since D is the average of 145 independent, identically distributed random variables with zero
mean and variance equal to
2 1Z 2 2 4
σ = t dt = ,
4 −2 3
we can use the (CLT) normal approximation. That is, we can use a normal distribution
with zero mean and variance equal toq(4/3)/145, to approximate the distribution of D. The
corresponding standard deviation is 4/435 = 0.095893.
(a)

P (|D| > 0.05) = P (|D|/0.095893 > 0.05/0.095893) ≈ 2[1 − Φ(0.52)] = 0.6084.

(b)
P (|D| > 0.1) = P (|D|/0.095893 > 0.1/0.095893) ≈ 2[1 − Φ(1.04)] = 0.2984.
(c)
P (|D| > 0.5) = P (|D|/0.095893 > 0.5/0.095893) ≈ 2[1 − Φ(5.21)] = 0.
2

Example 6.3 (continued from Chapter 6):


Recall part (h) of Example 6.3 from the previous chapter which was left unanswered
(h) What is the approximate probability that the waiting time until the 25th big earthquake
will exceed 27 years?
7.2. NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 123

Solution
(h) Since W is a sum of iid random variables, we can use the Central Limit Theorem to
approximate P (W > 27). Since E(W ) = 25 and SD(W ) = 5 we have
· ¸
27 − 25
P (W > 27) = 1 − P (W ≤ 27) = 1 − Φ = 1 − Φ[0.40] = 1 − 0.6554 = 0.3446.
5
2

7.2 Normal Approximation to the Binomial Distribution


Let X be a binomial random variable with parameters n and p. When n is large so that
min{np, n(1 − p)} ≥ 5,
we can use the following approximation:
" # " #
k − np + 0.5 k − np − 0.5
P (X = k) = P [k − 0.5 < X < k + 0.5] = Φ √ −Φ √ . (7.1)
npq npq
The justification for the approximation above is given by the Central Limit Theorem. In
fact, we have seen before that
X = Y1 + Y2 + · · · + Yn
where Y1 , Y2 , . . . , Yn are independent Bernoulli random variables with parameter p. Therefore,
X Y1 + Y2 + · · · + Yn
= =Y
n n
which is approximately N (p, pq/n) when n is large. Therefore,
X = nY
is approximately distributed as N (p, pq/n) multiplied by n, that is, N (np, npq). The continu-
ity correction 0.5 which is added and subtracted to k is needed because we are approximating
a discrete random variable with a continuous random variable.

For example, if n = 15 and p = 0.4, then


min{np, n(1 − p)} = min{6, 9} = 6 ≥ 5,

np = 6 npq = 1.9
and
· ¸ · ¸
8 − 6 + 0.5 8 − 6 − 0.5
P (X = 8) = Φ −Φ
1.9 1.9

= Φ(1.32) − Φ(0.79) = 0.9065825 − 0.7852361 = 0.1213.


124 CHAPTER 7. NORMAL PROBABILITY APPROXIMATIONS

Using the Binomial Table on the Appendix we have that the exact probability is equal to

P (X = 8) = F (8) − F (7) = 0.9050 − 0.7869 = 0.1181.

Therefore, the approximation error is equal to 0.0032.


The student can verify, as an exercise, the entries in Table 6.2, where P (X = k) is being
approximated using formula (7.1).
7.3. NORMAL APPROXIMATION TO THE POISSON DISTRIBUTION 125

Table 7.2

k Approximated Exact Error


0 0.0016 0.0005 0.0011
1 0.0070 0.0047 0.0023
2 0.0240 0.0219 0.0021
3 0.0605 0.0634 -0.0029
4 0.1213 0.1268 -0.0055
5 0.1827 0.1859 -0.0032
6 0.2051 0.2066 -0.0015
7 0.1827 0.1771 0.0056
8 0.1213 0.1181 0.0032
9 0.0605 0.0612 -0.0007
10 0.0240 0.0245 -0.0005
11 0.0070 0.0074 -0.0004
12 0.0016 0.0016 0.0000
13 0.0003 0.0003 0.0000
14 0.0000 0.0000 0.0000
15 0.0000 0.0000 0.0000

7.3 Normal Approximation to the Poisson Distribution


The Central Limit Theorem can also be used to approximate Poisson probabilities when the
expected number of counts, α, is large. As a rule of thumb, we will use this approximation
when α ≥ 20.
The Poisson random variable, X ' P(α), is approximated by the normal random variable,
N (α, α), with the same mean and variance. In other words,
" # " #
x + .5 − α x − .5 − α
P (X = x) ≈ Φ √ −Φ √ .
α α

provided that α ≥ 20. The continuity correction 0.5 added and subtracted to x is needed
because we are approximating a discrete random variable with a continuous random variable.
This approximation is justified by the following argument: consider a Poisson process
with rate λ = 1, and suppose that X represents the number of occurrences in a period of
length α. We can divide α into n subintervals of length α/n and denote by Yi the number of
occurrences in the ith subinterval. It is clear that Y1 , . . . , Yn are independent Poisson random
variables with mean α/n and that

X = Y1 + Y2 + · · · + Yn = nY .
126 CHAPTER 7. NORMAL PROBABILITY APPROXIMATIONS

Therefore, by the CLT,

X = nY ' nN (α/n, α/n2 ) = N (α, α).

Intuitively, the requirement that α is large is necessary because one needs to represent X
as the sum of a a large number, n, of independent random variables, Yi , and the common
distribution of these random variables becomes very asymmetric when α/n is very small.
As an example, let X ' P(25) and calculate (a) P (X = 27), (b) P (X > 27) and (c)
P (24 ≤ X < 27). In the case of (a),
" # " #
27 + .5 − 25 27 − .5 − 25
P (X = 27) ≈ Φ √ −Φ √ = Φ(0.5) − Φ(0.3) = 0.6915 − 0.6179 = 0.0736.
25 25
The exact probability is exp −25 × 2527 /(27!) = 0.07080. In the case of (b),

P (X > 27) = 1 − P (X ≤ 27) ≈ 1 − Φ((27.5 − 25)/5) = 1 − Φ(0.5) = 1 − 0.6915 = 0.3085.

The exact probability in this case is 0.2998. Finally, in the case of (c),

P (24 ≤ X < 27) ≈ Φ((26.5 − 25)/5) − Φ((23.5 − 25)/5) = Φ(0.3) − Φ(−0.3) = 2Φ(0.3) − 1 = 0.2358.

The corresponding exact probability is 0.2355.

7.4 Exercises
7.4.1 Exercise Set A
Problem 7.1 Two types of wood (Elm and Pine) are tested for breaking strength. Elm
wood has an expected breaking strength of 56 and a standard deviation of 4. Pine wood has
an expected breaking strength of 72 and a standard deviation of 8. Let X̄ be the sample
average breaking strength of an Elm sample of size 30, and Ȳ be the sample average breaking
strength of a Pine sample of size 40.
(a) What is the approximate distribution of X̄?
(b) What is the approximate distribution of Ȳ ?
(c) Calculate (approximately) P (X̄ + Ȳ < 110).

Problem 7.2 Consider a population with mean 82 and standard deviation 12.
(a) If a random sample of size 64 is selected, what is the probability that the sample mean
will lie between 80.8 and 83.2?
(b) With a random sample of size 100, what is the probability that the sample mean will lie
between 80.8 and 83.2?
(c) What assumption(s) have you used in (a) and (b)?

Problem 7.3 Suppose that the population distribution of the gripping strengths of indus-
trial workers is known to have a mean of 110 and a standard deviation of 10. For a random
sample of 75 workers, what is the probability that the sample mean gripping strength will
7.4. EXERCISES 127

be:
(a) between 109 and 112?
(b) greater than 111?
(c) What assumption(s) have you made?

Problem 7.4 The expected amount of sulfur in the daily emission from a power plant is 134
pounds with a standard deviation of 22 pounds. For a random sample of 40 days, find the
approximate probability that the total amount of sulfur emissions will exceed 5, 600 pounds.

Problem 7.5 Suppose we draw two samples of equal size n from a population with unknown
mean µ but a known standard deviation 3.5, respectively. Let X̄ and Ȳ be the corresponding
sample averages. How large would the sample size n be required to be to ensure that P (−1 ≤
X̄ − Ȳ ≤ 1) = 0.90?

Problem 7.6 Suppose X1 , . . . , X30 are independent and identically distributed random vari-
ables with mean EX1 = 10 and variance Var(X1 ) = 5.
1 P30
(a) Calculate the mean of X̄ = 30 i=1 X i and the standard deviation of X1 − X2 .
(b) Calculate the interquartile range of X̄ approximately.

7.4.2 Exercise Set B


Problem 7.7 Show that if U has uniform distribution on the interval (0, 1) and F is any
given continuous distribution function, then X = F −1 (U ) has distribution F . This result can
be used to generate random variables with any given distribution.

Problem 7.8 (a) Generate m = 100 samples of size n = 10, of independent random variables
with uniform distribution on the interval (0, 1). Let Xij denote the j th element of the ith
sample (i = 1, 2, . . . , m and j = 1, 2, . . . , n).
Construct the histogram and Q − Q plot for the sample means
n
1X
Xi = Xij .
n i=1

(b) Same as (a) but with n = 20 and n = 40. What are your conclusions?

(c) Repeat (a) and (b) but with the Xij having density
1
f (x) = (x − 4) 4 < x < 10
18

= 0 otherwise

What are your conclusions?

Hint: See Problem 7.7.


128 CHAPTER 7. NORMAL PROBABILITY APPROXIMATIONS

Problem 7.9 Solve part (a) of Problem 7.8 but with p = 0.7, instead of 0.3.

Problem 7.10 Referring to Problem 6.10, find the probability that more than 800 customers
will come during the next 20 business days?

Problem 7.11 The expected tensile strength of two types of steel (types A and B, say) are
106 ksi and 104 ksi. The respective standard deviations are 8 ksi and 6 ksi. Let X and Y
be the sample average tensile strengths of two samples of 40 specimens of type A and 35
specimens of type B, respectively.

(a) What is the approximate distribution of X? Of Y ?

(b) What is the approximate distribution of X − Y ? Why?

(c) Calculate (approximately) P [|X − Y | < 1].

(d) Suppose that after completing all the sample measurements you find x − y = 6. What
do you think now of the “population” assumptions made at the beginning of this problem?
Why?

Problem 7.12 (a) There are 75 defectives in a lot of 1500. Twenty five items are randomly
inspected (the inspection is non-destructive and the items are returned to the lot immediately
after inspection). If two or more items are defective the lot is returned to the supplier (at
the supplier’s expense). Otherwise, the lot is accepted. What is the probability that the lot
will be rejected?
(b) Suppose that the actual number of defectives is unknown and that five out of twenty
five independently inspected items turned out to be defectives. Estimate the total number
of defectives in the lot (of 1500 items). What is the expected value and standard deviation
of your estimate? What is the (approximated) probability that your estimate is within a
distance of 10 from the actual total number of defectives?

Problem 7.13 A sequence of n independent pH determinations of a chemical compound


will be made. Each determination can be viewed as a random variable, Xi , with mean µ
(the unknown “true” pH of the compound) and standard deviation σ = 0.15. How many
independent determinations are required if we wish that the sample average X is within 0.01
of the true pH with probability 0.95? What is the necessary n if σ = 0.30?

Problem 7.14 Bits are independently received in a digital communication channel. The
probability that a received bit is in error is 0.00001.
(a) If 16 million bits are transmitted, calculate the (approximate) probability that more than
150 errors occur.
(b) If 160,000 bits are transmitted, calculate the (approximate) probability that more than
1 error occurs.
Chapter 8

Statistical Modeling and Inference

8.1 Introduction
One is often interested on random quantities (variables Y , T , N , etc.) such as the strength
Y of a concrete block, the time T of a chemical reaction, the number N of visits to a
website, etc. Engineers and applied scientists use statistical models to represent these
random quantities. Statistical models are a set of mathematical equations involving random
variables and other unknown quantities called parameters.
For example, the compressive strength of a concrete block can be modeled as

Y = µ + σε (8.1)

where µ is a parameter that represents the “true” average compressive strength of the concrete
block, ε is a random variable with zero mean and unit variance that accounts for the “block-
to-block” variability and σ is a parameter that determines the average size of the “block-to-
block” variability. Notice that according to this model the compressive strength of a concrete
block is a random variable that results from the sum of two components: a systematic
component or signal (µ ) and a random component or noise (σε).
Independent measurements are often taken to “adjust the model”, that is, to estimate
the unknown parameters that appear in the model equations. For example, the compressive
strength of several concrete blocks can be measured to get information about µ and σ.
Before the measurements are actually performed they can be thought of as independent
replicates of the random quantity of interest. For example, the future measurements of the
compressive strengths can be represented as

Yi = µ + σεi , i = 1, . . . , n, (8.2)

where n is the number of measurements.


Population and Sample: The complete set of items or individuals on which we are in-
terested and on which we could, in principle, measure the variable (s) of interest is called
population. Some examples of populations are a lot of concrete blocks, the websites on a
certain topic, the most recent 300 days of operation of a retail store, etc. It is often impos-
sible (or impractical) to measure the quantity of interest on all the units that comprise the

129
130 CHAPTER 8. STATISTICAL MODELING AND INFERENCE

population under study. In practice some units are randomly chosen and the measurements
are performed only on them. The set of selected units is called sample. The corresponding
set of measurements is also called a sample.
Given a statistical model and a set of measurements (sample) one can carry on some some
statistical procedures called statistical inference which are aimed at extrapolating from
the sample to the population. The most typical statistical procedures are:
• Point estimation of the model parameters.
• Confidence intervals for the model parameters.
• Testing of hypotheses about the model parameters.
These procedures will be described and further discussed in the context of the simple situa-
tions considered below.

8.2 One Sample Problems


Sometimes it can be assumed that the quantity of interest is homogeneous for all the units in
the population and that the measurements are the sum of a systematic and a random part
(signal plus noise). In these cases we normally assume that the sample is a set of homogeneous
measurements
Yi = µ + σεi , i = 1, . . . , n. (8.3)
where µ and σ are as described in the Introduction above and n is the number of measurements
or sample size. It is often assumed that the measurements are independent and therefore
that the random variables εi , i = 1, . . . , n are independent. Finally, we assume that the
random variables εi are normal with mean zero and variance one.
Note: Multiplicative models where the measurements are the product of a systematic factor
and a random factor
Xi = γUi
can be transformed on additive models like (8.3) by taking the log of the measurements
Yi = ln(Xi ) = ln(γ) + ln(Ui ).

8.2.1 Point Estimates for µ and σ


A point estimate is a certain combination of the sample measurements (a function of the
sample) which is expected to take values “reasonably close” to the parameter it is supposed
to estimate. The point estimate is usually denoted by the same letter as the parameter
but with an added hat to indicate that it is an estimate (e.g. θ̂ is a point estimate for the
parameter θ).
Of course, there are in principle many ways of combining the data to obtain a point
estimate. The particular combination is chosen in order to minimize some function the
estimation error
θ̂ − θ,
8.2. ONE SAMPLE PROBLEMS 131

for example the expected squared estimation error or the expected absolute estimation error.

Estimation of µ: A good point estimate for µ, the main parameter of model (8.3), can be
obtained by the method of least squares which consists of minimizing (in m) the sum of
squares
n
X
S(m) = (Yj − m)2
j=1

Differentiating with respect to m and setting the derivative equal to zero gives the equation
n
X n
1X
S 0 (m) = −2 (Yj − m) = 0, or m = Yj = Y = µ̂.
j=1 n j=1

Estimation Error: Being functions of the random variables, the point estimate Y and
the estimation error Y − µ are also random variables. Obviously, we would like that the
estimation error is small. To have some idea of the behavior of the estimation error we can
calculate its expected value (mean) and its variance:
n n
1X 1X
E[Y − µ] = E(Y ) − µ = E(Yj ) − µ = µ−µ=0 ( Y is unbiased),
n j=1 n j=1
and n n
1 X 1 X 2 σ2
Var(Y − µ) = Var(Y ) = Var(Y j ) = nσ = .
n2 j=1 n2 j=1 n
In this case, the estimation error has then a distribution centered at zero and a variance
inversely proportional to n. In other words, if n is sufficiently large, likely values of Y will
all be close to µ.
Estimation of σ 2 : The point estimate for σ 2 is based on the minimized sum of squares,
S(Y ), divided by a quantity d so that the E[S(Y )/d] = σ 2 . The simple derivation outlined
in Problem 8.9 shows that d = n − 1, and so
Pn−1
2 2 − Y )2
j=1 (Yj
σ̂ = S = .
n−1

The Standard Error of y: The precision of y as an estimate of µ can be measured in terms


of its estimated standard deviation,

SE(y) = s/ n,
called standard error of y.
Example 8.1 A scientists wishes to detect small amounts of contamination in the environ-
ment. To test her measurement procedure, she spiked 12 specimens with a known concen-
tration (2.5 µg/l of lead). The readings for the 12 specimens are
1.9 2.4 2.2 2.1 2.4 1.5 2.3 1.7 1.9 1.9 1.5 2.0
132 CHAPTER 8. STATISTICAL MODELING AND INFERENCE

The sample mean and variance areqy = 1.9833 and s2 = 0.09787879, respectively. The
standard error of y is then SE(y) = 0.09787879/12 = 0.09031371. It would appear that the
scientist’s measurement procedure is biased giving values below the true concentration. The
bias can be estimated as 1.9833 − 2.5 = −0.5166667, give or take 0.181 (0.181 = 2 × SE(y)).

8.2.2 Confidence Interval for µ


Consider the absolute estimation error |Y − µ|. We wish to find a value d such that there is
a large probability (0.95 or 0.99) that the absolute estimation error is below d. That is, we
wish to find d such that, for some small value of α (typically α = 0.05 or 0.01) we have

P [|Y − µ| < d] = 1 − α
The resulting d can be, then, added to and subtracted from the observed average y to obtain
the upper and lower limits of an interval called (1 − α)100% confidence interval:
(y − d, y + d)
Typical values of α are α = 0.05 and α = 0.01 yielding 95% and 99% confidence intervals,
respectively. To fix ideas we will take α = 0.05 in what follows.
Assuming that the model (8.3) is correct, the probability that µ and Y differ by more than
d is only 0.05. In other words, if we repeatedly obtain samples of size n and construct the
corresponding 95% confidence intervals for µ, on average, 95% of these intervals will include
the (unknown) value of µ.
Using that Y ∼ N (µ, σ 2 /n) we have

" √ # "√ #
X −µ d n nd
0.95 = P [ |Y − µ| < d ] = P | √ |< = 2Φ − 1.
σ/ n σ σ
That is,
" √ #
d n
Φ = 0.975.
σ
Using the standard normal table we get,


d n
= 1.96,
σ
from which we have
σ
d = 1.96 √ .
n
8.2. ONE SAMPLE PROBLEMS 133

Unfortunately, in most practical applications, the value of σ is unknown and must be


estimated from the data. To estimate σ we can use, for instance, the sample standard
deviation s. The corresponding estimate for d is now

s
dˆ = 1.96 √ = 1.96 × SE(y).
n

The precision of s as an estimate of σ increases with the sample size. Therefore, replacing σ
by s has little effect when the sample size is large (n ≥ 20, say). However, when n is small
the added level of uncertainty is somewhat increased and an adjustment is needed. To adjust
for the increased level of uncertainty the value from the normal table (1.96 when α = 0.05)
must be replaced by a slightly larger value, tdf (α), obtained from the Student’s t table. The
precise Student’s t value, tdf (α), depends on two parameters: the significance level, α, and
the degrees of freedom, df .
The significance level, α, is equal to one minus the desired confidence level . In our case,
the confidence level (desired precision) is 0.95 and so α = 0.05. In this simple case the degrees
of freedom parameter, df is simply equal to the sample size minus one, that is df = n − 1.
More generally, (for future applications) the degrees of freedom are given by the formula

df = n − k,
where
n = number of squared terms appearing in the variance estimate
and
k = number of additional estimated parameters appearing in the variance estimate
Table A.2 in Appendix gives the values of t(df ) (α) for several values of α and df .
In summary, the estimated value of d is
s
dˆ = tdf (α) √ = tdf (α) SE(Y ).
n
Notice that for most values of n that appear in practice, tn−1 (0.05) ≈ 2, justifying the
common practice of adding and subtracting 2 × SE(y) from the observed average y.
Example 8.2 Refer to the data in example 8.1. A 95% confidence interval for the actual
mean of the scientist’s measurements is
1.9833 ± t(11) (0.05) × SE(y)
or
1.9833 ± 2.20 × 0.09031371.
That is, the systematic part of the scientist’s measurement is likely to lay between 1.8 and
2.2.
134 CHAPTER 8. STATISTICAL MODELING AND INFERENCE

8.2.3 Testing of Hypotheses about µ


There are situations when one wishes to determine if a certain statement or hypothesis about
a model parameter is consistent with the given data. That is, one wishes to confront the
statement against the empirical evidence (data). For example, the scientist of Examples 8.1
and 8.2 may wish to test the hypothesis that the given measurement method is unbiased,
using her collected data.
The procedure for rejecting a hypothesis about a certain unknown population parameter,
on the basis of statistical evidence, is called testing of hypothesis. The hypothesis to be
tested is denoted by H0 .
Typical hypotheses, H0 , about µ are
(i) H0 : µ = µ0 or (ii) H0 : µ ≤ µ0 or (iii) H0 : µ ≥ µ0 ,
where µ0 is some specified value. In the case of the scientist of examples 8.1 and 8.2, the
statement “the measurement method is unbiased” corresponds to (i) with µ = 2.5. On the
other hand, the statement “the measurement method does not consistently under-estimates
the true concentration” corresponds to (iii) with µ = 2.5. What statement would correspond
to (ii) with µ = 2.5?

Significance Level of a Test: When testing a hypothesis one can incur in two possible
errors: Rejecting a hypothesis that is true (Error of type I) or non-rejecting a hypothesis that
is false (Error of type II). Errors of type I are considered more important and kept under
tight control. Therefore, usual testing procedures insure that the probability of rejecting a
true hypothesis is rather small (0.01 or 0.05). The probability of error of type I is usually
denoted by α and called significance level of the test.
Taking that into consideration, the hypothesis H0 is constructed in a such a way that its
incorrect rejection has a small probability. H0 states, then, the most conservative statement.
A statement that one would like to reject only in the presence of strong empirical evidence.
Because of that H0 is called as the “null hypothesis”.
The Testing Procedure: The testing procedures learned in this course are simply derived
from confidence intervals. Suppose we wish to test H0 at level α. Then we distinguish two
cases:
Two sided tests: Hypotheses of the form H0 : µ = µ0 give rise to two sided tests because
in this case we reject H0 if we have evidence indicating that µ is smaller or larger than µ0 .
The two–sided level α testing procedure consists of the following two steps:
Step 1. Construct a (1 − α)100% confidence interval for µ.
Step 2. Reject H0 : µ = µ0 if µ0 lies outside that interval.
One sided tests: Hypotheses of the form H0 : µ ≥ µ0 (H0 : µ ≤ µ0 ) are called directional
hypotheses and give rise to one–sided tests. Notice that in this case we reject H0 only if we
suspect that µ < µ0 (µ > µ0 ).
The one–sided level α testing procedure consists of the following two steps:
Step 1. Construct a [1 − (2 × α)]100% confidence interval for µ.
8.2. ONE SAMPLE PROBLEMS 135

Step 2. Reject H0 : µ ≥ µ0 (H0 : µ ≤ µ0 ) if µ0 is larger (smaller) than the upper (lower)


end of that interval.
That is, we reject H0 if the confidence interval is completely contained in the complement of
the interval assumed under H0 .

Example 8.3 Refer to the data in example 8.1. Test at level α = 0.05 the following hy-
potheses: (a) H0 : µ = 2.5; and (b) H0 : µ ≥ 2.3.
(a) Since the 95% confidence interval (1.785, 2.182) (see example 8.2) does not include 2.5,
we reject H0 . There is statistical evidence indicating that the measurement procedure is not
unbiased.
(b) We must first construct a 90% confidence interval for µ. From example 8.1 we have
that y = 1.9833 and SE(y) = 0.0931371. Moreover, from the Student-t Table we have
t(11) (0.10) = 1.80. Therefore, the 90% confidence interval for µ is

(1.9833 − 1.80 × 0.0931371, 1.9833 − 1.80 × 0.0931371) = (1.82, 2.15).

Since 2.15 < 2.3 we reject the H0 . There is statistical evidence indicating that the mea-
surement procedure systematically underestimates the true lead concentration by at least 0.2
µg/l.

Example 8.4 A shipyard must order a large shipment of lacquer from a supplier. Besides
other design requirements, the lacquer must be durable and dry quickly. The average drying
time must not exceed 25 minutes. Supplier A claims that, on average, its product dries in
20.5 minutes. A sample of 30 20-liter cans from supplier A yields an average drying time of
22.3 minutes and standard deviation of 2.9 minutes.

(a) Is there statistical evidence to distrust supplier A’s claim that its product has an average
drying time of 20.5 minutes?
(b) Can we say that, on average, supplier A’s lacquer dries before 24 minutes?

Solution to Example 8.4:

(a) To answer this question we must assess the precision of y as an estimate of µ. Evidently,
y = 22.3 is different from the claimed value 20.5 for µ. However, we need still to determine
if the observed difference of 1.8 is within the normal range of variability of Y .
To answer the question we can test the hypothesis

H0 : µ = 20.5.

at level α = 0.05, say. Since it is a non-directional hypothesis (two–sided test) we must


construct a 95% confidence interval for µ and check if it contain the value 20.5. In the
136 CHAPTER 8. STATISTICAL MODELING AND INFERENCE

present case √
α = 0.05 and df = 30 − 1 Hence, from Table A.2, t29 (0.05) = 2.05. Moreover,
SE(y) = 2.9/ 30 = 0.529465. Therefore,

dˆ = 2.05 × 0.529465 = 1.085,

and the 95% confidence interval for µ is


ˆ = (22.3 ± 1.084) = (21.21 , 23.39).
(y ± d)

Since this interval doesn’t include the value µ = 20.5, we reject supplier’s A claim that
µ = 20.5. That is, we reject the hypothesis µ = 20.5 on the basis of the given data and
statistical model.

(b) One way to answer this question is to test the hypothesis

H0 : µ ≥ 24.0

at some (small) level α. To take advantage of the calculations already made we may choose
α = 0.025. Since the upper limit 95% confidence interval for µ is smaller than 24.0, we reject
H0 and answer question (b) in a positive way. 2

8.3 Two Sample Problems


There are practical situations where we are interested in comparing several populations. In
this section we will consider the simplest case of two populations. In Chapter 10 we will
consider the general case of two or more populations.

Example 8.5 Refer to the situation described in Problem 8.4. Another supplier, called
Supplier B, could also supply the lacquer. A sample of 10 20-liter cans from supplier B yields
an average drying time of 20.7 minutes and standard deviation of 2.5 minutes. Does the data
support supplier B’s claim that, on average, its product dries faster than A’s? What if the
sample size from supplier B were 100 instead of 10?

This example illustrates a fairly common situation: one must take or recommend an im-
portant decision involving a large number of items (or individuals) on the basis of a relatively
small number of measurements performed on some of these items. Recall that the set of all
the items under study is called the population and the subset of items used to obtain the
measurements (and often the measurements themselves) is called the sample.
Example 8.5 includes two populations, namely the 3,000 20-liter cans of lacquer that can
be acquired from either supplier A or B. In the following these two populations will be called
population A and population B, respectively.
Although we are concerned with the entire populations, we will only be able to test the
items in the samples. Therefore, we must try to investigate and exploit the mathematical
connections between the samples and the population from which they came. This can be
8.3. TWO SAMPLE PROBLEMS 137

done with the help of an statistical model, that is, a set of probability assumptions regarding
the sample measurements. The two sample measurements can be modeled as

Yij = µi + σi εij , i = 1, 2 and j = 1, . . . , ni . (8.4)


where the first subscript (i) indicates the population and the second subscript (j) indicates
the observation. Thus, µi and σi are the population means and variances, respectively, and
n1 and n2 are sample sizes. In the case of Example 8.5, n1 = 30 and n2 = 10. It is often
assumed that the measurements are independent and therefore that the random variables
εij , i = 1, 2 and j = 1, . . . , ni are independent. Finally, as in the case of one sample, we
assume that the random variables εij are normal with mean zero and variance one.
Similarly to the one-sample case, the population means µ1 and µ2 can be estimated by
the corresponding sample means:
n1 n2
1 X 1 X
Y1 = Y1j and Y2 = Y2j ,
n1 j=1 n2 j=1

Notice that Y 1 and Y 2 are normal random variables with means µ1 and µ2 and variances
σ12 /n1 and σ22 /n2 , respectively. Furthermore, the population variances σ12 and σ22 can be
estimated by the sample variances

n1 n2
1 X 1 X
S12 = [Y1j − Y 1 ]2 and S22 = [Y2j − Y 2 ]2
n1 j=1 n2 j=1

Notice that E(Si2 ) = σi2 (see Problem 8.9).

The Pooled Variance Estimate If the variances of the two populations are approximately
equal it then makes sense to compare their means. On the other hand, if the variances
are very different, comparing the population means may be a gross oversimplification. A
practical solution in these cases is to apply a transformation (e.g. use log(Yij ) instead of Yij )
that stabilizes (equalizes) the variances.
In this course we will only consider the simple situation where

σ12 = σ22 = σ 2 .

An unbiased estimate for the common variance σ 2 , based on the individual unbiased estimates
S12 and S22 , is given by the pooled variance estimate

P2 Pni
2 (n1 − 1)S12 + (n2 − 1)S22 i=1 − Y i ]2
j=1 [Yij
S = = .
n1 + n2 − 2 n1 + n2 − 2
138 CHAPTER 8. STATISTICAL MODELING AND INFERENCE

Linear Combinations of the Population Means: In practice one often wishes to esti-
mate linear combinations of the population means and to test hypotheses about them. In
such cases we say that the parameter of interest is a linear combination of µ1 and µ2 .
The most common linear combination of µ1 and µ2 is the simple difference:

θ = µ1 − µ2 .

Other examples are

θ = µ1 − 2µ2 , θ = 3µ1 − µ2 , θ = 1.2µ1 + 0.5µ2 ,

etc. In general, θ can be written as

θ = aµ1 + bµ2

where a and b are given constants.


The parameter of interest, θ can be unbiasedly estimated by

θ̂ = aY 1 + bY 2

In fact,
E(θ̂) = E(aY 1 + bY 2 ) = aE(Y 1 ) + bE(Y 2 ) = aµ1 + bµ2 = θ.
The variance of θ̂ is equal to
" #
σ2 σ2 a2 b2
Var(θ̂) = Var(aY 1 + bY 2 ) = a2 Var(Y 1 ) + b2 Var(Y 2 ) = a2 + b2 = σ2 + .
n1 n2 n1 n2

Therefore, the standard error of θ̂ is


s
a2 b2
SE(θ̂) = σ̂ + .
n1 n2
In the case of Example 8.5 the parameter of interest is θ = µ1 − µ2 , estimated as

θ̂ = y 1 − y 2 = 22.3 − 20.7 = 1.6,

The pooled variance estimate is


(29)(2.92 ) + (9)(2.52 )
s2 = = 7.90
30 + 10 − 2
and so

s s
1 1 1 1 √
SE(θ̂) = s + = 2.8106 + = 1.053 = 1.026.
n1 n2 30 10
8.3. TWO SAMPLE PROBLEMS 139

Degrees of Freedom: Notice that we are using n1 + n2 observations to calculate s2 , and


that we estimated two unknown parameters (µ1 and µ2 ), therefore

df = n1 + n2 − 2.

Confidence Interval for θ: A (1 − α)100% confidence interval for θ is given by

θ̂ ± t(n1 +n2 −2) (α) × SE(θ̂)

In the case of Example 8.5, a 95% confidence interval for µ1 − µ2 is given by

s
n1 + n2
(y 1 − y 2 ) ± t(38) (0.05) × s = 1.6 ± 2.02 × 1.026 = (−0.47 , 3.67),
n1 n2

We have used the approximation t(38) (0.05) ≈ t(38) (0.05) = 2.02, because t(38) (0.05) is not
included in the table.
Solution to Example 8.5: The statement of Supplier B is consistent with the hypothesis

H 0 : µ1 ≥ µ2

or equivalently
H0 : µ1 − µ2 ≥ 0.
We may answer the question by testing this (directional) hypothesis at some (small) level α.
For example, we may take α = 0.05. The 90% confidence interval for θ = µ1 − µ2 is
s
n1 + n2
(y 1 − y 2 ) ± t(40) (0.10) × s = 1.6 ± 1.68 × 1.026 = 1.6 ± 1.724 = (−0.124 , 3.324),
n1 n2

Since the value µ1 − µ2 = 0 falls in the interval, we conclude that there is no statistically
significant difference between the two means. There is, then, statistical evidence against
Supplier B’s claim of having a superior product. 2

Example 8.6 Either 20 large machines or 30 small ones can be acquired for approximately
the same cost. One large and one small machines have been experimentally run for 20 days
with the following results:

y large = y 1 = 31.0, slarge = s1 = 2.1


y large = y 1 = 31.0, slarge = s1 = 2.1

Is there statistical evidence in favor of either type of machine? Use α = 0.05.


140 CHAPTER 8. STATISTICAL MODELING AND INFERENCE

Solution: Since the total cost of 20 large machines equals the cost of 30 small machines, it is
reasonable to compare the total outputs:

Total output of 20 large machines = 20µ1


Total output of 30 small machines = 30µ2

where µ1 and µ2 are the average daily outputs for each type of machine.
Therefore, the parameter of interest is the linear combination

θ = 20µ1 − 30µ2 .

From the information given we have n1 = n2 = 20 and θ can be estimated by

θ̂ = 20 y 1 − 30 y 2 = 20 × 31 − 30 × 22.7 = −61.0.

The pooled estimate of σ 2 is


19 × 2.12 + 19 × 1.92
= 4.01
20 + 20 − 2
and so s = 2.0. Since df = 20 + 20 − 2 = 38, from the Student’s t table we have

t(38) (0.05) ≈ t(40) (0.05) = 2.02.

Therefore, the 95% confidence interval for θ is


s
202 302
−61.0 ± 2.02 × 2.0 × + = −61.0 ± 35.57 = (−93.57, −28.43).
20 20
Therefore we reject (at level α = 0.05) the hypothesis that both alternatives are equally
convenient. It appears that it would be more convenient to acquire 30 small machines.
8.4. EXERCISES 141

8.4 Exercises
8.4.1 Exercise Set A
P P
Problem 8.1 Given that n1 = 15, x̄ = 20, (xi − x̄)2 = 28, and n2 = 12, ȳ = 17, , (yi −
ȳ)2 = 22.
(a) Calculate the pooled variance s2 .
(b) Determine a 95% confidence interval for µ1 − µ2 .
(c) Test H0 : µ1 = µ2 with α = .05.

Problem 8.2 The time for a worker to repair an electrical instrument is a normally dis-
tributed N (µ, σ 2 ) random variable measured in hours, where both µ and σ 2 are unknown.
The repair times for 10 such instruments chosen at random are as follows:
212,234,222,140,280,260,180,168,330,250
(1) Calculate the sample mean and the sample variance of the 10 observations.
(2) Construct a 95% confidence interval for µ.
(3) Suppose the worker claims that his average repair time for the instrument is no more
than 200 hours. Test if his claim conforms with the data.

Problem 8.3 (Hypothetical) The effectiveness of two STAT251/241 labs which were con-
ducted by two TAs is compared. A group of 24 students with rather similar backgrounds
was randomly divided into two labs and each group was taught by a different TA.Their test
scores at the end of the semester show the following characteristics:

n1 = 13, x̄ = 74.5, s2x = 82.6

and
n2 = 11, ȳ = 71.8, s2y = 112.6.
Assuming underlying normal distributions with σ12 = σ22 , find a 95 percent confidence interval
for µ1 −µ2 . Are the two labs different? Summarize the assumptions you used for your analysis.

Problem 8.4 Two machines (called A and B in this problem) are compared. Machine A
cost $ 3000 and machine B cost $ 4500. One machine of each type was operated during 30
days and the daily outputs were recorded. The results are summarized below:
Machine A: xA = 200kg sA = 5.1kg.
Machine B: xB = 270kg sB = 4.9kg.
Is there statistical evidence indicating that any one of these machines has better output/cost
performance than the other? Use α = 0.5.

Problem 8.5 The average biological oxygen demand (BOD) at a certain experimental sta-
tion has to be estimated. From measurements at other similar stations we know that the
variance of BOD samples is about 8.0 (mg/liter)2 . How many observations should we sample
142 CHAPTER 8. STATISTICAL MODELING AND INFERENCE

if we want to be 90 percent confident that the true mean is within 1 mg/liter of our sample
average? (Hint: Using CLT, we may assume the sample average has approximately normal
distribution).

Problem 8.6 An automobile manufacturer recommends that any purchaser of one of its new
cars bring it in to a dealer for a 3000-mile checkup. The company wishes to know whether
the true average mileage for initial servicing differs from 3000. A random sample of 50 recent
purchasers resulted in a sample average mileage of 3208 and a sample standard deviation
of 273 miles. Does the data strongly suggest that true average mileage for this checkup is
something other than the recommended value?

Problem 8.7 The following data were obtained on mercury residues on birds’ breast mus-
cles:
Mallard ducks: m = 16, x̄ = 6.13, s1 = 2.40
Blue-winged teals: n = 17, ȳ = 6.46, s2 = 1.73
Construct a 95% confidence interval for the difference between true average mercury residues
µ1 , µ2 in these two types of birds in the region of interest. Does your confidence interval
indicate that µ1 = µ2 at a 95% confidence level?

Problem 8.8 A manufacturer of a certain type of glue claims that his glue can withstand
230 units of pressure. To test this claim, a sample of size 24 is taken. The sample mean is
191.2 units and the sample standard deviation is 21.3 units.
(a) Propose a statistical model to test this claim and test the manufacturer’s claim.
(b) What is the highest claim that the manufacturer can make without rejection of this
claim?

8.4.2 Exercise Set B


Problem 8.9 Suppose that Y1 , . . . , Yn are a sample, that is, they are independent, identically
distributed, with common mean µ and common variance σ 2 . Recall that the sample variance
is equal to Pn
(Yi − Y )2
S = i=1
2
n−1
(a) Show that
n
X n
X
(Yi − Y )2 = (Yi − µ)2 − n(Y − µ)2 .
i=1 i=1

(b) Show that S 2 is an unbiased estimate of σ 2 , that is

E(S 2 ) = σ 2

Problem 8.10 (a) The president of a cable company claims that its 0.3–inch cable will
support an average load of 4200 pounds. Twenty four of these cables are tested to failure,
yielding the following data:
4201.3 4262.4 3983.0 3943.0 4141.3 4168.5 4050.0 4142.7
8.4. EXERCISES 143

4270.0 4002.9 4393.9 3868.0 4123.5 4192.5 3986.6 4276.7


4253.9 4303.4 4099.2 4136.1 4492.7 4292.7 3820.9 3621.4
Propose a statistical model for the given data and test the president’s claim. Check that
your model’s assumptions are consistent with the data.
(b) A different supplier has provided a sample of thirty six 0.3–inch cables which, after tested
to failure, yielded the following data:
4047.3 4302.6 4069.4 3914.8 4133.2 3658.6 4221.9 3913.1 4129.9 4068.7
4389.9 3943.9 4446.6 3796.3 4117.4 3816.9 4353.4 4009.5 4432.9 4072.1
3862.0 3939.3 3875.2 3989.0 4203.2 4334.9 4358.6 4189.9 4219.7 4238.0
4033.2 4005.2 4428.8 3938.0 4171.6 3974.7
Propose a statistical model for (all) the given data and test the hypothesis that the
cable from the two companies have the same average strength. Check that your model’s
assumptions are consistent with the data.

Problem 8.11 A politician must decide whether or not to run in the next local election.
He would be inclined to do so if at least 30% of the voters would favor his candidacy. The
results of a poll of 20 local citizens yielded the following results:
30% favor the politician 35% favor other candidates 35% are still undecided
Should the candidate decide to run based on the results of this survey? Do you think that
the sample size is appropriate? If not, suggest an appropriate sample size.

Problem 8.12 The number of hours needed by twenty employees to complete a certain task
have been measured before and after they participated of a special training program. The
data is displayed on Table 7.1.
How would you model these data in order to answer the question: Was the training
program successful? Was it? Also check that your model’s assumptions are consistent with
the data.
Table 7.1:

Problem 8.13 In order to process a certain chemical product, a company is considering the
convenience of acquiring (for approximately the same price) either 100 large machines or 200
small ones. One important consideration is the average daily processing capacity (in hundred
of pounds).
One machine of each type was tested for a period of 10 days, yielding the following results:
Large Machine: x1 = 120 s1 = 1.5
Small Machine: x2 = 65 s2 = 1.6
Model the data and identify the parameter of main interest. Construct a 95% confidence
interval for this parameter. What is your recommendation to management?

Problem 8.14 A study is made to see if increasing the substrate concentration has appre-
ciable effect on the velocity of a chemical reaction. With the substrate concentration of 1.5
moles per liter, the reaction was run 15 times with an average velocity of 7.5 micromoles per
30 minutes and a standard deviation of 1.5. With a substrate concentration of 2.0 moles per
144 CHAPTER 8. STATISTICAL MODELING AND INFERENCE

Employee Before Training After Training Difference


1 14.6 10.6 4.0
2 17.5 15.4 2.1
3 13.5 13.2 0.3
4 13.9 12.2 1.7
5 15.0 11.7 3.3
6 20.5 18.6 1.9
7 14.4 10.3 4.1
8 14.6 10.3 4.3
9 17.9 10.4 7.5
10 16.7 16.8 -0.1
11 14.7 14.6 0.1
12 17.3 14.6 2.7
13 11.7 10.5 1.2
14 13.7 10.9 2.8
15 16.8 11.8 5.0
16 15.7 13.4 2.3
17 15.7 13.6 2.1
18 16.7 16.7 0.0
19 15.5 16.7 -1.2
20 17.2 13.8 3.4

liter, 12 runs were made yielding an average velocity of 8.8 micromoles per 30 minutes and a
sample standard deviation of 1.2. Would you say that the increase in substrate concentration
increases the mean velocity by as much as 0.5 micromoles per 30 minutes? Use a 0.01 level
of significance and assume the populations to be approximately normally distributed with
equal variances.
Problem 8.15 (Hypothetical) A study was made to estimate the difference in annual salaries
of professors in University of British Columbia (UBC) and University of Toronto (UT). A
random sample of 100 professors in UBC showed an average salary of $46,000 with a standard
deviation $12,000. A random sample of 200 professors in UT showed an average salary of
$51,000 with a standard deviation of $14,000. Test the hypothesis that the average salary
for professors teaching in UBC differs from the average salary for professors teaching in UT
by $5,000.
Problem 8.16 A UBC student will spend, on the average, $8.00 for a Saturday evening
gathering in pub. A random sample of 12 students attending a homecoming party showed an
average expenditure of $8.9 with standard deviation of $1.75. Could you say that attending
a homecoming party costs students more than gathering in pub?
Problem 8.17 The following data represent the running times of films produced by two
different motion-picture companies.
Times (minutes)
Company I 103 94 110 87 98
Company II 97 82 123 92 175 88 118
Compute a 90% confidence interval for the difference between the average running times of
films produced by the two companies. Do the films produced by Company II run longer than
those by Company I?
8.4. EXERCISES 145

Problem 8.18 It is required to compare the effect of two dyes on cotton fibers. A random
sample of 10 pieces of yarn were chosen; 5 pieces were treated with dye A, and 5 with dye B.
The results were
Dye A 4 5 8 8 10
Dye B 6 2 9 4 5
(a) Test the significance of the difference between the two dyes. (Assume normality, common
variance, and significance level α = 0.05.)
(b) How big a sample do you estimate would be needed to detect a difference equal to 0.5
with probability 99%.
146 CHAPTER 8. STATISTICAL MODELING AND INFERENCE
Chapter 9

Simulation Studies

9.1 Monte Carlo Simulation


Consider the integral
Z 1
I= g(t)dt.
0

Suppose that g is such that this integral cannot be easily integrated and we need to
approximate it by numerical means. For simplicity suppose that 0 ≤ g(t) ≤ 1 for all 0 ≤ t ≤
1.
If we are dealing with a function h(t) which is not between 0 and 1 but we know that

a ≤ h(t) ≤ b, for all 0 ≤ t ≤ 1,

then the function

h(t) − a
g(t) =
b−a

does take values between 0 and 1 and


Z 1 Z 1
h(t)dt = (b − a) g(t)dt + a
0 0

Suppose that we want to estimate I with an error smaller than δ = 0.01, with probability
equal to 0.99. In other words, if Iˆ is the estimate of I, we require that

P {|Iˆ − I| < 0.01} = 0.99.

147
148 CHAPTER 9. SIMULATION STUDIES

First of all, we notice that

Z 1
I= g(t)dt = E{g(U )},
0

where U is a random variable with uniform distribution on the interval (0, 1). If we generate
n independent random variables

U1 , U 2 , . . . , U n

with uniform distribution on (0, 1), then by the Central Limit Theorem

n
1X
Iˆ = g(Ui ) = g(U )
n i=1

is approximately normal with mean I = E{g(U )} and variance σ 2 /n, where

Z 1 Z 1
2 2 2
σ = g (t)dt − I ≤ g(t)dt − I 2 = I(1 − I).
0 0

Now,
√ √
P {|Iˆ − I| < 0.01} = P { n|Iˆ − I|/σ < n0.01/σ}


≈ P {|Z| < n0.01/σ}


= 2Φ[ n0.01/σ] − 1.

But,
√ √
2Φ[ n0.01/σ] − 1 = 0.99 ⇒ Φ[ n0.01/σ] = 0.995

n0.01
⇒ = 2.58
σ
σ 2 (2.58)2
⇒ n= .
(0.01)2
9.1. MONTE CARLO SIMULATION 149

Finally, since I(1 − I) reaches its maximum at I = 0.5 it follows that I(1 − I) ≤ 0.25 for all
I, and so, a conservative estimate for n is

σ 2 (2.58)2 (0.25)(2.58)2
n = ≤ = 16, 641
(0.01)2 (0.01)2

Therefore, an estimate of I based on n = 16, 641 independent uniform random variables, Ui ,


will include an error smaller than 0.01, with probability 0.99.

The Monte Carlo method can also be used to estimate an integral of the form

Z b
J= f (t)dt, (9.1)
a

where f (t) takes values between c and d. That is, the domain of integration can be any given
bounded interval, [a, b], and the function can take values on any given bounded interval [c, d].
For example, we may wish to estimate the integral
Z 3
J= exp {t2 }dt.
1

In this case the domain of integration is [1, 3] and the function ranges over the interval

[2.7183, 20.086]
.
In order to estimate J, first we must make the change of variables

t−a
u= ,
b−a

to obtain
Z 1 Z 1
J = (b − a) f [(b − a)u + a]du = g(u)du,
0 0

where

g(u) = (b − a)f [(b − a)u + a].


150 CHAPTER 9. SIMULATION STUDIES

In the case of our numerical example we have


Z 1 Z 1
2
J = (3 − 1) exp {[(3 − 1)u + 1] }du == 2 exp {2u + 1]2 }du,
0 0

and

g(u) = 2 exp {[2u + 1]2 }

The second step is to linearly modify the function g(u) so that the resulting function, h(u),
takes values between 0 and 1. That is,

g(u) − (b − a)c
h(u) = ⇒ g(u) = (b − a)(d − c)h(u) + (b − a)c
(b − a)(d − c)

Notice that, since

(b − a)c ≤ g(u) ≤ (b − a)d,

then

0 ≤ h(u) ≤ 1.

In the case of our numerical example,

2 exp {[2u + 1]2 } − 2.7183 2 exp {2u + 1]2 } − 2.7183


h(u) = = .
2(20.086 − 2.7183) 34.7354

Finally,

Z 1 Z 1
c
J = g(u)du = (b − a)(d − c) h(u)du +
0 0 d−c
c
= (b − a)(d − c)I + ,
d−c

where I is of the desired form (that is, the integral between 0 and 1 of a function that takes
values between 0 and 1).
9.2. EXERCISES 151

9.2 Exercises
Problem 9.1 Use the Monte Carlo integration method with n = 1500 to approximate the
following integrals.
(a) Z 1
I= exp{−x2 }dx.
0

What is the (approximated) probability that the approximation error is less than d = 0.05?
Less than d = 0.01?
(b) Z 2
I= exp{x2 }dx.
−1

Problem 9.2 Let Z π/2


I= exp{− cos2 (x)} cos(x) sin(x)dx
0

(a) Use the Monte Carlo method, with n = 100, to estimate I.


(b) Construct a 95% confidence interval for I based on the Monte Carlo data.
(c) Is the true value of I included in your confidence interval?
Hint: use the change of variables y = cos2 (x) to exactly evaluate the integral.
(d) Repeat (a)–(c) with n = 500 and n = 1000.
(e) What is the needed sample size if the 95% confidence interval must have total length
equal to 0.02?

Problem 9.3 (a) Generate 100 samples of size n = 10 from the following distributions:
(1) Uniform on the interval (0, 1); (2) exponential with mean 1; (3) discrete with f (1) =
1/3, f (2) = 1/3 and f (9) = 1/3; (4) discrete with f (1) = 1/8, f (3) = 1/8 and f (9) = 3/4
and (5) f (1) = 1/3, f (5) = 1/3 and f (9) = 1/3.
(b) For each distribution calculate the corresponding sample means and discuss the merits
of the CLT approximation to the distribution of the sample mean in each case. You can use
histograms, Q-Q plots, box plots, etc. for your analysis.
(c) Repeat (a) and (b) with n = 20 and n = 50.
(d) Concisely state your conclusions.
152 CHAPTER 9. SIMULATION STUDIES
Chapter 10

Comparison of several means

10.1 An example
The main ideas will be illustrated by the following example.

Example 10.1 A construction company wants to compare several different methods of dry-
ing concrete block cylinders. To that effect, the engineer in charge of acquisition and testing
of materials sets up an experiment to compare five different drying methods referred to as
drying methods A, B, C, D and E. One important feature of the concrete block cylinders
is their compressive strength (in hundreds of kilograms per square centimeter), which can
be determined by means of a destructive strength test. After selecting a carefully designed
experiment (we will discuss this important step later on) the engineer collected the data
displayed in Table 9.1.

Table 10.1: Concrete Blocks Compressive Strength

Type Compressive Strength (100 pounds per square inch) Mean SD


A 47.90 47.95 49.39 48.80 53.15 49.06 50.62 46.80 34.66 47.48 45.05 7.5
44.56 50.41 35.99 45.15 57.53 50.05 40.79 30.38 29.13 41.21
B 37.69 37.79 62.75 51.62 39.73 65.68 64.62 46.64 52.01 61.38 52.29 8.70
52.58 40.47 53.85 55.06 49.14 49.71 57.68 50.54 62.18 54.60
C 61.93 63.39 52.87 47.26 50.97 58.45 48.87 66.48 57.79 48.51 54.29 6.32
58.91 42.85 48.40 53.28 55.00 49.97 49.47 54.21 51.37 60.26
D 82.31 51.82 64.11 61.06 47.72 53.08 56.99 55.49 52.72 64.97 56.83 10.40
44.81 46.36 44.76 68.43 76.49 48.58 61.41 55.97 46.83 52.76
E 39.72 40.98 44.74 29.94 47.18 32.84 35.39 43.54 50.21 42.12 41.15 8.16
26.72 44.68 34.48 46.54 54.80 56.89 34.46 42.88 44.30 30.64

Propose a statistical model and answer the following questions:

(a) Are the model’s assumptions consistent with the data?


(b) Propose unbiased estimates for the unknown parameters in the model.
(c) Are the (population) mean compressive strengths for the five methods different?
(d) If the answer to question (b) is positive, what method is the best? The worst?

153
154 CHAPTER 10. COMPARISON OF SEVERAL MEANS

Solution to Example 10.1:

We propose the following model. Each measurement will be represented as the sum of
two terms, an unknown constant, µi , and a random variable, εij .

Yij = µi + εij , i = 1, . . . , k and j = 1, . . . , ni .

The first subscript, i, ranges from 1 to k, where k is the number of populations being
compared, usually called treatments. In our example, we are comparing five types of drying
methods, therefore k = 5. The second subscript, j, ranges from 1 to ni , where ni is the number
of measurements for each treatment. In our example, we have n1 = n2 = . . . = n5 = 20.
The unknown parameters µi represent the treatment averages. Differences among the µi ’s
account for the part of the variability observed in the data that is due to differences among
the treatments being compared in the experiment.
The random variables εij account for the additional variability that is caused by other
factors not explicitly considered in the experiment (different batches of raw material, different
mixing times, measurement errors, etc.). The best we can hope regarding the global effect
of these uncontrolled factors is that it will average out. In this way these factors will not
unduly enhance or worsen the performance of any treatment.
An important technique that can be used to achieve this (averaging out) is called ran-
domization. The experimental units available for the experiment (the 100 concrete blocks
cylinders in the case of our example) must be randomly assigned to the different treatments,
so that each experimental unit has, in principle, the same chance of being assigned to any
treatment. One practical way for doing this in the case of our example is to number the
blocks from 1 to 100 and then to draw (without replacement) groups of 20 numbers. The
units with numbers in the first group are assigned to treatment A, the units with numbers
in the second group are assigned to treatment B, and so on. The actual labeling of the
treatments as A, B, etc. can also be randomly decided.
The model assumptions are:

(1) Independence. The random variables εij are independent.


(2) Constant Treatment Means. E(εij ) = 0 for all i and j.
(3) Constant Variance. Var(εij ) = σ 2 for all i and j.
(4) Normality. The variables εij are normal.

These assumptions can be summarized by saying that the variables εij ’s are iid N (0, σ 2 ).

(a) The Q–Q plots of Figure 10.1 (a)-(e) suggest that assumption (4) is consistent with the
data. Figure 10.1 (f) displays the box–plots for the combined data (first from the left) and
for each drying method. The variability within the samples seem roughly constant (the boxes
are of approximately equal size). This suggests that assumption (3) is also consistent with
the data.
10.1. AN EXAMPLE 155

Drying Method A Drying Method B


• •

65

55


• ••

55 60
Empirical Quantiles

Empirical Quantiles
• •
35 40 45 50

• •• •
•• •
•• •
•• ••

••
45 50 •
•• ••



• •
40


30

• • •
-2 -1 0 1 2 -2 -1 0 1 2
Normal Quantiltes Normal Quantiltes
Drying Method C Drying Method D
• •
80
65

• •

Empirical Quantiles

Empirical Quantiles

••
60

70


•• •
••
55


•• ••
60

••
••
50

• • •
• ••
• ••••
50
45

••
• •
• • •
-2 -1 0 1 2 -2 -1 0 1 2
Normal Quantiltes Normal Quantiltes
Drying Method E Boxplot of Drying Methods

80
55


70
35 40 45 50


Empirical Quantiles

••
60

••
••
••

50

• ••
40


• •
30

30


-2 -1 0 1 2 A B C D E
Normal Quantiltes

Figure 10.1: Q-Q plots and boxplot


156 CHAPTER 10. COMPARISON OF SEVERAL MEANS

(b) The unknown parameters in the model are µi (i = 1, . . . , k) and σ 2 .


In what follows we will use the following notations.

n = n1 + . . . + nk , the total sample size,

yi. = yi1 + . . . + yini

ni
X
= yij , the ith treatment’s total,
j=1

ni P
yi1 + . . . + yini yij
y i. = = j=1
ni ni
yi.
= , the ith treatment’s mean,
ni

s
(yi1 − y i. )2 + . . . + (yini − y i. )2
si =
ni − 1
sP
ni
j=1 (yij− y i. )2
= the ith treatment’s standard deviation,
ni − 1

k
X
y.. = y1. + . . . + yk. = yi.
i=1
ni
k X
X
= yij , the overall total,
i=1 j=1

and

Pk
i=1 yi.
y .. =
n
Pk Pni
i=1 j=1 yij
= , the overall mean.
n
In the case of our example

y 1. = 45.05, y 2. = 52.29, y 3. = 54.29, y 4. = 56.83, y 5. = 41.15,


and
10.1. AN EXAMPLE 157

s1 = 7.50, s2 = 8.70, s3 = 6.32, s4 = 10.40, s5 = 8.16.


(See columns 3 and 4 of Table 9.1). In addition,

k
X
y.. = ni y i. = 20[45.05 + 52.29 + 54.29 + 56.83 + 41.15] = 4992.2
i=1

and

y .. = 4992/100 = 49.92.

It is not difficult to show that the y i. are unbiased estimates for the unknown parameters
µi . In fact, the reader can easily verify that
σ2
E(Y i. ) = µi , and Var(Y i. ) = , for i = 1, . . . , k.
ni
Analogously, it is not difficult to verify (see Problem 8.9) that S12 , S22 , . . . , Sk2 are k different
unbiased estimates for the common variance σ 2 :
E(Si2 ) = σ 2 , for i = 1, . . . , k.
These k estimates can be combined to obtain an unbiased estimate for σ 2 . The reader is
encouraged to verify that the combined estimate
Pk
2 − 1)Si2
i=1 (ni
S =
n−k
is also unbiased and has a variance smaller than that of the individual Si2 ’s.

(c) Roughly speaking one can answer this question positively if there is evidence that a
substantial part of the variability in the data is due to differences among the treatments.
The total variability observed in the data is represented by the total sum of squares,

ni
k X
X
SST = [yij − y .. ]2
i=1 j=1

ni Pk Pni
k X
X [ i=1 j=1 yij ]2
= yij2 − Pk
i=1 j=1 i=1 ni
ni
k X
X y..2
= yij2 − .
i=1 j=1 n
158 CHAPTER 10. COMPARISON OF SEVERAL MEANS

We will now show that the total sum of squares, SST , can be expressed as the sum of two
terms, the error sum of squares, SSe, and the treatment sum of squares, SSt. That
is,

SST = SSe + SSt, (10.1)


where

ni
k X
X
SSe = [yij − y i. ]2
i=1 j=1

ni
k X k
X X yi.2
= yij2 − .
i=1 j=1 i=1 ni

and

k
X
SSt = ni [y i. − y .. ]2 .
i=1

The first term on the right–hand side of equation (1), SSe, represents the differences
between items in the same treatment or within–treatment variability (this source of
variability is also called intra–group–variability). The second term, SSt, represents the
differences between items from different treatments or between–groups variability (this
source of variability is also called inter–group–variability).
To prove equation (1) we add and subtract y i. and expand the square to obtain

X ni
k X ni
k X
X
[yij − y .. ]2 = [(yij − y i. ) + (y i. − y .. )]2
i=1 j=1 i=1 j=1

X ni
k X ni
k X
X ni
k X
X
= (yij − y i. )2 + (y i. − y .. )2 + 2 (yij − y i. )(y i. − y .. )
i=1 j=1 i=1 j=1 i=1 j=1

k
X ni
X
= SSe + SSt + 2 (y i. − y .. ) (yij − y i. )
i=1 j=1

k
X
= SSe + SSt + 2 (y i. − y .. )[ni y i. − ni y i. ]
i=1

= SSe + SSt.
10.1. AN EXAMPLE 159

In the case of our example we have

(4992.2)2
SST = 259273.7 − = 10049.11,
100

(901.01)2 + (1045.72)2 + (1085.79)2 + (1136.67)2 + (823.05)2


SSe = 259273.7 − = 6587.75
20

and

SSt = SST − SSe = 3461.36.

Degrees of Freedom

The sums of squares cannot be compared directly. They must first be divided by their
respective degrees of freedoms.
Since we use n squares and only one estimated parameter in the calculation of SST , we
conclude that
df (SST ) = n − 1.
Since there are n squares and k estimated parameters (the k treatment means) in the
calculation of SSe, we conclude that

df (SSe) = n − k.

The degrees of freedom for SSt are obtained by the difference

df (SSt) = df (SST ) − df (SSe) = (n − 1) − (n − k) = k − 1.

ANALYSIS OF VARIANCE

All the calculations made so far can be summarized on a table called the analysis of
variance (ANOVA) table.

Table 10.2: ANOVA TABLE


Source Sum of Squares df Mean Squares F
Drying Methods 3461.36 4 865.25 12.45
Error 6587.75 95 69.34
Total 10049.11 99
160 CHAPTER 10. COMPARISON OF SEVERAL MEANS

(c) To answer question (c) we must compare the variability due to the treatments with the
variability due to other sources. In other words, we must find out if the “treatment effect”
is strong enough to stand out above the “noise” caused by other sources of variability.
To do so, the ratio
M St
F =
M Se

is compared with the value F [df (M St), df (M Se)] from the F–Table, attached at the end of
these notes. In our case
865.25
F = = 12.45
69.34
and

F (4, 95) ≈ F (4, 60) = 2.53.

Since F > F (4, 95) we conclude that there are statistically significant differences among the
drying methods.

(d) To answer question (d) we must perform multiple comparisons of the treatment means.
It is intuitively clear that if the number of treatments is large and therefore the total number
of comparisons of pairs of means

K = (k2 ) = k(k − 1)/2


is very large, there will be a greater chance that some of the 95% confidence intervals will fail
to include the value zero, even if all the µi were the same. For example, K = 3 when k = 3,
K = 6 when k = 4 and K = 10 when k = 5.
To compensate for the fact that the probability of declaring two means different when
they are not is larger than the significance level α = 0.05 used for each comparison, we must
use the smaller significance level, δ, given by

δ = 0.05/K
Each individual confidence interval is constructed so that it has probability 1 − δ of
including the true treatment mean difference. It can be shown that this procedure (called
Bonferroni multiple comparisons) is conservative: If all the treatment means are equal,

µ1 = µ2 = . . . = µk ,
then the probability that one or more of these intervals do not include the true difference, 0,
is at most α.
10.1. AN EXAMPLE 161

The procedure to compute the simultaneous confidence intervals is as follows. In the first
place, we must find the appropriate value, t(n−k) (δ) = t(n−k) (α/K), from the Student’s t table
(see Table 7.1). As before, the number of degrees of freedom corresponds to those of the MSe,
that is, df = n − k.
The second step is to determine the standard deviation of the difference of treatments
means, Y i. − Y m. . It is easy to see that
· ¸
1 1
Var(Y i. − Y m. ) = σ 2 + .
ni nm
Therefore,
s· ¸
√ 1 1
estimated SD(Y i. − Y m. ) ≈ M Se + .
ni nm
In the case of our example k = 5 and therefore K = 10. The observed differences between
the 10 pairs of treatments (sample) means are given in Table 9.3.

Table 10.3: MULTIPLE COMPARISONS


Treatments Observed Difference dˆi,m Significance
A–B -7.24 7.56
A–C -9.24 7.56 *
A–D -11.78 7.56 *
A–E 3.9 7.56
B–C -2.0 7.56
B–D -4.54 7.56
B–E 11.14 7.56 *
C–D -2.54 7.56
C–E 13.14 7.56 *
D–E 15.68 7.56 *

As explained before, the (precision) number dˆi,m is calculated by the formula

s
√ 1 1
dˆi,m = t(n−k) M Se + .
ni nm
In the case of our example, since

n1 = n2 = n3 = n4 = n5 = 20,
all the dˆi,m are equal to

s
√ 2
dˆ = t(95) (0.05/10) 69.34 = 7.56.
20
The differences marked with an star, *, on Table 9.3 are statistically significant. For example,
the * on the line A–C together with the fact that the sign of the difference is negative, is
162 CHAPTER 10. COMPARISON OF SEVERAL MEANS

interpreted as evidence that method A is worse (less strong) than method C. The conclusions
from Table 9.3 are: the methods A and E are not significantly different and appear to be
significantly worse than than the others. Observe that, although method A is not significantly
worse than method B (at the current level α = 0.05) their difference, 7.24, is almost significant
(fairly close to 7.56). 2
10.2. EXERCISES 163

10.2 Exercises
10.2.1 Exercise Set A
Problem 10.1 Three different methods are used to transport milk from a farm to a dairy
plant. Their daily costs (in $100) are given in the following:
Method 1: 8.10 4.40 6.00 7.00
Method 2: 6.60 8.60 7.35
Method 3: 12.00 11.20 13.30 10.55 11.50
(1) Calculate the sample mean and sample variance for the cost of each method.
(2) Calculate the grand mean and the pooled variance for the costs of the three methods.
(3) Test the difference of the costs of the three methods.

Problem 10.2 Six samples of each of four types of cereal grain grown in a certain region were
analyzed to determine thiamin content, resulting in the following data (micrograms/grams):
Wheat: 5.2 4.5 6.0 6.1 6.7 5.8
Barley: 6.5 8.0 6.1 7.5 5.9 5.6
Maize: 5.8 4.7 6.4 4.9 6.0 5.2
Oats: 8.3 6.1 7.8 7.0 5.5 7.2
Carry out the analysis of variance for the given data. Do the data suggest that at least two
of the four different grains differ with respect to true average thiamin content? Use α = 0.5.

Problem 10.3 A psychologist is studying the effectiveness of three methods of reducing


smoking. He wants to determine whether the mean reduction in the number of cigarettes
smoked daily differs from one method to another among male patients. Twelve men are
included in the experiment. Each smoked 60 cigarettes per day before the treatment. Four
randomly chosen members of the group pursue method I; four pursue method II; and so on.
The results are as follows (Table 9.4):
(a) Use a one-way analysis of variance to test whether the mean reduction in the number

Table 10.4:
Method I Method II Method III
52 41 49
51 40 47
51 39 45
52 40 47

of cigarettes smoked daily is equal for three methods. (Let the significance level equal 0.05).
(b) Use confidence intervals to determine which method results in a larger reduction in
smoking.

10.2.2 Exercise Set B


Problem 10.4 For best production of certain molds, the furnaces need to heat quickly up to
a temperature of 1500o F. Four furnaces were tested several times to determine the times (in
164 CHAPTER 10. COMPARISON OF SEVERAL MEANS

minutes) they took to reach 1500o F, starting from room temperature, yielding the following
results
Are the furnaces average heating times different? If so, which is the fastest? The slowest?

Table 10.5:
Furnace ni xi si
1 15 14.21 0.52
2 15 13.11 0.47
3 10 15.17 0.60
4 10 12.42 0.43

Problem 10.5 Three specific brands of alkaline batteries are tested under heavy loading
conditions. Given here are the times, in hours, that 10 batteries of each brand functioned
before running out of power. Use analysis of variance to determine whether the battery
brands take significantly different times to completely discharge. If the discharge times are
significantly different (at the 0.05 level of confidence), determine which battery brands differ
from one another. Specify and check the model assumptions.

Table 10.6:
Battery Type
1 2 3
5.60 5.38 6.40
5.43 6.63 5.91
4.83 4.60 6.56
4.22 2.31 6.64
5.78 4.55 5.59
5.22 2.93 4.93
4.35 3.90 6.30
3.63 3.47 6.77
5.02 4.25 5.29
5.17 7.35 5.18

Problem 10.6 Five different copper-silver alloys are being considered for the conducting
material in large coaxial cables, for which conductivity is a very important material char-
acteristic. Because of differing availabilities of the five kinds, it was impossible to make as
many samples from alloys 2 and 3 as from other alloys. Given next are the coded conduc-
tivity measurements from samples of wire made from each of the alloys. Determine whether
the alloys have significantly different conductivities. If the conductivities are significantly
different (at α = 0.05), determine which alloys differ from one another. Specify and check
the model assumptions.

Problem 10.7 Show that


E(Y i ) = µi , i = 1, . . . , k,
E(Si2 ) = σ 2 , i = 1, . . . , k
10.2. EXERCISES 165

Table 10.7:
Alloy
1 2 3 4 5
60.60 58.88 62.90 60.72 57.93
58.93 59.43 63.63 60.41 59.85
58.40 59.30 62.33 59.60 61.06
58.63 56.97 63.27 59.27 57.31
60.64 59.02 61.25 59.79 61.28
59.05 58.59 62.67 62.35 59.68
59.93 60.19 61.29 60.26 57.82
60.82 57.99 60.77 60.53 59.29
58.77 59.24 58.91 58.65
59.11 57.38 58.55 61.96
61.40 61.20 57.96
59.00 59.73 59.42
60.12 59.40
60.49 60.30
60.15

and
E(MSe) = σ 2 .
Is the variance of MSe smaller than the variance of Si2 ? Why?

Problem 10.8 To study the correlation between the solar insulation and wind speed in
the United States, 26 National Weather Service stations used three different types of solar
collectors–2D Tracking, NS Tracking and EW Tracking to collect the solar insulation and wind
speed data. An engineer wishes to compare whether these three collectors give significantly
different measurements of wind speed. The values of windspeed corresponding to attainment
of 95% integrated insulation are reported in the following Table 9.8.
Are there statistically significant differences in measurement among the three different
apertures? Specify and check the model assumptions.
166 CHAPTER 10. COMPARISON OF SEVERAL MEANS

Table 10.8:
Station No. Site Latitude 2D Tracking NS Tracking EW Tracking
1 Brownsville, Texas 25.900 11.0 11.0 11.0
2 Apalachicola, Fla. 29.733 7.9 7.9 8.0
3 Miami, Fla. 25.800 8.7 8.6 8.7
4 Santa Maia, Calif. 34.900 9.6 9.7 9.5
5 Ft. Worth, Texas 32.833 10.8 10.7 10.9
6 Lake Charles, La. 30.217 8.5 8.4 8.6
7 Phoenix, Ariz. 33.433 6.6 6.6 6.5
8 El Paso, Taxes 31.800 10.3 10.3 10.3
9 Charleston, S.C. 32.900 9.2 9.1 9.2
10 Fresno, Calif. 36.767 6.2 6.3 6.1
11 Albuquerque, N.M. 35.050 9.0 9.0 8.9
12 Nashville, Tenn. 36.117 7.7 7.6 7.7
13 Cape Hatteras, N.C 35.267 9.2 9.2 9.3
14 Ely, Nev. 39.283 10.0 10.1 10.1
15 Dodge City, Kan. 37.767 12.0 11.9 12.0
16 Columbia, Mo. 38.967 9.0 8.9 9.1
17 Washington, D.C. 38.833 9.3 9.1 9.5
18 Medford, Ore. 42.367 6.8 6.9 6.5
19 Omaha, Neb. 41.367 10.4 10.3 10.5
20 Madison, Wis. 43.133 9.5 9.5 9.6
21 New York, N.Y. 40.783 10.4 10.3 10.4
22 Boston, Mass. 42.350 11.4 11.2 11.4
23 Seattle, Wash. 47.450 9.0 9.0 9.1
24 Great Falls, Mont. 47.483 12.9 12.6 13.0
25 Bismarck, N.D. 46.767 10.8 10.7 10.8
26 Caribou, Me. 46.867 11.4 11.3 11.5
Chapter 11

The Simple Linear Regression Model

11.1 An example
Consider the following example:

Example 11.1 Due to differences in the cooling rates when rolled, the average elastic limit
and the ultimate strength of reinforcing metal bars is determined by the bar size. The
measurements in Table 10.1 (in hundreds of pounds per square inch) were obtained from a
sample of bars.

The experimental units (metal bars) are numbered from 1 to 35. Notice that each exper-
imental unit, i, gave rise to three different measurements:

The diameter of the ith metal bar xi


The elastic limit of the ith metal bar yi
The ultimate strength of the ith metal bar zi

We will investigate the relationship between the variables xi and yi . Likewise, the rela-
tionship between the variables xi and zi can be investigated in an analogous way (see Problem
10.2).
First of all we notice that the roles of yi and xi are different. Reasonably, one must assume
that the elastic limit, yi , of the ith metal bar is somehow determined (or influenced) by the
diameter, xi , of the bar. Consequently, the variable yi can be considered as a dependent or
response variable and the variable xi can be considered as an independent or explanatory
variable.
A quick look at Figure 11.1 (a) will show that there is not an exact (deterministic)
relationship between xi and yi . For example, bars with the same diameter (3, say) have
different elastic limits (436.82, 449.40 and 412.63). However, the plot of yi versus xi shows
that in general, larger values of xi are associated with smaller values of yi .
In cases like this we say that the variables are statistically related, in the sense that the
average elastic limit is a decreasing function, f (xi ), of the diameter.

167
168 CHAPTER 11. THE SIMPLE LINEAR REGRESSION MODEL

Table 11.1: Elastic Limit and Ultimate Strength of Metal Bars

Bar Elastic Ultimate Bar Elastic Ultimate


Unit Diameter Limit Strength Unit Diameter Limit Strength
( 18 of an inch) (100 psi) (100 psi) ( 18 of an inch) (100 psi) (100 psi)
1 3 436.82 683.65 19 7 361.14 605.12
2 3 449.40 678.48 20 7 356.06 604.17
3 3 412.63 681.41 21 8 328.59 568.11
4 4 425.00 672.29 22 8 321.64 576.69
5 4 419.71 673.26 23 8 321.14 570.47
6 4 415.74 671.31 24 9 297.28 538.99
7 4 422.94 674.42 25 9 286.04 537.11
8 5 407.76 646.44 26 9 291.99 537.44
9 5 416.84 654.32 27 10 231.15 502.76
10 5 388.39 649.31 28 10 249.13 498.88
11 5 416.25 654.24 29 10 249.81 495.17
12 5 384.35 644.20 30 10 251.22 499.21
13 5 412.91 640.15 31 11 200.76 455.28
14 6 379.64 627.52 32 11 216.99 460.75
15 6 371.11 621.45 33 11 210.26 460.96
16 6 369.34 626.11 34 12 162.30 411.13
17 6 384.91 632.73 35 12 167.63 410.74
18 7 362.89 601.73

Each elastic limit measurement, yi , can be viewed as a particular value of the random
variable, Yi , which in turn can be expressed as the sum of two terms, f (xi ) and εi . That is,

Yi = f (xi ) + εi , i = 1, . . . , 35. (11.1)

It is usually assumed that the random variables εi satisfy the following assumptions:

(1) Independence. The random variables εi are independent.


(2) Constant Mean. E(εi ) = 0 for all i.
(3) Constant Variance. Var(εi ) = σ 2 for all i.
(4) Normality. The variables εi are normal.

These assumptions can be summarized by simply saying that the variables Yi ’s are indepen-
dent normal random variables with

E(Yi ) = f (xi ) and Var(Yi ) = σ 2 .

The model (10.1) above is called linear if the function f (xi ) can be expressed in the form

f (xi ) = β0 + β1 g(xi ),
11.1. AN EXAMPLE 169

where the function g(x) is completely specified, and β0 and β1 are (usually unknown) param-
eters.

(a) (b)
150 200 250 300 350 400 450

• ••

20
• •
•• • • • •
• •• •
• • •
• •• • •
• ••
Elasticity Limit

•• ••

0
•• Residuals ••

• • • •
•• •
• •
-20 • •
• • •
• •
••

-40

••

4 6 8 10 12 4 6 8 10 12

Diameter Diameter

(c) (d)
150 200 250 300 350 400 450

• •

•• ••
20


• ••
• •• • •
Elasticity Limit

10

• •
Residuals

•• •• • •
• • • • •
•• • • •
• •
0

• • • • •

• • •
-10

• ••
••
• •
-20

•• • •

4 6 8 10 12 4 6 8 10 12

Diameter Diameter

Figure 11.1
The linear model,

Yi = β0 + β1 g(xi ) + εi , i = 1, . . . , 35, (11.2)


170 CHAPTER 11. THE SIMPLE LINEAR REGRESSION MODEL

is very flexible as many possible mean response functions, f (xi ), satisfy the linear form
given above. For example, the functions

f (xi ) = 5.0 + 4.2xi and f (xi ) = β0 + 3 sin(2xi ),


are linear, in the sense explained above, with

g(xi ) = xi , β0 = 5 and β1 = 4.2


in the first case, and

g(xi ) = sin(2xi ), β0 = unspecified and β1 = 3,


in the second case.
On the other hand, there are some functions that can not be expressed in this linear form.
One example is the function

exp{β1 xi }
f (xi ) = .
1 + exp{β1 xi }
The shape assumed for f (xi ) is sometimes suggested by scientific or physical considera-
tions. In other cases, as in the present example, the shape of f (xi ) is suggested by the data
itself. The plot of yi versus xi (see Figure 11.1) indicates that, at least in principle, the simple
linear mean response function

f (xi ) = β0 + β1 xi , that is g(xi ) = xi .


may be appropriate. In other words, to begin our investigation we will use the tentative
working assumption that, on the average, the elastic limit of the metal bars is a linear function
(β0 + β1 xi ) of their diameters.
Of course, the values of β0 and β1 are unknown and must be empirically determined, that
is, estimated from the data. One popular method for estimating these parameters is the
method of least squares. Given the tentative values b0 and b1 for β0 and β1 , respectively,
the regression residuals

ri (b0 , b1 ) = yi − b0 − b1 xi , i = 1, . . . , n,
measure the vertical distances between the observed value, yi , and the tentatively estimated
mean response function, b0 + b1 xi .
The method of least squares consists of finding the values βˆ0 and βˆ1 of b0 and b1 , respectively,
which minimize the sum of the squares of the residuals. It is expected that, because of this
minimization property, the corresponding mean response function,

fˆ(xi ) = βˆ0 + βˆ1 xi


will be “close to ” or will fit the data points. In other words, the least square estimates βˆ0
and βˆ1 are the solution to the minimization problem:
11.1. AN EXAMPLE 171

n
X n
X
min ri2 (b0 , b1 ) = min [yi − b0 − b1 xi ]2 .
b0 ,b1 b0 ,b1
i=1 i=1

To find the actual values of βˆ0 and βˆ1 , we differentiate the function
n
X
S(b0 , b1 ) = ri2 (b0 , b1 )
i=1

with respect to b0 and b1 , and set these derivatives equal to zero to obtain the so called LS
equations:

Xn

S(b0 , b1 ) = −2 [yi − b0 − b1 xi ] = 0
∂b0 i=1

Xn

S(b0 , b1 ) = −2 [yi − b0 − b1 xi ]xi = 0,
∂b1 i=1

The LS equations can be rewritten as

n
X n
X
yi − nb0 − b1 xi = 0
i=1 i=1
n
X n
X n
X
yi xi − b0 xi − b1 x2i = 0,
i=1 i=1 i=1

or equivalently,

y − b0 − b1 x = 0 (11.3)

xy − b0 x − b1 xx = 0, (11.4)

where

n n
1X 1X
xy = yi xi and xx = x2 .
n i=1 n i=1 i

From (10.3) we have


172 CHAPTER 11. THE SIMPLE LINEAR REGRESSION MODEL

b0 = y − b1 x. (11.5)

From this and (10.4) we have

b1 xx = xy − b0 x

= xy − [y − b1 x] x

= xy − y x + b1 x x

xy − x y
⇒ b1 = .
xx − x x

Therefore (see (10.5)),

xy − y x
βˆ1 = , and
xx − x x

βˆ0 = y − β̂ x.
In the case of our numerical example we have

x = 7.086, y = 336.565, xy = 2162.353 and xx = 57.657,

Therefore,

2162.353 − (7.086)(336.565)
βˆ1 = = − 29.86,
57.657 − (7.0862 )

βˆ0 = 336.565 − (−29.86)(7.086) = 548.16.


and

fˆ(x) = 548.16 − (29.86)x.


The plot of fˆ(x) versus x (solid line in Figure 11.1 (a)) and the plot of the regression
residuals,
11.1. AN EXAMPLE 173

ei = yi − fˆ(xi ) = yi − [ 548.16 − (29.86)xi ],


versus xi (Figure 11.1 (b)) shows that the current fit may not be appropriate. The resid-
uals show a clear negative–positive–negative pattern. Since the residuals, ei , estimate the
unobservable model errors,

ε i = y i − β 0 − β 1 xi ,
one would expect the plot of the ei versus the xi will not show any particular pattern. In
other words, if the specified mean response function is correct, the estimated mean response
function fˆ(xi ) should “extract” most of the signal (systematic behavior) contained in the
data and the residuals, ei , should behave as patternless random noise.
Now that the tentatively specified simple transformation

g(x) = x
for the explanatory variable, x, is considered to be incorrect, the next step in the analysis is
to specify a new transformation. We will try the mean response function

f (x) = β0 + β1 x2 , that is g(x) = x2 .


Notice that if the mean elastic limit of the bars is a function of the bar surface

(diameter)2 ∗ π 1 1
( inch)2 = x2 (π/4) ( inch)2 ,
4 8 8
then the newly proposed mean response function will be appropriate. To simplify the notation
we will write

wi = x2i ,
to represent the squared diameter of the ith metal bar.
The new estimates for β0 and β1 and f (x) are

wy − y w
β̂1 = = −2.022, and
ww − w w

β̂0 = y − β̂1 w = 336.56 + (2.022)(57.65) = 453.128.


and

fˆ(x) = 453.128 − (2.022) x2 .


The plot for this new fit (solid line on Figure 11.1 (c)) and the residuals plot (Figure 11.1
(d)) indicate that this second fit is appropriate.
174 CHAPTER 11. THE SIMPLE LINEAR REGRESSION MODEL

Inference in the Linear Regression Model

It is not difficult to show that, if the model is correct, the estimates βˆ0 and βˆ1 are unbiased:

E(β̂0 ) = β0 and E(β̂1 ) = β1 .

Also, it can be shown that


" #
2 1 w2
Var(β̂0 ) = σ + Pn 2
n i=1 [wi − w]

and

σ2
Var(β̂1 ) = Pn 2
.
i=1 [wi − w]

Finally, it can be shown that under the model,


à n !
X
2
E [Yi − β̂0 − β̂1 wi ] = (n − 2)σ 2
i=1

a model–based unbiased estimate for σ 2 is given by

Pn
2 i=1 [Yi − β̂0 − β̂1 wi ]2
s =
n−2

n[(yy − y y) − β̂12 (ww − w w)]


= .
n−2

In the case of our example,

35[(120178.7 − 336.56462 ) − (2.0222 )(4993.429 − 57.6572 )]


s2 = = 86.53.
35 − 2

In summary, the empirically estimated standard deviations of β̂0 and β̂1 are
11.1. AN EXAMPLE 175

v v
u u
u1 w2 u1 w2
SD(β̂0 ) = s t + Pn 2
= s t +
n i=1 [wi − w] n n[ww − w w]

and

s s
SD(β̂1 ) = qP =q
n 2
i=1 [wi − w] n[ww − w w]

In the case of our example,


v
u
√ u 1 57.6572
SD(β̂0 ) = 86.53 t + = 1.60
35 35[4993.429 − 57.6572 ]

and
s
86.53
SD(β̂1 ) = = 0.0385
35(4993.429 − 57.6572 )

Confidence Intervals

95% confidence intervals for the model parameters, β0 and β1 , and also for the mean
response, f (x), can now be easily obtained. First we derive the 95% confidence intervals for
β0 and β1 . As before, the intervals are of the form

β̂0 ± dˆβ0 and β̂1 ± dˆβ1

where

dˆβ0 = t(n−2) (α) SD(β̂0 ) and dˆβ1 = t(n−2) (α) SD(β̂1 ).

In the case of our example, n − 2 = 35 − 2 = 33, t(33) (0.05) ≈ t(30) (0.05) = 2.04 and so

dˆβ0 = (2.04)(1.60) = 3.26 and dˆβ1 = (2.04)(0.0385) = 0.0785

Therefore, the 95% confidence intervals for β0 and β1 are


176 CHAPTER 11. THE SIMPLE LINEAR REGRESSION MODEL

453.125 ± 3.26 and − 2.021 ± 0.0785,

respectively.
Notice that, since the confidence interval for β doesn’t include the value zero, we conclude
that there is a linear decreasing relationship between the square of the bar diameter and its
1
elastic limit. When the bar surface increases by one unit ( 64 inch2 ) the average elastic limit
decreases by two hundred psi.
Finally, we can also construct a 95% confidence interval for the average response, f (x), at
any given value of x. It can be shown that the variance of fˆ(w) is
" #
1 (w − w)2
Var(fˆ(x)) = σ 2
+
n n[ww − w w]

where w = x2 . Therefore, the empirically estimated standard deviation of fˆ(x) is


v
u
u1 (w − w)2
ˆ
SD(f (x)) = st + .
n n[ww − w w]

In the case of our example we have,


v
u
√ u 1 (w − 57.657)2
SD(fˆ(x)) = 86.53t + .
35 35[4993.429 − 57.6572 ]

For instance, if the value of interest is x = 8.0,


v
u
√ u 1 (16.0 − 57.657)2
SD(fˆ(8.0)) = 86.53t + = 1.59.
35 35[4993.429 − 57.6572 ]
The corresponding 95% confidence interval for f (8.0) is then,
fˆ(8.0)) ± dˆ
where
dˆ = (2.04)(1.59) ≈ (2.04)(1.59) = 3.24.
Since
fˆ(8.0) = 453.125 − (2.022)(8.02 ) = 323.72,
the 95% confidence interval for f (8.0) is equal to
323.72 ± 3.24.
11.2. EXERCISES 177

11.2 Exercises

Problem 11.1 The number of hours needed by twenty employees to complete a certain task
have been measured before and after they participated of a special training program. The
data is displayed on Table 7.2. Notice that these data have already been partially studied in
Problem 7.12. Investigate the relationship between the before training and the after training
times using linear regression. State your conclusions.

Problem 11.2 Investigate the relationship between the diameter bar and the ultimate
strength shown in Table 10.1. State your conclusions.

Problem 11.3 Table 10.2 reports the yearly worldwide frequency of earthquakes with mag-
nitude 6 or greater from January, 1953 to December, 1965.
(a) Make scatter-plots of the frequencies against magnitudes and the log-frequencies against
the magnitudes.
(b) Propose your regression model and estimate the coefficients of your model.
(c) Test the null hypothesis that the slope is equal to zero.

Table 11.2:
Magnitude Frequency Magnitude Frequency
6.0 2750 7.4 57
6.1 1929 7.5 45
6.2 1755 7.6 31
6.3 1405 7.7 23
6.4 1154 7.8 18
6.5 920 7.9 13
6.6 634 8.0 9
6.7 487 8.1 7
6.8 376 8.2 7
6.9 276 8.3 4
7.0 213 8.4 2
7.1 141 8.5 2
7.2 110 8.6 1
7.3 85 8.7 1

Problem 11.4 In a certain type of test specimen, the normal stress on a specimen is known
to be functionally related to the shear resistance. The following is a set of experimental data
on the variables.
178 CHAPTER 11. THE SIMPLE LINEAR REGRESSION MODEL

x, normal stress y, shear resistance


26.8 26.5
25.4 27.3
28.9 24.2
23.6 27.1
27.7 23.6
23.9 25.9
24.7 26.3
28.1 22.5
26.9 21.7
27.4 21.4
22.6 25.8
25.6 24.9
(a) Write the regression equation.
(b) Estimate the shear resistance for a normal stress of 24.5 pounds per square inch.
(c) Construct 95% confidence intervals for regression coefficients β0 and β1 .
(d) Check the normality assumption through the residuals.

Problem 11.5 The amounts of a chemical compound y, which dissolved in 100 grams of
water at various temperatures, x, were recorded as follows:
x(o C) y (grams)
0 8 6 8
15 12 10 14
30 25 21 24
45 31 33 28
60 44 39 42
75 48 51 44
(a) Find the equation of the regression line.
(b) Estimate the amount of chemical that will dissolve in 100 grams of water at 50o C.
(c) Test the hypothesis that β0 = 6, using a 0.01 level of significance, against the alternative
that β0 6= 6.
(d) Is the linear model adequate?
Chapter 12

Appendix

12.1 Appendix A: tables


This appendix includes five tables: normal table, t-distribution table, F -distribution table,
cumulative Poisson distribution table and cumulative binomial distribution table.

179

You might also like