Linear.regression.with.Python
Linear.regression.with.Python
Linear Regression
With Python
first step, and often the only step, in fitting a simple model to
data. This brief book explains the essential mathematics
required to understand and apply regression analysis. The
tutorial style of writing, accompanied by over 30 diagrams,
offers a visually intuitive account of linear regression,
including a brief overview of nonlinear and Bayesian A Tutorial Introduction to the
regression. Hands-on experience is provided in the form of
numerical examples, included as Python code at the end of Mathematics of Regression Analysis
each chapter, and implemented online as Python and Matlab
code. Supported by a comprehensive glossary and tutorial
appendices, this book provides an ideal introduction to
regression analysis.
Features
ü Informal writing style
ISBN 9781916279186
90000 >
Stone
Sebtel Press
A Tutorial Introduction Book
9 781916 279186 James V Stone
Books by James V Stone
Linear Regression With Matlab: A Tutorial Introduction
of Regression Analysis
James V Stone
Title: Linear Regression With Python
Author: James V Stone
ISBN 9781916279186
These are the tears un-cried for you.
So let the oceans
Weep themselves dry.
A. Glossary 115
Index 127
Preface
Who Should Read This Book? The material in this book should be
accessible to anyone with knowledge of basic mathematics. The tutorial
style adopted ensures that readers who are prepared to put in the e↵ort
will be rewarded with a solid grasp of regression analysis.
Online Computer Code. Python and Matlab computer code for the
numerical example at the end of each chapter can be downloaded from
https://github.com/jgvfwstone/Regression.
James V Stone.
Sheffield, England.
One of the first things taught in introductory statistics
textbooks is that correlation is not causation. It is also
one of the first things forgotten.
1.1. Introduction
ŷi = b1 x i + b0 , (1.1)
where the dependent variable ŷi is the height of the ith person in the
sample, as predicted by the independent variable xi , which is the salary
of the ith person in the sample. Already, we can see that a line is
1
1 What is Linear Regression?
⌘i = yi ŷi . (1.2)
yi = ŷi + ⌘i . (1.3)
For example, the line in Figure 1.2 has a slope of b1 = 0.764 and an
intercept of b0 = 3.22, so the predicted value of yi at xi is
i 1 2 3 4 5 6 7 8 9 10 11 12 13
xi 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
yi 3.34 4.97 4.15 5.40 5.21 4.56 3.69 5.86 4.58 6.94 5.57 5.62 6.87
Table 1.1: Values of salary xi (groats) and measured height yi (feet) for a
fictitious sample of 13 people.
2
1.2. The Equation of a Line
x = x2 x1 . (1.5)
ŷ1 = b1 x 1 + b0 , (1.6)
ŷ2 = b1 x 2 + b0 , (1.7)
so the change in ŷ is
ŷ = ŷ2 ŷ1 , (1.8)
ŷ = b1 (x2 x1 ) (1.10)
= b1 x. (1.11)
7 7
Height, y (feet)
Height, y (feet)
6 6
5 5
4 4
3 3
0 1 2 3 4 5 0 1 2 3 4 5
Salary,x (groats) Salary,x (groats)
(a) (b)
Figure 1.1: (a) Scatter plot of the salary and height data in Table 1.1.
(b) Three plausible lines that seem to fit the data. A groat is an obsolete unit
of currency, which was worth four pennies in England.
3
1 What is Linear Regression?
x = x2 x1 (1.13)
= 4 1 (1.14)
= 3 groats. (1.15)
From Equation 1.4 the values of ŷ at x1 and x2 are ŷ1 = 3.984 and
ŷ2 = 6.276 (respectively), so the change in ŷ is
b1 = ŷ/ x (1.19)
= 2.29/3 (1.20)
= 0.764 feet/groat. (1.21)
Thus, for a salary increase of one groat, height increases by 0.764 feet.
7
(feet)
Height, y y(feet)
6 slope, b1 = ŷ/ x
ŷ
Height,
5 △x
y-intercept, b0 4
3
0 1 2 3 4 5
Salary, x x(groats)
Salary, (groats)
Figure 1.2: A line drawn through the data points. The line’s slope is b1 = 0.764
feet/groat, and the intercept at a salary of zero groats is b0 = 3.22 feet, so
the equation of the line is ŷ = 0.764 x + 3.22.
4
1.3. The Best Fitting Line
Intercept. The intercept b0 specifies the value of ŷ where the line meets
the ordinate (vertical) axis, that is, where the abscissa (horizontal) axis
is zero, x = 0. The value b0 of the y-intercept can be obtained from
Equation 1.1 with b1 = 0.764 by setting ŷ1 = 3.984 and x1 = 1:
b0 = ŷ1 b1 x 1 (1.22)
= 3.984 0.764 ⇥ 1 (1.23)
= 3.22 feet. (1.24)
For our data, this gives the rather odd prediction that someone with a
salary of zero groats would have a height of 3.22 feet.
Given the data set of n = 13 pairs of xi and yi values in Table 1.1, how
should we go about finding the best fitting line for those data?
One possibility is to find the line that makes the sum of all squared
vertical di↵erences between each value of yi and the line as small as
possible, as in Figure 1.3. It may seem arbitrary to look at the sum of
squared di↵erences, but this has a sound statistical justification, as we
shall see in Chapter 6.
So, if we wish to find the line that minimises the sum of all squared
vertical di↵erences then we had better write an equation that describes
this quantity. Recall that the vertical distance between an observed
value yi and the value ŷi predicted by a straight line equation is ⌘i
(Equation 1.2). Using Equations 1.1 and 1.2, if we sum ⌘i2 over all n
data points then we have the sum of squared errors
n
X 2
E = yi (b1 xi + b0 ) . (1.25)
i=1
5
1 What is Linear Regression?
The values of b1 and b0 that minimise this sum of squared errors E are
the least squares estimates (LSE) of the slope and intercept, respectively.
The method described so far is called simple linear regression, to
distinguish it from weighted linear regression (see next page).
6
5.5
5 ŷ7
4.5 #2
#7
4
#1 y7
3.5
3
0 1 2 3 4 5
Salary, x (groats)
Figure 1.3: The di↵erence ⌘i between each data point yi and the corresponding
point ŷi on the best fitting line is assumed to be noise or measurement error.
Four examples of ⌘i = yi ŷi for the data in Table 1.1 are shown here.
6
1.4. Regression and Causation
7
1 What is Linear Regression?
Key point: Fitting a line to a set of data may suggest that one
variable y increases with increasing values of another variable x,
but it does not imply that increasing x causes y to increase.
Now that we have set the scene, the next four chapters provide details of
simple linear regression, summarised here. Given n values of a measured
variable y (e.g. height) corresponding to n values of a known quantity
x (e.g. salary), find the best fitting line that passes close to this set of
n data points. Crucially, the values of y are assumed to contain noise
(e.g. measurement error), whereas values of x are assumed to be known
exactly. The noise means that the observed values of y do not lie exactly
on any single line.
Given that values of y vary, some of this variation can be ‘soaked
up’ (accounted for) by the best fitting line. Precisely how well the best
fitting line matches the data is measured as the proportion of the total
variation in y that is soaked up by the best fitting line. This proportion
can then be translated into a p-value, which is the probability that the
slope of the best fitting line is really due to the noise in the measured
values of y and that the true slope of the underlying relationship between
x and y is actually zero (i.e. a horizontal line).
Later chapters explain regression in relation to maximum likelihood
estimation, multivariate regression, weighted linear regression, nonlinear
regression, and Bayesian regression.
8
Chapter 2
2.1. Introduction
We can estimate the slope b1 and intercept b0 of the best fitting line
using two di↵erent strategies. The first one is exhaustive search, which
involves trying many di↵erent values of b1 and b0 in Equation 1.25 to
see which values make E as small as possible. Even though exhaustive
search is impractical, it gives us an intuitive understanding of how
calculus can be used to estimate parameters more efficiently. The
second approach is to use calculus to find analytical expressions for the
values of b1 and b0 that minimise E.
As mentioned above, one method for finding values of the slope b1 and
intercept b0 that minimise the sum of squared errors E is to substitute
plausible values of b1 and b0 into Equation 1.25 (repeated here),
n
X 2
E = yi (b1 xi + b0 ) , (2.1)
i=1
9
2 Finding the Best Fitting Line
5
Intercept, b0
-2 -1 0 1 2 3
Slope, b1
Figure 2.1: Contour map of the values of E for di↵erent pairs of values of the
slope b1 and intercept b0 . The point (b1 , b0 ) = (0.764, 3.22) where the dashed
lines intersect gives the smallest value of E.
10
2.4. The Normal Equations
25 25
20 20
15 15
E
10 10
5 5
0 0.5 1 1.5 2.5 3 3.5 4
Slope, b1 Intercept, b0
(a) (b)
11
2 Finding the Best Fitting Line
n
1X
⌘ = yi (b1 xi + b0 ) = 0, (2.4)
n i=1
b1 x + b 0 = y. (2.7)
12
2.4. The Normal Equations
n n
! n n
X X X X
b1 x2i x xi = x i yi y xi . (2.13)
i=1 i=1 i=1 i=1
1
Pn
n i=1 xi yi yx
b1 = 1
P n . (2.15)
n i=1 xi
2 x2
13
2 Finding the Best Fitting Line
b0 = y b1 x. (2.17)
6
Height, y (feet)
0
0 1 2 3 4 5
Salary, x (groats)
14
2.5. Numerical Example
The reason that the two best fitting lines in Figure 2.3 are di↵erent is
that whereas regressing y on x finds the parameter values that minimise
the sum of squared (vertical) di↵erences between each observed value of
y and the fitting line, regressing x on y finds the parameter values that
minimise the sum of squared (horizontal) di↵erences between each value
of x and the fitting line. This is discussed in more detail in Section 3.3.
we find
Now we know how to find the least squares estimates of the slope b1
and intercept b0 of the best fitting line. Next, we consider this crucial
question: How well does the best fitting line fit the data?
15
2 Finding the Best Fitting Line
import numpy as np
import matplotlib.pyplot as plt
x = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]
###############################
# END OF FILE.
###############################
16
Chapter 3
3.1. Introduction
Having found the LSE for the slope and intercept of the best fitting line,
we naturally wish to get some idea of exactly how well this line actually
fits the data. The informal idea of how well a line fits a set of data boils
down to how well that line accounts for the variability in the data. In
essence, we seek an answer to the following question: for data with a
given amount of variability, what proportion of that variability can be
explained in terms of the best fitting line? In other words, we wish to
know how much of the overall variability in the data can be ‘soaked up’
by the best fitting line. This requires a formal definition of variability.
17
3 How Good is the Best Fitting Line?
18
3.3. Covariance and Correlation
1
The n terms cancel, but have been left in the expression so that we can
recognise it as the covariance (Equation 3.3) divided by the standard
deviations of x and y (Equation 3.2):
The correlation can vary between r = ±1. Examples of data sets with
di↵erent correlations are shown in Figure 3.1.
1
Pn
n i=1 (xi x)(yi y)
b1 = 1
Pn , (3.6)
n (x
i=1 i x)2
19
3 How Good is the Best Fitting Line?
so that the slope of the best fitting line is given by the ratio
Key point: The slope of the best fitting line when regressing y
on x is the covariance between x and y expressed in units of the
variance of x.
If we divide all x values by the standard deviation sx , we obtain a
normalised variable x0 = x/sx , which has a standard deviation of 1,
i.e. sx0 = 1, and hence var(x0 ) = (sx0 )2 = 1 as well; similarly, y can be
normalised to have standard deviation and variance equal to 1. If both
x and y are normalised then the regression coefficient in Equation 3.8
becomes equal to the correlation coefficient in Equation 3.5. In other
words, for normalised variables, the slope of the best fitting line equals
the correlation, i.e. r = b1 .
From Equation 3.5, we have cov(x, y) = rsx sy , and substituting this
into Equation 3.8 yields b1 = r(sy /sx ), so the slope is the correlation
scaled by the ratio of standard deviations. Regressing y on x (as above)
yields b1 = r(sy /sx ), whereas regressing x on y yields b01 = r(sx /sy )
(see Figure 2.3).
120 120
100 100
80 80
60 60
y
40 40
20 20
0 0
0 10 20 30 40 50 0 10 20 30 40 50
x x
(a) (b)
Figure 3.1: Comparison of the correlations, covariances and best fitting lines
for two data sets.
(a) Correlation r = 0.90; cov(x, y) = 186; the best fitting line has a slope of
b1 = 2.09 and an intercept of b0 = 6.60.
(b) Correlation r = 0.75; cov(x, y) = 193; the best fitting line has a slope of
b1 = 2.16 and an intercept of b0 = 10.4.
20
3.4. Partitioning the Variance
The total error, or di↵erence between each observed value yi and the
mean y of y, can be split or partitioned into two parts, which we refer
to as signal and noise, as shown in Figure 3.2. The signal part of the
error is
i = ŷi y, (3.9)
⌘i = yi ŷi , (3.10)
which is the part of the error that cannot be explained by the model.
The total error of each data point is the sum of the signal and noise
6 error
ŷi y ŷi
5.5
y
5
4.5
mean (x, y)
4
3.5
3
0 1 2 3 4 5
Salary, x (groats)
Figure 3.2: The total error (di↵erence between an observed value yi and
the mean y) can be partitioned into two parts, a signal part, ŷi y, that
is explained by the best fitting line and a noise part, yi ŷi , which is not
explained by the line.
21
3 How Good is the Best Fitting Line?
which expands to
n
1X 2
var(y) = i + ⌘i2 + 2 i ⌘i . (3.14)
n i=1
where the final sum equals zero (see Appendix D, Equation D.21);
because (as we should expect) the correlation between signal and noise
is zero. Accordingly, the variance can be partitioned into two (signal
and noise) parts:
n n
1X 2 1X 2
var(y) = i + ⌘ . (3.16)
n i=1 n i=1 i
For brevity, we define these three sums of squares (reading from left to
right in Equation 3.17) as follows.
a) The total sum of squared errors
n
X
SST = (yi y)2 . (3.18)
i=1
n
X n
X
2
SSExp = = (ŷi y)2 , (3.19)
i=1 i=1
22
3.4. Partitioning the Variance
which is the part of the sum of squared errors SST that is accounted
for, or explained, by the regression model.
c) The noise sum of squares
n
X n
X
SSNoise = ⌘2 = (yi ŷi )2 , (3.20)
i=1 i=1
which is the part of the sum of squared errors SST that is not explained
by the regression model.
Now Equation 3.17 can be written succinctly as
It will be shown in the next section that the proportion of the sum of
squared errors that is explained by the regression model is
SSExp
r2 = , (3.22)
SST
SSNoise
r2 = 1 . (3.23)
SST
which stands for regression sum of squares, and the noise sum of squares
23
3 How Good is the Best Fitting Line?
The intercept b0 is not only unimportant in most cases but also makes
for unnecessarily complicated algebra. If we can set b0 to zero then we
can ignore it. Accordingly, in Appendix D we prove that b0 can be set to
zero without a↵ecting the slope of the best fitting line, by transforming
the means of x and y to zero. This also sets the mean ŷ of ŷ (for points
on the best fitting line) to zero, so we have ŷ = x = y = b0 = 0.
For the present, the geometric representation of the proof in Figure
3.3 should suffice. This shows that setting the mean of x to zero is
equivalent to simply translating the data along the x-axis, and setting
the mean of y to zero amounts to translating the data along the y-axis.
Crucially, this process of centring the data has no e↵ect on the rate at
which y varies with respect to x, so the slope of the best fitting line
remains unaltered.
For example, after centring, x = y = 0, so Equation 3.6 simpliflies to
Pn
x i yi
b1 = Pi=1
n 2 , (3.26)
i=1 xi
6
Height, y (feet)
-2
-4 -2 0 2 4 6
Salary, x (groats)
Figure 3.3: Setting the means to zero. The upper right dots represent the
original data, with x = 2.50 and y = 5.13, marked with a diamond. The lower
left circles represent the translated data, with zero means. For the original
data (upper right) the best fitting line has equation ŷi = b1 xi + b0 , but for
the translated data (lower left) b0 = 0 so that ŷi = b1 xi . Crucially, the best
fitting line has the same slope b1 for the original and the translated data.
24
3.5. The Coefficient of Determination
var(ŷ)
r2 = . (3.27)
var(y)
cov(x, y)2
r2 = . (3.28)
var(x)var(y)
n
1X 2 2
var(ŷ) = b x (3.30)
n i=1 1 i
= b21 var(x). (3.31)
cov(x, y)2
b21 = . (3.32)
var(x)2
cov(x, y)2
var(ŷ) = (3.33)
var(x)
and therefore
25
3 How Good is the Best Fitting Line?
1
Pn
(ŷi ŷ)2
r 2
= n
1
Pi=1
n . (3.35)
n i=1 (yi y)2
SSExp
= , (3.37)
SST
Hence, the proportion of the overall variance accounted for by the best
fitting line is (using Equation 3.27)
r2 = var(ŷ)/var(y) (3.39)
= 0.511/1.095 = 0.466. (3.40)
26
3.7. Python Code
x = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]
27
Chapter 4
Having estimated the slope of the best fitting line, how can we find the
statistical significance associated with that slope?
Just as the best fitting slope is an estimate of the true slope of the
relationship between two variables, so the mean of n values in a sample
is an estimate of the mean of the population from which the sample was
drawn. It turns out that the estimated slope is a weighted mean (as
will be shown in Section 5.2), which is a generalisation of a conventional
mean. Therefore, we can employ standard methods for finding the
statistical significance of a mean value to find the statistical significance
of the best fitting slope.
To calculate a conventional mean, we add up all n measured quantities
and divide the sum by n. For a weighted mean, each measured quantity
is boosted or diminished according to the value of its associated weight,
so some measurements contribute more to the weighted mean. The
justification is that some measured quantities are more reliable than
others, and such values should contribute more to the weighted mean.
Accordingly, and just for this chapter, we will treat the estimated slope
as if it were a conventional mean (i.e. an unweighted mean).
The data set in Table 1.1 is just one of many possible sets of data, or
samples, taken from an underlying large collection of salary and height
values. Indeed, we can treat our data set as if it is a single sample of
29
4 Statistical Significance: Means
yi = µ + "i . (4.1)
To get an idea of how typical our sample is, suppose we could take many
samples. Consider a random sample of n values, V1 = {y1 , y2 , . . . , yn },
where the subscript 1 indicates that this is the first sample; we denote
its mean value by y 1 . If we repeat this sampling procedure N times, we
get N samples {V1 , V2 , . . . , VN }, where each sample Vj has mean y j , so
we have N means,
{y 1 , y 2 , . . . , y N }. (4.2)
where yij denotes the ith value of y in the jth sample and "ij is the
noise in yij . By splitting this into two summations, we have
n n
1X 1X
yj = µ+ "ij . (4.4)
n i=1 n i=1
The first term is just µ, and the second term is the mean of the noise
values in the jth sample,
n
1X
"j = "ij , (4.5)
n i=1
yj = µ + "j . (4.6)
30
4.2. The Distribution of Means
frequency
1000 1000
500 500
0 0
-2 -1 0 1 2 -2 -1 0 1 2
mean y mean y
(a) (b)
frequency
1000 1000
500 500
0 0
-2 -1 0 1 2 -2 -1 0 1 2
mean y mean y
(c) (d)
frequency
1000 1000
500 500
0 0
-2 -1 0 1 2 -2 -1 0 1 2
mean y mean y
(e) (f)
Figure 4.1: Each panel shows a histogram of N = 10,000 means. All the
means in each histogram are based on a single sample size n, and n increases
from (a) to (f), as indicated in each panel. For example, in (b), first the mean
y of n = 4 randomly chosen values of y was found; then this was repeated to
obtain a total of 10,000 means, and a histogram of those means was plotted.
Each value of y was chosen from a parent population with mean µ = 0 and
standard deviation y = 1, shown in (a). As n increases, the standard error
y shrinks (standard deviation of the n mean values, denoted by in each
panel). The smooth curve in each panel is a Gaussian function with standard
p
deviation y = y / n.
31
4 Statistical Significance: Means
The law of large numbers says that if the distribution of noise " in the
parent population has a mean " of zero then the distribution of noise
means "j has a mean that converges to zero as n increases. Therefore, as
n increases, Equation 4.6 becomes y ⇡ µ, with equality as n approaches
infinity. Thus, the sample mean y is essentially a noisy estimate of
the population mean µ. But how much confidence can we place in
our estimate y of the population mean µ? To answer this, we need to
find the standard deviation of the estimated mean y. And, to find the
standard deviation of the mean y, we need the central limit theorem.
0.4
0.3
p(z)
0.1
0
-3 -2 -1 0 1 2 3
z
32
4.2. The Distribution of Means
p
The factor of 1/ n in Equation 4.7 implies that the standard error
shrinks as the sample size n increases. However, there are ‘diminishing
returns’ on increasing n: initially the standard error shrinks rapidly,
but for sample sizes above 20 the rate of decrease slows considerably, as
shown in Figure 4.3. Despite these diminishing returns, Equation 4.7
guarantees that y approaches zero as n tends to infinity.
(y µ)2 /(2 2
y)
p(y) = ke , (4.9)
2 1/2
where e = 2.718 . . . and k = [1/(2⇡ y )] , which ensures that the area
under the curve sums to 1.
1
Standard deviation of sample means
0.8
0.6
0.4
0.2
0
0 20 40 60 80 100
Number n of values in each sample
33
4 Statistical Significance: Means
34
4.3. Degrees of Freedom
y2 y2
y2
y = (y1, y2 , y3)
y1, y2 , y3)
y = (y1, y2)
y1
y1 y1
y3
y3 (a) (b)
Figure 4.4: a) For a sample with n = 2 values y1 and y2 , if their sum S is fixed
then the point y = (y1 , y2 ) must lie on the line shown, which has equation
y1 + y2 = S. (b) For a sample with n = 3 values y1 , y2 and y3 , if their sum S
is fixed then the point y = (y1 , y2 , y3 ) must lie on the plane shown, which has
equation y1 + y2 + y3 = S. One possible value for y is shown in (a) and (b).
35
4 Statistical Significance: Means
36
4.5. The p-Value
And the central limit theorem guarantees that the distribution of sample
means is approximately Gaussian with a mean of µ and a standard
deviation of y.
z = (y µ)/ y , (4.12)
p
where y = y/ n (Equation 4.7), so that
y µ
z = p . (4.13)
y / n
Because all sets of data contain noise, which has a Gaussian distribution,
the central limit theorem tells us that finding the mean of almost any
data set is analogous to choosing a point under a Gaussian distribution
curve at random. As illustrated in Figure 4.5, the probability of choosing
37
4 Statistical Significance: Means
that point is proportional to the height p(z) of the curve above that
point. Consequently, the probability of choosing a point located at z
along the abscissa decreases with distance from the mean µ. In other
words, the Gaussian curve defines the probability of any given value of
z, and this probability is a Gaussian function of z as in Equation 4.9.
If we now consider the normalised Gaussian distribution shown in
Figure 4.2, which has mean 0 and standard deviation 1, then the area
under the curve between ±1.96 accounts for 95% of the total area, so
the probability of choosing a point z that lies between ±1.96 is 0.95. It
follows that the total area of the two ‘tails’ of the distribution outside of
this central region is p = 1 0.95 = 0.05 (see Figure 4.2). Specifically,
the tail of the distribution to the right of z = +1.96 has an area of
0.025, and the tail of the distribution to the left of z = 1.96 also has
an area of 0.025. This means that the probability that z is more than
1.96 is 0.025, and the probability that z is less than 1.96 is 0.025. The
probability that the absolute value |z| of z is larger than 1.96 is the
p-value, p = 1 0.95 = 0.05.
z
Figure 4.5: A normalised histogram as an approximation to a Gaussian
distribution (smooth curve). If dots fall onto the page at random positions,
there will usually be more dots in the taller columns (only dots that fall under
the curve are shown). The proportion of dots in each column is proportional
to the height p(z) of that column. Therefore, the probability of choosing a
specific value of z is proportional to p(z).
38
4.6. The Null Hypothesis
39
4 Statistical Significance: Means
The answer is: any value of y that lies under the shaded area of
the Gaussian distribution in Figure 4.6a. Because the total area under
the distribution curve is 1, the area of any region under the curve
corresponds to a probability. Imagine increasing the value of y in
Figure 4.6a and calculating the area under the curve to the right of the
current value; it turns out that when the area under the curve to the
right of y equals 0.05, the location of y corresponds to a critical value
of y crit = 1.645 ⇥ y, where y is the standard error. In other words,
the value of y that is 1.645 standard errors above the mean cuts o↵ a
shaded region with an area of 0.05.
Using Equation 4.11, the estimated standard error is ˆy = 0.302,
so y crit = 1.645 ⇥ 0.302 = 0.497. Therefore, values of y larger than
0.497 are considered to be statistically significantly di↵erent from zero.
Clearly, the observed mean of 5.135 is larger than 0.497, so we can reject
the null hypothesis that µ = 0. To put it another way, the observed
mean of y = 5.135 is 5.135/0.302 = 17.003 standard errors above zero,
which is very significant. Because we only consider the area in the tail
on one side of the distribution, this is called a one-tailed test.
1.2 1.2
1 1
0.8 0.8
p(y)
p(y)
0.6 0.6
0.2 0.2
0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
y y
(a) (b)
40
4.7. The z-Test
41
4 Statistical Significance: Means
that is,
which means there is a 95% chance that the sample mean y is within
1.96 standard errors of the population mean µ. More importantly,
Equation 4.15 can be rewritten as
which means there is a 95% chance that the population mean µ lies
between ±1.96 standard errors of the sample mean y.
Specifically, there is a 95% chance that the population mean µ lies in
the confidence interval (CI) between the confidence limits y 1.96 y
and y + 1.96 y, which is written as
For a Gaussian distribution, 95% of the area under the curve is within
±1.96 standard deviations of the mean. This implies that each of the
two tails of the distribution contains 2.5% of the area, making a total of
5%. Accordingly, z0.05 = 1.96 is the critical value of z for a statistical
significance level or p-value of 0.05.
42
4.8. The t-Test
of the family has a di↵erent number of degrees of freedom (see page 34),
as shown in Figure 4.7. Analogous to a z-score, we have the variable
t = (y µ)/ˆy , (4.19)
p
where ˆy = ˆy / n and ˆy is calculated from Equation 4.10. Just as
with a Gaussian distribution, for a t-distribution, 95% of the area under
the curve is between t = ±t(0.05) standard deviations of the mean,
where t(0.05) denotes the critical value of t for a statistical significance
level of p = 0.05. In other words, (by analogy with Equation 4.15) there
is a 95% chance that
y µ
t(0.05) +t(0.05). (4.20)
ˆy
y µ
t(0.05, ⌫) +t(0.05, ⌫), (4.21)
ˆy
n =∞
0.4
n=8
0.3
n=4
0.2
n=2
0.1
0
-4 -2 0 2 4
t
43
4 Statistical Significance: Means
This means there is a 95% chance that the population mean µ lies
between ±t(0.05, ⌫) standard errors of the sample mean y.
By analogy with Equation 4.18, there is a 95% chance that the
population mean µ lies in the confidence interval between the confidence
limits y t(0.05, ⌫)ˆy and y + t(0.05, ⌫)ˆy , which is written as
To find the p-value associated with the mean y, look up the critical
value t(0.05, ⌫) corresponding to ⌫ = n 1 degrees of freedom and a
significance level of p = 0.05 (see Table 4.1). If the absolute value of t
from Equation 4.19 is larger than t(0.05, ⌫) (i.e. |t| > t(0.05, ⌫)) then
the p-value is p(t, ⌫) < 0.05.
Table 4.1: Critical values of t for a two-tailed test for di↵erent degrees of
freedom ⌫ and p-values 0.05 and 0.01, where each p-value corresponds to the
total area under the two tails of the t-distribution. When ⌫ = 1, the values
of t equal the z-scores of a normalised Gaussian distribution.
44
4.9. Numerical Example
45
4 Statistical Significance: Means
p
Therefore the estimated population standard deviation is ˆy = 1.187 =
1.089 feet. From Equation 4.11, the unbiased estimate of the standard
error (i.e. the estimated standard deviation of the sample means) is
p 1.089
ˆy = ˆy / n = p = 0.302 feet. (4.28)
13
If we test the null hypothesis that the data were obtained from a
population with mean µ = 0 then the value of t is (Equation 4.19)
5.135 0
ty = (y µ)/ˆy = = 16.997. (4.29)
0.302
Reference.
Walker HM (1940). Degrees of Freedom. Journal of Educational
Psychology, 31(4), 253.
46
4.10. Python Code
import numpy as np
from scipy import stats
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]
# Find t value.
tval = ymean/semymean # 16.997
47
Chapter 5
The quality of the fit between a line and the data may look impressive,
or it may look downright shoddy. In either case, how are we to assess
the statistical significance of the best fitting line? In fact, this question
involves two distinct subsidiary questions.
1. What is the statistical significance associated with the overall fit
of the line to the data?
As stated in the previous chapter, the slope of the best fitting line is
a weighted mean. Because a conventional mean is a special case of
a weighted mean in which every weight equals 1, we can adapt the
standard methods from the previous chapter for finding the statistical
significance of means to find the statistical significance of the slope.
49
5 Statistical Significance: Regression
x
wi = Pn i . (5.3)
i=1 x2i
yi = b1 x i + b 0 + ⌘ i , (5.4)
50
5.3. Statistical Significance: Slope
b1 b01
tb1 = , (5.6)
ˆb1
b1
tb1 = . (5.7)
ˆb1
n
!
X
2
b1 = var w i yi . (5.8)
i=1
51
5 Statistical Significance: Regression
n
! n
X X
var w i yi = var(wi yi ). (5.9)
i=1 i=1
n
X
2
b1 = wi2 var(yi ). (5.10)
i=1
Recall that we are assuming that the variances of all the data points
are the same, var(y1 ) = var(y2 ) = · · · = var(yn ), with all being equal to
2
the variance ⌘ of the noise ⌘i in yi ; therefore
n
X
2 2
b1 = ⌘ wi2 . (5.11)
i=1
Pn Pn
Using Equation 5.3, we find that i=1 wi2 = 1/ i=1 (xi x)2 , so
2
2 ⌘
b1 = Pn . (5.12)
i=1 (xi x)2
2
Of course, we do not know the value of ⌘, but an unbiased estimate is
n
X (yi ŷi )2
ˆ⌘2 = (5.13)
i=1
n 2
E
= , (5.14)
n 2
52
5.3. Statistical Significance: Slope
s⌘
ˆb1 ⇡ p . (5.16)
sx n
ˆ
ˆ⌘ = p⌘ , (5.17)
n
s
s⌘ ⇡ p⌘ . (5.18)
n
s⌘
ˆb1 ⇡ . (5.19)
sx
r(n 2)1/2
tb1 = , (5.20)
(1 r2 )1/2
53
5 Statistical Significance: Regression
var(ŷ)
r2 = . (5.21)
var(y)
p-Values: Slope
The data contain noise, so it is possible that they actually have a slope
of zero, even though the best fitting line has a nonzero slope b1 . The
p-value is the probability that the slope b1 of the best fitting line is due
to noise in the data. More precisely, the p-value is the probability that
the slope of the best fitting line is equal to or more extreme than the b1
observed, given that the true slope is zero.
To find the p-value associated with the slope b1 , look up the critical
value t(0.05, ⌫) that corresponds to ⌫ = n 2 degrees of freedom and
a significance value of p = 0.05 (see Table 4.1). If the absolute value
of tb1 from Equation 5.7 is larger than t(0.05, ⌫) (i.e. |tb1 | > t(0.05, ⌫))
then the p-value is p(tb1 , ⌫) < 0.05. Notice that, in principle, the slope
could have been negative or positive, so we use a two-tailed test.
54
5.4. Statistical Significance: Intercept
b0 b00
tb0 = , (5.23)
ˆb0
The lower the p-value, the more statistical significance is associated with
the data. However, as discussed in Section 4.8, statistical significance
is not necessarily associated with anything important. For example,
suppose that the slope of the best fitting line y = b1 x + b0 is very small
p
(e.g. b1 = 0.001). The fact that ˆb1 / 1/ n (Equation 5.16) implies
that for a sufficiently large sample size n, the value of tb1 = b1 /ˆb1
(Equation 5.7) will be large, so the slope b1 will be found to be highly
significant (e.g. p < 0.00001). Even so, because the slope is so small, it
is probably unimportant.
55
5 Statistical Significance: Regression
Key point: That the slope b1 of the best fitting line y = b1 x+b0
is statistically significant does not imply that x is an important
factor in accounting for y.
The F -test. In the case of the simple linear regression considered here,
the t-statistic provides all the information necessary to evaluate the fit.
However, when we come to consider more general models (e.g. weighted
linear regression in Chapter 7), we will find that the t-test is really a
special case of a more general test, called the F -test. Thus, this section
paves the way for later chapters and may be skipped on first reading.
Roughly speaking, the F -test relies on the ratio F 0 of two proportions,
[the proportion of variance in y explained by the model] to [the
proportion of variance in y not explained by the model]:
Clearly, larger values of F 0 imply a better fit of the model to the data.
From Equation 5.21, the proportion of variance in y explained by the
regression model is r2 , so the proportion of unexplained variance is
1 r2 ; then Equation 5.26 becomes
r2
F0 = . (5.27)
1 r2
56
5.6. Assessing the Overall Fit
ratio is
r2 /(p 1)
F (p 1, n p) = , (5.28)
(1 r2 )/(n p)
SSExp /(p 1)
F (p 1, n p) = . (5.29)
SSNoise /(n p)
The F -Test and the t-Test. As mentioned earlier, the t-test is really
a special case of the F -test. With one independent variable, the number
of parameters is p = 2, so Equation 5.28 becomes
r2
F (1, n 2) = . (5.30)
(1 r2 )/(n 2)
57
5 Statistical Significance: Regression
then we can see that F (1, n 2) = [tb1 (n 2)]2 . In general, the value
of F with 1 and ⌫ = n p degrees of freedom equals the square of tb1
with ⌫ degrees of freedom.
So far we have been considering simple regression with only one
regressor, and in this case the t-test and the F -test give exactly the
same result. If we consider more than one regressor, as in multivariate
regression (Chapter 7), then the t-test and the F -test are no longer
equivalent and the F -test must be used.
2
Not Using the -Test. It is common to assess the overall fit between
2
the data and the best fitting line using either the F -test or the -
test (chi-squared test). In fact, these tests are equivalent as n ! 1.
2
However, we can usually disregard the -test, because it is the least
accurate of these tests when n is not very large, which is often the case
in practice.
To test the idea that the true slope is b01 = 0 we calculate (using
Equation 5.7)
58
5.7. Numerical Example
Because our value of tb1 is larger than 2.201, its associated p-value is
less than 0.05. By convention, this is reported as
p = 0.0101. (5.38)
59
5 Statistical Significance: Regression
r r2 F ⌫Num ⌫Den p
0.683 0.466 9.617 1 11 0.0101
Because the value of tb0 is larger than 2.718, its associated p-value is
less than 0.01. By convention, this is reported as
60
5.7. Numerical Example
r2 /(p 1)
F (p 1, n p) = , (5.51)
(1 r2 )/(n p)
var(ŷ) 0.511
r2 = = = 0.466. (5.52)
var(y) 1.095
0.466/1
F (1, 11) = = 9.617. (5.53)
(1 0.466)/11
p = 0.0101. (5.54)
This agrees with the p-value from the t-test for the slope in Equation 5.38,
as it should do. This is because (in the case of simple regression)
the overall model fit is determined by a single quantity r2 , which is
determined by the slope b1 .
61
5 Statistical Significance: Regression
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
ch5Python.py Statistical significance of regression.
"""
import numpy as np
from scipy import stats
import statsmodels.api as sm
x = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]
# SLOPE
# Find sem of slope.
num = ( (1/(n-numparams)) * sum((y-yhat)**2) )**0.5 # 0.831.
den = sum((x-xmean)**2)**0.5 # 3.373
semslope = num/den
print(’semslope = %6.3f.’ % (semslope))
62
5.8. Python Code
# INTERCEPT
# Find sem of intercept.
a = ( (1/(n-numparams)) * sum((y-yhat)**2) )**0.5 # 0.831
b = ( (1/n) + xmean**2 / sum( xzm**2 ))**0.5 # 0.791
semintercept = a * b # 0.658
print(’\na = %6.3f\nb = %6.3f\nsemintercept = a/b = %6.3f.’
% (a, b, semintercept))
# Find F ratio.
A = r2 / (numparams-1)
B = (1-r2) / (n-numparams)
F = A/B # 9.617
print("F ratio = %0.4f." % F)
# END OF FILE.
63
Chapter 6
6.1. Introduction
ŷi = b1 x i + b0 , (6.1)
(this is the same as Equation 1.1). As shown in Figure 6.1 (see also
Figure 1.3), the vertical distance between the measured value yi and
the value ŷi predicted by the ‘straight line’ model is represented as
⌘i = yi ŷi , (6.2)
⌘i2 /(2 2
i)
p(⌘i ) = ki e , (6.3)
p
where ki = 1/ 2⇡ 2 is a normalising constant which ensures that the
i
area under the Gaussian distribution curve sums to 1. More accurately,
65
6 Maximum Likelihood Estimation
p(⌘i ) is a probability density, but we need not worry about such subtle
distinctions here. The shape of a Gaussian distribution with = 1 is
shown in Figure 4.2 (p32).
If we substitute Equations 6.1 and 6.2 into Equation 6.3, we obtain
But because the values of b1 and b0 are determined by the model, while
the values of xi are fixed, the probability p(⌘i ) is also a function of yi ,
which represents the probability of observing yi , so we can replace p(⌘i )
with p(yi ) and write
[yi (b1 xi +b0 )]2 /(2 2
i)
p(yi ) = ki e . (6.5)
7.5
7
6.5 #10
Height, y (feet)
6
5.5
5
4.5 #2 #7
4
3.5
3
0 1 2 3 4 5
Salary, x (groats)
Figure 6.1: The vertical distance between each measured value yi and the
value ŷi predicted by the straight line model is ⌘i , shown for four data points
here. All measured values contain noise, and the noise of each data point is
assumed to have a Gaussian distribution; each distribution has a di↵erent
standard deviation, indicated by the width of the vertical Gaussian curves.
66
6.2. The Likelihood Function
67
6 Maximum Likelihood Estimation
p(⌘1 , ⌘2 , . . . , ⌘n |b1 , b0 )
= p(⌘1 |b1 , b0 ) ⇥ p(⌘2 |b1 , b0 ) ⇥ · · · ⇥ p(⌘n |b1 , b0 ). (6.10)
But since p(⌘i |b1 , b0 ) = p(yi |b1 , b0 ), Equation 6.10 can be expressed as
y = (y1 , y2 , . . . , yn ), (6.12)
p
where (as a reminder) ki = 1/( i 2⇡). In words, p(y|b1 , b0 ) is
interpreted as ‘the conditional probability that the variable y adopts the
set of values y, given the parameter values b1 and b0 ’. Equation 6.15 is
called the likelihood function because its value varies as a function of the
parameters b1 and b0 . The parameter values that make the observed
data most probable are the maximum likelihood estimate (MLE).
68
6.3. Likelihood and Least Squares Estimation
For reasons that will become clear, it is customary to take the logarithm
of quantities like those in Equation 6.15. As a reminder, given two
positive numbers m and n, log(m ⇥ n) = log m + log n. Accordingly,
the log likelihood of b1 and b0 is
n n 2
X 1 X yi (b1 xi + b0 )
log p(y|b1 , b0 ) = log p . (6.16)
i=1 i 2⇡ i=1
2 i2
69
6 Maximum Likelihood Estimation
Key point: If all data points are equally reliable (i.e. if all
i values are the same) then the maximum likelihood estimate
(MLE) of model parameter values is identical to the least squares
estimate (LSE) of those parameter values.
70
Chapter 7
Multivariate Regression
7.1. Introduction
71
7 Multivariate Regression
ŷ
b1 = (7.3)
x1
ŷ
b2 = (7.4)
x2
10
5
y
0
4
10 2
5
0 0 x
x2 1
Figure 7.1: Given a data set consisting of x1 , x2 and y values, where the
variable y is thought to depend on the variables x1 and x2 , we can fit a plane
to the data, defined by ŷi = b1 xi1 + b2 xi2 + b0 , where b1 is the gradient with
respect to x1 (i.e. the slope of the plane along the wall formed by the x1 - and
y-axes), b2 is the gradient with respect to x2 (the slope of the plane along
the wall formed by the x2 - and y-axes), and b0 is the height of the plane at
(x1 , x2 ) = (0, 0) on the ground. Each vertical line joins a measured data point
yi to the point ŷi on the best fitting plane above the same location (xi1 , xi2 )
on the ground. Using the data in Table 7.1, the least squares estimates are
b1 = 0.966, b2 = 0.138 and b0 = 2.148.
72
7.3. Vector–Matrix Formulation
i 1 2 3 4 5 6 7 8 9 10 11 12 13
yi 3.34 4.97 4.15 5.40 5.21 4.56 3.69 5.86 4.58 6.94 5.57 5.62 6.87
xi1 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
xi2 7.47 9.24 3.78 1.23 5.57 4.48 4.05 4.19 0.05 7.20 2.48 1.73 2.37
And, just as simple regression finds the best fitting line by minimising
the sum of squared vertical distances between the line and the y data,
so multivariate regression finds the best fitting plane by minimising the
sum of squared vertical distances between the plane and the y data,
n
X
E = (yi ŷi )2 (7.5)
i=1
n
X 2
= yi (b1 xi1 + b2 xi2 + b0 ) , (7.6)
i=1
Key point: Just as simple regression finds the best fitting line
by minimising the sum of squared vertical distances between
the line and the data, so multivariate regression finds the best
fitting plane by minimising the sum of squared vertical distances
between the plane and the data.
73
7 Multivariate Regression
where the first two elements of xi define a single location on the ground
plane. So Equation 7.2 can be written in vector–matrix form as
0 1
B b1 C
B C
B C
ŷi = (xi1 , xi2 , 1) B b2 C (7.9)
B C
@ A
b0
ŷi = xi b. (7.10)
74
7.3. Vector–Matrix Formulation
and the term in brackets on the right-hand side of Equation 7.11 can
be represented as the n ⇥ 3 matrix
0 1
B x1 C
B C
B .. C
X = B . C, (7.13)
B C
@ A
xn
ŷ = Xb. (7.16)
75
7 Multivariate Regression
E = y| y + b| X | Xb 2b| X | y, (7.20)
where the nabla symbol r is standard notation for the gradient operator
on a vector, i.e. rE = dE/db. Each element of the vector rE is
the derivative of E with respect to one regression coefficient, and at a
76
7.4. Finding the Best Fitting Plane
|
rE = (0, 0, 0) . (7.22)
At a minimum this equals the zero vector in Equation 7.22, which yields
(X | X)b = X | y. (7.24)
b = (X | X) 1
X | y, (7.25)
where (X | X) 1
is the 3 ⇥ 3 inverse of the matrix X | X.
From Equation 7.16 (ŷ = Xb), we obtain ŷ as
ŷ = X(X | X) 1
X | y, (7.26)
77
7 Multivariate Regression
Multicollinearity
Degrees of Freedom
78
7.5. Statistical Significance
var(ŷ)
r2 = . (7.28)
var(y)
r2 /(p 1)
F (p 1, n p) = . (7.31)
(1 r2 )/(n p)
79
7 Multivariate Regression
b1
tb1 (⌫) = , (7.32)
ˆb1
2
ˆX = ˆ⌘2 (X | X) 1 (7.33)
0 1
2
B ˆb1 0 0 C
B C
B C
= B 0 ˆb2 2
0 C, (7.34)
B C
@ A
0 0 ˆb0
E
ˆ⌘2 = , (7.35)
n p
In practice, the quality of the fit is assessed using the extra sum-of-
squares method, also known as the partial F -test. In essence, this
consists of assessing the extra sum-of-squares SSExp (see Equation 3.19)
accounted for by the regressors when the number of regressors is
increased. As the number of regressors is increased, the proportion
of the total sum of squared errors that is explained by the regressors
increases. For example, if we consider only one regressor, i.e. p = 2
parameters (slope and intercept), the predicted value of yi obtained
80
7.6. How Many Regressors?
From Equation 3.19, the sum of squared errors explained by the regressor
xi1 is
n
X 2
SSExp (bred ) = ŷi (bred ) y , (7.39)
i=1
where the two slopes b1 and b2 and the intercept b0 are represented as
such that
81
7 Multivariate Regression
To test the hypothesis that b2 equals zero (i.e. that removing x2 from
the full model does not change the explained sum of squared errors),
we calculate the F -ratio
Finding the Best Fitting Plane. The best fitting plane for the data
in Table 7.1 is shown in Figure 7.1. The least squares estimates of the
regression coefficients are
82
7.7. Numerical Example
This implies that the regressor x1 is only 0.864/0.338 = 2.56 times more
influential than x2 on the value of y.
var(ŷ) 0.600
r2 = = = 0.548, (7.48)
var(y) 1.095
p
so the multiple correlation coefficient is r = 0.548 = 0.740.
The statistical significance of the multiple correlation coefficient
is assessed using the F -ratio (see Section 5.6). From Equation 7.31
(repeated below), the F -ratio of the coefficient of determination with
numerator degrees of freedom p 1 = 3 1 = 2 and denominator degrees
of freedom n p = 13 3 = 10 is
r2 /(p 1)
F (p 1, n p) = (7.49)
(1 r2 )/(n p)
0.548/2
= (7.50)
(1 0.548)/10
= 6.063, (7.51)
83
7 Multivariate Regression
p = 0.019. (7.52)
As this is less than 0.05, the coefficient of determination (i.e. the overall
fit) is statistically significant.
b1 0.966
tb1 = = = 3.433, (7.53)
ˆb1 0.281
b2 0.138
tb2 = = = 1.344, (7.54)
ˆb2 0.103
b0 2.148
tb0 = = = 2.102, (7.55)
ˆb0 1.022
Table 7.2: Results of multivariate regression analysis of the data in Table 7.1.
84
7.7. Numerical Example
becomes
1.162/1
F (1, 10) = (7.60)
6.436/10
= 1.805, (7.61)
Reference.
Gujarati DN (2019) The Linear Regression Model, Sage Publications.
85
7 Multivariate Regression
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
ch7Python.py. Multivariate regression.
This is demonstration code, so it is transparent but inefficient.
"""
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
from scipy import stats
import warnings # python complains about small n, so turn off warnings.
warnings.filterwarnings(’ignore’)
x1 = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
x2 = [7.47,9.24,3.78,1.23,5.57,4.48,4.05,4.19,0.05,7.20,2.48,1.73,2.37]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]
###############################
# FULL model. Repeat twice:
# 1) by hand (vector-marix) then 2) check using standard library.
###############################
ymean = np.mean(y)
ones = np.ones(len(y)) # 1 x n vector
Xtr = [x1, x2, ones] # 3 x n matrix
X = np.transpose(Xtr) # n x 3 matrix
y = np.transpose(y) # 1 x n vector
86
7.8. Python Code
b0 = params[2] # 2.148
b1 = params[0] # 0.966
b2 = params[1] # 0.138
# PLOT DATA.
fig = plt.figure(1)
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(X[:,0], X[:,1], y, marker=’.’, color=’red’)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("y")
SSNoiseFULL = sum((yhat-y)**2)
87
7 Multivariate Regression
###############################
# REDUCED model. Repeat twice:
# 1) by hand (vector-matrix) then 2) check using standard library.
###############################
XREDtr = [x1, ones]
XRED = np.transpose(XREDtr)
###############################
# Extra sum of squares method (partial F-test). Repeat twice:
# 1) by hand (vector-marix) then 2) check using standard library.
###############################
# 1) Vector-matrix: Results of extra sum of squares method.
print(’\nVector-matrix: Results of extra sum of squares method:’)
dofDiff = 1 # Difference in dof between full and partial model.
numparamsFULL = 3 # params in full model.
num = (SSExplainedFULL - SSExplainedRED) / dofDiff
den = SSNoiseFULL / (n-numparamsFULL)
Fpartial = num / den
print("Fpartial = %0.3f" % Fpartial)
p_valuepartial = stats.f.sf(Fpartial, dofDiff, n-numparamsFULL)
print("p_valuepartial (vector-matrix) = %0.3f" % p_valuepartial) # 0.209
88
Chapter 8
n n 2
X 1 1 X yi (b1 xi + b0 )
log p(y|b1 , b0 ) = log p 2 . (8.1)
i=1 i 2⇡ 2 i=1 i
n 2
X yi (b1 xi + b0 )
E = 2
(8.2)
i=1 i
n
!
X 1
= 2 log p 2 log p(y|b1 , b0 ). (8.3)
i=1 i 2⇡
89
8 Weighted Linear Regression
i 1 2 3 4 5 6 7 8 9 10 11 12 13
xi 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
yi 3.34 4.97 4.15 5.40 5.21 4.56 3.69 5.86 4.58 6.94 5.57 5.62 6.87
i 0.09 0.15 0.24 0.36 0.50 0.67 0.87 1.11 1.38 1.68 2.03 2.41 2.83
Table 8.1: Values of salary xi and measured height yi as in Table 1.1, where
each value of yi has a di↵erent standard deviation i .
2
now ‘discounted’ by the noise variance i associated with the measured
value yi . The values of b1 and b0 obtained by minimising E are called
the weighted least squares estimates (WLSE) of the slope and intercept.
10
8
Height, y (feet)
0
0 1 2 3 4 5
Salary, x (groats)
Figure 8.1: Weighted versus simple regression using the data from Table 8.1.
Weighted regression yields a slope of b1 = 1.511 and an intercept of b0 = 2.122
(solid line); the length of each vertical line is twice the standard deviation of
the respective data point. For comparison, simple regression assumes that all
data points have the same standard deviation, which yields b1 = 0.764 and
b0 = 3.22 (dashed line).
90
8.3. Vector–Matrix Formulation
91
8 Weighted Linear Regression
(X | W X)b = X | W y, (8.10)
b = (X | W X) 1
X | W y, (8.11)
Assessing the Overall Model Fit. The fit of the model to the
data can be assessed using the F -ratio, which is defined as
2
rw /(p 1)
F (p 1, n p) = 2 )/(n
, (8.12)
(1 rw p)
92
8.4. Statistical Significance
Pn
The normalised weights sum to 1, i=1 vi = 1. As in Equation 3.11,
the total error is the sum of two subsidiary error terms,
yi yw = (yi ŷi ) + (ŷi y w ), (8.15)
where (yi ŷi ) is the part of the total error not explained by the model
and (ŷi y w ) is the part of the total error that is explained by the
model. The total sum of squared errors is therefore also the sum of two
subsidiary sums of squared errors,
n
X n
X n
X
(yi y w )2 = (yi ŷi )2 + (ŷi y w )2 , (8.16)
i=1 i=1 i=1
i.e. the sum of squared errors not explained by the model plus the sum
of squared errors that is explained by the model.
However, if we take account of the fact that di↵erent data points have
di↵erent variances then these become weighted sums,
n
X n
X n
X
(yi y w )2 / 2
i = (yi ŷi )2 / 2
i + (ŷi y w )2 / 2
i, (8.17)
i=1 i=1 i=1
i.e. the weighted total sum of squared errors is the weighted sum of
squared errors not explained by the model plus the weighted sum of
2
squared errors explained by the model. Recalling that 1/ i is the
ith diagonal element Wii of the matrix W (Equaiton 8.5), this can be
written as
n
X n
X n
X
(yi y w )2 Wii = (yi ŷi )2 Wii + (ŷi y w )2 Wii . (8.18)
i=1 i=1 i=1
93
8 Weighted Linear Regression
where y and ŷ are the vectors defined in Equations 7.17 and 7.12,
and ȳw is the column vector whose elements are all equal to y w in
Equation 8.13. As in Chapter 3, we define the second term in Equation
8.20, the weighted sum of squares explained by the model, as
and we define the first term in Equation 8.20, the noise or residual sum
of squares not explained by the model, as
2 SSExp
rw = , (8.24)
SST
2 SST SSNoise
rw = (8.26)
SST
SSNoise
= 1 , (8.27)
SST
94
8.4. Statistical Significance
2 (y ŷ)| W (y ŷ)
rw = 1 , (8.28)
(y ȳw )| W (y ȳw )
2 SSNoise /(n p)
rw,Adj = 1 . (8.29)
SST /(n 1)
Each of the standard deviations ˆb1 and ˆb0 is the square root of one
diagonal element of the covariance matrix
2
ˆX = ˆ⌘2 (X | W X) 1 (8.32)
0 1
2
B ˆb1 0 C
= @ A (8.33)
2
0 ˆb0
E
ˆ⌘2 = . (8.34)
n p
95
8 Weighted Linear Regression
2
rw /(p 1)
F (p 1, n p) = 2 )/(n
. (8.36)
(1 rw p)
2 SSExp 56.390
rw = = = 0.452. (8.37)
SST 124.746
2
Substituting rw = 0.452, p 1=2 1 = 1 and n p = 13 2 = 11
into Equation 8.36, we get
0.452/1
F (1, 11) = = 9.075. (8.40)
(1 0.452)/11
96
8.5. Numerical Example
b1
tb1 (⌫) = (8.41)
ˆb1
1.511
= = 3.012, (8.42)
0.502
b0
tb0 (⌫) = (8.43)
ˆb0
2.122
= = 3.405, (8.44)
0.623
97
8 Weighted Linear Regression
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ch8Python.py. Weighted linear regression.
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import sympy as sy
x = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]
sds = [0.09,0.15,0.24,0.36,0.50,0.67,0.87,1.11,1.38,1.68,2.03,2.41,2.83]
#x = [1.00,2.0,3.0]
#y=[0.008457848, 0.009728746, 0.010398164]
#sds =[0.001803007, 0.002398036, 0.003621527]
# TSP
x = [1.00,2.0,3.0]
y=[0.01374, 0.01717,0.01744]
sds = [0.00082, 0.00069, 0.00181]
# rubisco
y=[0.008455885, 0.009734122, 0.010397007]
sds=[0.000571801, 0.000758832, 0.001143985]
# test
#y = [2,4,6]
#sds=[1,1,1]
###############################
# Weighted least squares model (WLS) using vector-matrix notation.
###############################
# Convert vector w into diagonal matrix W.
w = 1 / (sds**2)
W = np.diag(w)
ones = np.ones(len(y))
Xtr = [x, ones] # 2 rows by 13 cols.
Xtr = sy.Matrix(Xtr)
98
8.6. Python Code
b1 = params[0] # 1.511
b0 = params[1] # 2.122
print(’slope b1 = %6.3f’ % b1)
print(’intercept b0 = %6.3f’ % b0)
##############################################
# Compare to standard WLS library output.
##############################################
mod_wls = sm.WLS(y, X, weights=w )
res_wls = mod_wls.fit()
print(’\n\nWeighted Least Squares LIBRARY MODEL SUMMARY’)
print(res_wls.summary())
##############################################
# Estimate OLS model for comparison:
##############################################
res_ols = sm.OLS(y, X).fit()
print(’\n\nOrdinary Least Squares LIBRARY MODEL SUMMARY’)
print(res_ols.params)
print(res_wls.params)
print(res_ols.summary())
##############################################
# PLOT Ordinary LS and Weighted LS best fitting lines.
##############################################
fig = plt.figure(1)
fig.clear()
plt.plot(x, y, "o", label="Data")
99
8 Weighted Linear Regression
plt.xlabel(’salary’)
plt.ylabel(’height’)
plt.show()
##############################################
# END OF FILE.
##############################################
100
Chapter 9
Nonlinear Regression
9.1. Introduction
ŷ = b0 e b1 x . (9.1)
where the observed value yi is a noisy version of the value ŷi given by
the model,
yi = ŷi + ⌘i . (9.3)
There are two broad classes of models used to fit nonlinear functions to
data, as described in the next two sections.
101
9 Nonlinear Regression
yi = b0 + b1 xi + b2 x2i + ⌘i , (9.4)
b = (b0 , b1 , b2 ). (9.6)
Notice that Equation 9.5 has the same form as Equation 7.2 used in
multivariate regression, the only di↵erence being that each regressor (xi1
and xi2 ) in Equation 7.2 has been replaced by xi raised to a particular
power (x1i and x2i ) here. Consequently, we can treat the polynomial
regression problem of Equation 9.5 as if it were a multivariate regression
problem with two regressors, xi and x2i , as shown in Table 9.1.
i 1 2 3 4 5 6 7 8 9 10 11 12 13
yi 3.34 4.97 4.15 5.40 5.21 4.56 3.69 5.86 4.58 6.94 5.57 5.62 6.87
xi 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
x2i 1.00 1.56 2.25 3.06 4.00 5.06 6.25 7.56 9.00 10.56 12.25 14.06 16.00
Table 9.1: Values of the regressors x and x2 and the dependent variable y.
102
9.2. Polynomial Regression
ŷi = b1 x i + b0 , (9.8)
the sum of squared errors is almost certainly greater than zero. At the
other extreme, if the regression function is a polynomial of high enough
order k,
then the fitted polynomial will pass through every data point exactly,
so that ŷi = yi for all i = 1, . . . , n and hence the sum of squared
errors equals zero. However, this does not necessarily mean that the
polynomial model provides an overall good fit to the data; we cannot
rely on only the sum of squared errors to assess the fit of the model.
In practice, the quality of the fit is assessed using the extra
sum-of-squares method introduced in Section 7.6. This consists of
6
Height, y (feet)
0
0 1 2 3 4 5
Salary, x (groats)
Figure 9.1: The dashed line is the best fitting linear function shown previously
(b1 = 0.764, b0 = 3.22; rL2 = 0.466). The solid curve is the best fitting
quadratic function (Equation 9.5 with b1 = 0.212, b2 = 0.111 and b0 = 3.819;
2
rNL = 0.473). The improvement from rL2 to rNL 2
is assessed using the extra
sum-of-squares method (Section 7.6), which yields p = 0.730, so the quadratic
model does not provide a significantly better fit.
103
9 Nonlinear Regression
Multicollinearity
ŷ = e b1 x (9.10)
104
9.3. Nonlinear Regression
log yi = b1 x i + ⌘ i . (9.12)
b21 x2 b3 x 3
e b1 x ⇡ 1 + b1 x + + 1 + ··· , (9.14)
2! 3!
105
9 Nonlinear Regression
Finding the Best Fitting Plane. Using the data in Table 9.1, the
best fitting quadratic curve (Equation 9.5) is shown in Figure 9.1. The
least squares estimates of the regression coefficients are
b1 = 0.212,
b2 = 0.111,
b0 = 3.819.
var(ŷ) 0.518
r2 = = = 0.473, (9.16)
var(y) 1.095
p
so the correlation coefficient is r = 0.473 = 0.688.
The statistical significance of the multiple correlation coefficient is
assessed using the F -ratio (see Section 5.6). Using Equation 7.31
(repeated below), the F -ratio of the multiple correlation coefficient with
numerator degrees of freedom p 1 = 3 1 = 2 and denominator degrees
Table 9.2: Results of polynomial regression analysis of the data in Table 9.1.
106
9.4. Numerical Example
of freedom n p = 13 3 = 10 is
r2 /(p 1)
F (p 1, n p) = (9.17)
(1 r2 )/(n p)
0.473/2
= (9.18)
(1 0.473)/10
= 4.49, (9.19)
p = 0.041. (9.20)
becomes
0.095/1
F (1, 10) = (9.25)
7.502/10
= 0.127, (9.26)
107
9 Nonlinear Regression
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ch9Python.py. Nonlinear regression.
x1 = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]
###############################
# Quadratic model. Repeated twice: 1) Vector-matrix, 2) Standard library.
###############################
ymean = np.mean(y)
ones = np.ones(len(y)) # 1 x n vector
Xtr = [ones, x1, x2] # 3 x n matrix
X = np.transpose(Xtr) # n x 3 matrix
y = np.transpose(y) # 1 x n vector
108
9.5. Python Code
print(quadraticModel.summary())
###############################
# Linear model (using standard library).
###############################
Xtr = [ones, x1]
X = np.transpose(Xtr)
linearModel = sm.OLS(y, X).fit()
print(’\n\nLINEAR MODEL SUMMARY’)
print(linearModel.params)
print(linearModel.summary())
params = linearModel.params
b0LINEAR = params[0] # 3.225
b1LINEAR = params[1] # 0.764
yhatLINEAR = b1LINEAR * x1 + b0LINEAR
###############################
# PLOT DATA.
###############################
fig = plt.figure(1)
fig.clear()
yhatQuadratic = b1Quadratic * x1 + b2Quadratic * x2 + b0Quadratic
###############################
# STANDARD LIBRARY: Results of extra sum of squares method.
###############################
# test hypothesis that x2=0
hypothesis = ’(x2 = 0)’
f_test = quadraticModel.f_test(hypothesis)
print(’\nResults of extra sum of squares method:’)
print(’F df_num = %.3f df_denom = %.3f’
% (f_test.df_num, f_test.df_denom)) # 1, 10
print(’F partial = %.3f’ % f_test.fvalue) # 1.127
print(’p-value (that x2=0) = %.3f’ % f_test.pvalue) # 0.729
###############################
# END OF FILE.
###############################
109
Chapter 10
111
10 Bayesian Regression: A Summary
the most probable ones given the data we have. This brings us to a
vital, fundamental distinction between two frameworks: the frequentist
framework and the Bayesian framework.
In essence, whereas frequentist methods (like those used in the
previous chapters) answer questions regarding the probability of the
data, Bayesian methods answer questions regarding the probability
of a particular hypothesis. In the context of regression, frequentist
methods estimate the probability of the data if the slope were zero (null
hypothesis), whereas Bayesian methods estimate the probability of any
given slope based on the data, and can therefore estimate the most
probable slope. This apparently insignificant di↵erence represents a
fundamental shift in perspective.
Subjective Priors. A common criticism of the Bayesian framework
is that it relies on prior distributions, which are often called subjective
priors. However, there is no reason in principle why priors should not
be objective. Indeed, the objective nature of Bayesian priors can be
guaranteed mathematically, via the use of reference priors.
A Guarantee. Before we continue, we should reassure ourselves
about the status of Bayes’ theorem: Bayes’ theorem is not a matter of
conjecture. By definition, a theorem is a mathematical statement that
has been proved to be true. A thorough treatment of Bayesian analysis
requires mathematical techniques beyond the scope of this introductory
text. Accordingly, the following pages present only a brief, qualitative
summary of the Bayesian framework.
Each term in Bayes’ rule has its own name: p(x|y) is the probability
of x given y, or the posterior probability; p(y|x) is the probability of y
given x, or the likelihood of x; p(x) is the prior probability of x; and
p(y), the probability of y, is the evidence or marginal likelihood. In
practice, the results of Bayesian and frequentist methods can be similar
because the influence of the prior becomes negligible for large data sets.
112
10.2. Bayes’ Theorem
Bayesian Regression
p(y|b) p(b)
p(b|y) = , (10.2)
p(y)
113
Appendix A
Glossary
alternative hypothesis The working hypothesis, for example that the
slope of the best fitting line is not equal to zero. See null hypothesis.
average Usually understood to be the average of a sample taken from
a parent population, while the word mean is reserved for the average of
the parent population.
Bayesian analysis Statistical analysis based on Bayes’ theorem.
Bayes’ theorem The posterior probability of x given y is p(x|y) =
p(y|x)p(x)/p(y), where p(y|x) is the likelihood and p(x) is the prior
probability of x.
chi-squared test (or 2 -test) Commonly known as a goodness-of-fit
test, this is less accurate than the F -test for regression analysis.
coefficient of determination The proportion of variance in a variable
y that is accounted for by a regression model, r2 = var(ŷ)/var(y).
confidence limit The 95% confidence limits of a sample mean y are
µ ± 1.96ˆy where µ is the population mean and ˆy is the standard error.
correlation A normalised measure of the linear inter-dependence of two
variables, which ranges between r = 1 and r = +1.
covariance An unnormalised measure of the linear inter-dependence of
two variables x and y, which varies with the magnitudes of x and y.
degrees of freedom The number of ways in which a set of values is free
to vary, given certain constraints (imposed by the mean, for example).
frequentist statistics The conventional framework of statistical analysis
used in this book. Compare with Bayesian analysis.
Gaussian distribution (or normal distribution) A bell-shaped curve
defined by two parameters, the mean µ and variance 2 . A shorthand
way of writing that a variable y has a Gaussian function is y ⇠ N (µ, y2 ).
heteroscedasticity The assumption that noise variances may not all be
the same.
homoscedasticity The assumption that all noise variances are the same.
inference Using data to infer the value, or distribution, of a parameter.
likelihood The conditional probability p(y|b1 ) of observing the data
value y given a parameter value b1 is called the likelihood of b1 .
115
Glossary
116
Appendix B
Mathematical Symbols
117
Mathematical Symbols
2
population variance.
2
ˆ unbiased estimate of the population variance based on ⌫ degrees of
freedom.
b0 intercept of a line (i.e. the value of y at x = 0).
b1 slope of a line (i.e. the amount of change in y per unit increase in x).
cov(x, y) covariance of x and y, based on n pairs of values.
E sum of squared di↵erences between the model and the data.
E mean squared error; the sum of squared di↵erences divided by the
number of data points n (E = E/n).
k number of regressors in a regression model, which excludes the
intercept b0 .
n number of observations in a sample (data set).
p number of parameters in a regression model, which includes k
regressors plus the intercept b0 (so p = k + 1). Also p-value.
r(x, y) correlation between x and y based on n pairs of values.
r2 proportion of variance in data y accounted for by a regression model.
2
rw proportion of variance in data y accounted for by a regression model
when each data point has its own variance.
sx standard deviation of x based on a sample of n values.
sy standard deviation of y based on a sample of n values.
SSExp (explained) sum of squared di↵erences between the model-
predicted values ŷ and the mean y.
SSNoise (noise, or unexplained) sum of squared di↵erences between the
model-predicted values ŷ and the data y (the same as E).
SST (total) sum of squared di↵erences between the data y and the mean
y; SST = SSExp + SSNoise .
var variance based on a sample of n values.
2
V n ⇥ n covariance matrix in which the ith diagonal element i is the
variance of the ith data point yi .
1 2
W weight matrix, W = V , in which the ith diagonal element is 1/ i.
x vector of n observed values of x: x = (x1 , x2 , . . . , xn ).
x position along the x-axis.
y position along the y-axis.
y vector of n observed values of y: y = (y1 , y2 , . . . , yn ).
z position along the z-axis. Also z-score.
118
Appendix C
y = b 1 x 1 + b2 x 2 . (C.6)
119
Vector and Matrix Tutorial
The reason for having row and column vectors is that it is often necessary
to combine several vectors into a single matrix, which is then used to
multiply a single column vector x, defined here as
x = (x1 , x2 )| . (C.9)
In such cases, we need to keep track of which vectors are row vectors
and which are column vectors. If we redefine b as a column vector,
b = (b1 , b2 )| , then the inner product b · x can be written as
y = b| x 0 1 (C.10)
x1
= (b1 , b2 ) @ A (C.11)
x2
= b1 x 1 + b2 x 2 . (C.12)
y = b| X.
120
Appendix D
x0i = xi x, (D.1)
yi0 = yi y, (D.2)
and
yi0 = b1 x0i + b00 + ⌘i0 , (D.4)
121
D Setting Means to Zero
Pn
Since yi0 (b1 x0i + b00 ) = ⌘i0 , we have i=1 ⌘i0 = 0, so that
n
1X 0
⌘0 = ⌘ = 0. (D.10)
n i=1 i
This was also stated in Equation 2.4 for the original variables.
Proving b00 Equals Zero. Taking the means of all terms in Equation
D.4, we have
y0 = b1 x0 + b00 + ⌘ 0 . (D.11)
6
Height, y (feet)
-2
-4 -2 0 2 4 6
Salary, x (groats)
Figure D.1: Setting the means to zero. The upper right dots represent the
original data, for which the mean of x is x = 2.50 and the mean of y is
y = 5.13 (the point (x, y) is marked with a diamond). The lower left circles
represent the transformed data, with zero mean. Two new variables x0 and y 0
are obtained by translating x by x and translating y by y, so that x0 = x x
and y 0 = y y; both x0 and y 0 have mean zero, as indicated by the axes
shown as dashed lines. The slope of the best fitting line for regressing y on x
is the same as the slope of the best fitting line for regressing y 0 on x0 , and for
the transformed variables the y 0 -intercept is at b00 = 0.
122
Proving cov(ŷ, ⌘) = 0. Note that translating the data has no e↵ect on
the covariance, so that cov(ŷ, ⌘) = cov(ŷ 0 , ⌘ 0 ), which is
n
1X 0 0
cov(ŷ 0 , ⌘ 0 ) = (ŷi ŷ )(⌘i0 ⌘ 0 ), (D.13)
n i=1
0
where ŷ = ⌘ 0 = 0, so that
n
1X 0 0
cov(ŷ 0 , ⌘ 0 ) = ŷ ⌘ . (D.14)
n i=1 i i
From Section 2.4, the best fitting line minimises the mean squared error
n
1X 0
E = (y ŷi0 )2 , (D.15)
n i=1 i
and we know that ŷi0 = b1 x0i (because b00 = 0). At a minimum, the
derivative with respect to b1 is zero:
n
@E 2X 0 0
= x (y ŷi0 ) = 0, (D.16)
@b1 n i=1 i i
where
(yi0 ŷi0 ) = ⌘i0 . (D.17)
cov(ŷi , ⌘i ) = 0; (D.20)
123
Appendix E
Key Equations
n
!1/2
1X 2
sy = (yi y) . (E.2)
n i=1
Covariance.
n
1X
cov(x, y) = (xi x)(yi y). (E.3)
n i=1
or, equivalently,
125
E Key Equations
The value of t for the di↵erence between b1 and the null hypothesis
value of b01 = 0 is
b0 = y b1 x. (E.8)
1/2
1 x2
ˆb0 = ˆ⌘ ⇥ + Pn , (E.10)
n i=1 (xi x)2
r2 /(p 1)
F (p 1, n p) = . (E.11)
(1 r2 )/(n p)
126
Index
F -ratio, 57, 79, 82, 92, 126 noise, 2, 8, 18, 21, 23, 30, 37,
F -test, 49, 57 51, 54, 65, 66, 73, 102,
F -test (in relation to t-test), 57 105, 116
frequentist statistics, 112, 115 nonlinear regression, 101
127
Index
normal distribution, 33, 65, 66, variance, 17, 20, 22, 25, 36, 51,
116 56, 66, 69, 79, 91, 93,
normalised, 20, 78, 93, 115 116
normalised Gaussian, 32, 37 vector, 68, 91, 102, 113, 116, 119
null hypothesis, 39, 46, 50, 104,
112, 116 weighted least squares estimate,
90
one-tailed test, 39, 116 weighted linear multivariate
regression, 92
p-value, 8, 37, 42, 44, 54, 57, weighted linear regression, 7, 89
104, 116 weighted mean, 29, 37, 50, 93
parameter, 2, 9, 49, 52, 57, 67,
78, 80, 95, 112, 116 z-score, 37, 41, 116
parent population, 18, 30, 32, z-test, 41
36, 45, 116
partial F -test, 80
partitioning variance, 21, 93
population mean, 30, 32, 39, 42,
44, 117
population variance, 18, 36, 118
probability density, 66
t-test, 42, 45
t-test (in relation to F -test), 49,
57
theorem, 32, 112, 116
transpose, 76, 120
two-tailed test, 40, 44, 54, 116
128