0% found this document useful (0 votes)
445 views

Linear.regression.with.Python

The document is a tutorial book on linear regression using Python, authored by James V Stone. It covers essential mathematics and practical applications of regression analysis, including hands-on Python code examples and a comprehensive glossary. The book is designed for readers with basic mathematics knowledge and includes visual aids to enhance understanding of key concepts.

Uploaded by

paulstavio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
445 views

Linear.regression.with.Python

The document is a tutorial book on linear regression using Python, authored by James V Stone. It covers essential mathematics and practical applications of regression analysis, including hands-on Python code examples and a comprehensive glossary. The book is designed for readers with basic mathematics knowledge and includes visual aids to enhance understanding of key concepts.

Uploaded by

paulstavio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 140

Linear Regression With Python

Linear Regression

Linear Regression With Python


Linear regression is the workhorse of data analysis. It is the

With Python
first step, and often the only step, in fitting a simple model to
data. This brief book explains the essential mathematics
required to understand and apply regression analysis. The
tutorial style of writing, accompanied by over 30 diagrams,
offers a visually intuitive account of linear regression,
including a brief overview of nonlinear and Bayesian A Tutorial Introduction to the
regression. Hands-on experience is provided in the form of
numerical examples, included as Python code at the end of Mathematics of Regression Analysis
each chapter, and implemented online as Python and Matlab
code. Supported by a comprehensive glossary and tutorial
appendices, this book provides an ideal introduction to
regression analysis.

Features
ü Informal writing style

ü Key points summarised in text boxes

ü Intuitive diagrams illustrate key concepts

ü Comprehensive Glossary explains technical terms

ü Python code is included at the end of each chapter

ü Online demonstration code in Python and Matlab

James V Stone is an Honorary Associate Professor at the


University of Sheffield, England.

ISBN 9781916279186
90000 >
Stone

Sebtel Press
A Tutorial Introduction Book
9 781916 279186 James V Stone
Books by James V Stone
Linear Regression With Matlab: A Tutorial Introduction

Linear Regression With Python: A Tutorial Introduction

Information Theory: A Tutorial Introduction (2nd Edition)

Principles of Neural Information Theory

The Fourier Transform: A Tutorial Introduction

The Quantum Menagerie: A Tutorial Introduction

Artificial Intelligence Engines: A Tutorial Introduction

A Brief Guide to Artificial Intelligence

Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis

Bayes’ Rule With Matlab: A Tutorial Introduction

Bayes’ Rule With Python: A Tutorial Introduction

Bayes’ Rule With R: A Tutorial Introduction

Vision and Brain

Sample chapters and computer code:


https://jamesstone.sites.sheffield.ac.uk/books

James Stone, Honorary Associate Professor, University of Sheffield, UK.


Linear Regression
With Python

A Tutorial Introduction to the Mathematics

of Regression Analysis

James V Stone
Title: Linear Regression With Python
Author: James V Stone

©2022 Sebtel Press

All rights reserved. No part of this book may be reproduced or


transmitted in any form without written permission from the author.
The author asserts his moral right to be identified as the author of this
work in accordance with the Copyright, Designs and Patents Act 1988.

First Edition, 2022.


Typeset in LATEX@ 2" .
First printing.

ISBN 9781916279186
These are the tears un-cried for you.
So let the oceans
Weep themselves dry.

For Bob, my brother.


Contents
Preface i
1. What is Linear Regression? 1
1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. The Equation of a Line . . . . . . . . . . . . . . . . . . 1
1.3. The Best Fitting Line . . . . . . . . . . . . . . . . . . . 5
1.4. Regression and Causation . . . . . . . . . . . . . . . . . 7
1.5. Regression: A Summary . . . . . . . . . . . . . . . . . . 8
2. Finding the Best Fitting Line 9
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2. Exhaustive Search . . . . . . . . . . . . . . . . . . . . . 9
2.3. Onwards and Downwards . . . . . . . . . . . . . . . . . 10
2.4. The Normal Equations . . . . . . . . . . . . . . . . . . . 11
2.5. Numerical Example . . . . . . . . . . . . . . . . . . . . . 15
2.6. Python Code . . . . . . . . . . . . . . . . . . . . . . . . 16
3. How Good is the Best Fitting Line? 17
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2. Variance and Standard Deviation . . . . . . . . . . . . . 17
3.3. Covariance and Correlation . . . . . . . . . . . . . . . . 18
3.4. Partitioning the Variance . . . . . . . . . . . . . . . . . 21
3.5. The Coefficient of Determination . . . . . . . . . . . . . 25
3.6. Numerical Example . . . . . . . . . . . . . . . . . . . . . 26
3.7. Python Code . . . . . . . . . . . . . . . . . . . . . . . . 27
4. Statistical Significance: Means 29
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2. The Distribution of Means . . . . . . . . . . . . . . . . . 29
4.3. Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . 34
4.4. Estimating Variance . . . . . . . . . . . . . . . . . . . . 36
4.5. The p-Value . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6. The Null Hypothesis . . . . . . . . . . . . . . . . . . . . 39
4.7. The z-Test . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.8. The t-Test . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.9. Numerical Example . . . . . . . . . . . . . . . . . . . . . 45
4.10. Python Code . . . . . . . . . . . . . . . . . . . . . . . . 47
5. Statistical Significance: Regression 49
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2. Statistical Significance . . . . . . . . . . . . . . . . . . . 49
5.3. Statistical Significance: Slope . . . . . . . . . . . . . . . 50
5.4. Statistical Significance: Intercept . . . . . . . . . . . . . 55
5.5. Significance Versus Importance . . . . . . . . . . . . . . 55
5.6. Assessing the Overall Fit . . . . . . . . . . . . . . . . . . 56
5.7. Numerical Example . . . . . . . . . . . . . . . . . . . . . 58
5.8. Python Code . . . . . . . . . . . . . . . . . . . . . . . . 62
6. Maximum Likelihood Estimation 65
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2. The Likelihood Function . . . . . . . . . . . . . . . . . . 67
6.3. Likelihood and Least Squares Estimation . . . . . . . . 69
7. Multivariate Regression 71
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2. The Best Fitting Plane . . . . . . . . . . . . . . . . . . . 71
7.3. Vector–Matrix Formulation . . . . . . . . . . . . . . . . 73
7.4. Finding the Best Fitting Plane . . . . . . . . . . . . . . 76
7.5. Statistical Significance . . . . . . . . . . . . . . . . . . . 78
7.6. How Many Regressors? . . . . . . . . . . . . . . . . . . . 80
7.7. Numerical Example . . . . . . . . . . . . . . . . . . . . . 82
7.8. Python Code . . . . . . . . . . . . . . . . . . . . . . . . 86
8. Weighted Linear Regression 89
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2. Weighted Sum of Squared Errors . . . . . . . . . . . . . 89
8.3. Vector–Matrix Formulation . . . . . . . . . . . . . . . . 91
8.4. Statistical Significance . . . . . . . . . . . . . . . . . . . 92
8.5. Numerical Example . . . . . . . . . . . . . . . . . . . . . 96
8.6. Python Code . . . . . . . . . . . . . . . . . . . . . . . . 98
9. Nonlinear Regression 101
9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.2. Polynomial Regression . . . . . . . . . . . . . . . . . . . 102
9.3. Nonlinear Regression . . . . . . . . . . . . . . . . . . . . 104
9.4. Numerical Example . . . . . . . . . . . . . . . . . . . . . 106
9.5. Python Code . . . . . . . . . . . . . . . . . . . . . . . . 108
10.Bayesian Regression: A Summary 111
10.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 111
10.2. Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . 112

A. Glossary 115

B. Mathematical Symbols 117

C. A Vector and Matrix Tutorial 119

D. Setting Means to Zero 121

E. Key Equations 125

Index 127
Preface

This book is intended to provide an account of linear regression that is


both informal and mathematically rigorous. A large number of diagrams
have been included to help readers gain an intuitive understanding of
regression, expressed in terms of geometry.

Who Should Read This Book? The material in this book should be
accessible to anyone with knowledge of basic mathematics. The tutorial
style adopted ensures that readers who are prepared to put in the e↵ort
will be rewarded with a solid grasp of regression analysis.

Online Computer Code. Python and Matlab computer code for the
numerical example at the end of each chapter can be downloaded from
https://github.com/jgvfwstone/Regression.

Corrections. Please send any corrections to [email protected].


A complete list of corrections is available on the book website at
https://jim-stone.staff.shef.ac.uk/Regression.

Acknowledgements. Thanks to William Gruner for permission to use


his Matlab code for the F and t distributions. Thanks to Nikki Hunkin
for valuable feedback, sound advice, and tea. Finally, thanks to Alice
Yew for meticulous copyediting and proofreading.

James V Stone.
Sheffield, England.
One of the first things taught in introductory statistics
textbooks is that correlation is not causation. It is also
one of the first things forgotten.

Thomas Sowell (1930–).


Chapter 1

What is Linear Regression?

1.1. Introduction

Linear regression is the workhorse of data analysis. It is the first step,


and often the only step, in fitting a simple model to data. In essence,
linear regression is a method for fitting a straight line to a set of data.
For example, suppose we have a vague suspicion that tall people have
higher salaries than short people. How can we test this hypothesis?
Well, we could obtain the salaries and heights of (for example) 13 people,
called a sample, and plot a graph with salary on one axis and height
on the other axis, as in Figure 1.1a. Clearly, the plotted points seem
to lie roughly on a straight line, and three plausible lines are shown
in Figure 1.1b. But how do we know which (if any) of these lines is
the best fitting line? Before exploring this question further, we will
summarise the mathematical definition of a line.

1.2. The Equation of a Line

If the relationship between height ŷ and salary x is essentially linear


then it can be described with the equation of a straight line:

ŷi = b1 x i + b0 , (1.1)

where the dependent variable ŷi is the height of the ith person in the
sample, as predicted by the independent variable xi , which is the salary
of the ith person in the sample. Already, we can see that a line is

1
1 What is Linear Regression?

defined by two parameters, its slope b1 and intercept b0 , as shown in


Figure 1.2 (the letters b0 and b1 are standard notation).
Human height, like most physical quantities, is a↵ected by many
factors, so the measured or observed value of a person’s height is not
perfectly predicted by their salary. By convention, the observed height
of the ith individual is denoted by yi , whereas the height predicted
by fitting a straight line to the data is represented as ŷi , as shown in
Figure 1.3. The di↵erence between the height predicted by a straight line
equation and the observed height is considered to be noise, represented
by the Greek letter eta:

⌘i = yi ŷi . (1.2)

Accordingly, the observed height is the predicted height plus noise,

yi = ŷi + ⌘i . (1.3)

For example, the line in Figure 1.2 has a slope of b1 = 0.764 and an
intercept of b0 = 3.22, so the predicted value of yi at xi is

ŷi = 0.764xi + 3.22. (1.4)

Deciding which physical variable (salary or height) should be the


dependent variable is discussed in Section 1.4.

Slope. The magnitude of the slope b1 specifies the steepness of the


line, and the sign of the slope indicates whether ŷ increases or decreases
with x. A positive slope means that ŷ increases from left to right (as in
Figure 1.2), whereas a negative slope means that ŷ decreases from left
to right.

i 1 2 3 4 5 6 7 8 9 10 11 12 13
xi 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
yi 3.34 4.97 4.15 5.40 5.21 4.56 3.69 5.86 4.58 6.94 5.57 5.62 6.87

Table 1.1: Values of salary xi (groats) and measured height yi (feet) for a
fictitious sample of 13 people.

2
1.2. The Equation of a Line

An intuitive understanding of the slope can be gained by considering


how a given change in x corresponds to a concomitant change in ŷ. If
we change x from x1 to x2 then the change in x is

x = x2 x1 . (1.5)

The values of ŷ that correspond to x1 and x2 are

ŷ1 = b1 x 1 + b0 , (1.6)
ŷ2 = b1 x 2 + b0 , (1.7)

so the change in ŷ is
ŷ = ŷ2 ŷ1 , (1.8)

which can be rewritten as

ŷ = (b1 x2 + b0 ) (b1 x1 + b0 ). (1.9)

The intercept terms b0 cancel, so

ŷ = b1 (x2 x1 ) (1.10)
= b1 x. (1.11)

Dividing both sides by x yields the slope of the line,



b1 = . (1.12)
x

7 7
Height, y (feet)

Height, y (feet)

6 6

5 5

4 4

3 3
0 1 2 3 4 5 0 1 2 3 4 5
Salary,x (groats) Salary,x (groats)
(a) (b)

Figure 1.1: (a) Scatter plot of the salary and height data in Table 1.1.
(b) Three plausible lines that seem to fit the data. A groat is an obsolete unit
of currency, which was worth four pennies in England.

3
1 What is Linear Regression?

In words, the slope b1 is the amount of change ŷ in the predicted


value ŷ of y that corresponds to a change of x in x.
This is shown graphically in Figure 1.2, where the triangle has a
horizontal length of x and a vertical length of ŷ. For example, if we
choose salary values x1 = 1 and x2 = 4 then the change in salary is

x = x2 x1 (1.13)
= 4 1 (1.14)
= 3 groats. (1.15)

From Equation 1.4 the values of ŷ at x1 and x2 are ŷ1 = 3.984 and
ŷ2 = 6.276 (respectively), so the change in ŷ is

ŷ = ŷ2 ŷ1 (1.16)


= 6.276 3.984 (1.17)
= 2.29 feet, (1.18)

and therefore the slope of the line is

b1 = ŷ/ x (1.19)
= 2.29/3 (1.20)
= 0.764 feet/groat. (1.21)

Thus, for a salary increase of one groat, height increases by 0.764 feet.

7
(feet)
Height, y y(feet)

6 slope, b1 = ŷ/ x


Height,

5 △x

y-intercept, b0 4

3
0 1 2 3 4 5
Salary, x x(groats)
Salary, (groats)

Figure 1.2: A line drawn through the data points. The line’s slope is b1 = 0.764
feet/groat, and the intercept at a salary of zero groats is b0 = 3.22 feet, so
the equation of the line is ŷ = 0.764 x + 3.22.

4
1.3. The Best Fitting Line

Intercept. The intercept b0 specifies the value of ŷ where the line meets
the ordinate (vertical) axis, that is, where the abscissa (horizontal) axis
is zero, x = 0. The value b0 of the y-intercept can be obtained from
Equation 1.1 with b1 = 0.764 by setting ŷ1 = 3.984 and x1 = 1:

b0 = ŷ1 b1 x 1 (1.22)
= 3.984 0.764 ⇥ 1 (1.23)
= 3.22 feet. (1.24)

For our data, this gives the rather odd prediction that someone with a
salary of zero groats would have a height of 3.22 feet.

Key point: Given the equation of a straight line, ŷ = b1 x + b0 ,


the slope b1 is the amount of change in ŷ induced by a change
of one unit in x, and the intercept b0 is the value of ŷ at x = 0
(i.e. where the line meets the ordinate axis).

1.3. The Best Fitting Line

Given the data set of n = 13 pairs of xi and yi values in Table 1.1, how
should we go about finding the best fitting line for those data?
One possibility is to find the line that makes the sum of all squared
vertical di↵erences between each value of yi and the line as small as
possible, as in Figure 1.3. It may seem arbitrary to look at the sum of
squared di↵erences, but this has a sound statistical justification, as we
shall see in Chapter 6.
So, if we wish to find the line that minimises the sum of all squared
vertical di↵erences then we had better write an equation that describes
this quantity. Recall that the vertical distance between an observed
value yi and the value ŷi predicted by a straight line equation is ⌘i
(Equation 1.2). Using Equations 1.1 and 1.2, if we sum ⌘i2 over all n
data points then we have the sum of squared errors
n
X 2
E = yi (b1 xi + b0 ) . (1.25)
i=1

5
1 What is Linear Regression?

The values of b1 and b0 that minimise this sum of squared errors E are
the least squares estimates (LSE) of the slope and intercept, respectively.
The method described so far is called simple linear regression, to
distinguish it from weighted linear regression (see next page).

Key point: The values of b1 and b0 that minimise the sum of


squared errors E are the least squares estimates (LSE) of the
slope and intercept.

Mean Squared Error. A commonly used measure of error is the mean


squared error (MSE), which is the average squared di↵erence between
observed and predicted y values per data point:
n
1X
E = (yi ŷi )2 . (1.26)
n i=1

Because E is just E divided by the number of data points n, the values


of b1 and b0 that minimise E also minimise E. Be aware that the
mean squared error is sometimes defined with n 1 rather than n as
the denominator in Equation 1.26, so it is important to check which
definition is being used in any given context. The philosophy adopted
in this book is that the word mean should indicate just that (i.e. the
mean of n values is the sum divided by n).
7.5
7
6.5
#10
Height, y (feet)

6
5.5
5 ŷ7
4.5 #2
#7
4
#1 y7
3.5
3
0 1 2 3 4 5
Salary, x (groats)

Figure 1.3: The di↵erence ⌘i between each data point yi and the corresponding
point ŷi on the best fitting line is assumed to be noise or measurement error.
Four examples of ⌘i = yi ŷi for the data in Table 1.1 are shown here.

6
1.4. Regression and Causation

A Line Suspended on Springs. One way to think about how a


line can be adjusted to minimise E is to imagine each di↵erence ⌘i as
a spring that pulls vertically on a free-floating line, so that the line’s
final position represents the net e↵ect of all those pulling forces. For
example, in Figure 1.3, each data point pulls the line along a vertical
arrow connecting that point to the line.
However, in practice it is often the case that some observed values
are more reliable than others, and the more reliable values should exert
greater forces so that they have a greater influence on the final position
of the line. Expressed formally, this general strategy yields a method
called weighted linear regression, which is described in Chapter 8. For
now, we assume that all data points are equally reliable.

1.4. Regression and Causation

Before continuing, we should address a question often asked by novices:


in our example, why have we decided to define height as the dependent
variable and salary as the independent variable, rather than vice versa?
After all, to test if taller people have higher salaries, it might seem more
natural to think that salary depends on height.
In practice, regression is typically used in experiments to examine the
e↵ect of increasing an independent variable, or a regressor, on the value
of a dependent variable. In such experiments, the independent variable
is controlled by the experimenter and so its value is known exactly. In
contrast, values of the dependent variable are the results of physical
measurements, which inevitably contain some intrinsic variability or
noise. In the data from Table 1.1, the salaries are taken to be values of
the independent variable, whereas the heights are taken to be values
of the dependent variable. As another example, the variable x could
represent the financial rewards for solving di↵erent numerical problems
in an experiment, and y could represent the amount of time spent trying
to solve each problem.
The result of our example regression problem could be interpreted to
mean that increasing salary x causes an increase in height y, which is
clearly silly. However, it does highlight a common misunderstanding
about regression: the fact that we can fit a line to the data in Table 1.1

7
1 What is Linear Regression?

correctly suggests that height increases with increasing salary, but it


does not imply that increasing salary causes height to increase.
More generally, for any two variables x and y that seem to be linearly
related (i.e. related by a straight line), it is possible that changes in both
x and y are caused by changes in a third variable z. For example, a
graph of daily ice cream sales x versus the daily power consumption
of air conditioning units y would (plausibly) suggest a straight line
relationship. In this case, it is fairly obvious that increasing ice cream
sales are not the cause of greater power consumption (or vice versa).
Instead, a rise in temperature z is the root cause of increases in both
ice cream sales x and power consumption y.

Key point: Fitting a line to a set of data may suggest that one
variable y increases with increasing values of another variable x,
but it does not imply that increasing x causes y to increase.

1.5. Regression: A Summary

Now that we have set the scene, the next four chapters provide details of
simple linear regression, summarised here. Given n values of a measured
variable y (e.g. height) corresponding to n values of a known quantity
x (e.g. salary), find the best fitting line that passes close to this set of
n data points. Crucially, the values of y are assumed to contain noise
(e.g. measurement error), whereas values of x are assumed to be known
exactly. The noise means that the observed values of y do not lie exactly
on any single line.
Given that values of y vary, some of this variation can be ‘soaked
up’ (accounted for) by the best fitting line. Precisely how well the best
fitting line matches the data is measured as the proportion of the total
variation in y that is soaked up by the best fitting line. This proportion
can then be translated into a p-value, which is the probability that the
slope of the best fitting line is really due to the noise in the measured
values of y and that the true slope of the underlying relationship between
x and y is actually zero (i.e. a horizontal line).
Later chapters explain regression in relation to maximum likelihood
estimation, multivariate regression, weighted linear regression, nonlinear
regression, and Bayesian regression.

8
Chapter 2

Finding the Best Fitting Line

2.1. Introduction

We can estimate the slope b1 and intercept b0 of the best fitting line
using two di↵erent strategies. The first one is exhaustive search, which
involves trying many di↵erent values of b1 and b0 in Equation 1.25 to
see which values make E as small as possible. Even though exhaustive
search is impractical, it gives us an intuitive understanding of how
calculus can be used to estimate parameters more efficiently. The
second approach is to use calculus to find analytical expressions for the
values of b1 and b0 that minimise E.

2.2. Exhaustive Search

As mentioned above, one method for finding values of the slope b1 and
intercept b0 that minimise the sum of squared errors E is to substitute
plausible values of b1 and b0 into Equation 1.25 (repeated here),

n
X 2
E = yi (b1 xi + b0 ) , (2.1)
i=1

and plot the corresponding value of E, as shown in Figure 2.1. The


value of E is smallest at b1 = 0.764 and b0 = 3.22. These values are the
least squares estimates (LSE) of the slope and intercept.

9
2 Finding the Best Fitting Line

2.3. Onwards and Downwards

Even though the exhaustive search method is e↵ective, it is very labour


intensive because it involves evaluating E many times. Fortunately,
there is a more efficient method, based on the idea of gradient descent.
Gradient descent relies on the intuition that if one wants to get to the
bottom of a valley then a simple strategy is to keep moving downhill
until there is no more downhill left, at which point the bottom of the
valley should have been reached. For our purposes, height in the ‘valley’
is represented by the sum of squared errors E, and each location on the
horizontal ground plane corresponds to a di↵erent pair of values of b1
and b0 , as shown in Figure 2.1.
This is all very well in theory, but if we were standing in this valley,
how would we know whether to increase or decrease b1 and b0 so as
to reduce E? Note that increasing b1 and b0 by the same amount
corresponds to moving in a north-east direction on the ground plane,
whereas decreasing both b1 and b0 corresponds to moving in a south-west
direction; in fact, every possible combination of changes in b1 and b0
corresponds to a particular direction on the ground plane. With this
in mind, a simple tactic is to take small steps in di↵erent directions
on the ground plane, and look for the direction in which the gradient
of E points downhill most steeply. Then, having found the direction
of steepest descent, a small step in that direction will decrease the

5
Intercept, b0

-2 -1 0 1 2 3
Slope, b1

Figure 2.1: Contour map of the values of E for di↵erent pairs of values of the
slope b1 and intercept b0 . The point (b1 , b0 ) = (0.764, 3.22) where the dashed
lines intersect gives the smallest value of E.

10
2.4. The Normal Equations

height the most, which corresponds to decreasing E the most. Thus, by


incrementally changing the values of b1 and b0 to decrease E, the least
squares estimates of b1 and b0 are (eventually) obtained. This is why
the method is called gradient descent.
As shown in Figure 2.2a, taking a horizontal cross-section of the
sum-of-squares function E in Figure 2.1 yields a curve with a minimum
value at b1 = 0.764, where the gradient of E with respect to b1 is zero.
Similarly, Figure 2.2b shows a vertical cross-section of the function E
in Figure 2.1, which is a curve with minimum value at b0 = 3.22, where
the gradient of E with respect to b0 is zero.
However, calculus provides a more efficient way of estimating the
gradient or derivative of E with respect to b1 and b0 . In fact, for
quadratic error functions (such as those based on squared di↵erences),
we don’t have to take a series of small steps at all. Instead, we can
jump directly to a point on the ground plane where the gradient is zero,
which corresponds to the bottom of the valley.

2.4. The Normal Equations

The fact that the gradient of E with respect to b1 is zero at b1 = 0.764


provides a vital clue as to how we can estimate the slope b1 and intercept
b0 without using exhaustive search.

25 25

20 20

15 15
E

10 10

5 5
0 0.5 1 1.5 2.5 3 3.5 4
Slope, b1 Intercept, b0
(a) (b)

Figure 2.2: (a) Horizontal cross-section of the function E in Figure 2.1 at


b0 = 3.22. The value of b1 that minimises E is b1 = 0.764 (dashed line).
(b) Vertical cross-section of the function E in Figure 2.1 at b1 = 0.764. The
value of b0 that minimises E is b0 = 3.22 (dashed line).

11
2 Finding the Best Fitting Line

Specifically, taking the derivative of E with respect to b1 and with


respect to b0 yields a pair of simultaneous equations, the normal
equations. The solution to the normal equations yields the least squares
estimate of b1 and b0 .
The derivative of E with respect to b0 is
n
X
@E
= 2 yi (b1 xi + b0 ) . (2.2)
@b0 i=1

At a minimum of E this equals zero, as shown in Figure 2.2b, so that


n
X
yi (b1 xi + b0 ) = 0. (2.3)
i=1

Given that ⌘i = yi (b1 xi + b0 ), if we divide both sides by n we get

n
1X
⌘ = yi (b1 xi + b0 ) = 0, (2.4)
n i=1

which will prove useful later. Equation 2.3 can be written as


n
X n
X
yi b1 xi nb0 = 0, (2.5)
i=1 i=1

which yields the first normal equation,


n
X n
X
b1 xi + nb0 = yi . (2.6)
i=1 i=1

It will prove useful to note that dividing both sides by n yields

b1 x + b 0 = y. (2.7)

The derivative of E with respect to b1 is


n
X
@E
= 2 yi (b1 xi + b0 ) xi . (2.8)
@b1 i=1

12
2.4. The Normal Equations

As shown in Figure 2.2a, at a minimum of E this is equal to zero, which


can be written as
n
X n
X n
X
x i yi b1 x2i b0 xi = 0, (2.9)
i=1 i=1 i=1

giving the second normal equation,


n
X n
X n
X
b1 x2i + b0 xi = x i yi . (2.10)
i=1 i=1 i=1

Thus far, we have two simultaneous equations with two unknowns,


b1 and b0 . Equation 2.10 can be used to solve for b1 , as follows. From
Equation 2.7 we have b0 = y b1 x, and substituting this into Equation
2.10 gives
n
X n
X n
X
b1 x2i + (y b1 x) xi = x i yi . (2.11)
i=1 i=1 i=1

Expanding the middle term, we have


n
X n
X n
X n
X
b1 x2i + y xi b1 x xi = x i yi . (2.12)
i=1 i=1 i=1 i=1

Collecting the terms containing b1 , we get

n n
! n n
X X X X
b1 x2i x xi = x i yi y xi . (2.13)
i=1 i=1 i=1 i=1

Then, solving for b1 yields


Pn Pn
xi yi y i=1 xi
b1 = Pi=1
n 2
P n . (2.14)
i=1 xi x i=1 xi

Dividing numerator and denominator by n, we obtain

1
Pn
n i=1 xi yi yx
b1 = 1
P n . (2.15)
n i=1 xi
2 x2

13
2 Finding the Best Fitting Line

Finally, it can be shown that Equation 2.15 can be expressed as


1
Pn
n i=1 (xi x)(yi y)
b1 = 1
Pn . (2.16)
n i=1 (xi x)2

Having found b1 , now the value of b0 can be obtained from

b0 = y b1 x. (2.17)

Key point: Taking the derivatives of E with respect to b1 and


b0 yields a pair of simultaneous equations, the normal equations.
The solution to the normal equations yields the LSE of b1 and b0 .

Regressing x On y Versus Regressing y On x

What we have done so far is regressing y on x, i.e. treating y as the


dependent variable and x as the independent variable, and seeking a
line of the form ŷ = b1 x + b0 . If we regress x on y, seeking a line of the
form x̂ = b1 y + b0 as shown in Figure 2.3, we will get di↵erent values
for the slope and intercept parameters.
Regressing y on x yields a slope and an intercept of

b1 = 0.764 feet/groat and b0 = 3.22 feet, (2.18)

6
Height, y (feet)

0
0 1 2 3 4 5
Salary, x (groats)

Figure 2.3: Comparison of regressing x on y and regressing y on x. The dashed


(best fitting) line for regressing y on x has a slope of b1 = 0.764 feet/groat
and minimises the sum of squared vertical distances between each data point
and the dashed line. The solid (best fitting) line for regressing x on y has a
slope of b01 = 0.610 groats/foot and minimises the sum of squared horizontal
distances between each data point and the solid line.

14
2.5. Numerical Example

where b1 = ŷ/ x feet/groat and b0 is the value of ŷ at x = 0.


In contrast, regressing x on y could be achieved by swapping x and y
in Equation 1.1 (ŷi = b1 xi + b0 ) to get

x̂i = b01 yi + b00 , (2.19)

where the slope b01 = x̂/ y and intercept b00 are

b01 = 0.610 groats/foot and b00 = 0.635 groats. (2.20)

The reason that the two best fitting lines in Figure 2.3 are di↵erent is
that whereas regressing y on x finds the parameter values that minimise
the sum of squared (vertical) di↵erences between each observed value of
y and the fitting line, regressing x on y finds the parameter values that
minimise the sum of squared (horizontal) di↵erences between each value
of x and the fitting line. This is discussed in more detail in Section 3.3.

2.5. Numerical Example

Slope. From Equation 2.16,

b1 = 0.6687/0.875 = 0.764 feet/groat. (2.21)

Intercept. From Equation 2.17 with

x = 2.50 and y = 5.13, (2.22)

we find

b0 = 5.13 0.764 ⇥ 2.50 (2.23)


= 3.225 feet. (2.24)

Now we know how to find the least squares estimates of the slope b1
and intercept b0 of the best fitting line. Next, we consider this crucial
question: How well does the best fitting line fit the data?

15
2 Finding the Best Fitting Line

2.6. Python Code


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ch2Python.py. Finding the best fitting line.
# This is demonstration code, so it is transparent but inefficient.

import numpy as np
import matplotlib.pyplot as plt

x = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]

# convert data to vectors.


x = np.array(x)
y = np.array(y)
xmean = np.mean(x)
ymean = np.mean(y)
# Find zero mean versions of x and y.
xzm = x - xmean
yzm = y - ymean

numerator = np.sum(xzm * yzm)


denominator = np.sum(xzm * xzm)
print("numerator = %0.3f" %numerator) # 8.693
print("denominator = %0.3f" %denominator) # 11.375

# Find slope b1.


# b1 = numerator/denominator # 0.764
b1 = np.sum(xzm * yzm) / ( np.sum(xzm * xzm) ) # 0.764

# Find intercept b0.


b0 = ymean - b1*xmean # 3.225
print(’slope b1 = %6.3f\nintercept b0 = %6.3f’ % (b1, b0))

# Draw best fitting line.


xx = np.array([-0.3, 3, 5])
yline = b1 * xx + b0
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x, y, "ko", label="Data")
ax.plot(xx, yline, "k-",label="Best fitting line (LSE)")
ax.legend(loc="best")
plt.grid()

###############################
# END OF FILE.
###############################

16
Chapter 3

How Good is the Best Fitting Line?

3.1. Introduction

Having found the LSE for the slope and intercept of the best fitting line,
we naturally wish to get some idea of exactly how well this line actually
fits the data. The informal idea of how well a line fits a set of data boils
down to how well that line accounts for the variability in the data. In
essence, we seek an answer to the following question: for data with a
given amount of variability, what proportion of that variability can be
explained in terms of the best fitting line? In other words, we wish to
know how much of the overall variability in the data can be ‘soaked up’
by the best fitting line. This requires a formal definition of variability.

3.2. Variance and Standard Deviation

A measure of the variability of n observed values of y is the variance,


n
1X
var(y) = (yi y)2 . (3.1)
n i=1

Here are some important properties of the variance.

1. The variance is the mean value of the squared di↵erence between


each observed value of y and the mean value y of y.
2. If all values of y are multiplied by a constant factor k then the
variance increases by a factor of k 2 ; that is, var(ky) = k 2 var(y).

17
3 How Good is the Best Fitting Line?

3. Adding any constant (e.g. y) to all values of y does not change


the variance; for example, var(y y) = var(y).
4. Variance is related to the sum of squared errors by a factor of n;
P
that is, (yi y)2 = n var(y).
5. In the context of linear regression, the variance of y can be split
into two parts: a) the variance in y that can be predicted from x,
and b) the residual or noise variance in y that cannot be predicted
from x.

A related measure of variability is the standard deviation, which is the


square root of the variance,
p
sy = var(y), (3.2)

where sy is the conventional symbol for the sample standard deviation.


If all values of y are multiplied by a constant factor k then the standard
deviation increases by a factor of k; that is, sky = ksy . The standard
deviation has the same units as y; for example, if y is measured in units
of feet then the standard deviation also has units of feet.
The variance var(y) is calculated from a finite sample of n values
{y1 , . . . , yn }, which is assumed to have been drawn from an infinite
parent population of y values. Thus, var(y) is an estimate of the
2
population variance y, where by convention the Greek letter (sigma)
is used to represent the parent population standard deviation.

3.3. Covariance and Correlation

Covariance. A useful measure of the degree of association between


two variables x and y is their covariance,
n
1X
cov(x, y) = (xi x)(yi y). (3.3)
n i=1

It is worth noting a few important properties of the covariance.

18
3.3. Covariance and Correlation

1. A positive covariance indicates that y increases with increasing


values of x, and a negative covariance indicates that y decreases
with increasing values of x.
2. Adding a constant to x or to y does not change their covariance;
for example, cov(x x, y y) = cov(x, y).
3. The covariance is sensitive to the magnitudes of x and y.
Specifically, if x and y are both multiplied by a factor of k then the
covariance increases by a factor of k 2 : cov(kx, ky) = k 2 cov(x, y).

Like the variance, the covariance cov(x, y) is based on a finite sample


of n values, assumed to be drawn from an infinite (parent) population
2
of values, so cov(x, y) is an estimate of the population covariance x,y .

Correlation. Correlation is a measure of association that is independent


of the magnitudes of x and y. A commonly used correlation measure is
the Pearson product-moment correlation coefficient,
1
Pn
n i=1 (xi x)(yi y)
r = Pn 1/2 1 Pn 1/2
. (3.4)
1 2
n i=1 (xi x) n i=1 (yi y)2

1
The n terms cancel, but have been left in the expression so that we can
recognise it as the covariance (Equation 3.3) divided by the standard
deviations of x and y (Equation 3.2):

r = cov(x, y)/(sx sy ). (3.5)

The correlation can vary between r = ±1. Examples of data sets with
di↵erent correlations are shown in Figure 3.1.

Back to the Slope. In Equation 2.16 (repeated here),

1
Pn
n i=1 (xi x)(yi y)
b1 = 1
Pn , (3.6)
n (x
i=1 i x)2

we can now recognise the numerator as the covariance (Equation 3.3)


and the denominator as the variance of x,
n
1X
var(x) = (xi x)2 , (3.7)
n i=1

19
3 How Good is the Best Fitting Line?

so that the slope of the best fitting line is given by the ratio

b1 = cov(x, y)/var(x). (3.8)

Therefore, the slope of the best fitting line when regressing y on x is


the covariance between x and y expressed in units of the variance of x.

Key point: The slope of the best fitting line when regressing y
on x is the covariance between x and y expressed in units of the
variance of x.
If we divide all x values by the standard deviation sx , we obtain a
normalised variable x0 = x/sx , which has a standard deviation of 1,
i.e. sx0 = 1, and hence var(x0 ) = (sx0 )2 = 1 as well; similarly, y can be
normalised to have standard deviation and variance equal to 1. If both
x and y are normalised then the regression coefficient in Equation 3.8
becomes equal to the correlation coefficient in Equation 3.5. In other
words, for normalised variables, the slope of the best fitting line equals
the correlation, i.e. r = b1 .
From Equation 3.5, we have cov(x, y) = rsx sy , and substituting this
into Equation 3.8 yields b1 = r(sy /sx ), so the slope is the correlation
scaled by the ratio of standard deviations. Regressing y on x (as above)
yields b1 = r(sy /sx ), whereas regressing x on y yields b01 = r(sx /sy )
(see Figure 2.3).

120 120

100 100

80 80

60 60
y

40 40

20 20

0 0
0 10 20 30 40 50 0 10 20 30 40 50
x x

(a) (b)

Figure 3.1: Comparison of the correlations, covariances and best fitting lines
for two data sets.
(a) Correlation r = 0.90; cov(x, y) = 186; the best fitting line has a slope of
b1 = 2.09 and an intercept of b0 = 6.60.
(b) Correlation r = 0.75; cov(x, y) = 193; the best fitting line has a slope of
b1 = 2.16 and an intercept of b0 = 10.4.

20
3.4. Partitioning the Variance

3.4. Partitioning the Variance

The total error, or di↵erence between each observed value yi and the
mean y of y, can be split or partitioned into two parts, which we refer
to as signal and noise, as shown in Figure 3.2. The signal part of the
error is

i = ŷi y, (3.9)

where ŷi = b1 xi + b0 is the y value corresponding to xi as predicted by


the best fitting line from the regression model. The noise part of the
error is

⌘i = yi ŷi , (3.10)

which is the part of the error that cannot be explained by the model.
The total error of each data point is the sum of the signal and noise

yi y = (ŷi y) + (yi ŷi ) (3.11)


= i + ⌘i . (3.12)

data point (xi,yi)


7.5
7 noise error yi
yi ŷi
6.5 total error
explained yi y
Height, y (feet)

6 error
ŷi y ŷi
5.5
y
5
4.5
mean (x, y)
4
3.5
3
0 1 2 3 4 5
Salary, x (groats)

Figure 3.2: The total error (di↵erence between an observed value yi and
the mean y) can be partitioned into two parts, a signal part, ŷi y, that
is explained by the best fitting line and a noise part, yi ŷi , which is not
explained by the line.

21
3 How Good is the Best Fitting Line?

Substituting this into Equation 3.1, the variance can be written as


n
1X
var(y) = ( i + ⌘i ) 2 , (3.13)
n i=1

which expands to
n
1X 2
var(y) = i + ⌘i2 + 2 i ⌘i . (3.14)
n i=1

Collecting the terms using separate summation symbols gives


n n n
1X 2 1X 2 2X
var(y) = i + ⌘ + i ⌘i , (3.15)
n i=1 n i=1 i n i=1

where the final sum equals zero (see Appendix D, Equation D.21);
because (as we should expect) the correlation between signal and noise
is zero. Accordingly, the variance can be partitioned into two (signal
and noise) parts:

n n
1X 2 1X 2
var(y) = i + ⌘ . (3.16)
n i=1 n i=1 i

We can multiply both sides by n to obtain three sums of squares,


n
X n
X n
X
2 2
(yi y) = i + ⌘i2 . (3.17)
i=1 i=1 i=1

For brevity, we define these three sums of squares (reading from left to
right in Equation 3.17) as follows.
a) The total sum of squared errors

n
X
SST = (yi y)2 . (3.18)
i=1

b) The signal sum of squares

n
X n
X
2
SSExp = = (ŷi y)2 , (3.19)
i=1 i=1

22
3.4. Partitioning the Variance

which is the part of the sum of squared errors SST that is accounted
for, or explained, by the regression model.
c) The noise sum of squares

n
X n
X
SSNoise = ⌘2 = (yi ŷi )2 , (3.20)
i=1 i=1

which is the part of the sum of squared errors SST that is not explained
by the regression model.
Now Equation 3.17 can be written succinctly as

SST = SSExp + SSNoise . (3.21)

It will be shown in the next section that the proportion of the sum of
squared errors that is explained by the regression model is

SSExp
r2 = , (3.22)
SST

where r2 is the square of the correlation coefficient defined in Equation


3.4. Finally, given that SSExp = SST SSNoise ,

SSNoise
r2 = 1 . (3.23)
SST

Notation. Depending on the software used, the total sum of squares is


often represented as SST = SST, the explained sum of squares as

SSExp = SSR, (3.24)

which stands for regression sum of squares, and the noise sum of squares

SSNoise = SSE or RSS, (3.25)

meaning the error sum of squares or residual sum of squares.


Key point: The total sum of squared errors SST consists of two
parts: SSExp , the sum of squared errors explained by the best
fitting line, and SSNoise , the sum of squared errors that remains
unexplained by the best fitting line.

23
3 How Good is the Best Fitting Line?

Disregarding the Intercept

The intercept b0 is not only unimportant in most cases but also makes
for unnecessarily complicated algebra. If we can set b0 to zero then we
can ignore it. Accordingly, in Appendix D we prove that b0 can be set to
zero without a↵ecting the slope of the best fitting line, by transforming
the means of x and y to zero. This also sets the mean ŷ of ŷ (for points
on the best fitting line) to zero, so we have ŷ = x = y = b0 = 0.
For the present, the geometric representation of the proof in Figure
3.3 should suffice. This shows that setting the mean of x to zero is
equivalent to simply translating the data along the x-axis, and setting
the mean of y to zero amounts to translating the data along the y-axis.
Crucially, this process of centring the data has no e↵ect on the rate at
which y varies with respect to x, so the slope of the best fitting line
remains unaltered.
For example, after centring, x = y = 0, so Equation 3.6 simpliflies to
Pn
x i yi
b1 = Pi=1
n 2 , (3.26)
i=1 xi

where (strictly speaking) x and y should really be written as the zero-


mean variables x0 and y 0 .

6
Height, y (feet)

-2
-4 -2 0 2 4 6
Salary, x (groats)

Figure 3.3: Setting the means to zero. The upper right dots represent the
original data, with x = 2.50 and y = 5.13, marked with a diamond. The lower
left circles represent the translated data, with zero means. For the original
data (upper right) the best fitting line has equation ŷi = b1 xi + b0 , but for
the translated data (lower left) b0 = 0 so that ŷi = b1 xi . Crucially, the best
fitting line has the same slope b1 for the original and the translated data.

24
3.5. The Coefficient of Determination

3.5. The Coefficient of Determination

The coefficient of determination is the proportion of the variance in y


that can be attributed to the best fitting line, i.e. var(ŷ)/var(y). As
shown next, this is equal to the square of the correlation coefficient:

var(ŷ)
r2 = . (3.27)
var(y)

By squaring Equation 3.5, we have

cov(x, y)2
r2 = . (3.28)
var(x)var(y)

Given that ŷi = b1 xi + b0 , the variance of ŷ is


n
1X
var(ŷ) = (b1 xi + b0 ŷ)2 . (3.29)
n i=1

By setting y = 0 we also set ŷ = b0 = 0 (see page 24), and so

n
1X 2 2
var(ŷ) = b x (3.30)
n i=1 1 i
= b21 var(x). (3.31)

From Equation 3.8,

cov(x, y)2
b21 = . (3.32)
var(x)2

Substituting this into Equation 3.31 gives

cov(x, y)2
var(ŷ) = (3.33)
var(x)

and therefore

cov(x, y)2 = var(ŷ) var(x). (3.34)

Substituting this in Equation 3.28 yields Equation 3.27, as promised.

25
3 How Good is the Best Fitting Line?

Similarly, Equation 3.22 is obtained by first writing Equation 3.27 as

1
Pn
(ŷi ŷ)2
r 2
= n
1
Pi=1
n . (3.35)
n i=1 (yi y)2

Then, by setting y = 0 so that ŷ = b0 = 0 (see page 24), we obtain


Pn
2 ŷi2
r = Pi=1
n 2 (3.36)
i=1 yi

SSExp
= , (3.37)
SST

which proves Equation 3.22.

Key point: The coefficient of determination, the proportion of


variance that can be attributed to the best fitting line, is equal
to the square of the correlation coefficient, r2 .

3.6. Numerical Example

From the data in Table 1.1, we have

var(y) = 1.095 and var(ŷ) = 0.511 (3.38)

Hence, the proportion of the overall variance accounted for by the best
fitting line is (using Equation 3.27)

r2 = var(ŷ)/var(y) (3.39)
= 0.511/1.095 = 0.466. (3.40)

So just under half of the total variance in y can be attributed to the


best fitting line, and just over half is simply noise. The correlation
p
between x and y is r = 0.466 = 0.683.
Two further examples can be seen in Figure 3.1, where it is apparent
that data which lie close to the best fitting line allow a better fit to
that line. In the next chapter, we consider how to assess the statistical
significance of mean values, which paves the way for assessing the
statistical significance associated with the coefficient of determination r2 .

26
3.7. Python Code

3.7. Python Code


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
ch3Python.py. Evaluating the fit with the coefficient of variantion.
This is demonstration code, so it is transparent but inefficient.
"""
import numpy as np

x = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]

# Convert data to vectors.


x = np.array(x)
y = np.array(y)
n = len(y)

# Find zero mean versions of x and y.


xzm = x-np.mean(x)
yzm = y-np.mean(y)
# Covariance between x and y.
covxy = np.sum(xzm * yzm)/n
# Variance of x.
varx = np.sum(xzm * xzm)/n
# Variance of y.
vary = np.sum(yzm * yzm)/n

# Find slope b1=0.764


b1 = covxy/varx

# Find intercept b0=3.225


b0 = np.mean(y) - b1*np.mean(x)

yhat = b1*x + b0 # 0.511


varyhat = np.var(yhat)

# Find coefficient of determination r2.


r2 = covxy * covxy / (varx * vary)
# OR equivalently ...
r2 = varyhat / vary # 0.466

print("Variance of x = %0.3f" %varx) # 0.875


print("Variance of y = %0.3f" %vary) # 1.095
print("Covariance of x and y = %0.3f" %covxy) # 0.669
print("Variance of yhat = %0.3f" %varyhat) # 0.511
print("Coefficient of variation = %0.3f" %r2) # 0.466
# END OF FILE.

27
Chapter 4

Statistical Significance: Means


4.1. Introduction

Having estimated the slope of the best fitting line, how can we find the
statistical significance associated with that slope?
Just as the best fitting slope is an estimate of the true slope of the
relationship between two variables, so the mean of n values in a sample
is an estimate of the mean of the population from which the sample was
drawn. It turns out that the estimated slope is a weighted mean (as
will be shown in Section 5.2), which is a generalisation of a conventional
mean. Therefore, we can employ standard methods for finding the
statistical significance of a mean value to find the statistical significance
of the best fitting slope.
To calculate a conventional mean, we add up all n measured quantities
and divide the sum by n. For a weighted mean, each measured quantity
is boosted or diminished according to the value of its associated weight,
so some measurements contribute more to the weighted mean. The
justification is that some measured quantities are more reliable than
others, and such values should contribute more to the weighted mean.
Accordingly, and just for this chapter, we will treat the estimated slope
as if it were a conventional mean (i.e. an unweighted mean).

4.2. The Distribution of Means

The data set in Table 1.1 is just one of many possible sets of data, or
samples, taken from an underlying large collection of salary and height
values. Indeed, we can treat our data set as if it is a single sample of

29
4 Statistical Significance: Means

values drawn from an infinitely large parent population of values. If the


population mean is µ then we can treat each observed value of y as if it
consists of two components, µ plus random noise ":

yi = µ + "i . (4.1)

To get an idea of how typical our sample is, suppose we could take many
samples. Consider a random sample of n values, V1 = {y1 , y2 , . . . , yn },
where the subscript 1 indicates that this is the first sample; we denote
its mean value by y 1 . If we repeat this sampling procedure N times, we
get N samples {V1 , V2 , . . . , VN }, where each sample Vj has mean y j , so
we have N means,

{y 1 , y 2 , . . . , y N }. (4.2)

If we make a histogram of these N mean values then we will find that


they cluster around the mean µ of the parent population, as in Figure 4.1.
The mean value of the jth sample of n values is
n n
1X 1X
yj = yij = (µ + "ij ), (4.3)
n i=1 n i=1

where yij denotes the ith value of y in the jth sample and "ij is the
noise in yij . By splitting this into two summations, we have

n n
1X 1X
yj = µ+ "ij . (4.4)
n i=1 n i=1

The first term is just µ, and the second term is the mean of the noise
values in the jth sample,
n
1X
"j = "ij , (4.5)
n i=1

so Equation 4.4 becomes

yj = µ + "j . (4.6)

30
4.2. The Distribution of Means

1500 n=1 1500 n=4


<=1 < = 1/2
frequency

frequency
1000 1000

500 500

0 0
-2 -1 0 1 2 -2 -1 0 1 2
mean y mean y
(a) (b)

1500 n=9 1500 n = 16


< = 1/3 < = 1/4
frequency

frequency

1000 1000

500 500

0 0
-2 -1 0 1 2 -2 -1 0 1 2
mean y mean y
(c) (d)

1500 n = 36 1500 n = 144


< = 1/6 < = 1/12
frequency

frequency

1000 1000

500 500

0 0
-2 -1 0 1 2 -2 -1 0 1 2
mean y mean y
(e) (f)

Figure 4.1: Each panel shows a histogram of N = 10,000 means. All the
means in each histogram are based on a single sample size n, and n increases
from (a) to (f), as indicated in each panel. For example, in (b), first the mean
y of n = 4 randomly chosen values of y was found; then this was repeated to
obtain a total of 10,000 means, and a histogram of those means was plotted.
Each value of y was chosen from a parent population with mean µ = 0 and
standard deviation y = 1, shown in (a). As n increases, the standard error
y shrinks (standard deviation of the n mean values, denoted by in each
panel). The smooth curve in each panel is a Gaussian function with standard
p
deviation y = y / n.

31
4 Statistical Significance: Means

The law of large numbers says that if the distribution of noise " in the
parent population has a mean " of zero then the distribution of noise
means "j has a mean that converges to zero as n increases. Therefore, as
n increases, Equation 4.6 becomes y ⇡ µ, with equality as n approaches
infinity. Thus, the sample mean y is essentially a noisy estimate of
the population mean µ. But how much confidence can we place in
our estimate y of the population mean µ? To answer this, we need to
find the standard deviation of the estimated mean y. And, to find the
standard deviation of the mean y, we need the central limit theorem.

The Central Limit Theorem

The central limit theorem states that a histogram of N sample means


as in Equation 4.2 has a bell-shaped distribution, known as a Gaussian
distribution; see Figure 4.2. This theorem is important because,
remarkably, the distribution of means is approximately Gaussian
almost irrespective of the shape of the parent distribution of y values.
Incidentally, a theorem is simply a mathematical statement that has
been proved to be true.
Given a parent population with standard deviation y , suppose we
take a large number of samples that each contain n values. The central
limit theorem states that the standard deviation of the sample means is
p
y = y/ n. (4.7)

0.4

0.3
p(z)

0.2 Area = 0.95

0.1

0
-3 -2 -1 0 1 2 3
z

Figure 4.2: A normalised Gaussian distribution has mean µ = 0 and standard


deviation = 1. The total area under the curve is 1.0 and the area between
z = ±1.96 is 0.95. For any Gaussian distribution, the area under the curve
between ±1.96 standard deviations is 95% of the total area.

32
4.2. The Distribution of Means

This standard deviation of sample means has a special name, the


standard error. Given N sample means y j as in Equation 4.2, the
estimated value of y is
0 11/2
XN
sy = @1 (y µ)2 A . (4.8)
N j=1 j

p
The factor of 1/ n in Equation 4.7 implies that the standard error
shrinks as the sample size n increases. However, there are ‘diminishing
returns’ on increasing n: initially the standard error shrinks rapidly,
but for sample sizes above 20 the rate of decrease slows considerably, as
shown in Figure 4.3. Despite these diminishing returns, Equation 4.7
guarantees that y approaches zero as n tends to infinity.

The Gaussian (Normal) Distribution. The equation for the


probability of a variable y with a Gaussian distribution that has mean
µ and standard deviation y is

(y µ)2 /(2 2
y)
p(y) = ke , (4.9)

2 1/2
where e = 2.718 . . . and k = [1/(2⇡ y )] , which ensures that the area
under the curve sums to 1.
1
Standard deviation of sample means

0.8

0.6

0.4

0.2

0
0 20 40 60 80 100
Number n of values in each sample

Figure 4.3: Consider a population with a standard deviation of y = 1. Given


N samples from this population, each of which contains n values, the standard
p
deviation of the sample means (standard error) is y = y / n (Equation 4.7).

33
4 Statistical Significance: Means

Key point: Given N samples, each of which consists of n values


chosen from a parent population with mean µ and standard
deviation y, the distribution of the sample means {y 1 , . . . , y N }
p
is Gaussian with mean µ and standard deviation y = y / n.

4.3. Degrees of Freedom

In essence, the number of degrees of freedom is the number of ways


in which a sample of n values are free to vary, after one or more
constraints have been taken into account. By convention, the Greek
letter ⌫ (pronounced new) is standard notation for degrees of freedom.
For example, if there are n values in a sample then the number of
degrees of freedom is ⌫ = n, because each value is free to vary. But
if we wish to calculate the variance, for example, then its associated
number of degrees of freedom is ⌫ = n 1, as explained next.
Variance is calculated from n squared di↵erences (yi y)2 , as in
Equation 3.1. For a sample of n = 2 values, where y1 = 6 and y2 = 4,
the sum of both values is S = 10 and the mean is y = 5. If we make a
two-dimensional graph with coordinate axes y1 and y2 then the values in
this sample can be represented as a single point located at (6, 4) on the
(y1 , y2 ) plane, as shown in Figure 4.4a. When there are no constraints
(e.g. if we allow the mean to take any value), y1 and y2 can adopt any
values, so the point (y1 , y2 ) can be anywhere on the plane. But when
we are calculating the variance based on a fixed value of the mean y,
the value of the sum S = ny is e↵ectively fixed, at 10 in this example.
In terms of geometry, fixing the value of the sum S means that the
point (y1 , y2 ) must lie on a line given by the equation y1 + y2 = S. In
other words, for a sample with a fixed mean (or equivalently a fixed
sum) the only combinations of y1 and y2 allowed must correspond to
points that lie on the line defined by y1 + y2 = S. Note that a line has
one dimension less than the plane defined by the axes y1 and y2 .
What has this to do with variance? Well (as mentioned above),
variance depends on n squared di↵erences from the mean, which is fixed
in the calculation. So once the value of y1 is known, the value of y2 is
also known (it is y2 = S y1 ); similarly, once the value of y2 is known,
the value of y1 is also known. Therefore, for a sample of n = 2 values

34
4.3. Degrees of Freedom

with a fixed mean y, only 1 = n 1 of the values is free to vary. In e↵ect,


the variance confines the sample values (y1 , y2 ) to a one-dimensional
space (e.g. a line y1 + y2 = S), so its associated number of degrees of
freedom is ⌫ = 1. (If we choose to take account of the fact that the
di↵erences are squared then the one-dimensional space is actually a
circle, but this has the same number of dimensions (one) as a line, hence
the underlying logic of the argument presented here holds good).
By analogy, if the sample comprises n = 3 values {y1 , y2 , y3 } then
this defines a point in three-dimensional space with coordinate axes y1 ,
y2 and y3 , as shown in Figure 4.4b. When the mean, and hence the
sum S, is fixed, if y1 and y2 are known then y3 can be calculated as
y3 = S (y1 + y2 ). In other words, once the mean is fixed, the point
(y1 , y2 , y3 ) has components such that y1 + y2 + y3 = S, which defines
a plane in three-dimensional space. Because the variance e↵ectively
confines the sample values (y1 , y2 , y3 ) within this plane, its associated
number of degrees of freedom is ⌫ = 2.
Generalising, if the sample comprises n values {y1 , . . . , yn } then this
defines a point in an n-dimensional space with coordinate axes y1 , . . . , yn .
Given a fixed mean and hence a fixed sum S, if y1 , . . . , yn 1 are known

y2 y2
y2
y = (y1, y2 , y3)
y1, y2 , y3)
y = (y1, y2)

y1
y1 y1
y3
y3 (a) (b)

Figure 4.4: a) For a sample with n = 2 values y1 and y2 , if their sum S is fixed
then the point y = (y1 , y2 ) must lie on the line shown, which has equation
y1 + y2 = S. (b) For a sample with n = 3 values y1 , y2 and y3 , if their sum S
is fixed then the point y = (y1 , y2 , y3 ) must lie on the plane shown, which has
equation y1 + y2 + y3 = S. One possible value for y is shown in (a) and (b).

35
4 Statistical Significance: Means

then yn can be calculated as yn = S (y1 + · · · + yn 1 ). So once


the mean is fixed, the point (y1 , . . . , yn ) has components such that
y1 + · · · + yn = S, which defines an (n 1)-dimensional hyperplane
in n-dimensional space. Because the variance e↵ectively confines the
sample values (y1 , . . . , yn ) to this (n 1)-dimensional hyperplane, its
associated number of degrees of freedom is ⌫ = n 1.
As a general rule, if a model has p parameters then each parameter acts
as a constraint on the degrees of freedom in that model. Consequently,
given n data points, a model with p parameters has ⌫ = n p degrees
of freedom. In the case of simple linear regression, there are p = 2
parameters (slope and intercept), so the number of degrees of freedom
in the model is ⌫ = n 2. See Walker (1940) for more details.

4.4. Estimating Variance

It is tempting to assume that the variance var(y) of a sample of n


2
values provides a reasonable estimate of the variance y of the parent
population. However, var(y) is a biased estimate; specifically, var(y)
systematically under-estimates the variance of the parent population.
2
An unbiased estimate of the population variance is obtained by
dividing the sum of squared di↵erences by the degrees of freedom
⌫=n 1 rather than by the sample size n:
n
X
1
ˆy2 = (yi y)2 . (4.10)
n 1 i=1

By implication, an unbiased estimate of the standard deviation y of the


parent population is ˆy . As a reminder, a variable with a hat (such as
ˆy ) is an estimate of the hatless quantity ( y ). Note that the di↵erence
between the sum of squared di↵erences divided by n and divided by
n 1 becomes negligible as n increases. By analogy with Equation 4.7,
the unbiased estimate ˆy of the standard error y (i.e. the standard
deviation of the sample mean y) is
p
ˆy = ˆy / n. (4.11)

36
4.5. The p-Value

And the central limit theorem guarantees that the distribution of sample
means is approximately Gaussian with a mean of µ and a standard
deviation of y.

The z-score. If for each sample j we subtract the population mean µ


from the sample mean y j and divide by the standard deviation of the
mean (standard error) y, the result is a z-score,

z = (y µ)/ y , (4.12)

p
where y = y/ n (Equation 4.7), so that

y µ
z = p . (4.13)
y / n

The Normalised Gaussian Distribution. A z-score has a normalised


Gaussian distribution, that is, a Gaussian distribution with a mean of 0
and a standard deviation of 1. As the total area under the curve of a
Gaussian distribution is equal to 1 (see page 33), we can interpret area
under a portion of the curve as probability, as shown in Figure 4.5. The
normalisation makes calculations much simpler.

Why is Noise Gaussian? The justification for assuming that noise


is Gaussian is that it usually consists of a mixture of many di↵erent
variables. Each of these variables can have a unique amplitude, so noise is
e↵ectively a weighted sum. Because every weighted sum is proportional
to its weighted mean, weighted sums and weighted means have the same
distribution. Given that the central limit theorem guarantees that all
weighted means are approximately Gaussian, it follows that noise has a
Gaussian distribution. For the sake of brevity, in the following we will
assume that the distribution of means is Gaussian.

4.5. The p-Value

Because all sets of data contain noise, which has a Gaussian distribution,
the central limit theorem tells us that finding the mean of almost any
data set is analogous to choosing a point under a Gaussian distribution
curve at random. As illustrated in Figure 4.5, the probability of choosing

37
4 Statistical Significance: Means

that point is proportional to the height p(z) of the curve above that
point. Consequently, the probability of choosing a point located at z
along the abscissa decreases with distance from the mean µ. In other
words, the Gaussian curve defines the probability of any given value of
z, and this probability is a Gaussian function of z as in Equation 4.9.
If we now consider the normalised Gaussian distribution shown in
Figure 4.2, which has mean 0 and standard deviation 1, then the area
under the curve between ±1.96 accounts for 95% of the total area, so
the probability of choosing a point z that lies between ±1.96 is 0.95. It
follows that the total area of the two ‘tails’ of the distribution outside of
this central region is p = 1 0.95 = 0.05 (see Figure 4.2). Specifically,
the tail of the distribution to the right of z = +1.96 has an area of
0.025, and the tail of the distribution to the left of z = 1.96 also has
an area of 0.025. This means that the probability that z is more than
1.96 is 0.025, and the probability that z is less than 1.96 is 0.025. The
probability that the absolute value |z| of z is larger than 1.96 is the
p-value, p = 1 0.95 = 0.05.

Key point: The area under the normalised Gaussian curve


between ±1.96 occupies 95% of the total area. Therefore, the
combined probability of the two tails of the distribution beyond
this central region is 1 0.95 = 0.05.
p(z)

z
Figure 4.5: A normalised histogram as an approximation to a Gaussian
distribution (smooth curve). If dots fall onto the page at random positions,
there will usually be more dots in the taller columns (only dots that fall under
the curve are shown). The proportion of dots in each column is proportional
to the height p(z) of that column. Therefore, the probability of choosing a
specific value of z is proportional to p(z).

38
4.6. The Null Hypothesis

4.6. The Null Hypothesis

Conventional statistical analysis is based on the idea that data can


be used to reject a null hypothesis, which usually assumes there is no
relationship between variables or no e↵ect of one variable on another.
If the data provide sufficient evidence to reject the null hypothesis then
an alternative hypothesis can be accepted. We will not discuss this in
detail, except to say that there may be many alternative hypotheses
which could account for the observed data.
As a simple example, consider the sample of 13 values of y in Table 1.1,
which has a mean of y = 5.135. Is this mean significantly di↵erent from
zero? In other words, what is the probability of obtaining a sample
with this mean if the population mean µ is zero?
As will be shown below, it is improbable that the y data in Table 1.1
could have been obtained from a population of values for which the
population mean µ equals zero. Precisely how improbable is indicated
by the infamous p-value. In this case, we take the null hypothesis
to be that the population mean is zero, and we want to see if this
hypothesis can be rejected in favour of the alternative hypothesis that
the population mean is not zero.

Key point: Given a set of data with nonzero mean, we test


whether it is improbable that the mean could have been obtained
by pure chance from a population with a mean of zero. Precisely
how improbable is indicated by a p-value.

One-Tailed Tests. If we assume that, like the original scenario in


Table 1.1, our 13 values of y are measurements of height, there is no
point in considering negative values of y. Consequently, we do not
consider the possibility of µ < 0, so the alternative hypothesis is that
µ > 0 and the null hypothesis is that µ = 0. By convention, any
value that occurs with probability less than 0.05 is deemed statistically
significant (although p-values of 0.01 or less are used in some research
fields). Therefore, the question is: given a distribution with mean
zero, which values of the sample mean y are big enough to occur with
probability less than (or equal to) 0.05?

39
4 Statistical Significance: Means

The answer is: any value of y that lies under the shaded area of
the Gaussian distribution in Figure 4.6a. Because the total area under
the distribution curve is 1, the area of any region under the curve
corresponds to a probability. Imagine increasing the value of y in
Figure 4.6a and calculating the area under the curve to the right of the
current value; it turns out that when the area under the curve to the
right of y equals 0.05, the location of y corresponds to a critical value
of y crit = 1.645 ⇥ y, where y is the standard error. In other words,
the value of y that is 1.645 standard errors above the mean cuts o↵ a
shaded region with an area of 0.05.
Using Equation 4.11, the estimated standard error is ˆy = 0.302,
so y crit = 1.645 ⇥ 0.302 = 0.497. Therefore, values of y larger than
0.497 are considered to be statistically significantly di↵erent from zero.
Clearly, the observed mean of 5.135 is larger than 0.497, so we can reject
the null hypothesis that µ = 0. To put it another way, the observed
mean of y = 5.135 is 5.135/0.302 = 17.003 standard errors above zero,
which is very significant. Because we only consider the area in the tail
on one side of the distribution, this is called a one-tailed test.

Two-Tailed Tests. Now suppose that the 13 values of y in Table 1.1


are temperature measurements, so the mean temperature of this sample

1.2 1.2

1 1

0.8 0.8
p(y)

p(y)

0.6 0.6

0.4 Area = 0.95 0.4 Area = 0.95

0.2 0.2

0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
y y

(a) (b)

Figure 4.6: Two views of a Gaussian distribution with mean µ = 0 and


standard deviation y = 0.302. (a) The shaded region in the right tail has an
area of 0.05; the probability that a sample of n values has a mean at least as
large as 0.497 is equal to the area of the shaded region (0.05). (b) Each of the
two shaded regions has an area of 0.025, so the total shaded area is 0.05; the
probability that a sample of n values has a mean of 0.592 or greater is 0.025,
and the probability that the mean is 0.592 or more negative is also 0.025.

40
4.7. The z-Test

is y = 5.135 degrees. In this case, since temperatures can be negative,


we must consider the possibility that y and µ could be less than zero.
As in the one-tailed test, the null hypothesis is that µ = 0, but now the
alternative hypothesis is not just that µ > 0; instead, it is a composite
hypothesis: either µ > 0 or µ < 0. Equivalently, the alternative
hypothesis is that µ 6= 0.
We calculated that the estimated standard deviation of the sample
mean is ¯y = 0.302, so we can work out that 2.5% of the total area
under the curve lies to the right of the critical value y +
crit = 1.96 ⇥ y =
1.96⇥0.302 = 0.592; similarly, 2.5% of the total area under the curve lies
to the left of the critical value y crit = 0.592. In other words, the area to
the right of 0.592 is 0.025, and the area to the left of 0.592 is also 0.025,
as shown in Figure 4.6b. Therefore, the probability that |y| > 0.592
is less than 0.05. So if y > 0.592 then the p-value is p < 0.05 (i.e. a
statistically significant p-value). In fact, our sample mean is y = 5.135,
which is larger than y +
crit , so we can reject the null hypothesis (i.e. we
can conclude that µ is significantly di↵erent from zero). Because we
consider the areas in both the left and the right tails of the distribution,
this is called a two-tailed test.

4.7. The z-Test

If the population standard deviation y is known then we can test


whether a sample mean y is significantly di↵erent from zero by using
the z-score in Equation 4.12; this is called the z-test. In practice, y is
never known, but assuming it to be known allows us to explain some
basic principles that will be used in the next section.
The standard deviation of the mean (standard error) ycan be
p
obtained from the population standard deviation y using y = y/ n
(Equation 4.7). As z = (y µ)/ y in Equation 4.12 has a Gaussian
distribution with mean 0 and standard deviation 1 (see page 37 and
Figure 4.5), there is a 95% chance that z lies between ±1.96,

1.96  z  +1.96, (4.14)

41
4 Statistical Significance: Means

that is,

1.96  (y µ)/ y  +1.96. (4.15)

Equivalently, there is a 95% chance that

µ 1.96 y  y  µ + 1.96 y, (4.16)

which means there is a 95% chance that the sample mean y is within
1.96 standard errors of the population mean µ. More importantly,
Equation 4.15 can be rewritten as

y 1.96 y  µ  y + 1.96 y, (4.17)

which means there is a 95% chance that the population mean µ lies
between ±1.96 standard errors of the sample mean y.
Specifically, there is a 95% chance that the population mean µ lies in
the confidence interval (CI) between the confidence limits y 1.96 y
and y + 1.96 y, which is written as

CI(µ) = y ± 1.96 y. (4.18)

For a Gaussian distribution, 95% of the area under the curve is within
±1.96 standard deviations of the mean. This implies that each of the
two tails of the distribution contains 2.5% of the area, making a total of
5%. Accordingly, z0.05 = 1.96 is the critical value of z for a statistical
significance level or p-value of 0.05.

4.8. The t-Test

Given that the population standard deviation y , and hence the


standard error y , is not known, testing whether an observed mean y is
significantly di↵erent from zero requires a t-test.
If we replace the unknown standard error y in Equation 4.12 with its
unbiased estimate ˆy from Equations 4.11 and 4.10 then the distribution
of the resulting variable will not quite be Gaussian. In fact, the
distribution belongs to a family of t-distributions, where each member

42
4.8. The t-Test

of the family has a di↵erent number of degrees of freedom (see page 34),
as shown in Figure 4.7. Analogous to a z-score, we have the variable

t = (y µ)/ˆy , (4.19)

p
where ˆy = ˆy / n and ˆy is calculated from Equation 4.10. Just as
with a Gaussian distribution, for a t-distribution, 95% of the area under
the curve is between t = ±t(0.05) standard deviations of the mean,
where t(0.05) denotes the critical value of t for a statistical significance
level of p = 0.05. In other words, (by analogy with Equation 4.15) there
is a 95% chance that

y µ
t(0.05)   +t(0.05). (4.20)
ˆy

However, the estimated population standard error ˆy , like the variance


discussed on page 34, is based on n di↵erences from a fixed sample
mean and is therefore associated with a particular number of degrees of
freedom, n 1. This means that the critical value t(0.05) also depends
on the degrees of freedom, so we write it as t(0.05, ⌫). Accordingly,
Equation 4.20 becomes

y µ
t(0.05, ⌫)   +t(0.05, ⌫), (4.21)
ˆy

n =∞
0.4
n=8
0.3
n=4
0.2
n=2
0.1

0
-4 -2 0 2 4
t

Figure 4.7: The t-distribution curves with degrees of freedom ⌫ = 2, 4 and 8.


For comparison, the dashed curve is a Gaussian distribution with standard
deviation = 1, which is close to the t-distribution with ⌫ = 8.

43
4 Statistical Significance: Means

and (analogously to Equation 4.17) we have

y t(0.05, ⌫) ˆy  µ  y + t(0.05, ⌫) ˆy . (4.22)

This means there is a 95% chance that the population mean µ lies
between ±t(0.05, ⌫) standard errors of the sample mean y.
By analogy with Equation 4.18, there is a 95% chance that the
population mean µ lies in the confidence interval between the confidence
limits y t(0.05, ⌫)ˆy and y + t(0.05, ⌫)ˆy , which is written as

CI(µ) = y ± t(0.05, ⌫) ˆy . (4.23)

The t-distribution becomes indistinguishable from a Gaussian


distribution as n (and hence ⌫) approaches infinity, which implies that
the critical value t(0.05, ⌫) approaches 1.96 as n approaches infinity.

p-Values for the Mean

To find the p-value associated with the mean y, look up the critical
value t(0.05, ⌫) corresponding to ⌫ = n 1 degrees of freedom and a
significance level of p = 0.05 (see Table 4.1). If the absolute value of t
from Equation 4.19 is larger than t(0.05, ⌫) (i.e. |t| > t(0.05, ⌫)) then
the p-value is p(t, ⌫) < 0.05.

Degrees of freedom, ⌫ t at p = 0.05 t at p = 0.01


2 4.303 6.965
4 2.776 3.747
8 2.306 2.896
11 2.201 2.718
12 2.179 2.681
20 2.086 2.528
1 1.960 2.326

Table 4.1: Critical values of t for a two-tailed test for di↵erent degrees of
freedom ⌫ and p-values 0.05 and 0.01, where each p-value corresponds to the
total area under the two tails of the t-distribution. When ⌫ = 1, the values
of t equal the z-scores of a normalised Gaussian distribution.

44
4.9. Numerical Example

Significance Versus Importance

Clearly, the lower the p-value, the more statistical significance is


associated with the data. However, statistical significance does
not necessarily mean anything important. For example, consider a
parent population of temperature readings for which the mean is
µ = 0.001 degrees. If we take a sufficiently large sample of size n
from this population then the di↵erence between the sample mean y and
zero will be found to be highly significant (e.g. p < 0.00001). Briefly, this
is because the value of t in a t-test depends on the standard error of the
p
sample mean, and this shrinks in proportion to 1/ n (Equation 4.11).
Consequently, even though µ is tiny, if the sample size n is sufficiently
large then t will be large (Equation 4.19), so the sample mean y will be
highly significant, but it is also unimportant in this example.
This e↵ect becomes more compelling when a t-test is used to assess
the di↵erence between two means. For example, suppose there is a
di↵erence of 1 millisecond between the mean reaction times of men and
women. This would be highly significant given sufficiently large sample
sizes n (e.g. if everyone on Earth was tested), but it is also unimportant.

4.9. Numerical Example

Statistical Significance of the Mean. The mean of the n = 13


values of y in Table 1.1 is y = 5.135 feet. The variance of these values is
n
1X
var(y) = (yi y)2 (4.24)
n i=1
= 14.240/13 = 1.095 feet2 . (4.25)

In contrast, the unbiased estimate of the parent population variance


(Equation 4.10) involves division by the number of degrees of freedom,
⌫=n 1 = 12:
n
X
1
ˆy2 = (yi y)2 (4.26)
n 1 i=1
= 14.240/12 = 1.187 feet2 . (4.27)

45
4 Statistical Significance: Means
p
Therefore the estimated population standard deviation is ˆy = 1.187 =
1.089 feet. From Equation 4.11, the unbiased estimate of the standard
error (i.e. the estimated standard deviation of the sample means) is

p 1.089
ˆy = ˆy / n = p = 0.302 feet. (4.28)
13

If we test the null hypothesis that the data were obtained from a
population with mean µ = 0 then the value of t is (Equation 4.19)

5.135 0
ty = (y µ)/ˆy = = 16.997. (4.29)
0.302

For ⌫ = n 1 = 13 1 = 12, the critical value of t at significance level


p = 0.05 for a one-tailed test is

t(0.05, 12) = 1.782, (4.30)

where we can justify using a one-tailed test because height cannot be


negative. Because ty = 16.997 is larger than the critical value t(0.05, 12),
the probability that the observed values of y could have occurred by
chance given that the population mean is µ = 0 is less than 5%. In fact,
10
the exact p-value is p = 4.617 ⇥ 10 ; few data sets are large enough
to justify this level of precision (especially a data set with only n = 13),
so we would usually report this simply as p < 0.01.

Confidence Intervals of the Mean. From Equation 4.23, there is a


95% chance that the population mean µ lies in the confidence interval

CI(µ) = y ± t(0.05, 12) ˆy (4.31)


= 5.135 ± (1.782 ⇥ 0.302) (4.32)
= 5.135 ± 0.538. (4.33)

Thus, there is a 95% chance that the population mean µ is between


5.135 0.538 = 4.597 and 5.135 + 0.538 = 5.673.

Reference.
Walker HM (1940). Degrees of Freedom. Journal of Educational
Psychology, 31(4), 253.

46
4.10. Python Code

4.10. Python Code


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ch4Python.py Statistical significance of mean.

import numpy as np
from scipy import stats

y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]

# Convert daa to vector.


y = np.array(y)
n = len(y)

# Find zero mean version of y.


ymean = np.mean(y)
yzm = y-ymean
varsampley = np.sum(yzm * yzm)/n
print("Sample variance of y = %0.3f." %varsampley) # 1.095

# Unbiased estimate of parent population variance (divide by n-1).


varpopy = np.sum(yzm * yzm)/(n-1)
print("Est of pop variance of y = %0.3f." %varpopy) # 1.187

# Standard error of ymean.


semymean = np.sqrt(varpopy/float(n)) # 0.302
print(’Standard error of ymean = %.3f.\n’ % semymean)

# Find t value.
tval = ymean/semymean # 16.997

# One-tailed p-value = Prob(t>tval).


pval = stats.t.sf(np.abs(tval), n-1) # 9.2343e-10.

print(’RESULT CALCULATED BY HAND: ’)


print(’t-statistic (by hand) = %6.3f.’ % tval)
print(’p-value = %6.4e.\n’ % pval)

# Compare the standard library version of t-test.


m = 0 # mean
t, p = stats.ttest_1samp(y, m)
p = p/2 # for one-tailed p-value.
print(’\nLIBRARY RESULT FOR COMPARISON: ’)
print(’t-statistic (by hand) = %6.3f.’ % t)
print(’p-value = %6.4e.\n’ % p)
# END OF FILE.

47
Chapter 5

Statistical Significance: Regression


5.1. Introduction

The quality of the fit between a line and the data may look impressive,
or it may look downright shoddy. In either case, how are we to assess
the statistical significance of the best fitting line? In fact, this question
involves two distinct subsidiary questions.
1. What is the statistical significance associated with the overall fit
of the line to the data?

2. What is the statistical significance associated with each individual


parameter, such as the slope and the intercept?
For reasons that will become apparent, we address these questions in
reverse order. Most books on regression analysis make use of three
statistical tests: the t-test, the F -test and the chi-squared test. As we
shall see, the chi-squared test is not needed. And, because the t-test
is a special case of the F -test, regression analysis really requires only
one statistical test, the F -test. However, when assessing the statistical
significance associated with individual parameters, it is convenient to
use the t-test.

5.2. Statistical Significance

As stated in the previous chapter, the slope of the best fitting line is
a weighted mean. Because a conventional mean is a special case of
a weighted mean in which every weight equals 1, we can adapt the
standard methods from the previous chapter for finding the statistical
significance of means to find the statistical significance of the slope.

49
5 Statistical Significance: Regression

Slope is a Weighted Mean

We can see that the slope b1 is a weighted mean of y values from


Equation 3.26 (repeated here),
Pn
x i yi
b1 = Pi=1
n 2 , (5.1)
i=1 xi

which can be re-written as


n
X
b1 = w i yi , (5.2)
i=1

where the ith weight is

x
wi = Pn i . (5.3)
i=1 x2i

5.3. Statistical Significance: Slope

Now we return to the second question posed at the beginning of


this chapter: what is the statistical significance associated with each
individual parameter, such as the slope and the intercept? Specifically,
how can we assess the statistical significance of the slope of the best
fitting line?
Basically, the answer comes down to whether it is statistically plausible
that the data could have arisen by chance alone. In the context of linear
regression, this can be rephrased as the following question: what is the
probability that observed data with a best fitting line of slope b1 are
just noisy versions of values that lie on a horizontal line? In this case
the null hypothesis is that the relationship between the variables is a
horizontal line, i.e. there is no dependence between x and y.
One way to answer this question is to estimate how di↵erent the best
fitting line with a slope of b1 is from a line with slope zero. Let us
denote the null hypothesis slope by b01 , so that b01 = 0. Suppose the
data in Table 1.1 represent a sample of n values

yi = b1 x i + b 0 + ⌘ i , (5.4)

50
5.3. Statistical Significance: Slope

where b1 is the slope of the linear relationship between y and x values,


b0 is the intercept, and ⌘i is random noise. The presence of this noise
means that any estimate of b1 based on values of y also contains noise.
Suppose also that each sample is the result of a single run of an
experiment in which we measure the values yi corresponding to a fixed
set of values {x1 , . . . , xn }. If we run the experiment N times then we
get N di↵erent data sets where the underlying relationship between x
and y remains constant, but the noise in the ith observed value yi is
di↵erent in each run of the experiment. Using these N data sets, we
can obtain a set of N di↵erent estimates of the slope,

{b11 , b12 , . . . , b1N }. (5.5)

If we plot a histogram of these N estimates then we would find that they


approximate a Gaussian distribution (because each estimated value b1j
is a mean, which therefore obeys the central limit theorem; see page 32).
Crucially, when measured in units of standard deviations, the distance
between the slope b1 estimated from a data set and the slope b01 is

b1 b01
tb1 = , (5.6)
ˆb1

where we have defined b01 = 0, so that

b1
tb1 = . (5.7)
ˆb1

Finding the Estimate ˆb1 of b1 . The central limit theorem (Section


4.2) implies that if each member of a set of variables {y1 , y2 , . . . , yn } has
a Gaussian distribution then any linear combination of those variables
also has a Gaussian distribution. From Equation 5.2 we know that
the slope parameter b1 is a linear combination of y values, and we also
know that each of these values has a Gaussian distribution; therefore,
2
b1 has a Gaussian distribution. We can obtain the variance b1 of this
distribution as follows. From Equation 5.2, we have

n
!
X
2
b1 = var w i yi . (5.8)
i=1

51
5 Statistical Significance: Regression

If the values of yi are uncorrelated with each other then

n
! n
X X
var w i yi = var(wi yi ). (5.9)
i=1 i=1

Also, var(wi yi ) = wi2 var(yi ) (see page 17), so

n
X
2
b1 = wi2 var(yi ). (5.10)
i=1

Recall that we are assuming that the variances of all the data points
are the same, var(y1 ) = var(y2 ) = · · · = var(yn ), with all being equal to
2
the variance ⌘ of the noise ⌘i in yi ; therefore

n
X
2 2
b1 = ⌘ wi2 . (5.11)
i=1

Pn Pn
Using Equation 5.3, we find that i=1 wi2 = 1/ i=1 (xi x)2 , so

2
2 ⌘
b1 = Pn . (5.12)
i=1 (xi x)2

2
Of course, we do not know the value of ⌘, but an unbiased estimate is
n
X (yi ŷi )2
ˆ⌘2 = (5.13)
i=1
n 2
E
= , (5.14)
n 2

where n 2 is the number of degrees of freedom (see Section 4.3). When


assessing the number of degrees of freedom, we start with n, and then we
lose one degree of freedom per parameter in the regression model. For
simple regression there are p = 2 parameters, the slope and intercept;
so the number of degrees of freedom is ⌫ = n p=n 2. Substituting
Equation 5.13 into Equation 5.12 and taking square roots yields the
unbiased estimated standard deviation of the slope b1 ,
⇥ 1
Pn ⇤1/2
n 2 i=1 (yi ŷi )2
ˆb1 = ⇥ Pn ⇤1/2 . (5.15)
i=1 (xi x)2

52
5.3. Statistical Significance: Slope

This can also be derived from a vector–matrix formulation (Section 7.3),


which leads to the general solution in Equation 7.34.
For purely explanatory purposes, this can be simplified as follows. If
n is large then 1/(n 2) ⇡ 1/n, so the numerator is approximately equal
to s⌘ , the standard deviation of the ⌘i = yi ŷi values. Similarly, (from
p p
Equation 3.7) we recognise the denominator as n var(x) = sx n.
Thus, for large n Equation 5.15 can be written as

s⌘
ˆb1 ⇡ p . (5.16)
sx n

Note that the standard deviation of the estimated slope is inversely


proportional to the square root of the number n of data points in each
sample, so it has the same form as in Figure 4.3. By analogy with
Equation 4.11, the unbiased estimate of the standard deviation in the
mean value of the noise is

ˆ
ˆ⌘ = p⌘ , (5.17)
n

where ˆ⌘ ⇡ s⌘ and ˆ⌘ ⇡ s⌘ for large n, so that

s
s⌘ ⇡ p⌘ . (5.18)
n

Substituting this into Equation 5.16 yields

s⌘
ˆb1 ⇡ . (5.19)
sx

As sx is constant because the xi values are fixed, the standard deviation


of the estimated slope is proportional to the standard deviation in the
mean value of the noise. This makes intuitive sense, because one would
expect the standard deviation of the slope of the best fitting line to
depend on the amount of noise in the observed values of y.
It can be shown that Equation 5.7 can be evaluated in terms of the
correlation r between x and y:

r(n 2)1/2
tb1 = , (5.20)
(1 r2 )1/2

53
5 Statistical Significance: Regression

where r is obtained from Equation 3.27 (repeated here),

var(ŷ)
r2 = . (5.21)
var(y)

Confidence Interval: Slope

By analogy with Equation 4.23, if b1 is the slope of the best fitting


line then there is a 95% chance that the population mean µb1 lies in
the confidence interval between b1 t(0.05, ⌫)ˆb1 and b1 + t(0.05, ⌫)ˆb1 ,
where ⌫ = n 2; that is,

CI(µb1 ) = b1 ± t(0.05, ⌫) ˆb1 . (5.22)

p-Values: Slope

The data contain noise, so it is possible that they actually have a slope
of zero, even though the best fitting line has a nonzero slope b1 . The
p-value is the probability that the slope b1 of the best fitting line is due
to noise in the data. More precisely, the p-value is the probability that
the slope of the best fitting line is equal to or more extreme than the b1
observed, given that the true slope is zero.
To find the p-value associated with the slope b1 , look up the critical
value t(0.05, ⌫) that corresponds to ⌫ = n 2 degrees of freedom and
a significance value of p = 0.05 (see Table 4.1). If the absolute value
of tb1 from Equation 5.7 is larger than t(0.05, ⌫) (i.e. |tb1 | > t(0.05, ⌫))
then the p-value is p(tb1 , ⌫) < 0.05. Notice that, in principle, the slope
could have been negative or positive, so we use a two-tailed test.

Key point: The data contain noise, so it is possible that they


actually have a slope of zero, even though the best fitting line
has nonzero slope. The p-value is the probability that the slope
of the best fitting line is due to noise in the data.

54
5.4. Statistical Significance: Intercept

5.4. Statistical Significance: Intercept

The di↵erence between the best fitting value of b0 and a hypothetical


value b00 , in units of standard deviations, is

b0 b00
tb0 = , (5.23)
ˆb0

where ˆb0 is the unbiased estimate of the standard deviation b0 of b0 .


We state without proof that (also see Section 7.5)
 1/2
1 x2
ˆb0 = ˆ⌘ ⇥ + Pn , (5.24)
n i=1 (xi x)2

where ˆ⌘ is defined by Equation 5.13. As with the slope, the number of


degrees of freedom is ⌫ = n 2. For example, to test the null hypothesis
that the intercept is b00 = 0, we use tb0 = b0 /ˆb0 with ⌫ = n 2 degrees
of freedom.

Confidence Interval: Intercept

By analogy with Equation 5.22, if b0 is the intercept of the best fitting


line then there is a 95% chance that the population mean µb0 lies in
the confidence interval between the confidence limits b0 t(0.05, ⌫)ˆb0
and b0 + t(0.05, ⌫)ˆb0 , where ⌫ = n 2, which is written as

CI(µb0 ) = b0 ± t(0.05, ⌫) ˆb0 . (5.25)

5.5. Significance Versus Importance

The lower the p-value, the more statistical significance is associated with
the data. However, as discussed in Section 4.8, statistical significance
is not necessarily associated with anything important. For example,
suppose that the slope of the best fitting line y = b1 x + b0 is very small
p
(e.g. b1 = 0.001). The fact that ˆb1 / 1/ n (Equation 5.16) implies
that for a sufficiently large sample size n, the value of tb1 = b1 /ˆb1
(Equation 5.7) will be large, so the slope b1 will be found to be highly
significant (e.g. p < 0.00001). Even so, because the slope is so small, it
is probably unimportant.

55
5 Statistical Significance: Regression

In contrast, the informal notion of importance is related to the


proportion of variance in y that is explained by x. Despite a significant
(i.e. small) p-value, if x accounts for only a tiny proportion of variance
in y then x is unimportant. In such cases, there are probably other
variables that account for a large proportion of variance in y, which
may be discovered using multivariate regression (see Chapter 7).

Key point: That the slope b1 of the best fitting line y = b1 x+b0
is statistically significant does not imply that x is an important
factor in accounting for y.

5.6. Assessing the Overall Fit

The F -test. In the case of the simple linear regression considered here,
the t-statistic provides all the information necessary to evaluate the fit.
However, when we come to consider more general models (e.g. weighted
linear regression in Chapter 7), we will find that the t-test is really a
special case of a more general test, called the F -test. Thus, this section
paves the way for later chapters and may be skipped on first reading.
Roughly speaking, the F -test relies on the ratio F 0 of two proportions,
[the proportion of variance in y explained by the model] to [the
proportion of variance in y not explained by the model]:

proportion of explained variance


F0 = . (5.26)
proportion of unexplained variance

Clearly, larger values of F 0 imply a better fit of the model to the data.
From Equation 5.21, the proportion of variance in y explained by the
regression model is r2 , so the proportion of unexplained variance is
1 r2 ; then Equation 5.26 becomes

r2
F0 = . (5.27)
1 r2

However, we need to take account of the degrees of freedom associated


with the explained and unexplained variances. Accordingly, the F -

56
5.6. Assessing the Overall Fit

ratio is

r2 /(p 1)
F (p 1, n p) = , (5.28)
(1 r2 )/(n p)

where p is the number of parameters, which consist of k slopes (for k


regressors, or independent variables) plus an intercept, so p = k + 1.
This can also be expressed in terms of the the sums of squares defined
in Section 3.4:

SSExp /(p 1)
F (p 1, n p) = . (5.29)
SSNoise /(n p)

A table of F values lists of two di↵erent degrees of freedom, the


numerator degrees of freedom ⌫1 = p 1 and the denominator degrees of
freedom ⌫2 = n p. These names matter when using an F -table to look
up a p-value. In general, the expected or mean value of F is 1. From
Equation 5.28, large values of r2 (which indicate strong correlation)
correspond to large values of F .
If the value of F from Equation 5.28 is larger than the value
F (p 1, n p) listed in a look-up table (at a specified p-value, such
as 0.05) then the probability that the measured correlation could have
occurred by chance is less than 0.05. Modern software packages usually
report the exact p-value implied by a particular value of F . In other
words, if F 1 then it is improbable that there is no underlying
correlation between x and y. Precisely how improbable is given by the
p-value calculated using an F -test.

The F -Test and the t-Test. As mentioned earlier, the t-test is really
a special case of the F -test. With one independent variable, the number
of parameters is p = 2, so Equation 5.28 becomes

r2
F (1, n 2) = . (5.30)
(1 r2 )/(n 2)

If we rearrange Equation 5.20 to


✓ ◆1/2
r2
tb1 (n 2) = (5.31)
(1 r2 )/(n 2)

57
5 Statistical Significance: Regression

then we can see that F (1, n 2) = [tb1 (n 2)]2 . In general, the value
of F with 1 and ⌫ = n p degrees of freedom equals the square of tb1
with ⌫ degrees of freedom.
So far we have been considering simple regression with only one
regressor, and in this case the t-test and the F -test give exactly the
same result. If we consider more than one regressor, as in multivariate
regression (Chapter 7), then the t-test and the F -test are no longer
equivalent and the F -test must be used.
2
Not Using the -Test. It is common to assess the overall fit between
2
the data and the best fitting line using either the F -test or the -
test (chi-squared test). In fact, these tests are equivalent as n ! 1.
2
However, we can usually disregard the -test, because it is the least
accurate of these tests when n is not very large, which is often the case
in practice.

5.7. Numerical Example

Statistical Significance of the Slope. From Equation 5.15, the


unbiased estimate of the standard deviation b1 of b1 values is
⇥ 1
Pn ⇤1/2
n 2 i=1 (yi ŷi )2
ˆb1 = ⇥Pn ⇤1/2 , (5.32)
i=1 (xi x)2

= 0.831/3.373 = 0.246. (5.33)

To test the idea that the true slope is b01 = 0 we calculate (using
Equation 5.7)

tb1 = b1 /ˆb1 (5.34)


= 0.764/0.246 = 3.101. (5.35)

In a look-up table (e.g. Table 4.1), we locate the row for ⌫ = n p=


13 2 = 11 degrees of freedom and find that if t = 2.201 then the
p-value is p = 0.05, which is written as

t(0.05, 11) = 2.201. (5.36)

58
5.7. Numerical Example

Because our value of tb1 is larger than 2.201, its associated p-value is
less than 0.05. By convention, this is reported as

p(3.101, 11) < 0.05, (5.37)

which is read as ‘the t value of 3.101 with 11 degrees of freedom implies


a p-value of less than 0.05’.
In other words, if the true slope is zero then the probability of
observing data with a best fitting slope of |b1 | 0.764 is less than
5%. To put this yet another way, if we were to rerun the experiment
N = 100 times then we expect to obtain a best fitting line whose slope
has magnitude greater than or equal to 0.764 about five times.
In fact, using most modern computer software, we can do better than
just state that p-value is less than some critical level. For example,
given tb1 = 3.101 with ⌫ = 11, a software package output yields

p = 0.0101. (5.38)

In words, if the true slope is zero then the probability of obtaining by


chance data with a best fitting slope of |b1 | 0.764 is p = 0.0101.

Confidence Interval of the Slope. From Equation 5.22, if the slope


of the best fitting line is b1 , there is a 95% chance that the population
mean µb1 lies in the confidence interval between the confidence limits
b1 t(0.05, ⌫)ˆb1 and b1 + t(0.05, ⌫)ˆb1 , where ⌫ = n 2 = 11:

CI(µb1 ) = b1 ± t(0.05, ⌫)ˆb1 (5.39)


= 0.764 ± (2.201 ⇥ 0.246) (5.40)
= 0.764 ± 0.542. (5.41)

Statistical Significance of the Intercept. Using Equation 5.24,


 n
X 1/2  1/2
1 1 x2
ˆb0 = (yi ŷi )2 ⇥ + Pn (5.42)
n 2 i=1
n i=1 (xi x)2

= 0.831 ⇥ 0.791 (5.43)


= 0.658. (5.44)

59
5 Statistical Significance: Regression

parameter value standard error ˆ t ⌫ p


slope b1 0.764 0.246 3.101 11 0.0101
intercept b0 3.22 0.658 4.903 11 < 0.01

r r2 F ⌫Num ⌫Den p
0.683 0.466 9.617 1 11 0.0101

Table 5.1: Simple regression analysis of data in Table 1.1.

Given that the best fitting intercept is b0 = 3.22, we have

tb0 = b0 /ˆb0 = 3.22/0.658 = 4.903. (5.45)

From Table 4.1, the critical value of t for ⌫ = 11 degrees of freedom at


significance level 0.01 is

t(0.01, 11) = 2.718. (5.46)

Because the value of tb0 is larger than 2.718, its associated p-value is
less than 0.01. By convention, this is reported as

p(4.903, 11) < 0.01, (5.47)

which is read as ‘the t value of 4.903 with ⌫ = 11 degrees of freedom


implies a p-value of less than 0.01’.

Confidence Intervals of the Intercept. From Equation 5.25, the


confidence interval for the intercept with ⌫ = n 2 = 11 is

CI(µb0 ) = b0 ± t(0.05, ⌫)ˆb0 (5.48)


= 3.22 ± (2.201 ⇥ 0.658) (5.49)
= 3.22 ± 1.448. (5.50)

Assessing the Overall Model Fit

The overall fit is represented by the correlation coefficient, and the


statistical significance of the correlation coefficient is assessed using the

60
5.7. Numerical Example

F -statistic in Equation 5.28 (repeated here),

r2 /(p 1)
F (p 1, n p) = , (5.51)
(1 r2 )/(n p)

where r2 is equal to the coefficient of determination. From Equation 3.27,


the coefficient of determination is

var(ŷ) 0.511
r2 = = = 0.466. (5.52)
var(y) 1.095

Substituting r2 = 0.466, p 1=2 1 = 1 and n p = 13 2 = 11 into


Equation 5.51, we get

0.466/1
F (1, 11) = = 9.617. (5.53)
(1 0.466)/11

Using the numerator degrees of freedom (p 1 = 1) and the denominator


degrees of freedom (n p = 11), we find that the p-value is

p = 0.0101. (5.54)

This agrees with the p-value from the t-test for the slope in Equation 5.38,
as it should do. This is because (in the case of simple regression)
the overall model fit is determined by a single quantity r2 , which is
determined by the slope b1 .

61
5 Statistical Significance: Regression

5.8. Python Code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
ch5Python.py Statistical significance of regression.
"""
import numpy as np
from scipy import stats
import statsmodels.api as sm

x = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]

# Convert data to vectors.


x = np.array(x)
y = np.array(y)
n = len(y)
xmean = np.mean(x)
ymean = np.mean(y)
# Find zero mean versions of x and y.
xzm = x - xmean
yzm = y - ymean
covxy = np.sum(xzm * yzm)/n
varx = np.var(x)
vary = np.var(y)
print("Variance of x = %0.3f." %varx) # 0.875
print("Variance of y = %0.3f." %vary) # 1.095
print("Covariance of x and y = %0.3f." %covxy) # 0.669

# Find slope b1.


b1 = covxy/varx # 0.764
# Find intercept b0.
b0 = ymean - b1*xmean # 3.225
print(’\nslope b1 = %6.3f\nintercept b0 = %6.3f.’ % (b1, b0))

# Find vertical projection of y onto best fitting line.


yhat = b1*x + b0

numparams = 2 # number of parameters=2 (slope and intercept).

# SLOPE
# Find sem of slope.
num = ( (1/(n-numparams)) * sum((y-yhat)**2) )**0.5 # 0.831.
den = sum((x-xmean)**2)**0.5 # 3.373
semslope = num/den
print(’semslope = %6.3f.’ % (semslope))

62
5.8. Python Code

# Find t-value of slope.


tslope = b1/semslope # 3.101

# Find p-value of slope.


# two-tailed pvalue = Prob(abs(t)>tt).
pvalue = stats.t.sf(np.abs(tslope), n-numparams)*2 # 0.0101
print(’\nSLOPE:\nt-statistic = %6.3f.’ % tslope)
print(’pvalue = %6.4f.\m’ % pvalue)

# INTERCEPT
# Find sem of intercept.
a = ( (1/(n-numparams)) * sum((y-yhat)**2) )**0.5 # 0.831
b = ( (1/n) + xmean**2 / sum( xzm**2 ))**0.5 # 0.791
semintercept = a * b # 0.658
print(’\na = %6.3f\nb = %6.3f\nsemintercept = a/b = %6.3f.’
% (a, b, semintercept))

# Find t-value of intercept.


tintercept = b0 / semintercept
pintercept = stats.t.sf(np.abs(tintercept), n-numparams)*2
print(’\nINTERCEPT:\nt-statistic = %6.3f.’% tintercept)
print(’pvalue = %6.4f.\n’% pintercept)

# Overall model fit to data.


# Find coefficient of variation r2.
r2 = covxy * covxy / (varx * vary)
print("coefficient of variation = %0.3f." %r2) # 0.466.

# Find F ratio.
A = r2 / (numparams-1)
B = (1-r2) / (n-numparams)
F = A/B # 9.617
print("F ratio = %0.4f." % F)

pfit = stats.f.sf(F, numparams-1, n-numparams)


print("p overall fit = %0.4f." % pfit) # 0.0101.

# Run standard library regression method for comparison.


ones = np.ones(len(x))
X = [ones, x]
X = np.transpose(X)
y = np.transpose(y)
res_ols = sm.OLS(y, X).fit()
print(res_ols.summary()) # Print table of results.

# END OF FILE.

63
Chapter 6

Maximum Likelihood Estimation

6.1. Introduction

In this chapter we show how the least squares procedure described in


previous chapters is equivalent to maximum likelihood estimation (MLE).
In essence, the equation of a straight line is a model (with parameters
b1 and b0 ) of how one variable y (e.g. height) changes with another
variable x (e.g. salary). If we model a set of n pairs of data values xi
and yi with a line of slope b1 and intercept b0 , the model’s estimate of
the value of y at xi is

ŷi = b1 x i + b0 , (6.1)

(this is the same as Equation 1.1). As shown in Figure 6.1 (see also
Figure 1.3), the vertical distance between the measured value yi and
the value ŷi predicted by the ‘straight line’ model is represented as

⌘i = yi ŷi , (6.2)

where ⌘i is the amount of error or noise in the value yi . If we assume that


⌘ has a Gaussian or normal distribution with mean zero and standard
deviation i then the probability of a value ⌘i occurring is

⌘i2 /(2 2
i)
p(⌘i ) = ki e , (6.3)
p
where ki = 1/ 2⇡ 2 is a normalising constant which ensures that the
i
area under the Gaussian distribution curve sums to 1. More accurately,

65
6 Maximum Likelihood Estimation

p(⌘i ) is a probability density, but we need not worry about such subtle
distinctions here. The shape of a Gaussian distribution with = 1 is
shown in Figure 4.2 (p32).
If we substitute Equations 6.1 and 6.2 into Equation 6.3, we obtain

[yi (b1 xi +b0 )]2 /(2 2


i)
p(⌘i ) = ki e . (6.4)

But because the values of b1 and b0 are determined by the model, while
the values of xi are fixed, the probability p(⌘i ) is also a function of yi ,
which represents the probability of observing yi , so we can replace p(⌘i )
with p(yi ) and write
[yi (b1 xi +b0 )]2 /(2 2
i)
p(yi ) = ki e . (6.5)

In words, the probability of observing a value yi that contains noise ⌘i


varies as a Gaussian function of ⌘i , where this Gaussian function has
mean zero and standard deviation i . A short way of writing that a
variable ⌘ has a Gaussian or normal distribution function with mean
2 2
µ and standard deviation is ⌘ ⇠ N (µ, ), where is the variance
of ⌘. So, given that noise is assumed to have a mean of zero, we have
2
⌘i ⇠ N (0, i ). Notice that each value ⌘i can have its own unique
2
variance i, which indicates the reliability of the corresponding value yi .

7.5
7
6.5 #10
Height, y (feet)

6
5.5
5
4.5 #2 #7
4
3.5
3
0 1 2 3 4 5
Salary, x (groats)

Figure 6.1: The vertical distance between each measured value yi and the
value ŷi predicted by the straight line model is ⌘i , shown for four data points
here. All measured values contain noise, and the noise of each data point is
assumed to have a Gaussian distribution; each distribution has a di↵erent
standard deviation, indicated by the width of the vertical Gaussian curves.

66
6.2. The Likelihood Function

6.2. The Likelihood Function

Because the probability of observing yi depends on the parameters b1 and


b0 , the distribution of yi values is a conditional probability distribution.
The conditional nature of this distribution is made explicit by using the
vertical bar notation, so Equation 6.5 becomes

[yi (b1 xi +b0 )]2 /(2 2


i)
p(yi |b1 , b0 ) = ki e . (6.6)

In words, p(yi |b1 , b0 ) is interpreted as ‘the conditional probability that


the variable y has the value yi given the parameter values b1 and b0 ’.
Because p(yi ) = p(⌘i ), we could have written this as p(⌘i |b1 , b0 ). The
conditional probability p(yi |b1 , b0 ) that the variable y equals yi is also
interpreted as the likelihood of the parameter values b1 and b0 .

Combining Probabilities: Coin Flipping. We usually have more


than one data point, so we need to know how to combine their
probabilities. To get a feel for how to do this, consider the simple
example of flipping a coin. A typical coin is equally likely to yield a
head or a tail, but let’s imagine that this is a biased coin for which the
probability of a head is given by the parameter ✓. Since the probability
of a head is p(h|✓) = ✓, the probability of a tail is p(t|✓) = 1 ✓; the
vertical bar notation makes the dependence on ✓ explicit.
For example, consider a coin for which the probability of a head is
✓ = 0.9, so that the probability of a tail is 1 ✓ = 0.1. If n = 3 coin
flips produce a head followed by two tails, we write this result as the
sequence (h, t, t). Because none of the individual flip outcomes depends
on the other two flip outcomes, all outcomes are independent of each
other, so the joint probability p(h, t, t) of all three outcomes is obtained
by multiplying the probabilities of the individual outcomes:

p(h, t, t|✓) = p(h|✓) ⇥ p(t|✓) ⇥ p(t|✓) (6.7)


= 0.9 ⇥ 0.1 ⇥ 0.1 (6.8)
= 0.009. (6.9)

This is interpreted as the likelihood that ✓ = 0.9.

67
6 Maximum Likelihood Estimation

Combining Probabilities: Regression. Analogously, when we fit a


line to data, it is assumed that the noise ⌘i in each observed value yi is
independent of the noise in all other values of y, so the probability of
obtaining a sample of n noise values (⌘1 , ⌘2 , . . . , ⌘n ) is

p(⌘1 , ⌘2 , . . . , ⌘n |b1 , b0 )
= p(⌘1 |b1 , b0 ) ⇥ p(⌘2 |b1 , b0 ) ⇥ · · · ⇥ p(⌘n |b1 , b0 ). (6.10)

But since p(⌘i |b1 , b0 ) = p(yi |b1 , b0 ), Equation 6.10 can be expressed as

p(y1 , y2 , . . . , yn |b1 , b0 ) = p(y1 |b1 , b0 ) ⇥ · · · ⇥ p(yn |b1 , b0 ). (6.11)

If we write the sample of y values as a vector (in bold),

y = (y1 , y2 , . . . , yn ), (6.12)

then Equation 6.11 becomes

p(y|b1 , b0 ) = p(y1 |b1 , b0 ) ⇥ · · · ⇥ p(yn |b1 , b0 ), (6.13)

which can be written more succinctly as the likelihood function


n
Y
p(y|b1 , b0 ) = p(yi |b1 , b0 ), (6.14)
i=1

where ⇧, the capital Greek letter pi, is an instruction to multiply together


all the terms to its right (see Appendix B). Substituting Equation 6.6
into Equation 6.14 gives
n
Y
[yi (b1 xi +b0 )]2 /(2 2
i)
p(y|b1 , b0 ) = ki e , (6.15)
i=1

p
where (as a reminder) ki = 1/( i 2⇡). In words, p(y|b1 , b0 ) is
interpreted as ‘the conditional probability that the variable y adopts the
set of values y, given the parameter values b1 and b0 ’. Equation 6.15 is
called the likelihood function because its value varies as a function of the
parameters b1 and b0 . The parameter values that make the observed
data most probable are the maximum likelihood estimate (MLE).

68
6.3. Likelihood and Least Squares Estimation

The Probability of the Data? At first, this way of thinking about


data seems odd. It just sounds wrong to speak of the probability of the
data, which are values we have already observed, so why would we care
how probable they are? In fact, we do not care about the probability
of the data per se, but we do care how probable those data are in the
context of the parameters we wish to estimate — that is, in the context
of our regression model (Equation 6.1). Specifically, in our model, we
want to find the values of the parameters b1 and b0 that would make
the observed data most probable (i.e. the values of b1 and b0 that are
most consistent with the data).

6.3. Likelihood and Least Squares Estimation

For reasons that will become clear, it is customary to take the logarithm
of quantities like those in Equation 6.15. As a reminder, given two
positive numbers m and n, log(m ⇥ n) = log m + log n. Accordingly,
the log likelihood of b1 and b0 is

n n 2
X 1 X yi (b1 xi + b0 )
log p(y|b1 , b0 ) = log p . (6.16)
i=1 i 2⇡ i=1
2 i2

If we plot the value of log p(y|b1 , b0 ) for di↵erent putative values of b1


and b0 then we obtain the log likelihood function, which is qualitatively
similar to the bowl-shaped function plotted in Figure 2.1.
The standard deviation i of each measured value yi e↵ectively
‘discounts’ less reliable measurements, so that noisier measurements
have less influence on the parameter values (b1 , b0 ) of the fitted line.
Notice that, except for the 1/ i factor, this is starting to resemble the
sum of squared errors in Equation 1.25. The assumption that noise
variances may not all be the same is called heteroscedasticity, whereas
homoscedasticity is the assumption that all variances are the same.
If we do not know the i values then we may as well assume that all
p
of them are the same. To simplify calculations, we set i = 1/ 2, so
that Equation 6.16 becomes
n
X
p 2
log p(y|b1 , b0 ) = n log(1/ ⇡) yi (b1 xi + b0 ) . (6.17)
i=1

69
6 Maximum Likelihood Estimation

Rearranging this yields


n
X
p 2
n log(1/ ⇡) log p(y|b1 , b0 ) = yi (b1 xi + b0 ) (6.18)
i=1
= E, (6.19)

where the right-hand side is identical to the sum of squared errors in


Equation 1.25.
Notice that the values of b1 and b0 that make the log likelihood
log p(y|b1 , b0 ) (Equation 6.17) as large as possible (but always negative)
also make E (Equation 6.19) as small as possible (but always positive).
Thus, the values of b1 and b0 that maximise the log likelihood
log p(y|b1 , b0 ) also minimise the sum of squared errors E. Therefore,
if the reliabilities of all data points are the same then the maximum
likelihood estimate equals the least squares estimate.
This bears repeating more fully: if all data points are equally reliable
(i.e. if all i values are the same) then the MLE of b1 and b0 is identical
to the LSE. That is why the apparently arbitrary decision to estimate
b1 and b0 by minimising the sum of squared errors turns out to be a
good idea (see Section 1.3, page 5).

Key point: If all data points are equally reliable (i.e. if all
i values are the same) then the maximum likelihood estimate
(MLE) of model parameter values is identical to the least squares
estimate (LSE) of those parameter values.

70
Chapter 7

Multivariate Regression

7.1. Introduction

Most measured quantities depend on several other quantities. For


example, a person’s score on a new computer game zog depends on
a combination of the number x1 of hours they spent playing zog and
the number x2 of years they spent playing computer games in general.
For a sample of n individuals, the score of the ith individual can be
modelled as

yi = b1 xi1 + b2 xi2 + b0 + ⌘i . (7.1)

The parameter b1 specifies the extent to which the score yi depends


on the number of hours spent playing zog, b2 specifies the extent to
which the score depends on previous experience with computer games,
b0 represents the average score of naive individuals (i.e. individuals who
have not spent any time playing zog and who have no experience with
computer games), and ⌘i is the noise in the score measurements.

7.2. The Best Fitting Plane

Just as simple regression has a geometric interpretation as fitting a


straight line to data, so multivariate regression can be interpreted as
fitting a plane to data. In the case of two independent variables x1 and
x2 , each location (xi1 , xi2 ) on the ‘ground’ has a corresponding point

71
7 Multivariate Regression

at height ŷi on the best fitting plane with equation

ŷi = b1 xi1 + b2 xi2 + b0 , (7.2)

as shown in Figure 7.1.


Note that the value of ŷ depends on p = 3 parameters, comprising
k = 2 regression coefficients


b1 = (7.3)
x1


b2 = (7.4)
x2

plus one intercept parameter b0 . The vertical distance between each


measured data point yi and the corresponding point ŷi on the best
fitting plane is considered to be measurement error or noise, ⌘i = yi ŷi .

10

5
y

0
4
10 2
5
0 0 x
x2 1

Figure 7.1: Given a data set consisting of x1 , x2 and y values, where the
variable y is thought to depend on the variables x1 and x2 , we can fit a plane
to the data, defined by ŷi = b1 xi1 + b2 xi2 + b0 , where b1 is the gradient with
respect to x1 (i.e. the slope of the plane along the wall formed by the x1 - and
y-axes), b2 is the gradient with respect to x2 (the slope of the plane along
the wall formed by the x2 - and y-axes), and b0 is the height of the plane at
(x1 , x2 ) = (0, 0) on the ground. Each vertical line joins a measured data point
yi to the point ŷi on the best fitting plane above the same location (xi1 , xi2 )
on the ground. Using the data in Table 7.1, the least squares estimates are
b1 = 0.966, b2 = 0.138 and b0 = 2.148.

72
7.3. Vector–Matrix Formulation

i 1 2 3 4 5 6 7 8 9 10 11 12 13
yi 3.34 4.97 4.15 5.40 5.21 4.56 3.69 5.86 4.58 6.94 5.57 5.62 6.87
xi1 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
xi2 7.47 9.24 3.78 1.23 5.57 4.48 4.05 4.19 0.05 7.20 2.48 1.73 2.37

Table 7.1: Values of two independent variables (regressors) x1 and x2 and a


dependent variable y.

And, just as simple regression finds the best fitting line by minimising
the sum of squared vertical distances between the line and the y data,
so multivariate regression finds the best fitting plane by minimising the
sum of squared vertical distances between the plane and the y data,
n
X
E = (yi ŷi )2 (7.5)
i=1

n
X 2
= yi (b1 xi1 + b2 xi2 + b0 ) , (7.6)
i=1

In summary, for k = 2 regressors (independent variables), the best


fitting two-dimensional model is a plane embedded in three-dimensional
space, as in Figure 7.1. More generally, if there are k > 2 regressors
then the best fitting k-dimensional model is a hyper-plane embedded in
(k + 1)-dimensional space.

Key point: Just as simple regression finds the best fitting line
by minimising the sum of squared vertical distances between
the line and the data, so multivariate regression finds the best
fitting plane by minimising the sum of squared vertical distances
between the plane and the data.

7.3. Vector–Matrix Formulation

Expressing regression in terms of vectors provides a notational uniformity


which ensures that extension beyond two regressors is reasonably self-
evident (see Appendix C). The right-hand side of Equation 7.2 can be

73
7 Multivariate Regression

expressed as the inner product of the column vector


0 1
B b1 C
B C
B C
b = B b2 C (7.7)
B C
@ A
b0

and the row vector

xi = (xi1 , xi2 , 1), (7.8)

where the first two elements of xi define a single location on the ground
plane. So Equation 7.2 can be written in vector–matrix form as
0 1
B b1 C
B C
B C
ŷi = (xi1 , xi2 , 1) B b2 C (7.9)
B C
@ A
b0

or, more succinctly,

ŷi = xi b. (7.10)

When considered over all n data points, we have


0 1 0 1
B ŷ1 C B x1 C
B C B C
B .. C B .. C
B . C = B . C b. (7.11)
B C B C
@ A @ A
ŷn xn

The n values of ŷ can be represented as a column vector


0 1
B ŷ1 C
B C
B .. C
ŷ = B . C, (7.12)
B C
@ A
ŷn

74
7.3. Vector–Matrix Formulation

and the term in brackets on the right-hand side of Equation 7.11 can
be represented as the n ⇥ 3 matrix
0 1
B x1 C
B C
B .. C
X = B . C, (7.13)
B C
@ A
xn

where each row can be expanded to yield


0 1
B x11 x12 1 C
B C
B .. .. C
X = B . ··· . C. (7.14)
B C
@ A
xn1 xn2 1

When Equation 7.11 is written out in full,


0 1 0 10 1
B ŷ1 C B x11 x12 1 b
CB 1 C
B C B CB C
B .. C B .. ..CB C
B . C = B . ··· .C B b2 C , (7.15)
B C B CB C
@ A @ A@ A
ŷn xn1 xn2 1 b0

it becomes apparent that we have n simultaneous equations and three


unknowns (b1 , b2 and b0 ), which can now be written succinctly as

ŷ = Xb. (7.16)

This can be used to express Equation 7.5 in vector–matrix format, as


follows. If we write the n measured values of y as a column vector
0 1
B y1 C
B C
B .. C
y = B . C, (7.17)
B C
@ A
yn

75
7 Multivariate Regression

then Equation 7.5 can be written as

E = (y ŷ)| (y ŷ), (7.18)

where y and ŷ are vectors of n elements and the transpose operator |


converts column vectors to row vectors (and vice versa). From Equation
7.16, we have

E = (y Xb)| (y Xb), (7.19)

and expanding this yields

E = y| y + b| X | Xb 2b| X | y, (7.20)

where the final (cross) term results from multiplying Xb by y twice


and using the transpose property (Xb)| = b| X | (see Appendix C).

7.4. Finding the Best Fitting Plane

The usual method for finding the regression coefficients b consists of


taking the derivative of E with respect to each regression coefficient
and then using these derivatives to find the values of the regression
coefficients that make the slope zero.
The intuition behind this method is that at a minimum of the function
E with respect to b, the slope must be zero. Turning this around, if
we find a value of b which makes the slope zero then we automatically
have a value for b that may give a minimum of E (of course, it could
also correspond to a maximum, but there are ways of checking whether
it is a minimum as we want).
The derivative with respect to the 3-element column vector b =
(b1 , b2 , b0 )| is also a 3-element column vector:
✓ ◆|
@E @E @E
rE = , , , (7.21)
@b1 @b2 @b0

where the nabla symbol r is standard notation for the gradient operator
on a vector, i.e. rE = dE/db. Each element of the vector rE is
the derivative of E with respect to one regression coefficient, and at a

76
7.4. Finding the Best Fitting Plane

minimum of E all these derivatives must equal zero:

|
rE = (0, 0, 0) . (7.22)

The derivative of Equation 7.20 is

rE = 2[(X | X)b X | y]. (7.23)

At a minimum this equals the zero vector in Equation 7.22, which yields

(X | X)b = X | y. (7.24)

Here X | X is a 3 ⇥ 3 covariance matrix, and X | y is a 3 ⇥ 1 column


vector. As in Section 2.4, Equation 7.24 represents three simultaneous
equations in three unknowns, b = (b1 , b2 , b0 )| .
Multiplying both sides of Equation 7.24 by (X | X) 1
yields the least
squares estimate of the regression coefficients,

b = (X | X) 1
X | y, (7.25)

where (X | X) 1
is the 3 ⇥ 3 inverse of the matrix X | X.
From Equation 7.16 (ŷ = Xb), we obtain ŷ as

ŷ = X(X | X) 1
X | y, (7.26)

where the n ⇥ n matrix H = X(X | X) 1


X | is called the hat matrix.
Note that this expression for ŷ involves only the x and y values from
the data.

Standardised Regression Coefficients

Suppose we find that the regression coefficient b1 is larger than the


coefficient b2 . Does that mean that the regressor x1 is more important
than the regressor x2 ? The answer depends on the magnitudes of x1
and x2 . For example, if we were to halve all values of x1 and repeat
the regression analysis, we would find that the value of b1 doubles but
the value of b2 remains unchanged. Clearly, halving all values of the
regressor x1 did not suddenly make it more important in relation to

77
7 Multivariate Regression

the regressor x2 . Consequently, we cannot simply compare the raw


coefficients b1 and b2 without taking into account the magnitudes of
the corresponding regressors x1 and x2 .
In order to compare di↵erent regression coefficients, we first need
to ensure that the corresponding regressors have the same standard
deviation. This is achieved by dividing each variable by its standard
deviation, which defines new standardised or normalised variables

x01 = x1 /sx1 , x02 = x2 /sx2 , y 0 = y/sy , (7.27)

where s represents standard deviation. Thus each of the standardised


variables x01 , x02 and y 0 has a standard deviation of 1. Now the normalised
regressors x01 and x02 have the same intrinsic scale, so their coefficients
can be compared on an equal footing.

Key point: The relative importance of di↵erent regressors


can only be assessed by comparing the magnitudes of their
standardised regression coefficients.

Multicollinearity

From the equation of multivariate regression, it is apparent that the


contribution of each regressor to y can be added to the contributions of
all other regressors. But if two di↵erent regressors are correlated then
the equation e↵ectively ‘double counts’ their contributions to y. For
this reason, multivariate regression is based on the assumption that the
regressors are mutually uncorrelated. Multicollinearity is a measure of
the extent to which regressors are correlated with each other.

7.5. Statistical Significance

Degrees of Freedom

For multivariate regression the parameters include k regressors and the


intercept b0 , so the number p of parameters is p = k + 1. When assessing
the number of degrees of freedom ⌫, we lose one degree of freedom per
regressor, and we also lose one degree of freedom for the intercept; so
the number of degrees of freedom for p parameters is ⌫ = n p.

78
7.5. Statistical Significance

Assessing the Overall Model Fit

The proportion of the variance in y that can be attributed to the best


fitting plane is the coefficient of determination, which equals the square
of the correlation coefficient (Equation 3.27, repeated here):

var(ŷ)
r2 = . (7.28)
var(y)

In the context of multivariate regression, r is the multiple correlation


coefficient, and r2 is a measure of how well the best fitting plane ‘soaks
up’ the variance in y, i.e. the proportion of variance in y explained by
the regression model. From Equations 3.17–3.23, we can also express
this as
Pn
(yi ŷi )2
r2 = 1 Pi=1
n . (7.29)
i=1 (yi y)2

Adding more regressors usually increases r2 , sometimes by pure


chance. As an extreme example, suppose we added 1,000 regressors,
where each regressor is a random set of n values. Almost inevitably,
some of these regressors are correlated with y, and will therefore increase
r2 . Accordingly, we can take account of the number of regressors by
using the adjusted r2 statistic
Pn
2 1/(n p) i=1 (ŷi yi ) 2
rAdj = 1 Pn , (7.30)
1/(n 1) i=1 (yi y)2

where p is the number of parameters (k regressors plus the intercept).


The statistical significance of the coefficient of determination is
assessed using the F -ratio (Equation 5.28, repeated here) with numerator
degrees of freedom p 1 and denominator degrees of freedom n p:

r2 /(p 1)
F (p 1, n p) = . (7.31)
(1 r2 )/(n p)

As discussed in Section 5.6, this F -ratio implies a particular p-value.


Of course, we could equivalently have used Equation 5.29 here.

79
7 Multivariate Regression

Statistical Significance of Individual Parameters

To test the significance of individual parameters, we need to estimate


the underlying standard deviation of each parameter. For example, the
regressor parameter b1 has an associated t-value (see Equation 5.7) of

b1
tb1 (⌫) = , (7.32)
ˆb1

where ⌫ = n p with p = 3 (the number of parameters). Each of the


standard deviations ˆb1 , ˆb2 and ˆb0 is the square root of one diagonal
element of the covariance matrix (from Equation 2.8 in Gujarati, 2019)

2
ˆX = ˆ⌘2 (X | X) 1 (7.33)
0 1
2
B ˆb1 0 0 C
B C
B C
= B 0 ˆb2 2
0 C, (7.34)
B C
@ A
0 0 ˆb0

where ˆ⌘2 is the estimated value of the noise variance 2


⌘ (Equation 5.14),

E
ˆ⌘2 = , (7.35)
n p

so that Equation 7.34 can be computed as


✓ ◆
E
2
ˆX = (X | X) 1
. (7.36)
n p

7.6. How Many Regressors?

In practice, the quality of the fit is assessed using the extra sum-of-
squares method, also known as the partial F -test. In essence, this
consists of assessing the extra sum-of-squares SSExp (see Equation 3.19)
accounted for by the regressors when the number of regressors is
increased. As the number of regressors is increased, the proportion
of the total sum of squared errors that is explained by the regressors
increases. For example, if we consider only one regressor, i.e. p = 2
parameters (slope and intercept), the predicted value of yi obtained

80
7.6. How Many Regressors?

from the reduced model is

ŷi (bred ) = b1 xi1 + b0 , (7.37)

where the slope b1 and intercept b0 are represented by the vector

bred = (b1 , b0 ). (7.38)

From Equation 3.19, the sum of squared errors explained by the regressor
xi1 is
n
X 2
SSExp (bred ) = ŷi (bred ) y , (7.39)
i=1

where the dependence on bred is made explicit.


If we now consider k = 2 regressors, xi1 and xi2 , then the predicted
value of yi from the full model becomes

ŷi (bfull ) = b1 xi1 + b2 xi2 + b0 , (7.40)

where the two slopes b1 and b2 and the intercept b0 are represented as

bfull = (b1 , b2 , b0 ). (7.41)

Incidentally, the values of the coefficients b1 and b0 usually change when


a new regressor xi2 is introduced. The important point is that the sum
of squared errors explained by the regressors increases to
n
X 2
SSExp (bfull ) = ŷi (bfull ) y , (7.42)
i=1

such that

SSExp (bfull ) SExp (bred ). (7.43)

The Extra Sum-of-Squares Method. Suppose we begin with the


full model containing p = 3 parameters, so the sum of squared errors
explained by the model is SSExp (bfull ) (Equation 7.42). The sum

81
7 Multivariate Regression

of squared errors explained by the reduced model with only p = 2


parameters is SSExp (bred ) (Equation 7.39). So upon removing one of
the regressors x2 from the full model, the extra sum-of-squares that had
been explained by b2 (strictly speaking, x2 ) in the full model is

SSExp (b2 |bred ) = SSExp (bfull ) SSExp (bred ). (7.44)

To test the hypothesis that b2 equals zero (i.e. that removing x2 from
the full model does not change the explained sum of squared errors),
we calculate the F -ratio

[SSExp (bfull ) SSExp (bred )]/⌫Di↵


F (⌫Di↵ , n p) =
SSNoise /(n p)
SSExp (b2 |bred )/⌫Di↵
= , (7.45)
SSNoise /(n p)

where SSNoise = E (Equations 7.5 and 7.18), ⌫Di↵ is the number of


parameters in the full model minus the number of parameters in the
reduced model (⌫Di↵ = 3 2 = 1 here), and p = k + 1 is the number of
parameters in the full model (so p = 3 here).
As usual, if the observed value F (⌫Di↵ , n p) is larger than the
corresponding value of F at a significance level of p = 0.05 then we reject
the hypothesis that b2 = 0, i.e. we conclude that the regressor associated
with b2 contributes significantly to the model. More generally, we may
wish to test the e↵ect of simultaneously removing several regressors
from the full model, in which case ⌫Di↵ > 1.

7.7. Numerical Example

Finding the Best Fitting Plane. The best fitting plane for the data
in Table 7.1 is shown in Figure 7.1. The least squares estimates of the
regression coefficients are

b1 = 0.966 (ˆb1 = 0.281),


b2 = 0.138 (ˆb2 = 0.103),
b0 = 2.148 (ˆb0 = 1.022).

82
7.7. Numerical Example

Substituting these values into Equation 7.2 gives

ŷi = 0.966 ⇥ xi1 + 0.138 ⇥ xi2 + 2.148. (7.46)

This seems to imply that the regressor x1 is 0.966/0.138 = 7.00 times


more influential than x2 on the value of y (but see below).

Standardised Regression Coefficients. Using the standardised


variables x0b1 = x1 /sb1 and x0b2 = x2 /sb2 introduced on page 78, we
obtain the parameter values b1 = 0.864, b2 = 0.338 and b0 = 2.047, so
Equation 7.2 becomes

ŷi = 0.864 ⇥ x0i1 + 0.338 ⇥ x0i1 + 2.047. (7.47)

This implies that the regressor x1 is only 0.864/0.338 = 2.56 times more
influential than x2 on the value of y.

Assessing the Overall Model Fit. Evaluating the coefficient of


determination (i.e. the square of the multiple correlation coefficient,
Equation 7.28), we obtain

var(ŷ) 0.600
r2 = = = 0.548, (7.48)
var(y) 1.095
p
so the multiple correlation coefficient is r = 0.548 = 0.740.
The statistical significance of the multiple correlation coefficient
is assessed using the F -ratio (see Section 5.6). From Equation 7.31
(repeated below), the F -ratio of the coefficient of determination with
numerator degrees of freedom p 1 = 3 1 = 2 and denominator degrees
of freedom n p = 13 3 = 10 is

r2 /(p 1)
F (p 1, n p) = (7.49)
(1 r2 )/(n p)
0.548/2
= (7.50)
(1 0.548)/10
= 6.063, (7.51)

83
7 Multivariate Regression

which corresponds to a p-value of

p = 0.019. (7.52)

As this is less than 0.05, the coefficient of determination (i.e. the overall
fit) is statistically significant.

Statistical Significance of the Individual Parameters. Using


Equation 7.32 with ⌫ = n p = 13 3 = 10 degrees of freedom, the
value of t associated with the regression coefficient b1 is

b1 0.966
tb1 = = = 3.433, (7.53)
ˆb1 0.281

which corresponds to a p-value of p = 0.0064, so the slope b1 = 0.966 is


statistically significant.
Similarly, with ⌫ = n p = 13 3 = 10 degrees of freedom, the value
of t associated with the regression coefficient b2 is

b2 0.138
tb2 = = = 1.344, (7.54)
ˆb2 0.103

which corresponds to a p-value of p = 0.209, so the slope b2 = 0.138 is


not statistically significant.
Finally, the value of t associated with the intercept b0 is

b0 2.148
tb0 = = = 2.102, (7.55)
ˆb0 1.022

parameter value standard error ˆ t ⌫ p


coefficient b1 0.966 0.281 3.43 10 0.0064
coefficient b2 0.138 0.103 1.34 10 0.209
intercept b0 2.148 1.022 2.10 10 0.062

r2 adjusted r2 F ⌫Num ⌫Den p


0.548 0.458 6.062 2 10 0.0189

Table 7.2: Results of multivariate regression analysis of the data in Table 7.1.

84
7.7. Numerical Example

which corresponds to a p-value of p = 0.062, so the intercept b0 = 2.148


is not statistically significant (i.e. not significantly di↵erent from zero).

Using the extra sum-of-squares method to assess if b2 = 0. To


assess whether the regressor x2 contributes significantly to the model,
we need to test the null hypothesis that the coefficient b2 = 0. This
involves calculating the two sums of squared errors in Equation 7.44:

SSExp (b2 |bred ) = SSExp (bfull ) SSExp (bred ) (7.56)


= 7.805 6.643 (7.57)
= 1.162. (7.58)

Then Equation 7.45 (repeated here),

SSExp (b2 |bred )/⌫Di↵


F (⌫Di↵ , n p) = , (7.59)
SSNoise /(n p)

becomes

1.162/1
F (1, 10) = (7.60)
6.436/10
= 1.805, (7.61)

which corresponds to a p-value of 0.209 (not significant), so we cannot


reject the hypothesis that b2 = 0. In other words, the regressor x2 with
coefficient b2 does not contribute significantly to the model.

Reference.
Gujarati DN (2019) The Linear Regression Model, Sage Publications.

85
7 Multivariate Regression

7.8. Python Code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
ch7Python.py. Multivariate regression.
This is demonstration code, so it is transparent but inefficient.
"""
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
from scipy import stats
import warnings # python complains about small n, so turn off warnings.
warnings.filterwarnings(’ignore’)
x1 = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
x2 = [7.47,9.24,3.78,1.23,5.57,4.48,4.05,4.19,0.05,7.20,2.48,1.73,2.37]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]

# Set donorm=1 to use standardized regressors to compare coefficients.


donorm = 0
if donorm==1:
x1 = x1 / np.std(x1)
x2 = x2 / np.std(x2)
y = y / np.std(y)
print(’Using standardized regressors ...’)

# convert data to vectors.


x1 = np.array(x1)
x2 = np.array(x2)
y = np.array(y)
n = len(y)

###############################
# FULL model. Repeat twice:
# 1) by hand (vector-marix) then 2) check using standard library.
###############################
ymean = np.mean(y)
ones = np.ones(len(y)) # 1 x n vector
Xtr = [x1, x2, ones] # 3 x n matrix
X = np.transpose(Xtr) # n x 3 matrix
y = np.transpose(y) # 1 x n vector

# find slopes and intercept using vector-matrix notation.


Xdot = np.dot(Xtr,X)
Xdotinv = np.linalg.pinv(Xdot)
XdotinvA = np.dot(Xdotinv,Xtr)
params = np.dot(XdotinvA,y)

86
7.8. Python Code

b0 = params[2] # 2.148
b1 = params[0] # 0.966
b2 = params[1] # 0.138

print(’slope b1 = %6.3f’ % b1)


print(’slope b2 = %6.3f’ % b2)
print(’intercept b0 = %6.3f’ % b0)

# PLOT DATA.
fig = plt.figure(1)
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(X[:,0], X[:,1], y, marker=’.’, color=’red’)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("y")

# PLOT BEST FITTING PLANE.


x1s = np.tile(np.arange(10), (10,1))
x2s = np.tile(np.arange(10), (10,1)).T
yhats = x1s*b1 + x2s*b2 + b0
ax.plot_surface(x1s,x2s,yhats, alpha=0.4)
plt.show()

# find vertical projection of data onto plane.


yhat = np.dot(X,params)
SSExplainedFULL1 = sum((ymean-yhat)**2)
ax.scatter(X[:,0], X[:,1], yhat, marker=’.’, color=’blue’)

SSExpFULL = sum((yhat-ymean)**2) # 7.804


print(’SSExplainedFULL (vector-matrix) = %6.4f’ % SSExpFULL )

SSNoiseFULL = sum((yhat-y)**2)

# find coefficient of variation r2.


r2 = np.var(yhat) / np.var(y) # 0.548
print(’coefficient of variation r2 = %6.3f’ % r2)

# Compare to STANDARD LIBRARY OUTPUT.


modelFULL = sm.OLS(y, X).fit()

SSExplainedFULL = modelFULL.ess # 7.804


print(’SSExplainedFULL (vector-matrix) = %6.4f’ % SSExpFULL )
print(’SSExplainedFULL (standard library) = %6.4f’ % SSExplainedFULL )

print(’\n\nFULL MODEL SUMMARY’)


print(modelFULL.summary())

87
7 Multivariate Regression

###############################
# REDUCED model. Repeat twice:
# 1) by hand (vector-matrix) then 2) check using standard library.
###############################
XREDtr = [x1, ones]
XRED = np.transpose(XREDtr)

# 1) Find slopes and intercept of best fitting plane.


Xdot = np.dot(XREDtr,XRED)
Xdotinv = np.linalg.pinv(Xdot)
XdotinvA = np.dot(Xdotinv,XREDtr)
paramsRED = np.dot(XdotinvA,y)
yhatRED = np.dot(XRED,paramsRED) # projection of data onto plane.
SSExplainedRED1 = sum((ymean-yhatRED)**2) # 6.643
print(’SSExplainedRED (vector-matrix) = %6.3f’ % SSExplainedRED1)

# 2) STANDARD LIBRARY OUTPUT.


modelRED = sm.OLS(y, XRED).fit()
SSExplainedRED = modelRED.ess # 6.643
print(’SSExplainedRED (standard library) = %6.3f’ % SSExplainedRED)
print(’\n\nREDUCED MODEL SUMMARY’)
print(modelRED.summary())

###############################
# Extra sum of squares method (partial F-test). Repeat twice:
# 1) by hand (vector-marix) then 2) check using standard library.
###############################
# 1) Vector-matrix: Results of extra sum of squares method.
print(’\nVector-matrix: Results of extra sum of squares method:’)
dofDiff = 1 # Difference in dof between full and partial model.
numparamsFULL = 3 # params in full model.
num = (SSExplainedFULL - SSExplainedRED) / dofDiff
den = SSNoiseFULL / (n-numparamsFULL)
Fpartial = num / den
print("Fpartial = %0.3f" % Fpartial)
p_valuepartial = stats.f.sf(Fpartial, dofDiff, n-numparamsFULL)
print("p_valuepartial (vector-matrix) = %0.3f" % p_valuepartial) # 0.209

# 2) STANDARD LIBRARY: Results of extra sum of squares method.


# test hypothesis that x2=0
hypothesis = ’(x2 = 0)’
f_test = modelFULL.f_test(hypothesis)
print(’\nSTANDARD LIBRARY: Results of extra sum of squares method:’)
print(’F df_num = %.3f df_denom = %.3f’
% (f_test.df_num, f_test.df_denom)) # 1, 10
print(’F partial = %.3f’ % f_test.fvalue) # 1.805
print(’p-value (standard library) = %.3f’ % f_test.pvalue) # 0.209
# END OF FILE.

88
Chapter 8

Weighted Linear Regression


8.1. Introduction

The method of weighted linear regression has been delayed until


now because it follows naturally from the methods described in the
previous two chapters. Whereas simple linear regression is based on
the assumption that all data points are equally reliable, weighted linear
regression takes account of the fact that some data points are more
reliable than others, and that these should have a greater influence on
the slope and intercept of the best fitting line (see ‘A Line Suspended
on Springs’ on page 7).

8.2. Weighted Sum of Squared Errors

Consider Equation 6.16 (repeated here),

n n 2
X 1 1 X yi (b1 xi + b0 )
log p(y|b1 , b0 ) = log p 2 . (8.1)
i=1 i 2⇡ 2 i=1 i

The second summation is a weighted sum of squared errors

n 2
X yi (b1 xi + b0 )
E = 2
(8.2)
i=1 i
n
!
X 1
= 2 log p 2 log p(y|b1 , b0 ). (8.3)
i=1 i 2⇡

Equation 8.2 can be recognised as a generalisation of the sum of squared


errors defined in Equation 1.25, such that the ith squared di↵erence is

89
8 Weighted Linear Regression

i 1 2 3 4 5 6 7 8 9 10 11 12 13
xi 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
yi 3.34 4.97 4.15 5.40 5.21 4.56 3.69 5.86 4.58 6.94 5.57 5.62 6.87
i 0.09 0.15 0.24 0.36 0.50 0.67 0.87 1.11 1.38 1.68 2.03 2.41 2.83

Table 8.1: Values of salary xi and measured height yi as in Table 1.1, where
each value of yi has a di↵erent standard deviation i .

2
now ‘discounted’ by the noise variance i associated with the measured
value yi . The values of b1 and b0 obtained by minimising E are called
the weighted least squares estimates (WLSE) of the slope and intercept.

As an overview of this chapter, Figure 8.1 compares the results


of simple regression and weighted regression, for the data shown in
Table 8.1. Whereas simple regression treats all data points as if they have
the same standard deviation, weighted regression takes account of the
di↵erent standard deviations of data points (also shown in Figure 6.1).
The data points on the left of Table 8.1 have small standard deviations,
so in weighted regression they have a relatively large e↵ect on the best
fitting line.

10

8
Height, y (feet)

0
0 1 2 3 4 5
Salary, x (groats)

Figure 8.1: Weighted versus simple regression using the data from Table 8.1.
Weighted regression yields a slope of b1 = 1.511 and an intercept of b0 = 2.122
(solid line); the length of each vertical line is twice the standard deviation of
the respective data point. For comparison, simple regression assumes that all
data points have the same standard deviation, which yields b1 = 0.764 and
b0 = 3.22 (dashed line).

90
8.3. Vector–Matrix Formulation

8.3. Vector–Matrix Formulation

In Section 7.3 we saw how multivariate linear regression can be expressed


in terms of vectors and matrices. This approach is especially useful when
dealing with data where di↵erent data points have di↵erent variances.
Equation 8.2 can be expressed in vector–matrix notation using an
n ⇥ n matrix that contains n variances along its diagonal,
0 1
2
B 1 ... 0 C
B C
B .. .. .. C
V = B . . . C. (8.4)
B C
@ A
2
0 ... n

The inverse of V is obtained by simply inverting each diagonal element,


0 1
2
B 1/ 1 ... 0 C
B C
B .. .. .. C
W = B . . . C. (8.5)
B C
@ A
2
0 ... 1/ n

Making use of ŷ and y defined in Equations 7.12 and 7.17, we write


Equation 8.2 (i.e. the unexplained sum of squares) as

E = (y ŷ)| W (y ŷ). (8.6)

From Equation 7.16, ŷ = Xb, so Equation 8.6 becomes

E = (y Xb)| W (y Xb). (8.7)

The gradient of E is a two-element vector,


✓ ◆|
@E @E
rE = , , (8.8)
@b1 @b0

where each element is the derivative with respect to one parameter.


Taking the derivative of Equation 8.7, we obtain

rE = 2[(X | W X)b X | W y]. (8.9)

91
8 Weighted Linear Regression

At a minimum, each element of rE equals zero, so

(X | W X)b = X | W y, (8.10)

where X | W X is a 2 ⇥ 2 covariance matrix and X | W y is a 2 ⇥ 1 column


vector. Rearranging yields the regression coefficients

b = (X | W X) 1
X | W y, (8.11)

where b is a column vector, b = (b1 , b0 )| .

Weighted Linear Multivariate Regression. By using vector–matrix


notation, weighted linear multivariate regression is a reasonably simple
extension of the method described above. Specifically, with k regressors
the matrix X would have p = k + 1 columns (one column per regressor
plus a column of ones, Equation 7.14). The rest of the calculations
above would remain the same.

8.4. Statistical Significance

Assessing the Overall Model Fit. The fit of the model to the
data can be assessed using the F -ratio, which is defined as
2
rw /(p 1)
F (p 1, n p) = 2 )/(n
, (8.12)
(1 rw p)

where we have replaced r2 in Equation 7.31 with rw


2
(defined below).
These are equivalent only if all values of y have the same variance, which
is not the case in weighted linear regression.

The Coefficient of Determination. In the following, the development


will parallel that in Chapter 3. For each data point, the total error is
the di↵erence between the observed value yi and the mean ȳ. However,
if each value of yi has its own variance then the best estimate of the
mean is no longer the conventional average of the yi values. Instead, it
should take into account the variance of each yi , such that more reliable
2
values of yi , with smaller variances i, are given more weight. This is

92
8.4. Statistical Significance

achieved by defining a weighted mean


n
X
yw = v i yi , (8.13)
i=1

where each vi is a normalised weight which is inversely proportional to


the variance of that data point,
2
1/ i
vi = Pn 2. (8.14)
j=1 1/ j

Pn
The normalised weights sum to 1, i=1 vi = 1. As in Equation 3.11,
the total error is the sum of two subsidiary error terms,
yi yw = (yi ŷi ) + (ŷi y w ), (8.15)

where (yi ŷi ) is the part of the total error not explained by the model
and (ŷi y w ) is the part of the total error that is explained by the
model. The total sum of squared errors is therefore also the sum of two
subsidiary sums of squared errors,
n
X n
X n
X
(yi y w )2 = (yi ŷi )2 + (ŷi y w )2 , (8.16)
i=1 i=1 i=1

i.e. the sum of squared errors not explained by the model plus the sum
of squared errors that is explained by the model.
However, if we take account of the fact that di↵erent data points have
di↵erent variances then these become weighted sums,
n
X n
X n
X
(yi y w )2 / 2
i = (yi ŷi )2 / 2
i + (ŷi y w )2 / 2
i, (8.17)
i=1 i=1 i=1

i.e. the weighted total sum of squared errors is the weighted sum of
squared errors not explained by the model plus the weighted sum of
2
squared errors explained by the model. Recalling that 1/ i is the
ith diagonal element Wii of the matrix W (Equaiton 8.5), this can be
written as
n
X n
X n
X
(yi y w )2 Wii = (yi ŷi )2 Wii + (ŷi y w )2 Wii . (8.18)
i=1 i=1 i=1

93
8 Weighted Linear Regression

Hence the total sum of squared errors can be written in vector–matrix


format as

SST = (y ȳw )| W (y ȳw ) (8.19)


| |
= (y ŷ) W (y ŷ) + (ŷ ȳw ) W (ŷ ȳw ), (8.20)

where y and ŷ are the vectors defined in Equations 7.17 and 7.12,
and ȳw is the column vector whose elements are all equal to y w in
Equation 8.13. As in Chapter 3, we define the second term in Equation
8.20, the weighted sum of squares explained by the model, as

SSExp = (ŷ ȳw )| W (ŷ ȳw ), (8.21)

and we define the first term in Equation 8.20, the noise or residual sum
of squares not explained by the model, as

SSNoise = (y ŷ)| W (y ŷ), (8.22)

(this is the same as E in Equation 8.6), so that Equation 8.20 becomes

SST = SSExp + SSNoise (8.23)

(the same as Equation 3.21). Using these definitions, the proportion of


the total sum of squared errors explained by the regression model is

2 SSExp
rw = , (8.24)
SST

and substituting Equations 8.19 and 8.22 yields

2 (ŷ ȳw )| W (ŷ ȳw )


rw = . (8.25)
(y ȳw )| W (y ȳw )

Given that SSExp = SST SSNoise , Equation 8.24 becomes

2 SST SSNoise
rw = (8.26)
SST
SSNoise
= 1 , (8.27)
SST

94
8.4. Statistical Significance

and substituting Equations 8.19 and 8.22,

2 (y ŷ)| W (y ŷ)
rw = 1 , (8.28)
(y ȳw )| W (y ȳw )

which can be substituted into Equation 8.12 to obtain an F -ratio.


Taking account of the degrees of freedom yields the adjusted r2 ,

2 SSNoise /(n p)
rw,Adj = 1 . (8.29)
SST /(n 1)

Statistical Significance of Individual Parameters. As in previous


chapters, to test the significance of each parameter, we first need to
estimate the underlying standard deviation of that parameter. For
example, from Equation 7.32, the slope parameter b1 is associated with
a t-value of

tb1 (⌫) = b1 /ˆb1 , (8.30)

where ⌫ = n p, with p = k + 1 being the number of parameters for


k regressors. Similarly, the intercept parameter b0 is associated with a
t-value of

tb0 (⌫) = b0 /ˆb0 . (8.31)

Each of the standard deviations ˆb1 and ˆb0 is the square root of one
diagonal element of the covariance matrix

2
ˆX = ˆ⌘2 (X | W X) 1 (8.32)
0 1
2
B ˆb1 0 C
= @ A (8.33)
2
0 ˆb0

(the analogue of Equation 8.32 in the weighted regression case), where


2
the noise variance ⌘ is estimated as

E
ˆ⌘2 = . (8.34)
n p

95
8 Weighted Linear Regression

Thus Equation 8.32 can be computed as


✓ ◆
E
2
ˆX = (X | W X) 1
. (8.35)
n p

8.5. Numerical Example

Assessing the Overall Model Fit. The statistical significance of


the correlation coefficient is assessed using the F -ratio in Equation 8.12
(repeated here),

2
rw /(p 1)
F (p 1, n p) = 2 )/(n
. (8.36)
(1 rw p)

From Equations 8.24, 8.21 and 8.19, the coefficient of determination is

2 SSExp 56.390
rw = = = 0.452. (8.37)
SST 124.746

Taking account of the number of parameters yields the adjusted r2


statistic from Equation 8.29,
Pn
2 1/(n p) i=1 (yi ŷi )2
rw,Adj = 1 Pn (8.38)
1/(n 1) i=1 (yi y w )2
= 0.402. (8.39)

2
Substituting rw = 0.452, p 1=2 1 = 1 and n p = 13 2 = 11
into Equation 8.36, we get

0.452/1
F (1, 11) = = 9.075. (8.40)
(1 0.452)/11

parameter value standard error ˆ t ⌫ p


slope b1 1.511 0.502 3.012 11 0.0118
intercept b0 2.122 0.623 3.405 11 0.0059

r2 adjusted r2 F ⌫Num ⌫Den p


0.452 0.402 9.075 1 11 0.0118

Table 8.2: Weighted linear regression of the data in Table 8.1.

96
8.5. Numerical Example

For the numerator degrees of freedom p 1 = 1 and the denominator


degrees of freedom n p = 11, the p-value associated with this F -ratio
is p = 0.0118, which is slightly less significant than the value p = 0.0101
obtained in Equation 5.54 with the same values of y but unweighted.

Statistical Significance of the Individual Parameters. As noted


in Figure 8.1, for the data in Table 8.1, weighted regression yields a
slope of b1 = 1.511 and an intercept of b0 = 2.122. For comparison,
simple regression, which assumes that all data points have the same
standard deviation, yields b1 = 0.764 and b0 = 3.22.
The t-value associated with the slope b1 is

b1
tb1 (⌫) = (8.41)
ˆb1
1.511
= = 3.012, (8.42)
0.502

which, with ⌫ = n p = 13 2 degrees of freedom, has a p-value of


p(3.012, 11) = 0.0118.
The t-value associated with the intercept b0 is

b0
tb0 (⌫) = (8.43)
ˆb0
2.122
= = 3.405, (8.44)
0.623

which, with ⌫ = n p = 13 2 degrees of freedom, has a p-value of


p(3.405, 11) = 0.0059.

97
8 Weighted Linear Regression

8.6. Python Code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ch8Python.py. Weighted linear regression.
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import sympy as sy

x = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]
sds = [0.09,0.15,0.24,0.36,0.50,0.67,0.87,1.11,1.38,1.68,2.03,2.41,2.83]

#x = [1.00,2.0,3.0]
#y=[0.008457848, 0.009728746, 0.010398164]
#sds =[0.001803007, 0.002398036, 0.003621527]

# sds = standard deviations of each value of y.

# TSP
x = [1.00,2.0,3.0]
y=[0.01374, 0.01717,0.01744]
sds = [0.00082, 0.00069, 0.00181]

# rubisco
y=[0.008455885, 0.009734122, 0.010397007]
sds=[0.000571801, 0.000758832, 0.001143985]

# test
#y = [2,4,6]
#sds=[1,1,1]

# Convert data to vectors.


y0 = y
sds = np.array(sds)
x = np.array(x)
y = np.array(y)

###############################
# Weighted least squares model (WLS) using vector-matrix notation.
###############################
# Convert vector w into diagonal matrix W.
w = 1 / (sds**2)
W = np.diag(w)
ones = np.ones(len(y))
Xtr = [x, ones] # 2 rows by 13 cols.
Xtr = sy.Matrix(Xtr)

98
8.6. Python Code

X = Xtr.T # transpose = 13 rows by 2 cols.


y = sy.Matrix(y)
W = sy.Matrix(W)
ymean = np.mean(y)

# Find weighted least squares solution by hand (ie vector-matrix).


temp = Xtr * W * X
tempinv = temp.inv() # invert matrix.
params = tempinv * Xtr * W * y

b1 = params[0] # 1.511
b0 = params[1] # 2.122
print(’slope b1 = %6.3f’ % b1)
print(’intercept b0 = %6.3f’ % b0)

# Convert to arrays for input to library funcitons


y = np.array(y0)
X = np.array(X)
w = np.array(w)
Xtr = [ones, x]
X = np.transpose(Xtr)

##############################################
# Compare to standard WLS library output.
##############################################
mod_wls = sm.WLS(y, X, weights=w )
res_wls = mod_wls.fit()
print(’\n\nWeighted Least Squares LIBRARY MODEL SUMMARY’)
print(res_wls.summary())

##############################################
# Estimate OLS model for comparison:
##############################################
res_ols = sm.OLS(y, X).fit()
print(’\n\nOrdinary Least Squares LIBRARY MODEL SUMMARY’)
print(res_ols.params)
print(res_wls.params)
print(res_ols.summary())

##############################################
# PLOT Ordinary LS and Weighted LS best fitting lines.
##############################################
fig = plt.figure(1)
fig.clear()
plt.plot(x, y, "o", label="Data")

99
8 Weighted Linear Regression

# Ordinary Least Squares.


plt.plot(x,res_ols.fittedvalues,"r--",label="Ordinary Least Squares")

# Weighted Least Squares.


plt.plot(x,res_wls.fittedvalues,"k-",label="Weighted Least Squares")
plt.legend(loc="best")

plt.xlabel(’salary’)
plt.ylabel(’height’)
plt.show()

##############################################
# END OF FILE.
##############################################

100
Chapter 9

Nonlinear Regression

9.1. Introduction

Even though the title of this book is Linear Regression, a summary of


nonlinear regression is provided for completeness. Nonlinear regression
is required if it is suspected that the data cannot be fitted well with
a straight line (i.e. linear) model. This could be the case when we
know the exact nature of the physical process that determines how
the independent and dependent variables are related. For example, at
the start of an epidemic, the number y of infected people increases as
an exponential function of time x, so the regression model could be
expressed as

ŷ = b0 e b1 x . (9.1)

In general, we can write the nonlinear function as f (xi , b), so the


regression problem involves finding parameter values b = (b0 , b1 , . . . , bk )
that minimise the sum of squared errors
n
X 2
E = yi f (xi , b) , (9.2)
i=1

where the observed value yi is a noisy version of the value ŷi given by
the model,

yi = ŷi + ⌘i . (9.3)

There are two broad classes of models used to fit nonlinear functions to
data, as described in the next two sections.

101
9 Nonlinear Regression

9.2. Polynomial Regression

Suppose we have reason to believe that the n observed values of y can


be fitted not by a line but by a quadratic function of the form

yi = b0 + b1 xi + b2 x2i + ⌘i , (9.4)

where ⌘i represents noise, so the predicted value of y at xi is

ŷi = b0 + b1 xi + b2 x2i . (9.5)

For brevity, the regression coefficients can be represented as the vector

b = (b0 , b1 , b2 ). (9.6)

Notice that x2 (and, more generally, any xm ) is linearly related to ŷ,


so (strictly speaking) fitting such an equation to the data is actually
a linear regression problem. As in previous chapters, the least squares
estimates of the regression coefficients can be obtained by minimising
the sum of squared errors
n
X
E = (yi ŷi )2 . (9.7)
i=1

Notice that Equation 9.5 has the same form as Equation 7.2 used in
multivariate regression, the only di↵erence being that each regressor (xi1
and xi2 ) in Equation 7.2 has been replaced by xi raised to a particular
power (x1i and x2i ) here. Consequently, we can treat the polynomial
regression problem of Equation 9.5 as if it were a multivariate regression
problem with two regressors, xi and x2i , as shown in Table 9.1.

i 1 2 3 4 5 6 7 8 9 10 11 12 13
yi 3.34 4.97 4.15 5.40 5.21 4.56 3.69 5.86 4.58 6.94 5.57 5.62 6.87
xi 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
x2i 1.00 1.56 2.25 3.06 4.00 5.06 6.25 7.56 9.00 10.56 12.25 14.06 16.00

Table 9.1: Values of the regressors x and x2 and the dependent variable y.

102
9.2. Polynomial Regression

Naturally, the fit of a polynomial model to n data points improves as


the order of the regression function is increased. At one extreme, if the
regression function is a first-order polynomial (i.e. a straight line)

ŷi = b1 x i + b0 , (9.8)

the sum of squared errors is almost certainly greater than zero. At the
other extreme, if the regression function is a polynomial of high enough
order k,

ŷi = b1 xi + b2 x2i + · · · + bk xki + b0 , (9.9)

then the fitted polynomial will pass through every data point exactly,
so that ŷi = yi for all i = 1, . . . , n and hence the sum of squared
errors equals zero. However, this does not necessarily mean that the
polynomial model provides an overall good fit to the data; we cannot
rely on only the sum of squared errors to assess the fit of the model.
In practice, the quality of the fit is assessed using the extra
sum-of-squares method introduced in Section 7.6. This consists of

6
Height, y (feet)

0
0 1 2 3 4 5
Salary, x (groats)

Figure 9.1: The dashed line is the best fitting linear function shown previously
(b1 = 0.764, b0 = 3.22; rL2 = 0.466). The solid curve is the best fitting
quadratic function (Equation 9.5 with b1 = 0.212, b2 = 0.111 and b0 = 3.819;
2
rNL = 0.473). The improvement from rL2 to rNL 2
is assessed using the extra
sum-of-squares method (Section 7.6), which yields p = 0.730, so the quadratic
model does not provide a significantly better fit.

103
9 Nonlinear Regression

calculating the extra sum-of-squares explained by the model when the


polynomial order is incremented by 1, for example from the first-order
polynomial in Equation 9.8 to the second-order (quadratic) polynomial
in Equation 9.5. If the extra sum-of-squares explained yields an F
value that has associated p-value less than 0.05, we reject the null
hypothesis that b2 = 0 and conclude that x2 contributes significantly to
the model. A comparison of linear and nonlinear (polynomial) regression
is summarised in Figure 9.1.

Multicollinearity

Whereas we could reasonably assume that the regressors x1 and x2


in Equation 7.2 are uncorrelated, we know that the regressors x and
x2 in Equation 9.5 are definitely correlated (which can be addressed
using orthogonal polynomials). This means that we cannot easily assess
the statistical significance of the individual coefficients in Equation 9.5.
However, we can still assess the statistical significance of the overall fit
of the model to the data.

9.3. Nonlinear Regression

If the regression problem cannot be solved using a polynomial function


(which, as we have seen, is really a type of multivariate linear regression)
then the problem is truly nonlinear. In this case, the regression function
falls into one of two categories, as follows.

Regression Functions That Can Be Linearised

Some nonlinear regression problems can be transformed into linear


regression problems. For example,

ŷ = e b1 x (9.10)

can be transformed into a linear regression problem by taking logarithms


of both sides,
log ŷ = b1 x. (9.11)

104
9.3. Nonlinear Regression

Now we view log y as the dependent variable; standard regression


methods assume that its observed values include Gaussian noise ⌘,

log yi = b1 x i + ⌘ i . (9.12)

Expressed in terms of the untransformed variable, this is

y = eb1 x+⌘ = e b1 x ⇥ e ⌘ . (9.13)

In words, if log y includes noise ⌘ with a Gaussian distribution (as


assumed by the regression model) then the noise in y is e⌘ , which has a
log-normal distribution. This is typically highly skewed, so it violates
the Gaussian assumptions on which regression is based.

Regression Functions That Cannot Be Linearised

If the regression function we want to fit cannot be linearised with a


transformation then we have two options.
First, we can attempt to approximate the regression function with a
polynomial by using its Maclaurin expansion (i.e. a Taylor expansion
around x = 0), so that the method described in Section 9.2 can be
applied. For example, if the underlying regression function is known to
be exponential, it can be approximated as

b21 x2 b3 x 3
e b1 x ⇡ 1 + b1 x + + 1 + ··· , (9.14)
2! 3!

where the polynomial on the right is the Maclaurin expansion of eb1 x .


Note that the approximation becomes exact as the number of terms
approaches infinity.
Second, we can attempt to find, by brute force, parameter values that
minimise the sum of squared errors in Equation 9.2. If the number of
parameters is small then such a brute-force method could be exhaustive
search as in Section 2.2. However, for a large number of parameters,
some form of gradient descent is required, as in Section 2.3.

105
9 Nonlinear Regression

9.4. Numerical Example

Finding the Best Fitting Plane. Using the data in Table 9.1, the
best fitting quadratic curve (Equation 9.5) is shown in Figure 9.1. The
least squares estimates of the regression coefficients are

b1 = 0.212,
b2 = 0.111,
b0 = 3.819.

Substituting these values into Equation 9.5, we have

ŷi = 0.212 ⇥ xi + 0.111 ⇥ x2i + 3.819. (9.15)

Assessing the Overall Model Fit. The square of the correlation


coefficient (i.e. the coefficient of determination, Equation 7.28) is

var(ŷ) 0.518
r2 = = = 0.473, (9.16)
var(y) 1.095
p
so the correlation coefficient is r = 0.473 = 0.688.
The statistical significance of the multiple correlation coefficient is
assessed using the F -ratio (see Section 5.6). Using Equation 7.31
(repeated below), the F -ratio of the multiple correlation coefficient with
numerator degrees of freedom p 1 = 3 1 = 2 and denominator degrees

parameter value standard error ˆ t ⌫ p


coefficient b1 0.212 1.570 2.121 10 0.060
coefficient b2 0.111 0.310 0.135 10 0.895
intercept b0 3.819 1.800 0.357 10 0.729

r2 adjusted r2 F ⌫Num ⌫Den p


0.473 0.368 4.49 2 10 0.041

Table 9.2: Results of polynomial regression analysis of the data in Table 9.1.

106
9.4. Numerical Example

of freedom n p = 13 3 = 10 is

r2 /(p 1)
F (p 1, n p) = (9.17)
(1 r2 )/(n p)
0.473/2
= (9.18)
(1 0.473)/10
= 4.49, (9.19)

which corresponds to a p-value of

p = 0.041. (9.20)

This is less than 0.05, so the multiple correlation coefficient (which


represents the overall fit) is statistically significant.

Using the extra sum-of-squares method to assess if b2 = 0.


To test the null hypothesis that the regressor x2 does not contribute
significantly to the model, we need to assess the probability that the
coefficient b2 = 0. This involves calculating the two sum of squared
errors in Equation 7.44,

SSExp (b2 |bred ) = SSExp (bfull ) SSExp (bred ) (9.21)


= 6.738 6.643 (9.22)
= 0.095. (9.23)

So Equation 7.45 (repeated here),

SSExp (b2 |bred )/⌫Di↵


F (⌫Di↵ , n p) = , (9.24)
SSNoise /(n p)

becomes
0.095/1
F (1, 10) = (9.25)
7.502/10
= 0.127, (9.26)

which corresponds to a p-value of p = 0.729. This is greater than 0.05,


so we cannot reject the hypothesis that b2 = 0. In other words, the
quadratic term x2 with coefficient b2 does not contribute significantly
to the model.

107
9 Nonlinear Regression

9.5. Python Code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ch9Python.py. Nonlinear regression.

import matplotlib.pyplot as plt


import numpy as np
import statsmodels.api as sm

x1 = [1.00,1.25,1.50,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,3.75,4.00]
y = [3.34,4.97,4.15,5.40,5.21,4.56,3.69,5.86,4.58,6.94,5.57,5.62,6.87]

# Convert data to vectors.


x1 = np.array(x1)
y = np.array(y)
n = len(y)
x2 = x1**2 # define x2 as square of x1.

###############################
# Quadratic model. Repeated twice: 1) Vector-matrix, 2) Standard library.
###############################
ymean = np.mean(y)
ones = np.ones(len(y)) # 1 x n vector
Xtr = [ones, x1, x2] # 3 x n matrix
X = np.transpose(Xtr) # n x 3 matrix
y = np.transpose(y) # 1 x n vector

# 1) Find slopes and intercept using vector-matrix notation.


Xdot = np.dot(Xtr,X)
Xdotinv = np.linalg.pinv(Xdot)
XdotinvA = np.dot(Xdotinv,Xtr)
params = np.dot(XdotinvA,y)

b0Quadratic = params[0] # 3.819


b1Quadratic = params[1] # 0.212
b2Quadratic = params[2] # 0.111

print(’\nQUADRATIC VECTOR-MATRIX PARAMETERS’)


print(’slope b1 = %6.3f’ % b1Quadratic)
print(’slope b2 = %6.3f’ % b2Quadratic)
print(’intercept b0 = %6.3f’ % b0Quadratic)

# 2) STANDARD LIBRARY Quadratic output.


quadraticModel = sm.OLS(y, X).fit()
print(’\n\nQUADRATIC MODEL SUMMARY’)
print(quadraticModel.params)

108
9.5. Python Code

print(quadraticModel.summary())

###############################
# Linear model (using standard library).
###############################
Xtr = [ones, x1]
X = np.transpose(Xtr)
linearModel = sm.OLS(y, X).fit()
print(’\n\nLINEAR MODEL SUMMARY’)
print(linearModel.params)
print(linearModel.summary())

params = linearModel.params
b0LINEAR = params[0] # 3.225
b1LINEAR = params[1] # 0.764
yhatLINEAR = b1LINEAR * x1 + b0LINEAR

###############################
# PLOT DATA.
###############################
fig = plt.figure(1)
fig.clear()
yhatQuadratic = b1Quadratic * x1 + b2Quadratic * x2 + b0Quadratic

plt.plot(x1, y, "o", label="Data")


plt.plot(x1, yhatQuadratic, "b--",label="Quadratic fit")
plt.plot(x1, yhatLINEAR, "r--",label="Linear fit")
plt.legend(loc="best")
plt.show()

###############################
# STANDARD LIBRARY: Results of extra sum of squares method.
###############################
# test hypothesis that x2=0
hypothesis = ’(x2 = 0)’
f_test = quadraticModel.f_test(hypothesis)
print(’\nResults of extra sum of squares method:’)
print(’F df_num = %.3f df_denom = %.3f’
% (f_test.df_num, f_test.df_denom)) # 1, 10
print(’F partial = %.3f’ % f_test.fvalue) # 1.127
print(’p-value (that x2=0) = %.3f’ % f_test.pvalue) # 0.729

###############################
# END OF FILE.
###############################

109
Chapter 10

Bayesian Regression: A Summary


10.1. Introduction

Bayesian analysis is a rigorous framework for interpreting evidence in


the context of previous experience or knowledge. At its core is Bayes’
theorem, also known as Bayes’ rule, which was first formulated by
Thomas Bayes (c. 1701–1761), and also independently by Pierre-Simon
Laplace (1749–1827). After more than two centuries of controversy,
during which Bayesian methods have been both praised and pilloried,
Bayesian analysis has now emerged as a powerful tool with a wide range
of applications, including in artificial intelligence, genetics, linguistics,
image processing, brain imaging, cosmology and epidemiology.
Bayesian inference is not guaranteed to provide the correct answer.
Rather, it provides the probability that each of a number of alternative
answers is true, and these can then be used to find the answer that
is most probably true. In other words, it provides an informed guess.
While this may not sound like much, it is far from random guessing.
Indeed, it can be shown that no other procedure can provide a better
guess, so Bayesian inference can be justifiably interpreted as the output
of a perfect guessing machine, or a perfect inference engine. This perfect
inference engine is fallible, but it is provably less fallible than any other
means of inference.
Up to this point, the best fitting line has been defined as the line
whose slope and intercept parameters are the least squares estimates
(LSE). However, as explained in Chapter 6, the LSE can also be obtained
using maximum likelihood estimation (MLE), which provides parameter
values that maximise the probability of the data. Of course, what
we really want is a method which provides parameter values that are

111
10 Bayesian Regression: A Summary
the most probable ones given the data we have. This brings us to a
vital, fundamental distinction between two frameworks: the frequentist
framework and the Bayesian framework.
In essence, whereas frequentist methods (like those used in the
previous chapters) answer questions regarding the probability of the
data, Bayesian methods answer questions regarding the probability
of a particular hypothesis. In the context of regression, frequentist
methods estimate the probability of the data if the slope were zero (null
hypothesis), whereas Bayesian methods estimate the probability of any
given slope based on the data, and can therefore estimate the most
probable slope. This apparently insignificant di↵erence represents a
fundamental shift in perspective.
Subjective Priors. A common criticism of the Bayesian framework
is that it relies on prior distributions, which are often called subjective
priors. However, there is no reason in principle why priors should not
be objective. Indeed, the objective nature of Bayesian priors can be
guaranteed mathematically, via the use of reference priors.
A Guarantee. Before we continue, we should reassure ourselves
about the status of Bayes’ theorem: Bayes’ theorem is not a matter of
conjecture. By definition, a theorem is a mathematical statement that
has been proved to be true. A thorough treatment of Bayesian analysis
requires mathematical techniques beyond the scope of this introductory
text. Accordingly, the following pages present only a brief, qualitative
summary of the Bayesian framework.

10.2. Bayes’ Theorem


Bayes’ theorem, also known as Bayes’ rule, says that
p(y|x) p(x)
p(x|y) = . (10.1)
p(y)

Each term in Bayes’ rule has its own name: p(x|y) is the probability
of x given y, or the posterior probability; p(y|x) is the probability of y
given x, or the likelihood of x; p(x) is the prior probability of x; and
p(y), the probability of y, is the evidence or marginal likelihood. In
practice, the results of Bayesian and frequentist methods can be similar
because the influence of the prior becomes negligible for large data sets.

112
10.2. Bayes’ Theorem

Bayesian Regression

Because Equation 10.1 represents a general truth, we can replace x with


the regression parameters b = (b0 , b1 ) and y with the n observed values
yi collected in a vector y as in Equation 6.12, which yields

p(y|b) p(b)
p(b|y) = , (10.2)
p(y)

where the likelihood p(y|b) was defined in Section 6.2.


The di↵erence between the likelihood p(y|b) and the posterior p(b|y)
looks almost trivial; after all, it involves simply swapping the order in
which two terms are written. But the implications of this small di↵erence
can be profound. For example, suppose the slope that maximises the
likelihood p(y|b) is bMLE = 0.2. Now suppose a huge of amount of
previous experience with the type of data under consideration informs
us that the slopes have a narrow Gaussian distribution with a mean of
b1 = 0.5. Should this prior knowledge a↵ect the final estimate of the
slope? Of course it should. And Bayesian analysis tells us (amongst
other things) precisely how much to change our estimate of the slope to
take account of such prior knowledge.

A Rational Basis For Bias. Naturally, all decisions regarding the


value of an estimated parameter should be based on evidence (data), but
the best decisions should also be based on previous experience. Bayes’
theorem provides a way of taking into account previous experience in
interpreting evidence. For example, the precise way in which prior
experience a↵ects the decision regarding the best fitting slope is through
multiplication of the likelihood for each putative slope by the value
of the prior at that slope. This ensures that values of the slope that
were encountered more often in the past receive a boost. One might
think that this is undesirable because it biases outcomes towards those
that have been obtained in the past. But the particular form of bias
obtained with Bayes’ rule is fundamentally rational. Bayes’ theorem is,
simply put, a rational basis for bias.

113
Appendix A

Glossary
alternative hypothesis The working hypothesis, for example that the
slope of the best fitting line is not equal to zero. See null hypothesis.
average Usually understood to be the average of a sample taken from
a parent population, while the word mean is reserved for the average of
the parent population.
Bayesian analysis Statistical analysis based on Bayes’ theorem.
Bayes’ theorem The posterior probability of x given y is p(x|y) =
p(y|x)p(x)/p(y), where p(y|x) is the likelihood and p(x) is the prior
probability of x.
chi-squared test (or 2 -test) Commonly known as a goodness-of-fit
test, this is less accurate than the F -test for regression analysis.
coefficient of determination The proportion of variance in a variable
y that is accounted for by a regression model, r2 = var(ŷ)/var(y).
confidence limit The 95% confidence limits of a sample mean y are
µ ± 1.96ˆy where µ is the population mean and ˆy is the standard error.
correlation A normalised measure of the linear inter-dependence of two
variables, which ranges between r = 1 and r = +1.
covariance An unnormalised measure of the linear inter-dependence of
two variables x and y, which varies with the magnitudes of x and y.
degrees of freedom The number of ways in which a set of values is free
to vary, given certain constraints (imposed by the mean, for example).
frequentist statistics The conventional framework of statistical analysis
used in this book. Compare with Bayesian analysis.
Gaussian distribution (or normal distribution) A bell-shaped curve
defined by two parameters, the mean µ and variance 2 . A shorthand
way of writing that a variable y has a Gaussian function is y ⇠ N (µ, y2 ).
heteroscedasticity The assumption that noise variances may not all be
the same.
homoscedasticity The assumption that all noise variances are the same.
inference Using data to infer the value, or distribution, of a parameter.
likelihood The conditional probability p(y|b1 ) of observing the data
value y given a parameter value b1 is called the likelihood of b1 .

115
Glossary

maximum likelihood estimate (MLE) Given the data y, the value


b1 MLE of a parameter b1 that maximises the likelihood function p(y|b1 )
is the maximum likelihood estimate of the true value Pnof b1 .
mean The mean of a set of n values of y is y = n1 i=1 yi .
multiple correlation coefficient If the proportion of variance in y that
is explained by a multivariate regression model is r2 then the multiple
correlation coefficient is r.
multivariate regression A model ŷ = b0 + b1 x1 + b2 x2 + · · · + bk xk that
assumes y depends on k regressors, for which the k regression coefficients
and the intercept are to be estimated.
noise The part of a measured quantity that is not predicted by a model.
normal distribution See Gaussian distribution.
null hypothesis The hypothesis that we want to show is improbable
given the data (e.g. that the slope of the regression line is zero); precisely
how improbable is the p-value.
one-tailed test A test which estimates either the probability that a
value y is larger (but not smaller) than a hypothetical level y 0 , or the
probability that y is smaller (but not larger) than y 0 .
p-value The probability that the absolute value of a parameter b is equal
to or greater than the observed value |bobs |, assuming that the true
value of b is zero. It is a measure of how improbable bobs is, assuming
that there is no underlying e↵ect (e.g. slope = 0).
parameter A coefficient in an equation, such as the slope b1 of a line,
which acts as a model for observed data.
parent population An infinitely large set of values of a quantity, from
which each finite sample of n values is assumed to be drawn.
regressor An independent variable in a regression model; for example,
in the model ŷ = b1 x1 + b2 x2 + b0 , the regressors are x1 and x2 , which
account for a proportion of the variance in the dependent variable y.
regression A technique used to fit a parametric model (e.g. a straight
line) to a set of data points.
sample A set of n values assumed to be chosen at random from a parent
population of values.
statistical significance If the probability p that a value arose by chance,
given the null hypothesis, is p < 0.05 then it is statistically significant.
standard deviation A measure of ‘spread’ in the values of a variable;
the square root of the variance.
theorem A mathematical statement that has been proved to be true.
two-tailed test A test which estimates the probability that an observed
value y is larger or smaller than (but not equal to) a hypothetical value.
variance A measure of how ‘spread out’ the values of a variable are.
vector An ordered list of numbers. See Appendix C.
z-score The distance between an observed value y and the mean µ of a
Gaussian distribution measured in units of standard deviations.

116
Appendix B

Mathematical Symbols

/ proportional to. For example, y / x means y = cx where c is a


constant.
P
(capital Greek letter sigma) shorthand for summation. For example,
if we have n = 3 numbers x1 = 2, x2 = 5 and x3 = 7 then their sum can
be represented as
n
X
xi = 2 + 5 + 7 = 14.
i=1

The variable i is counted up from 1 to n, and for each i the term xi


adopts a new value, which is added to a running total.
Q
(capital Greek letter pi) shorthand for multiplication. For example,
the product of the numbers defined above can be represented as
n
Y
xi = 2 ⇥ 5 ⇥ 7 = 70. (B.1)
i=1

The variable i is counted up from 1 to n, and for each i the term xi


adopts a new value, which is multiplied by a running total.
⇡ approximately equal to.
greater than or equal to.
 less than or equal to.
⌫ (Greek letter nu, pronounced ‘new’) the number of degrees of freedom.
µ (Greek letter mu, pronounced ‘mew’) the population mean.
⌘ (Greek letter eta, pronounced ‘eater’) the noise in y.
" noise, di↵erence between a value yi and the population mean µ.
(Greek letter sigma) the population standard deviation.
ˆ unbiased estimate of the population standard deviation based on ⌫
degrees of freedom.

117
Mathematical Symbols

2
population variance.
2
ˆ unbiased estimate of the population variance based on ⌫ degrees of
freedom.
b0 intercept of a line (i.e. the value of y at x = 0).
b1 slope of a line (i.e. the amount of change in y per unit increase in x).
cov(x, y) covariance of x and y, based on n pairs of values.
E sum of squared di↵erences between the model and the data.
E mean squared error; the sum of squared di↵erences divided by the
number of data points n (E = E/n).
k number of regressors in a regression model, which excludes the
intercept b0 .
n number of observations in a sample (data set).
p number of parameters in a regression model, which includes k
regressors plus the intercept b0 (so p = k + 1). Also p-value.
r(x, y) correlation between x and y based on n pairs of values.
r2 proportion of variance in data y accounted for by a regression model.
2
rw proportion of variance in data y accounted for by a regression model
when each data point has its own variance.
sx standard deviation of x based on a sample of n values.
sy standard deviation of y based on a sample of n values.
SSExp (explained) sum of squared di↵erences between the model-
predicted values ŷ and the mean y.
SSNoise (noise, or unexplained) sum of squared di↵erences between the
model-predicted values ŷ and the data y (the same as E).
SST (total) sum of squared di↵erences between the data y and the mean
y; SST = SSExp + SSNoise .
var variance based on a sample of n values.
2
V n ⇥ n covariance matrix in which the ith diagonal element i is the
variance of the ith data point yi .
1 2
W weight matrix, W = V , in which the ith diagonal element is 1/ i.
x vector of n observed values of x: x = (x1 , x2 , . . . , xn ).
x position along the x-axis.
y position along the y-axis.
y vector of n observed values of y: y = (y1 , y2 , . . . , yn ).
z position along the z-axis. Also z-score.

118
Appendix C

A Vector and Matrix Tutorial


The single key fact to know about vectors and matrices is that each
vector represents a point located in space, and a matrix moves that
point to a di↵erent location. Everything else is just details.
Vectors. A number, such as 1.234, is known as a scalar, and a vector
is an ordered list of scalars. A vector with two components b1 and b2 is
written as b = (b1 , b2 ). Note that vectors are printed in bold type.
Adding Vectors. The vector sum of two vectors is obtained by adding
their corresponding elements. Consider the addition of two pairs of
scalars (x1 , x2 ) and (b1 , b2 ); adding the corresponding elements gives

(x1 , x2 ) + (b1 , b2 ) = (x1 + b1 ), (x2 + b2 ) . (C.1)

In vector notation we can write x = (x1 , x2 ) and b = (b1 , b2 ), so that

x+b = (x1 , x2 ) + (b1 , b2 ) (C.2)


= (x1 + b1 ), (x2 + b2 )
= z. (C.3)

Subtracting Vectors. Subtracting vectors is done similarly by


subtracting corresponding elements, so that

x b = (x1 , x2 ) (b1 , b2 ) (C.4)


= (x1 b1 ), (x2 b2 ) . (C.5)

Multiplying Vectors. Consider the result of multiplying the


corresponding elements of two pairs of scalars (x1 , x2 ) and (b1 , b2 ) and
then adding the two products together:

y = b 1 x 1 + b2 x 2 . (C.6)

Writing (x1 , x2 ) and (b1 , b2 ) as vectors, we can express y as

y = (x1 , x2 ) · (b1 , b2 ) = x · b, (C.7)

119
Vector and Matrix Tutorial

where Equation C.7 is to be interpreted as Equation C.6. This operation


of multiplying corresponding vector elements and adding the results is
called the inner , scalar or dot product, and is often denoted by a dot,
as here.
Row and Column Vectors. Vectors come in two basic flavours, row
vectors and column vectors. The transpose operator | transforms a row
vector (x1 , x2 ) into a column vector (or vice versa):
0 1
x1
(x1 , x2 )| = @ A . (C.8)
x2

The reason for having row and column vectors is that it is often necessary
to combine several vectors into a single matrix, which is then used to
multiply a single column vector x, defined here as
x = (x1 , x2 )| . (C.9)

In such cases, we need to keep track of which vectors are row vectors
and which are column vectors. If we redefine b as a column vector,
b = (b1 , b2 )| , then the inner product b · x can be written as

y = b| x 0 1 (C.10)
x1
= (b1 , b2 ) @ A (C.11)
x2
= b1 x 1 + b2 x 2 . (C.12)

Here, each element of the row vector b| is multiplied by the


corresponding element of the column vector x, and the results are
summed. This allows us to simultaneously specify many pairs of such
products as a vector–matrix product. For example, if the vector variable
x is measured n times then we can represent the measurements as a
2 ⇥ n matrix X = (x1 , . . . , xn ). Then, taking the inner product of each
column xt in X with b| yields n values of the variable y:
0 1
x11 x12 . . . x1n
(y1 , y2 , . . . , yn ) = (b1 , b2 ) @ A . (C.13)
x21 x22 . . . x2n

Here, each (single-element) column yt is given by the inner product of


the corresponding column in X with the row vector b| , so that

y = b| X.

Finally, it is useful to note that y | = (b| X)| = X | b.

120
Appendix D

Setting Means to Zero

Setting the means of x and y to zero, or centring the data, involves


defining two new variables

x0i = xi x, (D.1)
yi0 = yi y, (D.2)

so that the equation of the regression line becomes

ŷi0 = b1 x0i + b00 (D.3)

and
yi0 = b1 x0i + b00 + ⌘i0 , (D.4)

where x0 = y 0 = b00 = ⌘¯0 = 0 and ⌘i0 = ⌘i , as shown in Figure D.1.


Proving y 0 Equals Zero. We can check that the new variable y 0 has
a mean of zero:
n n
1X 0 1X
y0 = y = (yi y) (D.5)
n i=1 i n i=1
n n
1X 1X
= yi y (D.6)
n i=1 n i=1
= y y (D.7)
= 0. (D.8)

A similar line of reasoning shows that x0 also has a mean of zero.


Proving ⌘ 0 Equals Zero. For the best fitting slope, the derivative in
Equation 2.2 (repeated here in terms of the new variables) equals zero:
n
X
@E
= 2 yi0 (b1 x0i + b00 ) = 0. (D.9)
@b00 i=1

121
D Setting Means to Zero

Pn
Since yi0 (b1 x0i + b00 ) = ⌘i0 , we have i=1 ⌘i0 = 0, so that
n
1X 0
⌘0 = ⌘ = 0. (D.10)
n i=1 i

This was also stated in Equation 2.4 for the original variables.
Proving b00 Equals Zero. Taking the means of all terms in Equation
D.4, we have

y0 = b1 x0 + b00 + ⌘ 0 . (D.11)

We know x0 = y 0 = ⌘ 0 = 0, so 0 = b1 ⇥ 0 + b00 + 0 and hence b00 = 0.


0
Proving ŷ Equals Zero. Taking the means of all terms in Equation
D.3, we have
0
ŷ = b1 x0 + b00 . (D.12)
0
We know x0 = b00 = 0, and hence ŷ = 0.

6
Height, y (feet)

-2
-4 -2 0 2 4 6
Salary, x (groats)

Figure D.1: Setting the means to zero. The upper right dots represent the
original data, for which the mean of x is x = 2.50 and the mean of y is
y = 5.13 (the point (x, y) is marked with a diamond). The lower left circles
represent the transformed data, with zero mean. Two new variables x0 and y 0
are obtained by translating x by x and translating y by y, so that x0 = x x
and y 0 = y y; both x0 and y 0 have mean zero, as indicated by the axes
shown as dashed lines. The slope of the best fitting line for regressing y on x
is the same as the slope of the best fitting line for regressing y 0 on x0 , and for
the transformed variables the y 0 -intercept is at b00 = 0.

122
Proving cov(ŷ, ⌘) = 0. Note that translating the data has no e↵ect on
the covariance, so that cov(ŷ, ⌘) = cov(ŷ 0 , ⌘ 0 ), which is
n
1X 0 0
cov(ŷ 0 , ⌘ 0 ) = (ŷi ŷ )(⌘i0 ⌘ 0 ), (D.13)
n i=1

0
where ŷ = ⌘ 0 = 0, so that
n
1X 0 0
cov(ŷ 0 , ⌘ 0 ) = ŷ ⌘ . (D.14)
n i=1 i i

From Section 2.4, the best fitting line minimises the mean squared error

n
1X 0
E = (y ŷi0 )2 , (D.15)
n i=1 i

and we know that ŷi0 = b1 x0i (because b00 = 0). At a minimum, the
derivative with respect to b1 is zero:
n
@E 2X 0 0
= x (y ŷi0 ) = 0, (D.16)
@b1 n i=1 i i

where
(yi0 ŷi0 ) = ⌘i0 . (D.17)

So, given that yi0 = b1 x0i + ⌘i0 , we have

x0i = (yi0 ⌘i0 )/b1 = ŷi0 /b1 . (D.18)

Substituting Equations D.18 and D.17 into Equation D.16 yields


n
!
2 1X 0 0
ŷ ⌘ = 0. (D.19)
b1 n i=1 i i

Using Equation D.14 to rewrite the term in brackets, we get


cov(ŷi0 , ⌘i0 )=0. Since cov(ŷi , ⌘i ) = cov(ŷi0 , ⌘i0 ), we conclude that

cov(ŷi , ⌘i ) = 0; (D.20)

in other words, the correlation r between ŷ and ⌘ is zero. From this we


know that, cov((ŷi y), ⌘i )=0, where (ŷi y)= i , from which it follows
that (as promised in Section 3.4)
n
1X
i ⌘i = 0. (D.21)
n i=1

123
Appendix E

Key Equations

Variance. Given n values of yi , the variance is


n
1X
var(y) = (yi y)2 , (E.1)
n i=1

where y is the mean of the yi values.


Standard deviation. This is the square root of the variance,

n
!1/2
1X 2
sy = (yi y) . (E.2)
n i=1

Covariance.
n
1X
cov(x, y) = (xi x)(yi y). (E.3)
n i=1

Pearson product-moment correlation coefficient.


1
Pn
n i=1 (xi x)(yi y)
r = P 1/2 Pn 1/2
1 n 1
n i=1 (xi x)2 n i=1 (yi y)2

or, equivalently,

r = cov(x, y)/(sx sy ). (E.4)

The square of the correlation coefficient, r2 , is called the coefficient of


determination, which is the proportion of variance in y accounted for
by x; because r(x, y) = r(y, x), r2 is also the proportion of variance in
x accounted for by y.

125
E Key Equations

Regression slope. The least squares estimate of the slope is


1
Pn
(xi x)(yi y) cov(x, y)
b1 = n 1i=1 Pn 2
= . (E.5)
n (x
i=1 i x) var(x)

The value of t for the di↵erence between b1 and the null hypothesis
value of b01 = 0 is

tb1 = (b1 b01 )/ˆb1 , (E.6)

where ˆb1 is the unbiased estimate of the standard deviation b1 of b1 ,


⇥ 1
Pn ⇤1/2
n 2 i=1 (yi ŷi )2
ˆb1 = ⇥ Pn ⇤1/2 . (E.7)
i=1 (xi x)2

Here n 2 is the number of degrees of freedom. The p-value is associated


with tb1 and ⌫ = n 2 degrees of freedom.
p-value for the slope. If the absolute value of tb1 is larger than the
critical value t(0.05, ⌫) then the p-value is p(tb1 , ⌫) < 0.05.
Regression intercept. The least squares estimate of the intercept is

b0 = y b1 x. (E.8)

The value of t for the di↵erence between b0 and a hypothetical value


b00 is
tb0 = (b0 b00 )/ˆb0 , (E.9)

where ˆb0 is the unbiased estimate of the standard deviation b0 ,

 1/2
1 x2
ˆb0 = ˆ⌘ ⇥ + Pn , (E.10)
n i=1 (xi x)2

where ˆ⌘ is defined by Equation 5.13.


p-value for the intercept. If |tb0 | > t(0.05, ⌫) then the p-value is
p(tb0 , ⌫) < 0.05, where the number of degrees of freedom is ⌫ = n 2.
Statistical evaluation of the overall fit. The statistical significance
of the overall fit is assessed using the F -statistic

r2 /(p 1)
F (p 1, n p) = . (E.11)
(1 r2 )/(n p)

126
Index

adjusted r2 , 79, 95, 96 Gaussian distribution, 32, 37,


alternative hypothesis, 39, 41, 40, 51, 65, 105, 115
115 Gaussian equation, 33
average, 6, 92, 115 gradient descent, 10, 105
groat, 3
Bayes’ theorem, 111, 112, 115
Bayesian analysis, 111, 115 hat matrix, 77
bias, 36, 113 heteroscedasticity, 69, 115
homoscedasticity, 69, 115
causation, 7 hyper-plane, 73
central limit theorem, 32, 37, 51
centring, 24, 121 independence, 67
chi-squared test, 58, 115 independent variable, 1, 7, 14,
coefficient of determination, 25, 57, 71, 73, 101
26, 61, 92, 115 inference, 111, 115
conditional probability, 67
confidence interval, 41, 42, 44, least squares estimate, 6, 9, 69,
54, 55 70, 77, 102, 111
confidence limit, 42, 44, 115 likelihood, 67, 68, 112, 115
correlation, 19, 20, 23, 26, 53, likelihood function, 68
57, 115 linearly related, 8
covariance, 18, 77, 80, 92, 95, log likelihood, 69
115
critical value, 40–44, 54 Maclaurin expansion, 105
maximum likelihood estimate,
degrees of freedom, 34, 43, 52, 68, 116
56, 78, 95, 115 mean, 6, 17, 21, 24, 29, 116
dependent variable, 1, 7, 14, mean squared error, 6
101, 105 multicollinearity, 78, 104
multiple correlation coefficient,
exhaustive search, 9, 105 79, 83, 116
extra sum-of-squares, 80, 81, multivariate regression, 56, 71,
103 102, 116

F -ratio, 57, 79, 82, 92, 126 noise, 2, 8, 18, 21, 23, 30, 37,
F -test, 49, 57 51, 54, 65, 66, 73, 102,
F -test (in relation to t-test), 57 105, 116
frequentist statistics, 112, 115 nonlinear regression, 101

127
Index

normal distribution, 33, 65, 66, variance, 17, 20, 22, 25, 36, 51,
116 56, 66, 69, 79, 91, 93,
normalised, 20, 78, 93, 115 116
normalised Gaussian, 32, 37 vector, 68, 91, 102, 113, 116, 119
null hypothesis, 39, 46, 50, 104,
112, 116 weighted least squares estimate,
90
one-tailed test, 39, 116 weighted linear multivariate
regression, 92
p-value, 8, 37, 42, 44, 54, 57, weighted linear regression, 7, 89
104, 116 weighted mean, 29, 37, 50, 93
parameter, 2, 9, 49, 52, 57, 67,
78, 80, 95, 112, 116 z-score, 37, 41, 116
parent population, 18, 30, 32, z-test, 41
36, 45, 116
partial F -test, 80
partitioning variance, 21, 93
population mean, 30, 32, 39, 42,
44, 117
population variance, 18, 36, 118
probability density, 66

regression, 6, 49, 68, 112, 116


regressor, 7, 57, 58, 73, 77, 78,
80, 92, 102, 116

sample, 1, 18, 29, 30, 36, 39, 45,


68, 116
significance versus importance,
45, 55
standard deviation, 18, 20, 32,
36, 41, 51, 53, 66, 69,
78, 90, 95, 116
standard error, 33, 37, 40–42,
45
standardised, 77, 78, 83
statistical significance, 29, 42,
45, 49, 78, 92, 116
sum of squared errors, 5, 9, 18,
22, 24, 69, 70, 80, 81,
89, 93, 101, 102

t-test, 42, 45
t-test (in relation to F -test), 49,
57
theorem, 32, 112, 116
transpose, 76, 120
two-tailed test, 40, 44, 54, 116

unbiased estimate, 52, 53, 55

128

You might also like