FDS Lecture Notes 2024 01 28
FDS Lecture Notes 2024 01 28
F
DATA
S c i enc
©David Sterratt, Kobi Gal, Hiroshi Shimodaira, Steve Renals and Iain Murray, University of Edinburgh,
2014-2023, CC BY-SA, unless otherwise indicated.
Contents
2 Data 7
2.1 Data and metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Tabular data and variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Working with tabular data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Data wrangling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Merging tabular data using database-style joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Split-apply-combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Descriptive statistics 15
3.1 Introduction to descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Distributions of numeric variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Sample and population mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Sample and population median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 The mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iii
iv CONTENTS
11 Multiple regression 83
11.1 The principle of multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
11.2 Interpreting multiple regression coefficients and metrics . . . . . . . . . . . . . . . . . . . . . . . . 84
11.3 Interaction terms and nonlinear fits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
11.4 Interpreting and refining multiple regressions on many variables . . . . . . . . . . . . . . . . . . . 87
15 Estimation 123
15.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
15.2 Estimation bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
15.3 Standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
CONTENTS v
A Resources 199
A.1 Example visualisations and projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
About these lecture notes
These lecture notes are written for the University of Edinburgh Course Informatics 2 – Foundations of Data
Science. Ultimately we aim that they will give comprehensive (though perhaps not exhaustive) coverage of the
material covered in the lectures.
Although the notes are designed to fit in with the Foundations of Data Science course, we’re making them
an open educational resource available under a Creative Commons licence. The development of this course has
benefited from other open educational resources, and now the notes are fairly well developed (though still not
complete or error free!) it seems like a good time to give back to the world.
vii
viii ABOUT THESE LECTURE NOTES
Much of data science is about story telling and explanation, so we consider that models such as linear
regression are helpful not only for prediction, but also for explanation. We therefore cover topics such as the
interpretation of linear and logistic regression coefficients.
We first approach statistical inference from the point of statistical simulations and sampling theory, inspired
by the computational approach to statistical inference in the Berkeley Data 8 course, and its associated textbook.
Given the ubiquity of using theoretical distributions to generate confidence intervals and undertake hypothesis
testing, we also introduce some common statistical distributions and tests.
6. Project skills
This is a slightly different way to the structure envisaged in the FDS course descriptor, but all the topics
indicated there are included. Interleaved with these topics are topics focusing on real-world implications (often
using case studies), critical thinking, working and writing skills.
The order of the chapters represents one ideal ordering of the material. For logistical reasons the order of
the lectures differs from this ideal order: We want to present the material on statistical inference early enough
to be useful for a coursework at the beginning of Semester 2, but late enough so that students will have covered
the relevant concepts of probability in Discrete Maths and Probability.
Influences
Material on supervised learning and clustering mostly comes from the work of Steve Renals, Iain Murray and
Hiroshi Shimodaira on the previous Informatics course Inf2B.
We are indebted to other openly-available data science courses, in particular Harvard’s CS109, in particular
the concept of the data science life cycle, and the Berkeley Data 8 course, with its emphasis on programming
statistical simulations for inference.
Gelman and Nolan’s (2017) book Teaching statistics – a bag of tricks has provided the basis for presentation
of some of the concepts and lecture demos.
ix
Resources
There is no one textbook that covers all this course. However, we will refer to the following books and resources
at points throughout the course:
Modern Mathematical Statistics with Applications (Devore and Berk, 2012): Modern Mathematical Statis-
tics with Applications is an introduction to inferential statistics that follows on from the Discrete Maths
and Probability course. It’s freely available as a PDF from the library - see the Library Resources link.
You can also purchase a print copy for £25 via the page for the book that you reach via the Library.
The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios (Wexler et al., 2017)
The first chapter of this book provides a concise introduction to principles of visualisation.
An Introduction to Data Ethics (Vallor, 2018) This is an introduction to data ethics from the perspective of
virtue ethics. It presents case studies to help you develop your ethical sensitivity and ethical reasoning.
Computational and Inferential Thinking (Adhikari et al., 2020): The online textbook Computational and In-
ferential Thinking takes a computational, non-mathematical approach to statistical inference, and the
statistical inference sections of FDS have drawn inspiration from their approach. Please note that Python
code in the book uses a bespoke python library instead of pandas, which we’ll be using on this course, so
don’t pay too much attention to the details of the code.
Becoming a critical thinker: for your university studies and beyond (Ivory, 2021) This book, available online
in the University Library, is an excellent general introduction to critical thinking. We recommend the whole
book, but if you have little time, read Chapters 4-6 (on arguments, evidence and communication). This
reading will help you to understand the sort of thinking, writing and referencing that we would like to see
in the Critical Evaluation and the Project.
x ABOUT THESE LECTURE NOTES
Notation
The following is the default notation used in the course. Note that different versions may be used sometimes for
the ease of readability.
x A scalar
x A column vector : x = (x1 , x2 , . . . , xD )T , where D is the dimension of vector
T
x Transpose of vector x, meaning a row vector if x is a column vector
xi or xd ith sample (scalar), or dth element of vector x
PN
x i Summation, i.e., x1 + x2 + . . . + xN
QNi=1
x
i=1 i Product, i.e., x1 x2 · · · xN
xi . , xiD )T
ith sample (vector): xi = (xi1 , xi2 , . .q
PD √
kxk Euclidean norm or L2 norm: kxk = T
d=1 xd = x x, also known as magnitude
2
PD
kxk1 L norm, i.e. d=1 |xd |
1
u·v dot (inner or scalar) product of u and v, i.e., u · v = kuk kvk cos θ = uT v = ΣD d=1 ud vd
A A matrix: A = (aij ). If A is a M-by-N matrix, A = (a1 , . . . , aN )
Aij Element of matrix A at ith row and jth column, i.e. aij
I or Id Identity matrix of size d by d (ones on diagonal, zeros off)
AT Transpose of matrix A, i.e. AT = (aji )
−1
A Inverse of matrix A, i.e. A−1 A = AA−1 = I
n
{xi }1 A set of samples: {x1 , . . . , xn }
x̄ Sample mean of x
s2 , s2x , or Var (x) Sample variance of x
sxy Sample covariance of x and y. NB: sxx = s2x
µ Population mean
σ2 Population variance P
Σ Population covariance matrix (cf. summation operator )
σij (i, j)-element of a covariance matrix, i.e. Σ = (σij ). NB: σii = σi2
exp(x) The natural exponential function of x, i.e. ex
ln(x) The natural logarithm of x
x∝y x is proportional to y
This notation is also used by Modern Mathematical Statistics with Applications (Devore and Berk, 2012).
Part I
1
Chapter 1
What is a data scientist? Data scientists are able to turn raw data into understanding, insight and knowledge.
Some of the activities that data scientists do on a daily basis include the following:
This is why data scientists need to employ interdisciplinary tools, from statistics, artificial intelligence and
machine learning. We will be addressing all of these different aspects in the course, and have closely aligned
our learning goals with the varied skill set that data scientists need to be successful in the field.
This process is broken up into steps and shown in Figure 1.1. We will expand on all the steps of the data
science cycle in this course.
The data science process We can see data science as a process, comprising interconnected steps (Figure 1.1).
The data explosion: volume, variety and velocity Data is always around us, and it always has been. But
the development of technology has made it possible to collect and store vast amounts of data. Just to get some
perspective, each day on Earth we generate over 500 million tweets (how many of these are generated by bots?),
294 billion emails, 4 million gigabytes of Facebook data, 65 billion WhatsApp messages and 720,000 hours of
new content added daily on YouTube.
In 2018, the total amount of data created, captured, copied and consumed in the world was 33 zettabytes (ZB)
– the equivalent of 33 trillion gigabytes. This grew to 59ZB in 2020 and is predicted to reach a mind-boggling
175ZB by 2025. One zettabyte is 8,000,000,000,000,000,000,000 bits (Vopson, 2021).
Parallel to this explosion of data generation, there is also consistent improvement in computational methods;
we are now able to mechanise some aspects of data analysis that data scientists care about. So, while the
amount of digital data in the world doubles every two years, so does the computational power that lies at our
disposal to analyse it. One should state that the speed of computation, which was commonly considered to
double every two years, may have plateaued.
3
4 CHAPTER 1. INTRODUCTION AND LOGISTICS
Figure 1.1: The data science process. Credit: Joe Blitzstein and Hanspeter Pfister, created for the Harvard data
science course CS109. This figure is not under the CC BY-SA licence covering the rest of the notes.
1.2. COURSE LOGISTICS 5
How does data science relate to other fields? Data Science, Statistics, Machine Learning. These are some of
the terms used in the scientific literature and popular media. Let’s try to sort through the confusion.
Data science is a multidisciplinary field which uses scientific methods, processes, and systems in a range
of forms. Both statisticians and data scientists care about analysing and explaining data. Indeed, some of the
methods used by data scientists are taken from statistics, and we shall learn them in this course.
Statisticians focus on using structured models and parameter estimation to quantify uncertainty in data.
Data science uses methods from statistics and machine learning. but in addition, data scientists need to do a lot
of work on processing the data itself and check that the data makes sense. Data scientists often deal with huge
databases, which is why they rely heavily on computational methods for their analysis.
Machine learning is all about algorithms that are able to learn from data. Data science also uses some
tools from machine learning, and we will study some of these in the course. Data scientists begin by exploring
the data and formalise questions that can be asked on the data. They care about explaining how (and possibly
why) trends in the data arise. They need to communicate quantitative and qualitative arguments to the general
public. Often, they also care about supporting practitioners in the field (e.g., alerting teachers to struggling
students based on their interactions with educational software). The value of a machine learning algorithm is
measured by its performance on unseen data. But the value of a data science analysis also needs to consider the
human in the loop. We’re (happily) not at the stage when these complex tasks can be replaced with computer
algorithms. There is a lot of room for skill and creativity that cannot be automated.
In all these fields, it’s important to know where the data comes from, how it’s been collected, and to be able
to reason about whether a dataset has been collected ethically, or if the project we are undertaking is ethical.
Examples of data science at work Let’s consider two examples of data science at work. Both of these examples
are meant to highlight the inherent biases that unfortunately exist in society, and that also arise in the data
sets that we generate.
• Example of gender tropes in films. In this example we will see the gender bias reflected in script writing in
movies.
• Case study: COMPAS. In this example we will describe racial bias in the criminal justice system.
Data science career prospects Data science is a lucrative field. Although the study and analysis of data
started off in academia, industry is increasingly taking a leading role and applying and developing data science
methods. Demand is high for data professionals—data scientists and mathematical science occupations are
growing at a significant higher rate than in other high tech fields.
Data
Figure 2.1: Ancient Egyptian grain account from between 2025–1700 BC found in the Lahun Papyri. Ac-
quisition UC32189, Digital Egypt for Universities, University College London, https://www.ucl.ac.uk/
museums-static/digitalegypt/lahun/papyri.html. Courtesy of the Petrie Museum of Egyptian and Su-
danese Archaeology, UCL.
7
8 CHAPTER 2. DATA
Structured and unstructured data There is an important distinction is between structured data and unstructured
data. In structured data, the data is organised according to a clear structure. An example of structured data is
a table describing attributes of Star Wars characters with columns such as name, “height”, “weight” and eye
colour. Structured data needn’t be a table: for example a bibliographic database, with different types of entries
for books and academic articles. It is easy to query structured data: for example, find all characters in Star
Wars with blue eyes.
Unstructured data does not have a predefined structure. Text and images are common examples of unstructured
data. It is typically much harder to extract information from unstructured data than from structured data. For
example, in a block of text written about Star Wars characters, would find it hard to identify all characters with
blue eyes.
Metadata We contain terabytes of data files, but they are useless if we know what they mean. Metadata –
literally “about data” – is information describing the data files it accompanies. The metadata may be a description
in a text file (often called README) or it may be in a structured format. Whenever a dataset is created, stored
and shared, the data meaning of the data should be recorded. It is surprising how often the meaning of datasets
is not clearly described. The metadata should describe how the data were collected, including if there were any
legal or ethical considerations, the format of the data files and the meaning of variables in the data files.
Tidy Data Tidy data is a subset of tabular data (Wickham, 2014). In a tidy dataset, each column corresponds
to a variable (or attribute) and each row corresponds to an instance or observation possessing each of the
variables. The terms data matrix or long form data are synonyms for tidy data.
For example, suppose we capture squirrels and measure their weight (in grams) and length (in millimetres),
and note down their sex. The variables represented in the columns would be “Name” (assuming we’ve named
the squirrels we are observing), “Weight (g)”, “Length (mm)” and “Sex”. Each row contains the value of these
variables.
Typically, tidy data is multivariate, i.e. it has multiple variables. In the special case of two variables we refer
to it as bivariate, and when we consider only one variable, we call it univariate.
Messy data We call data that is not in the tidy format messy. There are a number of ways in which data can
be messy, for example:
• The values of variables could be column headings, as shown in Table 2.2. Here the messy format is easy
for humans to read, but is often more difficult to process. Functions such as melt in Pandas can be used
to rearrange this messy format into a tidy format.
• A single observational unit could be stored in multiple tables. For example, data on the CO2 emissions for
countries over time is stored in per-continent files, e.g. annual_co2_europe.csv, annual_co2_asia.csv…To
tidy the data here, we would load in each file, give it a column “Continent” with the value for each row
derived from the file name, and then concatenate these files.
2.2. TABULAR DATA AND VARIABLES 9
Table 2.2: Messy data with values as columns. In the messy table (top), the rows correspond to one variable
(“Sex”) and the columns “Under 1 year old”, “1 to 2 years old” and “More than 2 years old” are actually the
values of “Age”, a categorical variable. This format is easy to read, but it is not tidy, as the values of a variable
are in the column names. We can rearrange to a tidy format (bottom) in which there is one column for each of
the variables “Sex” and “Age”, and a column “Count”.
Sex Under 1 year old 1 to 2 years old More than 2 years old
Messy data Male 4 7 2
Female 6 9 1
Messy data may actually be easier for the human eye to comprehend, but is not so easy to manipulate using
software tools, so for the moment we will assume we have tidy data.
• Numerical variables are quantities that can be measured (continuous) or counted (discrete). For example,
weight and length are continuous numeric variables. In contrast, the number of babies a squirrel gives
birth to in a year is a discrete variable, since we cannot have a fractional number of babies. Continuous
variables often have a physical dimension (e.g. weight or length) and it is important to quote them with
their unit.
• Categorical variables can take on one of a number of values. For example, the sex of the squirrel is
a categorical variable, since it can be “Male” or “Female”. We can have more than two categories, for
example a fruit might be “Apple”, “Orange”, “Lemon”, “Grapefruit” etc.
• Ordinal variables can take on one of a number of categories, but those categories are ordered. For
example, we might estimate the age of a squirrel as “Under 1 year old”, “1 to 2 years old” or “More than 2
years old”. There are only three categories, but we can order them from youngest to oldest.
• String variables contain string information such as names or a comment from a survey form.
Representing categorical variables The human-readable format of categorical variables is as a string. However,
for many of the algorithms that we deal with later, the string representation is not helpful. We could represent
the category by a number, e.g. 0 for “Apple”, 1 for “Orange”, 2 for “Lemon” etc. But this system implies an
ordering for the categories, which is not the case here.
Instead, what we do is convert the categorical variable into an indicator variable, also known as a dummy
variable. We create one new column for each category (for example, an “Apple” column, an “Orange” column and
a “Lemon” column). We indicate which category the item belongs to by putting a 1 in the corresponding columns,
and zeros in the other columns. This form of encoding of categories is also known as “one-hot” encoding, since
there is always 1 “hot” bit required to indicate the category.
In fact, if we have k categories, we can manage with k − 1 columns, by dropping one column, which we
regard as the default. For example, if we dropped the “Apple” column, when there were zeros in all the remaining
columns, we would assume that the item represented was an “Apple”.
10 CHAPTER 2. DATA
• Text file: common formats are comma-separated variable (CSV), tab separated variable (TSV). There are
various standards for the precise formatting of the files, for example if the data enclosed by cells are
enclosed by quotes. Sometimes there are a few lines of metadata at the top of the file. Data science
packages functions to read in text files, have many options, for example allowing you to skip lines at the
head of the file, or deal with separators other than tabs or commas.
• Binary file: common formats are the Excel spreadsheets or Open Document Format spreadsheets. The
modern versions of both of these are in fact ZIP files containing an XML file with the spreadsheet
information, and any other files (e.g. embedded images).
• Databases: for example MySQL or SQLlite. Here there is a database server, and data is extracted by
running SQL (Structured Query Language).1
Text files have the advantage of being human-readable, and also enforce a discipline of encoding information
solely in the content of the cells – it is not possible to encode information by colour-coding cells, as it is
in spreadsheets. Conversely, spreadsheets have the advantage be being format-able, which can help with
readability.
Selecting rows and columns Data science packages have methods for extracting rows and columns from tables.
With tidy data, extracting a row selects all the data connected with one observation. Extracting a column selects
every instance of one variable.
Filtering data We call finding a subset of the data based on the value of some variable filtering. For example,
we may wish to filter out all the squirrels longer than 220 mm.
Cleaning data The quality of any analysis of data can only be as good as the data itself – and the data itself
may have many types of problems:
• Data entry: for example, if we have collected data in a free-text survey, respondents may not have entered
a time in a uniform format (e.g. “16:00”, or “4pm”, or “4” or “16” or “17 (CET)”) may all mean the same
thing. Or someone may have typed “08” when they meant “80”.
• Mixing text and numbers: for example, we might have recorded “210g” or “0.21kg” in the weight cell for
our squirrels.
• Missing data: perhaps a sensor wasn’t working for a few minutes or hours, so some readings are missing.
Does it record “0” in this situation? Or “−1”. The metadata should tell us, but maybe it doesn’t…
• Mislabelled data: A column may have been mislabelled. For example, we might be recording the temperature
of heating water entering and leaving the Informatics Forum, and the temperature of heating water entering
and leaving Dugald Stewart Building (DSB). To get a measure of how much heat the Forum is using, we
subtract the temperature of the water leaving from the water entering, and the same for DSB. But if we
subtract the temperature of water leaving DSB from water entering the Forum, we will get nonsense. (This
scenario actually happened.)
1 You can learn how to process and analyse data using SQL in the 3rd year course Introduction to Databases.
2.5. MERGING TABULAR DATA USING DATABASE-STYLE JOINS 11
• Faulty sensor: perhaps a sensor goes wrong, and starts giving values that are implausible.
In summary, there are many potential problems with data: the data is out to get you! One of the most
important jobs of a data scientist is to check the data quality, and fix problems. You need to approach the data in
the spirit of critical evaluation: Is this value reasonable? Is that pattern strange? Does the data look too good?
We use the term data cleaning to describe the process of checking and fixing problems in data. You should
start data checking and cleaning after loading the dataset, but problems with data also emerge in the process of
visualisation, which we will come to later.
Data cleaning is very time-consuming: it is commonly said that 80% of the time of a data science project is
spent on cleaning (Dasu and Johnson, 2003) though some more recent estimates suggest time spent cleaning is
less than 30% (Anaconda, 2020), though the same report also estimates loading and visualising data take around
20%. In any event, data cleaning can take a long time, needs to be done carefully, and can be quite fiddly and
frustrating.
Missing data Modern data science packages have a special values – NA or NaN (Not Applicable or Not a
Number) – to describe data that is missing. It’s very helpful to ensure that data that you regard as missing
shows as NA or NaN rather than as a default value (e.g. 0 or −1), as this can help with filtering out missing data.
Pandas is somewhat confusing in its handling of missing data. Strictly speaking, NA applies to missing string,
categorical or numeric data, whereas NaN applies only to missing numbers. However, Pandas represents missing
data as NaN, but functions to check for missing data refer to NA, e.g. .isna().
Once missing data is identified, you have to decide what to do with it, which may depend on the analysis
you are undertaking. Sometimes it can be possible to keep observations that contain missing data, but some
analyses only work if every variable in every observation has a value. In this case, you may need to drop any
rows containing NaN.
Inner join In an inner join (Table 2.4), only the squirrels in both datasets will be present in the joined dataset.
Note that the key Fiona in Table 2.1 matches two rows in Table 2.3, so there are two corresponding rows in the
joined table. The data describing Fiona’s characteristics is repeated in these rows, but the unique information
about the data and time of the obstacle course run are not repeated. Jakub isn’t present in this table, since
Jakub hasn’t had a time recorded on the obstacle course in Table 2.3.
12 CHAPTER 2. DATA
Table 2.4: Results of inner join applied to Tables 2.1 and 2.3.
Name Weight (g) Length (mm) Sex Age Date Time (s)
Fiona 342 222.0 Female 1-2 years 2021-05-06 67.5
Fiona 342 222.0 Female 1-2 years 2021-05-10 50.2
Cameron 330 215.0 Male 2+ years 2021-05-08 55.6
Table 2.5: Results of left join applied to Tables 2.1 and 2.3.
Name Weight (g) Length (mm) Sex Age Date Time (s)
Jakub 320 211.0 Male Under 1 year nan nan
Fiona 342 222.0 Female 1-2 years 2021-05-06 67.5
Fiona 342 222.0 Female 1-2 years 2021-05-10 50.2
Cameron 330 215.0 Male 2+ years 2021-05-08 55.6
Left and right joins In a left join (Table 2.5), all squirrels present in the first (left) table passed to the merge
function will be retained in the merged table. When the key isn’t present in the second table, we fill in the
missing values with NaN. In Table 2.5 Jakub has NaN values for data and time, since Jakub hasn’t had a time
recorded on the obstacle course in Table 2.3.
In a right join all items in the right-hand table are retained, and the resulting table will have NaN values
when a key is present in the right-hand table but not the left-hand table.
Table 2.6: Results of outer join applied to Tables 2.1 and 2.3.
Name Weight (g) Length (mm) Sex Age Date Time (s)
Jakub 320.0 211.0 Male Under 1 year nan nan
Fiona 342.0 222.0 Female 1-2 years 2021-05-06 67.5
Fiona 342.0 222.0 Female 1-2 years 2021-05-10 50.2
Cameron 330.0 215.0 Male 2+ years 2021-05-08 55.6
Lily nan nan nan nan 2022-07-13 45.0
Outer join In an outer join (Table 2.6): All squirrels in both datasets will be present in the joined dataset.
When the key isn’t present in the first or second table, we fill in the missing values with NaN. In Table 2.6 Jakub
and Lily are present because they are present either in Table 2.1 (Jakub) or Table 2.3 (Lily). Jakub has NaN
values for data and time, since Jakub hasn’t had a time recorded on the obstacle course, and Lily’s age, height,
weight etc. are missing, since Lily didn’t appear in Table 2.1.
2.6 Split-apply-combine
Split-apply-combine A commonly used operation in data science is split-apply-combine. It involves:
• Applying a function to each group independently. This could include computing a summary statistic for
each group, such as a mean or count; performing some group-specific computation such as standardising
the data within a group; as well as filtering to discard data that has only a few members, for example.
For example, we might want to find the fastest time for male and female squirrels. In this case, we could split
the joined table (Table 2.4) by Sex, apply the min function to each subtable, and then combine the results in one
table (Table 2.7).
Table 2.7: Results of splitting by Sex, applying min() and then combining, applied to the inner join (Table 2.4).
Advantages of split-apply-combine Data science programming environments, such as R or the Pandas package
in Python have functions that can achieve split-apply-combine operations in one line. We could write code to
achieve the same effect, but it would be less clear, at least to data scientists who are familiar with data science
programming idioms.
The split-combine-apply paradigm is also found in big data, where it is referred to as MapReduce. Here the
groups created by the split operation can be processed in parallel before being combined.
• Split-apply-combine (groupby)
Chapter 3
Descriptive statistics
Modern Mathematical Statistics with Applications, Sections 1.1, 1.3 and 1.4
Descriptive statistics It’s not humanly possible to make sense of a large raw datasets. For example, suppose
we know the salary of every member of staff in the University of Edinburgh - the list would be very long. To
make sense of this data we can try to summarise it or visualise it. Descriptive statistics refers to methods of
summarising data. This topic introduces the notation we’ll use for the sample and population mean and variance
of numeric univariate data. We will also introduce quantiles and skewed distributions.
15
16 CHAPTER 3. DESCRIPTIVE STATISTICS
10
Frequency
5
0
280 300 320 340 360 380 400
Weight (g)
(a) Histogram showing frequency of observations in bins.
0.015
Density
0.010
0.005
0.000
280 300 320 340 360 380 400
Weight (g)
(b) Histogram showing probability density of observations in bins.
Figure 3.1: Histograms of weight in grams of a sample of squirrels recorded in the winters of 1985 and 1986 in
coniferous woods in North Belgium (Wauters and Dhondt, 1989). The bin width in both histograms is 10g.
and there are not many individuals in this sample, so we should be careful to assume that this distribution is
truly representative of all squirrels.
Definition of sample mean For a numeric variable x with n observations or instances sampled from a population,
x1 , x2 . . . xn , the sample mean is defined as:
n
1X
x= xi (3.1)
n
i=1
P
Sometimes the sample mean is written informally as 1/n xi . When reporting the mean of a set of numbers,
one convention is to report to one more decimal place than the accuracy of the xi ’s. For example, if the age of 6
3.4. SAMPLE AND POPULATION MEDIAN 17
Figure 3.2: Scottish wildcats, Felis silvestris silvestris, a critically endangered species. Credit: Peter Trimming /
CC BY 2.0)
cats is 3, 4, 5, 6, 6 and 7 years, the mean would be reported as 5.2 years, not 5.1666 years. Note that the units
of the mean should be quoted; i.e. “5.2 years” not just “5.2”.
The sample mean is a measure of where the centre of the set of instances is. It is guaranteed to lie between
the minimum and maximum xi . It has the same units as the xi .
Population mean For a numeric variable x with N instances, x1 , x2 . . . xN , the population mean is defined as:
N
1 X
µ= xi (3.2)
N
i=1
With bivariate data (two variables) or multivariate data (more than one variable), we can distinguish between
the population means of variables x, y, z by using subscripts: e.g. µx , µy , µz .
The population mean is a measure of where the centre of all instances in the population is. It is guaranteed
to lie between the minimum and maximum xi . It has the same units as the the xi .
The sample mean is an estimate of the population mean. We will consider how good an estimate it is later in
the course when we learn about statistical inference. For now it is enough to know that it depends on how the
sample is chosen (randomly or by some other method), and the values of n and N.
For example, the median age of 6 cats aged 3, 4, 5, 6, 6 and 7 is 5.5 years. The median age of 5 cats aged
3, 4, 5, 6 and 7 is 5 years. Note that we should quote the units, as for the mean.
By analogy with the population mean, the population median µ̃ of a variable is the median of the entire
population.
18 CHAPTER 3. DESCRIPTIVE STATISTICS
Median and mean The mean and median of a sample or population are generally not the same. For example,
the mean age of 6 cats aged 3, 4, 5, 6, 6 and 7 years is 5.2 years, but the median is 5.5 years.
• If a distribution is symmetric, x = x̃
Suppose the age of the cats had been 3, 4, 5, 6, 6 and 18. The mean age would now be 7 years, but the
median is unchanged at 5.5. An instance that appears to be far away from most of the other numbers is called
an outlier. The example shows that the median is less affected by outliers than the mean. For this reason, the
median can be seen as a better way of measuring a typical value of a variable.
It is often worth checking outliers, to make sure that they are real data. Depending on how the data has been
collected, an outlier might be due to a faulty sensor, or a mistake in data entry or in the logic of an automated
programme collecting data. However, outliers may well be real data, and should not just be removed as a matter
of course.
The sample variance and standard deviation give a measure of how spread out the data is. It’s an average of
a measure of distance of each point from the sample mean (we’ll come to why we divide by n − 1 rather than n
later).
One measure of distance we could use is the magnitude (absolute value) of the deviation from the mean of
each observation: x1 − x, x2 − x, . . . , xn − x. We cannot just use the deviations, since they add up to 0 (think about
it!). The magnitude of each deviation |xi − x| is guaranteed to be positive. However the average of magnitudes is
not as nicely behaved mathematically as the average of the square of the deviations, as defined in Equation 3.4.
It’s important to quote the units of the standard deviation and variance. The standard deviation has the same
units as the quantity in question and the variance has those units squared.
Definition of population variance and population standard deviation By analogy with the population mean,
for a numeric variable in a population of N instances, x1 , x2 . . . xN , the population variance is defined as:
N
1 X
σ2 = (xi − µ)2 (3.6)
N
i=1
Why the divisor n − 1 in the sample variance? In short, we’d like the sample variance to be a statistically
unbiased estimate of the population variance (we define estimator bias formally in the chapter on Estimation).
The squared deviations between the sample mean and the observations (xi − x)2 will tend to be smaller than
the squared deviations between the population mean and the observations (xi − µ)2 , so if we divided by n we’d
underestimate the sample variance. For the truly interested, we prove that if we divided by n we would have a
biased estimate later (Estimation bias and variance).
A second way of thinking about this is that we know that the sum of the deviations is 0:
n
X
(x − xi ) = 0 (3.8)
i=1
So if we know all but one (n − 1) of the deviations, we can use the above equation to deduce the deviation we
don’t know. We say that this means there are n − 1 degrees of freedom, and it turns out that it makes sense to
divide by n − 1.
In practice, when n is large, the difference doesn’t matter much. It’s worth being aware that various Python
packages have different conventions about dividing by n − 1 or n. pandas.Series.std divides by n − 1 whereas
numpy.std divides by n. This behaviour can be changed by specifying the ddof parameter in either function.
Scaled quantities Sometimes quantities can be scaled, for example if the units change. If a variable y = cx,
where c is a scaling constant, then the following relationships hold:
y = cx
s2y = c 2 s2x (3.9)
sy = csx
Another way of writing the variance It’s sometimes helpful to rearrange the summation over the squared
deviations in the sample or population variance:
X X
(xi − x)2 = (xi2 − xi x − xxi + x 2 ) expanding
X X X X
= xi2 − xi x − xxi + x 2 splitting up the summation
X (3.10)
= xi2 − nx x − nx x + nx 2
X
= xi2 − nx 2
3.6 Quantiles
Definition of a percentile The yth percentile of a set of numeric observations x1 , . . . , xN is the value of xi that
is above y% of the values. For example, a baby that is on the 95th percentile for weight will weigh more than
95% of other babies.
The median as the 50th percentile By definition 50% of observations are less than the median, so we could
also think of the median as the 50th percentile.
Lower and upper quartiles The 25th percentile is called the lower quartile (since it encloses the lower quarter
of the distribution) and the 75th percentile is called the upper quartile. The difference between the upper and
lower quartiles is called the interquartile range. The interquartile range is a measure of the spread of the
distribution of values.
20 CHAPTER 3. DESCRIPTIVE STATISTICS
Quantiles Percentiles and quartiles are all examples of the general concept of quantiles. q-Quantiles are
the values that divide the population or sample into q separate nearly equally sized groups. Percentiles
are 100-quantiles and quartiles are 4-quantiles. Other common uses are deciles (10-quantiles) and quintiles
(5-quantiles).
Example: AI-assisted tax collection In 2022, the French government made several billion euros in late tax
returns from people who had swimming pools and weren’t reporting them and weren’t paying tax on those pools.
The Government applied a very basic AI algorithm to Google Maps images to find the pools much faster than
could be done by humans checking the images. The tax collected was redistributed though society. But were
there any potential downsides of such application of data and algorithms? For example, such algorithms could
be used by governments searching for particular individuals who might be participating in demonstrations.
By agreeing to the use of service agreement for this technology, we opt in (often unknowingly) to allowing
institutions to take control of our personal data. Although this is legal, it’s a big problem because we’re
compromising our privacy here.
Another example of the social pitfalls of using AI algorithms involves false positives. A well known case
involves a man who took a picture of his sick son to send to the doctor and was labelled as a potential child
pornographer, and had his Google account blocked. The ethical ramifications of algorithms are a significant
issue that is affecting all of us. We will focus on these issues from different aspects.
What is ethics? Ethical principles have been around since the age of Greek philosophers. They provide us
with a kind of checklist of what is right and what is wrong in society. Our job as data scientists is to understand
and apply that checklist to some of the stages in the data science lifecycle.
Universal and context-specific ethical guidelines Some ethical guidelines are so universal that they’re naturally
understood, for example, “Thou shalt not kill”. That’s a rule of behaviour that has been around for thousands of
years, and that is a guide to good behaviour that people in society should be able to follow. Other types of
ethical behaviour are contextual and depend on the culture, the person and the way in which they are distributed
and applied and because ethical reasoning is so subjective. For example, in some countries, no-one stands in a
queue. But in the UK that would be considered inappropriate behaviour. So culture and context matter. When
building a data science program, it’s very difficult to understand the context.
21
22 CHAPTER 4. INTRODUCTION TO DATA ETHICS
The ambiguity of moral decisions – the trolley problem People are often ambiguous about what are good
moral decisions, depending on the context. The trolley problem is one of the tools that philosophers use to talk
about the moral dilemmas that come about with ethical reasoning. A runaway train is hurtling towards 5 people
on a railway track. You look around, but there’s no way of warning them. You notice that you’re standing next to
a lever that you can operate. You can divert the train onto another track and save the people. But there is a
problem: there is also a person on the other track. If you hit the lever and divert the train, then they die. If you
do nothing then 5 people die, but if you pull the lever then 5 people are spared but 1 person dies. About 50% of
the people in different places all over the world were asked how they would behave in this hypothetical scenario
are willing to switch the points, for the net benefit of saving four people.
Philosophers consider variations of this problem to understand how people think about different contexts.
Now imagine there are no points any more, but there’s a man standing on top of a bridge. Now the only way
to save these five people is by pushing the man on the tracks, thereby stopping the train with his body. The
number of people willing to push the man off the bridge is fewer than those willing to switch the points, even
though the net effect is the same (saving 5 people at the expense of one person).
Relevance of the trolley problem This is much research on understanding how should autonomous vehicles
should behave in situations when they might kill people. Suppose the brakes have failed in an autonomous
vehicle that is approaching a junction that a school bus is crossing. In order to avoid the school bus, the vehicle
needs to slam into a tree and kill the driver. If you think about utility, that is the rational outcome1 .
When surveyed, most people agreed the autonomous vehicles should take moral decisions for the benefit of
society and the common good. But people in the survey weren’t willing to buy such a car. The trolley problem
has a real, strong relevance to how computers should be making decisions. This example demonstrates that data
ethics is a wide, surprisingly complex area of research that requires philosophers and computer scientists and
machine learning people to work together.
Ethical reasoning is relevant to all the stages of the data science life cycle, from when data is collected, how
data is used, how data is distributed, and even how it is controlled. We shall touch on several of these aspects.
GDPR The EU has a very successful and has a very successful list of regulations on data protection and
privacy called the General Data Protection Regulation Rights or GDPR https://gdpr-info.eu/. GDPR lists
regulations that concern rights of the data subjects, duties of data controllers or processors, transfers of personal
data to third countries and more. Importantly, it also details liability or penalties for breach of rights. GDPR
has also been widely applied outside the EU and is considered to be a gold standard in data protection.
In particular, GDPR allows individuals the right to access and rectify data that concerns them, and even the
right to erase data This is also known as the ‘right to be forgotten’. This right can be exercised if personal data
is no longer necessary for its original purpose, or that it harms someone individual interests and there is no
overriding legitimate interest to keep it. For example, Germany’s highest court has ruled that A German man
convicted of murder in 1982 has the right to have his name removed from online search results,
Interventions As data scientists, we should be aware that data is used by algorithms I’m intervening in your
lives and making you engage in behaviour that you would not have done otherwise. For example, convenience
1 You can participate in an experiment presenting similar scenarios at https://www.moralmachine.net/.
4.3. BIAS 23
chain stores in the USA have been known to track consumer behaviour for the purpose of recommending products
for them. Ride sharing apps have been known to use intervention methods to get drivers to stay longer on the road,
when they are predicted to quit and go home. Video sharing apps use recommendation algorithms to lure people to
consuming more and more content, while this content becomes more radical or extreme in nature. This phenomenon
has been coined “down the rabbit hole” https://www.anewseducation.com/post/youtube-rabbit-holes.
personal data belonging to millions of Facebook users was collected without their consent by British
consulting firm Cambridge Analytica, predominantly to be used for political advertising.
We will also discuss possible ways of mitigating the influence of AI algorithms on our behaviour. There are
several rules of behaviour that we can apply to respect people’s privacy and autonomy, such as
• being transparent about whether AI is being used to recommend or to motivate users.
• Allow people to opt out at any stage from receiving AI generated interventions.
4.3 Bias
The definition of bias is an inclination or prejudice for or against one person or group, especially in a way
considered to be unfair. It turns out that bias can be exhibited not only by people, but also by computers.
Algorithmic bias refers to attributes of an algorithm that create unfair outcomes. When it does this, it unfairly
favours someone or something over another person or thing.
Algorithmic Bias Algorithmic bias describes systematic and repeatable errors in a computer system that create
“unfair” outcomes, such as “privileging” one category over another in ways different from the intended function of
the algorithm.
Bias can emerge from many factors, including but not limited to the design of the algorithm or the unintended
or unanticipated use or decisions relating to the way data is coded, collected, selected or used to train the
algorithm.
Machine learning algorithms can introduce bias at different stages in the data science life cycle; they may
provide biased recommendations and decisions. Noteworthy examples of machine bias that have been widely
reported in the media include:
• The COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) algorithm used
in US court systems to predict the likelihood that a defendant would become a re-offender. This model
predicted twice as many false positives for black offenders than white offenders.
• In 2015, Amazon realized that their algorithm used for hiring employees was found to be biased against
women.
• In 2019, Facebook was found to be in violation of the U.S. constitution, by allowing its advertisers to
deliberately target adverts according to gender, race, and religion, all of which are protected classes under
the country’s legal system.
How can we mitigate machine bias? Unfortunately, there are no quick answers to this question. Humans are
the ultimate source of bias, as they are the ones generating the data and writing the algorithms that use the
data to learn and make decisions. There is a lot of work (outside the scope of this course) on detecting fairness
and bias in machine learning. It is also clear that regulatory practices should be developed that limit algorithm
discrimination and also provide guarantees such as the right to explanation. We will discuss several of these
issues in the data ethics workshop.
loans. This is a procedure that’s been around for a long time. Banks compute a score for each application based
on different attributes, and then they rank all the applications and choose a cut-off. Those with a score above that
cut-off will receive the loan and those with a score below that cut-off will not receive the loan. Nowadays, banks
use computer algorithms to compute credit scores using attributes like age, time in current address, occupation,
and so forth. The natural question to ask is what attributes (features) we should allow these algorithms to
use in order to avoid bias. Legislation prohibits characteristics from being used in credit scoring (race, colour,
national origin, gender, marital status, religion, receipt of public assistance, or exercise of consumer protection
rights). This is called the law of equal opportunity. Protected attributes include disability, gender reassignment,
marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex, sexual orientation. Age has
special status and can be used in some cases (credit scoring is one).
The intent of the legislator is that by practising equal opportunity in credit scoring, we will achieve equal
outcome, which simply means that two individuals who only differ on protected attributes should not differ in the
likelihood to obtain a loan.
The study showed this was not the case in practice. Specifically, equal treatment was not fulfilled, in that
the model learned to distinguish applications based on attributes that are proxies for gender, such as salary
(unfortunately, women still earn less than men on average).
The study also showed that gender itself is a statistically significant predictor of credit risk. If we allowed
the model to train on gender, we could get a more accurate prediction of credit risk. This means that women are
“paying” for the fact that their gender is a protected attribute, in that their cut-off is higher than it would be if
the model was allowed to distinguish on gender (and conversely for men). The conclusion is that existing law is
not effective in promoting equality when it comes to algorithms.
• Identify any possible anomalies in the data, e.g. missing data or data points that appear unlikely
In exploratory data analysis, we should try to avoid any preconceived ideas about relationships in the data,
but we should use our pre-existing knowledge to assess whether data make sense. For example, if we come
across a temperature of over 100 ◦ C in a time series of temperature recorded outside in Edinburgh, we should be
suspicious that there has been an error in the sensor, or the processing of the data we have received.
Why is visualisation important? We define visualisation as the process of conveying information through
graphical representations of data. Visualisation is a crucial part of our own exploratory data analysis, and is
also vitally important for communicating results to others.
The aims of communicating data are to
25
26 CHAPTER 5. EXPLORATORY DATA ANALYSIS AND COMMUNICATION OF RESULTS
This may sound obvious, but it is surprising how a changing the presentation of a dataset can make features of
the dataset become more apparent.
The influential statistician John Tukey explained the power of visualisation thus:
The simple graph has brought more information to the data analyst’s mind than any other device.
The greatest value of a picture is when it forces us to notice what we never expected to see.
It has the property of forcing “us to notice what we never expected to see” that makes visualisation so
powerful. For example, suppose we looked at the series of ambient temperature readings mentioned above as
numbers. We would be unlikely to spot the reading of 100 ◦ C. However, if we plotted the series with time on the
x-axis and temperature on the y-axis, this data point would stand out at us.
Why does visualisation work? Put simply, the human visual system is very good at identifying what are called
preattentive attributes in scenes. Wexler et al. (2017) list 9 preattentive features:
• Length
• Width
• Position
• Size
• Shape
• Curvature
• Added marks
• Enclosure
• Colour value
• Colour hue
• Spatial grouping
We use these features to construct visualisations to show the data in a which aspects of the data “pop-out” to
us. It is also possible to create visualisations that mislead or confuse. Creating visualisations that truthfully and
informatively represent the data is an art that takes practice to acquire.
How we create visualisations Before the computer age, visualisation was a painstaking process of drawing
lines accurately on paper. The earliest work that we would recognise as modern visualisations were in the
late 18th century by the Scottish economist William Playfair (not the architect), who created visualisations to
illustrate various economic data. Nowadays visualisations can be created by computer packages such as Excel,
or via cloud-based packages such as Power BI and Tableau.
In this course we will focus (in the labs) on creating visualisations programatically using Python’s Matplotlib
and Seaborn packages (though we could also have used R). Creating plots using computer code can seem more
time-consuming than doing so using a graphical package. However, an advantage is that we can reproduce
the steps taken to analyse data and produce a visualisation. We also have fine control over the design of our
visualisations.
– Bar charts: good for plotting numeric variables associated with categorical or ordinal variables, for
example the mean weight (numeric variable) of male and female (categorical variable) squirrels.
– Line charts: for showing trends of numerical variables over time (a numerical variable).
– Scatterplots: show the relationship between two numeric variables.
– Histograms: are good for showing the distribution of a single numeric variable, for example the
weights of male squirrels. The area of each bar in a histogram shows the frequency (i.e. number) of
observations within a set of bins (i.e. intervals). The size of bins is your choice: up to a point smaller
bin sizes show more detail, but a very small bin size will obscure the distribution. Bins can be of
different sizes – because we’re plotting the area, this won’t over-represent counts in large bins. It’s
also possible to normalise histograms by making the area of each bar equal to the density, i.e. the
frequency in each bin by the total number of observations. In this case the total area in the histogram
is equal to 1.
– Density plots: also show the distribution of a numeric variable and can be seen as a smoothed
histogram. The density is an estimate of the probability density function underlying the distribution,
and can be generated using kernel density estimation: the estimated density at each point is
computed by placing normal distributions of a particular width (called the bandwidth) around each
point. Analogous to the bin width in histograms, narrower bandwidths lead to a more bumpy
appearance, and wider ones a smoother appearance. Packages such as Seaborn set the bandwidth
automatically, but it’s possible to adjust it.
– Boxplots: represent the distribution of a numeric variable for multiple categories, e.g. the weights
of male and female squirrels. Boxplots (also known as box and whisker plots) are one-dimensional
representations of a distribution in which the box extends from the lower (first) to the upper (fourth)
quartile values of the data, while the line across the box represents the median. The ’whiskers’, the
lines extending from the box, can represent different things, as described in the Wikipedia article
on boxplots. By default, Matplotlib defines the end of the upper whisker as the value of the largest
data point that lies within 1.5 times the interquartile range from the upper quartile, and the lower
whisker as the value of the smallest data point that lies within 1.5 times the interquartile range from
the lower quartile. Data points that lie outwith the whiskers are called outliers, and are represented
by dots or circles. Since the whiskers can represent multiple statistics, ideally their meaning should
be indicated in the plot caption.
– The above plots are univariate or bivariate, since they display one or two variables. We can use
shape and colour to create extra dimensions for categorical variables. For example, in a scatterplot of
squirrel weight versus length, we can indicate sex using colour, thus displaying 3 variables. We could
also indicate age categories by changing the size or the shape of the markers (4 variables). However,
we must be careful that adding information using marker properties does not detract from the plot.
– Barcharts can be extended to two categorical variables and one numerical variable by using colour.
– Choose an appropriate colour scale, depending on if the data is sequential, diverging or categorical.
– Colour can also be used to highlight features in the plot, e.g. the largest two bars in a bar plot.
• Encourage the eye to compare several pieces of data, e.g. by using multiple plots with the same scale.
28 CHAPTER 5. EXPLORATORY DATA ANALYSIS AND COMMUNICATION OF RESULTS
– Wexler et al. (2017), p. 31, is a nice example of how this can work better than using multiple symbols
on a plot (p. 30).
• Present many numbers in a small space
– A boxplot takes up as much space as a barplot, but conveys more information. For example, a boxplot
of the squirrel’s weight versus sex shows information about the distribution of the weight as well as
the median weight.
• Choose appropriate transforms
– Sometimes it can make sense to transform data so that features of it are clearer. For example, plotting
the value of Bitcoin over time shows very little detail about the early history of the currency, when it
was not valuable. However, plotting the log of the value of Bitcoin on the y-axis allows this detail to
be seen.
Consider the left-hand panel of Figure 1 in Kramer et al. (2014), which shows the percentage of words
written by Facebook users in a week that were classified as emotionally positive for users in a control
condition, or in an experimental condition in which 10% of users’ friends’ posts containing negative words
were omitted from their feeds. Suppose that the quantity measured in the control condition is yC and the
quantity measured in the experimental condition is yE . Then
yE − yC
size of effect in data =
yC
In the “Negativity reduced” condition (left), yC = 5.24% words, and yE = 5.30% words. Therefore, the
effect size in the data is (5.30-5.24)/5.24 = 1.132%. However, if we look at the size of the bars in the data,
the control bar 0.24 high, and the experimental bar is 0.30 high. Thus, the effect size in the graphic is
(0.30-0.24)/0.24 = 25%. Therefore, Tufte’s lie factor is 25/1.132 = 21.8.
– A very common problem is that the baseline (i.e. the lowest point on the y−axis) in a barchart is not
zero. This can lead to small differences appearing large, and will cause the lie factor to be different
from 1. However, in some cases a non-zero baseline is justified, for example in time series when it is
the absolute changes over time rather than proportional differences that are important to see.
– Although marker area can be useful for indicating categories, humans are not very good at relating
the area to a quantity – we are much better at comparing lengths.
(yE − y0 ) − (yC − y0 ) yE − yC
size of effect in graphic =
yC − y0 yC − y0
=
We substitute the two equations into the equation for the lie factor to give:
yC
Lie factor =
yC − y0
If y0 = 0, i.e. when the baseline of the variable we are measuring is at zero”, we can see that the lie factor
is 1.
For positive yC with a baseline above 0, the lie factor will be above 1, i.e. the size of the effect will be
exaggerated. We can make the lie factor less than one (i.e. understating the size of the effect) by having a
negative baseline – but this rarely happens.
• Make sure text is legible, i.e. font size of minimum 8 points in a PDF, or about 20 points in a presentation.
(It is surprising how often talks are given in which it’s impossible to read the labels on plots even from the
front row.)
• Use colours that work for people with colour-vision deficiency. Wexler et al. (2017), Chapter 1 has an
excellent introduction to using colour in visualisations.
Suppose you are writing a report in which the width of the page is 8 inches, with 1 inch left and right
margins. The width of the body text is therefore 6 inches. You generate a figure that has a figsize
of (6, 4), and a fontsize of 8 points, save the figure to file, and include it in your document. Since
figsize=(6, 4) means “6 inches wide and 4 inches tall”, when you include the figure, the font size
remains 8 points. All is well.
Suppose now you are having difficulties fitting all the numbers on the plot in. You decide to set
figsize=(12, 8) (12 inches wide and 8 inches high) and keep the font size at 8 point. Bingo! Your
numbers now fit. You put the image in the document so that it fills the width of the page. But now the
actual font size is 4 points, because you have shrunk the image by a factor of two to fit it into the page.
Your plot will not be accessible to some people. This is a very common mistake.
30 CHAPTER 5. EXPLORATORY DATA ANALYSIS AND COMMUNICATION OF RESULTS
• Reduce clutter
• Correct spelling
• Ineffective visualisations
– Scale distortions
– Ineffective colours
– Graph junk
• Effective visualisations
• Matplotlib basics
• Overlay charts
• Downloaded as structured files (e.g. tablular data, or data in JSON format) from websites, e.g. scientific,
government or charity data repositories
• web scraping – the process of automating the process of extracting data from a web page.
These are listed in order of difficulty. It’s generally better to start with the easier options first – there’s no need
to scrape data if it’s already available in a structured format.
The first rule of web scraping is don’t do it unless you have to. Data is increasingly available in structured
formats such as CSV and JSON. Web scraping seems cool, but it’s also time-consuming and fiddly. It pays to
spend some time searching for files in structured formats, in which the data may well be in a cleaner format than
on a web page. However, if you can’t find the data you need in a structured format, but it is available on a web
page, then web scraping can be helpful.
Ethical and legal considerations in online collection It is worth remembering that using online data involves
a relationship between you and the data creator, and possibly the people who funded the data collection, so we
need to think a little about law and ethics.
Licences for downloaded file If you’re downloading data files, it’s likely they will have been issued with a
data license, which governs what you are allowed to do with the data, and how you can publish any work that
you produce using that data. A few common types of license are:
Creative Commons These general-purpose licenses can come with conditions, such as that the creator must be
credited (BY) or that only non-commercial uses are allowed (NC). The CC0 license is very permissive,
meaning that the owner has no control of the data, not even requiring data users to attribute the data
creator – but it is always good practice to attribute sources.
Open Data Commons Similar to the Creative Commons licences, but designed for data.
You should check the license or copyright statement, if there is one. This section of The Turing Way (The
Turing Way Community, 2022) has more information about data licences.
API conditions Sites with an API (e.g. Twitter) have conditions under which you can take the data – you
should respect those conditions if they are not already enforced by the API.
33
34 CHAPTER 6. DATA COLLECTION AND STATISTICAL RELATIONSHIPS
Figure 6.1: Red squirrel, Sciurus vulgaris. Credit: Peter Trimming / CC BY 2.0)
Ethical web-scraping As with data files, the owner of the website you wish to scrape may have invested
considerable time and money in creating their site. It’s possible that by web scraping you could have an adverse
effect on either their intellectual property or their web server.
The law around web scraping is not always clear (Davies, 2020) but you should always check the terms
and conditions of the website before starting scraping. For example, the Copyright Policy of the Financial
Times website says you cannot ”Frame, harvest or scrape FT content or otherwise access FT content for similar
purposes.” In contrast, the Time Out website terms and conditions are more permissive.
Beyond the legal restrictions for a specific website, we should also consider general ethical principles of web
scraping. Densmore (2017) suggests a number of rules for those undertaking web scraping, including:
• respect content rights, and not passing date off as one’s own
• giving back to the data owner when possible, including attributing the data owner in any publication
6.2 Correlation
In the topic on Descriptive statistics, we considered summary statistics of one variable, for example the weight of
a wildcat, or another wild animal, such as the red squirrel (Figure 6.1). In data science and statistics we’re often
interested in statistical relationships between two or more variables. In this section we’ll discuss the most basic
statistical relationships: covariance and correlation.
Covariance Suppose that as well as measuring the weight in grams xi of the ith squirrel, we also measure its
length in millimetres yi (Figure 6.2). The sample covariance is defined:
n
1 X
sxy = (xi − x)(yi − y) (6.1)
n−1
i=1
We expect that squirrels that are longer than average will also be heavier than average, i.e. both xi − x and
yi − y are positive. Thus, for these squirrels (xi − x)(yi − y) will be positive. We also expect that squirrels that
6.2. CORRELATION 35
400
380
Weight (g)
360
340
320
are shorter than average will be lighter than average. For these squirrels both contributions xi − x and yi − y
will be negative, but the product (xi − x)(yi − y) will be positive. We should thus expect that the covariance of
squirrel length and weight is positive.
If one quantity gets smaller as the other one gets bigger, then, by similar reasoning, the covariance is
negative.
There are a few points to note about the covariance:
1. Its units are the product of the units of the two variables; in this example the units would be g · mm.
2. It depends on the scaling of the variables; suppose we measured the squirrels’ weight in kilograms (instead
of grams) and their length in centimetres (instead of millimetres). The covariance would be 10,000 times
smaller, even though the relationship between the variables remains the same.
We don’t want a measure of how strongly quantities are related to depend on the units they are measured in, so
the second point is a problem.
Correlation coefficient The sample correlation coefficient (also known as Pearson’s correlation coefficient1 )
addresses this problem by dividing the covariance by the product of the standard deviations of the two quantities:
Pn
sxy i=1 (xi − x)(yi − y)
r= = qP qP (6.2)
sx sy n n
i=1 (xi − x) i=1 (yi − y)
2 2
Standardised variables Another way of looking at the correlation coefficient is in terms of standardised
variables. The standardised version of a variable x is, by convention, denoted by z, and is also referred to as a
z-score. The standardised version of ith instance is defined as:
xi − x
zi = (6.3)
sx
The standardised variable zi has several nice properties:
What’s hiding in a correlation coefficient? With multivariate numeric data, a very helpful technique can be
to look at the correlation coefficient of every pair of variables. These can be plotted as a heat map, quickly
showing where there might be interesting relationships (i.e. correlation coefficients close to 1 or −1). We’ll use
this technique in the labs.
But before using correlation heat maps, we should take a look at Figure 6.3. Can you guess roughly what
the correlation coefficient is in each plot? Before going to the next the page, try to think about how you would
describe the data in each plot in words.
6.3. LIMITATIONS OF STATISTICAL RELATIONSHIPS 37
Plot 1 Plot 2
10 8
8 6
y
6
4
4
5 10 5 10
x x
Plot 3 Plot 4
12.5 12
10.0 10
y
y
7.5 8
6
5 10 10 15
x x
Figure 6.3: What are the correlation coefficients in these plots? “Anscombe’s quartet” (Anscombe, 1973)
Well, the answer is they all have the same correlation coefficient, r = 0.82. But the visualisation indicates
that something quite different is happening in each one:
1. Basically a linear relation, with some noise around it – quite like the squirrel length and weight
2. It looks like there is a very precise nonlinear (perhaps quadratic) relationship between the variables here
3. It looks like there would be a linear relationship with r = 1 if we took out the outlier point
Two variables are causally related when changing the value of one of the variables affects the value of the
other variable. To return to the examples:
1. Multiple lines of evidence show that wearing seat belts causes fewer injuries in car accidents
38 CHAPTER 6. DATA COLLECTION AND STATISTICAL RELATIONSHIPS
2. We might ask if it is really age that causes a higher chance of credit approval, or some other factor
associated with age, for example higher income.
3. It is fairly clear that eating ice cream does not cause sunburn, and vice versa.
Spurious correlations Suppose we have a correlation between two variables, and moreover, we’re happy that
it looks linear – does this indicate that one variable is causing a change in the other variable? For example
the number of Nicolas Cage and number of people drowned in swimming pools. Spurious correlations are
correlations that arise by chance between two unrelated variables.
Confounding variables A confounding variable is a variable that influences two or more variables, meaning
that the two variables are correlated, even if there may be no causal relationship between them. In the ice
cream example, high temperatures and sunshine may cause both sunburn and ice cream consumption, and so
temperature is a confounding variable.
Biases in data Selection or sampling bias is when the sample chosen is not representative of the population
under investigation.
Establishing causation
• Observational data
• Experiments
41
Chapter 7
7.1 Classification
The classification problem Suppose a bank has data on previous customers it has given loans to, including
variables such as their income, housing status and employment status. Each of these sets of variables – also
referred to as feature vectors – has a label, indicating whether the customer did or didn’t pay back their loan.
The bank might want to predict whether a new customer will be able pay back a bank loan from their features,
i.e. to predict whether they belong to the class of customers who paid or the class of customers who didn’t pay.
This is an example of classification, which we define as the problem of predicting the correct category label
(“class”) for an unlabelled input feature vector.
Visualising the classification problem To visualise the classification problem, we’ll use a toy example: the
fruit data set, collected by Iain Murray (Murray, 2006). He bought pieces of fruit and measured their height and
width (features) and noted the type of fruit (the label). Figure 7.1 visualises the data. In the context of the fruit,
the classification problem is using this dataset to build a machine to predict the class of a piece of unidentified
fruit automatically just by measuring its width and height. We will refer to the feature vector of this unidentified
fruit as the test point.
43
44 CHAPTER 7. SUPERVISED LEARNING: CLASSIFICATION WITH NEAREST NEIGHBOURS
Fruit
10 Apple
Mandarin
9 Big Orange
Orange ?
8
Height (cm)
Lemon
7
6
5
4
2 4 6 8 10
Width (cm)
Figure 7.1: The supervised learning problem, as applied to fruit. We are given the labels of the fruit with various
widths and heights. We are then presented with an unknown piece of fruit with given width and height (the test
point, represented by the question mark). The task is then to predict what type of fruit it is.
7.1. CLASSIFICATION 45
Height/cm
pears
10
5
apples
15 20 Circumference/cm
Figure 7.2: Training data for apples and pears. It is not possible to draw a straight line to perfectly separate the
classes in this feature space. Three possible lines are drawn, but each results in a number of misclassifications.
Decision boundaries Consider the problem of determining apples from pears. In this example case we have
two features for each piece of fruit: its circumference (at the widest point) and its height, each measured in cm.
Apples are ‘more spherical’ than pears which tend to be longer than they are wide. But some pears are relatively
short, and some apples are taller than expected. In this case we have an input vector of the form x = (x (1) , x (2) )T ,
where x (1) is the circumference and x (2) the height. The class c can take two values A or P (standing for apples
and pears).
We have a set of training data: height and circumference measurements, along with class labels. In Figure 7.2
we plot this two-dimensional training data for the two classes. We can see that it is not possible to draw a
straight line to separate the two classes.
We now have three new, unlabelled examples which we would like to classify (represented as stars in
Figure 7.3):
• (16, 10): all the training data in the region of this point is classified as P, so we classify this point as P.
• (19, 6): looking at the training data it seems obvious to class this as A.
• (18, 7): it’s not obvious in which class this example should be classified; the feature vector gives us evidence
whether we have an apple or a pear, but does not enable us to make an unambiguous classification.
We can draw a straight line such that one side of it corresponds to one class and the other side to the other
– as in the three possible lines shown in Figure 7.2. Such a line is called a decision boundary; if the data
was three-dimensional, then the decision boundary would be defined by a plane. For one-dimensional data, a
decision boundary can simply be a point on the real line. Intuitively it seems possible to find an optimal decision
boundary, based on minimising the number of misclassifications in the training set.
46 CHAPTER 7. SUPERVISED LEARNING: CLASSIFICATION WITH NEAREST NEIGHBOURS
Height/cm
10
15 20 Circumference/cm
Figure 7.3: The training data for apples (blue circles) and pears (red triangles), together with three test points
(yellow stars).
Constructing classifiers using supervised learning To construct the classifier automatically we need:
3. hyperparameters, numbers that control how the algorithm learns and predicts
This is referred to as supervised learning, since the label for a training vector acts as supervision for the
classifier when it is learning from training data.
Decision boundaries for nearest neighbour classification What do the decision boundaries look like for
nearest neighbour classification? Each training data point defines a region around it; test points within this
region will be classified to the same class as the training data point. These regions are illustrated for a simple
case in Figure 7.4, where the boundaries of regions are shown as dotted lines. Each boundary is given as the
7.3. EVALUATION 47
perpendicular bisector of the line segment between the two corresponding data points. This partitioning formed
by a set of data points is sometimes called Voronoi diagram or Voronoi tessellation.
0
0 1 2 3 4 5 6 7
Figure 7.4: Decision boundaries by 1-NN for data points of distinct classes from each other
Now we assume that each data point belongs to either of two classes, say, red or blue. To obtain the
decision boundary we combine those boundaries which are between regions of different classes, as illustrated
in Figure 7.5. The resultant boundary is referred to as being piecewise linear. Figure 7.6 shows the decision
boundary and decision regions in the case of three classes.
Application of 1-nearest neighbour to a real dataset Figure 7.7 shows 1 nearest-neighbour classification
applied to the fruit dataset. Note that we’ve standardised the variables, so that the data spreads out roughly
equally in both directions. In common with other distance-based methods, we would like the results of clustering
to be independent of the units we measure the variables in. It can be seen that the decision boundary is quite
complex, with islands of apple amongst the oranges. We’ll explore in the next section if this might be a problem.
7.3 Evaluation
Classification error rate Having constructed a classifier, we would like to evaluate how well it works. One way
to quantify how well a classifier is working is the number of items that it misclassifies – i.e., the number of times
it assigns a class label ĉi different from the true class label ci . The classification error is often expressed as the
percentage of the total number of items that is misclassified, the error rate.
Error rate for one-nearest neighbour classification For one-nearest neighbour classification, the error rate
when we consider members of the training set is 0, since the closest point in the training set to a member of the
training set is itself1 .
Evaluating generalisation to unseen data This sounds very promising, until we remember that the job of the
classifier is to classify data points that we haven’t seen before. It may be that the classifier will not generalise
1 Unless we have two data points with exactly the same features and different labels.
48 CHAPTER 7. SUPERVISED LEARNING: CLASSIFICATION WITH NEAREST NEIGHBOURS
0
0 1 2 3 4 5 6 7
Figure 7.5: Decision boundary and decision regions for a 1-nearest neighbour classifier for a training data set of
two classes, where training samples of one class are shown with ‘*’ in red, those of the other class are shown
with ‘◦’ in blue. The Euclidean distance is used as the distance measure in this example.
0
0 1 2 3 4 5 6 7
Figure 7.6: Decision boundary and decision regions for a 1-nearest neighbour classifier for three classes.
to data that haven’t seen. In order to estimate how well the classifier generalises, we can split our original data
7.3. EVALUATION 49
1
Height
3
3 2 1 0 1 2 3
Width
Figure 7.7: Decision regions for one nearest neighbour classification applied to the fruit data set. The variables
have been standardised to make the scales on both axes similar. Some regions are darker shade of blue or red.
This indicates that there are 2 points labelled with “apple” or “orange” in the dataset with the same features.
There is one region that is purple, amongst the blue and red region. There are two data points corresponding to
this region with identical coordinates, one labelled with orange and one with apple. Colour indicates decision
region for each type of fruit: Apple (blue), Mandarin (orange), Big Orange (green), Orange (red), Lemon (purple).
50 CHAPTER 7. SUPERVISED LEARNING: CLASSIFICATION WITH NEAREST NEIGHBOURS
set into a training set and a testing set. The training and testing sets are mutually exclusive, and a typical split
might be 70% for training and 30% for testing. We train the classifier using the training set, and then evaluate
the performance using the testing set. We are not allowed to use the test set to train the classifier – otherwise
our estimate of its performance on the test set will be too optimistic.
In summary, the training set error rate is the percentage of misclassifications that the classifier makes on the
training set after the learning algorithm has been applied. The test set error rate refers to errors made by the
classifier on the testing set.
Training and testing set notation We’ll use the notation: x = (x (1) , x (2) , . . . , x (D) )T to denote a D-dimensional
(input) feature vector, which has class label c. The training set is a set of n feature vectors and their class
labels; the i’th training item consists of a feature vector xi and its class label ci . The j’th element of the i’th
feature vector is written xij .
Chapter 8
• 2-nearest: tie (one apple and one pair are nearest neighbours). In this case, we could choose randomly
between the two classes. Another option strategy would be to take the 1-nearest neighbour or the 3-nearest
neighbour classification.
k-nearest neighbour algorithm We can write the k-nearest neighbour algorithm precisely as follows, where
and d is the distance metric (typically the Euclidean distance):
– Compute the n distances di = d(x, xi ) between x and the features of each training example xi ,
i ∈ 1, . . . n.
– Sort the distances from lowest to highest and find the indices i1 , . . . ik of the k lowest values of di
– Find the classes that belong to the closest points, i.e. ci1 , . . . , cik
– Each of these represents a vote for a class. Count the votes for each class and return the one with
the largest number.
– If there is a tie, choose randomly, or look at the k + 1th neighbour to resolve the tie.
51
52CHAPTER 8. K-NEAREST NEIGHBOUR CLASSIFICATION, SETTING HYPERPARAMETERS, AND METRICS
Height/cm
10
15 20 Circumference/cm
Figure 8.1: The training data for apples (blue circles) and pears (red triangles), together with three test points
(yellow stars).
Decision boundaries produced by k-NN classification k-nearest neighbour classifiers make their decisions
on the basis of local information. Rather than trying to draw a linear decision boundary across the whole space
(as in Figure 7.2), k-nearest neighbour techniques make decisions based on a few local points. As such they can
be quite susceptible to noise, especially if k is small: a small shift in the location of a test point can result in a
different classification since a new point from the training data set becomes the nearest neighbour. As k becomes
larger, the classification decision boundary becomes smoother since several training points contribute to the
classification decision. Figure 8.2 illustrates the decision regions for various values of k for the 5-fruit example
introduced in the previous chapter (raw data in Figure 7.1 and 1-nearest-neighbour classification in Figure 7.7).
Generalisation and regularisation In Figure 8.2 for low values of k the boundary between apples (blue) and
oranges (red) is “noisy”: a small shift in the height and width of the fruit will lead to the classification training.
The trained classifier is very flexible and therefore over-sensitive to the data it’s been training on, and we say
that it is exhibiting over-fitting and under-generalisation.
As k increases, the decision boundaries get smoother, and we might think that the results will be exhibit better
generalisation to unseen examples. As k increases further there could also the problem of over-generalisation
or over-fitting. This problem isn’t seen clearly in Figure 8.2. However, if we made k very large, we would end up
classifying everything as the fruit with the largest number of examples. Another example of over-generalisation
might be the linear decision boundaries in Figure 7.2 in Supervised learning: Classification with Nearest
neighbours.
We can see that the decision regions with higher k in Figure 8.2 appear more regular. The process of changing
the behaviour of a classifier so that it produces more regular or smoother output is known as regularisation and
k is sometimes referred to as a regularisation parameter.
Over- and under-fitting (and their counterparts under- and over-generalisation) are issues for other supervised
learning methods, for example when extending multiple regression with extra features (Interaction terms and
8.1. K-NEAREST NEIGHBOUR CLASSIFICATION 53
Figure 8.2: Decision regions, training error and testing error for various values of k applied to the 5-fruit data
shown in Figure 7.1). Colour indicates decision region for each type of fruit: Apple (blue), Mandarin (orange),
Big Orange (green), Orange (red), Lemon (purple).
nonlinear fits). In general, supervised machine learning models have hyperparameters that act as regularisation
parameters. We leave more detailed work on regularisation parameters for later courses.
Choosing k The value k is hyperparameter: a number that we can choose to get the best performance from
the algorithm. Figure 8.3 shows the classification error rate for various values of k on the training set and the
testing set. As k increases the error on the training set initially increases rapidly, as explained in the last
section. The testing error decreases a little and then starts rising around k = 9, indicating that a somewhat
larger k helps generalisation. Both testing and training error then increase.
This graph suggests that we can look at the error on the testing set to set k. But this would break the
rule of using the test data to train the classifier, since our choosing the best hyperparameter k is part of the
training process. We have really been using the test data as validation data, that is, data used to help us
validate our choice of hyperparameter by estimating what the error would be on an independent set of data
drawn from the same population as the data available. This technique of using separate data sets for training
and testing to predict the error is referred to as cross-validation.
Thus, we need to divide our dataset into 3 parts (Figure 8.4):
• Training data (about 50%): used to train the classifier for any particular value of k.
• Validation data (about 25%): used to compare performance of the trained classifier for different values of k.
• Testing data (about 25%): used to report the performance of the trained classifier with the one value of k
that we have chosen.
The precise fractions of data are not crucial. However, it is important that the testing data is only used to report
the performance, not to choose hyperparameters. A poor test score is probably an indication that the classifier
will perform poorly on real world data. Here we have taken all of the non-testing data available, and “held out”
54CHAPTER 8. K-NEAREST NEIGHBOUR CLASSIFICATION, SETTING HYPERPARAMETERS, AND METRICS
0.25
Classification error rate 0.20
0.15
0.10
0.05 E_train
0.00 E_test
0 20 40
k
Figure 8.3: Classification error rate for various values of k.
some of the data to form the validation set, leading to the name holdout cross-validation for this method. A
more sophisticated way of undertaking validation is K -fold cross-validation (see later).
Computational efficiency of k-nearest neighbour classification k-nearest neighbour is very efficient at training
time, since training simply consists of storing the training set.1 Testing is much slower, since, in the simplest
implementation, it would involve measuring the distance between the test point and every training point, which
can make k-nearest neighbour impractical for large data sets.
Improving the efficiency of k-NN It is sometimes possible to store the training data set in a data structure
that enables the nearest neighbours to be found efficiently, without needing to compare with every training data
point (in the average case). One such data structure is the k-d tree. However, these approaches are usually
only fast in practice with fairly low-dimensional feature vectors, or if approximations are made.
8.2 Metrics
Accuracy In the chapter on Supervised learning: Classification with Nearest neighbours, we introduced the
error rate, the number of items misclassified as a fraction of the total number of items. We define classification
accuracy as one minus the error rate, i.e. one minus the number of items misclassified divided by the number of
items. When the error rate is zero, the classification accuracy is 1 or, equivalently, 100%.
Accuracy and unbalanced classes Accuracy seems to make sense as a metric – the fewer errors, the better the
classifier. However, it can appear misleadingly high when there is a large difference in the number of items in
each class, which we call unbalanced classes. For example, suppose we have a data set containing 95% apples,
3% pears and 2% oranges. When we split into training and test sets, the split will be 95%/3%/2% in both the
training and test sets. We could devise a classifier that classifies any item as an apple, regardless of its height
and width. This classifier would have an accuracy of 95%. This type of dummy classifier is called a baseline
classifier, since its performance sets a baseline against which to compare more “intelligent” classifiers.
Sensitivity and selectivity as alternative metrics in two-class problems Classification problems often have two
classes, for example someone has or has not repaid a loan, or someone does or doesn’t have an illness. In these
two-class problems, we regard one class as the “positive” outcome and the other as the “negative” outcome.
Confusingly the “positive” outcome is usually the case that we are searching for, which may often be a negative
thing – think of testing “positive” for Covid-19.
Two-class problems allow us to introduce other metrics, or sometimes pairs of metrics, that avoid the
misleading impression given by accuracy with unbalanced classes. One common pair of metrics are sensitivity
and selectivity, defined as:
Suppose now that instead of classifying fruit, we are classifying whether someone is “positive” for Covid, on
the basis of their symptoms. We will assume that 2% of people truly have Covid. The dummy classifier described
has a high accuracy (how high?). However, it classifies all the positive cases as negative, so the sensitivity is 0%.
Conversely, all the negative cases are classified as negative, so the selectivity is 100%. This gives us a much
clearer picture of the performance of the classifier than accuracy alone. An ideal classifier would have 100%
selectivity and sensitivity.
1 Inpractice, responsible machine learning practitioners will try out different choices of k, and different distance measures, possibly
optimising free parameters of a distance measure. Then training requires testing different choices, and becomes expensive.
56CHAPTER 8. K-NEAREST NEIGHBOUR CLASSIFICATION, SETTING HYPERPARAMETERS, AND METRICS
Predicted class
P N
Actual P TP FN
class N FP TN
Figure 8.5: Confusion matrix. TP – number of true positives; FN – number of false negatives; FP – number of
false positives; and TN – number of true negatives.
Confusion matrix The fullest picture of the performance of a classifier on a two-class problem can be gained by
looking at the confusion matrix, which compares the actual class and the predicted class (Figure 8.5). The cells
of the matrix are the number of items classified as “true positives”, “false positive” etc. Sometimes we normalise
by dividing every cell by the total number of items, so that the sum of all the cells is 100%.
We can define metrics in terms of the cells of the confusion matrix. For example, we have
TP TN
Sensitivity = Selectivity = (8.1)
TP + FN TN + FP
Other metrics can be constructed from the numbers in the cells of the confusion matrix; the appropriate metric
to use will depend on the application.
1. Select algorithm
We’ve used k-nearest neighbours as our algorithm, but many other algorithms are available.
There are a number of issues when using this paradigm in the real world.
Data can change over time. For example typical Covid symptoms changed over the pandemic, so a classifier
that worked well at the start of the pandemic wouldn’t necessarily work well at the end of the pandemic.
Perhaps it’s been a particularly good year for apples in the year we’ve trained the fruit classifier, and they
are larger then in other years.
Selection of training, validation and test sets A related problem is that the training, validation and test sets
should be drawn from the same distribution. The training, validation and test sets should be selected
using random sampling, As discussed in the chapter on Randomness, sampling and simulation, any form of
non-random sample (for example taking the first 50% of data collected over time) could lead to the statistics
of the training and test sets being different.
Limited amount of data Machine learning algorithms need more data to perform well. If you have 100 data
points, there will be only 50 data points to use for training, 25 for validation and 25 for testing. As
discussed in the chapter on Randomness, sampling and simulation, these samples may not be representative
of the wider population.
K -fold cross-validation We’ve seen how we can split the data we’re not using to report the final test result
into training and validation data, and that this can be used to find the hyperparameter that gives the best
performance. However, if we’ve a small amount of data, we’re only using a third of the actual data that we’re
8.3. LIMITATIONS IN SUPERVISED MACHINE LEARNING AND CROSS-VALIDATION 57
1. Split into testing data and non-testing data. Spilt non-testing data into 3 folds.
2. For each hyperparameter, train and compute validation error three times:
4. Pick hyperparameter with lowest mean test error. Train on all non-testing data and report
performance on test data.
allowed to use to test the performance of each hyperparameter. The method of K -fold cross-validation allows us
to use all the non-testing data for both training the classifier with each hyperparameter, and testing it.
Figure 8.6 gives an overview of how K -fold cross-validation works. There is a split into test data, and
non-testing data, as previously. The difference is that the non-testing data is split into a number (here 3) of
equally-sized folds, or blocks. For each value of the hyperparameter, the model is trained 3 times: first on the
data in folds 2 and 3, then on the data in folds 1 and 3, and finally on the data in folds 1 and 2. Each trained
model is tested on the data in the fold that wasn’t used for training. The mean error from all three folds is
then used as a estimate of what the test error will be in the final trained model. On the basis of this K -fold
cross-validation, a hyperparameter is chosen – probably the one giving the lowest mean error. Finally, the
classifier is trained again using the chosen hyperparameter on all the non-testing data.
In general, we don’t have to have 3 folds. We can have a number K of folds, which gives rise to the term
k-fold cross-validation. 5 and 10 are common values for K . It’s unfortunate that the letter K is used for both
cross-validation and nearest neighbours, as K means different things in the two contexts.
Although the validation error gives an estimate of the expected test error, it is still necessary to compute the
error on a separate test set. The estimate produced by cross-validation has an optimism bias, because the data
in each testing fold has been used to select the model hyperparameters.
However, in one case it is OK to used all the data for training and cross-validation: if someone else is doing
the testing using a part of the data that they have retained and shared with you – this is common in machine
learning competitions. They will get the independent estimate, and you can use the data that you have to best
train your models.
• MacKay, D. J. C. (2003) Information Theory, Inference and Learning Algorithms pp. 284-288
• Xu, D. and Tian, Y. (2015). A comprehensive survey of clustering algorithms. Annals of Data Science,
2:165. https://doi.org/10.1007/s40745-015-0040-1
Human brains are good at finding regularities in data. One way of expressing regularity is to
put a set of objects into groups that are similar to each other. For example, biologists have found
that most objects in the natural world fall into one of two categories: things that are brown and
run away, and things that are green and don’t run away. The first group they call animals, and the
second, plants. (MacKay, 2003), p. 284
Supervised learning Clustering can be contrasted with classification (Figure 9.1). As outlined in the chapter
on Supervised learning: Classification with Nearest neighbours, classification is a supervised learning process:
there is a training set in which each data item has a label.
Unsupervised learning Clustering, on the other hand, is an unsupervised learning process in which the
training set does not contain any labels, and where the goal is to learn something about the structure of the
data. The aim of a clustering algorithm is to group such a data set into clusters, based on the unlabelled data
alone. In many situations there is no ‘true’ set of clusters. For example consider the twenty data points shown
in Figure 9.2a. It is reasonable to divide this set into two clusters (Figure 9.2b), four clusters (Figure 9.2c) or
five clusters (Figure 9.2d).
59
60 CHAPTER 9. UNSUPERVISED LEARNING: K-MEANS
Input: unlabelled data points. e.g. widths and Output: each point is assigned to a cluster –
heights of various unknown fruits which may correspond to the original fruits
Cluster
10 10 0
1
9 9 2
3
8 8 4
Height (cm)
Height (cm)
7 → 7
6 6
5 5
4 4
2 4 6 8 10 2 4 6 8 10
Width (cm) Width (cm)
Input: labelled data points. e.g. widths and Output: classifier that can predict the identity of
heights of various fruits an unlabelled data point
Fruit Fruit
10 Apple 10 Apple
Mandarin Mandarin
9 Big Orange 9 Big Orange
Orange Orange
8 Lemon 8 Lemon
Height (cm)
Height (cm)
7 → 7
6 6
* Unknown fruit
5 5
4 4
2 4 6 8 10 2 4 6 8 10
Width (cm) Width (cm)
Figure 9.1: Unsupervised and supervised learning, exemplified by clustering and classification applied to Iain
Murray’s oranges and lemons dataset of the widths, heights and masses of various types of fruits.
9.1. CLUSTERING, UNSUPERVISED AND SUPERVISED LEARNING 61
20
20
15
15
10
10
5
5
0
0 5 10 15 20 0 5 10 15 20
15
15
10
10
5
5
0
0
0 5 10 15 20 0 5 10 15 20
Reasons to cluster There are many reasons to perform clustering. Most commonly it is done to better understand
the data (data interpretation), or to efficiently encode the data set (data compression).
Data interpretation: Automatically dividing a set of data items into groups is an important way to analyse
and describe it. Automatic clustering has been used to cluster documents (such as web pages), user preference
data, and many forms of scientific observational data in fields ranging from astronomy to psychology to biology.
Data compression: Clustering may be used to compress data by representing each data item in a cluster
by a single cluster prototype, typically at the centre of the cluster. Consider D-dimensional data which has
been clustered into K clusters. Rather than representing a data item as a D-dimensional vector, we could store
just its cluster index (an integer from 1 to K ). This representation, known as vector quantisation, reduces the
required storage for a large data set at the cost of some information loss. Vector quantisation is used in image,
video and audio compression.
Hierarchical clustering There are two main approaches to hierarchical clustering. In top-down hierarchical
clustering algorithms, all the data points are initially collected in a single top-level cluster. This cluster is
then split into two (or more) sub-clusters, and each these sub-clusters is further split. The algorithm continues
to build a tree structure in a top-down fashion, until the leaves of the tree contain individual data points. An
alternative approach is agglomerative hierarchical clustering, which acts in a bottom-up way. An agglomerative
clustering algorithm starts with each data point defining a one-element cluster. Such an algorithm operates by
repeatedly merging the two closest clusters until a single cluster is obtained.
9.3 K -means
Aim K -means clustering, a form of partitional clustering, aims to divide a set of D-dimensional data points
into K clusters. The number of clusters, K , must be specified; it is not determined by the clustering: thus it will
always attempt to find K clusters in the data, whether they really exist or not.
Algorithm Each cluster is defined by its cluster centre, and clustering proceeds by assigning each of the input
data points to the cluster with the closest centre, using a Euclidean distance metric. The centre of each cluster
is then re-estimated as the centroid of the points assigned to it. The process is then iterated. The algorithm,
illustrated in Figure 9.4, is:
To be more precise, we can express the assignment of a point i to a cluster k by defining the set of points
in cluster k as Ck . Using the notation |Ck | to indicate the number of points assigned to cluster k, we can then
9.3. K-MEANS 63
Figure 9.3: Hierarchical clustering of the fruit dataset. The x-axis indicates the distance between clusters,
which in this clustering example is the centroid of all the points belonging to the cluster. For example the two
Mandarins at the very bottom are about 0.2 units apart, and form a cluster. In turn the centre of this cluster
is about 0.45 units from the centre of the next closest cluster (the three other Mandarins). We can regard all
sub-trees lower than a threshold inter-cluster distance of our choosing as clusters. For example, if we chose a
threshold of 1.0, we would see 5 clusters, indicated by the green, red, cyan, magenta and yellow sub-trees.
64 CHAPTER 9. UNSUPERVISED LEARNING: K-MEANS
(4,13) (4,13)
10 10
(4.33, 10)
(2,9) (2,9)
(7,8) (7,8)
(4,13)
10
(4.33, 10)
(2,9)
(7,8)
(7,6)
(6,6)
5 (4,5) (10,5)
(8.2,4.2)
(5,4) (8,4)
Figure 9.4: Example of K -means algorithm applied to 14 data points, K = 3. The lines indicate the distances
from each point to the centre of the cluster to which it is assigned. Here only one point (6,6) moves cluster after
updating the means. In general, multiple points can be reassigned after each update of the centre positions.
9.4. EVALUATION AND APPLICATION OF K-MEANS CLUSTERING 65
1 X
mk = xi (9.1)
|Ck |
i∈Ck
Distance measure The algorithm requires a distance measure to be defined in the data space, and the Euclidean
distance is often used. In this case, if x and y are two points in the data space, the distance function d(x, y) is
given: v
u D
uX
d(x, y) = kx − yk = t (xj − yj )2 (9.2)
j=1
Initialisation The initialisation method needs to be further specified. There are several possible ways to
initialise the cluster centres, including:
• Randomly assign data points to K clusters and compute means as initial centres
• Find the mean for the whole data set then perturb into K means
All of these work reasonably, and there is no ‘best’ way. However, as discussed below, the initialisation has an
effect on the final clustering: different initialisations lead to different cluster solutions.
Convergence The algorithm iterates until it converges. Convergence is reached when the assignment of points
to clusters does not change after an iteration. An attractive feature of K -means is that convergence is guaranteed.
However, the number of iterations required to reach convergence is not guaranteed. For large datasets it is often
sensible to specify a maximum number of iterations, especially since a good clustering solution is often reached
after a few iterations. Figure 9.4 illustrates the K -means clustering process. Figure 9.5 illustrates how different
initialisations can lead to different solutions.
E= = 4.
((4 + 4) + (4 + 4))
4
The second solution (Figure 9.5b) has a lower error:
E= < 4.
0 + (32/9 + 20/9 + 68/9) 10
=
4 3
Figures 9.5c and 9.5d show two clusterings of the fruit data set – the first clustering is noticeably worse, with
one cluster centre in-between two distinct groups of points.
To get around the problem of multiple solutions, it is common to repeat K -means algorithms from multiple
randomly-chosen starting points, and then take the solution that gives the lowest value of a mean squared error
E, that we will define shortly.
66 CHAPTER 9. UNSUPERVISED LEARNING: K-MEANS
5 5
(6.33,2.33)
(1,1) (3,1) (5,1) (1,1) (5,1)
0 0
0 5 10 0 5 10
Height
0 0
1 1
2 2
E = 20.33 E = 11.00
3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3
Width Width
(c) Within-cluster sum-squared error E = 20.33 (d) Within-cluster sum-squared error E = 11.00
Figure 9.5: (a) and (b) Two different converged clusterings for the same data set, but starting from different
initialisations. (c) and (d) Two different converged clusterings for fruit data set, but starting from different
initialisations.
9.4. EVALUATION AND APPLICATION OF K-MEANS CLUSTERING 67
120
100
80
60
E 40
20 Elbow
0
2 4 6
K
Figure 9.6: Scree plot of K -means clustering applied to the fruit dataset. The elbow is around K = 5, implying
that 5 is a reasonable number of clusters.
Mean squared error function To compare two different clusterings each with K clusters we can use the mean
squared error function1 , which can be thought of as measuring the scatter of the data points relative to their
cluster centres. Thus, if we have two sets of K clusters, it makes sense to prefer the one with the smallest mean
squared error: the clustering with the lowest scatter of data points relative to their cluster centres.
Let us define the set of points in the kth cluster as Ck , Then we can write the mean squared error as
K
1 XX
E= kxi − mk k2 , (9.3)
n
k=1 i∈Ck
where mk is the centre of cluster k, n is the number of data points in total, and k · k denotes the Euclidean
norm (i.e. L2 -norm) of a vector.
Another way of viewing this error function is in terms of the squared deviation of the data points belonging
to a cluster from the cluster centre, summed over all centres. This may be regarded as a variance measure for
each cluster, and hence K -means is sometimes referred to as minimum variance clustering.
At the start of this chapter we said that the aim of clustering was to discover a structure of clusters such that
points in the same group are close to each other, and far from points in other clusters. The mean squared error
only addresses the first of these: it does not include a between-clusters term.
Failures of K -means Like all variance-based approaches this criterion is dependent on the scale of the
data. This if variables are on different scales, nonsensical clusters can arise (Figure 9.7a). The range of the
y-coordinates is around 20, whereas the range on the x-axis is about 8. Thus, a point at the top of the left-hand
cluster is closer to a point at the top of the right-hand cluster than it is to a point at the bottom of the left-hand
cluster. Standardising variables is often a sensible step to take solve this problem - here it would reduce the
distances in the y-direction relative to the x-direction.
Another failure mode of K -means is that large clouds can pull small clusters off-centre (Figure 9.7b). The
reason for this is that we’ve not modelled the variance of each cluster – we’ve implicitly assumed they are the
same. The solution is to allow the variance of the clusters to vary. Gaussian mixture models do this, but are
beyond the scope of this course.
1 Commonly in K -means, the sum of the squared errors, also known as the inertia, is used instead of the mean squared error. It is simply
the mean squared error, but without the 1/n factor.
68 CHAPTER 9. UNSUPERVISED LEARNING: K-MEANS
11
7 7
3
5
-1
-5
3
-9
3 5 7 9 11 3 5 7
(a) Differences in scale on the axes cause nonsensical (b) A smaller cloud (red points) attracts points from a
clusters. larger cloud (blue points).
How to decide on K We can use a scree plot (Figure 9.6), plotting the mean squared error E against the
number of clusters K . We are looking for the elbow, where the error stops falling dramatically. Scree plots are
also used to inform decisions about the number of dimensions to retain in dimensionality reduction methods
(Dealing with high dimensions – PCA).
Other variants of K -means We have discussed the batch version of K -means. The online (sample-by-sample)
version in which a point is assigned to a cluster and the cluster centre is immediately updated is less vulnerable
to local minima.
There are many variants of K -means clustering that have been designed to improve the efficiency and to find
lower error solutions.
The curse of dimensionality As the number of dimensions D increases, the distances between points increases.
This has the effect of making the intracluster distances on a similar scale to the intercluster distances, and so
clustering becomes less reliable. This is one manifestation of the curse of dimensionality. To counteract the
curse, a dimensionality reduction method such as Principal Components Analysis (PCA) can be applied to the
data before it is clustered (see chapter Dealing with high dimensions – PCA).
Linear models
69
Chapter 10
Linear Regression
Regression is related to the correlation coefficient, but is subtly different. In regression, we are still dealing with
two numeric variables x and y, but we are trying to predict what value of y will be found at a particular x, not
just give a measure of how they are correlated. Often we have good reason to believe that there is a causal
relation between the variables x and y such that x causes y or y depends on x.
The variable y can be referred to as the response variable (or “response” for short) and the variable x can
be referred to as the predictor variable (or predictor for short). We will use this terminology1 but it is worth
noting that the response variable is also referred to as a dependent variable, target variable, the outcome
or endogenous variable and the predictor variables are also referred to as features, explanatory variables,
independent variables, covariates, regressors, exogenous variables. The terminology used tend to vary on
the discipline. In particular, in Machine Learning, the pair “target” and “features” or “covariate” is used; we
have used “feature” when introducing Machine Learning in Supervised learning: Classification with Nearest
neighbours.
A standard dataset used to illustrate regression involves another mammal, Homo sapiens. The data was
collected by the inventor of regression, Francis Galton2 , who surveyed the heights of grown-up children and
their parents. He expected that parents who were taller than average would have children who were taller than
average. Galton’s results are shown in Figure 10.1. The data is simplified. Galton recorded four variables: the
height of the mother, the height of the father, the gender of the child and the height of the child. We have
reduced the number of variables to three by taking the mean height of each child’s parents to give a variable we
call the midparental height.
Suppose we want to predict the height a daughter will grow to given that we know the heights of her parents,
and thus her midparental height. We have a reasonable amount of data, so we could just take the mean height
of all daughters with a similar midparental height. To be more specific, if we knew the midparental height were
x, we could predict the future height of the daughter to be the mean height of all daughters with midparental
heights in the range [x − 0.5, x + 0.5].
Figure 10.2 shows the results of this method of prediction. It’s generally true that the taller the parents,
the taller the daughters will be, though the relationship is not quite monotonic – it fluctuates a bit at lower
midparental heights. It seems reasonable to suppose that, if there were more data, the relationship would be
linear. In the next section we’ll see what happens if we assume the relationship is linear.
71
72 CHAPTER 10. LINEAR REGRESSION
80
gender
female
75 male
Child's height (inches)
70
65
60
55
64 66 68 70 72 74
Midparental height (inches)
Figure 10.1: Height of daughters and sons plotted against the mean height of their two parents.
70
68
Daughter's height (inches)
66
64
62
60
58
56
64 66 68 70 72 74 76
Midparental height (inches)
Figure 10.2: Heights of daughters plotted against their midparental heights (blue dots) and predictions (orange
dots). The method described in the text doesn’t work where there are gaps in the x values of greater than 1 inch.
10.1. REGRESSION AS PREDICTION 73
70
68
Daughter's height (inches)
66
64
62
60
58
56
64 66 68 70 72 74
Midparental height (inches)
Figure 10.3: Linear regression of daughter’s height on their midparental height. The linear regression line is
shown in orange and the location of the means of the daughter’s height and midparental height is indicated by
the orange dot.
74 CHAPTER 10. LINEAR REGRESSION
Figure 10.4: Left: LEGO Expert Builder set 952. (David Sterratt, CC-BY-4.0) Right: Massey Fergusson 135
Tractor (Lyle Buist, reproduced with permission).
y = β0 + β1 x (10.1)
The variables β0 and β1 are called parameters or regression coefficients and are the intercept and slope of the
line respectively. Think of them as dials that we can turn to produce any straight line we want to. Our aim is to
tune the values of β0 and β1 so that the line lies close to the data, and will therefore be a good predictor. We
refer to Equation 10.1 as the linear regression model.
Models The word “model” recurs throughout data science and statistics, and indeed, science in general. A
general definition is that a model is an abstraction of reality that captures aspects that are important for a given
task and omits the rest.
For example, a LEGO® model of a tractor looks like a tractor but doesn’t include all the details of a real
tractor (Figure 10.4). Similarly, the linear regression model looks like the data, but doesn’t contain all the details.
Figure 10.3 shows the linear regression model of the data that we are working towards in the next few pages.
Note that linear regression is called a parametric model, since it contains explicit parameters (the regression
coefficients). In contrast, the method shown in Figure 10.2 is a non-parametric method, since the prediction does
not depend explicitly on any parameters.
The LEGO® tractor is an example of a physical or mechanical model. If we’re using equations to connect
how gravity and Newton’s laws lead to the motion of a pendulum, we’d be using a mathematical model; if
we were simulating a city in a computer simulation game, we’d be using a computational model. Biologists
refer to animal models, i.e. investigate biological processes such as diseases in an animal, with the aim of
understanding these biological process in humans.
The linear regression model is an example of a statistical model. We use statistical models to investigate
relationships in data and to make predictions. Statistical models can help us to explain the data, but
they do not provide a mechanistic explanation of the data. For example, fitting a sine wave to the motion
of a pendulum would be a statistical model; in a mathematical model of the pendulum the sine wave
would emerge from solving equations describing gravity, Newton’s laws, and the weight and length of the
1 Up until 2022–23, we used the terms “dependent variable” and “independent variable” in FDS. However, the term “independent variable”
is confusing, since in multiple regression, the predictor variables are often not statistically independent.
2 Francis Galton (1822–1911) was a remarkable polymath – and a eugenicist. Adam Rutherford’s book Control: the Dark History and
Troubling Present of Eugenics (Rutherford, 2022) gives a detailed account of Galton’s views and those of his academic offspring and fellow
statisticians and eugenicists Karl Pearson and Ronald Fisher.
10.2. LINEAR REGRESSION 75
pendulum. In these lecture notes the word “model” will generally mean “statistical model”, though when
we are referring specifically to probabilities (e.g. in the chapter on Randomness, sampling and simulation),
we refer to “probabalistic models”.
The principle of least squares To tune the parameters, we need to have a measure of how close the line is to
the data. One way of defining “close” is how the mathematicians Gauss3 and Legendre did around 1800: the
sum of the squared deviations between the predicted and actual values of y for every value of x. The closeness
to the line is a function of β0 and β1 :
n
X
f(β0 , β1 ) = (yi − (β0 + β1 xi ))2 (10.2)
i=1
We call this function a error function or loss function. We are searching for the values of β0 and β1 that minimise
the error function. We denote these values β̂0 and β̂1 , pronounced “beta 0 hat” and “beta 1 hat” respectively. In
statistics, the hat denotes the best estimate of a quantity. Minimising this error function is to the principle of
least squares, which states that the best fit is obtained by minimising the sum of the squared deviations between
the predicted and actual values of y.
Applying the principle of least squares We could try to find the values of β̂0 and β̂1 that minimise the error
function by trial and error, or by using a numerical minimisation routine. However, in the case of the linear
regression model, we can find β̂0 and β̂1 analytically.4
We can find values of β̂0 and β̂1 by setting the partial derivatives5 of the error function f with respect to β0
and β1 equal to 0:
n
∂f X
(−2)(yi − β0 − β1 xi ) = 0
∂β0
=
i=1
n
(10.3)
∂f X
(−2xi )(yi − β0 − β1 xi ) = 0
∂β1
=
i=1
We can rearrange these formulae so that it is obvious that they are a pair of simultaneous equations in β0 and
β1 :
X X
nβ0 + xi β1 = yi
X X X (10.4)
xi β0 + xi2 β1 = xi yi
These equations are called the normal equations. Notice that we’ve used the abbreviated notation for summation
here. P P
The sums over xi and yi are related to their sample means: xi = nx and yi = ny. These relationships
allow us to simplify the normal equations:
nβ0 + nxβ1 = ny
X X (10.5)
nxβ0 + xi2 β1 = xi yi
The second part of this equation follows from the identity ni=1 (x − x)2 = ni=1 xi2 − nx 2 (see Equation 3.10 in
P P
the Descriptive statistics topic for proof of the identity). It follows from the first of Equation 10.5 that:
β̂0 = y − β̂1 x (10.7)
Properties of the linear regression line Don’t worry if you don’t follow the derivation above – let’s stand back
and look at some properties of the least squares linear regression line.
1. The line passes through the point (x, y). We can see this by substituting the value of β̂0 into Equation 10.1,
which leads to:
y = β̂0 + β̂1 x = y − β̂1 x + β̂1 x = y + β̂1 (x − x) (10.8)
When x = x it follows that y = y. In Figure 10.3 you can indeed see the regression line passing through
the mean of both variables.
2. Equation 10.6 for the gradient of the line β̂1 looks similar to the equation for the correlation coefficient r
(Equation 6.2 in the Data collection and statistical relationships topic) but is different in one respect. The
numerator is the same, but the denominator contains only the sum of squared deviations of x rather than
the product of the square roots of the sum of squared deviations of x and y. We can in fact relate β̂1 and r
via the standard deviations of x and y:
sy
β̂1 = r (10.9)
sx
3. If we plug this expression for β̂1 into the fitted model (Equation 10.8), we get:
sy
x −x
y = y + r(x − x) = y + sy r (10.10)
sx sx
which rearranges to
y−y x −x
=r (10.11)
sy sx
Here the fractional terms on the left- and right-hand sides are the standardised versions of the variables
x and y, which we also refer to as their z-scores (see section on Correlation). This shows that we can
think of making predictions from regression in for steps: (1) compute the z-score for x (i.e. we standardise
x) (2) multiply the z-score with the correlation coefficient, (3) multiply-in the scale of y (via sy ), and
(4) add-in the mean of y. This is a simple example of moving into a ”normalised space” (the z-score), doing
the prediction there, and then going to the target space by multiplying-in the y-scale (i.e. the standard
deviation) and adding-in the y-location (i.e. the mean).
This view also shows that “learning”, i.e. estimating r, happens in the normalised space, since we correlate
the z-score of x with the z-score of y.
Insights from regression with standardised variables We can gain some insights into regression by standardising
x and y (Figure 10.5):
1. Since the mean of standardised variables is zero, the regression line passes through (0, 0).
2. The gradient of the regression line of y on x is β̂1 = r, from Equation 10.9. The intercept of the regression
line β̂0 = 0, from Equation 10.7. As the relationship between the variables gets weaker (r gets smaller),
the gradient of the regression line decreases.
3. The predicted standardised y is always closer to the mean than the standardised x. In other words, parents
who are much taller than average are likely to have children who are taller than average, but not by as
much as the parents: the height of these children is likely to be closer to the mean height of children than
the tall parents’ height was to the mean parental height. The same is true of very short parents and their
children. Galton characterised this observation as “regression to mediocrity”, or regression to the mean.
4. If we try to predict x from y, i.e. we flip the variables, we find the same gradient β̂1 . However, by plotting
on the same set of axes, we see that when −1 < r < 1 the regression lines of y on x (orange) and x on y
(green) are different. If there is perfect correlation (r = 1 or r = −1), the two lines overlap.
10.2. LINEAR REGRESSION 77
3
2
Standardised Daughter's height
1
0
1
2
3
2 1 0 1 2 3
Standardised midparental height
Figure 10.5: Linear regressions of: (i) standardised daughter height on standardised midparental height (orange
line) and (ii) standardised midparental height on daughter height (green). For both regression lines β̂1 = r = 0.51.
Note that children of parents whose height is 3 standard deviations higher than the mean are predicted to be
less than 2 standard deviations above the mean – this is what the famous phrase “regression to the mean” is
referring to.
78 CHAPTER 10. LINEAR REGRESSION
70 6
Daughter's height (inches)
68 4
Residuals (inches)
66 2
64
0
62
60 2
58 4
56 6
65.0 67.5 70.0 72.5 75.0 65.0 67.5 70.0 72.5 75.0
Midparental height (inches) Midparental height (inches)
Figure 10.6: Data points and regression line (left) and residual plot (right) for regression of daughter’s height on
midparental height. The residuals for the same three data points are shown in both plots.
The differences yi − ŷi between each pair of predicted and actual values of the response variable are
called residuals. They can be visualised as the vertical deviations of each data point from the regression line
(Figure 10.6, left). Plotting the residuals on their own (Figure 10.6, right) can indicate whether the linear
regression model is an appropriate one for the dataset in question.
Source: HYDE 3.2 database https://dataportaal.pbl.nl/downloads/HYDE
It is worth noting that residuals from a linear regression always have:
1. Zero mean
2. Zero correlation with predictor variable x
3. Zero correlation with the predicted values ŷ
These are all true no matter what the data looks like, just like the mean of deviations from the mean is zero.
Nonlinearity The linear regression model is appropriate when the underlying data is linear, but deviations from
linearity might not be very apparent when the regression line is plotted with the data. The residual plot makes
this more obvious. An example which demonstrates this very clearly is a plot of world population versus year, for
the years 1940–2000 (Figure 10.7). The regression line fit looks alright, but when we look at the residuals, we
see more clearly that there seems to be something systematically nonlinear going on: the estimate is too low at
the start and end of the sequence and the residuals fall and rise again smoothly.
Transforming the data Just because it looks as though data is nonlinear doesn’t mean we have to give up on
linear regression. We might suppose that the world population grows exponentially (if we looked at a longer
time series we would definitely get that idea). To turn an exponential curve back into a linear one, we can
transform the data by taking log to the base 10 of the population (Figure 10.8). The residuals now appear to
10.3. VISUAL DIAGNOSTICS AND TRANSFORMATIONS 79
6000
300
Population (millions)
Residuals (millions)
5000 200
100
4000
0
3000 100
2000 200
1940 1960 1980 2000 1940 1960 1980 2000
Year Year
Figure 10.7: World population 1940–2000 (Klein Goldewijk et al., 2017)
Source: HYDE 3.2 database https://dataportaal.pbl.nl/downloads/HYDE
9.8
0.015
9.7 0.010
Log population
0.005
Residuals
9.6
0.000
9.5 0.005
9.4 0.010
0.015
1940 1960 1980 2000 1940 1960 1980 2000
Year Year
Figure 10.8: Log world population 1940–2000.
80 CHAPTER 10. LINEAR REGRESSION
30
10
25
5
20
Residuals
0
15
y
10 5
5 10
0 15
2 4 6 2 4 6
x x
Figure 10.9: A synthetic dataset that exhibits heteroscedasticity.
have a less systematic relationship with the predictor variable, and if we turn the residuals of the log population
back into raw numbers of people, we find the largest of these residuals is of the order of 70 million, rather than
300 million for the linear fit (Figure 10.7).
We can also transform the predictor variable or both the predictor and response variables. Whether it is
appropriate to transform depends on our understanding of the data and the potential underlying processes.
For example, we might expect that the weight of a squirrel is proportional to the cube of its length (assuming
that its height and width are proportional to length). This would suggest regressing the cube root of the weight
on the length.
Heteroscedasticity A quick look at the variance of the residuals (Figure 10.6) suggests that the variance of
residuals doesn’t change as a function of the predictor variable. But suppose we had some data that looks like
Figure 10.9. The variance here clearly increases as the predictor variable increases, and we say that the residuals
exhibit heteroscedasticity. The word heteroscedasticity comes from the ancient Greek “hetero” (different) and
“scedastic” (spread) – in other words the variance of the residuals changes as we go along the x-axis.
In the chapter on Linear regression and inference, we’ll look at linear regression from a probabilistic
perspective, and see that datasets exhibiting heteroscedasticity violate the assumptions that we’re using in least-
squares regression, namely that the variance of the residuals is independent of x, which we call homoscedacsticity.
(The Greek word homos means “same”.)
(Root) Mean Square Error We have already seen the sum of the squared deviations; this is what we minimised
with respect to the parameters in order to fit the regression line. However, this quantity scales with the number
of points, and we would prefer something that gives an indication of the typical size of a residual. The mean
square error (often written MSE) is defined by:
n
1X
MSE = (yi − ŷi )2 (10.13)
n
i=1
10.4. NUMERICAL DIAGNOSTICS 81
and the root mean square error (RMSE) is just the square root of this. The RMSE gives an indication of how
far data points deviate from predictions in absolute terms.
Coefficient of determination Another way of looking at the performance of the regression is to consider how
much of the variance in y we can explain with x. If we have a perfectly linear relationship (correlation coefficient
r = 1 or −1), then if we know x, we can predict y precisely. Thus, we have explained all the variance s2y of y.
When the relationship has a correlation of magnitude lower than 1, if we know x, we can predict y better than if
we didn’t know x, but we can also expect that the actual value of y will be some way from our prediction, and
we say that there is unexplained variance.
The coefficient of determination metric quantifies how much variance is explained. It is defined in terms of
two quantities:
1. The total sum of squares (SST), which we define as the sum of squared deviations from the mean of y:
X
SST = Syy = (yi − y)2 = (n − 1)s2y (10.14)
Multiple regression
Dealing with categorical variables The predictor variables may also be categorical. In the example of predicting
a child’s height from the heights of their parents, introduced in the chapter on Linear Regression, there were
three variables in the data: midparental height, height of child and gender of child. Although gender is a
categorical variable, we can convert it to a numeric variable by encoding “Daughter” as 0 and “Son” as 1. As
described in the section on Tabular data and variables, this new numeric variable is called a dummy variable or
indicator variable, and categorical variables with more than two values (for example colours) can also be encoded
using multiple indicator variables. We can treat indicator variables mathematically in the same way that we
would treat variables that are naturally numeric.
The multiple regression model To predict the child’s height given their midparental height and their gender,
we could try to use a non-parametric method, as we did when thinking about regression as prediction. However,
we choose here to focus on extending the linear regression model introduced in the last chapter, i.e. a parametric
model, in which the parameters were the coefficients β0 and β1 . Suppose we have two predictor variables, x (1)
and x (2) and one predictor variable y. (We’re using this rather cumbersome notation for the predictor variables
so that we don’t get confused with the notation for instances of variables.) The multiple regression model is then
expressed:
y = β0 + β1 x (1) + β2 x (2) (11.1)
Geometrically, this is a 2D plane in 3D space, with β1 being the gradient of a cut through the plane with
constant x (2) and β2 being the gradient of a cut through the plane with constant x (1) .
Principle of fitting the multiple regression model The principle of fitting the multiple regression model is
exactly the same as the principle of fitting the linear regression model, just with more variables. We use the
least squares principle, but this time we have to adjust three coefficients, β0 , β1 and β2 , to manoeuvre the plane
around so that we minimise the sum of the squared distances between the predicted and actual values of y over
all the data points.
In maths, we’ll modify the function f (as defined in the chapter on Linear Regression) that we’re minimising
to be:
n
X
f(β0 , β1 , β2 ) = (yi − (β0 + β1 xi1 + β2 xi2 ))2 (11.2)
i=1
83
84 CHAPTER 11. MULTIPLE REGRESSION
Table 11.1: MSE, RMSE and coefficient of determination for various simple and multiple regression models
applied to the heights data.
Note that xij means the ith instance of the jth variable. We could have used the more cumbersome notation xi ,
(j)
but xij is simpler, and also makes more sense when we come to minimise the function analytically.
As with linear regression, once we have gone through the maths, we denote the values of the coefficients that
minimise f to be β̂0 , β̂1 and β̂2 , and we can compute the predicted value of the response variable for any values
of predictor variables as:
ŷ = β̂0 + β̂1 x (1) + β̂2 x (2) (11.3)
For now, we will skip the derivation of how to find the regression coefficients – it is sufficient to know that it
can be done using an extension to the derivation shown in the Linear Regression slides. If you’re interested, we
give the derivation in the chapter on Mathematics of multiple regression.
After fitting the regression function (Equation 11.1) we find the coefficients:
• β̂0 = 16.41 inches: the intercept – nominally the height of a child born to parents with zero height!
• β̂1 = 0.69: for every inch of midparental height, we expect the child to be 0.687 inches taller.
• β̂2 = 5.21 inches: we expect sons to be 5.21 inches taller than daughters.
(Root) mean squared error Since the MSE and RMSE (see the chapter on Linear Regression just depend on
yi and ŷi , we can compute them using the same formulae. We find the MSE and RMSE in Table 11.1, in which
we’ve also included the MSE and RMSE from:
• the regression of daughter’s height on midparental height shown in the Linear Regression chapter
• the regression of son’s height on midparental height (equivalent to the regression of the daughter’s height)
• the regression of child’s height on midparental height, i.e. as if we did not know the gender of the child.
The RMSE and MSE for the multiple model is mid-way between the values for the two single regression
models where we know the gender. This is consistent with the picture of what looks like two regression lines
(Figure 11.1). However, if we don’t know the gender (imagine we make the prediction before the child is born and
without the benefit of a prenatal scan), then the MSE and RMSE are much bigger. This is because the regression
line goes midway between the lines that we find for the single-sex regressions or the multiple regressions, so
there is more spread around the line.
11.2. INTERPRETING MULTIPLE REGRESSION COEFFICIENTS AND METRICS 85
75 75
Son's height (inches)
70 70
65 65
60 60
55 55
65.0 67.5 70.0 72.5 75.0 65.0 67.5 70.0 72.5 75.0
Midparental height (inches) Midparental height (inches)
Figure 11.1: Multiple and single regression models of the height data. Child on midparent & gender: Multiple
regression of child’s height on midparent height and child’s gender. Child on midparent: Single linear regression
of child’s height on midparent’s height. Gender is not a variable, hence the black and grey colour coding.
Daughter on midparent and Son on midparent: Single linear regression of daughter’s height on daughter’s
midparent height (as in last chapter) or son’s height on son’s midparent height.
86 CHAPTER 11. MULTIPLE REGRESSION
Interpretation of coefficient of determination Since the coefficient of determination (see the Linear Regression
chapter) just depends on yi , ŷi and y, we can compute it using the same formula, and here we find that
R 2 = 0.633. This is a lot more than the R 2 = 0.263 we found for regressing daughter’s height on midparental
height. Does this indicate that the multiple regression model is a better fit? That doesn’t seem to make sense,
as we’ve just seen the MSE and RMSE of the multiple regression model is about the same as for the single-sex
regression models.
The reason is in that the multiple regression, the mean height that is in the total sum of squares is the mean
of all children, making the total sum of squares (SST) term in the coefficient of determination bigger. As we have
more-or-less the same sum of squared errors (SSE) after fitting, there’s a much bigger improvement SST − SSE,
and therefore a higher value for R 2 .
This isn’t the whole picture about the coefficient of determination – we’ll return to it later.
Lurking variables Suppose that in the chapter on Linear Regression, we had ignored the gender, and tried to
predict the height of sons and daughters just based on their midparental height. In this case we’d have had much
higher MSE and RMSE (see table above) and the coefficient of determination would have been 0.10, since there
is more variance measured around the mean height of daughters and sons. If we had not considered gender in
the simple linear regression, gender would have been a lurking variable. In the simple linear regression, we
controlled for the lurking variable of gender by only considering the midparental heights of daughters.
Using multiple regression to control for lurking variables By including gender in the multiple regression
analysis, we have also controlled for it, and quantified the size of the effect of gender – a double win. However,
one side-effect of this model is that the gradients of both regression lines have to be the same, and multiple
regression will find a value of β̂1 that is not quite optimal for either sons or daughters.
In terms of accurate prediction, we could do better by having two linear regression models: one for the
daughters and one for the sons.
In the fourth term, we’ve effectively introduced a new variable x (3) = x (1) x (2) , and we can just feed x (3) into
the machinery for computing the coefficients. This new variable allows us to have different gradients for the
daughters and sons. It will be zero for daughters (since x (2) = 0) but it will be β3 x (1) for sons (since x (2) = 1), so
the effective height gradient for sons will be β1 + β3 rather than just β1 for daughters.
It turns out that the fits we get are exactly the same as the fit we obtain from two separate regression
analyses. Why bother then with this interaction term? By using interaction terms, we can avoid having to
create separate datasets for each analysis. Also, had x (2) been a continuous variable, we couldn’t create a linear
regression model for each of its values.
Nonlinear fits The same idea allows us to use linear regression to fit nonlinear curves. Let’s go back to our
simple linear regression y = β0 + β1 x, and suppose that the residual plot shows a quadratic pattern. We can
create a new variable x 2 and treat it as a variable in a multiple regression:
y = β0 + β1 x + β2 x 2 (11.5)
Adjusted R 2 We can use the Python statmodels package to run linear regression on the plot. When we do
this, we can retrieve a lot of information about the fit (Figure 11.3), but we will just focus on two measures:
• R-squared = 0.289: This is the coefficient of determination R 2 , as defined in the Linear Regression
chapter.
• Adj. R-squared = 0.251. This is the adjusted coefficient of determination, as will be defined below.
If the number of variables is k and the number of instances is n, the adjusted coefficient of determination is
given1 :
n − 1 SSE
Ra2 = 1 − (11.6)
n − (k + 1) SST
This is very similar to the coefficient of determination, but is, by definition, lower than the coefficient of
determination – the term (n−1)/(n−(k +1)) is bound to be greater than 1, which means that Ra2 < 1−SSE/SST =
R 2.
The reason for adjusting the coefficient of determination is that as we add more and more variables, it
becomes easier to fit the data, and therefore the goodness of fit may increase just because of the increase in
variables. By adjusting the coefficient of determination, we counteract this tendency.
Meaning of the coefficients Looking at the second table of output in Figure 11.3 we see that:
• Intercept β̂0 = 36.1215
• Algebra β̂1 = 0.9610
• ACTM β̂2 = 0.2718
• ACTNS β̂3 = 0.2161
• HSRANK β̂4 = 0.1353
The interpretation is that an increase of 1 point in the Algebra test predicts an increase of 0.961 points in the
final grade, whereas an increase of 1 in the HSRANK predicts only an increase of 0.135 in the final grade.
Correlated predictor variables The scatter plots (Figure 11.2) show considerable correlation between the
predictor variables. We may imagine that if we did a single linear regression with the Algebra scores as the
predictor variable, we could explain a considerable amount of the variance in the Grade. We can check this by
re-running the model with just the Algebra as a predictor, in which case we find Ra2 = 0.231, not much less than
the Ra2 for the model with 4 predictor variables. Those extra variables don’t seem to have explained much more.
We can also investigate what would happen if we tried to fit using only one of the other variables. When
we regress Grade on ACTM, we find Ra2 = 0.124, which is about half the size of Ra2 for regressing on Algebra.
Note that the Ra2 = 0.247 when we regress on both Algebra and ACTM is less than the sum of the two adjusted
coefficients of determination (0.231 + 0.124 = 0.355). The reason for this is the correlation between Algebra
and ACTM: knowing Algebra tells us a lot about ACTM, so ACTM doesn’t add very much new information. The
lesson to learn here is that we need to think carefully about which variables to include as predictor variables
1 Itwould be nice to use the letter D to denote the number of dimensions of the predictor variables, but this is not the convention in the
literature on multiple regression. Normally it’s either k (as used here and by Devore and Berk (2012)) or p
88 CHAPTER 11. MULTIPLE REGRESSION
30
Algebra
20
10
30
ACTM
20
30
ACTNS
20
100
HSRANK
80
60
100
80
Grade
60
10 20 30 20 30 20 30 50 10050 100
Algebra ACTM ACTNS HSRANK Grade
Figure 11.2: Paired scatter plot showing grades of 80 students on a calculus course (Grade) and their performance
on three earlier tests (Algebra, ACTM and ACTNS) and their high school rank (HSRANK). Data from Edge and
Friedberg (1984), downloaded from website accompanying Devore and Berk (2012), where it is Example 12.25
https://extras.springer.com/2012/978-1-4614-0390-6.zip.
11.4. INTERPRETING AND REFINING MULTIPLE REGRESSIONS ON MANY VARIABLES 89
Figure 11.3: Output from the statmodels.formula.api.ols routine applied fitting a multiple regression of
Algebra, ACTM, ACTNS and HSRANK on Grade (Figure 11.2). At this point, we will focus on R-squared,
Adj. R-squared and the coeff column.
90 CHAPTER 11. MULTIPLE REGRESSION
in a multiple regression analysis. It may make sense to take out a variable that adds little to the adjusted
coefficient of determination. In fact, some of the output in Figure 11.3 that we haven’t discussed gives us clues
about which variables to drop, but we will discuss that later in the course.
Collinear variables The extreme case of correlation is when a pair of predictor variables are perfectly correlated,
for example: x (1) = cx (2) , with c being a constant. Here the correlation coefficient between x (1) and x (2) is 1 or
−1. In this case a linear regression function in a stats package like statsmodels in Python will complain about
a singular matrix. The full mathematical explanation of why this problem arises is in the section on Interpreting
the equation for the coefficients. Basically, the problem is that there are an infinite number of solutions of the
values of β̂1 and β̂2 .
Collinear variables can occur surprisingly frequently, for example, there could be two columns with a height
in centimetres and the same height in inches. To fix the problem of collinear variables, we should take out one of
the correlated variables – it is adding no information anyway.
Highly correlated variables When the magnitude of the correlation between two predictor variables is almost
one, there can still be numerical stability issues that cause numerical routines in stats packages to fail. Also,
small differences in the correlation of x (1) and x (2) with y can lead to very different estimates of the coefficients.
The interpretation of the coefficients therefore needs particular care, or one of the highly correlated predictor
variables should be removed.
We could work generally with k variables, but we will stick with 2 variables for now, as we feel it gives a better
intuition about what’s going on. To find values of β̂0 , β̂1 and β̂2 that minimise:
n
X
f(β0 , β1 , β2 ) = (yi − (β0 + β1 xi1 + β2 xi2 ))2 (12.2)
i=1
we will modify the method we used for simple linear regression (Linear regression chapter). We could set the
partial derivatives of the error function (or loss) f with respect to β0 , β1 and β2 equal to 0, and work from the
resulting three equations. However, we get some nice insights by proceeding in two steps. First we’ll set the
partial derivative of f with respect to β0 to be 0:
n
∂f X
(−2)(yi − β0 − β1 xi1 − β2 xi2 ) = 0 (12.3)
∂β0
=
i=1
y − β0 − β1 x (1) − β2 x (2) = 0
(12.4)
⇒ β0 = y − β1 x (1) − β2 x (2)
We now substitute this expression for β0 into Equation (12.1) for the regression line:
This equation shows that the regression plane will pass through the point (x (1) , x (2) , y), which is what happens
with the linear regression line in simple linear regression. It is as though the plane is pinned to this location,
and there are only two coefficients left to adjust: β1 and β2 .
It will simplify the following analysis to define new versions of the variables in which the mean has been
subtracted:
y∗i = yi − y , xij∗ = xij − x (j) (12.6)
Here we’ve used the stars to denote that this is a version of the variable with the mean subtracted – the star
does not mean “complex conjugate”!
91
92 CHAPTER 12. MATHEMATICS OF MULTIPLE REGRESSION
Derivation of multiple regression coefficients, part 2 We can now start again from the least squares function
f ∗ that is a function of β1 and β2 , and which contains our mean-subtracted variables:
n
X
f ∗ (β1 , β2 ) = (y∗i − (β1 xi1
∗ ∗ 2
+ β2 xi2 )) (12.7)
i=1
We can now partially differentiate with respect to β1 and β2 to give the normal equations:
n
∂f ∗ X
∗
(−2xi1 )(y∗i − β1 xi1
∗ ∗
− β2 xi2
∂β1
= )=0
i=1
n
(12.8)
∂f ∗ X
∗ ∗ ∗
(−2xi2 )(yi − β1 xi1 − β2 xi2
∂β2
= )=0
i=1
We can divide both sides of the normal equations by −2 and then rewrite them as one matrix equation:
Pn ∗ ∗ ∗ ∗
Pi=1 xi1 (yi − β1 xi1 − β2 xi2 )
n ∗ ∗ ∗ ∗
i=1 xi2 (yi − β1 xi1 − β2 xi2 )
=0
Pn ∗ ∗ ∗
Pn ∗ ∗ (12.9)
Pi=1 xi1 (β1 xi1 + β2 xi2 Pi=1 xi1 yi
⇒
)
n ∗ ∗ ∗ n ∗ ∗
x x β x i=1 xi2 yi
=
i=1 i2 (β1 i1 + 2 i2 )
Now suppose that we define the matrix X to be an n × 2 matrix in which the first column contains the n values
∗ ∗
of xi1 and the second column contains the n values of xi2 . We call this matrix the design matrix or regressor
matrix We’ll define y to be the vector containing the n values of y∗i . By the definition of matrix multiplication,
you should be able to verify that the right-hand side of the equation is equal to the matrix product XT y:
Pn ∗ ∗
T Pi=1 xi1 yi
X y= n ∗ ∗ (12.10)
i=1 xi2 yi
1. If you look back to the definitions of sample variance and sample covariance (Equation 6.1 in the chapter
on Linear Regression), you will see that its diagonal elements are (n − 1) times the variances of x (1) and
x (2) and its off-diagonal elements are (n − 1) times the covariance of x (1) and x (2) .
2. The matrix can be written as XT X, which can be referred to as the normal matrix.
T
3. We define the covariance matrix to be S = n−1 X X
1
XT Xβ = XT y (12.12)
The vector on the left contains the values of β̂1 and β̂2 , which we can substitute back into Equation 12.4 to get
β̂0 .
12.2. INTERPRETING THE EQUATION FOR THE COEFFICIENTS 93
Notes on our derivation The derivation can be extended to k predictor variables – Equation 12.13 looks the
same, but β̂ is a k × 1 vector, and the design matrix X is an n × k matrix. The normal matrix (and covariance
matrix) end up being k × k matrices.
It’s worth noting that there are other ways of deriving the coefficients – many treatments in textbooks add a
column of 1s to the design matrix and include β0 in the vector of coefficients. We’ve chosen not to do this, as it
makes the connection with the covariance matrix clearer.
It’s also worth noting that although you can program this equation yourself, real-world multiple regression
routines use other matrix formulations for reasons of efficiency and numerical stability.
Interpretation of the derivation 1: no covariance We’ll now suppose that the s21 , the variance of x (1) , and s22 ,
the variance of x (2) , are non-zero, but the covariances are 0 (Figure 12.1, left), so the covariance matrix is:
2
s1 0
XT X
1
S=
0 s22 n−1
=
We therefore have
1/s21
(XT X)−1 = S−1 =
1 1 0
n−1 n−1 0 1/s22
Here we’ve used the notation s1y as shorthand for the covariance of x (1) and y.
Now we’ll denote the correlation of x (1) and y by r1y , and remember that the correlation coefficient between two
variables is defined as covariance divided by the product of the standard deviations, in this case: r1y = s1y /(s1 sy ).
Therefore, we can write the covariance of x (1) and y in terms of the correlation s1y = r1y s1 sy . We have similar
definitions for r2y . Substituting in Equation 12.14 gives us:
β̂1 r1y sy /s1
(12.15)
β̂2 r2y sy /s2
=
These are exactly the coefficients we found by doing a single linear regression of y on x (1) and x (2) separately.
How much a predictor variable that is not correlated with another predictor variable influences our estimate of y
depends purely on its correlation with y.
Interpretation of the derivation 2: the general case Now we’ll allow the covariances to be non-zero. We
simplify calculations by denoting the correlation between x (1) and x (2) as r12 , so r12 = s12 /s1 s2 . Thus, the
covariance s12 = r12 s1 s2 , and the covariance matrix can be written as:
s21 r12 s1 s2
XT X
1
S=
r12 s1 s2 s22 n−1
=
2.5
0.0 0.0
x1
x1
2.5 2.5
2.5
0.0 0.0
x2
x2
2.5
2.5
0.0 0.0
y
y
2.5 2.5
2.50.0 2.5 0.0 2.5 2.50.0 2.5 2.5 0.0 2.5 0.0 2.5 0.0
x1 x2 y x1 x2 y
Figure 12.1: Correlated data, generated for the purposes of demonstration. In both plots the correlations
between the predictor variables and the response variables are r1y = 0.8 and r2y = 0.4, and all variables have
unit variance. In the left-hand plot the predictor variables are uncorrelated r12 = 0. The expected regression
coefficients are β̂1 = r1y = 0.8 and β̂2 = r2y = 0.4. In the right-hand plot, the predictor variables have a
correlation r = 0.5, but due to the correlation between the predictor variables, the regression coefficients change
to β̂1 = 0.8 and β̂2 = 0. See text for details.
When we multiply by
T r1y s1 sy
X y = (n − 1)
r2y s2 sy
we end up with
β̂1 r1y sy /s1 − r12 r2y sy /s1 (r1y − r12 r2y )sy /s1
(12.17)
1 1
β̂2 r2y sy /s2 − r12 r1y sy /s2 (r2y − r12 r1y )sy /s2
= =
1 − r12
2
1 − r12
2
The coefficient values found in the no-covariance case (Equation 12.15) are still there (when r12 = 0), but
we see that when there is a non-zero correlation between the predictor variables, the coefficients are altered.
The no-covariance estimate for β̂1 (i.e. r1y ) is adjusted by subtracting a fraction r12 of the correlation of the
response variable with the other predictor variable, r2y . This makes sense, since if didn’t make this correction,
the contribution β̂2 x (2) would “pollute” our estimate of y. Likewise, β̂2 is adjusted by subtracting r12 r1y .
For example, suppose we have unit variance variables (s1 = s2 = 1) and r12 = 0.5, r1y = 0.8, r2y = 0.4
(Figure 12.1, right). Then we have
β̂1 1 0.8 − 0.5 × 0.4 0.8
β̂2 1 − 0.52 0.4 − 0.5 × 0.8
= =
0
Essentially the predictor variable with the stronger correlation with the response variable reduces the
influence of the predictor variable with the weaker correlation. The more informative predictor variable “wins”
the competition to explain the response variable.
There is a special case when r12 is non-zero, but r1y = r2y . The coefficients become:
β̂1 1 − r12 r1y sy /s1 1 r1y sy /s1
β̂2 r2y sy /s2 1 + r12 r2y sy /s2
= =
1 − r12
2
12.2. INTERPRETING THE EQUATION FOR THE COEFFICIENTS 95
Here x (1) and x (2) have equal correlation with y, so are equally informative about y. As the correlation between
the predictor variables gets stronger the coefficients are scaled down; when r11 approaches 1, the coefficients
are half what they would be in the case of uncorrelated predictor variables. If two people are singing the same
song, you can halve the volume of both singers, and still hear the same information.
Collinearity The extreme case of correlation is r12 = 1, when x (1) = cx (2) , with c being a constant. In this case
the denominator of Equation 12.17 is zero. This reflects the fact that the determinant of the covariance matrix S
is zero – the two rows of the matrix XT X in Equation 12.11 are multiples of each other. There is therefore no
solution to Equation 12.13, and a linear regression function in a stats package like statsmodels in Python will
complain about a singular matrix.
When r12 is large there are still problems, since small differences in the correlation of x (1) and x (2) with y
can lead to very different estimates of the coefficients. As stated in the Chapter on Multiple regression, the
interpretation of the coefficients therefore needs particular care.
96 CHAPTER 12. MATHEMATICS OF MULTIPLE REGRESSION
Chapter 13
Witten et al. (2017) Data mining, 4th Edition, pp 304–307 contains an overview of PCA. Different sources
use different notation, so it may be least confusing just to follow these notes.
The challenge of visualisation We can see a lot in the paired correlation plots. With 4 predictor variables, the
visualisation works, but what about if we had 26 variables? The Scottish Index of Multiple Deprivation
(SIMD, Table 13.1) records 26 variables for each of 6527 data zones in Scotland. A 26×26 grid of scatter
plots is going to be difficult to read.
The challenge of interpretation In the grades example, the test grades (predictor variables) were correlated,
which made the interpretation of the regression coefficients challenging – and this was with only 4 predictor
variables. In the SIMD example, we might expect many of the 26 variables to be correlated, e.g. the time it
takes to drive to the nearest primary school and the time it takes to drive to the nearest secondary school.
There is also another problem with high-dimensional data, called the curse of dimensionality: essentially
a large number of dimensions makes is harder for distance-based methods such as clustering and nearest
neighbours to work effectively – we’ll come back to the curse of dimensionality in the following lectures on
clustering and nearest-neighbour methods.
Table 13.1: Scottish Index of Multiple Deprivation, 2016 edition (Scottish Government, 2016). https://simd.
scot. It has n = 6527 data points (postcode zones), each associated with D = 26 variables.
97
98 CHAPTER 13. DEALING WITH HIGH DIMENSIONS – PCA
10.0
Bruntsfield
7.5
5.0
2.5
0.0 Fairmilehead
PC2
Figure 13.1: Scatter plot of first and second principal component scores (PC1 and PC2) of 6527 data points in
the SIMD dataset (blue dots). Locations of 4 data zones are indicated in orange dots next to an image from that
data zone. All photos released under CC licence from geograph.co.uk. Credits: Orkney © Des Colhoun; Possil
Park © Stephen Sweeney; Brunstfield © Leslie Barrie; Fairmilehead © Jim Barton.
In dimensionality reduction methods these challenges are addressed by reducing the number of dimensions in
the data while retaining as much useful information as possible. There are a number of dimensionality reduction
methods which differ in what aspects of the data they preserve.
Principal components analysis We are going to discuss one method of dimensionality reduction called principal
components analysis (PCA).
PCA can be applied to a set of D numeric variables with n datapoints. In contrast to linear regression, all
variables are treated equally: there is no response variable that we are trying to predict, just a set of variables
whose structure we’re trying to understand better. The result of PCA is a set of up to D new variables (with n
datapoints). We can keep k ≤ D of the most informative new variables.
In PCA the objectives are:
1. change the angle we view the data from to see things clearly
2. ignore small details in the data that don’t affect the big picture.
We’ll specify these objectives more precisely and explain how PCA works later. First, we will show the results
when PCA is applied to the SIMD example (Table 13.1).
Example of PCA We can use PCA to reduce the number of variables D in the SIMD data from D = 26 to k = 2,
allowing us to visualise all n = 6527 data points (Figure 13.1). In this plot, the ith datapoint has coordinates
13.1. THE PRINCIPLE OF PRINCIPAL COMPONENTS ANALYSIS (PCA) 99
10.0
7.5
5.0
2.5 Attendance
overcrowded_rate
0.0
PC2
2.5 Unemployment
5.0 drive_secondary
drive_retail
7.5
10.0
10 5 0 5 10
PC1
Figure 13.2: Scatter plots of first and second principal component scores of SIMD data zones (blue dots). The
projection of three original variables related to deprivation are shown as orange arrows emanating from the
origin. High unemployment and overcrowded rate are found in areas with higher deprivation, whereas high
school attendance is found in areas with low deprivation. These vectors are more closely aligned with the first
principal component (PC1), which we therefore interpret as “Deprivation”. Red arrows indicate the projections
of the time take to drive to the nearest secondary school or retail outlet. As these are aligned with PC2, we
therefore interpret PC2 as being related to distance to services, or “Remoteness”.
(ti1 , ti2 ) in which each coordinate is a linear combination of the standardised1 data zij shown in Table 13.1:
The weights p11 , p21 , . . . , pD1 are elements of the first principal component and ti1 is the first principal component
score of the ith datapoint; we will explain how to find them later. Likewise, p12 , p22 , . . . , pD2 form the second
principal component and ti1 is the second principal component score of datapoint i. The weights in the principal
component indicate how much influence each original variable has over each principal component score –
sometimes they are referred to as loadings or weights. The axes in Figure 13.1 are labelled PC1 (first principal
component – “PC” stands for principal component) and PC2 (second principal component).
To see the influence of each original variable on PC1 and PC2 scores, we can project the jth original variable
onto the plot by setting zij to 1 and all the other z’s to 0 in Equation 13.1. In this case, the coordinates we’ll be
plotting are (pj1 , pj2 ). The orange arrows in Figure 13.2 show the projections of the variables for unemployment,
overcrowding (in housing) and school attendance. Unemployment and overcrowding have high PC1 scores. In
contrast, school attendance has a low PC1 score. This all makes sense if we identify the first component score
with “Deprivation”. We can rephrase the previous sentences as “unemployment and overcrowding are found in
areas of high deprivation and high school attendance is found in areas of low deprivation”.
The red arrows in Figure 13.2 show the projections of the time to drive to the nearest retail outlet and time
to drive to the nearest secondary school. These vectors have higher magnitude PC2 scores than PC1 scores. We
therefore identify PC2 as being to do with “remoteness” – low values of PC2 indicate the zone is more remote.
1 Remember from the section on Correlation that we standardise the jth variable x (j) by subtracting its mean x (j) and dividing by its
standard deviation sj , so that zij = (xij − x (j) )/sj . Generally, the data we supply to PCA do not need to be standardised, but we still do
need to subtract the mean in order to compute the component scores.
100 CHAPTER 13. DEALING WITH HIGH DIMENSIONS – PCA
Note that the correlation between the PC1 and PC2 scores is zero. It is a general property of PCA there are
no correlations between the scores of different principal components.
In this particular example, the visualisation shows a unimodal distribution of data with little obvious structure.
Later on in the course we will see examples where PCA reveals clusters of data – though still with zero correlation.
Even if no structure is apparent, reducing the dimensionality of the data can be useful for further analysis.
For example, suppose we have data on cancer screening rates in each data SIMD zone, we could then do multiple
regression of the cancer screening rate on the new deprivation and remoteness variables. This is probably going
to give us coefficients that are a lot more interpretable than regressing on all 26 variables.
Projecting principal component scores back into the data space Suppose we have identified the first two
principal component scores ti1 and ti2 of area i. We might wish to project them back into the data space, to see
what the original variables looked like. To do this we can use the following equations to give approximations
(indicated by the tilde) to the original standardised variables:
We can include more terms for higher PCs, right up to the Dth PC. In general, the jth component of the i data
point is given:
Once we’ve got the standardised variables, we can convert back to the original variables using the formula
xij = zij sj + x j .
Principal component equations in vector notation The equations used so far may make more sense when
expressed as vectors. The jth principal component is actually a vector in the original data space:
All the principal component vectors are orthogonal to each other. With this notation we can write Equation 13.2
as a linear combination of the principal component vectors, weighted by the principal component scores:
Table 13.2: Imaginary data about Informatics students’ preferences for programming languages and drinks.
● ●
8
● ●
●
6
No pref.
4
● ●
● ●
2
Tea
2 4 6 8
Haskell No pref. Java
e
fe
of
8
C
.
6
ef
pr
o
N
4
a
2
Te
●
●
●
●
●
●
●
2
H
as
ke
ll
4
N
o
6
pr
ef
.
8
Ja
va
Figure 13.3: Informatics students’ preferences for drinks and programming languages, as plotted initially (left),
and rotated (right).
102 CHAPTER 13. DEALING WITH HIGH DIMENSIONS – PCA
1. Change the angle we view the data from to see things clearly
6
Coffee
● ●
8
4
● ●
●
6
2
No pref.
● ● ●
PC 2
● ● ● ●
0
● ●
4
● ●
−2
● ●
● ●
2
−4
Tea
●
−6
2 4 6 8 −6 −4 −2 0 2 4 6
Haskell No pref. Java PC 1
2. Ignore small details in the data that don’t affect the big picture
4
Coffee
● ●
● ●
2
●
No pref.
●
0
● ● ● ● ● ● ●●
● ●
● ●
−2
● ●
Tea
●
−4
−4 −2 0 2 4 −6 −4 −2 0 2 4 6
Haskell No pref. Java PC 1
Figure 13.4: Visualisation of how PCA achieves the two objectives in the text.
13.2. PRINCIPLE OF FINDING PRINCIPAL COMPONENTS 103
Plotting the data (Figure 13.3 left) shows that students’ preferences for drinks and programming languages
are correlated. It seems that we could characterise every Informatics student by one number that is low if they
like Haskell and tea, and high if they like Java and coffee. If we could rotate the axes (Figure 13.3 right), the
new x-axis would give us this number.
Once we’ve done the rotation (changed the angle), we end up with the data plotted against a new set of axes,
which are the principal components (Figure 13.4, top). The new x-axis, which tells us a lot about the students’
preferences for Java and coffee or tea and Haskell, is the first principal component (PC1). The new y axis is the
second principal component (PC2). It is worth noting two points:
• The correlation between the new PC1 and PC2 scores is zero. It is a general property of PCA that
correlations between scores is zero.
• We have not lost any information about the data; we can reconstruct the original data by reversing the
rotation. It is a general property of PCA that it is possible to reconstruct the data if scores of all D
principal components are retained.
The second principal component doesn’t seem so informative, so we could just ignore it altogether (Figure 13.4,
bottom). Thus, we have ignored small details in the data that don’t affect the big picture. We have performed
dimensionality reduction by reducing the number of values describing each data point from two to one.
Objective of rotation There are two questions that we haven’t answered so far:
1. How do we choose how much to rotate the axes?
2. What counts as “informative”?
The answer to both questions is “variance”. In Figure 13.4 (top), the variance of the data in the PC1 direction
is much greater than the variance of the data in the PC2 direction. The high variance PC1 is telling us a lot
about the informatics students, whereas the low variance PC2 tells us little. Therefore, in order to choose how to
rotate the axes, we use the variance as an objective. In fact there are two ways of formulating PCA:
1. Maximum variance formulation: find an axis that maximises the variance of the data projected onto it
2. Minimum variance formulation: find an axis that minimises the variance of the data projected onto it
It doesn’t matter which formulation we use; the answer is the same either way.
Explained variance The variance in the original x (Programming language) and y (Drink) directions was 9.7
and 7.7. The sum of these two variances is the total variance, i.e. 17.4. It turns out that the sum of the variance
along the principal components is exactly the same. However, the variance of the PC1 scores is 16.5, i.e. 96% of
the total variance. We therefore say the PC1 explains 96% of the variance.
More than 2D In general, we can find D principal components in D dimensions. The principal components are
all orthogonal to each other, and each principal component explains a certain fraction of the variance. We order
the principal components from the one that explains most variance to the one that explains least.
In the SIMD example, the first principal component explains 41.7% of the data and the second explains a
further 15.2%. Thus, the first two principal components together explain 56.9% of the variance. We can visualise
how much each principal component explains in a scree plot or cumulative scree plot (Figure 13.5).
How many components to choose? Obviously if we are visualising data, we can only look straightforwardly at
up to 3 dimensions. The scree plot helps us to choose how many components to include if we are using PCA as
a preprocessing step. A rule of thumb is to use the components to the left of the elbow or knee of the scree
plot, i.e. the point where the gradient changes sharply. In Figure 13.5 this point is indicated in red, and the rule
of thumb would suggest that we use PC1 and PC2. There are more principled ways of choosing, which we won’t
cover at this point, and it may also be that successful application of PCA requires more components.
In the next section, we’ll look at the maths of how to find the directions of the principal components and the
associated variances. However, you should already know enough to skip to the section after, which is about
applying PCA to help with a regression problem.
104 CHAPTER 13. DEALING WITH HIGH DIMENSIONS – PCA
100
30
60
20
40
10 Elbow or Knee 20
0 0
0 10 20 0 10 20
PC PC
Figure 13.5: Scree plot for PCA applied to SIMD example (left). The elbow (or knee) is indicated in red. The
Cumulative scree plot (right).
Table 13.3: Coefficient of determination and adjusted coefficient of determination for regression of grades on
original variables and on 2 or 4 PC scores.
When we regress on all 4 PC scores, we get exactly the same predictions and coefficient of determination as we
do for regressing on all variables (Table 13.3). This makes sense, since by keeping all 4 components we have
not lost any information about the data. It is more surprising that the coefficient of determination with if we
regress on only the first two PC scores is almost as high. Furthermore, the adjusted coefficient of determination
is actually higher when regression on the first two principal components, due to there being fewer variables.
There is no combination of any two of the original variables that gives as high a coefficient of determination.
This example demonstrates that PCA can be a useful preprocessing step for regression, by decorrelating the
variables.
PCA and linear regression lines Thinking back to linear regression, we remember the distinction between the
regression lines of y on x and x on y. In two dimensions there is now a 3rd line: the first principal component.
This goes right between the regression lines, and is probably what you would think the line of best fit to the
data is. In fact, it is a line of best fit. It’s the line that minimises the sum of the squared distances from the data
points to the line, rather than minimising the error in predicting y or x.
13.3. PCA AND REGRESSION 105
2.5
0.0
PC1
2.5
2
PC2
0
2
0
PC3
2
2
PC4
100
80
Grade
60
2.50.02.5 0.0 2.5 2.5 0.0 0 2 50 100
PC1 PC2 PC3 PC4 Grade
Figure 13.6: PCA scores of predictor variables in Grades example (see Multiple regression lecture notes).
106 CHAPTER 13. DEALING WITH HIGH DIMENSIONS – PCA
2 x on y
PC1
1
y on x
y 0
1
2
2 1 0 1 2
x
Figure 13.7: Regression of y on x, x on y and the first principal component of data with a correlation coefficient
r = 0.5.
4. Find direction of the axis that maximises variance of data projected onto it (1st principal component, PC 1)
5. Interpret
This list may seem overwhelming, but it actually boils down to about 4 lines of code (assuming some helper
functions), shown in Listing 13.1.
Step 1: Defining variance along original axes We’ve already met a lot of the mathematical machinery we
need in the multiple regression topic. We’ll assume now that we have D variables x (1) , . . . , x (D) , and that we
have defined zero-mean versions of the variables xij∗ = xij − x (j) . Usually, as in the SIMD example, we start off
with standardised variables anyway. We can arrange these zero-mean variables in an n × D matrix,
∗ ∗
x T1
x11 ··· x1D
X = ... .. .
x (1) ··· x (D) = .. (13.8)
. =
∗ ∗
xn1 ··· xnD x Tn
which it can be helpful to write in terms of the D n × 1 vectors representing all the data in each dimension, or
as the transposes of the n D × 1 vectors representing each data point.
13.4. DERIVATION OF PCA 107
Listing 13.1: Listing of PCA. We are assuming the existance of helperfunctions standardize() and
sort_eigenvalues().
import numpy as np
def pca(X):
""" Given a data matrix X with n rows and D columns ,
return principal components (P) and
eigenvalues ( lambda )"""
# Standardise the data X
Z = standardize (X)
# Compute the covariance matrix S
S = np.cov(Z)
# Find unsorted eigenvectors (P) and eigenvalues ( lambda )
lambdas , P = np. linalg .eig(S)
# Sort the eigenvectors and eigenvalues
# in order of largest eigenvalues
lambdas , P = sort_eigenvalues (lambdas , P)
# The eigenvalues ( lambdas ) are proprtional
# to the amount of variance explained
for i in range (len( lambdas )):
print ('PC' + str(i+1) + ' explains ' +
str(round (( lambdas [i] / np.sum( lambdas ))*100)) +
'% of the variance .')
return (lambdas , P)
108 CHAPTER 13. DEALING WITH HIGH DIMENSIONS – PCA
4
Coffee
● ●
x
● ●
2
●
No pref.
t1
● p
0
● ●●● ● ● ●●● ●
● ●
−2
● ●
Tea
●
−4
−4 −2 0 2 4 −6 −4 −2 0 2 4 6
Haskell No pref. Java t(1)
We’ve also met the covariance matrix, the D × D matrix that’s derived from the data matrix:
s11 · · · s1D
.. .. XT X (13.9)
1
S= .
. =
n−1
sD1 · · · sDD
Step 2: Project data onto a new axis We’ll define the new axis by the unit vector p (Figure 13.8). The
projection of a data point x i onto this axis (its component score) is (as per Equation 13.1)
Step 3: Compute variance in these axes The definition of the variance of the t (1) is:
n
1 X 2
s2t = ti (13.11)
n−1
i=1
Our old friend the covariance matrix has reappeared. Although we’ve demonstrated this in 2 dimensions, the
equation is still valid in D dimension.
Step 4: Find direction of axis that maximises variance of data projected onto it (1st principal component, PC
1) We now have an expression for the variance in the component scores for any direction of p we now want to
find the direction of p that maximises that variance. We have a constraint that p is of unit length, so |p| = 1.
13.4. DERIVATION OF PCA 109
4
Coffee
●
●
●
●
2
●
No pref.
● ● ● ● ● ● ● ●●
● ●
0
●
−2
●
Tea ●
●
●
−2 0 2 4 −6 −4 −2 0 2 4 6
Haskell No pref. Java t(1) (PC 1)
This is a constrained optimisation problem, which we can solve using Lagrange multipliers and differentiation.
We won’t show the details here, but the result is the following equation:
Sp = λp
Hopefully you recognise this equation. Its solutions are:
1. λ = λ1 , p = e1 , where λ1 is the biggest eigenvalue of S and e1 is the associated eigenvector
2. λ = λ2 , p = e2 , where λ2 is the second biggest eigenvalue of S and e2 is the associated eigenvector
We choose the first principal component p1 to be the eigenvector e1 with the largest eigenvalue λ1 . λ1 is the
variance of the 1st component scores t (1) .
Step 5: Interpret Finding the eigenvalues and eigenvectors for our example, we arrive at the first principal
component being:
Java
p1 = λ1 = s2t (1) = pT1 Sp1 = 16.5
0.75
0.66 Coffee
The first component score t (1) is the “Java-coffeeness” of a student.
Step 6: Find the 2nd principal component (PC 2) In 2D our job is already done, since there is only one
direction perpendicular to p1 , and eigenvectors (and therefore principal components) are always orthogonal
to each other. It’s the other eigenvector of S, p2 = e2 , with eigenvalue λ2 , which is the variance of the 2nd
component scores t (2) .
In D dimensions, the principal components are the D eigenvectors of the D × D matrix S. It’s helpful to
introduce more matrix notation here. We arrange the principal components in the rotation matrix:
P = p1 p2
Step 7: Quantify what is lost by dimensionality reduction We can reverse the transformation from the scores
to the original data
X = TPT
If we drop the 2nd PC from P and the 2nd PC scores from T, we can reconstruct a 1-dimensional version of the
original data:
X̃(1) = t (1) pT1
We can see (Figure 13.9) that the first principal component (the “Java-coffeeness”) score of a student tells us
a lot about them – but how much? Consider the total variance, the sum of the variances of the data:
D
X D
X
s2i = λi
i=1 i=1
110 CHAPTER 13. DEALING WITH HIGH DIMENSIONS – PCA
It is equal to the sum of the eigenvalues of the covariance matrix. Thus the fraction of the total variance
“explained” by the ith principal component is:
λi
PD
j=1 λj
Statistical inference
111
Chapter 14
Inferential statistics tasks Inferential statistics can seem like a toolbox full of tools with confusing names such
as “standard error of the mean”, t-test, χ 2 test, bootstrap, and a confusing set of rules about what to use each
tool for. We’re going to try to give you an idea of what task each tool is useful for, and how it works. There are
three main tasks we will consider:
1. Estimation
2. Hypothesis testing
3. Comparing two samples (A/B testing)
Data scraped from paper
Estimation We’ve already given one example of estimating an unobserved quantity (the population mean).
Another example of an unobserved quantity is linear regression coefficients. We already know how to find (point)
estimates of them, using the formulae we covered earlier in the course. But in part of the course we will learn
how to estimate the confidence intervals around the point estimates, which will tell us how much uncertainty
113
114 CHAPTER 14. RANDOMNESS, SAMPLING AND SIMULATION
350
300
210 220 230
Length (mm)
Figure 14.1: Uncertainty in regression line. The best estimate regression line (solid) and the lines at the edge
of the 95% confidence interval (dashed lines). Note this plot is simplified, since the uncertainty in the intercept
is not represented.
there is in our estimates. For example, we’ll be able to say that we estimate the mean weight of squirrels in the
population to be 320 ± 16g, with 95% confidence, i.e. in the confidence interval [304, 336]g. In a linear regression
of the weight of a sample of squirrels on their length (Figure 14.1), we will be able to say that the best estimate
of the slope of the regression line is 3.35 g/mm, but we are 95% confident that the slope lies in the interval [2.32,
4.38] g/mm.
Hypothesis testing In hypothesis testing, we are trying to ascertain which of two or more competing theories
are the best explanation of the data. For example, in 1965 a court case was brought against the state of Alabama
(Swain versus Alabama, 1965) due to there being no Black members of the jury in a trial.1 Part of the case
concerned the fact that at one stage of the jury selection, 8 Black people were chosen for a jury panel of 100
people, but the fraction of Black people in the population was 26%. Our question is “Is the jury selection system
biased against Black people?”.
Comparing two samples (also known as A/B testing) Here we have two samples that have been treated
differently, and we want to either test if the groups are different, or estimate how different they are. For example,
to find out the effectiveness of a vaccine, we select a sample of volunteers from the population randomly, divide
them randomly into two groups, give the vaccine to one group (Treatment group) and give the other group a
placebo (Control Group). In the vaccine group 3 volunteers catch the disease, but in the placebo group 95
volunteers catch the disease. Is the vaccine effective? How much would we expect the vaccine to cut the risk of
catching the disease if we give it to the whole population?
In the context of user testing, often in web applications, this is called A/B testing. A famous example was at
Amazon, where a developer had the idea of presenting recommendations as customers add items to a shopping
cart (Kohavi et al., 2007). Amazon managers forbid the developer to work on the idea, but the developer disobeyed
orders and ran a controlled experiment on customers, by splitting them into two groups (“A” and “B”), one which
had recommendations shown and one which didn’t. The group which had recommendations shown bought more,
and displaying recommendations quickly became a priority for Amazon.
Two approaches to statistical inference We are going to learn a number of techniques for undertaking point
and interval estimation, hypothesis testing and comparing samples. We will also think carefully about the
interpretation of these techniques. There are two main approaches to undertaking statistical inference tasks,
statistical simulations and statistical theory:
1. Statistical simulations: Here we use repeated random sampling to carry out the statistical inference
procedures. The advantages of statistical simulation procedures are they often require fewer assumptions
1We found the example of Swain versus Alabama in Adhikari et al. (2020), and follow their treatment.
14.2. SAMPLING, STATISTICS AND SIMULATIONS 115
Frequency
Frequency
40
200
20 50
0 0 0
0 1 2.5 0.0 2.5 0 10
x x x
Figure 14.2: Histograms of 1,000 samples taken from normal, uniform and exponential distributions.
about the data and/or hypothesis than statistical theory, and they require somewhat less theory to
understand. However, they can be compute-intensive, and care is still needed in their use.
2. Statistical theory: Here we use the properties of various well-known theoretical distributions to draw
inferences about our data. We need to check that the assumptions behind the distribution match the
statistical question we are trying to answer. For example, a distribution of delays to flights is likely to be
highly right-skewed, so we shouldn’t assume a normal distribution when dealing with it. Typically, the
process is not compute-intensive: very often it amounts to arithmetic and then reading of a quantity from a
distribution table. These procedures come as standard in a number of stats packages, including R and
Python’s statsmodels.
A number of fundamental concepts underpin both the statistical theory and statistical simulations.
Plan for this part of the course The plan for the statistical inference topics in this part of the course will be:
• We’ll learn how we can use statistical simulations to generate samples from a model and compute
statistics for each of these samples to give a sampling distribution.
• Learn about the distribution of the mean of repeated samples from a model. This will lead us to the
central limit theorem, which can help us to estimate the uncertainty in our estimate of the mean,
i.e. confidence intervals, and the law of large numbers, which also helps with estimation.
4. A/B testing
Random sample In a random sample of size n from either a probability distribution or a finite population of
N items, the random variables X1 , . . . , Xn comprising the sample are all independent and all have the same
probability distribution.
116 CHAPTER 14. RANDOMNESS, SAMPLING AND SIMULATION
Sampling from probability distributions You should be familiar with sampling from random number generation.
A standard random number generator produces numbers within an interval (e.g. [0, 1]) with uniform probability
for each number, i.e. it samples from a uniform distribution. We can demonstrate the distribution of a standard
random number generator by drawing many samples and plotting a histogram (Figure 14.2). We adapt these
functions to sample from any univariate distribution, e.g. a normal distribution or an exponential distribution
(Figure 14.2).
Sampling from a set of discrete items We can also sample from a population of discrete items. We can select
n items from a set of N items at random either without replacement or with replacement. If we sample without
replacement, it is as though we are pulling items of various types (e.g. coloured balls) at random out of a bag,
and not replacing them. We can only sample up to N items, and also, as we remove items from the bag, the
probabilities of drawing a particular type (colour) changes. If we sample with replacement, we put the item back
in the bag, before making our next choice – we can carry on doing this for ever. We could construct an algorithm
for random sampling either with or without replacement from a uniform random number generator, but these
functions are provided in packages such as numpy.random.choice in Python.
A particular application of sampling from a set of discrete items is creating a sample of a larger data set.
Non-random samples from a population We can also imagine ways of sampling that are not systematically
random. For example, we might have a list of the daily takings in a restaurant. We could take the first n days.
But suppose that the dataset has been sorted in terms of takings? We would then have days with low takings at
the start of the list, so the statistics of the sample would not resemble the statistics of the population. We could
try taking every 7th day in the list – but if the list is in date order we will always be sampling from one day of
the week, e.g. Mondays. Random sampling ensures that we don’t have this type of problem.
Samples of convenience When we are collecting data, it might be tempting to sample from the data that we
can collect conveniently. For example, a polling company may find it easier to contact people who have more
time to answer the phone, which may tend to be retired people. If we don’t correct for this sort of bias, it’s called
a sample of convenience. One way of combating convenience sampling is stratified sampling, in which the
sampling is targeted so that the proportions of attributes of the sample matches the proportions in the population.
Definition of a statistic Before going further, it’s helpful to have the definition of statistic: “A statistic is any
quantity whose value can be calculated from sample data.”(Modern Mathematical Statistics with Applications 6).
We probably recognise the mean, variance and median as statistics by this definition. But we’ve also derived
other quantities from sample data, such as the correlation coefficient and regression coefficients – they are
also statistics. We will follow Modern Mathematical Statistics with Applications and denote a statistic using
an uppercase letter, to indicate that it is a random variable, since its value depends on the particular sample
selected. E.g. X represents the mean and S 2 the variance.
Simulations and sampling Before considering inferential statistics proper, we will focus on running statistical
simulations, i.e. using a computer program to make predictions from probabilistic models of real-world processes.
For example, the probabilistic model of tossing a coin multiple times is that the tosses are independent and
that the probability of a head is 1/2 (or perhaps another value, if with think the coin is loaded). The statistical
simulation generates a sequence of heads and tails.
To do this we need to decide on:
• The population distribution (e.g. normal with particular mean and variance) or set of discrete items
Probability
Frequency
500 0.05
0 0.00
10 20 30 40 0 10 20 30 40
Number of Black people on panel Number of Black people on panel
Figure 14.3: Results of statistical simulations of the panel size in Swain versus Alabama (1965). The blue
histogram shows how many of 10 000 simulations produced jury panels of 100 with the given number of Black
people on them. The red dot indicates the number of Black jurors in Swain versus Alabama (1965).
1. For i in 1, . . . , k
(a) Sample n items from the population distribution or set of discrete items
(b) Compute and store the statistic of interest for this sample
Example of hypothesis testing using a simulation experiment To demonstrate the utility of the statistical
experiment we’ve introduced, let’s look again at the example in which 26% of the population is Black and 8 Black
people are selected to be on a jury panel of 100 people. The null hypothesis H0 is “The jury panel was chosen
at random from the population”. We can map the null hypothesis onto the general framework above as follows:
• The statistic of interest is T0 , the number of Black people in a sample of n = 100 panel members
• The population distribution is a Bernoulli distribution with the sample space Black, Non-Black in which
p(Black) = 0.26.
We follow the procedure described in the previous section to give the results shown in Figure 14.3. Coding
this up will be an exercise for you in the Labs. We can see that none of the 10 000 simulations of the null
hypothesis produced a jury with 8 members, suggesting that we should reject the null hypothesis in favour of an
alternative one. This looks like a clear-cut case; in the topic on Hypothesis testing, we’ll consider in more
detail how to interpret the results when the data is less distinct from the simulations.
Deriving the sampling distribution Note that in this example, we didn’t have to go to the trouble of running a
simulation experiment. We might have noticed that the total number of Black people will be distributed according
to a binomial distribution with n = 100 and p = 0.26.
• Population distribution: Normal distribution with mean 0 and variance 1, Uniform distribution on [0, 1],
Exponential distribution p(x) = e−x .
• Sample size n = 10
Figure 14.4. There are a number of points to notice about this plot:
Sample mean (first column), all distributions The distribution of the mean is narrower than the original distri-
bution in every case. This is because some of the variability in the individual samples is averaged out.
The standard deviation of this distribution is called the standard error of the mean.
Sample mean of normal distribution The distribution looks to be normal – it turns out that this is easy to prove.
Sample mean of uniform distribution The distribution is symmetric and looks to be near-normal.
Sample mean of exponential distribution The distribution is clearly skewed, but less so than the original
exponential distribution.
Sample variance (second column) All these distributions are skewed, reflecting the fact that it’s very unlikely
to get 10 samples that are all very close together, and therefore have low variance. It turns out that there
is a theoretical distribution (the χ 2 distribution) that describes the shape of sample variance from the
normal distribution.
Median (third column) The main point to draw from this column is that we can use the simulation method to
produce a distribution for any statistic, regardless of how easy it would be to calculate a theoretical
distribution for it.
As we will see later, we could generate the sampling distribution of the mean and the variance analytically
rather than by simulation. However, it is not always possible to compute sampling distributions of the desired
statistics analytically, and we can always run statistical simulations.
Normal
500 500
Frequency
Frequency
Frequency
500
250 250
0 0 0
1 0 1 0.0 2.5 1 0 1
x s2 x
Uniform
500 400
400
Frequency
Frequency
Frequency
Frequency
500
250 1000
250
0 0 0
1 2 0 10 0 2
x s2 x
Figure 14.4: Sampling distribution generated by 10,000 simulations of the mean x, variance s2 and median x̃
of 10 samples drawn from a normal distribution (top row), uniform distribution (middle row) and exponential
distribution (bottom row).
120 CHAPTER 14. RANDOMNESS, SAMPLING AND SIMULATION
15 40
10
10
20 5
5
0 0 0
0.1 0.0 0.1 0.45 0.50 0.55 0.9 1.0 1.1
x x x
Normal n=10000 Uniform n=10000 Exponential n=10000
150 40
Relative frequency
40
100
20 20
50
0 0 0
0.1 0.0 0.1 0.45 0.50 0.55 0.9 1.0 1.1
x x x
Figure 14.5: Distributions of means from samples of size n = 1000 (top row) and n = 10000 (bottom row) drawn
from the normal, uniform and exponential distributions shown in Figure 14.2. The blue histograms show the
histograms obtained from k = 2000 simulations. The orange curves are normal distributions with mean equal to
the mean of the original distribution and variance σX equal to σ 2 /n, where σ 2 is the variance of the original
distribution.
14.4. THE DISTRIBUTION OF THE SAMPLE MEAN OF LARGE SAMPLES 121
Central Limit Theorem Here is an informal statement of the Central Limit Theorem (CLT):
The distribution of the mean or sum of a random sample of size n drawn from any distribution will
converge on a normal distribution as n tends to infinity.
In the case of the sample mean, its expected value is the same as the mean of the population
distribution, and its expected variance is a factor of n lower than the population variance.
In the case of the sample sum, its expected value is the same as the product of the sample size n
and the expected value of the distribution, and its expected variance is n times the variance of the
population distribution.
We denote the expected variance of the mean σX2 and we denote the standard deviation of the mean σX ,
called the standard error of the mean, often abbreviated as SEM. It’s important to note that the SEM is not the
same as the standard
√ deviation of the original distribution. According to the statement above, an estimate of the
SEM is σ̂X = σ / n.
We can verify that this statement holds in the case of sampling a mean in Figure 14.5 by computing the means
√
and SEM from the simulations and comparing with the expected values of µ (population mean) and σX = σ / n.
Pn
The Swain versus Alabama jury selection example demonstrates the CLT applied to a total T0 = i=1 Xi ,
where Xi = 1 indicates a Black member of the population was selected, and Xi = 0 indicates non-Black. The
distribution is a Bernoulli distribution with population mean µ = p = 0.26, the probability of picking a Black
person. We can see from Figure 14.3 that the mean of the total is nµ = 100 × 0.26 = 26, and the variance
is approximately σT20 ≈ nσ 2 = np(1 − p) = 19.24, as expected for a Bernoulli distribution, giving a standard
deviation of 4.38. Furthermore, the distribution is approximately normal.
The law of large numbers Here is an informal statement of the law of large numbers:
In the limit of infinite n, the expected value of the sample mean X tends to the population mean µ
and the variance of the sample mean X tends to 0.
Note that sometimes the law of large numbers is referred to as the “law of averages”. This can lead to confusion.
The law of averages is sometimes called the “Gambler’s fallacy”, i.e. the idea that after a run of bad luck,
the chance of good luck increases. If the events that are being gambled on are independent of each other
(e.g. successive tosses of the same coin), the probability of a head will be the same regardless of how many tails
have preceded it.
In the second row of Figure 14.5 we can see that the distribution for n = 10000 is narrower than the
distribution for n = 1000, and that the sample means converge on the population means. The law of large
numbers says that we could, in principle, continue this process by choosing an n as large as we would like to
make the variance as small as desired.
Formal statement of the central limit theorem (Modern Mathematical Statistics with Applications 6.2) Let
X1 , . . . , Xn be a random sample √
from a distribution with mean µ and variance
√ σ 2 . Then, in the limit n → ∞ the
standardised mean ((X − µ)/(σ / n)) and standardised total ((T0 − nµ)/( nσ )) have a normal distribution. That
is
X −µ
lim P √ ≤ z = P(Z ≤ z) = Φ(z)
n→∞ σ/ n
and
T0 − nµ
lim P √ ≤z = P(Z ≤ z) = Φ(z)
n→∞ nσ
where Φ(z) is the cumulative distribution function (cdf) of a normal distribution with mean 0 and s.d. 1. Thus, when
n is sufficiently large, X has an approximately normal distribution with mean µX = µ and variance σX2 = σ 2 /n
and the distribution of T0 is approximately normal with mean µT0 = nµ and variance σT20 = nσ 2 . We can also
say that the standardised versions of X and T0 are asymptotically normal.
122 CHAPTER 14. RANDOMNESS, SAMPLING AND SIMULATION
Formal statement of the (weak) law of large numbers (Modern Mathematical Statistics with Applications 6.2)
Let X1 , . . . , Xn be a random sample from a distribution with mean µ and variance σ 2 . As the number of
observations n increases, the expected value of the sample mean remains E[X ] = µ, but the expected variance
V [X ] = E[(X − µ)2 ] → 0. We say that “X converges in mean square to µ”.
More formally, the probability that the difference between the sample mean and population mean is greater
than an arbitrary value ε is
σ2
P(|X − µ| ≥ ε) ≤
nε2
for any value ε. Thus, as n → ∞, the probability approaches 0, regardless of the value of ε.
A proof of this statement relies on Chebyshev’s inequality, and can be found in Modern Mathematical
Statistics with Applications 6.2.
Note that this is the statement of the weak law of large numbers. There is also a strong law, which has
somewhat more stringent requirements on convergence. All distributions that obey the strong law also obey
the weak law, but some distributions only obey the weak law and some obey neither law. A discussion of this
topic is beyond the scope of this course; the distributions that do not obey the distribution tend to be “weird”,
e.g. having infinite variance.
You may have heard of the difference between Frequentist and Bayesian statistics. The two systems have
different philosophical bases, but, in simpler cases, often end with similar results. Roughly speaking, the
differences between the two are:
Frequentist The population is a fundamental concept. There is just one possible value of the population
mean and variance, i.e. the one that exists in the population. In estimation, we are trying to estimate
these quantities, and in hypothesis testing, we are trying to compare our sample with this population.
Bayesian A fundamental concept is the model of the likelihood of the data given parameters (such as the
mean). The parameters themselves are uncertain. Conceptually, the population itself is generated
from the model, so a number of combinations of parameters and luck may have generated the particular
value of (say) the mean observed in a population. Before we have seen any data, we have an initial
idea about the distribution of the parameters (the prior). The inference process involves using the
data to update this prior distribution to give a distribution of the parameters given the data.
For around a century, there has been controversy about which approach is best. Broadly speaking,
we will be using Frequentist approaches in this course. At the level we are working at here, it will
give very similar results to Bayesian approaches. The important thing is to understand the meaning and
interpretation of our inference.
Estimation
Parameters In both of these examples we call the population mean and standard deviation parameters. In the
case of the pizza chef, the mean and standard deviation are parameters of normal distribution, which describe its
centre and width. In some distributions, e.g. an exponential distribution p(x) = exp(−λx) (for x > 0), the mean
and standard deviation are not separate parameters; they are both equal to the inverse of the parameter λ.
Estimation problems The Oxford English Dictionary defines estimation as “the process of forming an approximate
notion of (numbers, quantities, magnitudes, etc.) without actual enumeration or measurement.” In other words,
we would like to get an approximate idea of population parameters without looking at the entire population.
There are two main estimation problems:
1. What is the best way of using the sample to construct a point estimator for each population parameter?
2. How do we construct a confidence interval to indicate how accurate we expect that point estimator to be?
The answers to these questions will depend on the distribution of the data, and is quite a complex area. We
will give an overview of some of the issues here, but not go into depth.
123
124 CHAPTER 15. ESTIMATION
Generic notation for parameters and estimators To help with making some definitions general, we will refer to
a generic parameter by the Greek letter θ and its estimator by θ̂. We’ll also use the hat notation for the specific
parameters. For example, µ̂ = X means “the point estimator of the population mean µ is the sample mean X ”.
More than one estimator for a parameter In some cases, we can have more than one estimator for a parameter.
For example, both the mean and the median are estimators of the centre µ of a symmetric distribution, so we can
write µ̂ = X̃ as well as µ̂ = X .
Suppose we want to estimate the number of squirrels N in a population. We can do this with a clever
method called capture-recapture:
1. Capture n of the squirrels, tag them so that they can be identified if caught again, then release them.
2. Wait for the squirrels to move around.
3. Recapture K of the squirrels and record the number k of these recaptured squirrels that have tags.
nK
N̂ =
k
This should work if the capturing and recapturing processes are random. If this is the case, the expected
proportion of tagged squirrels in the whole population n/N is equal to the proportion in the recaptured
sample k/K , hence the estimator.
The bias tells us, on average, by how much the estimate is too high or too low. If, on average, an estimator gets
the right result, i.e. E[θ̂] − θ = 0 for any value of θ we say that the estimator is unbiased.
Mean squared error of an estimator We would like each estimate of the parameter to be as close to the true
value of the estimator as possible. A measure of close we can expect estimates to be to the true value is the
mean squared error of an estimator, which is defined as:
It turns out that the mean squared error (MSE) can be decomposed into the variance of the estimator and the
squared bias2 :
MSE = E[(θ̂ − θ)2 ] = V[θ̂] + (E[θ̂] − θ)2 = variance + (bias)2 (15.3)
1 The notation E[X ] means “expected value of the random variable X ”. See Section 4.2 of Modern Mathematical Statistics with Applications
for a definition.
2 The notation V[X ] means “(expected) variance of the random variable X ”. See Section 4.2 of Modern Mathematical Statistics with
Figure 15.1: Schematic diagram of bias and variance. In each plot the true value of the parameter is indicated
by the black point θ. The estimator θ̂1 (top left, light blue) is an unbiased estimator, meaning that its expected
value is equal to θ; the distribution is centred on the true value θ. In the top right plot (red) θ2 is a biased
estimator; its expected value E[θ̂2 ] is greater than the true value, and the difference between the two is the
bias. In the bottom left plot, the distribution of the estimator θ̂3 (dark blue) is unbiased and has lower variance,
meaning that it is more tightly clustered on the true value. In the bottom right plot, θ̂4 is a biased estimator, but
has lower variance than θ̂1 . It may be preferable, as it has lower mean squared error (MSE).
126 CHAPTER 15. ESTIMATION
We might think that we should always prefer unbiased estimators that have minimal MSE (or, equivalently zero
bias and minimal variance). It turns out that for some parameters of some distributions the unbiased estimators
do not have minimal MSE, and that by adding bias we can reduce the variance, thereby reducing the MSE. For
an example, see Example 7.4 in Modern Mathematical Statistics with Applications.
Example: point estimators for the mean of a normal distribution with known variance We’ll first consider
the simplest, and rather artificial case, namely a random sample of n observations X1 , . . . , Xn from a normal
distribution in which we know the variance parameter σ 2 independently of the data. An obvious choice for a
point estimator of the mean parameter µ is the
√ sample mean, i.e. µ̂ = X . For a normal distribution, for any value
of n, the standardised variable (X − µ)/(σ / n) has a standard normal distribution (this can be proved quite
easily). From this we can derive that E[X ] − µ = 0, which means that the bias is zero. We can also see that
Since the bias is 0, the variance is equal to the MSE. Here the mean squared error is the square of the standard
error of the mean (SEM), defined in the chapter on Randomness, sampling and simulation. We can see that as
we increase the number of samples n, the MSE decreases, which makes sense.
Example of a senseless biased estimator Note that an estimator does not have to be unbiased or have minimal
variance. For example, we could try to estimate the mean with X + 1. There would then be a bias of 1 and
the MSE would be higher. This particular addition of bias makes no sense, but there are cases (see Modern
Mathematical Statistics with Applications, Section 7.1) where it can make sense.
Example of a biased estimator from Machine Learning As noted in the section on K -fold cross-validation,
cross-validation can be used to estimate the value of a metric (e.g. accuracy) when a classifier is tested on unseen
data. However, if the cross-validation data has been used to choose a hyperparameter, the cross-validated
estimate of the metric is biased. We can regard the value of the metric measured from unseen data as the
quantity being estimated θ, and the cross-validated value of the metric as the estimator θ̂.
We can now understand the reasoning for why the unbiased estimator of the variance σ 2 has an n − 1 in
the divisor (see Why the divisor n − 1 in the sample variance?) The estimator of the mean is:
n
1X
µ̂ = X = Xi (15.5)
n
i=1
We rearrange this expression to collect terms that comprise two independent random variables (e.g. Xi and
Xj ) or one random variable squared (i.e. Xi2 ):
n n n
!2
1X 1X 1X
σ̂ =
2
(Xi − µ̂) =
2
Xi − Xi
n n n
i=1 i=1 i=1
2
n
1 X n − 1 1X
Xi − Xj
n n n
=
i=1 j6=i
2 (15.7)
n
X (n − 1) 2
2
n−1 X X
Xi − 2 2 Xi Xj + 2 Xj
1 1
n n2 n n
=
i=1 j6=i j6=i
n
(n − 1) Xi2 − 2 n − 1 Xi
X 2 X X X X
Xj + 2 Xj2 + Xj Xk
1 1
n n2 n2 n
=
i=1 j6=i j6=i j6=i k6=j,i
We then compute the expectation of this estimator and use the properties of expectations to bring it
into a form where we can compare it to the actual population variance σ 2 = E[X 2 ] − (E[X ])2 :
n
X − 2
n − X X XX
E[Xi2 ] − 2 2 E[Xi Xj ] + 2 E[Xj Xk ]
1 (n 1) 1 1 1
E[σ̂ 2 ] = E[Xj2 ] + 2
n n2 n n n
i=1 j6=i j6=i j6=i k6=i,j
(n − 1)2 n−1 (n − 1)2 (n − 1)(n − 2)
E[Xi ] + −2
2
E[Xi ]E[Xj ]
n2 n2 n2 n2
= + +
n(n − 1)
(n − 1)2 n−1
(15.8)
i
2
− i j
n2 n2 n2
= E[X ] + E[X ]E[X ]
n−1
E[Xi2 ] − (E[Xi ])2
n
=
n−1 2
σ
n
=
Going back to the example of estimating the mean of a normal distribution with known variance, Equation 15.4
gives the MSE of the estimator. However, the MSE is defined in terms of the population variance σ 2 , which
above we assumed that we know – but in real life we only have the estimate σ̂ 2 from the sample. However, we
can replace any parameters in the formula for the variance with their estimates to give the estimated standard
error of an estimator, which we denote σ̂θ̂ .
Relationship between standard error of the mean and standard deviation of the distribution In the chapter
on Randomness, sampling and simulation, we encountered the standard error of a particular estimator, namely
the standard error of the mean (SEM), denoted σ̂X . More generally, we could denote the SEM as σ̂µ̂ , since that
implies that we could be using and estimator other than the sample mean to estimate the mean parameter.
It is important to be clear about the difference between the terms “standard deviation” and “standard error of
the mean”. The standard deviation tells the variability of the population, distribution or sample. If the standard
deviation is describing the population or distribution it’s a parameter, and we denote it σ ; if the standard
deviation is derived from the sample, it’s a statistic, and we denote it s.
In the artificial case where we know the standard deviation of the population σ independently of the data,
we know from the Central Limit Theorem that the standard error is related to the standard deviation and the
sample size by:
σ
σµ̂ = √ (15.11)
n
This relationship shows that to make the estimator twice as accurate (i.e. halving the SEM) we need to quadruple
the size of the data n. Figure 15.2 uses statistical simulations to demonstrate that the SEM behaves as predicted
for both a normal distribution (top row) and exponential distribution (bottom row).
Estimated standard error of the mean In the much more realistic case √ where we don’t know the standard
deviation of the population, the standard error of the mean is σ̂µ̂ = S/ n, where S is the sample standard
deviation. Since S varies depending on the sample, it’s a random variable, and therefore the estimated SEM
itself is a random variable.
Figure 15.3 shows the distribution of the SEM in statistical simulations of sampling n = 10 or n = 100
samples from a normal and exponential distribution. For n = 10 it can be seen that the SEM varies a lot around
the theoretical value. This means that a particular sample might give us a much higher or lower SEM then the
true value. For n = 100, the estimated SEM is distributed much more tightly around the theoretical value – it’s
therefore safer to use the estimated SEM as a substitute for the true SEM.
Normal Exponential
2.0 2.0
1.5 1.5
1.0 1.0
p(x)
p(x)
0.5 0.5
0.0 0.0
2 0 2 0 2 4 6 8
x x
n = 10 n = 10
10.0 10.0
Relative frequency
Relative frequency
7.5 Sim SEM = 0.314 7.5 Sim SEM = 0.313
Theo SEM = 0.316 Theo SEM = 0.316
5.0 5.0
2.5 2.5
0.0 0.0
2 0 2 0 2 4 6 8
x x
n = 100 n = 100
10.0 10.0
Relative frequency
Relative frequency
Normal Exponential
2.0 2.0
1.5 1.5
1.0 1.0
p(x)
p(x)
0.5 0.5
0.0 0.0
2 0 2 0.0 2.5 5.0 7.5
x x
n = 10 n = 10
10.0 10.0
Relative frequency
Relative frequency
Relative frequency
30 Est SEM 30
Theo SEM
20 20
10 10
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Estimated SEM X Estimated SEM X
Figure 15.3: Distribution of estimated standard error of the mean. Top row: normal probability distribution with
µ = 1 and σ = 1 (left) and exponential probability distribution with λ = 1 (right). Middle row: Histograms
showing distribution of the SEM calculated by from each of 10,000 simulations of sampling 10 random numbers
from a normal and exponential distributions. Bottom row: same as middle row, but with sample size n = 100.
Chapter 16
Confidence intervals
Size of confidence interval We have chosen the length of the intervals to ensure that, if we carried on estimating
the mean and the interval, about 95% of intervals would contain the true mean. To determine this length, we use
the z critical values of the standardised normal distribution with zero mean and variance 1 (Figure 16.2), which
we refer to as the z-distribution. We define the z critical value zα as the value of z in a normal distribution
which has the area α under the curve to its right. If we want the intervals to contain the true mean 95% of the
time, we need to make sure that the mean is within the central 95% of the distribution. This implies that we
need 2.5% of the area under the curve to the right of the upper bound, so we look up z0.025 in a statistical table
or a function in a stats package and find that z0.025 = 1.96 – we will show how to do this later. The z critical
value of 1.96 tells us that the length of the lines on the side of each estimate of the mean should all be 1.96
times the standard error of the mean (SEM).
We may want to be more or less certain of whether the mean is contained in a confidence interval. In this
case we can look up the z critical value for our chosen level of confidence. We can also decide to express the
confidence interval in terms of the multiples of the SEM. For example confidence intervals of plus or minus one
SEM correspond to a 68% confidence interval.
Reminder It is worth remembering that these simulations are artificial in the sense that we can repeat many
samples. In real life we only get one sample, which does or does not contain the true value – but we don’t know.
Looking up a z critical value To look up a z critical value, you can use the python scipy package. For example
to find z0.2 you would use:
131
132 CHAPTER 16. CONFIDENCE INTERVALS
20
15
Repetition
10
0
0.4 0.2 0.0 0.2 0.4
and CIs for
Figure 16.1: Principle of confidence intervals. We repeat a simulation using a sample size of n = 100 to estimate
the sample mean of a normal distribution with mean 0 and standard deviation 1. The black vertical line indicates
the true mean, the blue dots indicate the sample means, and the blue horizontal lines indicate the 95% confidence
intervals obtained in each of the 20 repetitions. It can be seen that 19 of the confidence intervals do contain the
population mean, but one of them does not.
Figure 16.2: Confidence intervals of the z-distribution. The intervals containing various amounts of probability
mass under a standardised normal distribution with mean 0 and variance 1 are shown. The 95% confidence
interval (blue) is [−1.96, 1.96] and has 2.5% of the probability mass in each tail. The 80% confidence interval
is [−1.28, 1.28]. The amount of probability mass contained in one standard deviation is 68%. In general for
a confidence interval of 100(1 − α)%, the upper and lower boundaries are determined by the z critical value
zα/2 . E.g. with the 95% confidence interval α = 0.05 and there is 2.5% of the area of the curve above the upper
boundary of the confidence interval.
16.1. PRINCIPLE OF CONFIDENCE INTERVALS 133
0.4
0.3
0.2
p(z)
0.1
0.0
4 3 2 1 0 1 2 3 4
1.0 z
0.8
0.6 (z)
0.4 1 (z)
0.2 = 0.2
0.0 z0.2=0.84
4 3 2 1 0 1 2 3 4
z
Figure 16.3: Concept of the z critical value. Top: zα is the value of z suchRin a normal distribution such that
∞
the area under the curve to the right of zα (green) is equal to α. i.e. α = zα p(z)dz. Bottom: the blue curve
Rz
shows the cumulative distribution function Φ(z) = −∞ p(z)dz. The orange curve shows the “survival function”
sf(z) = 1 − Φ(z). The survival function of z is exactly the area to the right of z under the pdf. Therefore we want
to look up the inverse survival function to determine zα from α, as indicated by the green lines.
134 CHAPTER 16. CONFIDENCE INTERVALS
In this probability statement, the upper and lower bounds of the interval are random variables, since they are
based on the estimators and the estimated standard error, which are themselves random variables derived from
the sample.
Expression in terms of random variable in fixed interval We can rearrange the definition of the confidence
interval in terms of a standardised variable (θ̂ − θ)/σ̂θ̂ :
!
θ̂ − θ
P −b < <a =1−α (16.2)
σ̂θ̂
16.3. METHOD OF ESTIMATING CONFIDENCE INTERVAL FOR THE MEAN OF A LARGE SAMPLE135
15000 200
10000
5000 100
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Preparation time (days) Preparation time (days)
Figure 16.4: Distribution of time from making a reservation to the reservation time (“preparation time” ) in
restaurants using the “air” booking system in Japan in the period January 2016–April 2017.
Because this standardised variable is derived from the sample, it fits our definition of a statistic. Furthermore, it
is composed of two statistics, the estimator θ̂ and the estimated standard error σ̂θ̂ .
1. Under some assumptions about the distribution of the data Xi and the number of samples n we can derive
the distribution of (θ̂ − θ)/σ̂θ̂ , which will then tell us the values of −b and a at the 100α/2th centile and
the 100(1 − α/2)th centile.
2. More generally we can use a type of statistical simulation called a bootstrap estimator to derive the
confidence interval.
We’ll demonstrate the first approach by continuing with our simplified example of a normal distribution with
known parameters. In the following section we’ll then cover the bootstrap estimator.
Example: confidence interval for the mean of a normal distribution with known variance In the example of
sampling from a normal distribution introduced in the last √ section, we know the population variance σ , and by
definition, the standard estimate of the mean is σ̂θ̂ = σ / n. Because the population variance σ is known, it’s
not a random variable, and therefore the SEM σ̂θ̂ isn’t a random variable either. The standardised variable in
Equation 16.2 is therefore
θ̂ − θ X −µ
= √ (16.3)
σ̂θ̂ σ/ n
and only contains one random variable, X . This makes it quite easy to deal with, since we know that this
distribution is a standard normal distribution, so we can define a 95% confidence interval by setting a and b to
be values at which the cumulative distribution function (cdf) is equal to 2.5%(= α/2) and 97.5%(= 1 − α/2). In
this case the values a = b = 1.96 satisfy these conditions. Generally we set a and b symmetrically, so that
there is equal weight in the “tails” of the distribution (Figure 16.2).
136 CHAPTER 16. CONFIDENCE INTERVALS
Table 16.2: Summary statistics of population and sample of preparation times, generated by the pandas describe
function.
Population Sample
count 92378.00 1000.00
mean 8.30 8.06
std 25.65 27.72
min 0.00 0.00
25% 0.21 0.17
50% 2.08 1.96
75% 7.88 6.92
max 393.12 364.96
Confidence intervals for the mean of a large sample The central limit theorem states that the distribution
of the sample mean of a “large” sample from any distribution should be normal. How large the sample needs
to be depends on the distribution, but Figure 16.1 demonstrates that sample means of n = 100 samples from
an exponential distribution already appear to fairly normally distributed
√ with the SEM as predicted to be the
standard deviation of the exponential distribution divided by n. This means that we can use the procedure
above to find confidence intervals.
Confidence intervals for the mean of an empirical distribution Up until now, we have considered estimating
parameters from theoretical probability distributions, such as the normal distribution or the exponential distribution.
We’ll now consider how we can estimate the parameters of an empirical distribution, i.e. real-world data, from a
sample of that distribution.
As an example, we will take the population of times between making a reservation and the time of the
reservation itself in Japanese restaurants using the “air” booking system. The full population contains 92378
times (Figure 16.4, left) and we’ve created a random sample of 1000 of these times (Figure 16.4, right). In real
life, if we had the full set of data, there would not be any point in creating this random sample of times, but we
do so here to demonstrate how well we can estimate confidence intervals. From now on imagine that the sample
of 1000 times is all that we have available to us. It’s important to notice that the distribution of the sample
resembles the population distribution, even though it is rougher.
Table 16.2 shows the summary statistics for the population and the sample. We can see that the estimates
for the mean, standard deviation and centiles from the sample are all similar to the true population values. From
the table we can see that the population mean is µ = 8.30 days and the sample mean is x = 8.06 days. The
sample mean would be different if we’d happened to have taken a random different sample.
From the summary statistics from the sample, we have x = 8.06 days and the standard deviation s = 27.72
days. Our √ estimator for√the mean is θ̂ = x = 8.06 days. Our estimator for the standard error of the mean
is σ̂θ̂ = s/ n = 27.72/ 1000 = 0.88 days. The 95% confidence interval for the mean in days is therefore
(θ̂ − 1.96σ̂θ̂ , θ̂ + 1.96σ̂θ̂ ) = (6.34, 9.78).
Reporting confidence intervals When reading scientific papers, there are various ways of reporting confidence
intervals:
• M=8.06, CI=6.34–9.78. Here “M” stands for mean and “CI” stands for confidence interval.
• 8.06 ± 0.88 (± 1 SEM). This is a 68% confidence interval, though the confidence interval isn’t specified in
terms of area under the curve.
Figure 16.5: Bootstrapping: Baron Münchhausen pulls himself and his horse out of a swamp by his pigtail.
Public domain image from Wikipedia’s article on bootstrapping.
Bootstrap procedure for finding a confidence interval for the mean We will start with a large sample n from
the data, which has a mean x. By large, we mean large enough that the sample resembles the population
distribution. Of course, this is not possible to know exactly, so the larger the better. We decide to take B
bootstrap samples. Common numbers are 1000 or 5000, or 10000. More samples are generally better, but
bootstrapping can be computationally expensive, and fewer samples can also give reasonable results.
Here is the procedure:
• For j in 1, . . . , B
– Take sample x ∗ of size n from the sample with replacement
– Compute the sample mean of the new sample xj∗
• To compute the bootstrap confidence interval, we find the centiles of the distribution at 100α/2 and
100(1 − α/2). We can do this by arranging the sample means xj∗ in order from lowest to highest, and pick
xj∗ at k = α(B + 1)/2 to be the lower end of the CI and pick xj∗ at k = B − α(B + 1)/2 to be the upper end
of the CI.
138 CHAPTER 16. CONFIDENCE INTERVALS
100
0
6 7 8 9 10 11 12
X* (days)
Figure 16.6: Demonstration of bootstrap mean applied to restaurant reservation time data (Figure 16.4). The top
row shows the distributions obtained from the first 3 of 10000 bootstrap samples. Although the distributions are
similar to each other, they are not exactly the same, and the sample mean of each is different. The bottom figure
is the distribution of all 10000 of these bootstrap sample means. The mean of the original sample is shown, as is
the 95% and 80% confidence intervals.
• We can also compute the bootstrap estimator of the variance of the mean:
PB ∗
j=1 (xj − x)2
s2boot
B−1
=
The advantages of the bootstrap procedure are that we can use it for any estimator, e.g. the median, and that
we do not need to make any assumptions about the distribution of the estimator.
Example of bootstrap estimator applied to the mean We’ll now apply the bootstrap estimator to give us a
confidence interval for the mean (Figure 16.6). For each of our 10,000 bootstrap samples, we’ll resample 1000
samples with replacement from our sample of 1000. Each of these samples will be a distribution (top row of
Figure 16.6), from which we can compute the 10,000 bootstrap means. Then we’ll plot the distribution of the
bootstrap means (bottom row of Figure 16.6) and find the 95% and 80% confidence intervals. In this case we can
see both the 95% and 80% confidence intervals contain the population mean (8.30, Table 16.2). However, if we
replicate the experiment with a different initial random sample of 1000, in around 5% of cases we should expect
that the 95% confidence interval does not contain the mean.
16.5. INTERPRETATION OF CONFIDENCE INTERVALS 139
We’ll leave this as a lab exercise for you to implement, though you will find that you get different answers
for the confidence intervals, depending on the state of the random number generator.
Comparison of bootstrap confidence intervals with normal approximation The 95% confidence interval obtained
via the bootstrap procedure is (6.46, 9.90) days, which is very similar to the confidence interval obtained by the
normal approximation, (6.34, 9.78) days. The bootstrap interval is slightly shifted to the right, suggesting that
the normal approximation is quite accurate at a sample size of n = 1000.
General formulation of bootstrap estimator A great advantage of the bootstrap is that we can easily apply it
to statistics other than the mean. Here is the general procedure for estimating the confidence interval for a
generic estimator θ̂:
• For j in 1 . . . B
• To compute the bootstrap confidence interval, we find the centiles of the distribution at 100α/2 and
100(1 − α/2).
This procedure works well for measures of centrality such as the median, and for the variance. It doesn’t
work so well for statistics of extremes of the distribution, such as the maximum or minimum.
How big should a confidence interval be? Should we choose the 95% confidence interval or the 80% confidence
interval? The answer to this question depends on the problem. For example, suppose we have a machine that
makes tens of thousands of ball bearings for aircraft jet engines every day. Each ball bearing needs to have
a diameter of 2 ± 0.0001mm for the engine to work safely. We measure the diameter of a sample of the ball
bearings every day. Because this is a safety-critical application, we need to have high confidence (say 99.999%)
that the ball bearings are in the range 2 ± 0.0001mm. This might require a large sample size, but it’s worthwhile
because the consequences of getting it wrong could be catastrophic.
On the other hand, suppose we are estimating the number of red squirrels in a population so that we know
how much red-squirrel friendly food to put out for them over winter. We might want to leave out a bit more than
we expect they need, we’re happy to accept a 10% chance that the true number of squirrels might be greater
than the upper end of a confidence interval, so we compute the 80% confidence interval, and put out enough food
for the number of squirrels at the upper end of the interval. There’s a 10% chance that we might not be providing
for enough squirrels, but it’s not as catastrophic as in the aircraft situation (depending on how much you value
red squirrels compared to humans).
140 CHAPTER 16. CONFIDENCE INTERVALS
0.4
Normal
0.3 t, 10 d.f
t, 3 d.f.
0.2
p
0.1
0.0 t0.975, 3 t0.975, 10 z0.975 z0.025t0.025, 10 t0.025, 3
0.1
4 3 2 1 0 1 2 3 4
t
Figure 16.7: The t-distribution for 3 degrees of freedom and 10 degrees of freedom, with normal distribution for
comparison. 2.5% t critical values and z critical values are shown.
Upper and lower confidence bounds In this case, we’re not worried about our estimate being too low, so we
only need to compute the upper confidence bound – we would quote a mean number of squirrels and an upper
limit.
The t-distribution We could use the bootstrap estimator to estimate confidence intervals. However, in this
special case, there is another option. There’s a theorem that states that when X1 , . . . , Xn is a random sample of
size n from a normal distribution with mean µ, the random variable
X −µ
T = (16.4)
σ̂X
is distributed as a t-distribution with n − 1 degrees of freedom, where the t-distribution with ν degrees of
freedom has the probability density function depicted in Figure 16.7, which is given by the equation:
pν (t) = √ (16.5)
1 Γ((ν + 1)/2) 1
πν Γ(ν/2) (1 + t 2 /ν)(ν+1)/2
where Γ(x) is a gamma function. We will not prove this theorem here; in Modern Mathematical Statistics with
Applications Section 6.4 there is the sketch of a proof.
16.6. CONFIDENCE INTERVALS FOR THE MEAN FROM SMALL SAMPLES 141
The t-distribution is very similar in shape to the z-distribution: it is bell-shaped, symmetrical, and centred on
0. However, for small numbers of degrees of freedom, the t-distribution has longer tails than the z-distribution.
This means that the tails of a t-distribution contain a greater fraction of the weight of the distribution than
do the tails in a z-distribution. We define the t critical value tα,ν as the value of t in a t-distribution with ν
degrees of freedom which has the area α under the curve to its right.
For small degrees of freedom, the t critical values are considerably bigger than the z critical values of the
normal distribution (Figure 16.7). As the number of degrees of freedom increases, the t-distribution approaches
a z-distribution. The distribution with 40 degrees of freedom (not shown in the figure) looks very similar to a
z-distribution.
Looking up a t critical value To look up a t critical value, you can use the python scipy package. For example
to find t0.025,10 you would use:
from scipy.stats import t
alpha = 0.025
nu = 10
t_cv = t(nu). isf(alpha)
print (t_cv)
You can also look up t critical values and z critical values in statistical tables, such as the ones in the
appendices of Modern Mathematical Statistics with Applications. Table 16.1 shows an abbreviated example
of such a table. Each row contains t critical values for degree various levels of α. The final row, with infinite
number of degrees of freedom, is the z critical values for these values of α. The full tables include values for
more degrees of freedom.
Using the t-distribution to derive a confidence interval The 100(1 − α) percent confidence interval around a
mean x of a sample of n values with estimated SEM σ̂X derived using a t-distribution is:
Note that we have used the t critical value tα/2,ν . Here the number of degrees of freedom is one less than the
sample size (ν = n − 1). Also, we have divided α by 2 because we are wanting upper and lower bounds to the
confidence interval. It might be that we only need an upper bound, as we considered when we were estimating
squirrel numbers earlier. In this case we would just quote x + tα,n−1 σ̂X . This is still a 100(1 − α) confidence
interval, since the interval from −∞ to the upper bound contains 100(1 − α) of the area under the t-distribution.
To continue the squirrel example, suppose we want to find a 95% confidence interval for the weight. The
95% confidence interval implies α = 0.05 and ν = n − 1 = 31. We would then look up the t0.025,31 = 2.040 and
substitute it into Equation 16.6 along with the sample mean and estimated SEM, and then use this to generate
the confidence interval, which we could quote as µ̂ = 341.0 ± 8.0g (95% confidence interval, n = 32). This is
a bit wider than the interval we would obtain using the corresponding critical value of a normal distribution
z0.025 = 1.96.
1. Deciding on whether a hypothesis or model is compatible with data from observational studies and
randomised experiments.
2. If the hypothesis is compatible with the data, investigating the mechanisms specific to the data, e.g. the
biological effect of chocolate on the body or the process by which a jury panel was selected.
In the course we are going to focus on the statistical aspects (Aspect 1), but it’s worth remembering that the
question is not answered once we’ve completed this step – it should prompt further investigation of the question
rather than ending the inquiry (Yanai and Lercher, 2020). Furthermore, as Yanai and Lercher (2020) illustrate
rather amusingly, it is important to explain data before undertaking hypothesis testing – a good visualisation
can reveal features of the data that a hypothesis test can’t.
Method of hypothesis testing At the core of hypothesis testing are the null hypothesis and the alternative
hypothesis:
The null hypothesis H0 : The claim that we initially assume to be true, formalised as a statistical model. e.g. ”The
jury panel was chosen by random selection from the population in a district.”
The alternative hypothesis Ha : The claim that is contradictory to H0 , typically not formalised as a statistical
model. E.g. ”The jury panel was chosen by some other, unspecified, method.”
The aim of hypothesis testing is to either reject or not reject the null hypothesis. Note that we do not “accept”
the null hypothesis as true, we are just saying that it’s not been proved to be false.
143
144 CHAPTER 17. HYPOTHESIS TESTING AND P-VALUES
Probability
0.05 0.05
0.00 0.00
10 20 30 40 0 10 20 30 40
Number of Black people on panel Number of Black people on panel
Figure 17.1: Distributions of number of Black people T0 (test statistic) on a panel of 100 under the null
hypothesis that the jury was randomly selected from a population that is 26% Black and 74% non-Black. Left:
distribution arising from 10 000 statistical simulations. The red line indicates the number of Black jurors in
Swain versus Alabama (1965), the magenta line indicates t0 = 15 and the yellow line indicates t0 = 20. Right:
Binomialp distribution (blue dots) for n = 100 and p = 10. Normal approximation (orange curve) with µ = np
and σ = np(1 − p).
Test procedure The procedure to carry out a hypothesis test, which we call the test procedure, consists of:
1. Deciding on a test statistic, which is a function of the sample data, e.g. the number of Black people in a
jury panel.
2. Determining what the distribution of the test statistic would be if it arose from the null hypothesis statistical
model.
3. Either:
(a) Deciding on a rejection region, i.e. regions of the distribution of the test statistic under H0 in which
we should reject H0 . Typically, these are the extremities of the distribution. If our test statistic falls
into the rejection region, we reject H0 ; otherwise, we don’t reject it.
(b) Returning a p-value, which tells us how compatible the test statistic is with the distribution predicted
by chance from H0 .
Application of test procedure to example In the topic on Randomness, sampling and simulation, we looked at
the example of Swain versus Alabama (1965), in which the question was “if 8 Black people were chosen for a
jury panel of 100 people, but the fraction of Black people in the population was 26%, does this show bias against
Black people?” We found the distribution of the test statistic under the null hypothesis by simulating the null
hypothesis model of sampling from a Bernoulli distribution with P(Black) = 0.26. In this case probability theory
also tells us that the distribution is a binomial distribution with n = 100 and p = 0.26. We found that there
were no replications in which 8 Black members were chosen (Figure 17.1) – the simulated numbers were always
higher.
We did not consider rejection regions or p-values. Since the observed data (8 Black people on the panel; red
line in Figure 17.1) were inconsistent with the range of predictions produced by the null hypothesis, it seemed
very clear that we should reject the null hypothesis. But what would we have decided if the number of Black
people had sat within the distribution of simulated values, e.g. 15 (magenta line) or 20 (yellow line)?
Rejection regions We might want to specify the rejection region as the bottom 5% of the probability mass,
i.e. the region that seems unusually low (Figure 17.2, left, region to left of orange boundary). If the observed test
statistic falls into that region, we might “reject the hypothesis at the 5% level (one-tailed test)”. We call this a
one-tailed test because the rejection region occupies only one tail of the distribution. This is justified, as the
17.2. P-VALUES 145
Lower-tailed Two-tailed
p<0.1 p<0.1
0.075 p<0.05 0.075 p<0.05
p<0.01 p<0.01
0.050 0.050
0.025 0.025
0.000 0.000
0 10 20 30 40 0 10 20 30 40
T0 T0
Figure 17.2: Rejection regions. Lower-tailed (left) and upper-tailed (right) rejection regions are shown for the
normal approximation to the distribution of the null hypothesis model in the Swain-Alabama example. The
observed statistic t0 = 15 is shown with a magenta dot. It lies in the p < 0.01 rejection region for a lower-tailed
test and in the p < 0.05 rejection region for a two-tailed test.
alternative hypothesis was implicitly “the number of Black people selected is below the number we would have
expected by chance”.
If we know the distribution of our null hypothesis model, we can look up statistical tables to determine the
boundaries of rejection regions. E.g. in this case, the number n is large enough that we can approximate the
binomial distribution with a normal distribution with mean µ = np and variance σ 2 = np(1 − p). This means
that the standardised statistic
T0 − µ
Z= (17.1)
σ
is normally distributed. At the edge of the rejection region, this statistic is equal to the z critical value z0.95 ,
which has 95% of the probability mass to its right. We can then rearrange Equation 17.1 to find the edge of the
rejection region in terms of the original statistic:
T0 = µ + σ z0.95 (17.2)
If a test statistic in a hypothesis test is distributed according to a normal distribution, the hypothesis test is
sometimes referred to as a “z-test”.
One-talied and two-tailed tests We could have formulated the alternative hypothesis as “the number of Black
people selected is different from (i.e. above or below) the number we would have expected by chance”. In this
case we would perform a two-tailed test (Figure 17.2, right) by setting the rejection regions to be the bottom
2.5% and the top 2.5% of the probability mass of the distribution. We would “reject the hypothesis at the 5% level
(two-tailed test)”.
17.2 p-values
Principle of p-values The principle of p-values is that we set the boundary of the rejection regions to be where
the data is, and then report the probability mass in the resulting rejection regions as the p-value .
Determining p-values from statistical simulations Had there been 15 Black people on the panel in Swain
versus Alabama (magenta line), a fraction 0.0062 of the 10 000 simulations produced panels with 15 or fewer
black members. This would therefore give the p-value p = 0.0062, i.e. 0.62%. This certainly calls into question if
the observed data is compatible with the null hypotheses.
146 CHAPTER 17. HYPOTHESIS TESTING AND P-VALUES
Table 17.1: P-values computed by various methods for various observed values of t0 in Swain versus Alabama
(1965).
Suppose that there had been 20 Black people on the jury panel (yellow line in Figure 17.1). The corresponding
rejection region is 20 or fewer Black people on the jury. A fraction of 0.101 of the simulations are in this region,
so the p-value is p = 0.101. We would tend not to reject the null hypothesis at this size of p-value, but this
would not mean that the null hypothesis was true.
Sometimes the p-value is reported relative to a round figure rejection region, e.g. in the case with 15 Black
people on the jury, p < 0.01, indicating that we could “reject the null hypothesis at the 1% level”. However,
supplying the actual p-value gives more information than just reporting the rejection region.
Determining p-values from probability distributions Sometimes it is straightforward to compute the probability
distribution implied by the null hypothesis. In the Swain versus Alabama example, it is a binomial distribution
with n = 100 and p = 0.26 (Figure 17.1, right). As we are looking at a lower-tailed test, the p-value is the
cumulative distribution function of the binomial distribution, cut off at t0 , the observed number of Black people
on the jury panel:
t0
X
P(T0 ≤ t0 ) = B(t0 ; n, p) = b(t; n, p) (17.3)
t=0
where b(t; n, p) is the probability of t successes in a binomial distribution with n trials and success probability p;
B(t; n, p) is the corresponding cumulative distribution function (cdf). Stats packages have functions to compute
the cdf for various distributions, and the values for the binomial are shown in Table 17.1 along with the simulated
values. p
Also shown is the normal approximation to the binomial, in which we set µ = np and σ = np(1 − p). The
p-values are the values of the normal cumulative distribution function at the standardised value
t0 − µ
z= (17.4)
σ
Why use rejection regions? The rejection region method works well with printed statistical tables, in which
critical values of z and other distributions are available only for particular cut-off values, e.g. 0.01, 0.05. With
computer packages it is now possible to define the rejection region relative to the observed data rather than a
pre-set cut-off.
The p-value is the probability, calculated assuming the null hypothesis is true, of obtaining a
value of the test statistic at least as contradictory to H0 as the value calculated from the available
sample. (Modern Mathematical Statistics with Applications, p. 456)
The whole topic of the interpretation and use of p-values is complex and highly contested. In fact, it took 20
statisticians 2 days and many subsequent days of drafting to produce the American Statistical Association’s
statement on p-values: The statement by the American Statistical Association (Wasserstein and Lazar, 2016).
What p-values are We quote 2 of the 6 points in the statement here. Firstly, what p-values are:
17.3. TESTING FOR GOODNESS OF FIT TO A MODEL 147
P-values can indicate how incompatible the data are with a specified statistical model…
The smaller the p-value, the greater the statistical incompatibility of the data with the null
hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility
can be interpreted as casting doubt on or providing evidence against the null hypothesis or the
underlying assumptions. (ASA Statement on Statistical Significance and P-values)
In the Swain versus Alabama example where we imagined there were 15 Black people on the jury, the small
p-value (p = 0.0062) indicates that the data (here 15 Black people on the panel) are quite incompatible with
the null hypothesis statistical model (here that Black and non-Black people were drawn from the population at
random). The low p-value casts considerable doubt on the hypothesis. Of course the actual data (t0 = 8) has a
vanishingly small p-value (Table 17.1).
P-values do not measure the probability that the studied hypothesis is true, or the probability
that the data were produced by random chance alone.
Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or
about the probability that random chance produced the observed data. The p-value is neither. It is
a statement about data in relation to a specified hypothetical explanation, and is not a statement
about the explanation itself. (ASA Statement on Statistical Significance and P-values)
“Statistical significance” A widespread practice in scientific literature is to take p-values of less than p = 0.05
as indicating statistical significance, i.e. that the null hypothesis should be rejected. Values of less than 0.05
indicate weak evidence against the null hypothesis. Sometimes higher thresholds are used, e.g. p = 0.01 and
p = 0.001. In scientific papers and the output from stats packages you will sometimes see these values indicated
with asterisks:
There is no “correct” answer about what the right level of significance is. The p < 0.05 value was suggested
in a paper by the statistician Ronald Fisher1 , who invented the hypothesis test, but it simply seemed “convenient”
to him for his purposes. As in the discussion on confidence intervals (How big should a confidence interval be?),
the value we choose to use may depend on the application. For example, we would demand a very low p-value
when testing the null hypothesis that a new drug has no effect on the death rate of patients. We might accept a
slightly higher p-value for the hypothesis that it has no positive effect on symptoms. In less mission-critical
scientific applications, a higher p-value will be acceptable.
The null hypothesis H0 : The jury panels were chosen by random selection from the population in a district.
The alternative hypothesis Ha : The jury panels were chosen by some other, unspecified, method.
1 Fisher studied under Pearson, and developed a huge body of modern statistics. He also edited the Annals of Eugenics and had
controversial views on race.
148 CHAPTER 17. HYPOTHESIS TESTING AND P-VALUES
Table 17.2: Alameda County jury panel data. The top row shows the estimated proportions of 5 ethnic groups
(Caucasian, Black/African American, Hispanic, Asian/Pacific Islander and Other) in Alameda County. The second
row (Observed panel numbers) shows the total number in each group on 11 jury panels from 2009–2010. There
was a total of 1453 on the 11 jury panels (final column). The third row (Expected panel numbers) shows the
numbers we would expect from each group if the panels had been selected randomly from the population. The
final row (Observed−Expected) shows the disparity between the observed and expected using this formula. The total
2
Expected
disparity is in the final column.
With two categories, it’s easy to see that the number of Black people could be a test statistic. But in this case,
there are 4 numbers that describe the outcome of any simulation (we can always compute the number in the 5th
category if we know the total number and the numbers in 4 categories). We can’t have 4 test statistics, so we
need to create a statistic that indicates the disparity between the observed and expected outcomes.
Suppose we call the population proportions ofP each of k groups pi and the observed numbers in each group
ni . The total number sitting on jury panels is n = i ni . We can compute the numbers we would expect to be on
jury panels as npi (third row of table). One measure of disparity would be the sum of the squared differences:
k
X
(ni − npi )2
i=1
This looks at the absolute squared differences between the expected and observed values for each category. If
we expected np1 = 100 in one category and observed N1 = 95, this expected-observed pair would contribute 25
to the sum. A difference of np2 = 10 (expected) and N2 = 5 (observed) would also contribute 25 to the sum.
However, in relative terms, the difference between the first expected-observed pair is 5%, whereas in the second
pair it is 50%.
This motivates us to look at the scale the disparity measure by dividing by the expected number in each
category, to create a statistic that we call chi-squared, written using the Greek symbol χ 2 :
k
X (ni − npi )2
χ2 = (17.5)
npi
i=1
The components of the χ 2 statistic are seen in the final row of Table 17.2, as is the value of χ 2 = 357.36 for the
observed values (in the “Total” column).
Statistical simulation We can now run a statistical simulation to generate the expected distribution of χ 2 .
For each repetition we simulate the numbers in each category by drawing from a multinomial distribution with
parameters n and pi . We then compute and store χ 2 for that simulation, which gives us the simulated distribution
shown in blue in Figure 17.3. We can immediately see that the observed value of χ 2 = 357.36 is off the scale of
the graph, indicating that it has a much bigger value than is compatible with the null hypothesis, so we reject
the null hypothesis.
Chi-squared distribution It turns out that, providing every expected value npi is greater than 5, the χ 2 statistic
is distributed approximately according to a χ 2 probability distribution with k − 1 degrees of freedom, shown in
orange in Figure 17.3. The fit between the probability distribution the simulated distribution is clear.
2 Sometimes you may see the letter X used instead of χ.
17.3. TESTING FOR GOODNESS OF FIT TO A MODEL 149
0.10
0.05
0.00
0 5 10 15 20 25 30
2
Figure 17.3: Distribution of χ 2 for jury panel selection in Alameda County. Simulations shown in blue and
theoretical χ 2 distribution with 4 degrees of freedom shown in orange.
Goodness-of-fit Large values of χ 2 statistic (i.e. the upper tails of the distribution) indicate a poor goodness-
of-fit between the model and the data. χ 2 tests therefore tend to be upper-tailed. However, the statistic had a
very low χ 2 , we might be suspicious that the data had been fiddled with.
The χ 2 statistic can be used to assess the goodness-of-fit of many types of model and data, not just this
proportion example. If we find a χ 2 with a p-value greater than desired cut-off, this suggests that we should not
reject the model.
Testing for independence with two-way contingency tables We may have multiple populations (e.g. males
and females) and multiple categories (e.g. depressed or not depressed). We can arrange these in a two-way
contingency table (Table 17.3).
We want to test the null hypothesis that being depressed is independent of if you are male of female. In
other words P(X = x, Y = y) = P(X = x)P(Y = y). Using a notation similar to Table 17.3 (right), we can write
this probability as pij = pi• p•j , where pi• is the marginal probability of an item being in category i and p•j is
the marginal probability of an item being in category j. Our best estimates of the marginal probabilities are
ni• n•j
pi• = and p•j = (17.6)
n•• n••
Therefore the best estimates of the number of in each cell are
ni• n•j
êij = n•• pij = (17.7)
n••
The χ 2 statistic is computed as
Table 17.3: Left: Contingency table of the number of depressed and not depressed people in a population of
females and males; data based on a prospective study Bornioli et al. (2020). Right: General symbolic version of
the two-way contingency table. There are I rows and J columns. The number of items falling into a cell in the
ith row and jth column is denoted nij . The total in the ith row is denoted ni• = Jj=1 nij , the total in the jth
P
column is n•j = Ij=1 nij . The grand total is n•• = i ni• = j n•j .
P P P
Table 17.4: Expected numbers in contingency table, in example (left) and in symbols from Equation 17.7 (right).
Data snooping and p-value hacking It is very tempting to try out many experiments in order to get a p-value
of less than 0.05. However, the more experiments are run, the more chance there is of Type I errors – i.e. rejecting
the null hypothesis when it is true.
“Data snooping” or “p-value hacking” is the practice of rerunning experiments or selecting subsets of datasets
until a statistically significant result is achieved. It is harder to publish negative results than positive results in
academic journals, so there is an incentive to data snoop. Some statistically significant results in the literature
will be Type I errors, which makes it important to replicate experimental results.
The ASA statement says:
Proper inference requires full reporting and transparency.
P-values and related analyses should not be reported selectively. Conducting multiple analyses
of the data and reporting only those with certain p-values (typically those passing a significance
threshold) renders the reported p-values essentially uninterpretable. Cherry-picking promising
findings, also known by such terms as data dredging, significance chasing, significance questing,
selective inference, and “p-hacking,” leads to a spurious excess of statistically significant results in
the published literature and should be vigorously avoided…(ASA Statement on Statistical Significance
and P-values)
17.4. ISSUES IN HYPOTHESIS TESTING 151
Multiple testing Suppose we undertake multiple tests on the same dataset is problematic, and find that one of
the tests is significant. As we increase the number of tests, the probability of a Type I error increases (XCKD
comic in reading). If we undertake 20 tests, there’s a 0.9520 chance of not having a Type I error, and therefore a
1 − 0.9520 = 0.64 chance of a type I error. There are ways to compute more stringent cut-offs in these cases, for
example the Bonferroni correction.
152 CHAPTER 17. HYPOTHESIS TESTING AND P-VALUES
Chapter 18
A/B testing
Generating confidence intervals for A/B learning using statistical simulations Let’s imagine that we present
the two versions of the page to group A and to group B the same number of times, n. We find that group A
clicks through on 70% of occasions and group B on 72%. We’ll call the underlying proportions of users that
click through that we are trying to estimate pA and pB , and we will define the difference that we are trying to
estimate:
d = pA − pB (18.1)
The difference d is positive when A is better than B. We can address the question of how much better than A is
than B by finding a point estimate of d – the larger d the better A is than B. We can address the question of if
A is significantly better than B by finding a confidence interval.
The natural point estimators for pA , pB and d are:
nA nB
p̂A = , p̂B = and d̂ = p̂A − p̂B (18.2)
n n
153
154 CHAPTER 18. A/B TESTING
Figure 18.1: A/B Testing. Group A is shown the web page on the left; group B the one on the right. Image
credit: Maxime Lorant, Wikimedia, CC SA 4.0.
where nA and nB are the actual numbers clicking through from A and B.
To find the confidence interval, we can use a statistical distribution of d, assuming the underlying proportions
in populations A and B are given by the point estimates p̂A and p̂B . The routine to generate the sampling
distribution of d looks like:
• For j in 1, . . . , k
The result is shown in Figure 18.2. The point estimate d̂ = −0.02, suggesting that B is better than A.
However, the 95% confidence interval is (−0.06, 0.02), which contains the value d = 0, suggesting that A and B
could be equally effective.
Note that the area to the left of d̂ = 0 in the bootstrap distribution is about 85% and the area to the right is
about 15%. We interpret this as meaning that there is an 85% chance that version B is better than version A –
but there is still a 15% chance that it isn’t.
Undertaking a hypothesis test for A/B learning using statistical simulations It’s also possible to approach
this A/B problem as a hypothesis test. We leave it as an exercise to formulate the problem in this way and write
a statistical simulation.
25
d = -0.02
20 95% CI = (-0.06, 0.02)
p(pA pB) > 0 = (0.1547)
15
10
5
0
0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06
pA pB
Figure 18.2: Bootstrap simulation of A/B test with n = 1000, pA = 0.70 and pB = 0.72.
we can see that the upper end of the 99% confidence interval is now less than 0. The chance of the underlying
proportion pA being higher than pB is around 0.0012. We can therefore say that a 99.999% confidence interval
is (−∞, 0).
When to stop sampling Suppose we had collected our first n = 1000 A and B responses in 2 hours on a
Monday afternoon. We’re quite excited by the result, and reckon that we need to keep it running up to n = 10000
in order to be 99.999% certain. This will probably take us to Tuesday afternoon, we’ll then write a report for the
boss, and be done by Wednesday. What could possibly go wrong?
We’ve made a hidden assumption that every period of the week is like a Monday afternoon. What if people
prefer blue to green in the evening? What if the Monday afternoon demographic is older, but the weekend
demographic is younger? To avoid selection bias (see Data collection and statistical relationships), we may wish
to collect at least a full week of data to check that our result really is robust – a week’s worth of data should
mean that any day- or time-specific effects are eliminated, or at least greatly reduced.
Now we are interested in estimating the difference d = pA − pB between our population proportions. An unbiased
estimator of d is:
d̂ = p̂A − p̂B (18.4)
Supposing the population proportions are pA and pB , we expect the number of successes in n trials to be
binomially distributed, with the standard deviations of nA and nB being:
p p
σn A = npA (1 − pA ) ; σn B = npB (1 − pB ) (18.5)
156 CHAPTER 18. A/B TESTING
60 n=1000 d = -0.02
95% CI = (-0.06, 0.02)
p(pA pB) > 0 = (0.1547)
40
20
0
0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06
pA pB
60 n=5000 d = -0.02
95% CI = (-0.04, -0.00)
p(pA pB) > 0 = (0.0143)
40
20
0
0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06
pA pB
60 n=10000 d = -0.02
95% CI = (-0.03, -0.01)
p(pA pB) > 0 = (0.0012)
40
20
0
0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06
pA pB
Figure 18.3: Bootstrap simulation of A/B test with pA = 0.70 and pB = 0.72, and varying numbers of n.
18.4. ISSUES IN A/B TESTING 157
Dividing through by n and replacing pA and pB by their estimates, we get the estimated standard errors of the
estimators p̂A and p̂B :
r r
p̂A (1 − p̂A ) p̂B (1 − p̂B )
σ̂p̂A = ; σ̂p̂B = (18.6)
n n
Since the samples from A and B are independent, the variance of the estimator of the difference in proportions d̂
is equal to the sum of the variances of p̂A and p̂B . We take the square root to get the standard error of the
estimator d̂: p
q p̂A (1 − p̂A ) + p̂B (1 − p̂B )
σ̂d̂ = σ̂p̂A + σ̂p̂B =
2 2 √ (18.7)
n
We’ll assume that n is large, in which case the Central Limit Theorem applies, and we can assume that there is
little variance in the estimated standard error of d̂. We can therefore assume that the statistic
is normally distributed. We can then use the z-distribution to calculate confidence intervals.
Worked example We’ll use figures we used for the bootstrap to find the 95% confidence interval theoretically:
A p-value, or statistical significance, does not measure the size of an effect or the importance
of a result. Statistical significance is not equivalent to scientific, human, or economic significance.
Smaller p-values do not necessarily imply the presence of larger or more important effects, and
larger p-values do not imply a lack of importance or even lack of effect. Any effect, no matter how
tiny, can produce a small p-value if the sample size or measurement precision is high enough, and
large effects may produce unimpressive p-values if the sample size is small or measurements are
imprecise. Similarly, identical estimated effects will have different p-values if the precision of the
estimates differs. (Wasserstein and Lazar, 2016)
158 CHAPTER 18. A/B TESTING
30
ACTmath
20
1 2
Semester
Figure 18.4: Maths scores in Semester 1 (Autumn) and Semester 2 (Spring). Data from Edge and Friedberg
(1984), via Devore and Berk (2012).
Ethical questions in A/B testing There are a number of ethical issues we should consider in A/B testing:
• In an academic situation, informed consent is required – but how can we get this informed consent without
affecting the experiment?
The experiment in which Facebook manipulated news feeds to explore the effect on users’ moods (Kramer
et al., 2014) was an example of A/B testing that was widely seen as problematic (Verma, 2014) because of a lack
of informed consent, no opportunity to opt-out and no institutional review of the experiment, because it was
carried out by a private company. Many A/B tests are arguably not having such a significant effect on users –
but they may have some effect nonetheless.
It’s therefore important to reflect on an A/B test before setting it up. Questions might include:
Applying the bootstrap We already know how to use a sample to compute a confidence interval for the mean
of one population using the bootstrap (Bootstrap estimation of confidence intervals). Can we adapt the bootstrap
to give us a confidence interval around the difference between two means? To apply the bootstrap, on each
bootstrap step we sample with replacement from both groups:
• For j in 1, . . . , B
– Compute and store the difference in the sample means xj∗ − y∗j
• We can also compute the bootstrap estimator of the variance of the difference between the mean:
PB ∗
j=1 (xj − y∗j )2
s2boot
B−1
=
The bootstrap distribution is shown in Figure 18.5. We can see that the 95% confidence interval for the
difference in mean grades is 1.11 to 3.66.
We leave it as an exercise to write a bootstrap simulation to test the hypothesis that the means of the grades
differ between the two semesters.
Estimator of the difference The difference between the sample means X − Y is an unbiased estimator of the
difference between the true means µ1 − µ2 . This follows from X being an unbiased estimator of µ1 and Y being
an unbiased estimator of µ2 . The standard deviation of the estimator is:
s
σ12 σ2
σX −Y = + 2 (18.12)
m n
This follows from the two samples being independent, so V (X ) − V (Y ) = V (X ) + V (Y ) = σ12 /m + σ22 /n.
Theoretical distribution for large samples When both samples are larger than 40 (m > 40 and n > 40) we
can regard the sample as large. As when we estimated confidence intervals for the mean of one population
(Confidence intervals for the mean from small samples), we define a standardised variable, which we expect to be
zero in the case of a null hypothesis that the true difference between the population means is µ1 − µ2 :
X − Y − (µ1 − µ2 )
Z= q (18.13)
S12 /m + S22 /n
The denominator is the sample standard deviation of the estimator, and is a random variable. As when estimating
the mean of one population from a small sample (Confidence intervals for the mean from small samples), for small
m and n this will vary considerably between different samples, and so we do have to consider these random
effects. However, for large m and n, it will approximate the true population means, and its variability is low
enough to consider it as a fixed parameter as when estimating the mean of one population (Method of estimating
confidence interval for the mean of a large sample). In the limit of large n and m, the Central Limit Theorem
suggests that the distribution of the
qstatistic should be normal, so we can use a normal distribution with a mean
of zero and standard deviation of s21 /m + s22 /n as the sampling distribution of the test statistic.
Procedure to calculate a confidence interval for the difference in means from large samples
(x − y − zα/2 σ̂µ̂1 −µ̂2 , x − y + zα/2 σ̂µ̂1 −µ̂2 ) which can also be written x − y ± zα/2 σ̂µ̂1 −µ̂2 (18.15)
In the comparison of grades example (Figure 18.4), we have m = 74, n = 80, x = 28.2500, y = 25.8784,
s1 = 3.2472, and s2 = 4.5929. As an exercise compute the 95% confidence interval of the difference in the means
µ1 − µ2 and compare this with the estimate from the bootstrap.
Theoretical distribution for small samples When either group has a sample size of less than 40, the variability
of the sample standard deviation has to be taken into account, and it turns out that the sampling distribution of
the standardised statistic is a t-distribution with a number of degrees of freedom ν that depends on the standard
deviations of both distributions:
2
s21 /m + s22 /n
ν = (s2 /m)2 (s22 /n)2
(18.16)
1
m−1 + n−1
18.7. QUANTIFYING THE EFFECT SIZE OF DIFFERENCES BETWEEN TWO NUMERIC SAMPLES 161
Value
Value
5 5 5
p=0.2104 p=0.0042 p=0.0004
0 0 0
A B A B A B
Group Group Group
Figure 18.6: Cohen’s d for a number of samples, along with the corresponding t statistics and the p-values for
testing if the groups are different. Group A is identified with the xs and Group B with the ys.
Procedure to calculate a confidence interval for the difference in means from small samples To determine a
confidence interval of 100(1 − α)%, we follow the first two steps of the large sample procedure above. We then
compute ν using Equation 18.16, and find the t critical value tα/2,ν . We can then use the t critical value and the
standard error of the estimator to compute the confidence interval in a similar way to the last step of the large
sample procedure:
x − y ± tα/2,ν σ̂µ̂1 −µ̂2 (18.17)
18.7 Quantifying the effect size of differences between two numeric samples
Cohen’s d The issue of practical significance raised in Issues in A/B testing can be addressed quantitatively
with numerical samples using a statistic called Cohen’s d. Using the same notation for data in two groups
(x1 , . . . , xm and y1 , . . . , yn ), Cohen’s d is defined:
s
x −y (m − 1)s21 + (n − 1)s22
d= s= (18.18)
s m+n−2
;
Figure 18.6 shows some examples of Cohen’s d for different pairs of samples. We can see that in the leftmost
and rightmost plots, the value of d is around 0.5, meaning that the difference in the means is about half the
size of the pooled standard deviations of the groups – in other words there is a substantial overlap of the two
distributions. The middle plot has a much higher value of d, and it is clear here that the distributions overlap
little. Although the leftmost and rightmost plots have similar values of d, the rightmost plot, with many more
datapoints in each group, has a higher t-statistic and a lower p-value than the one on the leftmost plot.
H0 : The mean performance in semester 1 is the same as the mean performance in semester 2.
Ha : The mean performance in semester 1 is different from the mean performance in semester 2.
162 CHAPTER 18. A/B TESTING
We’ve previously simulated the null hypothesis (here, no difference in means) to generate a distribution of
what the test statistic (here, the difference in the sample means) would be under the null hypothesis, and then
compared this distribution with the observed value of our test statistic.
It turns out that there is a duality between confidence intervals and hypothesis testing. Instead of the
distribution of the sampling distribution generated under the null hypothesis, we have used the observed data to
generate the distribution of the estimator for the parameter corresponding to the test statistic. Instead of the
observed value of the test statistic, we have the value the parameter would take under the null hypothesis.
In this example, we used the bootstrap to estimate the distribution of the estimator of the difference of
the means µx − µy (Figure 18.5), and we could then ask if the null hypothesis value of the difference in the
means (0) lies in either of the rejection regions in the tails of that distribution. If so, we can reject at the level
corresponding to the size of the tails. Alternatively, we could compute a p-value, by finding at what quantile the
null hypothesis value (here 0) lies on the distribution. In this case we would find p = 0, so the null hypothesis
would be rejected.
For more on the duality between confidence intervals and hypothesis testing see this Quora article.
The denominator is the “probability of the data” p(nA ) and derives from the definition of conditional probability.
Suppose our model of how the data arises is that users decide, independently of each other, to click through
with a probability pA To start of with, we don’t know what pA is – we assume it is equally likely to be any
number between 0 and 1. In other words, we are assuming a uniform prior: p(pA ) = 1 for 0 ≤ pA ≤ 1.
Probability theory tells us that if we have n repeats of a trial in which the probability of “success” on each
18.10. NON-EXAMINABLE: BAYESIAN INFERENCE APPLIED TO A/B TESTING 163
trial is pA , then the distribution of the total number of successes is given by a binomial distribution.
n
p(nA |pA , n) = pnAA (1 − pA )n−nA (18.20)
nA
This is the likelihood of observing nA out of n users clicking through, given a hypothetical click-through probability
of pA .
Using integration by parts and recursion, we can evaluate the integral in the denominator to get 1/(n + 1).
We thus find that the posterior distribution is
n
p(pA |nA , n) = (n + 1) pnAA (1 − pA )n−nA (18.21)
nA
Note that, although this looks like a binomial distribution, the variable is actually pA , and this is therefore a
beta distribution with a = nA + 1 and b = n − nA + 11
This distribution is now the distribution of the parameter pA given the data. Using differentiation, we can
prove that the distribution has a maximum at p̂A = nA /n. This makes sense, since we would expect the most
likely value to be the observed proportion. The 2.5% and 97.5% centiles will enclose 95% of the distribution and
are our Bayesian credibility interval – the analogue of a frequentist confidence interval.
We can write a similar expression for pB . Since the A and B group are independent, the likelihood is
n n
p(nA , nB |pA , pB , n) = pnAA (1 − pA )n−nA pnAA (1 − pA )n−nA (18.22)
nA nA
It turns out that the posterior p(pA , pB |nA , nB , n) is also the product of the two posterior distributions.
n n
p(pA , pB |nA , nB , n) = (n + 1) pnAA (1 − pA )n−nA pnBB (1 − pB )n−nB (18.23)
nA nB
To compute the posterior distribution of the difference, we substitute pB = pA − d and then integrate over pA
from d to 1 (if d is positive) or from 0 to 1 + d (if d is negative).
165
Chapter 19
Logistic regression
• As a parametric supervised classification algorithm. For example, a bank has data on previous customers
it has considered offering credit cards to, including predictor variables (independent variables) such as
their age, income, housing status and employment status. Each of these sets of variables is labelled with
the response variable of whether the credit card was approved. The task is to determine if the credit card
should be approved for a new customer.
• As a way of investigating the association between predictor variables and a binary (also called
dichotomous) response variable. For example, suppose we have an observational study of patients of
different ages, health levels, ethnicity and gender. Some of the patients have had a dose of vaccine for an
illness, and some haven’t. We’d like to know how the probability of getting the illness depends on if the
vaccine has been administered or not. Like multiple linear regression, we can examine logistic regression
coefficients to isolate the effect of the vaccine, controlling for the other variables.
Similarities and differences to k-NN We’ve already discussed classifiers, when we looked at k-Nearest
Neighbours (Supervised learning: Classification with Nearest neighbours). As a reminder, the problem of
classification is to predict the correct label for an unlabelled input item described by a feature vector of variables.
As well as acting as a classifier, logistic regression can predict a real-valued number, the probability of a data
point belonging to a category, on the basis of the predictors/predictor variables. In fact, we convert the logistic
regression model into a classifier by choosing at a threshold level of probability at which we make a decision.
For example, we might only want to approve cards that we think would have a 60% chance of being approved
historically.
Association between continuous predictor and binary outcome We will use the example of the credit card
approval to illustrate how logistic regression is used as a classifier and as a way of exploring associations
between variables. Figure 19.1 visualises the relationships between age and approval and between employment
status and approval in two ways. Because age is a continuous variable, we can plot individual datapoints on a
scatter plot (Figure 19.1a). It looks like older customers were more likely to have their credit approved than
younger ones.
Association between binary predictor and binary outcome: Odds and odds ratios Employment is a binary
variable (“employed” or “not employed”). If we tried plotting it in the same way as age versus approval, we’d end
up with a very uninformative plot, so instead we look at a contingency table (Figure 19.1b), which shows the
empirical probability (relative frequency) of having credit approved or not approved based on employment status.
167
168 CHAPTER 19. LOGISTIC REGRESSION
Approved
0
20 40 60
Age
(a) Age versus approval. Each datapoint represents the age of a customer and whether their credit was approved (1) or not
approved (0). It looks like a greater fraction of older customers had their credit approved than younger ones. A random
subsample of data points is plotted to aid visualisation.
Figure 19.1: Relationship between age, employment status and credit approval in the credit approval dataset.
In logistic regression, we will see that it makes sense to describe these probabilities in terms of odds1 , which
we define as:
P(Success) P(Success)
(19.1)
P(Failure) 1 − P(Success)
Odds(Success) = =
If success and failure are equally likely, the odds are equal to 1.
We call the odds ratio (OR) the ratio between the odds of credit approval if employed versus credit approval
if not employed.
(19.2)
Odds(Success|x = True)
OR(x) =
Odds(Success|x = False)
We can find the odds ratio of employment in the credit example by setting “Success” to “Approved” and x to
“Employed”, giving an answer of 7.09 = 2.42/0.34. Thus, the odds of someone who is employed having credit
approved are 7.09 times larger than the odds of someone who is not employed having credit approved. The odds
ratio is sometimes referred to as an effect size and expressed as the percentage change in the odds from x being
False to True; this case the effect size is 609%, since the effect of employment increases the odds of approval by
this amount.
Principle of logistic regression with one predictor variable As its name suggests, logistic regression is related
to linear regression. Suppose that the response variable (or dependent variable) y is a dichotomous variable
(i.e. a categorical variable with two categories). We’ll represent the categories by 0 (failure) and 1 (success).
We’d like to model the probability P(Y = 1|X = x) that the response variable is 1, given the predictor variable
(or predictor) X has a value x. Because we’re predicting a probability, the answer given by logistic regression
has to lie between 0 and 1. Therefore, P(Y = 1|X = x) can’t be a linear function of x.
We get around this problem by allowing P(Y = 1|X = x) to be a nonlinear function of x. A function that
works well in many applications is the logistic function2 (Figure 19.2). Using f to denote the logistic function,
1 Modern Mathematical Statistics with Applications calls the “odds” the “odds ratio”, which is not standard usage.
2 Not-examinable: The logistic function is also known as the sigmoid function, and denoted S(x) or σ (x), due to its S-shaped curve.
However, the term “sigmoid function” can refer to a family of S-shaped functions.
The term logisitique was first used in 1845 by Verhulst (1845) to describe the solution of a differential equation describing population
19.2. INTERPRETATION OF LOGISTIC REGRESSION COEFFICIENTS 169
1.0 1.0
f(x) f(x-4)
f(x/2-2)
0.5 0.5
f(x)
f(x)
0.0 0.0
10 5 0 5 10 10 5 0 5 10
x x
Figure 19.2: The logistic function. Left: the standard logistic function: f(u) = exp(u)/(1 + exp(u)), which can
also be written f(u) = 1/(1 + exp(−u)). Right: Examples of shifted logistic curves.
1.0
0 = 1.176
Approved
1 = 0.031
0.5
0.0
20 40 60
Age
Figure 19.3: Logistic regression of age on credit approval.
eβ0 +β1 x
P(Y = 1|X = x) = f(β0 + β1 x) = (19.3)
1
1 + eβ0 +β1 x 1 + e−β0 −β1 x
=
Just as with linear regression, we can adjust the values of the coefficients β0 and β1 to fit the data as best as
possible – we will come to how we do this later.
Application to credit example with one variable Figure 19.3 shows the logistic regression of age on credit
approval – we are ignoring all the other variables for now. The curve doesn’t look very much like a logistic
curve, but that’s because it’s got a very shallow slope, since β̂1 = 0.03. We can see that the probability ranges
between about 0.37 for teenagers and 0.8 for 70-year-olds.
growth: dp
dt = p(1 − p). However, Verhulst applied the term logisitique to the expression of time in terms of population, i.e. essentially
t = ln(p/(1 − p)). This is in fact the “logit” function: logit(p) = ln(p/(1 − p)), which is the inverse of the logistic function. The name logit
was coined much later – see later footnote.
In python scipy and some R packages, the logistic function is referred to as “expit”, making “expit” the inverse of “logit”, just as “log” is
the inverse of “exp”.
170 CHAPTER 19. LOGISTIC REGRESSION
the response variable being 1 (“success”) when the predictor variable is 0. In the credit example it suggests the
likelihood of a newborn baby receiving credit approval is f(−1.176) = 0.236 – which seems rather high!
Log odds Remember the definition of odds (Equation 19.1). To interpret the coefficient β̂1 it helps to rewrite
the logistic regression model (Equation 19.3) in terms of log odds, i.e. the log of the odds:
P(Success) P(Success)
Log Odds(Success) = ln (19.4)
P(Failure) 1 − P(Success)
= ln
Log odds of 0 mean that success and failure are equally likely: P(Success) = P(Failure) = 0.5. Positive log
odds mean that success is more likely than failure, and vice versa for negative log odds. An increase of 1 unit of
the log odds means that the odds increase by a factor of e. As the probability tends towards 1, the log odds
tend towards infinity; as the probability tends towards 0, the log odds tend towards negative infinity.
When we express probability in terms of log odds, we sometimes say it has units of “logits”, which stands for
logistic units. Going back to the example, we can say that when the predictor variable is 0, the log odds of
approval are β̂0 = −1.176 logits.
The logit function converts the probability of success into the log odds of success to failure3 :
p
(19.5)
1−p
logit(p) = ln
Rewriting the logistic regression model in terms of log odds The probability of a failure is:
eβ0 +β1 x
P(Y = 0|X = x) = 1 − f(β0 + β1 x) = 1 − = f(−β0 − β1 x) (19.6)
1
1+e β x 1 + eβ0 +β1 x
+β
=
0 1
The ratio on the left is the odds for success. It tells us how many times more likely the “success” (Y = 1) is than
the “failure” (Y = 0) for any value of x (see Equation 19.1). If we take natural logs of both sides of the equation,
we see that the log odds is a linear function of the predictor:
We can now see that β̂0 is the log odds when the predictor variable is equal to 0.
Interpretation of β̂1 From Equation 19.8, we can see that the parameter β1 tells us the increase in the log
odds when we increase x by 1 unit.
In other words, when we increase x by 1 the odds multiply by a factor exp(β1 ). We refer to this factor as
the odds ratio (OR) for the variable x. In this example the OR = exp(0.03) = 1.03. Thus, for every year of age,
you’re 1.03 times more likely to have a loan approved, an effect size of 3%.
3 If we have a continuous response variable between 0 and 1 (e.g. the proportion p of organisms killed by a toxin), we could transform
the response variable into logits using logit(p). In fact, logistic regression and the term logit were invented to deal with this sort of data
(Berkson, 1944).
4 This identity should help to see this:
f(u) eu 1 + e−u
= eu
f(−u) 1 + eu e−u
=
19.3. MULTIPLE LOGISTIC REGRESSION AND CONFIDENCE INTERVALS 171
Table 19.1: Coefficients expressed in raw form and as odds ratio exp(β).
10 40 0.4
5 20 0.2
0 0 0.0
0.1 0.2 0.3 1.02 1.04 1.06 5 10
odds (Intercept) OR (Year of age) OR (Employment)
Figure 19.4: Bootstrap distributions for the baseline odds and odds ratios for age and employment in the credit
scoring example.
P(Y = 1|X (1) = x (1) , X (2) = x (2) , . . . ) = f(β0 + β1 x (1) + β2 x (2) + . . . ) (19.9)
This equation applies regardless of whether the predictor variables are binary (such as employment status) or
continuous (such as age).
Multiple logistic regression applied to credit example If we apply multiple logistic regression to the credit
example, we end up with the coefficients and odds ratios shown in Table 19.1. We can see that the effect of being
employed increases the odds of being awarded a loan by a factor of 6.56, an effect size of 556%. By contrast
each year of age only multiplies the odds by 1.03, an effect size of 3%. To see the effect of increasing age by 10
years, we’d need to raise this OR to the power 10, and would find that the odds are only multiplied by 1.35. The
effect of an increase in age from 20 to 70 is about 4.36 – still less than the effect of being in employment.
Bootstrap confidence intervals on coefficients Just as the mean and median are statistics, so are the coefficients
β̂1 and β̂2 in logistic regression. We can therefore use the bootstrap to generate confidence intervals for logistic
regression (Figure 19.4). The central estimate and the 95% confidence intervals computed from the 2.5% and
97.5% centiles are:
6
Approved
0
5 1
Log (base 10) of Income
0
10 20 30 40 50 60 70 80
Age (years)
Figure 19.5: Logistic regression applied to credit approval dataset. The age and log income of each application
is plotted, along with its approval status. The black line is the decision boundary found by logistic regression
with an odds ratio of 1, i.e. log odds c = 0. The grey lines are the thresholds corresponding to odds ratios of 3
and 1/3 (i.e. 75%/25% and 25%/75%.
c = β0 + β1 x (19.10)
This defines a linear decision boundary – in the region where β0 + β1 x > c, the log odds are greater than the
threshold, and we classify unseen datapoints in this region as “Success”, and elsewhere, we classify unseen
datapoints as “Failure”.
Figure 19.5 shows decision boundaries for various threshold levels when we consider two continuous variables
in the credit data set: age and the log of the income. Note: as with linear regression, it often makes sense with
logistic regression to transform variables so that their distribution is as normal as possible.
Transparency of logistic regression The credit agency might want to explain to its customers why their
application was or was not approved. Logistic regression makes it very easy to do this, since essentially we
have a credit scoring system:
• Multiply your age by 0.029 and add the result to your score
19.5. MAXIMUM LIKELIHOOD ESTIMATION OF LOGISTIC REGRESSION COEFFICIENTS 173
6
Approved Approved
5 0 0
1 1
4
LogIncome
3
2
1
0
20 40 60 80 20 40 60
Age Age
Figure 19.6: Logistic regression (left) versus 11-NN (right) applied to the credit data. The decision boundaries
are shown as dark lines.
• Round your income to the nearest 1000. Multiply the number of zeros in this figure by 0.320 and add the
result to your score5
• If you scored more than 2.246, your credit will be approved
Thus, a logistic regression classifier is potentially a very transparent classifier. It could help to reduce the
ethical harms of data science to individuals by allowing them to understand why their loan was rejected. One of
the recommendations of Vallor’s Introduction to Data Ethics is to “Promote Values of Transparency, Autonomy,
and Trustworthiness” (Vallor, 2018).
Logistic regression versus k-nearest neighbour The logistic regression classifier differs in a number of ways
from the nearest neighbour classifier:
1. The logistic regression decision boundary (obtained by setting a probability criterion) is a straight line,
whereas the nearest neighbour decision boundary is nonlinear (Figure 19.6).
2. The k-NN thus gives more flexibility and the ability to have higher accuracy, but it is also more likely
to over-fit, as seen in the chapter on k-Nearest neighbour classification, setting hyperparameters, and
metrics.
3. The logistic regression algorithm is more transparent than k-NN.
4. k-NN classifiers benefit from having standardised predictor variables as inputs; logistic regression doesn’t
need this, though it can help if we are regularising a logistic regression classifier (which we will not do in
this course).
Often it is worth trying logistic regression first in classification problem.
and observed response variables. We could do this analytically, ending up with a formula for the regression
coefficients. In logistic regression, it doesn’t make sense to minimise the sum of squared errors, since our response
variable is only 0 or +1 whereas the predictor variables can have an infinite range.
Instead, we use the principle of maximum likelihood, which states that we adjust the model coefficients so as
to maximise the likelihood that the observed data arises from the model. The resulting coefficients are referred
to as the maximum likelihood estimators. Maximum likelihood can be applied to many models, not just logistic
regression.
To apply the principle of maximum likelihood, we need an expression for the likelihood of all the observed
data given the model, which we will derive below. The likelihood is a function of β0 and β1 . However, unlike in
the case of linear regression, we can’t derive formulae to give the best estimates of the coefficients β̂0 and β̂1
that maximise the likelihood. We have instead to use a numerical optimisation procedure that gets us to the best
estimates.
Intuition of maximum likelihood applied to logistic regression The maximum likelihood principle and derivation
may look a bit complicated. We can imagine that the logistic function P(Y = +1|X = xi ) is like a piece of elastic.
Datapoints that are “successes” (yi = 1) pull P(Y = 1|X = xi ) upwards towards 1 at the location xi , since this
will make success more likely. Datapoints that are “failures” (yi = 0) pull P(Y = 1|X = xi ) downwards towards
0 at the location xi . Of course, the successes and failures may be mixed up, in which case they will be competing
with each other to pull the logistic function up or down.
Derivation of maximum likelihood function Assuming that all our datapoints are independent of each other,
the likelihood of obtaining the full set of response variables y1 , . . . , yn is the product of the likelihood of getting
each individual variable6 . The probability of getting a success or failure is given by Equations 19.3 and 19.6. We
use a trick to combine them so that the probability of success used when yi = 1 and the probability of failure is
used when yi = 0:
n
Y n
Y
P(Y = yi |X = xi ) = (yi f(β0 + β1 xi ) + (1 − yi )f(−β0 − β1 xi )) (19.11)
i=1 i=1
Substitute yi = 1 or yi = 0 in the equation above to verify that each one “selects” the correct probability.
We can now use another trick. Notice that 2yi − 1 is equal to 1 when yi = 1 and equal to −1 when yi = 0.
We can now express the arguments of the logistic functions in terms of 2yi − 1:
n
Y n
Y
P(Y = yi |X = xi ) = (yi f((2yi − 1)(β0 + β1 xi )) + (1 − yi )f((2yi − 1)(β0 + β1 xi )))) (19.12)
i=1 i=1
There’s now a common factor, f((2yi − 1)(β0 + β1 xi )), so the equation simplifies:
n
Y n
Y
P(Y = yi |X = xi ) = (yi + 1 − yi )f((2yi − 1)(β0 + β1 xi ))
i=1 i=1
n
(19.13)
Y
= f((2yi − 1)(β0 + β1 xi ))
i=1
Since the log function is a monotonically increasing function, maximising this probability is equivalent to
maximising the log likelihood7 :
n
X n
X
ln P(Y = yi |X = xi ) = ln f((2yi − 1)(β0 + β1 xi )) (19.14)
i=1 i=1
6We will use the product notation (Greek capital letter “pi” Π) to represent a product of probabilities. For example to the probabilities of
Q
three independent events happening is p1 p2 p3 , which we represent 3i=1 pi . Qn Pn
7 The log law log ab = log a + log b can be generalised using product and sum notation: log
i=1 xi = i=1 log xi .
19.5. MAXIMUM LIKELIHOOD ESTIMATION OF LOGISTIC REGRESSION COEFFICIENTS 175
The log of f(u) is − ln(1 + e−u ), so we can now write the optimisation equation as maximising:
n
X n
X
ln P(Y = yi |X = xi ) = − ln(1 + exp(−(2yi − 1)(β0 + β1 xi ))) (19.15)
i=1 i=1
with respect to β0 and β1 . Although we can compute the gradients with respect to β0 and β1 , we can’t solve the
resulting equations analytically; we have to use numerical optimisation to find the best estimates β̂0 and β̂1 .
Inference on the slope of the regression line When we first encountered linear regression, we hadn’t learned
about inferential statistics. It’s now time to apply what we’ve learned about statistical inference to linear
regression.
In the chapter on Linear Regression we introduced the linear regression model. The response variable y
depends linearly on the predictor variable x (Equation 10.1 in the chapter on Linear Regression):
y = β0 + β1 x (20.1)
The variables β0 and β1 are called parameters or regression coefficients and are the intercept and slope of the
line respectively.
In the chapter on Linear Regression we calculated what we now know as point estimates of the intercept and
slope β̂0 and β̂1 of the regression line, using the principle of least squares:
P P
xi yi − nx y (xi − x)(yi − y)
β̂0 = y − β̂1 x ; β̂1 = P 2 P (20.2)
(xi − x)2
=
xi − nx 2
For any set of data, we would now like to answer the following questions about the regression coefficients:
• Do they represent a real effect, and not just a chance correlation in the data?
Bootstrap estimation confidence intervals Since the value of the regression coefficients can be calculated from
sample data, they fit the definition of a statistic (see chapter on Randomness, sampling and simulation). We can
therefore use the bootstrap estimator to estimate the uncertainty in the coefficients (Figure 20.1).
As shown in Figure 20.1, there is considerable variation in the slope – the 95% CI is (2.50, 4.24). This is
evident in the sample of fits shown in the top left plot. There is also variation in the intercept – however, the
scatter plot shows that it is very closely negatively correlated with the slope, since a steeper slop means that
the intercept needs to be more negative to ensure that the regression lines pass through the centre of the data.
177
178 CHAPTER 20. LINEAR REGRESSION AND INFERENCE
400 1.0
380 0.8
Weight (g)
360 0.6
340 0.4
320 0.2
0.0
210 215 220 225 2 3 4
Length (mm) Slope
0.005 0
0.004 100
200
0.003
Intercept
300
0.002 400
500
0.001 600
0.000 700
600 400 200 0 2 3 4
Intercept Slope
Figure 20.1: Bootstrap applied to squirrel length and weight data. Top left: the Squirrel data of Wauters and
Dhondt (1989). A sample of 5 linear regression lines from the bootstrap distribution are shown. Top right:
bootstrap distribution of Slope β̂1 . Bottom left: bootstrap distribution of Intercept β̂0 . Bottom right: scatter plot
of the slope and intercept coefficients from the bootstrap simulations.
20.2. SAMPLING THEORY INFERENCE ABOUT LINEAR REGRESSION COEFFICIENTS 179
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
300 310 320 330 340 350 350 360 370 380 390 400
Predicted Weight 210mm (g) Predicted Weight 225mm (g)
Figure 20.2: Bootstrap applied to squirrel length and weight data. Distribution of weight predictions for squirrel
210mm long (left) and 225mm long (right).
Bootstrap hypothesis testing Often we want to know if the slope of a regression model is significantly different
from zero, since a non-zero slope suggests a relationship between the predictor and response variables. In other
words, can we reject the null hypothesis that β̂1 = 0? The bootstrap distribution of β̂1 (Figure 20.1, top right)
answers this question immediately, since the range of the bootstrap distribution does not even include 0.
Prediction uncertainty Suppose we want to predict the weight of a squirrel that weighs 225g. We can use the
bootstrap distribution of coefficients to predict the weight of squirrels that are 210mm long versus those that are
225mm long (Figure 20.2). All we do is substitute in the values x = 210 or x = 225 in the equation y = β̂0∗ + β̂1∗ x
for each pair of the bootstrap estimates (β̂0∗ , β̂1∗ ). We can see that the confidence interval for squirrels of length
225mm is wider, reflecting that there are fewer data for longer squirrels than shorter ones (Figure 20.1, top left).
The more tightly the points cluster around the regression line (small s in the numerator) the smaller the
standard error of the estimate of the gradient we have. Also, the larger the spread of points on the x-axis (large
denominator in Equation 20.3), the smaller the standard error of the estimator.
Confidence interval for β̂1 As in our estimate of the confidence interval around the mean of small samples
(Confidence intervals for the mean from small samples), we regard both β̂1 and the estimated standard error Sβ̂1
as random variables. In order to determine the confidence intervals, we define the quantity:
β̂1 − β1
T = (20.5)
Sβ̂1
180 CHAPTER 20. LINEAR REGRESSION AND INFERENCE
Figure 20.3: Output from Python statsmodels applied to the squirrel dataset.
where β1 is the true but unknown value. It turns out that this quantity has a t-distribution with n − 2 degrees of
freedom. To determine a 100(1 − α)% confidence interval, we therefore set the probability that the variable T
lies between the t critical values ±tα/2,n−2 :
!
β̂1 − β1
P −tα/2,n−2 < < tα/2,n−2 < 1 − α (20.6)
Sβ̂1
We rearrange the inside of the probability statement to find the 100(1 − α)% confidence interval for the slope is:
Testing hypotheses about β̂1 Often we are interested to know if the slope of a regression model is significantly
different from zero – in other words, can we reject the null hypothesis that β̂1 = 0? To do this we can assume
the null hypothesis by setting β1 = 0 in Equation 20.5. Under the null hypothesis, the estimator for the slope is
expected to be zero. If the estimated value of β̂1 is large relative to Sβ̂1 then the T value will be large, and
therefore far out in one of the tails of the distribution. By computing the probability mass more extreme than the
T value, we get the p-value, and we can then decide if to reject or not reject the null hypothesis that the slope
coefficient is different from zero.
If the p-value is large, this suggests that there may be no real relationship between the predictor and
response variables, but one appears to be there by chance. Generally when the coefficient of determination R 2
20.3. DERIVATION OF STANDARD ERROR OF ESTIMATOR FOR SLOPE COEFFICIENT 181
is large and the number n is reasonably large, the p-value will be low, and it is reasonable to start drawing
conclusions about the relationship between x and y.
In summary, we’re now in a position to understand a lot more of the output produced by a stats package
when it fits a linear regression model (Figure 20.3).
We will assume that ε is a random variable with an expected value of E[ε] = 0. In principle, we can assume that
is drawn from any distribution we would like, but a very common assumption is that it is drawn from a normal
distribution with mean 0 and variance σ 2 :
ε ∼ N(0, σ 2 ) (20.9)
The term Yi − Y is exactly equal to εi − ε, which, from the properties of εi has E[εi − ε] = 0 and variance
V[εi − ε] = σ 2 n/(n − 1). Therefore, the expected variance of β̂1 is:
P
(xi − x)2 V[εi − ε] σ 2 n/(n − 1)
V[β̂1 ] = P P (20.11)
(xi − x)4 (xi − x)2
=
Since the sample variance s2 of the residuals is an estimator of σ 2 n/(n − 1), we get the estimated standard
error:
s
σ̂β̂1 = qP (20.13)
n
i=1 (xi − x) 2
For more details see pp. 642–643 of Devore and Berk (2012).
A probabilistic model for linear regression We’ve just introduced the probabilistic model of linear regression:
Y = β0 + β1 x + ε ; ε ∼ N(0, σ 2 ) (20.14)
If we wanted to, we could use the model to generate “fake” data points (the polite way of referring to this “fake”
data is “synthetic data”). Assume that we have set the coefficients β̂0 , β̂1 and σ̂ 2 . We pick a value xi , compute
β0 + β1 x and then add a number drawn at random from a normal distribution with zero mean and variance σ 2 .
However, we are not interested in generating fake data. Under the assumption that our data was indeed
generated by a model of this form, and want to infer the set of parameters β̂0 , β̂1 and σ̂ 2 are most consistent
with data.
Principle of maximum likelihood revisited To recap from the topic on logistic regression (Maximum likelihood
estimation of logistic regression coefficients), the principle of maximum likelihood states that we adjust the
model coefficients so as to maximise the likelihood that the observed data arises from the model. The resulting
coefficients are referred to as the maximum likelihood estimators.
To apply the principle of maximum likelihood, we need an expression for the likelihood of all the observed
data given the model, which we will derive below. The likelihood is a function of β0 , β1 , and σ 2 .
First, we want to compute the likelihood of one data point. The likelihood of finding a value of the predictor
variable yi at a value of the response variable xi is:
(yi − β0 − β1 xi )2
P(Y = yi |xi ) = √ exp − (20.15)
1
2πσ 2σ 2
This is exactly the same as saying that it’s the probability of drawing a point from a normal distribution with
mean β0 − β1 xi and variance σ 2 .
Derivation of maximum likelihood function Assuming that all our datapoints are independent of each other,
the likelihood of the model having given rise to the full set of response variables y1 , . . . , yn given the predictor
variables x1 , . . . , xn is the product of the likelihood of the model giving rise to yi at position xi :
The notation i=1 is a short way of writing out the multiplication in the first line – it’s exactly like the ni=1
Qn P
notation for addition.
It’s convenient to work with the log likelihood. This is because the log of a product is the sum of logs of the
components of the product (ln ab = ln a + ln b). Also, the log function is a monotonically increasing function, so
the values of β1 , β2 and σ 2 that maximise the log likelihood will be exactly the same values that maximise the
likelihood. Here is the log likelihood of all the data:
We now need to find the log of the likelihood of one point (Equation 20.15). It is:
(yi − β0 − β1 xi )2
ln P(Y = yi |xi ) = ln √ exp −
1
2πσ 2σ 2
(yi − β0 − β1 xi )2
= ln √ + ln exp − (20.18)
1
2πσ 2σ 2
√ (yi − β0 − β1 xi ) 2
= − ln( 2πσ ) −
2σ 2
20.4. MAXIMUM LIKELIHOOD ESTIMATION OF LINEAR REGRESSION COEFFICIENTS 183
We now substitute Equation 20.18 in Equation 20.17 to give the log likelihood of all the data:
n
X
ln P(y1 , . . . , yn |x1 , . . . , xn ) = ln P(Y = yi |X = xi )
i=1
n √
X (yi − β0 − β1 xi )2
= − ln( 2πσ ) − (20.19)
2σ 2
i=1
√ n
X (yi − β0 − β1 xi )2
= −n ln( 2πσ ) −
2σ 2
i=1
Take a moment to look at this equation. From the first linear regression topic, remember that we defined the sum
of squared errors (SSE) as: X
SSE = (yi − ŷi )2 (20.20)
where the predicted value ŷi = β0 + β1 xi . The second term in Equation 20.19 is the negative of the SSE,
divided by 2σ 2 . Thus, to maximise the log likelihood, we need to minimise the SSE with respect to β0 and β1 ,
which is exactly what we did when we derived the linear regression coefficients in semester 1. We thus have a
probabilistic motivation for the principle of least squares.
The values of the coefficients We’ve already calculated β̂0 and β̂1 using the principle of least squares in the
topic on Linear Regression:
P P
xi yi − nx y (xi − x)(yi − y)
β̂0 = y − β̂1 x β̂1 = P 2 P (20.21)
(xi − x)2
; =
xi − nx 2
However, we now have one more parameter to estimate: σ 2 . To do this we maximise Equation 20.19 by
differentiating it with respect to σ 2 , and arrive at the maximum likelihood estimator for σ 2 :
n n
1X 1X
σ̂ 2 = (yi − β̂0 − β̂1 xi )2 = (yi − ŷi )2 = (20.22)
SSE
n n n
i=1 i=1
Note that this is a biased estimator. An unbiased estimator, which can be obtained via estimation theory, is:
n n
1 X 1 X
σ̂ 2 = s2 = (yi − β̂0 − β̂1 xi )2 = (yi − ŷi )2 = (20.23)
SSE
n−2 n−2 n−2
i=1 i=1
because are n − 2 degrees of freedom: knowing β̂0 and β̂1 means that we only need to know n − 2 of the yi to
work out the values of the remaining two. In practice, n is large enough that the difference between the two
estimates of σ̂ 2 is negligible.
It is also worth noting that, since (yi − ŷi ) are the residuals, that σ̂ 2 is the variance of the residuals.
Uncertainty estimates We can also use the maximum formulae to derive the standard error and confidence
intervals for β̂1 – essentially this falls out of Equation (20.19) when we consider varying β1 .
Maximum likelihood elsewhere Maximum likelihood is a widely-used method for deriving estimators for
parameters, ranging from simple models like linear regression to more complicated probabilistic machine learning
ones. If it is possible to write a likelihood function for a model, it is generally possible to find its maximum.
184 CHAPTER 20. LINEAR REGRESSION AND INFERENCE
Chapter 21
185
186 CHAPTER 21. ETHICAL ISSUES WITH SUPERVISED LEARNING
Part VI
Project skills
187
Chapter 22
• If I have carried out an experiment or done some analysis, I can give you the instructions to re-run the
experiment and analysis, and you can reproduce my results
• I can also come back to my files and run the analysis again in 3 months’ time – yourself from 3 months
ago doesn’t answer email!
• Data
• Code
• Documentation
• Data lost or in incompatible format. Here are some emails I have received when requesting data the
authors of scientific papers:
189
190 CHAPTER 22. SOFTWARE ENGINEERING FOR DATA SCIENCE
• Journal policies
• Data repositories
Although some journals now ask peer reviewers to rerun and verify code, sharing it publicly
is still far from an academic norm. The amount of time researchers have to spend either helping
people use their software or refuting claims stemming from its misuse is a “big worry” among
many academics, says Neil Chue Hong, founding director of the Software Sustainability Institute
in Edinburgh. “There are ways you can run the code that mean you won’t get sensible results,
but the researchers who use the code know what those ways are,” he says. “It’s like you or me
being given a Formula One racing car, and being surprised when it crashes when it goes around
the first corner.” (Chawla, 2020)
• Why?
• Unit testing: continuous integration (e.g. Travis-CI) helps to identify problems if underlying libraries get
uploaded
• Virtual environments/containers
• Notebooks?
22.2. NOTEBOOKS VERSUS PROGRAMS 191
Notebooks
• Mathematica (first released in 1988) was first package to have notebook interface
• Idea of combining text with computation or maths to produce a living paper, which is also reproducible
• Related to literate programming (invented by Donald Knuth author of TEX, which is written as a literate
program)1
• The open-source Jupyter (first released under the name iPython in 2011) system2 has implemented the
notebook principle for Python, R and Julia and many other languages3
(Perkel, 2021).
3 Although Mathematica is a closed source package, Wolfram Research have made the Wolfram engine available as a kernel for Jupyter
• Object-oriented code
• http://compbio.ucsd.edu/reproducible-analysis-automated-jupyter-notebook-pipelines/
• Version control
• If you’re wanting to repeat analyses, or you’re finding that you are repeating snippets between in notebooks,
it might be time to create a Python module to contain the functions
• Version control for code (e.g. using Github) is a good thing but does it work for data?
• VC systems deal well with “small” text files – kilobytes rather than megabytes, and definitely not gigabytes
• Thus they are good for code but not so good for data, especially large datasets
• Thus we need different solutions for storing and keeping track of changes to data and code
Data
– Cleaned
– Open data format, e.g. CSV, JSON, YAML, XML
– Meaningful variable names and file names
– Metadata about meaning of columns (if not clear from original dataset)
Code
• Version control – git is currently dominant
• Documentation
• Specifying versions – this can be done using Conda environments and .req files
jupyter=1.0.0
matplotlib=2.2.3
numpy=1.15.0
pandas=0.23.4
scikit-learn=0.19.1
scipy=1.1.0
seaborn=0.9.0
python-graphviz=0.8.4
Writing skills
195
196 CHAPTER 23. WRITING SKILLS
Chapter 24
197
198 CHAPTER 24. LATEX AND BIBTEX
Appendix A
Resources
https://www.engage-csedu.org/find-resources/analyzing-airbnb-data http://www.cse.msu.edu/~cse231/
PracticeOfComputingUsingPython/06_Dictionaries/Cancer/project8.pdf https://www.engage-csedu.
org/find-resources/bmi-body-mass-index
199
200 APPENDIX A. RESOURCES
List of datasets
p. § Description
13 2.6.0 Titanic
13 2.6.0 Life expectancy (WHO)
13 2.6.0 Drinks by country
14 2.6.0 Bad form entry data
14 2.6.0 University of Edinburgh timetables
15 3.2.0 Squirrels (Wauters and Dhondt, 1989)
30 5.3.0 CO2 and other Greenhouse Gas Emissions (Ritchie et al., 2023)
30 5.3.0 Drinks by country
31 5.3.0 Diabetes (Kahn, 1994)
34 6.2.0 Squirrels (Wauters and Dhondt, 1989)
43 7.1.0 Oranges, lemons and apples (Murray, 2006)
58 8.3.0 Wine quality
59 9.1.0 Oranges, lemons and apples (Murray, 2006)
68 9.4.0 Breast cancer (Wolberg et al., 1995)
71 10.1.0 Galton’s heights
78 10.3.0 World population 1940–2000 (Klein Goldewijk et al., 2017)
80 10.3.0 Synthetic dataset showing heteroscedasticity
81 10.4.0 Diabetes (Kahn, 1994)
84 11.2.0 Galton’s heights
87 11.4.0 Grades (Edge and Friedberg, 1984)
94 12.2.0 Synthetic dataset showing correlation
97 13.1.0 Scottish Index of Multiple Deprivation, 2016 edition. https://simd.scot
104 13.3.0 Grades (Edge and Friedberg, 1984)
110 13.4.0 Breast cancer (Wolberg et al., 1995)
113 14.1.0 Squirrels (Wauters and Dhondt, 1989)
117 14.2.0 Swain versus Alabama
122 14.4.0 Basketball players
122 14.4.0 Swain versus Alabama
135 16.3.0 Japanese restaurant data
141 16.6.0 Basketball players
144 17.1.0 Swain versus Alabama
147 17.3.0 Alameda County Jury pools
158 18.5.0 Grades (Edge and Friedberg, 1984)
167 19.1.0 Credit approval dataset, from UCI.
178 20.1.0 Squirrels (Wauters and Dhondt, 1989)
Bibliography
Adhikari, A., DeNero, J. et al. (2020). ‘Computational and inferential thinking: The foundations of data science’.
Online, Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0),
retrieved at various times between June 2020 and March 2021. URL https://www.inferentialthinking.com
Anaconda (2020). ‘The state of data science 2020: Moving from hype toward maturity’. URL https://www.
anaconda.com/state-of-data-science-2020
Anscombe, F. J. (1973). ‘Graphs in statistical analysis’. The American Statistician 27:17–21. URL http:
//www.jstor.org/stable/2682899
Berkson, J. (1944). ‘Application of the logistic function to bio-assay’. Journal of the American Statistical
Association 39:357–365
Bornioli, A., Lewis-Smith, H. et al. (2020). ‘Body dissatisfaction predicts the onset of depression among adolescent
females and males: a prospective study.’ J Epidemiol Community Health
Chawla, D. S. (2020). ‘Critiqued coronavirus simulation gets thumbs up from code-checking efforts’. Nature
582:323–324. URL https://doi.org/10.1038/d41586-020-01685-y
Dasu, T. and Johnson, T. (2003). Exploratory data mining and data cleaning. Wiley
Devore, J. and Berk, K. (2012). Modern Mathematical Statistics with Applications. Springer Texts in Statistics.
Springer, New York, second ed.
Edge, O. P. and Friedberg, S. H. (1984). ‘Factors affecting achievement in the first course in calculus’. Journal of
Experimental Education 52:136–140
Gelman, A., Carlin, J. B. et al. (2004). Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, second ed.
Gelman, A. and Nolan, D. (2017). Teaching statistics – a bag of tricks. Oxford University Press
Hastie, T., Tibshirani, R. et al. (2009). The elements of statistical learning. Springer, New York, second ed. URL
http://dx.doi.org/10.1007/b94608
Ivory, S. B. (2021). Becoming a critical thinker: for your university studies and beyond. Oxford University Press
Klein Goldewijk, K., A., Beusen, J. D. et al. (2017). ‘Anthropogenic land use estimates for the holocene; HYDE
3.2’. Earth System Science Data 9:927–953
201
202 BIBLIOGRAPHY
Kohavi, R., Henne, R. M. et al. (2007). ‘Practical guide to controlled experiments on the web: Listen to your
customers not to the HiPPO’. In P. Berkhin, R. Caruana, X. Wu and S. Gaffney, eds., KDD-2007 Proceedings of
the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 959–967.
Association of Computing Machinery, New York, USA
Kramer, A. D., Guillory, J. E. et al. (2014). ‘Experimental evidence of massive-scale emotional contagion
through social networks.’ Proc Natl Acad Sci U S A 111:8788–90. URL https:dx.doi.org:/10.1073/pnas.
1320040111
Lane-Claypon, J. E. (1926). A Further Report on Cancer of the Breast With Special Reference to its Associated
Antecedent Conditions. No. 32 in Reports on Public Health and Medical Subjects. HMSO, London
MacKay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge University Press,
Cambridge, UK, sixth ed. URL http://www.inference.phy.cam.ac.uk/itprnn/book.pdf
Murray, I. (2006). ‘Oranges, lemons and apples dataset’. URL http://homepages.inf.ed.ac.uk/imurray2/
teaching/oranges_and_lemons/
Perkel, J. M. (2021). ‘Ten computer codes that transformed science’. Nature 589:344–348. URL https:
//doi.org/10.1038/d41586-021-00075-2
Press, David J; Pharoah, P. (2010). ‘Risk factors for breast cancer: A reanalysis of two case-control studies from
1926 and 1931’. Epidemiology 21:566–572. URL https://doi.org/10.1097/EDE.0b013e3181e08eb3
Ritchie, H., Rosado, P. et al. (2023). ‘CO2 and greenhouse gas emissions’. Our World in Data. URL https:
//ourworldindata.org/co2-and-greenhouse-gas-emissions
Rutherford, A. (2022). Control: the Dark History and Troubling Present of Eugenics. Weidenfeld & Nicolson,
London
Schwabish, J. (2021). Better data visualizations: a guide for scholars, researchers and wonks. Colombia University
Press
Scottish Government (2016). ‘Scottish index of multiple deprivation (SIMD) 2016’. URL https://www.
webarchive.org.uk/wayback/archive/20200117165925/https://www2.gov.scot/SIMD
The Turing Way Community (2022). The Turing Way: A Handbook for Reproducible Data Science (Version
v1.0.3). Zenodo. URL http://doi.org/10.5281/zenodo.6909298
Tufte, E. (1982). The visual display of quantitative information. Graphics Press, Cheshire, Connecticut
Tufte, E. (2001). The visual display of quantitative information. Graphics Press, Cheshire, Connecticut, second ed.
Vallor, S. (2018). ‘An introduction to data ethics’. Online. URL https://www.scu.edu/media/ethics-center/
technology-ethics/IntroToDataEthics.pdf
Verhulst, P.-F. (1845). ‘Recherches mathématiques sur la loi d’accroissement de la popula-
tion’. Nouveaux mémoires de l’Académie Royale des Sciences et Belles-Lettres de Bruxelles
18:4–55. URL https://gdz.sub.uni-goettingen.de/id/PPN129323640_0018?tify={%22pages%22:
[14],%22panX%22:0.459,%22panY%22:0.815,%22view%22:%22info%22,%22zoom%22:0.721}
Verma, I. M. (2014). ‘Editorial expression of concern: Experimental evidence of massive-scale emotional contagion
through social networks.’ Proc Natl Acad Sci U S A 111:10779. URL https://dx.doi.org/10.1073/pnas.
1412469111
Vopson, M. M. (2021). ‘The world’s data explained: how much we’re produc-
ing and where it’s stored’. The Conversation URL https://theconversation.com/
the-worlds-data-explained-how-much-were-producing-and-where-its-all-stored-159964
Wasserstein, R. L. and Lazar, N. A. (2016). ‘The ASA statement on p-values: Context, process, and purpose’. The
American Statistician 70:129–133. URL https://doi.org/10.1080/00031305.2016.1154108
BIBLIOGRAPHY 203
Wauters, L. A. and Dhondt, A. A. (1989). ‘Variation in length and body weight of the red squirrel (Sciurus
vulgaris) in two different habitats’. Journal of Zoology 217:93–106
Wexler, S., Shaffer, J. et al. (2017). The Big Book of Dashboards: Visualizing Your Data Using Real-World
Business Scenarios. Wiley
Wickham, H. (2014). ‘Tidy data’. Journal of Statistical Software 59. URL https://www.jstatsoft.org/index.
php/jss/article/view/v059i10/v59i10.pdf
Wilson, G., Bryan, J. et al. (2017). ‘Good enough practices in scientific computing’. PLOS Computational Biology
13:1–20. URL https://doi.org/10.1371/journal.pcbi.1005510
Witten, I. H., Frank, E. et al. (2017). Data Mining. Morgan Kaufmann, Cambridge, MA, fourth ed.
Wolberg, W., Mangasarian, O. et al. (1995). ‘Breast cancer Wisconsin (diagnostic)’. UCI Machine Learning
Repository. URL https://doi.org/10.24432/C5DW2B
Xu, D. and Tian, Y. (2015). ‘A comprehensive survey of clustering algorithms’. Annals of Data Science 2:165–193.
URL https://doi.org/10.1007/s40745-015-0040-1
Yanai, I. and Lercher, M. (2020). ‘A hypothesis is a liability’. Genome Biol 21:231. URL https://doi.org/10.
1186/s13059-020-02133-w
Index
204
INDEX 205
unbalanced classes, 55
unbiased estimator, see estimator, unbiased
under-generalisation, 52
unimodal, 15
unit vector, 108
univariate, 8
univariate data, 15
unstructured data, see data
unsupervised learning, 43, 59
upper quartile, 19
upper-tailed, 149
validation data, 53
variable, 8
categorical, 9
numerical, 9
ordinal, 9
string, 9
variables