0% found this document useful (0 votes)
4K views512 pages

M348 Applied Statistical Modelling - Applications

Applied Statistical Modelling - Applications

Uploaded by

M T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4K views512 pages

M348 Applied Statistical Modelling - Applications

Applied Statistical Modelling - Applications

Uploaded by

M T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 512

M348

Applied statistical modelling

Book 3
Applications
This publication forms part of the Open University module M348 Applied statistical modelling. Details of this
and other Open University modules can be obtained from Student Recruitment, The Open University, PO Box
197, Milton Keynes MK7 6BJ, United Kingdom (tel. +44 (0)300 303 5303; email [email protected]).
Alternatively, you may visit the Open University website at www.open.ac.uk where you can learn more about
the wide range of modules and packs offered at all levels by The Open University.

The Open University, Walton Hall, Milton Keynes, MK7 6AA.


First published 2023.
Copyright © 2023 The Open University
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, transmitted or
utilised in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without
written permission from the publisher or a licence from The Copyright Licensing Agency Ltd. Details of such
licences (for reprographic reproduction) may be obtained from The Copyright Licensing Agency Ltd, 5th Floor,
Shackleton House, 4 Battle Bridge Lane, London, SE1 2HX (website www.cla.co.uk).
Open University materials may also be made available in electronic formats for use by students of the
University. All rights, including copyright and related rights and database rights, in electronic materials and
their contents are owned by or licensed to The Open University, or otherwise used by The Open University as
permitted by applicable law.
In using electronic materials and their contents you agree that your use will be solely for the purposes of
following an Open University course of study or otherwise as licensed by The Open University or its assigns.
Except as permitted above you undertake not to copy, store in any medium (including electronic storage or use
in a website), distribute, transmit or retransmit, broadcast, modify or show in public such electronic materials in
whole or in part without the prior written consent of The Open University or in accordance with the Copyright,
Designs and Patents Act 1988.
Edited, designed and typeset by The Open University, using LATEX.
Printed sustainably in the UK by Pureprint, a CarbonNeutral® company with FSC® chain of custody and an
ISO 14001 certified environmental management system, diverting 100% of dry waste from landfill.

ISBN 978 1 4730 3554 6


1.1
Contents
Unit A1 Introduction to econometrics 1

Welcome to Strand A 3

Introduction to Unit A1 4

1 The economic problem 6


1.1 Considering income and consumption 7
1.2 Towards an economic model 9

2 The econometric model 10


2.1 The Keynesian consumption function 11
2.2 Writing multivariate econometric models 15
2.2.1 Human capital theory 15
2.2.2 Using R to write econometric models 16
2.3 Identifying the effect of price on quantity demanded 17
2.3.1 Estimating the law of demand 17
2.3.2 Causal relationships: considering possible endogeneity
and omitted variables 22
2.4 Identification and the expected value of the error term 23
2.5 The importance of OLS 25
2.5.1 Unbiasedness of OLS 26
2.5.2 Consistency of OLS 28
2.5.3 Efficiency of OLS 29

3 Data: structures, sampling and measurement 31


3.1 Competing theories of consumption following Keynes 32
3.1.1 Modigliani’s lifecycle theory of consumption 32
3.1.2 Friedman’s permanent income hypothesis 33
3.1.3 Duesenberry’s relative income hypothesis 34
3.2 Data structures 34
3.2.1 Panel data 35
3.2.2 What/who to observe, and how often 37
3.2.3 Using R to explore different data structures 39
3.3 Sampling 42
3.4 Measurement 44

4 Estimating causal relationships 45


4.1 Transforming variables 46
4.2 Choosing the regressors 54
4.2.1 Omitted variable bias 56
4.2.2 Using R to compare wage equation specifications 57
4.3 Dealing with possible endogeneity and omitted variable biases 58
4.3.1 Using proxy variables 59
4.3.2 Using twin studies 61
4.3.3 Using instrumental variables 64
5 Estimating causal relationships with panel data 68
5.1 Pooled OLS 69
5.2 Fixed effects estimators 69
5.2.1 The least squares dummy variable (LSDV) estimator 71
5.2.2 Within groups (WG) estimator 73
5.2.3 Using panel data models to estimate a consumption
function 75
5.2.4 Additional examples of fixed effects modelling 79
5.3 Random effects estimator 80
5.4 Choosing between estimators for panel data 82
5.5 Using R to estimate parameters in models for panel data 85

Summary 85

Learning outcomes 87

References 88

Acknowledgements 89

Solutions to activities 90

Unit A2 Time series econometrics 97

Introduction 99

1 Stock and flow variables 101

2 Describing intertemporal properties of time series 103


2.1 Exploring the evolution of the UK unemployment rate 103
2.2 Autoregressive scatterplots 108
2.3 Autocorrelation and the correlogram 110
2.4 Persistence: an application to GDP 114
2.5 Using R to explore time series 119

3 Random walks 119


3.1 Simple random walk 121
3.2 A random walk with drift 123
3.3 Using R to simulate random walks 126
3.4 Random walk with drift versus a model with a deterministic
trend 127

4 Stationarity and lagged dependence of a time series 130


4.1 Stochastic processes 130
4.2 Stationarity and weak dependence 130
4.3 Transforming to stationarity 135

5 Testing for stationarity 136


5.1 The Dickey–Fuller (DF) test 138
5.2 The Augmented Dickey–Fuller (ADF) test 140
5.3 Stationarity of a transformation of GDP 143
5.4 Using R to test for stationarity 146

6 Modelling more than one time series variable 146


6.1 Spurious regression 147
6.2 Cointegration 150
6.2.1 GDP and consumption in the USA from 1970 to 1991 151
6.2.2 Testing the theory of relative purchasing power parity 158
6.3 Error correction model (ECM) 161
6.4 Using R to explore cointegration 164

7 Modelling time with panel data 165


7.1 Fixed effects models with time 166
7.1.1 The Anderson–Hsiao estimator 167
7.1.2 Using R to obtain the Anderson–Hsiao estimator 168
7.2 Exploring the time dimension in panel data in the presence
of omitted ability bias 169

Summary 171

Learning outcomes 173

References 174

Acknowledgements 176

Solutions to activities 177

Unit B1 Cluster analysis 185

Welcome to Strand B 187

Introduction to Unit B1 188

1 Clusters in data 191


1.1 Spotting clusters 192
1.2 Measuring the closeness of data points 201

2 Assessing clusters 208


2.1 Plotting the dissimilarity matrix 209
2.2 Silhouette plots 216
2.3 Mean silhouette statistic 221

3 Hierarchical clustering 223


3.1 Starting agglomerative hierarchical clustering 225
3.2 Finding and merging the closest clusters 226
3.3 The dendrogram 232
3.4 Using R to do hierarchical clustering 235
4 Partitional clustering 236
4.1 Subtask 1: allocating observations to clusters 237
4.2 Subtask 2: estimating the centres of clusters 239
4.3 Solving the bigger task 243
4.4 Starting and stopping 245
4.5 Selecting k 250
4.6 Using R to implement partitional clustering 254

5 Density-based clustering 255


5.1 Phase 1: finding a new cluster 258
5.2 Phase 2: identifying all the observations in a cluster 262
5.3 Putting it together 265
5.4 Using R to implement DBScan 269

6 Comparing clustering methods 269

Summary 273

Learning outcomes 275

References 275

Acknowledgements 276

Solutions to activities 277

Unit B2 Big data and the application of data science 295

Introduction 297

1 What is so special about big data? 299


1.1 Uses of big data 300
1.2 The three (or four or more) V’s 302
1.2.1 Volume 303
1.2.2 Variety 304
1.2.3 Velocity 305
1.2.4 Other V’s 305

2 Handling big data 307


2.1 Computational time 309
2.2 Distributed computing 314
2.3 Distributed computing in practice 318
2.3.1 Using R for distributed computing 319
2.3.2 Structuring calculations for distributed computing 320

3 Models and algorithms 321


3.1 Convergence 321
3.2 One solution or many solutions? 325
3.3 Is there a ‘best’ answer? 327
4 Outputs from big data analysis 328
4.1 Correlation versus causation 329
4.2 Sample or population 331
4.3 Prediction rather than explanation 333

5 Privacy 336
5.1 Consent 336
5.2 Anonymisation 341

6 Fairness 346
6.1 Inequality 347
6.2 Feedback loops 353

7 Guidelines for good practice 355

Summary 358

Learning outcomes 360

References 361

Acknowledgements 364

Solutions to activities 366

Review Unit Data analysis in practice 381

Introduction 383

1 Multiple regression: understanding the data 386


1.1 A first look at some data from a pharmaceutical company 386
1.2 An initial multiple regression model 388
1.3 Visualising the dataset 390
1.4 Designed experiments and observational studies 393

2 Multiple regression: model fitting 396


2.1 Fitting the initial model 396
2.2 Checking the assumptions of the initial model 397
2.3 Using transformations to improve model fit 399
2.4 Adding interactions to the model 403
2.5 Checking the assumptions of the selected model 407

3 Multiple regression with factors 410


3.1 Introducing a dataset from biology 410
3.2 Modelling the data 413
3.3 The ANOVA table 415
3.4 Is the model a good one? 416
3.5 Another look at how the experiment was planned 417
4 The larger framework of generalised linear models 417
4.1 Review of GLMs 418
4.2 Example: Modelling citations of published articles 421
4.3 Example: Modelling a dose escalation trial 426
4.4 Example: Modelling child measurements 433

5 Relating the model to the research question 435


5.1 The cells dataset revisited 436
5.2 The child measurements dataset revisited 437
5.2.1 Log-linear model or logistic regression? 438
5.2.2 Using R to fit a logistic regression model from contingency
table data 439
5.3 The dose escalation dataset revisited 439
5.4 The desilylation dataset revisited 443

6 Another look at the model assumptions 446


6.1 Models for responses which are percentages 446
6.2 Modelling counts 447
6.3 Choosing between models 450
6.4 Using R for model comparison: the Olympics dataset revisited 451
6.4.1 A Poisson GLM with only main effects 452
6.4.2 A Poisson GLM with two-way interactions and main
effects 452

7 To transform, or not to transform, that is the question 453


7.1 Transformations before fitting a model 454
7.2 Transformations after fitting a model 454
7.3 The citations dataset revisited 455

8 Who’s afraid of outliers? 459


8.1 Reasons for outliers 459
8.2 Changing the model 461
8.3 Can we just delete outliers? 468

9 The end of your journey: what next? 470

Summary 472

Learning outcomes 474

References 475

Acknowledgements 476

Solutions to activities 477

Index 497
Unit A1
Introduction to econometrics
Welcome to Strand A

Welcome to Strand A
Strand A of M348 is the econometrics strand, consisting of Units A1
and A2. The spirit of this strand is to show ways in which economists
apply statistical modelling and techniques to economic data.
(Remember that you should study either Strand A or Strand B. See the
module guide for further details.)
Some of the reasons why economists use statistics differently relate to the
prominent roles that famous economists have had in advising government
policy. Some of the most famous economists of the last century and
beyond had relevant positions in national and international policy
committees. This emphasis on policy created an additional set of priorities
and objectives that economists draw from data and statistical analysis.
For this reason, there are two important differences between the
data-related branch of economics – called econometrics – and how we
have so far done statistical modelling in this module.
The first difference stems from the role of economic theory in the relations
that econometrics establishes between data and theory, and between
theory and measurement. These relations are not stable over time, nor
agreed amongst economists at a particular point in time. In fact, they have
given rise to historical ongoing debates in economics, and to different
schools of thought within the economics discipline. We will give you a
flavour of these debates by exploring two case studies that show how the
relation between data and theory progressed economic understanding of
key economic problems. We will look at:
• what economists have said about the relation between income and
consumption, and how the relation of economic theory with data and
measurement influenced this debate
• what economists have said about how the levels of wages paid to workers
are determined, and the main competing theories explaining them.
In line with the rest of the module, and as stated in the module guide, we
will keep mathematical manipulations to a minimum.
The second difference relates to the importance attached to key
parameters of econometric models. When economists use data and
statistical modelling to investigate economic problems, they are often
interested in the magnitudes and signs of particular parameters of this
model. The ‘correct’ estimation of these parameters, and the search for
estimators and data which will yield the best possible estimates, is often
called the identification phase of statistical modelling; it is a topic of
particular interest to many econometricians.
In this strand, you will learn what economists mean by:
• the ‘correct’ estimators
• unbiasedness and consistency of estimators
• identification and causality.

3
Unit A1 Introduction to econometrics

You will also learn some of the techniques and uses of the theory involved
in correctly estimating parameters of interest.
This is not to say economists will not use statistics to describe and explore
patterns in data with an agnostic attitude towards what they will find.
However, and for the purposes of showing what is distinctive about
econometrics from what you have learnt so far in this module, we will be
focusing on the instrumental way that economists have used data to
confirm or refute theoretical explanations of economic observations.
The type of data available often conditions the way economic relations can
be modelled and estimated, and conditions the type of identification
strategies available. This strand is therefore organised as follows. Unit A1
will explore two types of data structures: the so called cross-sectional
structure, and the longitudinal or panel structure. Unit A2 will then
discuss the third main data structure, called time series, and end with
revisiting panel data models which account for time.

A note on notebooks
Make sure you have all the relevant files installed for the notebook
activities in this strand. Check the module website for any additional
instructions.

Introduction to Unit A1
This unit starts by showing you the main steps of doing econometrics when
its main purpose is to use it as a tool for economic policy. For this reason,
the engagement with economic theory, and the ways in which this theory is
represented and seeks support in the data, warrant some thought and
practice.
We will start by showcasing key elements of a historical economic problem
that looks at the relation between income and consumption. We will show
you how main debates have evolved and we will discuss the use (and
neglect) of data and of modelling in furthering these debates over time. No
knowledge of economics is assumed in this unit. In doing this, we will
introduce you to some of the developments in alternatives to OLS as ways
of improving the properties of the estimators of key parameters.
(Econometricians refer to the ‘standard’ linear regression model as OLS –
ordinary least squares.) These additional estimators are often also
dependent on the data structure, and on the assumptions made about the
behaviour of economic variables included in the model.

4
Introduction to Unit A1

The treadmill of an econometrician’s work


In Unit 1, you explored the notion of the statistical modelling process as a
sequential process: from the formulation of the relevant question; through
to the design of the study and exploration of the data; to the selection,
estimation and diagnosis of the model; and finally to the interpretation
and reporting of results. While this process remains valid and used by
econometricians, there are some additional challenges when using
statistical modelling to answer economic questions, such as:
• how questions posed are often driven by economic theory or models
• the importance often attached to particular parameters in the model,
requiring a careful selection and analysis of the properties of the
econometric model and of its random term – often referred to as the
error term or the disturbance term
• the measurement and data availability of key variables suggested by
economic theory; it is often the case that finding suitable measurements
is a substantial stage of modelling in econometrics, which entangles the
relation between data, model and estimators.
Figure 1 summarises the econometrician’s modelling process. You will
consider this further in Activity 1.

Economic problem

Estimation Econometric model

Data

Figure 1 The econometrician’s modelling process

Activity 1 Comparing modelling processes

Compare the econometrician’s modelling process in Figure 1 with the


statistical modelling process shown in Figure 1 of Unit 1. What is the
same and what is different about the two modelling processes?

In this unit, we will deal with the different parts of the econometrician’s
modelling process. Whilst doing this, we will introduce you to one of the
oldest economic problems, which still very much debated: the relationship
between income and consumption.
The structure of Unit A1 in terms of how the unit’s sections fit together is
represented diagrammatically in the following route map.

5
Unit A1 Introduction to econometrics

The Unit A1 route map

Section 1
The economic problem

Section 2
The econometric model

Section 3
Data: structures,
sampling and
measurement

Section 5
Section 4
Estimating causal
Estimating causal
relationships using
relationships
panel data

Note that Subsections 2.2.2, 3.2.3, 4.2.2 and 5.5 contain a number of
notebook activities, so you will need to switch between the written
unit and your computer to complete these.

1 The economic problem


The first key stage of doing econometrics is to bring in an economic model.
Economics is a contested discipline; that is to say, there is not an accepted
base of ‘settled science’ as there often is in the natural sciences. The field
is divided into ‘schools of thought’ working from differing sets of
assumptions with diverse methodologies and arriving at different (often
contradictory) conclusions. One of the roles of statistical modelling and
econometrics has therefore been to provide evidence to support or to reject
at least one of competing theories. This is what is called confirmatory
analysis and it uses data and evidence within a predefined framework and
hypothesis. This is very different from exploratory analysis, where data
is analysed freely and openly, and relationships between variables arise
organically from the exploration of the data. A key feature in confirmatory
analysis is the fact that relationships between key variables are suggested
by economic theory rather than data.
This means we need some economic theory to get started. So, in
Subsection 1.1 we will see how exploratory analysis of economic data can
lead to the generation of such theory, in the particular context of

6
1 The economic problem

consumption and income. Then in Subsection 1.2 we will consider what is


desirable in a model for income and consumption.

1.1 Considering income and consumption


The differences in consumption between poor and rich families
excited attention and often compassion, but apparently never
quantitative analysis, for many centuries.
(Stigler, 1954)

The above quote comes from a paper by George J. Stigler. The next
activity will introduce you to some early studies about income and
consumption that he did find.

Activity 2 Empirical studies of income and consumption

Read pages 95 to 98 of the article ‘The early history of empirical studies of


consumer behavior’ by Stigler (1954), provided on the module website. We
will be analysing these pages in this subsection.

Stigler identifies three main early studies from David Davies, Sir Frederick
Morton Eden and Ernst Engel. These arose from a need to show evidence
of the extent of poverty in England. They collected ‘budget data’ on
several English families, which is data on income and on the expenditure
on various items.
In his article, Stigler reproduces a table for each of the studies,
summarising the information on income and consumption in each one.
Often, and until now, consumption is measured as expenditure. All three
studies use a particular measurement of expenditure called expenditure
share. This is described in Box 1.

Box 1 Expenditure share


The expenditure share on good x is the proportion of total
expenditure spent on good x, and is calculated as
expenditure on good x
expenditure share on good x = .
total expenditure
Expenditure shares are expressed as a number between 0 and 1 and
can also be presented as percentages between 0% and 100%.

In the next activity, you will consider a couple of tables given in


Stigler (1954).

7
Unit A1 Introduction to econometrics

Activity 3 Expenditure shares and income


Look again at Davies’ and Engel’s budget data provided in Tables 1 and 3
of Stigler (1954, pp. 97 and 98), repeated below. Describe some of the
main features.

8
1 The economic problem

The relationship you saw in Activity 3 between income and expenditure


shares on food is so regular across so many studies and datasets that it has
become a law in economics. This law is given in Box 2.

Box 2 Engel’s law


Engel’s law states the following: the lower the income, the higher the
food expenditure share or, more broadly, the higher the expenditure
share on necessities such as food.

What the tables given in Activity 3 also show is that when looking at
absolute values of expenditure, rather than at relative shares, and when
comparing absolute values of expenditure with absolute values of income,
one can see that for poorer groups, expenditure is much larger than
income, and this difference is reduced as income increases. In these studies
therefore, most people are spending more than they are earning, and this is
particularly noticeable for poorer families.
Activity 3 has shown you one way in which economics uses data and relates
it to economic theory. As Stigler (1954, p. 98) says, ‘[Engel’s law] was the
first empirical generalization from budget data’, where exploration of the
data informed economic theory. So we have just done some exploratory
analysis of economic data! However, later studies and moments in the
economics discipline have tilted more towards confirmatory analysis. That
is, data analysis aimed at investigating the appropriateness of a
pre-specified model. This is what we will be doing in most of the unit.

1.2 Towards an economic model


In trying to inform policy about the standards of living of poorer people,
and whether they are consuming subsistence levels of necessities, earlier
studies looked at the relationship between income and consumption, with
consumption often measured in terms of expenditure shares. To propose
an economic model, more needs to be said about how income and
consumption are related.
Income is the variable on which consumption depends. Changing income
allows families to change their consumption levels of food, clothes, health
and recreation. In an economic model, income is the key explanatory
variable, and consumption is the dependent variable whose level depends
on income.
You will have seen in Table 1 of Unit 1 (in ‘Setting the scene’) that there
are several ways in which we can refer to the dependent variable (or
regressand, explained variable, predicted variable, endogenous variable,
and so on), and to the explanatory variable (or regressor, independent
variable, and so on). We will be using the terms dependent variable and
explanatory variable or regressor most of the time. We carefully avoid
using regressors and exogenous variables (one of the other terms given in

9
Unit A1 Introduction to econometrics

Table 1 of Unit 1) interchangeably. As you will see in Box 3, an exogenous


variable is not the same as a regressor. Verifying that a key regressor is
exogenous is one important step in econometrics and is thus not taken for
granted. We will get to that when discussing econometric models and
identification in Section 2.

Box 3 Exogenous and endogenous variables


In a model, variables are either exogenous or endogenous.
Exogenous variables are variables whose values are not influenced
by the values of other variables in the model. (Exogenous is Latin for
‘from without’.)
Endogenous variables are variables whose values are influenced by
at least one of the other variables in the model. (Exogenous is Latin
for ‘from within’.)

For policy purposes, we would like consumption to be strongly dependent


on income. Governments who would raise disposable income (that is,
income net of tax paid and benefits received) by, for instance, reducing tax
rates for poorer people, or for goods which are bought mostly by poor
families, would want to see a clear predictable impact of this fiscal policy
change on the well-being of the poorer segments of society. Let’s nail this
down.
The economic model where consumption C is a function of income I,
C = f (I), will be a useful model for policy purposes when income has a
causal effect on consumption: a change in income will cause a change in
consumption. Crucially a causal effect must be predictable on average, and
independent of changes occurring elsewhere which may also be affecting
consumption. For this to happen, changes to income must be driven by
forces which are exogenous to the economic system. Our economic model
is therefore a model where income is an exogenous variable as opposed to
consumption which, by responding to income changes, is an endogenous
variable explained within the model. In the next section, and when
discussing identification and estimation in later sections, we will
demonstrate how these distinctions materialise when doing econometrics.

2 The econometric model


Statistical modelling in economics requires an estimable version of an
economic model. In this strand, we will only consider econometric models
which are called linear in parameters. In Box 4 we define what it means
for a model to be deemed linear in parameters, then in Activity 4 you will
practise applying this definition.

10
2 The econometric model

Box 4 Linear in parameters


A model is linear in parameters when the dependent variable is
represented by the sum of a list of additive terms where the
‘measurable’ terms are each the product of one parameter and one
explanatory variable, and an error term. In a model with K
explanatory variables and error term ε, it is linear in parameters when
it takes the form
Y = a0 + a1 X1 + a2 X2 + a3 X3 + · · · + aK XK + ε.

As part of being linear in parameters, all econometric models have an error


term which, in this strand, will always be an additional term added to the
model. We will discuss the error term, and the importance of carefully
analysing its properties, in later sections.

Activity 4 Models which are linear in parameters

Consider models (a) to (e) below. Assuming that Y , X1 and X2 are


variables and a0 , a1 and a2 are parameters (in this case, model
coefficients), identify which models are linear in parameters.
(a) Y = a0
(b) Y = a0 + a1 + a2 X1
(c) Y = a0 + a1 X1 + a2 X12
(d) Y = a0 + a1 X1 a2 X2
(e) Y = a0 + a1 X1 X2

Notice in Activity 4 that we used lower-case Roman letters to represent


model coefficients. You may be more used to using Greek letters to
represent model coefficients. In economics, we use both lower-case Roman
and Greek letters.

2.1 The Keynesian consumption function


John Maynard Keynes was one of the most famous economists active and
prominent in England and beyond in the period between the First and
Second World Wars, and whose theories have recently been revived as a
result of the 2007–08 financial crisis. During the period between the First
and Second World Wars, which included the Great Depression (which
started in 1929), one of the periods of most extreme poverty in England
and the USA in the twentieth century, Keynes observed a stable relation
between aggregate consumption and aggregate income. In its simplest One way of using income for
form, the econometric model representing the Keynesian consumption consumption in 1920s London

11
Unit A1 Introduction to econometrics

function represents consumption C as a linear function of income I, as


C = c0 + bI + ε. (1)
Keynes argued this relation was stable when considering both consumption
and income in real terms. What is meant by ‘real terms’ is described in
Box 5.

Box 5 Measuring in real and nominal terms


The analysis of economic variables over time or when looking at
different regions of the globe with different price levels is often carried
out in real terms by removing the effect of changing prices and
inflation from variable measurements.
Inflation is measured by collecting information on a basket of goods
and services, recording their price levels over time, and summarising
the composite price in an index number. This index number,
depending on how it is calculated, is either a deflator, or a consumer
or retail price index. It is often expressed with base 100.
A variable expressed in nominal terms is expressed in terms of
prices (and often in local currencies) at the time of measurement.
A variable expressed in real terms has been adjusted for inflation,
often by dividing its value with a deflator or price index. Changes
over time of a variable measured in real terms will only be picking up
‘real’ changes in quantities and not of their prices.

This relation exists for all possible income and consumption levels, even
hypothetical levels which have not been observed. This is what we call an
a priori relationship, as opposed to an a posteriori relationship which we
can analyse by observing when it occurs in the data. In this case, plans to
consume are a function of expected income. While Keynes recognised that
expectations were also likely to influence consumption, he argued that the
simplicity and stability of the above model outweighed the added value of
explicitly including expectations in the model. In the next activity, you
will consider the interpretation of the Keynesian consumption function.

Activity 5 The parameters of the Keynesian consumption


function
Looking back at Model (1), and leaving the error term aside, consumption
is explained as a linear function of income. Use your knowledge of linear
functions to interpret the parameters of this model c0 and b.

12
2 The econometric model

In the Solution to Activity 5, you have seen that Keynes called the
parameter c0 autonomous consumption and the parameter b the
marginal propensity to consume (MPC). Keynes did not assume
that the marginal propensity to consume was fixed for all income levels.
As populations get richer, a larger fraction of income is allocated to
savings. Keynes observed and modelled that increases in income would
allocate an increasing share to savings and a decreasing share to
consumption, but in a way that absolute levels of consumption would still
increase, although at a decreasing rate. This mirrors Engel’s law when it
observes that richer people will use up a decreasing share of their income
for necessities such as food, but it goes one step further and claims some of
the unspent income is saved.
The way in which economic theory has trickled into the interpretation of
these parameters means we can use it to suggest ranges of values for each
that are more likely to be observed, should the Keynesian consumption
function be a good approximation to the relationship between consumption
and income. You will do this in Activity 6.

Activity 6 Suggesting ranges for the parameters

From the discussion of the parameters c0 and b, suggest constraints placed


on the possible ranges of values for each if the Keynesian consumption
function were a plausible explanation of the relationship between income
and consumption.

In Activity 6, you have seen how the economic theory underlying the
model given by Model (1) places constraints on the value of the
autonomous consumption, c0 , and the value of the marginal propensity to
consume, b. A third implication is to do with the relation between the
marginal propensity to consume and the average propensity to consume,
which is explained next.
The average propensity to consume (APC) is the ratio of
consumption to income; that is, for the ith pair of values for income (Ii )
and consumption (Ci ),
Ci
APCi = .
Ii
The relation between the marginal and average propensities to consume is
depicted in Figure 2, shown next.

13
Unit A1 Introduction to econometrics

Average propensity to consume: low income high income

(I2 , C2 )

Consumption (C)
Consumption
function
(I1 , C1 ) C = c0 + bI

c0

0
Income (I)
Figure 2 Marginal and average propensity to consume

The marginal propensity to consume is represented by the slope of the


consumption function line (shown as a red solid line in Figure 2). The
average propensity for an observation (I, C) is the slope of the line joining
(I, C) and the origin. In Figure 2, we illustrate this for two observations,
(I1 , C1 ) and (I2 , C2 ), both of which lie on the line given by the consumption
function. The slope of the line joining (I1 , C1 ) and the origin (blue dotted
line) depicts the average propensity to consume at a low level of income,
whereas the slope of the line through (I2 , C2 ) and the origin (green dashed
line) depicts the average propensity to consume at a higher level of income.
Notice that the slope of the line joining (I2 , C2 ) and the origin is less than
the slope of the line joining (I1 , C1 ) and the origin. So, the average
propensity to consume is lower for higher values of income. Furthermore, it
does not matter how far the point (I2 , C2 ) moves to the right along the line
representing the consumption function: the slope of the line joining it with
the origin will remain higher than the slope of the consumption function.
So, the average propensity to consume is never lower than the marginal
propensity to consume, b.
Figure 2 is a simplification of the relation between marginal and average
propensity to consume, but it illustrates Keynes’ two main observations:
• empirically, we should observe the marginal propensity to consume being
lower than the average propensity to consume
• the average propensity to consume decreases with income.
After analysing the ways in which economic theory restricts the range of
values each coefficient should have if the theory is valid, statistical
modelling, estimation and hypothesis testing are then used to see if these
restrictions hold in practice. That is confirmatory analysis, which will be

14
2 The econometric model

the scope of Section 4. But first, in Subsection 2.2, you will look at other
examples of economic problems and how they translate into possible
econometric model specifications.

2.2 Writing multivariate econometric


models
The consumption function you have considered so far has just two
variables: a dependent variable and one regressor. In this subsection, you
will be looking at models and theories which have more than two variables:
one dependent variable and more than one regressor. Subsection 2.2.1
introduces a model that arises out of human capital theory.
Subsection 2.2.2 will consider models of the production process and the
import behaviour of countries.
In interpreting such models there is the important concept of ceteris
paribus – holding all else constant – as described in Box 6.

Box 6 Ceteris paribus


You were introduced to multiple regression models in Unit 2. In a
model represented as
Y = a0 + a1 X1 + a2 X2 + · · · + aK XK + ε,
a slope coefficient ak , where k = 1, . . . , K, is interpreted as the change
in Y as a result of a marginal change in Xk , while holding all other
regressors constant. Economists often also use the Latin expression for
‘holding all else constant’, which is ceteris paribus.

2.2.1 Human capital theory


Human capital is a concept that has been widely used in a field of
economics called labour economics. Just as firms invest in plant and
equipment (called capital goods) to enhance their productivity and hence
their profits, human capital theory (HCT) holds that individuals invest in
education and skills to enhance their own productivity and hence their
earning power. One of the most prominent contributions came from Jacob
Mincer who gave his name to the Mincer equation, which relates the
wage rate to years of education and years of work experience. The most
common econometric model representing a Mincer equation is
log(w) = log(w0 ) + α educ + β1 exper + β2 exper2 + u, (2)
where w is the wage rate (most often the hourly wage); w0 represents a
notional ‘base wage’ for someone with no education and no experience;
educ and exper represent education and experience, respectively; and u is
the error term. Notice that this model is linear in parameters even though
it contains both a linear and a non-linear term in exper; this captures the

15
Unit A1 Introduction to econometrics

possibility for returns to experience not being linear. For instance, it is


often observed that there is a diminishing marginal increase in wages with
increasing experience. That is, as years of work experience increase, so
does the wage, but not by as much; it may even decline towards the end of
someone’s career. Figure 3 illustrates the concept.

40

30
Hourly wage ( £ )

20

10

0 10 20 30 40 50
Work experience (years)
Figure 3 A Mincer model showing diminishing marginal returns to work
experience

2.2.2 Using R to write econometric models


In Subsections 2.1 and 2.2.1, you have seen two examples of econometric
models, relating to consumption and wage rate. Notebook activities A1.1
and A1.2 will ask you to write down examples of econometric models
representing other important relations in economics.

Notebook activity A1.1 An econometric model of a


production function
In this notebook, you will explore the economic theory behind the
function which represents how firms and producers combine their
inputs to create their output of goods and services. You will then
write down a possible econometric model which represents this
production function.

16
2 The econometric model

Notebook activity A1.2 An econometric model of the


propensity to import
In this notebook, you will explore one theory economists use to
understand the import behaviour of countries. You will then write
down a possible econometric model which represents this behaviour.

2.3 Identifying the effect of price on


quantity demanded
In Subsections 2.1 and 2.2, we concentrated on writing down econometric
models. In this subsection, we move onto another stage of developing an
econometric model – identification.
The identification stage in econometrics requires you to look at statistical
modelling with an emphasis on the extent to which we can interpret slope
coefficients as the causal impact of their regressor on the dependent
variable, holding all else constant (ceteris paribus). For policy purposes
identifying causal effects of each key regressor on the dependent variable
allows for their estimate and confidence interval to be used to infer how
the dependent variable can be manipulated by changing the regressor.
This information may be the difference between the success or failure of
economic policies. Economists draw both on the statistical properties of
the variables and of the error term of the model, as well as on economic
theory in this stage. Issues of causality and therefore identification also
apply to non-economic issues related to policy; for example, it has
applications in, say, politics or sociology.
In Subsection 2.3.1, we will use one of the earliest studies in economics to
highlight problems which arise from lack of identification. Then, in
Subsection 2.3.2, we discuss some issues that can lead to a lack of
identification.

2.3.1 Estimating the law of demand


Early econometric studies analysed how consumers respond to changes in
prices of goods and services. The quantities that consumers would be
prepared to buy at each possible price form what is known as the demand
curve. This is an a priori relationship between demand and price, the
same way the consumption function explained earlier was an a priori
relationship between consumption and income. One such demand curve,
typical for most goods, is depicted in Figure 4, shown next.

17
Unit A1 Introduction to econometrics

Price

Demand

Quantity
Figure 4 A demand curve

Notice that quantity, which is thought of as dependent on price changes, is


represented on the horizontal axis, while price is on the vertical axis; this
goes against the conventions in the module so far, and in most of
economics. The visualisation of a demand curve with dependent variable
and regressor switched over in a diagram is one of the very few exceptions
in economics.
The estimation of a demand curve allows sellers and suppliers of goods and
services to estimate how much revenue they could get from selling their
produce at different prices.
The demand for most goods is a decreasing function of price: when the
price of a good increases, consumers will consume less of it as it has become
more expensive. This inverse relationship is what is called the law of
demand. It is so common that most economic theory assumes that goods
satisfy the law of demand. But the law of demand is not always observed.
An increase in the price of necessities such as food reduces the amount of
income available to buy other products, and this sometimes leads to the
increase of the demand for a good even when its price increases.
A similar curve, the supply curve, gives the quantities that will be
offered for purchase at each possible price. The supply curve, holding all
else constant, is often assumed to be an increasing or, at most, flat
function of price.
Economists and policy analysts also model demand and supply to get a
sense of which prices will be observed in each market, and whether these
prices generate good outcomes and incentives for both buyers and sellers.

18
2 The econometric model

Using the model, they predict which prices will be operating in each
market, given the way that suppliers set their prices in response to how
they think demand will respond – but holding all else constant. This way
of arriving at equilibrium prices as an alignment of how suppliers and
consumers behave for each price is represented in Figure 5 as the
intersection between the demand and the supply curves.

Supply
Price

Demand

Quantity

Figure 5 A demand curve and a supply curve

In 1914, Henry Moore, one of the early pioneers of econometrics (even if at


the time he called this field statistical economics), used OLS to estimate a
demand curve for US corn using annual observations for the years
1867–1911. Using data on prices and corn output, he estimated an
equation that fitted the data well and showed a relationship between price
and quantity demanded. That line had a negative slope, as the law of
demand says it should have.
In a later study, Moore estimated the demand for pig iron. The resulting
equation was
P = −4.58 + 0.5211 Q + u
b,
where P is price and Q is quantity demanded. Even though statistical Molten iron produced in a
significance tests were not done, this suggested a positive relationship furnace was made into forms
between price and quantity demanded, so then the estimated demand called ‘pigs’, the name given
curve is upward-sloping. This positive slope could be because the demand because the outline of the casts
for pig iron was behaving as an exception to the law of demand at the it was poured into resembled a
time. However, this did not seem to be a convincing explanation, given the sow nursing a litter of piglets
knowledge of the market for pig iron at the time. Instead, the parameter of

19
Unit A1 Introduction to econometrics

interest, the slope of the demand curve, seemed to have been estimated
poorly.
It turns out that one of the key limitations of Moore’s study, and several
studies of demand behaviour, is the fact that the econometrician often
does not observe a priori quantities demanded. Let’s think about this. We
would like to observe how much consumers would buy of a good for all
possible prices. But at any given time, in one particular location, there is
only one price which results from the interaction between what consumers
want to buy, and what sellers or suppliers want to sell. When time
changes, we may observe another price, but the change in time may well
have brought further changes to both demand and supply. In other words,
by not modelling what else may have changed during this period that
would have influenced the relationship between demand and price, Moore
failed to keep all else constant: the ceteris paribus assumption was not
safeguarded.
Over 100 years ago, when Henry Moore published this work, the
understanding of identification of causal effects of, say, prices on the
demand for goods, was still in its infancy. In fact, in a letter to Moore,
Alfred Marshall – another prominent contemporaneous economist in the
analysis of prices, and of demand and supply of quantities – wrote of any
effort to attempt to identify such causal effects, holding all else constant,
as ‘though formally adequate seems to me impracticable’ (Stigler, 1962).
So if the ceteris paribus assumption was not safeguarded in the study
estimating the demand for corn and for pig iron, why do estimates of these
two demand curves have opposite slopes?
While there are other explanations and omitted variables, some of which
we consider in Subsection 2.3.2, it is likely that when estimating the
demand for corn, supply was also shifting over the period 1867–1911 in the
USA due to rising agricultural productivity. This could have resulted in
the supply curve shifting downwards, as shown by curves S1 to S4 in
Figure 6. This results in the amount supplied increasing for each and all
price levels, hence implying the expected decreasing demand schedule.
The period 1867–1911 was a period soon after the Industrial Revolution
that witnessed dramatic transformations in agricultural and industrial
production processes, and the use of machinery proliferated in the USA.
This may have increased the preferences and willingness to pay higher
prices for pig iron, a key input to the production of machinery, and this
could be represented by higher demand schedules, as shown by curves D1
to D4 on Figure 7. This results in an increasing relationship between
observed prices and demand.

20
2 The econometric model

S1 S2

S3

S4
Price

Demand

Quantity

Figure 6 Demand and supply for corn

Supply
Price

D4

D1 D2 D3

Quantity

Figure 7 Demand and supply for pig iron

21
Unit A1 Introduction to econometrics

2.3.2 Causal relationships: considering possible


endogeneity and omitted variables
So why did Moore fail to identify the effect of price on the quantity
demanded of pig iron? Let’s put aside the fact that the measurement of
quantity demanded may have been done with error since it used observed
values of quantities bought and sold, and not a schedule of planned
demand values which, by its nature, is not observable. We will return to
issues to do with errors in measurement in Subsection 3.4.
There were three issues with Moore’s modelling of the relationship between
price and quantity demanded.
• The ceteris paribus assumption was not satisfied. As discussed in
Subsection 2.3.1, the relationship between demand and price was
modelled and estimated in a way which could not hold all else constant.
For instance, demand could not have been modelled separately from
supply as, according to theory, both are simultaneously and
interdependently determined by responding to exogenous prices. So
while Moore estimated
Qd = α0 + α1 P + ud ,
the true econometric model (and already boldly assuming that changes
in the price for this particular market do not have an impact in any
other markets!) would be more like
Qd = α0 + α1 P + ud ,
Qs = β0 + β1 P + us ,
Qd = Qs ,
for all P , and where Qd is the quantity demanded and Qs is the quantity
supplied.
• There were additional omitted variables (such as increasing agricultural
productivity, and increasing preferences for pig iron) whose change over
the period influenced demand, and because these omitted variables were
also correlated with price, their impact was being picked up by changing
prices. The coefficient on price was not the impact of prices alone, but
also of omitted variables’ impacts on quantities demanded.
• Price was not an exogenous regressor either: a price change does not
only cause a change in demand, but also in supply, which requires the
price to adjust again until demand and supply schedules intersect.
(Recall Figure 5 in Subsection 2.3.1.) While prices may be exogenous in
the full demand and supply model, they were not exogenous in the
model of demand alone.
Because of these three issues, the coefficient measuring the effect of price
on quantity demanded was not identifying the causal effect of price on
quantity demanded; the influence of other variables and processes was so
substantial that it is likely the coefficient estimate ended up having the
opposite sign of what the causal effect would have been. While these three

22
2 The econometric model

issues became impossible to ignore in the study of the demand for pig iron,
all these issues are likely to have occurred in other studies of demand,
including the one for corn. While the sign of the effect of price on the
demand for corn was the right one according to the law of demand, there is
no reason to expect that the magnitudes are the right ones if one or several
of these issues were at play.
In Subsections 2.4 and 2.5, we will be more precise about the conditions
that are required to identify the causal effect of a regressor on the
dependent variable.

2.4 Identification and the expected value of


the error term
In the discussion of identification of the effect of price on quantity
demanded in Subsection 2.3.2, we mentioned some related concepts. These
are summarised in Figure 8 along with a condition for such identification,
E(u|X) = 0, that will be discussed in this subsection.

Is the parameter identifying


the causal effect of its
regressor on the dependent
variable?

Ceteris
paribus?

Omitted
variables?

Exogenous
explanatory
variable?

E(u|X) = 0?

Figure 8 What identification of causal effects requires

What identification of the causal effect of our key explanatory variable


requires is that it is exogenous, that there is no omitted variable being
picked up by changing prices, and therefore that all else can be held
constant (ceteris paribus). If all these conditions hold, then we can be
certain that we can attribute the change in the dependent variable with
being due to the change in our variable of interest, and not because of
anything else changing!

23
Unit A1 Introduction to econometrics

We can analyse identification by looking at the properties of the error term


of our econometric model. In an econometric model such as
Y = a0 + a1 X1 + a2 X2 + · · · + aK XK + u, (3)
we can claim ak is the causal effect of Xk on Y if and only if the error
term, u, is independent from Xk , for all k = 1, . . . , K.
Identification does not mean that all variables Xk , for all k, need to be
independent from each other. You saw in Unit 2 that, in a multiple
regression model, the coefficient ak will pick up the singled-out effect of Xk
on Y once its correlation with all other X’s has been removed. What
identification of each coefficient requires is that its regressor is not
correlated with the error term.
In a model which is linear in parameters, such as the ones we cover in this
strand, we can use the concept of expected value, described in Box 7, to
formalise this statement.

Box 7 Expected value


An expected value can be interpreted as a theoretical weighted
average. (That is, it is a population average, not a sample average.)
The expected value E(X) of a given variable X defined over the
interval [Xmin , Xmax ] is
Z Xmax
E(X) = Xf (X) dX,
Xmin

where f (X) is the probability density function of X.


This can be extended to the conditional expected value E(X|Z) of
variable X and a given random variable Z:
Z Xmax
E(X|Z) = Xf (X|Z) dX,
Xmin

where f (X|Z) is the conditional probability density function.

To put it simply, the causal effects of regressors in a model such as the one
in Model (3) are identified when
E(u|X1 , X2 , . . . , XK ) = 0.
In this class of models, it can be shown that this condition is exactly the
same as
Cov(u, Xk ) = 0, for all k = 1, . . . , K,
where Cov(u, Xk ) is the covariance between u and Xk .
These conditions are summarised in Box 8.

24
2 The econometric model

Box 8 Exogeneity assumption


When using OLS to estimate model parameters, we assume the
following about the error term, u:
E(u|Xk ) = 0
for every regressor, Xk , included in the model.
This is equivalent to assuming that there is no correlation, or
covariance, between u and Xk ; that is, Cov(u, Xk ) = 0.

You have seen in Box 8 that OLS assumes that for every regressor Xk in
the model, there is no correlation between it and the error term. Other
methods have been developed to deal with the situation when there is a
correlation between Xk and u. In the rest of the unit, you will be exploring
some of these.

2.5 The importance of OLS


In Unit 2, we considered the OLS estimators of a multiple regression
equation (see in particular Subsection 3.1 and Box 7). Error terms were
assumed independent of each other, and normally distributed with zero
mean and constant variance σ 2 . You also saw in Subsection 5.1 of Unit 2
that to avoid multicollinearity you should, when possible, avoid including
regressors in your model that are very highly correlated with each other.
At the extreme, if two regressors have correlation +1 or −1 their separate
variation cannot be discerned and their coefficients cannot be separately
estimated. Often you will see in the literature that this assumption of no
perfect correlation between regressors, together with the assumption of
positive variance of all regressors, are called identification assumptions.
In Subsection 2.4, we went a step further in discussing identification and
introduced additional assumptions on the error term which go beyond a
zero mean for identification.
This assumption, together with the assumption of no perfect
multicollinearity and positive variances, is very important – and its failure
is problematic – when using OLS.
To see why consider the simple univariate econometric model with a
constant term
Yi = a + bXi + ui ,
where i represents observation i, and assume there are N observations, so
that i = 1, . . . , N .
You saw in Subsection 4.2.1 of Unit 1 that the OLS estimator is the
solution to the minimisation of the sum of the squared differences between
each point, and the regression line it generates. Using a notation which
you may not be very familiar with, OLS is the set of estimators (b a, bb)

25
Unit A1 Introduction to econometrics

which, amongst all possible linear estimators (e


a, eb), minimises the sum of
the squared residuals
N
X
(b
a, bb ) = min(ea,eb ) S = a − ebXi )2 .
(Yi − e
i=1

It can be shown that this quadratic function of estimators e a and eb is


convex, which means that the solution is found by solving what is called
the first-order conditions of the function, that is, equating its derivative
with respect to each unknown to zero. With two unknown estimators, the
two first-order conditions already simplified are
N
∂S X
= 0 ⇐⇒ (Yi − e
a − ebXi ) = 0 ⇐⇒ E(e
u) = 0
∂e
a
i=1

and
N
∂S X
= 0 ⇐⇒ (Yi − e
a − ebXi )Xi = 0 ⇐⇒ Cov(e
u, Xi ) = 0.
∂eb i=1

The first-order condition associated with the constant term estimator e a


repeats the assumption that gives the residuals u ei the characteristic of zero
mean expected from the error terms ui .
The first-order condition for the slope coefficient estimator eb, however,
imposes zero correlation between the regressor X and the residuals uei . So
while OLS assumptions often do not explicitly include the independence
between the error term and the regressors, it sneaks in this assumption by
construction! This is why, given the importance of causality in economics,
we need to analyse whether this assumption holds in practice.
When the error term and the regressors are uncorrelated, we can expect
OLS estimators to have good properties. Next we will discuss three of
these properties: unbiasedness (in Subsection 2.5.1), consistency (in
Subsection 2.5.2) and efficiency (in Subsection 2.5.3).

2.5.1 Unbiasedness of OLS


One of the most important properties we can expect OLS estimators to
have when the error term and the regressors are uncorrelated is
unbiasedness, described next in Box 9.

Box 9 Unbiasedness of an estimator


We say that an estimator βb is unbiased when its expected value
equals the true parameter β.

26
2 The econometric model

Let’s look at the relation between the unbiasedness of the OLS estimator
for the slope, b, and the error term in the univariate model
Yi = a + bXi + ui .
Furthermore, assume that
• E(ui |Xi ) = 0
• E(u2i |Xi ) = σ 2
• Cov(ui , uj |Xi ) = 0, for all i ̸= j
• V (Xi ) > 0, for all i = 1, . . . , N, N > 2.
(If you are unsure why we specify N > 2, look again at the cartoon at the
beginning of Subsection 4.3 in Unit 1.)
Solving the first-order conditions in Subsection 2.5, the OLS estimators are
a = Y − bb X
b
and
PN
bb = i=1 (Yi − Y )(Xi − X)
PN ,
2
i=1 (Xi − X)

where bb is well-defined because V (Xi ) > 0.


Replacing Yi − Y by b (Xi − X) + (ui − u) gives
PN
bb = b + i=1 (Xi − X)(ui − u)
PN .
2
i=1 (Xi − X)

Unbiasedness requires E(bb) = b, so in this case it means that


PN !
i=1 (X i − X)(u i − u)
E PN = 0.
(X − X) 2
i=1 i

By calculating E(bb |X), and using the assumption that E(u|X) = 0, we


reach the conclusion that E(bb|X) = b, which, holding for all X, means that
E(bb) = b must also be true.
This result can be generalised to a multiple regression model by analysing
whether the error term is independent of each and all explanatory
variables (though the algebra gets messier). So, by assuming this
independence, we can ensure we can identify causal effects of our X’s on
the dependent variable and that OLS delivers unbiased estimators.
Unbiasedness is a property which can be verified for any sample size.
Being a theoretical average, the expected value of our estimator is the
average of the estimates generated from all possible random same-size
samples from the population. That would be an infinite number of fixed
same-size samples. No particular sample will deliver the true value, but
were we able to estimate our model for this infinite number of samples,
then an unbiased estimator would deliver the true value if we averaged all
estimates obtained.

27
Unit A1 Introduction to econometrics

2.5.2 Consistency of OLS


We say an estimator is consistent when applying it to a sample of infinite
N delivers the true value; in notational form, this is when
lim bb = b.
N →∞

So consistency is an asymptotic property. This means it is a property


which looks at how an estimator would behave if we applied it to a sample
with an infinite number of observations, that is, when N → ∞.
The variance of the OLS estimator decreases with the number of
observations; see Figure 9. Consistency requires this variance to collapse to
zero when N → ∞.

N = 120

N = 80

N = 50

N = 25

Figure 9 Consistency requires the variance of our estimator, say β,


b to
collapse to zero when N → ∞

Other than for a few pedagogical exceptions, an estimator which is


unbiased will also be consistent, while there are many consistent estimators
which are biased. For this reason, we often focus on sources of bias when
analysing the quality of an estimator.

28
2 The econometric model

2.5.3 Efficiency of OLS


We end this subsection about the importance of OLS by discussing
whether it is efficient. For this, we make use of the Gauss–Markov
theorem, described in Box 10.

Box 10 Gauss–Markov theorem


The OLS estimator of the econometric model
Yi = α + β1 X1 + · · · + βK XK + ui
will generate α
b, βb1 , . . . , βbK and a set of residuals u
bi .
If the assumptions discussed in Subsection 2.5 hold, including the
identification assumption
E(ui |X1 , . . . , XK ) = 0, for all i = 1, . . . , N,
then the Gauss–Markov theorem states that the OLS estimator will
have the BLUE property, that is, the Best Linear Unbiased
Estimator.

In this case, ‘efficient’ means whether OLS will have the lowest possible
variance in the class of linear unbiased estimators; that is, the best linear
unbiased estimator (BLUE). This is important, as lower variance of an
estimator means smaller and more informative confidence intervals for each
of the parameters of our model.
For example, Figure 10 represents the sampling distribution of two
estimators for a parameter β.

β
Figure 10 Sampling distributions of two estimators of β, both unbiased
29
Unit A1 Introduction to econometrics

Both estimates in Figure 10 are unbiased as the average value of the


estimator in both cases is the true value of β. However, the distribution of
the estimator represented by the red solid line is more concentrated around
the true value and hence is more precise and informative than the
estimator represented by the blue dotted line. OLS is the estimator with
the most concentrated distribution around β when all assumptions hold.
We have already seen that OLS will be unbiased, and so will generate an
estimator with a distribution centred around the true value, often
bell-shaped such as the normal or the t-distribution, and a mean equal to
the true value β. In contrast, a biased estimator would have a distribution
to the left or the right of β, as shown in Figure 11. The distance between
the two means is the bias.

Bias

Figure 11 Sampling distributions of two estimators of β, one biased, one


unbiased

Unbiasedness and minimum variance are known as finite sample


properties; they hold irrespective of the size of the sample, as opposed to
consistency as seen in Subsection 2.5.2. The Gauss–Markov theorem only
holds for linear estimators. Notice that as explained in Box 11, the
linearity of an estimator is different from – although strongly related to –
the linearity in parameters of the econometric model.

30
3 Data: structures, sampling and measurement

Box 11 Class of linear estimators


We say an estimator is linear if it can be written as a weighted
average of the observations of the dependent variable Y , and where
the weights do not depend on Y .

We have shown that in the univariate model, the OLS estimator for the
slope coefficient can be written as
PN
bb = i=1 (Yi − Y )(Xi − X)
PN 2
i=1 (Xi − X)
N
X (Xi − X)
= PN Yi .
2
i=1 i=1 (Xi − X)

Therefore, the weights wi , for all i = 1, . . . , N , are


Xi − X
wi = PN ,
i=1 (Xi − X)2

which do not depend on Yi . Hence, bb is a linear estimator.


OLS has become the workhorse of econometrics because it is BLUE under
particular assumptions. In Section 4, we will look at estimators which are
more robust to possible sources of endogeneity and bias, but when trying
different estimators, if unnecessary, we are giving up precision and the
lowest possible variance of our estimates. The remainder of this unit will
look at sources of bias and alternative estimators. Some of the latter
depend on the data structures available. The next section discusses data
issues in detail.

3 Data: structures, sampling and


measurement
In the preface of his 1957 book, A Theory of the Consumption Function,
Milton Friedman stated that:
The theory of the consumption function proposed in this book
evolved over a number of years. During most of this period I was not
engaged in empirical work on consumption.
(Friedman, 1957)

This section will look at how the use of data and the limits and
opportunities present in data relate to the type of confirmatory analysis
done in econometrics and discussed in this unit.
As was mentioned in Subsection 1.1, in confirmatory analysis we look to
see the extent that data supports a theory. So, in Subsection 3.1 we
introduce a number of different theories about consumption and income.

31
Unit A1 Introduction to econometrics

In Subsection 3.2 we will consider three different structures that


econometric data may have. We then consider two sources of bias which
are data-driven: representativeness in Subsection 3.3, and measurement
error in Subsection 3.4.

3.1 Competing theories of consumption


following Keynes
In Subsection 2.1, we introduced the Keynesian consumption function
C = c0 + bI + ε,
where C is the consumption and I is income, both measured in real terms.
This has become known as the absolute income hypothesis (AIH). As
you saw in that subsection, two implications of this model are that the
average propensity to consume (APC = C/I) is never lower than the
marginal propensity to consume (MPC, the slope in the Keynesian
consumption function) and that the average propensity to consume
decreases with income. Very shortly after its proposal, and after a period
where empirical evidence strongly supported this theory, the Keynesian
consumption function and absolute income hypothesis (AIH) faced strong
opposition within economics.
Kuznets, an influential economist, used US annual data on consumption
and income between 1869 and 1938 (the longest time series used in that
period) to produce results that showed the average propensity to consume
was constant and not decreasing in income as Keynes observed. The
disparity of the estimated average propensity to consume obtained from
data such as the US annual data and from the cross-sectional family budget
data or shorter time series became known as Kuznets’ consumption puzzle.
Kuznets’ consumption puzzle gave rise to several alternative theories
explaining the relationship between consumption and income. In
Subsections 3.1.1 to 3.1.3 we will briefly, and in a very stylised manner,
discuss three of these theories, along with the implications they have for
the parameters of a consumption function econometric model.

3.1.1 Modigliani’s lifecycle theory of consumption


Modigliani’s lifecycle theory of consumption models consumption not as a
function of current income, but as a function of the income individuals
expect to stock up over their entire life. This variable of all income
generated is what economists call wealth, which is often denoted by W
(which unfortunately is how random error was represented earlier in the
module!).
Modigliani’s basic econometric model of his lifecycle income hypothesis
(LIH) represents consumption C as a function of wealth W , and not of
current income I:
C = b W + u,

32
3 Data: structures, sampling and measurement

where b is a parameter to be estimated. The intercept in the Modigliani’s


consumption function is omitted since it does not have the same strong
meaning as it did in Keynes’ theory. Subsistence consumption does not
enter the Modigliani world which assumes smoothing consumption is
seamless and no consumption is spent beyond each individual’s lifetime
means. As a result, the average propensity to consume (the ratio of
consumption to income) becomes
C W
=b .
I I
Modigliani could explain Kuznets’ consumption puzzle by looking at the
relation between wealth and income. For an individual, current income is
likely to be increasing over their working life, and they are less likely to
incur bulky purchases such as a house, a car, or a degree! So with constant
wealth (the numerator) and an increasing income (the denominator), one
would observe individuals’ decreasing APC. For countries, and assuming
economies grow over time and accumulate expanding wealth over time,
both numerator and denominator would increase at the same rate, creating
a constant APC.

3.1.2 Friedman’s permanent income hypothesis


A more popular theory however was proposed by Milton Friedman (quoted
at the start of this section). He argued that while individuals make their
consumption choices based on income, their consumption does not respond
to transitory changes which they know will not affect their permanent or
stable income. He envisaged current income as the sum of a permanent
and a transitory component, let’s call these I T and I P , where superscripts
T and P refer to the transitory and the permanent component,
respectively. (They are purely notational and not meant to represent the
power function.)
Friedman’s basic econometric model of his permanent income
hypothesis (PIH) represents consumption C as a function of permanent
income only, and not of total current income I:
C = bI P + u,
where b is a parameter to be estimated. As in Modigliani’s model, the
intercept is also zero. There is a strong assumption in Friedman’s model
regarding individuals’ ability to know what is their permanent income
throughout their lifetime, and to know what constitutes a transitory
income shock and what constitutes an update on their permanent long-run
income.
The average propensity to consume in Friedman’s model is
C IP
APC = =b T .
I I + IP

33
Unit A1 Introduction to econometrics

3.1.3 Duesenberry’s relative income hypothesis


A third explanation questioned what consumers value and consider when
making their choices. The idea that consumers consume more when they
can afford more was questioned by Duesenberry. Drawing on psychology,
sociology and a vast knowledge and evidence of how humans behave, he
claimed that individuals often buy what their reference group or peers are
buying, which adds a new dimension to the simple linear relationship
between consumption and income of alternative theories. For Duesenberry,
consumption choices were therefore not absolute, but relative to this
group. He claimed individuals get more pleasure out of – and therefore
seek – shared consumption experiences with others they relate to more
strongly, or aspire to be.
Duesenberry’s relative income hypothesis (RIH) translates into a
basic econometric model where the strength of the relationship between
consumption and income lies in the way each relates to the consumption
and the income of a reference group. A possible model (bearing in mind
there are several different ways of representing what others are earning and
consuming) is
C I
P =a+bP + u,
j wj C j j wj I j

where j represents a member of the relevant reference group, and wj


represents the weight w given to each individual j.
When income increases slowly and steadily in the economy, it can be
shown that consumption is proportional to income, which explains the
constant APC found by Kuznets.

3.2 Data structures


In Subsection 3.1, you have seen that there are a number of theories about
the relationship between consumption and income. So how do we choose
between them? The following quote is worth bearing in mind.
No amount of data can prove a hypothesis, but they can disprove it.
All we can do is to take the available materials and see whether they
are consistent with our theory.
(Duesenberry, 1949, ch. IV, p. 47)

Earlier studies of the consumption function, most of which provided strong


support to the Keynesian consumption function, either used data on a
specific country over a short period of time, soon after the First World
War until the 1930s, or budget data from families which focused on
particular categories of goods such as food. You will consider the
differences between these types of data in the next activity.

34
3 Data: structures, sampling and measurement

Activity 7 Units of observation and level of analysis


What are the main differences between country-level data and family-level
data on consumption and income?

Varying the unit of observation from a country to a family or individual


also changes the perspective and level of analysis in economics. Analysis of
country-level behaviour, and theories stemming from such aggregate
behaviours, fall into the realm of macroeconomics, whereas the analysis
of smaller units such as individuals and families, and theorising about their
individual behaviours separately, is the subject of microeconomics.
In the Solution to Activity 7, two different types of data structure were
described: times series and cross-sectional data. There is a third data
structure, which has become more popular in later years, which combines
cross-sectional and time series data. It records information on a group of
cross-sectional units over more than one time period and is called panel (or
longitudinal ) data. A more detailed description of panel data is given in
Subsection 3.2.1. Then Subsection 3.2.2 deals with the issue of choosing
the data structure that is most appropriate, before you use R in
Subsection 3.2.3 to explore data that conforms to each of these three data
structures.

3.2.1 Panel data


Panel data (also known as longitudinal data) is a dataset in which the
behaviour of a set of entities is observed across time. These entities could
be individuals, households, firms, countries or any other economic agent.
With panel data, information about observations is collected in waves,
which means it is collected repeatedly at regular time intervals. This
makes a panel dataset a combination of cross-sectional and time series
data. Whereas cross-sectional data is arranged as a two-dimensional table
with units of observation (or subjects) as rows and variables as columns,
by adding time to the sample, panel data is more like a three-dimensional
cube as in Figure 12, shown next.
In practice, panel datasets are rarely perfect cubes; if the number of
entities (N ) is large and the number of time periods (T ) is short, we have a
short panel; conversely if T > N , we have a long panel (Figure 13).
Similar to the ‘short/long’ in time distinction is the idea of wide and
narrow panels in the cross-sectional dimension.
Where a panel has the same number of observations for each entity, i.e. no
missing data for any entities or time periods, it is called a balanced
panel; so an unbalanced panel is one in which there is missing data for
at least some of the cells in the cube.
In this unit, we will only discuss short panels. Properties of estimators and
statistical modelling when there is a large number of time periods (and
when asymptotics are analysed as time tends to infinity) is left to Unit A2.

35
Unit A1 Introduction to econometrics

t=T
.
..
Time t=3
t=2
t=1
Y X1 X2 ··· XK

i=1 y1 x11 x21 ··· xK1


i=2 y2 x12 x22 ··· xK2
Subject
i=3 y3 x13 x23 ··· xK3
.. .. .. .. .. ..
. . . . . .
i=N yN x1N x2N ··· xK N

Variables
Figure 12 The three dimensions of panel data

Short Long
(T small) (T large) Time

Narrow
(N small)
1...N
1...T 1...N 1...T

Wide
(N large)
1...N
1...N

1...T
1...T

Cross-section
Figure 13 Shapes of panel data

36
3 Data: structures, sampling and measurement

The most general form of a model for panel data would be


Yit = β0it + β1it X1it + · · · + βKit XKit + uit ,
where uit is a zero mean disturbance term which satisfies E(uit |X) = 0.
The subscript i ranges across the N entities of the panel, i = 1, . . . , N and
t ranges across the T time periods t = 1, . . . , T . In an unbalanced panel,
the number of observations per individual may be lower Pn than T ; we have
called these Ti . The total number of observations is i=1 Ti . In a balanced
panel, the number of observations is just N T .
As before, the analysis of the error term will help decide what are the best
model specification and estimation.
Panel data offers more options to identify causal effects because there are
specific estimators which can be used with such data whilst estimators for
cross-sectional data can also be used, and under additional assumptions
(as you will see in Unit A2) estimators for time series data. However, it
has its own limitations and reasons for caution. Panel data collection is
expensive and often participants drop out of the study as years go by. If
participants drop out of the study in a random way, this will not cause
additional problems to identification. However, often participants who
drop out are different from the stayers in ways which are likely correlated
with the regressors of interest. This is a type of bias called attrition bias,
where attrition refers to the loss of observations as time goes by.
Due to costs and often the impossibility of contacting dropouts, the
analysis of dropouts is not made. Instead, as with other data structures
and studies, panel data requires a holistic view of what is being observed,
and what is not.

3.2.2 What/who to observe, and how often


As you have already seen in Subsection 3.2, three data structures that are
used with economic data are: cross-sectional data, time series data and
panel data. These are summarised in Box 12.

Box 12 Data structures


Cross-sectional data relate to a sample of units (individuals,
countries, bags of coffee, and so on) collected at a particular moment
in time.
Time series data relate to one unit collected over several time
periods (minutes, days, months, years, centuries, etc.).
Panel data combine cross-sectional and time series data, collecting
information on several cross-sectional units over more than one time
period.

37
Unit A1 Introduction to econometrics

In Subsection 3.1, four different hypotheses about the relationship between


consumption and income were described:
• Keynes’ absolute income hypothesis (AIH)
• Modigliani’s lifecycle income hypothesis (LIH)
• Friedman’s permanent income hypothesis (PIH)
• Duesenberry’s relative income hypothesis (RIH).
The next question for the econometrician is which data structure and units
of observation are appropriate to estimate an econometric model
representing each hypothesis.
It turns out that studies using the AIH have used individuals, families,
countries and regions as units of observation. The same types of units of
observation can also be used to investigate the LIH, PIH and RIH. So this
leaves us with considering which data structure, or structures, are required
for each hypothesis. This is what you will do in the next activity.

Activity 8 Data requirements

What data structures can be used to estimate an econometric model for


each of the four theories of consumption?

Econometric models are often written in a way that signals the type of
data they use. The index i is often used in cross-sectional studies and the
index t in time series. Panel data, because each variable is
two-dimensional, uses index it. The univariate version of the generic model
in Model (3) in Subsection 2.4 becomes, for each data structure:
Cross-sectional: Yi = a + bXi + ui , for all i = 1, . . . , N
Time series: Yt = a + bXt + ut , for all t = 1, . . . , T
Panel: Yit = a + bXit + uit , for all i = 1, . . . , N and t = 1, . . . , T.

When we add data-related information, it also becomes clearer


that a and b do not vary across the observations, but the error term and
included variables do. In this unit, we will look at examples which use
cross-sectional data and panel data. Unit A2 will explore time series and
will revisit the consumption/income relation using this data structure.
In Table 1 we give a summary of basic econometric models and resulting
APC formulations for each of the theories of consumption covered in this
unit. Testing these theories requires specific data structures, and more
often than not, differences in results and conclusions about which theory
receives support depend on the data. In the coming sections, we will look
at some of the biases data can add to an econometric study. But before
that, you will explore different data structures using R.

38
3 Data: structures, sampling and measurement

Table 1 Theories and econometric models representing the relationship between consumption and income

Keynes’ AIH Modigliani’s LIH Friedman’s PIH Duesenberry’s RIH


C bI
Econometric model C = a + bI + u C = bW + u C = b IP + u P =a+ P +u
w
j j jC j wj Ij
P
C a W IP a P  wj Cj
APC = +b b b j wj Cj + b
Pj
I I I IT + IP I j wj Ij

Minimum data Cross-sectional Time series Time series Cross-sectional


structure required or time series

3.2.3 Using R to explore different data structures


In this subsection, we will explore one example of each data structure in R:
• a cross-sectional dataset in Notebook activity A1.3
• a time series dataset in Notebook activity A1.4
• a panel dataset in Notebook activity A1.5.
The datasets – the clothing firms dataset, the imports and exports dataset
and the PSID dataset – are described next.

European firms producing textiles and footwear


The dataset we will use as an example of a cross-sectional dataset
comes from the Amadeus database that was introduced in
Subsection 1.2.1 of Unit 6.
The clothing firms dataset (clothingFirms)
This is a cross-sectional dataset of 20 096 European firms engaged in
the industry sector involved with the manufacture of textiles, wearing
apparel and footwear.
There are five variables in the dataset:
• segment: a standard industry classification (SIC) code indicating
the precise segment of industry that the firm is engaged in
• year: the year that the observation was made
• production: the output each firm produces represented by its gross
sales
• labour: the labour input to the firm’s production, represented by
the number of employees
• capital: the capital input to the firm’s production, represented by
the value of its fixed assets in the year of observation.
The data for the first six observations from this dataset are given in
Table 2.

39
Unit A1 Introduction to econometrics

Table 2 First six observations from clothingFirms

segment year production labour capital


1413 2019 557 35 88
1413 2018 2210 18 202
1511 2019 166403 2467 17580
1413 2019 1085 15 37
1413 2019 856 72 12
1320 2020 9837 78 2929

Source: Amadeus, 2020, accessed 22 November 2022

UK imports and exports


This is a time series dataset containing quarterly data for imports,
exports and income for the UK from 1995 to 2019.
The imports and exports dataset (importExport)
The dataset contains five variables:
• year: the year that the observation relates to
• quarter: the quarter that the observation relates to
• exports: the value of real exports, in billions of £
• income: the value of national income (as measured by GDP), in
billions of £
• imports: the value of real imports, in billions of £.
The data for the first six observations from this dataset are given in
Table 3.
Table 3 First six observations from importExport

year quarter exports income imports


1995 1 53.361 207.213 49.217
1995 2 52.396 207.345 54.541
1995 3 54.676 213.422 56.340
1995 4 56.827 222.201 54.845
1996 1 56.954 220.555 56.799
1996 2 58.658 222.577 60.249

Source: Eurostat (2022)

The Panel Study of Income Dynamics


This dataset is an extract from a lengthy longitudinal study
conducted in the USA called the Panel Study of Income Dynamics
(PSID). This study began in 1968 with a nationally representative

40
3 Data: structures, sampling and measurement

sample of over 18 000 individuals living in 5000 families in the USA.


Information on these individuals and their descendants has been
collected continuously, including data covering employment, income,
wealth, expenditures, health, marriage, childbearing, child
development, philanthropy, education and numerous other topics.
The PSID dataset (psid)
This extract has 4165 observations, which relate to 595 individuals for
each of 7 years (out of more than 53 years of data available at the
time of writing) to form a balanced panel.
The dataset contains data for the following variables:
• exper: the number of years of full-time experience
• occupation: taking values 1 (blue-collar occupation) and 0 (other
occupations)
• gender: taking values 0 (male) and 1 (female)
• educ: the number of years of education
• ethnicity: taking values 1 (black) and 0 (otherwise)
• wageLog: natural logarithm of hourly wages (in cents of a dollar)
• period: year of the study that the observation relates to
• id: individual.
The data for the first three observations of individuals 1 and 2 are
given in Table 4.
Table 4 First three observations from psid for id values 1 and 2

exper occupation gender educ ethnicity wageLog period id


3 0 0 9 0 5.56 1 1
4 0 0 9 0 5.72 2 1
5 0 0 9 0 6.00 3 1
30 1 0 11 0 6.16 1 2
31 1 0 11 0 6.21 2 2
32 1 0 11 0 6.26 3 2

Source: Stata Press, no date, accessed 10 January 2023

Notebook activity A1.3 Exploring a cross-sectional


dataset
In this notebook, you will explore a cross-sectional dataset,
clothingFirms, and summarise some of the main features of its
variables.

41
Unit A1 Introduction to econometrics

Notebook activity A1.4 Exploring a time series dataset


In this notebook, you will explore a time series dataset,
importExport, and summarise some of the main features of its
variables.

Notebook activity A1.5 Exploring a panel dataset


In this notebook, you will open psid and set it up in R as a panel,
then explore its shape and some of its properties, and summarise the
main features of its variables.

3.3 Sampling
Some of the earlier cross-sectional studies of consumption and income have
been criticised for having been too narrow in terms of goods consumed, or
in terms of regions or socio-economic groups considered. Some of the
earlier time series studies in this field were instead criticised and their
results dismissed because the time period considered was too short, or
because the period observed had undergone structural breaks, such as
world wars.
The reasoning behind this criticism relates to the relationship between the
sample of observations and the population the sample aims to represent
and generalise back to. As discussed in Section 2, economists aim to
estimate causal relationships between variables for which they need
unbiased and consistent estimators. This will be difficult if not impossible
to achieve if the sample in the dataset used is not representative of the
population. The key question is: what is the relevant population for each
data structure?
Cross-sectional and panel datasets aspire to represent the group of subjects
from which they sampled their individuals, firms or families. For
consistency, the asymptotic behaviour of estimators which use these
structures require the analysis of its properties as the number of subjects
increases and, in the limit, converges to infinity. In contrast, time series
asymptotics require time to tend to infinity. You will explore more about
time series data structures and properties required from a time series
sample in Unit A2.
Unbiasedness of cross-sectional and panel data can be achieved with a
small sample of subjects. What is key is that each sampled subject is as
likely to be in the sample as they are represented in the population, such
as in the scenario given in Example 1.

42
3 Data: structures, sampling and measurement

Example 1 A representative sample


Suppose the population of interest is some orange and green balls that
are equal in everything except their colour. Further suppose that in
this population 50% of the balls are orange (and hence the other 50%
are green).
A representative sample of this population would also consist of
roughly 50% orange balls and roughly 50% green balls. (Randomness
A population of green and
from the sampling process means that we would not expect the
orange balls
proportions to exactly match that in the population, but it should be
close.)

An unrepresentative sample would be too different from the population,


and for reasons which go beyond randomness and chance. One such
scenario is given in Example 2.

Example 2 An unrepresentative sample


It often happens in surveys that the families or individuals who do
not fill out a survey are different in some aspects of their life, which
may be relevant to the study, from the ones who do fill it out. The
sample of individuals who fill out the survey is therefore different from
the population of individuals the survey was targeting, because those
with features associated with the tendency not to fill out the survey
end up under-represented in the study.

Increasing the size of an unrepresentative sample will not change it from


being unrepresentative to being representative. The differences between an
unrepresentative sample and the population introduces what is known as
selection bias. This is summarised in Box 13.

Box 13 Representative samples and selection bias


A sample should be representative of the population from which it is
drawn; that is, each observation should be sampled with the same
probability as its weight in the population. When this does not occur,
and the sample differs from the population in ways that bias
estimation, there is selection bias.

43
Unit A1 Introduction to econometrics

As before, unbiasedness of OLS would require E(u|X) = 0. We can show


that selection bias, and lack of representativeness of the sample used,
violates this assumption and renders OLS a biased and inconsistent
estimator.

3.4 Measurement
Other than the units of observation, the data structure and the sampling,
a fourth consideration to make when analysing data is the measurement of
variables.
In economics, specifically in the theories of consumption proposed, an
additional data challenge is the gap between existing observable data and
what is often conceptualised. The Keynesian consumption function and
AIH, which is by far the theory of the four introduced in Subsection 3.1
which has fewer challenges with measurement given widely available data,
already imposes several assumptions on the data. For instance, while data
available often measure observed consumption and earned income, the
Keynesian theory of consumption models a relationship between planned
consumption and planned income, two theoretical constructs of which
existing data can only be an approximation. Assumptions needed to
measure the consumption or income of a reference group (in Duesenberry’s
RIH), or to measure lifetime income or permanent income (in Modigliani’s
LIH or Friedman’s PIH), are however unquestionably bolder.
What do measurement issues add to the challenges of estimating causal
relationships? The immediate answer is measurement error and possible
bias.
Intuitively, one may expect that if a variable is measured with error, but
this error is not correlated with the error term of the econometric model,
then one would hope this would only add randomness and extra noise to
our estimation results. While this is true for errors in the measurement of
the dependent variable, things get more complicated when it is the
explanatory variable which is measured with error; see Box 14.

Box 14 Impact when the explanatory variable is


measured with error
If the econometric model is
Y = a + bX + u
but instead of observing X exactly we can only observe x where
x = X + εX ,
then the estimated model is
Y = a + bx + (u − b εX ),
where (u − bεX ) is the error term in the estimated model.

44
4 Estimating causal relationships

In this situation the covariance between the values used for the
explanatory variable and the error term for the regression is
Cov(x, u − b εX ) = Cov(X + εX , u − bεX ),
which is no longer zero. Instead, this covariance is a function of the
variance of the measurement error and the slope b.
• When b is positive, this covariance is negative.
• When b is negative, this covariance is positive.

The bias introduced by measurement error is called attenuation bias


since it reduces, or attenuates, the magnitude of the estimate of the
parameter. This type of bias occurs when we assume a very simple type of
error which is purely exogenous and independent of everything else in the
model. Measurement error can cause other attenuation types of bias, it all
depends on how model and measurement error terms correlate with
measured variables.

4 Estimating causal relationships


In Section 2, we showed the importance of OLS as a workhorse of
econometrics. You saw in Subsection 2.5.1 that if the error term is
uncorrelated with each and all the regressors then OLS, when applied to
cross-sectional data, will always deliver unbiased estimates. You also saw
in Subsection 2.5.3 that if all additional assumptions hold (see
Subsection 2.4), OLS is the best linear unbiased estimator (BLUE).
Now we are in a position to evaluate whether, in each econometric
problem, OLS is a suitable estimator in cross-sectional data. One reason,
as you will see in Subsection 4.1, why it might not be is due to the
distribution of the error term. Also, as you will see in Subsection 4.2, it
might be due to which regressors are – or more importantly are not –
included in the model. This can effect the key assumption made when
using OLS which is E(u|X) = 0.
For when this key assumption is unreasonable, we discuss in Subsection 4.3
three alternative methods which are available with cross-sectional data. In
Section 5, we will also look at alternatives to OLS which are available with
panel data. The analysis and estimation of time series models is left to
Unit A2.

45
Unit A1 Introduction to econometrics

4.1 Transforming variables


The quality of our estimation can often be improved by transforming some
or all of our data. You will have done this in other units, in particular in
Section 4 of Unit 2. One of the assumptions discussed earlier in the
module was the normality of the error term, for example in Section 5 of
Unit 1 and Subsection 3.1 of Unit 2. Ideally, variables should have their
observations scattered in a way which has a high concentration around the
mean, and a thinning concentration towards more extreme values. The
speed at which the density of observations declines should also be
symmetric around the mean. This is what is called a bell-shaped
distribution and is represented in Figure 14.

N (0, 1)

Figure 14 PDFs of the standard normal and t distributions

In this unit, we do not always need to assume normality of the errors since
we can sometimes rely on the central limit theorem to be able to infer
properties of the estimators. The central limit theorem states that,
even if the underlying population from which we are sampling is
non-normal, the properties of the mean or sum approach normality as the
number of observations becomes large (i.e. tends to ∞). The key
assumptions of the central limit theorem are:
• a constant mean µ
• a finite variance σ 2
• independence; in the models we encounter in this strand (and elsewhere
in the module), independence means zero covariance between
observations, cov(ui , uj ) = 0, for all i ̸= j.
This means that the observations are independent and identically
distributed, which we refer to simply as i.i.d.
You will recall that these three assumptions are part of the set of
assumptions required for OLS to be BLUE. However, and very often in
economics data, variables will not be symmetric around a mean value, and
often exhibit a large concentration of observations at low values and a very
sparse frequency of higher values. That is, the data are right-skew.

46
4 Estimating causal relationships

Log transformation of variables which are skewed to the right is a popular


solution. It is a transformation for right-skew data suggested by the ladder
of powers (Box 17, Subsection 4.2, Unit 2). Common terminology for the
untransformed and transformed data is given in Box 15.

Box 15 In levels and in logs


In econometrics, a variable which has not been transformed is often
referred to as being measured in levels. The same variable to which
the log transformation has been applied is often referred to as being
measured in logs.

Another reason why the log transformation is popular in econometrics is


because the interpretation of slope coefficients when either the dependent
variable or the regressor (or both!) are in logs is very insightful in
economics.
In a multiple regression equation where all variables are measured in levels,
generally defined as
Y = a + b1 X1 + b2 X2 + · · · + bK XK + u,
any slope coefficient, such as b1 , measures the absolute change in Y as a
response to an absolute change in X1 .
The absolute change in a variable Y as a result of the change in another
variable X can be a helpful piece of information, but we may also want to
know, for instance, the relative change in Y as a result of a change in X.
This change in X could also be either an absolute change of one unit, or a
relative change.
The difference between absolute change and relative change is explained in
Box 16.

Box 16 Absolute change, relative change, and elasticity


Absolute change is the difference between two levels of the variable
Z, that is,
∆ Z = new level of Z − initial level of Z.
(The symbol ∆, pronounced ‘delta’, is the upper-case version of the
fourth letter in the Greek alphabet.)
Relative change (and percentage change) is the absolute change
in a variable expressed as a proportion (percentage) of its initial level,
that is,

∆Z
relative change in Z = .
initial level of Z

47
Unit A1 Introduction to econometrics

Economists are often interested in the relative change in a variable Y as a


result of a relative change in another variable X. In terms of the notation
above, this is represented by the ratio between two relative changes: the
relative change in Y divided by the relative change in X. This is what is
called the elasticity of Y with respect to X and is described in Box 17.

Box 17 Elasticity
The elasticity of Y with respect to X is often denoted εY,X (read
as ‘epsilon of Y with respect to X’) and can be written as
relative change in Y ∆ Y /Y
εY,X = = .
relative change in X ∆ X/X

A model in levels will deliver non-constant elasticities. To obtain constant


elasticities from a regression equation, or at least a good approximation to
constant elasticities, the log-log model is often used. When there are just
the variables Y and X, the log-log model can be given as
log Y = a + b log X + u.
As you can see in Aside 1, the log-log model estimates a relationship
between X and Y that has the useful feature that the estimated slope
coefficient b is approximately equal to εY,X , the elasticity of Y with respect
to X, and is constant for all values of X.

Aside 1 Interpreting regression coefficients as elasticities


Box 17 defined elasticities. Here, we describe a useful feature of the
coefficients of log-log models.
Consider the regression model
log Y = a + b log X + u.
The coefficient b represents the slope of the regression line. Using
differential calculus, this slope can be expressed as
∂(log Y )
b= .
∂(log X)
Now
∂Y ∂(log Y )
∂(log Y ) = (since = 1/Y );
Y ∂Y
similarly, for X,
∂X
∂(log X) = .
X

48
4 Estimating causal relationships

Putting these together, we see that


∂(log Y )
b=
∂(log X)
∂Y /Y
= .
∂X/X
Since ∆ Y ≃ ∂Y for small changes in Y (and ∆ X ≃ ∂X for small
changes in X), this means
∆ Y /Y
b≃ ,
∆ X/X
which, from Box 17, is the definition of elasticity.

In Subsection 3.2.3 we introduced the clothing firms dataset. Table 5


shows some summary statistics for the variables production, labour and
capital from this dataset.
Table 5 Selected summary statistics for production, labour and capital

Variable name Mean Median Standard deviation


production 9724 1953 77054
labour 83 33 302
capital 2979 334 29504

The values in Table 5 indicate that all three variables are skewed to the
right because
• the mean is much larger than the median for all three variables
• the standard deviations are much larger than the means.
The following plots of the distribution for each variable, in Figures 15(a),
(b) and (c), confirm the right-skewness of the data.

49
Unit A1 Introduction to econometrics

20000 20000

15000 15000
Frequency

Frequency
10000 10000

5000 5000

0 0
0 1000 2000 3000 4000 5000 6000 0 5 10 15
(a) Production (millions of euros) (b) Labour (thousands of employees)

20000

15000
Frequency

10000

5000

0
0 500 1000 1500
(c) Capital (millions of euros)

Figure 15 Distribution of (a) production, (b) labour and (c) capital, all variables in levels

50
4 Estimating causal relationships

Despite this skewness, we can still fit a model to these data. In Activity 9,
you will consider the results from fitting one such model.

Activity 9 Considering a model for production

Using the data in the clothing firms dataset, the following model for
production was fitted: production ∼ labour + capital.
The coefficients for this model are given in Table 6 and some summary
statistics of the residuals are given in Table 7.
Table 6 Coefficients for production ∼ labour + capital

Coefficient Estimate Standard t-value p-value


error
Intercept 1189 415.3 2.86 0.004
labour 53.96 1.760 30.66 < 0.001
capital 1.364 0.018 75.84 < 0.001

Table 7 Summary statistics for the residuals from


production ∼ labour + capital

Minimum First Median Third Maximum


quartile quartile
−1 042 572 −3091 −1840 12 5 851 346

(a) Interpret the coefficients given in Table 6.


(b) For this model, R2 is 0.4632 and Ra2 is 0.4631. Use these values, and
the summary statistics for the residuals given in Table 7, to comment
on this model.

After creating the log transformation of each of these variables, the


distribution of values resembles more a bell-shaped curve. This can be seen
in Figures 16(a), (b) and (c), next.

51
Unit A1 Introduction to econometrics

3000 3000

2500 2500

2000 2000

Frequency
Frequency

1500 1500

1000 1000

500 500

0 0
−4 −2 0 2 4 6 8 −4 −2 0 2
(a) log(production) (b) log(labour)

3000

2500

2000
Frequency

1500

1000

500

0
−4 −2 0 2 4 6 8
(c) log(capital)

Figure 16 Distribution of (a) production, (b) labour and (c) capital, all variables in logs

In Activity 10, you will consider the results from a model similar to that
considered in Activity 9 but this time the variables are in logs, not in levels.

Activity 10 Interpreting coefficients in log-log models

Using the data in the clothing firms dataset, the following model for
production was also fitted:
log(production) ∼ log(labour) + log(capital).
The coefficients for this model are given in Table 8 and some summary
statistics of the residuals are given in Table 9.

52
4 Estimating causal relationships

Table 8 Coefficients for log(production) ∼ log(labour) + log(capital)

Coefficient Estimate Standard t-value p-value


error
Intercept 3.339 0.030 112.77 < 0.001
labour 0.334 0.008 40.25 < 0.001
capital 0.528 0.004 123.16 < 0.001

Table 9 Summary statistics for the residuals from


log(production) ∼ log(labour) + log(capital)

Minimum First Median Third Maximum


quartile quartile
−5.9618 −0.6441 −0.0145 0.6447 4.5923

(a) Interpret the coefficients given in Table 8.


(b) For this model, R2 is 0.5843 and Ra2 is 0.5843. Use these values, and
the summary statistics for the residuals given in Table 9, to comment
on the reasonableness of this model. Is this model better than the one
considered in Activity 9?

In Activity 10, we considered a model where the dependent variable and


the regressors were all logged. We may also use logs just in the dependent
variable, or in the regressor. The interpretation of the coefficient changes
with the presence of logs. Box 18 summarises this.

Box 18 Interpretation of a slope coefficient


Suppose Y is a dependent variable and X is a regressor.
• When Y = a + bX + u (that is, Y in levels and X in levels),
b represents the change in Y when X changes by 1 unit.
• When log Y = a + bX + u (that is, Y in logs and X in levels),
b is a semi-elasticity, and 100b represents the percentage change in
Y when X changes by 1 unit.
• When Y = a + b log X + u (that is, Y in levels and X in logs),
b/100 represents the change in Y when X changes by 1%.
• When log Y = a + b log X + u (that is, Y in logs and X in logs),
b is an elasticity, and represents the percentage change in Y when
X changes by 1%.

53
Unit A1 Introduction to econometrics

4.2 Choosing the regressors


As seen before in the module (in particular, in Units 1 to 4), the
econometrician will also choose a specification of the model which gives the
estimator good statistical properties. When discussing multivariate
regression, you saw that adding regressors will increase the R2 of the
model, but may result in high multicollinearity and in estimators which are
not computationally stable, that is, whose estimates change too
substantially given minor changes in the sample. As the number of
regressors increases, the net variation of each regressor, and its role in
explaining the dependent variable, decrease to the point of becoming
statistically not significant if their entire variation is captured in
correlations it has with other explanatory variables. Finding the right
balance between including all relevant factors in the model and
safeguarding the explanatory role of the model and its variables is no easy
task.
The added challenge of this unit, and of this section in particular, is to find
the balance which also delivers estimators of coefficients that can be
interpreted as the causal effect of its variable on the dependent variable.
In Subsection 2.2.1, we discussed the human capital theory in economics.
One of the econometric models used to assess the role of human capital on
wage formation was described in Model (2), repeated below:
log(w) = log(w0 ) + α educ + β1 exper + β2 exper2 + u.
While we are mostly interested in the coefficient of education, experience is
added to the model, and, as you have seen earlier in the module, adding
more regressors changes the magnitude and statistical significance of the
estimates of interest. So which ones are the right ones?
We start to address this by considering in Activity 11 what this model says
about the effect of experience on wages.

Activity 11 Effect of experience when modelled as a


quadratic function
With two variables representing experience, how can we calculate the effect
of experience on wages?

In this subsection, we want you to explore how coefficient estimates change


depending on which regressors are included in the model. We will be
exploring the following two models:
log(wage) = log(w0 ) + α1 educ + u. (4)
and
log(wage) = log(w0 ) + α1 educ + α2 exper + α3 exper2 + u. (5)

54
4 Estimating causal relationships

Start in Activity 12 by considering the results obtained after fitting both


these models to some data.

Activity 12 Coefficient of education in a Mincerian wage


equation
Tables 10 and 11 show the OLS results from estimating the two Mincerian
wage equations, Model (4) and Model (5), to the first wave of the PSID
dataset. Describe how the coefficient of educ and its interpretation has
changed between the two models.
Table 10 Coefficients for Model (4)

Parameter Estimate Standard t-value p-value


error
Intercept 5.6704 0.0691 82.10 < 0.001
educ 0.0549 0.0053 10.44 < 0.001

Table 11 Coefficients for Model (5)

Parameter Estimate Standard t-value p-value


error
Intercept 5.1475 0.0774 66.55 < 0.001
educ 0.0657 0.0049 13.50 < 0.001
exper 0.0388 0.0049 7.87 < 0.001
exper2 −0.0007 0.0001 −5.40 < 0.001

When we have identified the causal effects of all regressors, this can be
represented simply as in Figure 17.

Education

Hourly wages

Experience

Figure 17 A model for wages including education and experience

In Figure 17, shaded boxes represent variables included in the model and
the arrows indicate the effects of education and experience on hourly
wages. As both these arrows separately point to hourly wages, it indicates
that in this situation we can estimate the separate effects of education and
experience on wages. If the true model of hourly wages only includes
education and experience, and if these are exogenous, we have identified
causal effects of each on the dependent variable, and their coefficient
estimators will be unbiased and return (on average) the true effects.

55
Unit A1 Introduction to econometrics

In Subsection 4.2.1, we will consider what happens to the estimates if we


don’t include both variables. Then in Subsection 4.2.2 you use R to
compare different specifications of the Mincerian wage equations.

4.2.1 Omitted variable bias


In Activity 12, you saw that the coefficient of educ changes when terms
relating to experience are added to the Mincerian wage equation. If
experience explains hourly wages, did the model using only educ provide
an unbiased estimator of its coefficient? Unfortunately, the answer is no,
for the following reasons.
• Experience is statistically significant, and so there is evidence that it
helps to explain hourly wages.
• Experience is also correlated with education.
◦ For younger individuals who stay in school longer, their experience
will be lower than for those who left school earlier and acquired some
work experience, so this correlation may be negative.
◦ For older individuals, this correlation is unclear: higher-educated
individuals may have more work experience because they are more
productive and sought after, but because they earn more per hour
worked, they may also prefer to work less or retire earlier if they have
accumulated enough income for their later years.
While the relation is complex and its direction difficult to predict, it will
be difficult to claim education and experience are not related. In the
first wave of the PSID dataset, the correlation between the two variables
is −0.2219, and this correlation is significant at the 1% significance level.
Because education and experience are correlated, we cannot use our ceteris
paribus assumption in interpreting the coefficient of education in the
univariate model. Some of the apparent effect of changes in education on
wages (the causal effect we are after) will be due to changes in experience.
This entanglement is represented in Figure 18.

Education

Hourly wages

Experience

Figure 18 A representation of the source of omitted variable bias in a model

Notice in Figure 18 there is no direct link between experience and hourly


wages because the univariate model only includes a term for education.
However, the correlation between education and experience, indicated by
the dotted line, means that the effect of experience on hourly wages merges
with the effect of education and hourly wages, causing an omitted variable
bias as defined in Box 19, next.

56
4 Estimating causal relationships

Box 19 Omitted variable bias


Excluding an explanatory variable that should be in the model results
in an omitted variable bias of the estimator of the regressors that
are included.

Whether it is possible to say anything about the direction of this bias is


something you will consider in the next activity.

Activity 13 Direction of omitted variable bias

Let’s consider now whether, knowing the correlation between education


and experience, we could have anticipated the way in which the coefficient
estimate of education would change when we add experience.

Let’s explore the statement given in Box 19 a little more using the case
when we include just one of two regressors that should be in a model.
So suppose the true model is
Yi = β0 + β1 X1i + β2 X2i + ui
and that instead, we estimate
Yi = β0 + β1 X1i + ui .
In other words, we leave out X2 . (In Figure 18 this was experience.)
It can be shown that if E(ui |X1 , X2 ) = 0 (which is necessary to identify
causal effects) then the expectation of βe1 , the OLS estimate of the
coefficient of X1 , is
Cov(X1 , X2 )
E(βe1 |X1 , X2 ) = β1 + β2 . (6)
V (X1 )
X2 creates an omitted variable bias on βe1 if the second term in
Equation (6) is not zero, which requires the two following conditions:
• its coefficient β2 ̸= 0
• the included and omitted variables are correlated.

4.2.2 Using R to compare wage equation


specifications
In Notebook activity A1.6, you will estimate different Mincerian wage
equations with a view to select the best specification for each variable, and
the best set of regressors which may avoid omitted variable bias of your
coefficient(s) of interest.

57
Unit A1 Introduction to econometrics

Notebook activity A1.6 Comparing alternative


specifications of a Mincerian
wage equation
In this notebook, you will be using psid to compare alternative
specifications of a Mincerian wage equation.

4.3 Dealing with possible endogeneity and


omitted variable biases
The empirical literature on human capital and the Mincer wage equation
often argues that inherent ability is an important factor determining wages.
By inherent ability is meant the ability which is not affected by
schooling or work experience and is idiosyncratic to each individual.
(Blackburn and Neumark, 1993)

In the next activity, you will consider why such biasing might occur if
ability is not included in the model.

Activity 14 Omitted ability bias

Why would ability create an omitted variable bias in the estimator of the
education coefficient? In other words, in what ways is it correlated with
education? And why should it be in the model?

The explanations given in the solution to Activity 14, and there are many
more, suggest ability has a direct positive effect on wages, which does not
depend on its correlation with education.
Other social sciences may find these arguments quite simplistic, and that
the way ability is thought of is also simplistic, but here the point is that
there is such a relation and that the omission of ability from a wage
equation leads to omitted variable bias.

Activity 15 Sign of the omitted ability bias

Using Equation (6) given in Subsection 4.2.1, anticipate the direction of


the bias resulting from excluding ability from a wage regression.

As you have seen, omitting ability from the Mincerian wage equation can
lead to omitted variable bias. However, the solution is not as
straightforward as ‘include ability in the model’ because it’s not clear how
to measure the ability of an individual. So, in the rest of this subsection,
we will consider three alternative ways of dealing with this bias: using

58
4 Estimating causal relationships

proxy variables (Subsection 4.3.1), studying twins (Subsection 4.3.2) and


using instrumental variables (Subsection 4.3.3).

4.3.1 Using proxy variables


A widely available technique to deal with the omitted variable bias of
excluding ability in wage regressions is to think of a measurement of ability
which picks up its main variation but does not introduce its own biases.
We call these measurements proxy variables. As Figure 19 shows, these
proxy variables are then included in the regression model.

Education

Hourly wages

Proxy variable
for ability

Figure 19 A representation of use of a proxy variable in a model

To see how this works, suppose the original model is


Y = a + bX + u,
where an unobserved variable W is absorbed into the error term,
u = W + ν, so that E(u|X) is different from 0 due to a correlation between
X and W . Estimation with a proxy variable involves running OLS on the
model
Y = a + bX + cZ + (ν − ε),
where Z is the proxy variable for W such that W = Z + ε and ε has zero
mean, constant variance and values which are i.i.d. So (ν − ε) is the new
error term. It is assumed that this new error term is now uncorrelated
with the regressors.
This strategy is summarised in Box 20.

Box 20 The proxy variable estimator


Proxy variables are used in situations where there is an excluded
variable that is correlated with one of the regressors which is hard or
impossible to measure. If another variable can be found that is
correlated with the unobserved variable, it can be included in the
model as a proxy for the unobserved variable, removing the correlation
between the error term and the regressors with which it is correlated.

Example 3 describes one study that has made use of proxy variables.

59
Unit A1 Introduction to econometrics

Example 3 Proxy variables for ability


Blackburn and Neumark (1993) argued that the increasing wage
differential between higher- and lower-educated white men in the USA
in the 1980s could be explained by changes in the average ability of
different educational groups. To find empirical evidence for their
hypothesis, and contrary to earlier empirical studies of the
wage/education relationship, they chose to include a proxy variable for
ability in their econometric model. They used a US dataset called the
National Longitudinal Survey (NLS) and focused on a cohort of young
people whose information is collected at different points in time. This
dataset includes information on the test scores participants obtained
while still at school. These test scores capture several cognitive and
technical elements which are argued to represent inherent ability.

While the study described in Example 3 tried different methods to


investigate the wage/education relationship, here we present only the
results obtained for a log wage regression with and without proxy variables
for ability. A summary of the main results is in Table 12. Column (1)
shows the wage regression results when ability is excluded. Columns (2)
and (3) include different sets of measurements of ability. Standard errors
are reported in parentheses.
Table 12 Results for models of wages, in logs, with proxy variables for ability

(1) (2) (3)


Years of education 0.032 0.013 0.012
(0.007) (0.008) (0.008)
Years of education × trend 0.0034 0.0048 0.0048
(0.0017) (0.0017) (0.0017)
Academic test −0.010
(0.017)
Technical test 0.044
(0.013)
Computational test 0.041
(0.010)
Non-academic test 0.038
(0.006)

R2 0.393 0.404 0.405

Note: this is a panel dataset which also contains regressors related to time. We
will leave the discussion of time-related variables for Unit A2.

60
4 Estimating causal relationships

Their econometric model also includes experience, age, union status, a


measurement of urban versus rural location, and marital status. All these
other regressors are the same in the three model specifications. In
Activity 16, you will interpret some of the results given in Table 12.

Activity 16 Interpreting the results of the three models

Interpret the coefficient estimates of the three models given in columns (1),
(2) and (3) of Table 12. Do these results suggest an omitted variable bias
which is positive, as suggested in the previous activity?

While this strategy has been used often, and earnings and income datasets
often collect measurements of ability, we have seen in Section 3 that
measurement error also biases our results. Proxy variables are prone to
more complex types of measurement error so results for the coefficients of
tests scores have to be taken with a pinch of salt. But what is more,
simply including the test scores as regressors may not be the best
way to use the information in these variables to control for ‘ability’.
It seems reasonable to expect that the productive ability that
employers value is at least partly reflected in our test scores but that
several other factors also affect the outcome of the tests (e.g.,
test-taking ability, sleep the previous night, etc.).
(Blackburn and Neumark, 1993)
Alternative methodologies to deal with such issues include instrumental
variables, which we will discuss in Subsection 4.3.3.

4.3.2 Using twin studies


This subsection discusses a creative and compelling strategy to deal with
omitted ability – using twins. This draws on a common identification
strategy which tries to look at observations which are the same in
everything except the variables whose change we are interested in
modelling. But this is often easier said than done . . .
In medical and natural sciences, this is the principle used in lab
experiments estimating the impact of one drug. One group receives a
placebo and the other group receives the actual drug. Differences between
their outcomes will provide an estimate of the impact of the drug, as long
as the two groups are comparable (and the ones receiving the placebo will
not alter their behaviour if they know they are not being treated). You
will have seen similar treatment and placebo set-ups in the pea growth,
rats and protein, and placebo effect datasets (Subsection 5.4 of Unit 3,
Subsection 4.4 of Unit 4 and Subsection 5.1 of Unit 5, respectively).
When it comes to estimating the wage returns to education, finding people
who are the same in everything except the variables we are interested in is
far more tricky, which is where twins come in. In the words of some
researchers that have used this approach:

61
Unit A1 Introduction to econometrics

We estimate the returns to schooling by contrasting the wage rates of


identical twins with different schooling levels. Our goal is to ensure
that the correlation we observe between schooling and wage rates is
not due to a correlation between schooling and a worker’s ability or
other characteristics. We do this by taking advantage of the fact that
monozygotic (from the same egg) twins are genetically identical and
have similar family backgrounds.
(Ashenfelter and Krueger, 1994)
Ashenfelter and Krueger went to the largest gathering of twins and
multiples in the summer of 1991 and – with a team of five interviewers –
collected information on 298 identical (or monozygotic) twins and
92 fraternal (or dizygotic) twins. They compared their sample of twins
with the sample obtained in a nationally representative survey in the USA
(the Current Population Survey (CPS)) and noted that there were
discrepancies in terms of race and age.
Their identification strategy for the effect of education on earnings assumes
that identical twins with different education levels have the same inherent
ability so that the difference in their education explains the difference in
their earnings, accounting for race and age.
We can see the rationale in a simplified econometric model where the
earnings of each pair of twins, each pair represented by twin i and twin j,
are a function of ability, familyBackground, and educ:
log(wagei ) = β0 + β1 educi + β2 abilityi
+ β3 familyBackgroundi + ui
and
log(wagej ) = β0 + β1 educj + β2 abilityj
+ β3 familyBackgroundj + uj .

The term familyBackground captures the variation between individuals


due to upbringing and genetics in a non-specific way. That is, it captures
general, unobserved, heterogeneity between individuals, not represented by
education or ability.
When the two equations are differenced, we obtain
∆ log(wagei−j ) = β0 + β1 ∆ educi−j + β2 ∆ abilityi−j
+ β3 ∆ familyBackgroundi−j + ∆ ui−j ,
where here ∆ Xi−j is taken to represent Xi − Xj .
The similar upbringing of most twins, and the genetic similarity of
identical twins, means that it can be reasonable to assume that for
identical twins, family backgrounds are the same. So if ability is also the
same for identical twins, this equation simplifies to
∆ log(wagei−j ) = β0 + β1 ∆ educi−j + ∆ ui−j .
This is depicted in Figure 20.

62
4 Estimating causal relationships

∆ Education ∆ Hourly wages

Figure 20 A model resulting from a twin study

Note that, even though the variables are no longer the collected earnings
and education levels, and instead the difference between twin earnings as
well as the difference between twin educational levels need to be created,
the coefficient on the education variable is still β1 , and is argued to be
freed of omitted variable bias present in standard models which exclude
ability. Moreover, this strategy can account for family background without
having to measure it explicitly.
Generically, the twin study approach can be written in the following way.
Suppose the original model is
Yi = a + bXi + ui ,
where ui = cWi + νi , and E(ui |Xi ) is different from 0 due to a correlation
between X and W . Further suppose there are groups consisting of
individuals who share the same value of unobserved heterogeneity W , so
Wi = Wj when i and j belong to the same group. (In twin studies, each
group contains just two members.) Differencing then yields
Yi − Yj = α + b(Xi − Xj ) + vi ,
where vi = νi − νj . Unobserved heterogeneity due to W is not in the
transformed model, so E(v|X) = 0. The original intercept is cancelled out
by the differencing; nevertheless, we often run OLS on a model with an
intercept.
The twin studies technique is summarised in Box 21.

Box 21 The twin studies technique


Twin study techniques can be used in situations where there is an
excluded variable that is correlated with one of the regressors and
which is difficult or impossible to measure/observe. If more than one
of the study’s cross-sectional units have the same value for the
unobserved variable, they can be differenced to eliminate the
unobserved variable. This way, the bias due to the omitted variable is
removed without having to measure it.
Twin studies can be used with all data structures.

63
Unit A1 Introduction to econometrics

Ashenfelter and Krueger (1994) went on to estimate their model in twin


differences to find that, contrary to most studies which account for ability
one way or another, the coefficient of education is higher than the ‘biased’
estimates in the literature, suggesting a negative ability bias.
They were aware of possible flaws with twin studies and models in
differences, including measurement error. They discussed that if education
was measured with error, then differencing education values could
potentially exacerbate any initial error in the collected values for
education. To account for this criticism, they collected two measures of
education, one self-reported by the individual and the other reported by
their twin, and used their average as the indicator of each twin’s education
(on the assumption that measurement error was lower in this averaged
measurement of each twin’s education). They further explored the
discrepancies between the two values of each individual’s education to
describe the effect of measurement error in their work.
These studies have received a lot of criticism, mainly because of the
identification strategy which only uses twin pairs who have had different
education levels. If twins are indeed the same in family background and
ability, why would they have different education levels? And why would
reasons that led them to have different education levels not have a direct
impact on earnings?
We are inferring the returns to schooling for the US population from pairs
of twins argued to be similar but which make different economic choices for
reasons which could well be related with both education and, ceteris
paribus with earnings themselves, undermining the results. This type of
bias is called selection bias (as discussed in Subsection 3.3). Selection bias
occurs when the sample used to estimate an econometric model is not
representative of the population in unobservable ways.
Ashenfelter and Krueger (1994) accounted for how different their sample of
twins was from CPS’s by adding regressors which would make both
measurable distributions more comparable. But race and age in this case
would not necessarily explain why twins would choose different education
levels, leaving a lot of questions unanswered.
Selection bias in twin studies, when left unaccounted for, is the main
reason why some economists argue that twin studies may be more creative
than compelling.

4.3.3 Using instrumental variables


The most widely used technique to deal with an endogenous regressor,
whether or not endogeneity is due to an omitted variable, and when only
cross-sectional data is available, is called the instrumental variable (IV)
estimator. The principle of instrumental variables is to replace the
endogenous regressor with an alternative that resembles the regressor in
some way, but is only correlated with the dependent variable via the
endogenous regressor and, ceteris paribus, not directly.

64
4 Estimating causal relationships

This is illustrated in Figure 21. Here the endogenous regressor is education


due to its correlation with ability, which here is not in the model. Instead,
an instrument for education is included in the model. This is correlated
with education. However, conditional on education, it is assumed to have
no impact on the dependent variable.

Instrument
for education

Hourly wages

Education

Ability

Figure 21 A model using an instrumental variable

In Activity 17, you will consider the model depicted in Figure 21 a bit
further.

Activity 17 Selection of instrumental variables


In our wage regression example, if we are interested in the coefficient of
education and assume education is endogenous because of its correlation
with ability which is unobservable and excluded, which variable would our
instrumental variable replace?

Not an instrumental variable,


but does the slide in a
Suppose that the original model is
trombone make it a variable
Yi = a + bXi + ui , instrument?

where ui = cWi + νi , and E(ui |Xi ) is different from 0 due to a correlation


between X and W . Further suppose that we have an instrumental
variable Z. That is:
• X and Z are correlated. (Ideally highly correlated, otherwise, we are in
the realm of what is called weak instruments.)
• Cov(u, Z|X) = 0. (This means that conditional on X, the instrumental
variable Z is not correlated with the dependent variable Y .)
The instrumental variables approach works by breaking the regression
modelling into two steps.
1. Estimate the relationship between the instrument Z and the
regressor X, and use this to predict values, X,
b of X using Z.
2. Use the predicted values, X,
b in place of X in the original regression
model. That is, fit the model Yi = a + bXbi + ui .

65
Unit A1 Introduction to econometrics

IV estimation is the most common way with cross-sectional data of dealing


with not only omitted variable bias, but also measurement error bias, or
selection bias. It is also often used in combination with other methods,
even with richer data structures.
This is summarised in Box 22.

Box 22 The instrumental variables (IV) estimator


The instrumental variables approach can be applied when the error
term is correlated with at least one of the regressors because there is
an omitted variable which is also correlated with the regressor. It
consists of finding another variable Z that is correlated with the
regressor X but, ceteris paribus, not with the dependent variable Y .
The regression proceeds in two steps: the first step is to estimate the
relationship between the instrument Z and the regressor X, and this
is used to ‘predict’ values of X from Z; the second step uses the
predicted values of X in place of X in the original regression model.
Instrumental variables can be used with all data structures.

In Example 4, we return to investigating the impact of education on wages,


this time using instrumental variables.

Example 4 Using IV in a wage regression


In Subsection 4.3.1, we discussed proxy variables for unobservable
factors and looked at Blackburn and Neumark (1993). We saw that
they acknowledged that simply replacing ability with test scores in
their wage regression equations would not necessarily generate
unbiased estimates due to measurement error of proxy variables. They
argued that test scores may still be endogenous due to family
background and support. That being the case, not only would
coefficients on test scores be biased, but it is also likely that the
endogeneity of education remains.
Blackburn and Neumark (1993) included several variables measuring
family background, including parental education, to account for the
potential unobserved relationship test scores may have with wages
and education. They estimated three alternatives to OLS:
• an IV estimator that uses family background as an IV for test
scores in the wage regression
• an IV estimator that uses family background variables to
instrument for both education and test scores
• an IV estimator that uses IV to instrument for education only.

66
4 Estimating causal relationships

The assumptions required to confirm that IV, using family background


characteristics as instruments, is a valid estimator vary for the three cases
mentioned in Example 4. Let’s consider them:
• IV for test scores in a wage regression. This assumes
◦ family background is related to test scores
◦ using the ceteris paribus assumption, when looking at individuals with
similar test scores (proxy for ability), any observed differences in
wages are not driven by family background
• IV for test scores and for education in a wage regression. This assumes
◦ family background is related to test scores and separately to education
◦ using the ceteris paribus assumption, when looking at individuals with
similar test scores (proxy for ability) and education, any observed
differences in wages are not driven by family background
• IV for education in a wage regression. This assumes
◦ family background is related to education
◦ using the ceteris paribus assumption, when looking at individuals with
similar education, any observed differences in wages are not driven by
family background.
Note that IV requires at least as many instruments as endogenous
regressors. So when the authors instrument for both education and test
scores, they have at least two instruments which are different indicators of
family background.
Some of the results that Blackburn and Neumark (1993) obtained are
given in Table 13.

Table 13 OLS and IV log wage equation estimates

IV for test scores


OLS IV for test scores and schooling IV for schooling
(1) (2) (3) (4) (5) (6) (7) (8)
Years of education 0.013 0.012 −0.000 −0.001 0.029 0.022 0.028 0.024
(0.008) (0.008) (0.013) (0.013) (0.045) (0.043) (0.021) (0.018)
Years of education × trend 0.0048 0.0048 0.0062 0.0057 0.0077 0.0062 0.0084 0.0070
(0.0017) (0.0017) (0.0023) (0.0018) (0.0047) (0.0042) (0.0029) (0.0030)
Academic test −0.010 −0.057 −0.110 −0.042
(0.017) (0.152) (0.155) (0.024)
Technical test 0.044
(0.013)
Computational test 0.041
(0.010)
Non-academic test 0.038 0.094 0.064 0.083 0.035 0.042 0.028
(0.006) (0.081) (0.020) (0.081) (0.044) (0.008) (0.009)

The first two columns repeat the results given in Table 12


(Subsection 4.3.1) when test scores were included in a wage regression

67
Unit A1 Introduction to econometrics

directly and OLS was used to estimate its coefficients (Columns (2)
and (3) of Table 12).
All education coefficient estimates remain lower than in the model which
did not account for ability at all (Column (1) of Table 12). The models
which instrument for test scores (Columns (3), (4), (5) and (6)) even
generate returns to education close to zero, and are statistically not
significant. Estimates when schooling is instrumented with family
background are almost twice the size of the estimates using proxy variables
only, but have the additional advantage of not being as susceptible to
measurement error bias.
All in all, and for this dataset, it is likely that estimates in Columns (5)
to (8) are the best estimates given the econometric model used to explain
wages.
The work of the econometrician would not necessarily stop here.
Blackburn and Neumark (1993) went on to refine and augment their model
by interacting education with test scores (see Unit 4 for a discussion of
interactions between variables and how they model non-parallel slopes),
explored the time series nature of their dataset, and made a few more
robustness checks to strengthen the evidence and results obtained. But
within the spirit of confirmatory analysis, the analysis presented suggests
that estimates in Columns (5) to (8) would be convincing enough, as they
have the ‘right’ size: lower than the initial ones which were subject to
positive omitted variable bias, and higher than the estimates which were
subject to proxy variable measurement error.

5 Estimating causal relationships


with panel data
In Section 4, we discussed methods for estimating causal relationships
using cross-sectional data. In this section, we will discuss some techniques
that can be used to estimate causal relationships when we have panel data:
using pooled OLS (Subsection 5.1), using fixed effects models
(Subsection 5.2) and using random effects (Subsection 5.3). In
Subsection 5.4 we will then discuss a means for selecting which technique.
Finally, in Subsection 5.5 you will use R to fit pooled, fixed effects and
random effects models.

68
5 Estimating causal relationships with panel data

5.1 Pooled OLS


Recall from Subsection 3.2.1 that the generic form of a model for panel
data is
Yit = β0it + β1it X1it + · · · + βKit XKit + uit .
Pooled OLS is the estimator of the following econometric model which uses
panel data:
Yit = β0 + β1 X1it + · · · + βK XKit + uit . (7)
Comparing this model with the generic form, you will notice that we have
dropped the subscripts i and t from the coefficients βk , for all
k = 0, . . . , K. This implies that the intercept and slope are the same for all
individuals i in the panel and for all time periods t, and the error term is
still a zero mean error term which satisfies E(uit |X) = 0. These conditions
suggest that the best estimator is the pooled OLS estimator. This is
summarised in Box 23.

Box 23 Pooled model


As the name suggests, pooling the data ignores the panel structure
and treats all observations as if they were independent, effectively
reducing the panel to a cross-sectional dataset.

Using a pooled model for panel data can appear to defeat the purpose of
having a panel at all, but there are occasions where it is appropriate. This
makes pooled OLS not too different from cross-sectional OLS.
However, one needs to remember that observations in a cross-sectional
dataset are assumed to be randomly sampled from the population, and are
representative of the population. So, with panel data, it would be difficult
to convince the econometrician that two observations taken from the same
individual in two different time periods are as random as two
cross-sectional observations. Some software packages use
variance-covariance correction methods to estimate pooled models which
account for this data structure.

5.2 Fixed effects estimators


While pooled OLS can be the most efficient estimator with panel data, it is
often the case that the error term is not as well-behaved as in Model (7),
where uit has zero mean and E(uit |X) = 0. A common assumption made
about the error term in a panel data model is as follows:
Yit = β0 + β1 X1it + · · · + βK XKit + νit ,
where νit = fi + uit , and uit represents the same well-behaved error term
with zero mean and E(uit |X) = 0. The added error term fi does not vary

69
Unit A1 Introduction to econometrics

with t and is often assumed to be unobservable and correlated with at


least one of the regressors.
These fi are often called fixed effects or individual effects. The term fi
delivers an intercept that is specific to individual i. It is assumed at least
some of these individual effects are not zero – otherwise we would just
estimate a pooled model.
This panel data model, summarised in Box 24, is called the fixed effects
(FE) model.

Box 24 Fixed effects model


A fixed effects model is
Yit = β0 + fi + β1 X1it + · · · + βK XKit + uit , (8)
where the fi are correlated with at least one Xk , making Xk
endogenous.

In the fixed effects model, all individuals are affected the same way by
changes in the regressors (the Xk ’s). This means the regression lines for all
individuals have the same slope, but they have different intercepts due to
the differing individual effects. So this is a parallel slopes model which can
be extended with interactions between dummy variables and other
regressors. (Remember from Subsection 2.1 of Unit 3 that ‘dummy
variable’ is an alternative term for ‘indicator variable’.)
In this strand, you will learn three main estimators of a fixed effects model:
• the least squares dummy variable (LSDV) estimator (introduced in
Subsection 5.2.1)
• the within groups (WG) estimator (introduced in Subsection 5.2.2)
• the first difference (FD) estimator (introduced in Subsection 7.1 of
Unit A2).

Example 5 Modelling ability as a fixed effect in a wage


regression
Several researchers have argued that properties like ability vary across
individuals, but are specific to each individual, and do not vary over
time. If this is the case, then a fixed effects model and estimator are
appropriate.
Returning to our wage regression example, and thinking of ability as
an omitted variable, this econometric model can be written as
log(wageit ) = β0 + β1 educit + abilityi + uit ,

70
5 Estimating causal relationships with panel data

where abilityi is the fixed effect, with a subscript i because it is


different for each individual, but no subscript t because it is
time-invariant. We have no data for abilityi (it is unobservable),
and so it is included in the error term. In a fixed effects model, it is
assumed that ability is correlated with education, rendering
education endogenous, and an OLS estimator of β1 biased and
inconsistent.

Let’s now explore alternative fixed effects estimators and their strategies to
estimate the parameters of the model consistently and with no bias.

5.2.1 The least squares dummy variable (LSDV)


estimator
Returning to Model (8) and rewriting it as
Yit = (β0 + fi ) + β1 X1it + · · · + βK XKit + uit
suggests that fixed effects can be modelled and estimated as
individual-specific constant terms. By grouping the original intercept β0
with fixed effect fi , we have rewritten a fixed effects model as a model
which has individual-specific intercepts, and a well-behaved error term uit
which is uncorrelated with the regressors.
The time-invariant unobserved heterogeneity represented by the fixed
effects can be dealt with by estimating the following model using OLS:
Yit = α1 D1 + · · · + αN DN + β1it X1it + · · · + βKit XKit + uit ,
where, for the ith observation, Di takes the value 1, and Dj for j ̸= i takes
the value 0. That is, D1 , . . . , DN , are indicator variables as introduced in
Subsection 2.1 of Unit 3.
In econometrics, we call indicator variables dummy variables, or
dummies for short, leading to the name of this estimator, the least
squares dummy variable (LSDV) estimator described in Box 25.

Box 25 The least squares dummy variable (LSDV)


estimator
The LSDV estimator is one of the fixed effects estimators of panel
data models.
The most common estimable version of this model is
Yit = β0 + α2 D2 + · · · + αN DN + β1it X1it + · · · + βKit XKit + uit ,
where β0 is the model intercept and common to all cross-sectional
units i, and where one of the individual-specific dummy variables has
been dropped.

71
Unit A1 Introduction to econometrics

In Example 5, we argued how ability is often modelled as a fixed effect in a


panel data model. In Example 6, we will apply the LSDV estimator to this
model.

Example 6 Ability in a wage regression and LSDV


In Example 5, the following model was given for the wage regression
example:
log(wageit ) = β0 + β1 educit + abilityi + uit ,
where abilityi is the fixed effect.
Estimating this model using LSDV, i.e. including ability in the model
as an individual-specific dummy variable, leads to the following
specification:
log(wageit ) = α1 D1 + · · · + αN DN + β1 educit + uit . (9)

Ability has been modelled as a dummy variable for each individual,


and the new error term uit is assumed to be uncorrelated with
education now that ability is no longer concealed in it. All αi
effectively define a separate intercept for each individual, which
removes the need for an overall intercept β0 . While the model for each
individual has a different intercept, all still have the same education
coefficient β1 . So it is a parallel slopes model, according to the
terminology used in Section 2 of Unit 4.
Often, however, and for reasons which will become clearer when we
discuss tests for fixed effects, we choose to include the model intercept
β0 and to drop one of the individual-specific dummy variables just as
we did in Units 3 and 4.
So the estimable version of Model (9) is often as follows (and for the
sake of this example, we have dropped the dummy variable for
individual i = 1):
log(wageit ) = β0 + α2 D2 + · · · + αN DN + β1 educit + uit .
However, when estimating LSDV on a basic wage equation with
education and experience as regressors, and using psid, it turns out
that an extra individual-specific dummy variable needs to be dropped
(making two such dummy variables dropped in total). Table 14 shows
two sets of results obtained by dropping different pairs of dummy
variables. (Only the first 10 individual-specific dummy variable
coefficient estimates are shown in each case.)

72
5 Estimating causal relationships with panel data

Table 14 LSDV results for a wage regression with education and experience

Dropping D1 and D595 Dropping D1 and D468

Parameter Estimate Standard Estimate Standard


error error
Intercept 4.3457 0.2888 7.7600 0.1481
educ 0.1058 0.0272 −0.2736 0.1178
exper 0.1140 0.0025 0.1140 0.0025
exper2 −0.0004 0.0001 −0.0004 0.0001
D2 −2.3033 0.0792 −1.5446 0.0765
D3 −0.0942 0.0819 1.0440 0.0732
D4 −2.3942 0.0793 −2.0148 0.0816
D5 −0.5419 0.1663 2.1137 0.0859
D6 −1.4652 0.0870 −0.3271 0.0732
D7 −1.2154 0.0841 0.0774 0.0733
D8 −1.4836 0.0764 −1.1043 0.0796
D9 0.1917 0.1652 2.8473 0.0873
D10 −0.1933 0.1652 2.4622 0.0873
D11 −1.4488 0.0865 −0.3108 0.0732

Notice from Table 14 that the coefficients for exper and exper2 are
the same despite different pairs of dummy variables being dropped. In
contrast, the coefficient for educ is one of the coefficients that does
change. Even the sign of the coefficient for educ changes! This means
we have to take great care when fitting the model and interpreting
what it is telling us about the effect of education on wages.

5.2.2 Within groups (WG) estimator


The most commonly used method to correct for the omitted variable bias
of a time-invariant variable is the within groups (WG) estimator. Instead
of trying to model ability like the LSDV estimator, it follows the same
principle used in twin studies and transforms the variables of the model so
that ability is removed from the econometric model which uses the
transformed variables. In the case of twin studies, the new variables were
the difference between twins for each variable of the original model and
ability being the same was differenced out. In this case, WG eliminates the
fixed effects αi from Model (9) by expressing the dependent and
explanatory variables for each entity in terms of the deviations from their
mean values. Starting from the form of the fixed effects model
Yit = β0 + fi + β1 X1it + · · · + βK XKit + uit , (10)
we would average the observations on the ith individual over the Ti time
periods to get
Y i = β0 + fi + β1 X 1i + · · · + βK X Ki + ui . (11)

73
Unit A1 Introduction to econometrics

Subtracting Model (11) from Model (10) yields the following form for the
within groups model:
Yit − Y i = β1 (X1it − X 1i ) + · · · + βK (XKit − X Ki ) + uit − ui .
These are called demeaned or mean corrected variables; the intercepts
have been eliminated. OLS estimation of the new model using

X1it = X1it − X 1i ,
..
.

XKit = XKit − X Ki
yields the WG estimator. In practice, it’s not necessary to perform this
transformation: software packages do it all.

Box 26 The within groups (WG) estimator


The WG estimator is another fixed effects estimator of panel data
models. It uses OLS on a model of the form
Yit ∗ = β1 X1it ∗ + · · · + βK XKit ∗ + uit ∗ ,
where for a variable Zit , Zit ∗ = (Zit − Z i ). That is, Zit ∗ is a demeaned
variable.
Due to the use of demeaned variables which pick up observations from
all time periods for each individual, WG requires that the new error
term is uncorrelated with the new regressors; this assumption is more
stringent than the one needed for OLS or LSDV.

Notice that in the demeaning process, the intercept and the fixed effect in
the error term u that characterise the fixed effects model have been
eliminated. So the WG estimator avoids the complexity of adding extra
variables and the computational cost of the LSDV estimator. We often
estimate WG with an intercept nevertheless.
In Example 7, we will estimate the same fixed effects model as in
Example 6 but this time using the WG estimator.

Example 7 Wage regression and WG


Running WG on our basic wage regression with educ and exper
delivers the results shown in Table 15.

74
5 Estimating causal relationships with panel data

Table 15 WG results for a wage regression with education and experience

Parameter Estimate Standard t-value p-value


error
Intercept 4.6342 0.0279 166.27 < 0.001
educ – – – –
exper 0.1140 0.0025 46.24 < 0.001
exper2 −0.0004 0.0001 −7.88 < 0.001

In the PSID dataset, education does not vary and so its effect cannot
be estimated using a WG estimator. Notice that the coefficients on
exper and exper2 are the same as the LSDV estimates presented in
Table 14.

As you saw in Example 7, the challenge with WG is that it eliminates all


time-invariant variables, and not just ability and family background.
Education itself, in most panel studies, does not tend to vary for each
individual, as most people will have finished their education early in life.
Some studies will use samples which include several school attendees, but
the identification of the coefficient of education then depends on the
behaviour of this group of individuals, which may be different from the
behaviour of the relevant population, resulting in selection bias. For this
reason, and despite its computational effort, LSDV is often used as a fixed
effects estimator instead. (Recall that for LSDV we need to fit a dummy
variable for all but one of the individuals in the panel, something that is
not required for WG.)

5.2.3 Using panel data models to estimate a


consumption function
This subsection will estimate a consumption function using the panel data
estimators discussed so far, and will test whether a pooled model or fixed
effects model is better, by using simulated macroeconomic data.

Aggregate income and consumption


This fictional dataset includes observations of aggregate income and
aggregate consumption for five hypothetical countries over 12 time
periods.
The simulated panel dataset (simulatedPanel)
This dataset has 60 observations and contains data for the following
four variables:
• country: the country the observation relates to
• year: the time period the observation relates to

75
Unit A1 Introduction to econometrics

• income: the aggregate income of the country


• consumption: the aggregate consumption of the country.
Seven observations from the simulated panel dataset are given in
Table 16.
Table 16 Selected observations from simulatedPanel

country year income consumption


1 1 2.51 2.31
1 2 2.62 2.40
1 3 2.65 2.43
2 1 4.26 3.58
2 2 4.37 3.65
3 2 2.96 2.76
4 1 4.59 4.01

In Figure 22 the variation there is in the dependent variable consumption


is shown. Notice that in this simulated dataset, consumption varies
considerably between countries and that over time consumption has been
increasing.

12
5 11
10
4 9
8
Country

7
3
Year

6
5
2 4
3
1 2
1
2.5 3.0 3.5 4.0 4.5 2.5 3.0 3.5 4.0 4.5
Consumption Consumption

Figure 22 Boxplots of consumption given in simulatedPanel (a) by country and (b) by period

Let’s compare the various estimators introduced so far – the fixed effects
estimators (LSDV and WG) with the pooled model.
Scatterplots of the data are given in Figure 23. In this plot, different
plotting symbols are used for each country. (From the plot it is not
possible to determine which period each point corresponds to.) In
Figure 23(a) the pooled model is also plotted. Notice that this corresponds
to just one line. Although the pooled model fits the data reasonably well,
it is better for some countries. For example, the points for country 3 lie

76
5 Estimating causal relationships with panel data

much closer to the line than the points for country 5.

Country: 1 2 3 4 5

5.0 5.0

4.5 4.5
Consumption

Consumption
4.0 4.0

3.5 3.5

3.0 3.0

2.5 2.5

2.0 2.0
3 4 5 6 3 4 5 6
Income Income
(a) (b)

Figure 23 Scatterplot of aggregate consumption versus aggregate income for five hypothetical countries with
(a) the pooled OLS consumption function and (b) the pooled OLS and the LSDV consumption functions

In Figure 23(b) the fixed effects model is shown as well as the pooled
model. As you would expect from a parallel slopes type model, the fixed
effects model corresponds to separate lines for each country. Furthermore,
these lines are all parallel. Notice these lines fit the data better than the
pooled model. However, the quality of the fit still varies (slightly) between
countries.
Table 17 shows the estimated consumption function results using pooled
OLS and Table 18 shows the estimated consumption function results using
LSDV.
Table 17 Consumption function: pooled OLS results

Parameter Estimate Standard t-value p-value


error
Intercept 0.940 0.112 8.43 < 0.001
income 0.609 0.025 24.49 < 0.001

Table 18 Consumption function: LSDV results

Parameter Estimate Standard t-value p-value


error
Intercept 0.791 0.031 25.455 < 0.001
income 0.632 0.010 62.716 < 0.001
Country 2 0.102 0.024 4.249 < 0.001
Country 3 0.113 0.013 8.816 < 0.001
Country 4 0.341 0.026 13.282 < 0.001
Country 5 −0.321 0.028 −11.277 < 0.001

77
Unit A1 Introduction to econometrics

By comparing Tables 17 and 18, we can see the change of slope from the
pooled model to the LSDV model – an increase in the coefficient of the
income term. We can also see the replacement of the common intercept by
the dummies for each country except the first.
To avoid estimating a model with as many dummy variables as
cross-sectional units, we use the WG estimator. As mentioned in
Subsection 5.2.2, the WG estimators start by demeaning the data. The
simulated panel dataset, after demeaning, are shown in Figure 24(a).
Notice that once this is done, the data for all five countries overlaps
considerably and are centered about (0, 0). The WG estimator then
corresponds to fitting a single line to these demeaned data. The resulting
line is also shown on Figure 24(a). Transforming back to the original scale
results in the single line becoming separate lines for each country, all
parallel – as shown in Figure 24(b). Notice in Figure 24(b) these lines are
identical to those found using LSDV. The two regressions, WG and LSDV,
are mathematically the same – WG just represents a translation of LSDV,
not a distinct model.

Country: 1 2 3 4 5 Estimator: LSDV WG


1.0 5.0
Consumption (demeaned)

4.5
Consumption

0.5 4.0

3.5

0.0 3.0

2.5

−0.5 2.0
−1.0 −0.5 0.0 0.5 1.0 1.5 3 4 5 6
(a) Income (demeaned) (b) Income

Figure 24 (a) Scatterplot of demeaned consumption and income and the WG estimated consumption function
(b) the LSDV and WG estimated consumption functions

To select between fixed effects or pooled OLS models, we test the joint
significance of the individual-specific intercepts using an F -test.
Testing the hypothesis that the same intercept applies to all individuals
with these data returns a very small p-value. So the null hypothesis that
the same coefficients apply to all individuals (pooled model) is rejected in
favour of the presence of fixed effects. Either LSDV or WG can be used (or
FD, as we will see in Unit A2.)

78
5 Estimating causal relationships with panel data

5.2.4 Additional examples of fixed effects modelling


By collecting observations of the same individual over an extended time
period, differences in individual characteristics are captured in the data
without necessarily being measured and included in the model. Let’s look
at other examples of unobserved heterogeneity in studies where, amongst
other estimation methods, fixed effects estimators were used.

Example 8 Debt repayments: Hajivassiliou (1987)


Hajivassiliou (1987) conducted a study of the external debt
repayments problem using a panel of 79 developing countries observed
over the period 1970–82. These countries differ in terms of their
colonial history, financial institutions, religious affiliations and
political regimes. All of these country-specific variables affect the
problems that these countries have with regards to borrowing and
defaulting, and the way they are treated by their lenders. Not
accounting for this country heterogeneity causes omitted variable bias.
The unobserved heterogeneity in this example is at the country level,
and so are the fixed effects which account for it.

Example 9 Smoking behaviour: Baltagi and Levin (1992)


A study by Baltagi and Levin (1992) considers cigarette demand
across 46 American states for the years 1963–88. Consumption is
affected by variables that vary across states, e.g. religion and
education, and through time, e.g. advertising on TV and radio, which
is nationwide so does not vary by state but may change with time.
There are also variables that may be state-invariant or time-invariant
which are difficult to measure or hard to obtain. Panel data are able
to control for these state- and time-invariant variables in a way that is
much more difficult with a pure time series or cross-sectional study. In
this particular study, there are two types of fixed effects: at the US
state level, and for each time period collected.

Example 10 Agricultural productivity: Deaton (1995)


Finally, Deaton (1995) gives an example from agricultural economics.
This addresses the question of whether small farms are more
productive than large farms. OLS regressions of yield per hectare on
inputs such as land, labour, fertiliser, farmer’s education, etc. usually
find that the sign of the estimate of the land coefficient is negative,

79
Unit A1 Introduction to econometrics

implying that smaller farms are more productive. Some explanations


from economic theory argue that higher output per head is an optimal
response to uncertainty by small farmers, or that hired labour requires
more monitoring than family labour. Deaton offers an alternative
explanation. This regression suffers from the omission of unobserved
heterogeneity, in this case ‘land quality’, and this omitted variable is
systematically correlated with the explanatory variable (farm size). In
fact, farms in low-quality marginal areas (semi-desert) are typically
large, while farms in high-quality land areas are often small. The
author models farm-specific fixed effects in their estimation.
Deaton argues that while gardens add more value per hectare than a
sheep station, this does not imply that sheep stations should be
organised as gardens.

5.3 Random effects estimator


Subsection 5.2 dealt with fixed effects models. Referring back to the most
general form of the equation for panel data (Subsection 3.2.1),
Yit = β0it + β1it X1it + · · · + βKit XKit + uit ,
it is often assumed that the uit term can be broken out into separate
components
uit = µi + λt + νit .
One of the components, µi , captures disturbances that are specific to each
entity and doesn’t change over time. The second, λt , captures variations
across time that are common to all entities. The third component, νit , is a
random variable that captures incidental variation across all observations.
However, in what follows we will ignore λt .
The fixed effects estimators assumed that the individual effects µi (often
denoted fi in fixed effects models) were fixed parameters to be estimated,
and in the various models these formed part of the intercept term or were
eliminated from the model through the transformation of the variables.
The reasoning behind this is that, in these cases, the error component uit
is correlated with the regressors because of µi .
But in cases where the uit , and all of its components, are uncorrelated with
the regressors, a random effects (RE) model is the most suitable estimation
method with panel data. The µi are assumed to be random variables
which remain components of the error term. Starting from the basic form
of the fixed effects model in Model (8) in Box 24 (Subsection 5.2),
Yit = β0 + µi + β1 X1it + · · · + βK XKit + uit ,
instead of treating αi = β0 + µi as fixed, we assume that it is a random
variable with a mean of α (no subscript). The intercept value for a

80
5 Estimating causal relationships with panel data

cross-sectional unit can then be expressed as


α i = α + εi ,
where εi is a random error term with a mean value of zero and a variance
of σε2 . For the example using the simulated panel dataset discussed in
Subsection 5.2.3, this is saying that the five countries in our sample are a
drawing from a much larger population of such countries (e.g. the EU, or
the OECD, or the UN . . .).
So the general form of the RE model becomes
Yit = α + β1 X1it + · · · + βK XKit + (εi + νit ).

There are two error components and both are assumed to be normally
distributed with zero mean and constant variance:
εi ∼ N (0, σε2i )
and
νit ∼ N (0, σν2 ).
The εi capture the random individual effects; νit , often called the
idiosyncratic error, captures remaining elements of variation.
Consequently, the model is assumed to have a constant variance: σε2 + σν2 .
That is, the model is homoskedastic. (A model where the variance is not
the same is often referred to as being heteroskedastic.) Neither of the
error terms, ε and ν, is correlated with the regressors Xk , for all
k = 1, . . . , K.
OLS is not suitable to estimate a random effects model since the error
structure and error covariance matrix need to acknowledge that error terms
are not independent of each other for each set of cross-sectional data.

Box 27 The random effects (RE) estimator


Similarly to the LSDV and WG estimators, the random effects
(RE) estimator is concerned with individual effects, but treats them
as random variations rather than fixed effects.
The random effects model (also known as the error components
model) is
Yit = α + β1 X1it + · · · + βK XKit + (εi + νit ).

The disturbance term has two components: εi captures random


variation related to the individual, and νit captures other randomness
associated with individuals and time.
The RE estimator assumes errors are not correlated with the
regressors Xk .

81
Unit A1 Introduction to econometrics

5.4 Choosing between estimators for panel


data
So far in this section, we have introduced a number of different estimators
for panel data:
• the pooled OLS estimator
• the fixed effects estimators, LSDV and WG
• the random effects estimator.
In this subsection, we consider how to choose between the different models.
First, suppose we know we want to use a fixed effects model. As mentioned
in Subsection 5.2.3, the two estimators LSDV and WG are mathematically
the same. Either could be used, but both have downsides.
• The downside of the LSDV estimator is the computational effort,
especially since it is common (and desirable in terms of precision and
hypothesis testing) to have panel datasets with a very large number of
individuals.
• The downside of the WG estimator is not only its stricter assumptions
on the relationship between the error term and the regressors and the
fact it will not estimate the impact of regressors which are
time-invariant, but also that, by only using the variation between units,
it reduces the usable variance of the regressors, becoming less efficient.
So given these downsides of estimating fixed effects models, it is important
to find out if they are worth doing compared to the pooled OLS model, by
testing. As discussed in Subsection 5.2.3, this can be done by using an
F -test to compare the LSDV model with this pooled model.
But if a fixed effects model is preferred over a pooled model, what about
choosing between fixed and random effects models? The following are a
few considerations:
• If the data exhaust the population, e.g. data on all 50 states of the USA,
then the fixed effects approach, which produces results conditional on
the cross-sectional units in the dataset, seems appropriate because
inference is confined to these cross-sectional units. However, if the data
are a drawing of observations from a large population (like 1000
individuals from a city many times that size) and we wish to draw
inferences regarding other members of that population, then the RE
estimator seems more appropriate (provided it is unbiased).
• The RE estimator can estimate coefficients of time-invariant variables
such as gender and ethnicity. The WG estimator does control for such
variables, but it cannot estimate them directly. Furthermore, the LSDV
estimator for the FE model will be impractical for large panels.
• FE models control for all time-invariant variables, whereas RE models
can only estimate time-invariant variables that are explicitly introduced
in the model.

82
5 Estimating causal relationships with panel data

More generally, the RE estimator is recommended whenever it is unbiased.


That is, when there is no endogeneity – the error components are not
correlated with the explanatory variables. This is because when it is
unbiased, it is more efficient than an FE estimator.
The unbiasedness of the RE estimator can be tested using the Hausman
test. This test makes use of the fact that the FE estimators are consistent
in both models, but the RE estimator is only consistent if there is no
endogeneity.
The null hypothesis underlying the Hausman test is that the FE and RE
estimators are both providing consistent estimates and hence do not differ
substantially.
• If the null hypothesis is rejected, the conclusion is that the RE estimator
is biased due to endogeneity. So the fixed effects model is preferred.
• If the null hypothesis is not rejected, the conclusion is that the RE
estimator is producing a consistent estimate. So the random effects
model is preferred as it is more efficient.
In Subsection 5.2.3, you saw that for the simulated panel dataset, a fixed
effects model is preferred over a pooled model. Example 11 will explore
whether, for these data, a random effects model is preferred over a fixed
effects model.

Example 11 Revisiting the consumption function using


an RE estimator
To decide whether a random effects or fixed effects model is better for
the simulated panel dataset, the Hausman test was performed.
For these data it turns out that the p-value is 0.8957. So the null
hypothesis that the two estimators are essentially the same cannot be
rejected. This means that the random effects estimator should be
preferred as it is more efficient.
The fixed effects and random effects models fitted to the simulated
panel dataset are shown in Figure 25. Notice that the models are very
similar, though not identical. This backs up the conclusion from the
Hausman test – that the models are sufficiently similar that a null
hypothesis that they are the same is not rejected.

83
Unit A1 Introduction to econometrics

Estimator: fixed effects random effects


5.0

4.5

4.0

Consumption 3.5

3.0

2.5

2.0
3 4 5 6
Income
Figure 25 Comparing FE and RE consumption functions using
simulatedPanel

The process of deciding between the different panel data models is


summarised in Figure 26.

Are fixed effects No


Pooled OLS
significant?

Yes

reject H0 Fixed effects


Hausman test
estimator
do not
reject
H0

Random effects
estimator

Figure 26 Decision procedure for panel data models

84
Summary

5.5 Using R to estimate parameters in


models for panel data
In this subsection, you will learn how to apply pooled, fixed effects and
random effects estimators to panel data:
• the fitting of a pooled model in Notebook activity A1.7
• the fitting of fixed effects models in Notebook activity A1.8
• the fitting of a random effects model in Notebook activity A1.9.

Notebook activity A1.7 Estimating pooled OLS


In this notebook, you will be estimating a basic wage equation for the
data in psid using pooled OLS.

Notebook activity A1.8 Testing for fixed effects in a


wage equation
In this notebook, you will be estimating a wage equation using psid
using a fixed effects model. You will also be testing whether fixed
effects are needed against pooled OLS.

Notebook activity A1.9 Finding the best estimation


method with panel data
In this notebook, you will be estimating a wage equation using psid
using a random effects model. You will also be exploring whether
random effects or fixed effects would be better for these data.

This concludes the panel data estimation methods covered in Unit A1, but
Unit A2 contains more alternatives.

Summary
This unit emphasised two main differences from statistical modelling when
used by economists: the use of economic theory and the confirmatory
analysis that economic theories often require from statistical models, and
the need to identify causal effects in the relationships between the
dependent variable and each of the regressors of interest.
To identify causal effects, you learnt that three conditions have to be met:
the ceteris paribus assumption, the exogeneity of the regressors, and the
lack of omitted variable bias. You also learnt that to meet these conditions,
a key method is to think about the existence of correlation between the
error term of the model and the regressors of interest. If there is a
suspicion that there might be such correlation, this needs to be addressed.

85
Unit A1 Introduction to econometrics

You explored different data structures and their relation with the solutions
available to identify causal effects. You also learned that data themselves
can add biases, such as measurement error or selection biases. Panel data
in particular, can include a specific type of selection bias called attrition
bias.
With cross-sectional data, the most common approaches at identifying
causal effects when the econometrician suspects there are biases are proxy
variables, twin studies, and instrumental variables. In the presence of
panel data, three more types of estimators are available: pooled OLS, fixed
effects estimators, and the random effects estimator.
We looked at statistical tests in the context of panel data models, namely
the F -test of joint significance of fixed effects and the Hausman test, to
choose between panel data estimators.
You were given the opportunity to apply these methods using R. You
should now be in a position to understand studies that model and estimate
economic problems, and to model your own.
The Unit A1 route map, repeated from the Introduction, provides a
reminder of what has been studied and how the different sections link
together.

The Unit A1 route map

Section 1
The economic problem

Section 2
The econometric model

Section 3
Data: structures,
sampling and
measurement

Section 5
Section 4
Estimating causal
Estimating causal
relationships using
relationships
panel data

86
Learning outcomes

Learning outcomes
After you have worked through this unit, you should be able to:
• identify the key stages of doing econometrics
• write down an economic model which represents an economic problem
• discuss the identification of key parameters of an econometric model
• analyse the error term of an econometric model and assess ways in which
it may be correlated with the regressors
• engage with the possibilities and limits of different data structures when
estimating an econometric model
• critically evaluate the limits of different estimation techniques in
providing unbiased and consistent estimators
• estimate these methods using R.

87
Unit A1 Introduction to econometrics

References
Ashenfelter, O. and Krueger, A. (1994) ‘Estimates of the economic return
to schooling from a new sample of twins’, The American Economic Review,
84(5), pp. 1157–1173.
Baltagi, B.H. and Levin, D. (1992) ‘Cigarette taxation: Raising revenues
and reducing consumption’, Structural Change and Economic Dynamics,
3(2), pp. 321–335. doi:10.1016/0954-349X(92)90010-4.
Blackburn, M.L. and Neumark, D. (1993) ‘Omitted-ability bias and the
increase in the return to schooling’, Journal of Labor Economics, 11(3),
pp. 521–544.
Bureau van Dijk (2020) Amadeus. Available at: https://www.open.ac.uk/
libraryservices/resource/database:350727&amp;f=33492 (Accessed:
22 November 2022). (The Amadeus database can be accessed from The
Open University Library using the institution login.)
Deaton, A. (1995) ‘Data and econometric tools for development analysis’,
in Behrman, J. and Srinivasan, T.N. (eds) Handbook of Development
Economics. Amsterdam: Elsevier Science, pp. 1785–1882.
Duesenberry, J. (1949) Income, saving, and the theory of consumer
behaviour. Cambridge, Massachusetts: Harvard University Press.
Eurostat (2022), GDP and main components (output, expenditure and
income). Available at https://ec.europa.eu/eurostat/en/ (Accessed:
17 January 2023).
Friedman, M. (1957) A theory of the consumption function. Princeton:
Princeton University Press.
Hajivassiliou, V.A. (1987) ‘The external debt repayments problems of
LDCs: An econometric model based on panel data’, Journal of
Econometrics, 36(1–2), pp. 205–230. doi:10.1016/0304-4076(87)90050-9.
Stata Press (no date) ‘Datasets for Stata Longitudinal-Data/Panel-Data
Reference Manual, Release 17’. Available at:
https://www.stata-press.com/data/r17/xt.html
(Accessed: 10 January 2023).
Stigler, G.J. (1954) ‘The early history of empirical studies of consumer
behavior’, Journal of Political Economy, 62(2), pp. 95–113.
doi:10.1086/257495.
Stigler, G.J. (1962) ‘Henry L. Moore and statistical economics’,
Econometrica, 30(1), pp. 1–21. doi:10.2307/1911284.

88
Acknowledgements

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 2.1, grocery shopping: The Print Collector / Alamy Stock
Photo
Subsection 2.3.1, pig iron production: Pi3.124 / Wikimedia. This file is
licensed under the Creative Commons Attribution-Share Alike 4.0
International license.
Subsection 3.3, green and orange balls: Laurent Sauvel / Getty
Subsection 4.3.3, trombone: C Squared Studios / Getty
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

89
Unit A1 Introduction to econometrics

Solutions to activities
Solution to Activity 1
At the heart of both modelling processes is a cycle. In particular,
estimation or fitting of a model are done many times. So in both modelling
processes there is the idea of using results from model estimation to
improve the modelling, leading to what is hopefully an improved model
being estimated.
The main difference is that in the statistical modelling process some tasks
are not part of the cycle, whereas in the econometric modelling process
everything is part of the cycle. In particular, in the statistical modelling
process steps relating to ‘Collect data’ and ‘Explore data’ appear before
the cycle, whereas a step relating to ‘Data’ appears in the cycle that forms
the econometrician’s modelling process. This reflects that in econometrics
the relationship between economic theory and the selection of data are
often entwined. Given the key variables and their relationships expressed
by different economic models, the econometrician will choose an
econometric model. Using that model, they will estimate parameters given
the data available, and use the results to shed light on the key
relationships of the economic problem.

Solution to Activity 3
Both studies show expenditure shares for several categories of goods, food
included. They show these shares for different groups.
In Table 1, Davies tabulates these shares across different income bands,
starting from the £10–£20 income band to the £30–£45 income band. He
only looks at agricultural workers.
In Table 3, Engel uses an alternative to income which tries to capture
overall poverty, economic position, and income security. He uses a
categorical variable which has three groups: families on relief, poor but
independent, and comfortable families.
Both studies show that as income or economic position improves, the
individualised share of expenditure spent on food decreases. This
individualised share divides the family expenditure share by the number of
people in the household. In contrast, and looking specifically at Engel’s
study, the share spent on health and recreation increases with income.

90
Solutions to activities

Solution to Activity 4
The models that are linear in parameters are (a), (b), (c) and (e).
(a) Note that a0 can be thought of as a0 × 1. So if we think of a variable
which takes the value 1 for all units of observation, this model is
linear in parameters.
(b) Like for the model in part (a), we can think of a0 as a0 × 1. (And/or
a1 too if we like.) So this model is linear in parameters.
(c) Note that each parameter is associated to one variable only, and that
all terms are additively separated. One of the variables is non-linear –
quadratic in fact – but that does not contradict the definition of linear
in parameters. It just means that the original variable X would need
to be transformed and its transformation added to the model.
(d) Note that there is one term that includes two parameters in a
multiplicative way: a1 a2 . So this model is not linear in parameters.
(e) Note that each parameter is associated to one variable only, and that
all terms are additively separated. One of the variables is non-linear,
the product of two variables, but that does not contradict the
definition of linear in parameters. It means, as for the model in
part (c), that a new variable needs to be created to estimate this
model.

Solution to Activity 5
The parameter c0 is the intercept of this linear function and, by definition,
it is the amount of consumption needed when income (I) is zero. (The
literature on the Keynesian consumption function often uses c0 to
represent the intercept because this notation reminds us that it represents
consumption c when income is zero. You will remember from Activity 3
that, for poorer families, Davies as well as Engel found that consumption
was often higher than income. Keynes called this level of subsistence
consumption autonomous consumption.)
The parameter b is the slope of the linear function and, by definition, it is
the amount of the change in consumption when income changes. At each
income level, the consumption increase is a defined proportion of the
income increase. This parameter b can be interpreted as the change in
consumption divided by change in income. (Keynes called b the marginal
propensity to consume.)

91
Unit A1 Introduction to econometrics

Solution to Activity 6
The parameter c0 , the autonomous consumption, should be a positive
number, and in line with what poorer people would consume.
The marginal propensity to consume, b, should be fixed and lie between 0
and 1 for a particular income group, since income increases will be used as
consumption or savings.

Solution to Activity 7
Data focusing on a particular country will record data over time. This was
the case of the data Henry Moore used to estimate the demand for corn
and pig iron in the USA, and it included annual data from 1867–1911.
(This type of data structure is called a time series and it will be the
subject of Unit A2.)
In contrast, data on several families and their expenditure and income
observed at a specific point in time, such as the data used by Engels in
Table 3 of Stigler (1954) shown in Activity 3. (This type of data structure
said to be cross-sectional.)

Solution to Activity 8
The AIH, by focusing on current income only, and not having an in-built
lifecycle or intertemporal dimension, can be and has been analysed using
both cross-sectional and time series data.
The LIH and PIH require different observations on the same subject over
time, which is why studies using these theories require time series data.
Alternatively, the RIH requires information on each subject’s peers or
reference group so that the basic data structure is cross-sectional, looking
at several individuals observed at the same time.
All the theories can be tested using panel data.

Solution to Activity 9
(a) For this model in levels, we interpret the coefficient of labour as the
amount (in thousands of euros) that production increases when labour
is increased by one thousand euros. That amount is almost 54, and is
statistically significant. By the same token, production increases by
1.364 thousand euros when capital increases by 1 thousand euros, and
this is also statistically significant.
(b) The R2 of this model is 0.4632, which is reasonable in economics
studies of production functions, and when only two regressors are
included. However, the distribution of the residuals suggests that they
are very skewed to the right, and therefore too far from the normality
assumption.

92
Solutions to activities

Solution to Activity 10
(a) As this model is fitted to logged data, the coefficients are elasticities.
When labour increases by 1%, production increases by 0.33%; when
capital increases by 1%, the production increases by 0.52%.
(b) The R2 of this model is more than the R2 for the model fitted in
Activity 9. Table 9 also suggests that the distribution of the residuals
is now much closer to a symmetric distribution. All in all, the
interpretation of the coefficients is clearer, the model fit has improved,
and hypothesis testing is more reliable since the distribution of the
residuals is now closer to a normal distribution. So this model seems
better.

Solution to Activity 11
Experience shows up in the model as a quadratic function, with both a
linear and a squared term. To calculate the impact of experience on hourly
wages, we need to find the derivative of log(wage) with respect to exper.
After some calculations, we derive that the effect of experience on
log(wage) is
β1 + 2β2 exper
and so it varies with the individual’s experience, and the effect of
experience on hourly wages is
β1 + 2β2 exper
.
wage
So we will need to resist the temptation to apply the ceteris paribus
assumption to each coefficient separately when its effect is being
represented by more than one term.

Solution to Activity 12
The interpretation of a coefficient in a multiple regression, in particular the
assumption of holding all else constant (ceteris paribus), provides an
additional mechanism to choose the right specification. In Model (5) we
can interpret the coefficient of educ as the effect of education on the log of
hourly wages, holding experience constant. Its own direct effect on the log
of hourly wages has been modelled. If both variables are exogenous (and
that is the situation the econometrician ideally wants), we have identified
the causal effect of both education and experience on the log of hourly
wages.
The coefficient of educ is statistically significant in both specifications. As
seen in Box 18 (Subsection 4.1), we can interpret the coefficient estimate
multiplied by 100 as the percentage increase in hourly wages given by a
unit increase in education. In Model (4), a univariate model, this
percentage is approximately 5.5%, and in Model (5), an augmented model,
it increases to almost 6.6%. While the difference between the two estimates

93
Unit A1 Introduction to econometrics

is not large, this difference does force the econometrician to think about
which specification, if any, is the one which should be analysed further.

Solution to Activity 13
As education increases, on average in our dataset, experience decreases
(due to their negative correlation). Looking back at Figure 18, the effect of
experience on wages (a positive and statistically significant effect in our
example) is being picked up by changes in education. Because the effect of
experience on wages is positive, that means the higher the experience, the
higher the wages on average. High experience is however associated with
lower levels of education in this data, so if lower levels of education are
associated with higher levels of experience, higher values of education are
associated with lower values of experience. An estimator in a model which
omits experience will pick up the dampening effect on wages of decreasing
experience as education increases. In Activity 12, this dampening effect
was of 1.1 percentage points: including experience increased the effect of
education on wages from 5.5% to 6.6%.

Solution to Activity 14
Some economists argue that those with higher ability will go further in
their education, which creates a positive correlation between ability and
education. But to explore why ability should be in the model as well as
education, we will need to use the ceteris paribus assumption: holding all
else constant, including education, what is the direct effect of ability on
wages?
Consider two individuals with the same education, same experience, but
with different ability. A human capital explanation argues that ability
translates into more productivity, ease at picking up new tasks and
efficiency at existing ones, which translates into higher wages. An
alternative explanation, known as a signalling model explanation, assumes
higher-ability individuals choose to have more education to signal to the
labour market that they are more productive, which is then confirmed once
they are employed.

Solution to Activity 15
The direction of the bias is given by the sign of the second term on the
right-hand side of Equation (6), which is the product of the sign of two
factors: the effect of the excluded variable on the dependent variable (β2 in
the formula), and the sign of the correlation between included and
excluded regressors (in this case, education and ability). Given economic
theory and competing explanations, the effect of ability on wages is
positive. And the correlation between education and ability is also
positive. The product of two positive factors is also positive. So the biased
estimator of the education coefficient is higher than the true one, and we
expect the true value to be lower. This is what we call positive bias.

94
Solutions to activities

Solution to Activity 16
Column (1) shows the results of the model which omits ability. It shows
that, on average, each year of education increases hourly wages by 3.2%.
In columns (2) and (3), this return to each year of education decreases to
1.3% and 1.2%, respectively.
Economic theory suggests that there is a positive bias when ability is
excluded from the econometric model. In this particular study, we can see
that this positive bias is quite substantial. You may also note that the R2
has not increased substantially when test scores were added. In
econometrics, the choice between models often requires an analysis of
possible omitted variables biasing our parameters of most interest, as we
have just done. Looking solely at the R2 , we would be tempted to reject
all additional variables given such a low increase in the goodness of fit. In
doing so however, we would be using estimates for returns to schooling
which are broadly three times of what they were likely to be with this
sample in this time period, and we would be drawing misguided policy
recommendations.

Solution to Activity 17
The instrumental variable, or variables (as we can use several variables to
replace one endogenous regressor) would be replacing education. The IV
would need to be correlated with education but, conditional on this
correlation, ceteris paribus, it should not have a direct impact on the
dependent variable (in our example, this would be wages).
Effectively, the IV would not enter the original wage equation model as it
only works via education.

95
Unit A2
Time series econometrics
Introduction

Introduction
Sequences of observations recorded at regular intervals over time are called
time series. This second (and final) unit of Strand A is concerned with the
economic analysis of time series data. A key feature of time series is that
they are numerical footprints of history. In other words, they track the
evolution of a variable through time and, therefore, tell you something
about its past evolution. History matters, therefore, when dealing with
time series.
Historical variation, however, is rarely random in nature. Almost
invariably, successive measurements of an economic or social time series
variable will be related to each other in some way, and so will
measurements across different time series, revealing patterns that give us
structure to understand what went on in the past and possibly also provide
us with a basis for making predictions (forecasts) about their future
evolution.
Time series data, therefore, pose particular challenges for statistical and
econometric modelling because each variable on its own reflects historical
variation through time. In practical terms, some of these challenges
include: the fact that time series variables cannot simply be viewed as a
randomly drawn sample from a wider population; that observations are
also likely to be correlated with each other; that changes due to time can
be difficult to distinguish from changes of a time series variable; and that
when adding additional time series regressors to an econometric model,
these time-dependence issues get compounded with the likely correlation
between the regressors and the error terms. These have consequences for
the way we model the evolution of a time series and for how we model and
estimate the relationship between more than one time series in a regression
equation.
In Section 1, we start by considering two types of time series variable:
stocks and flows. Then, in Section 2, the focus is on the concepts and
techniques that help us to make sense of the patterns inherent in a single
time series variable. In Section 3, we will use the example of particular
time series models, called random walks, to rehearse the techniques from
Section 2. A key concept you will learn and need to ensure when modelling
time series is stationarity. We will be discussing how to model and
transform time series data so that they become stationary in Section 4,
and how to test for stationarity in Section 5. We will argue that most of
the estimation and hypothesis testing you know from the rest of the
module is still valid if and only if variables are stationary and all OLS
assumptions hold (including the condition studied in Unit A1, assuming
that errors and regressors have to be independent of each other).
Subsequently, based on this understanding of the analysis of a single time
series, in Section 6 we will discuss how the concerns with stationarity
extend to the modelling of an econometric model with additional
regressors. To do so, we will first focus on what is called the problem of

99
Unit A2 Time series econometrics

spurious regressions with time series data, which is a problem that results
when two or more non-stationary time series variables display similar
patterns of movement through time without there being a relationship
between them. When there is a structural relationship between them,
variables which are initially non-stationary are called cointegrated and
their joint behaviour can be meaningfully modelled together. The section
ends by discussing relationships between cointegrated variables and how
error correction models can further add to the modelling and analysis of
the relationship between cointegrated variables.
Finally, in Section 7, we will discuss how time series data can be modelled
using panel data.
The following route map shows how the sections of the unit link together.

The Unit A2 route map

Section 2
Section 1 Describing
intertemporal Section 3
Stock and flow
properties of Random walks
variables
time series

Section 4
Section 5
Stationarity and
Testing for
lagged dependence
stationarity
of a time series

Section 6
Modelling more
than one time
series variable

Section 7
Modelling time
with panel data

Note that Subsections 2.5, 3.3, 5.4, 6.4 and 7.1.2 contain a number of
notebook activities, which means you will need to switch between the
written unit and your computer to complete these sections.

100
1 Stock and flow variables

1 Stock and flow variables


In econometric analysis, time series are commonly denoted with the
subscript t, where t denotes a particular point in time or a specific period
in time each observation of the variable is measuring. Time series that are
measured at successive points in time are called stock variables, whereas
those measured over successive periods of time are called flow variables.
For example, consider the following variables relating to a country’s
population and economy.
• The size of a country’s population is a stock variable as it makes sense
to refer to the population at a particular point in time. The number of
births, deaths and net migration are all flow variables because they are
measured over successive periods of time.
• Net public debt is a stock variable because it is measured at successive
points in time. GDP is a flow variable because it is measured over
successive periods of time (yearly, quarterly or monthly).
Furthermore, if you measure the change in a stock variable from one point
in time to the next, the resulting variable will be a flow variable, because
this change takes place over a period of time (the period between two
different points in time). And if you measure the change in the level of a
flow variable from one period to the next, the resulting variable will also
be a flow variable.
For example:
• The change in population size from one Olympics to the next is a flow
variable. The individual components of the change in population size –
number of births, number of deaths and net migration – are flow
variables.
• The change in net public debt from one point in time to the next is a
flow variable. The change in GDP from one Olympics to the next is a
flow variable.
This is summarised in Box 1. You will then see in Example 1 how a stock
variable and a flow variable were used to monitor the COVID pandemic.

Box 1 Stocks and flows


Stocks and flows are both types of time series variables.
• A stock variable is measured at successive points in time.
• A flow variable is measured over successive periods in time.
• A change in a stock variable from one point in time to the next is a The number of cats in a
flow variable defined over the period in between both points in time. household each month is a
stock variable and the number
• A change in a flow variable from one period to the next is also a of kittens born in a month is a
flow variable. flow variable

101
Unit A2 Time series econometrics

Example 1 Monitoring the COVID pandemic


Two important indicators were routinely made available each day to
inform public policy and awareness during the COVID pandemic.
The first indicator gave the reported number of newly infected people
each day. This is a flow variable. Its trajectory through time signalled
whether the number of new infections was expanding or contracting
and thus whether the pandemic was under control or not.
The second indicator reported the total number of COVID patients in
hospital care in the country each day. This is a stock variable. This
indicator mattered for public policy because its trajectory signalled
the strain the pandemic put on the health care system and, more
specifically, the extent to which the pandemic threatened to
overwhelm available hospital capacity to care for patients. The
trajectory of this stock variable through time depended not only on
the percentage of people infected with COVID who subsequently –
with a time lag of 1 to 2 weeks – needed hospital care, but also on the
average number of days COVID patients stayed in the hospitals before
being discharged or before mortality occurred. The distribution of the
length of these hospital stays altered during the pandemic as the virus
mutated, changing its impact on those infected, and as treatments for
COVID were developed and improved.

In the following activity, you will practise distinguishing between stock and
flow variables yourself.

Activity 1 Distinguishing stocks from flows


For each of the following time series variables, indicate whether it is a
stock or a flow variable.
(a) Your bank balance.
(b) Your expenses on food.
(c) Daily miles travelled to and from work.
(d) The number of unemployed people in the UK.
(e) Infant mortality in the UK.
(f) The rate of inflation of consumer prices in a country.

In econometrics, the technique of calculating absolute changes of a time


series variable (whether flows or stocks) from one period to the next or
from one point in time to the next is called first differencing. This is
detailed in Box 2.

102
2 Describing intertemporal properties of time series

Box 2 First differencing a time series


First differencing a time series (say, Yt ) is commonly denoted using
the symbol ∆.
Hence
∆ Yt = Yt − Yt−1 .
Note that ∆ Yt is often referred to as ‘first differences of Yt ’. (And, as
you saw in Unit A1, the untransformed time series is often referred to
as the variable ‘in levels’.)

Taking first differences of a time series variable results in a different


variable and hence highlights a different feature of the original time series
(its change over time) while abstracting from another feature (its level in a
given period or at a given point in time). Number of births, for example,
tells you how many children are born during a particular period while
abstracting from how high or low the population is in any given period in
time. In this unit, you will see that this technique of first differencing plays
an important part in the econometric analysis of time series.

2 Describing intertemporal
properties of time series
In this section, we will consider a couple of times series, relating to
unemployment and GDP, and consider some properties that can be used to
describe them.
In Subsection 2.1, we will look at the trajectory of the quarterly rate of
unemployment using a time plot. In Subsections 2.2 and 2.3, we will
introduce two different plots that are used to explore time series – the
autoregressive scatterplot and the correlogram – and you will see how they
can be used to describe the unemployment data. In Subsection 2.4, you
will then apply the ideas in Subsections 2.1 to 2.3 to a time series relating
to GDP. Finally, in Subsection 2.5, you will use R to explore times series.

2.1 Exploring the evolution of the UK


unemployment rate
In this subsection, we introduce some real time series data – the quarterly
rate of unemployment. (Quarterly data provides four data points per
year.) Whilst the exact dates can vary between datasets, these quarters
roughly correspond to the following: Quarter 1 (Q1) is January to March;
Quarter 2 (Q2) is April to June; Quarter 3 (Q3) is July to September; and
Quarter 4 (Q4) is October to December.

103
Unit A2 Time series econometrics

The reason for selecting unemployment is that, as Adam Tooze (a


well-known professor of economic history) put it, ‘the best general
indicator of an economy’s health is the rate of employment and
unemployment’ (Tooze, 2021, p. 61). The data are described next.

Unemployment in the UK
The available labour force of a country at a particular point in time
consists of the total number of those who are currently employed plus
those who are unemployed. The rate of unemployment of a country is
then defined as the number of unemployed as a percentage of the total
labour force. Similarly to the number of unemployed people discussed
in Activity 1 (Section 1), the unemployment rate is also a stock
variable. (Generally, to be unemployed a person has to be actively
seeking work or about to start working.)
The unemployment dataset (unemployment)
In the UK, the Office for National Statistics (ONS) routinely
publishes the seasonally adjusted rate of UK unemployment for people
aged 16 and over on a monthly, quarterly and annual basis. Monthly
A sight that becomes more
and quarterly data are available from 1971 onwards.
common when unemployment
is rising The data considered in this dataset are the quarterly data for the
seasonally adjusted UK unemployment rate for the period 1971 Q1 to
2020 Q4.
The dataset contains data for the following variables:
• year: the year that the observation relates to
• quarter: the quarter (of the year) that the observation relates to,
taking the value 1, 2, 3 or 4
• unemploymentRate: the seasonally adjusted UK unemployment
rate for the quarter that the observation relates to.
The data for the first six observations in the unemployment dataset
are shown in Table 1.
Table 1 The first six observations from unemployment

year quarter unemploymentRate


1971 1 3.8
1971 2 4.1
1971 3 4.2
1971 4 4.4
1972 1 4.5
1972 2 4.4

Source: Office for National Statistics, 2021a, release date 23 February 2021

104
2 Describing intertemporal properties of time series

Notice that the rate of unemployment given in the unemployment dataset


is seasonally adjusted. Box 3 explains what is meant by ‘seasonally
adjusted’ time series and why seasonal adjustment is so frequently used
when presenting aggregate data for the economy as a whole.

Box 3 Seasonal adjustment of time series data


Seasonal adjustment of time series data is a process that aims to
remove regular pattens in the data associated with the time of the
year.
The ONS includes the following in its explanation of the purpose of
seasonal adjustment of time series:
[Seasonal adjustment] facilitates comparisons between
consecutive time periods. . . .
Those analysing time series typically seek to establish the
general pattern of the data, the long-term movements and
whether any unusual occurrences have had major effects on the
series. This type of analysis is not straightforward when one is
reliant on raw time series data, because there will normally be
short-term effects, associated with the time of the year, which
obscure or confound other movements.
For example, retail sales rise each December due to Christmas.
The purpose of seasonal adjustment is to remove systematic
calendar-related variation associated with the time of the year,
that is, seasonal effects. This facilitates comparisons between
consecutive time periods.
(Office for National Statistics, no date)

The simplest and often the first approach to describing a time series is to
use a time plot: that is, a connected scatterplot (line plot) of the data
against time. An example of a time plot is given next – Figure 1 shows the
evolution of the seasonally adjusted quarterly rate of unemployment
from 1971 to 2020. It yields an immediate historical panorama of the
evolution of unemployment in the UK over a period of 50 years.

105
Unit A2 Time series econometrics

12

10

Unemployment rate (%)


8

1970 1980 1990 2000 2010 2020


Year
Figure 1 Seasonally adjusted quarterly rate of unemployment: 1971 Q1
to 2020 Q4

Figure 1 shows that the trajectory of the rate of unemployment left very
distinctive historical footprints about what happened in the past. The rate
of unemployment clearly varied quite markedly over time, showing a
landscape of hills, valleys and occasional flatlands.
Economic historians interpret these patterns taking into account the
broader context of economic history of the UK, and particularly of the
successive distinctive policy regimes that prevailed during these years.
During the 1950s and 1960s (not shown in Figure 1) the rates of
unemployment fluctuated around or below 2% (Santos and Wuyts, 2010,
p. 33). These were the heyday of Keynesian economic policies
characterised by the primacy given to maintaining full employment in the
economy. As Figure 1 shows, the rate of unemployment started rising
during the 1970s but still remained historically relatively low in light of
subsequent developments.
A radical break with the past occurred at the start of the 1980s when
exceptionally high levels of unemployment prevailed during most of this
decade and into the 1990s. The 1980s was a decade for which Margaret
Thatcher was the UK’s prime minister, when maintaining low
unemployment no longer featured as a key focus of economic policy.
Most of the 2000s (up to the second half of 2008) showed moderate
variation in the rate of unemployment at relatively lower levels. This
period became known as the period of the ‘great moderation’. But also,

106
2 Describing intertemporal properties of time series

taking into account subsequent events, it turned out to be a period of


silence that preceded the storm that hit Western economies as a result of
the 2008 financial crash, which led to a steep rise in unemployment in the
late 2000s and early 2010s. The rate of unemployment subsequently
declined before the COVID pandemic hit the world economy in 2020.
For the purpose of modelling time series, you need to understand some of
their ‘intertemporal properties’ (Leamer, 2010, p. 91); that is, the
relationships between past, present and future values. An explanation of
two of these properties – persistence and momentum – is given by the
econometrician Leamer, as described in Box 4.

Box 4 Persistence and momentum


Two intertemporal properties of time series are persistence and
momentum.
• Persistence relates to how quickly the time series responds to
changes; that is, whether a variable that increases rapidly tends to
stay there or whether it quickly comes back down to its normal
level. A persistent variable will tend to stay where it is and
converge only slowly, if at all, to its historical mean.
• Momentum relates to the tendency for the direction of changes to
switch; that is, whether a variable that moves up (or down) tends
to continue moving in the same direction as it was moving, at least
for a significant stretch of time, if not indefinitely.

In the next activity, you will consider the extent to which the
unemployment data have these two properties.

Activity 2 Persistence or momentum?

What does the Figure 1 time plot of the seasonally adjusted quarterly rate
of unemployment suggest about the variable’s persistence and momentum
over time?

While a time plot, such as that in Figure 1, is often the first technique
used to describe a time series, other visual techniques are used to start
breaking into the intertemporal properties of a time series, and to focus on
how a variable is explained in terms of its own past behaviour. To do this,
it is convenient at this stage to introduce you in Box 5 to the idea of the
lag operator.

107
Unit A2 Time series econometrics

Box 5 The lag operator L


In time series analysis, the lag operator L (also referred to as the
backshift operator) is defined as
L Yt = Yt−1 .
Hence, the first differencing operator described in Box 2 (Section 1)
can be rewritten using the lag operator as
∆ Yt = Yt − Yt−1
= Yt − L Yt
= (1 − L) Yt .
The lag operator can be used to refer to observations which are more
than one time period apart. For instance,
L2 Yt = L(L Yt )
= L Yt−1
= Yt−2 .
Repeated application of the lag operator is denoted as
L Yt , L2 Yt , L3 Yt , L4 Yt , ...,
where Lp Yt = Yt−p , for all p = 1, 2, . . . .
The time series Yt−p , created by applying the lag operator p times,
will have p fewer observations than the original Yt .
Note that with quarterly data the L4 operator takes you back to the
same quarter in the previous year.

We will make use of the lag operator and lagged variables in the two plots
that are introduced next: autoregressive scatterplots in Subsection 2.2 and
correlograms in Subsection 2.3.

2.2 Autoregressive scatterplots


An autoregressive scatterplot is a scatterplot of a time series variable Yt
against the lagged variable Yt−p for some p ≥ 1. Such a plot allows the
dependence of a variable on a lagged version of it to be explored in a visual
way. This is detailed more formally in Box 6.

Box 6 Autoregressive scatterplots


An autoregressive scatterplot is a scatterplot of a time series
variable against its own lagged variables.
To examine intertemporal properties of a time series, the
econometrician visually inspects a sequence of autoregressive

108
2 Describing intertemporal properties of time series

scatterplots of variable Yt against L Yt , variable Yt against L2 Yt ,


variable Yt against L3 Yt , variable Yt against L4 Yt , and so on, in
search of the scatterplot – and therefore the lag – that shows no
correlation between the two plotted variables.

For example, Figure 2 shows the autoregressive scatterplots of the


unemployment rate against its lag 1, then lag 2, then lag 3 and then lag 4.
The unemployment rate is measured quarterly in this dataset, so L4 of the
unemployment rate is effectively its own value four quarters in the past,
which is its own value one year ago.

12 12
Unemployment rate (%)

Unemployment rate (%)


10 10

8 8

6 6

4 4

4 6 8 10 12 4 6 8 10 12
(a) Unemployment rate (%) lag 1 (b) Unemployment rate (%) lag 2

12 12
Unemployment rate (%)

Unemployment rate (%)

10 10

8 8

6 6

4 4

4 6 8 10 12 4 6 8 10 12
(c) Unemployment rate (%) lag 3 (d) Unemployment rate (%) lag 4

Figure 2 Autoregressive scatterplots of the rate of unemployment against: (a) lag 1, (b) lag 2, (c) lag 3
and (d) lag 4

109
Unit A2 Time series econometrics

Recall that Box 4 (Subsection 2.1) introduced two properties of time


series: persistence and momentum. Persistence can be investigated using
autoregressive scatterplots, as it occurs when the values of a time series
and its lagged counterpart are close to each other. Bear in mind that on an
autoregressive scatterplot the variables on both axes are measured with
the same units and at the same scale, so persistence implies that the points
representing their joint values are close to the 45-degree line in each
scatterplot.
You will consider the plots in Figure 2 in the next activity.

Activity 3 Interpreting autoregressive scatterplots

Use Figure 2 to answer the following questions.


(a) Is it true that the closer together in time the observations are, the
closer they are correlated?
(b) Would you say that the rate of unemployment time series is highly
persistent?

As has already been noted, the lag 4 autoregressive plot for quarterly data
plots values against those from the same quarter of the previous year. So
dependencies in such plots often reflect seasonality in such data. Recall,
however, that in Figure 2 we are plotting seasonally adjusted rates of
unemployment and, hence, any seasonal particularities have been removed
from the raw data on which this series was based. This explains why
Figure 2(d) shows no specific features that might be due to seasonality.
While the autoregressive scatterplot technique already goes a long way in
suggesting how far back each value of a time series determines its current
value, visual inspections are often not enough to model the degree of
persistence of a time series variable. This is why autoregressive
scatterplots are often used in conjunction with a plot called a correlogram.
So, in the next subsection we will introduce the correlogram and the
summary statistic it plots: the autocorrelation coefficient.

2.3 Autocorrelation and the correlogram


In Activity 3 (Subsection 2.2), you considered the correlation between
unemployment rate and each of its first four lags. As you will see in Box 7,
the correlation coefficient between a time series variable and one of its
lagged counterparts has a special name – the autocorrelation coefficient.
(‘Autocorrelation’ literally means self-correlation.)

110
2 Describing intertemporal properties of time series

Box 7 The autocorrelation coefficient


For a time series Yt , the (population) autocorrelation coefficient,
ρp , at a lag p, is defined as the (population) correlation coefficient
between Yt and Lp Yt .
In the population, the variance of Yt is the same as the variance
of Lp Yt . So this is equivalent to the ratio between the (population)
covariance of Yt and Lp Yt , and the variance of Yt .
The sample equivalent of ρp is usually denoted as rp .

Notice that Box 7 did not detail how an autocorrelation coefficient is


estimated. Furthermore, with times series there is often interest in when
there is no correlation between the variable and a lagged version of it. So
it is also important to have some assessment of whether a sample
autocorrelation provides evidence that the corresponding population
autocorrelation is not zero. In Activity 4 you will use one method, based
on regression, to achieve both of these goals.

Activity 4 Estimating the autocorrelation between the rate


of unemployment and its first lag
One way to obtain an estimate of the autocorrelation coefficient, rp , for the
rate of unemployment, and to test whether it is significantly different from
zero, is to compute a simple linear regression of the rate of unemployment
against the pth lag of it. It turns out that an estimate of the
autocorrelation is given by the slope, β.
b
For example, r1 for the rate of unemployment can be estimated by fitting
the regression equation
unemploymentRatet = α + β unemploymentRatet−1 + ut .
The results from fitting this model are summarised in Table 2.
Table 2 Coefficients for unemploymentRatet ∼ unemploymentRatet−1

Parameter Estimate Standard t-value p-value


error
Intercept 0.069 0.0579 1.19 0.234
unemploymentRatet−1 0.991 0.0080 124.25 < 0.001

The R2 value for this regression is 0.9874.


Based on this regression:
(a) What is the estimate of r1 , the first-order autocorrelation for the rate
of unemployment?
(b) Is it reasonable to assume that the corresponding population
autocorrelation is zero?

111
Unit A2 Time series econometrics

In Activity 4, you used the results from a simple linear regression to


estimate the first-order autocorrelation. It is also worth noting the
following about this fitted model.
• The intercept term is close to zero, with a p-value of 0.234, and hence is
not significantly different from zero. As you will see in Section 3,
whether or not the intercept term is present can make a difference to the
type of time series we think we have.
• The coefficients of the slope and intercept, together with the very
high R2 , means that the rate of unemployment in the previous quarter
nearly mirrors itself into the next quarter.
• As the model suggests that the rate of unemployment from quarter to
quarter is practically the same (as indicated by the very high R2 value),
it does not enable us to explain why or how the rate of unemployment
changes, and so it is not a useful model for policy. We will return to this
issue in Section 6.
So far we have been considering each autocorrelation coefficient (ρp or rp )
for p = 1, 2, . . . individually. These are also thought of as values of a
function: the autocorrelation function (ACF).
As you will see in Box 8, a correlogram plots estimated autocorrelations of
a time series for a sequence of successive lags to investigate the
relationship between a time series variable and its past values.

Box 8 The correlogram


A correlogram is a plot of the autocorrelation function, rp , against
its lags, p, for all p between 1 and P (where P is the maximum
number of lags to be plotted).
A correlogram provides an elegant way to look at the pattern of these
correlation coefficients as the lag increases, and to choose a lag order
where the autocorrelation is no longer significantly different from zero.
It is also useful for detecting cyclical and seasonal behaviour in time
series.

The correlogram will visually and quickly show how far back the
autocorrelation of a variable and its lag is significantly different from zero,
as you will see in Example 2 and Activity 5.

112
2 Describing intertemporal properties of time series

Example 2 Correlogram for the rate of unemployment


Figure 3 shows the estimated autocorrelations between unemployment
and its successive lagged variables.
The blue dashed lines are the critical values of the t-statistic
representing the statistical significance of each autocorrelation
coefficient. So sample autocorrelations that go within the band
around 0, between the blue dashed lines, suggest that it is reasonable
to assume the corresponding population autocorrelation is 0.

1.0

0.5
ACF

0.0

−0.5

−1.0
0 5 10 15 20
Lag
Figure 3 The correlogram of a quarterly rate of unemployment time
series

Activity 5 Correlogram for the rate of unemployment

What does the correlogram in Figure 3 suggest about how far back
unemployment and its lags are correlated?
Reflect upon your answers for Activities 3 to 5; what is your takeaway
message in terms of predicting the rate of unemployment based on its past
evolution?

113
Unit A2 Time series econometrics

While the autoregressive scatterplots and the correlogram are tools that
most time series econometricians will use to start the exploration of the
intertemporal properties of a time series variable, their suggestions about
how far back a time series variable is correlated with its lags can be
misleading. Activity 5 showed how slowly the correlation between
unemployment and its lags was decreasing, and it suggested that
unemployment is still significantly correlated with its value from five years
back. This pattern of slowly decreasing correlation can, however, be
explained by the presence of a time trend. While this is unlikely to be the
case for variables such as unemployment rates, which are finite and with a
limited range, you will see throughout this unit how difficult it can be to
disentangle persistence from a time trend – either a deterministic trend
(no random element) or what is referred to as drift or a stochastic trend
(a random element to the trend).
You also came across in Activity 4 another alarm signal when using these
tools. The R2 was close to 1, which – at least with time series data – is
often a sign of misspecification and high multicollinearity. In this unit, you
will see that one solution in such cases is to use the first differencing
operator (∆) discussed in Section 1. Let’s look at such an example in the
next subsection.

2.4 Persistence: an application to GDP


We will now take a look at the growth rate of GDP – an aggregate measure
that is frequently used to measure the success or failure of economic policy
and performance. As discussed in Section 1, the GDP of a country is a
flow variable that measures the aggregate value of domestically produced
goods and services over some time period. GDP can be expressed in
nominal or monetary terms (using prices prevailing in the period in which
it is measured) or in real terms (focusing on volume expansion of output
by removing the effects of price changes from period to period). The rate
of growth of GDP is generally calculated in real terms.
The calculation of GDP growth creates a new variable that emphasises the
relative change in the level of GDP, rather than the level of GDP itself.
This calculation of GDP growth therefore involves a transformation of the
original variable GDP – a ratio of the first difference of GDP to its prior
level:
GDPt − GDPt−1
gGDP,t =
GDPt−1
∆ GDPt
= ,
L GDPt
GDP or GDP growth rate? It where gGDP,t is GDP growth rate at time t. Note that this equation
can make a difference! expresses the growth rate of GDP as a proportion, not as a percentage.

114
2 Describing intertemporal properties of time series

Let’s use the autoregressive scatterplots and the correlogram tools to


visually illustrate how this transformation has changed the persistence and
momentum of GDP variables. We will use the UK GDP dataset described
next.

Gross domestic product for the UK


The ONS routinely publishes GDP data for the UK economy on a
monthly, quarterly and annual basis.
The UK GDP dataset (ukGDP)
The data considered in this dataset are seasonally adjusted quarterly
GDP growth figures from 1955 Q2 up to 2020 Q4. Keep in mind that
this time series includes the year 2020 – the year the COVID
pandemic hit the world with full force.
The dataset contains data for the following variables:
• year: the year that the observation relates to
• quarter: the quarter (of the year) that the observation relates to,
taking the value 1, 2, 3 or 4
• gdp: the seasonally adjusted quarterly GDP in real terms (in
millions of pounds sterling (£))
• gdpGrowth: the seasonally adjusted quarterly GDP growth rate
(%).
The data for the first six observations in the UK GDP dataset are
shown in Table 3.
Table 3 The first six observations from ukGDP

year quarter gdp gdpGrowth


1955 2 4687 0.1
1955 3 4860 2.0
1955 4 4949 −0.5
1955 1 5083 1.2
1956 2 5152 −0.2
1956 3 5233 −0.1

Source: Office for National Statistics, 2021b and 2021c, release date
12 February 2021

Figure 4 shows the time plot for quarterly GDP in the UK between 1955
and 2020. As you will see, GDP exhibits an overall upward trend over the
period, interspersed with episodes of recession. Falls in GDP associated
with the recessionary period after the financial crisis of 2007–2008 and the
sharp fall in GDP at the onset of the COVID pandemic in early 2020 are
clearly visible in the plot.
Figure 5 shows the correlogram for quarterly GDP.

115
Unit A2 Time series econometrics

500000

GDP (in levels)


400000

300000

200000

100000

0
1960 1980 2000 2020
Year
Figure 4 Real UK GDP: 1955 Q2 to 2020 Q4

1.0

0.5
ACF

0.0

−0.5

−1.0
0 5 10 15 20 25
Lag
Figure 5 Correlogram of UK GDP: 1955 Q2 to 2020 Q4

Comparing Figure 5 with the correlogram for the rate of unemployment in


Figure 3 (Subsection 2.3), GDP appears to exhibit higher degrees of
persistence. However, unlike the series for the unemployment rate, GDP
can change dramatically between one quarter and the next. This suggests
that, rather than persistence, GDP follows a time trend. When working
with time series variables, it is critical to ascertain whether this trend is
deterministic or random in nature.
When a variable such as GDP displays a trend, first differences should be
analysed. We will do this by considering GDP growth rates. In the next
activity, you will examine a time plot of GDP growth rates.

116
2 Describing intertemporal properties of time series

Activity 6 Analysing the time plot of GDP growth rates

Figure 6 is a time plot of the UK GDP growth rate. How did the quarterly
rate of GDP growth vary over the period 1955 Q2 to 2020 Q4?

10
GDP growth rate (%)

−10

−20
1960 1970 1980 1990 2000 2010 2020
Year
Figure 6 Seasonally adjusted quarterly GDP growth rate: 1955 Q2
to 2020 Q4

The unemployment and UK GDP datasets cover similar time periods; this
enables us to consider how historical events have impacted on both time
series. In Activity 7, you will consider one such event – the onset of the
COVID pandemic.

Activity 7 Comparing GDP growth and unemployment in


2020
On 11 March 2020 the World Health Organization declared the emergence
of COVID-19 to be a pandemic (WHO, 2021) and on 23 March 2020 the
first lockdown in the UK due to the pandemic began (Institute for
Government, 2021).
Use Figure 1 (Subsection 2.1) and Figure 6 to compare the impact of the
beginning of the COVID pandemic on quarterly unemployment and GDP
growth rates. Do they show the same pattern?

117
Unit A2 Time series econometrics

The Solution to Activity 7 may have surprised you. During the Great
Depression of the 1930s, the fall in output (GDP) and in unemployment
went hand in hand. Similarly, after the 2008 financial crisis, unemployment
rose quite steeply when GDP growth fell in the UK and lingered on at this
level for quite some time. In 2020 unemployment rose but not exceptionally
so. The difference in 2020 was the widespread and massive use of wage
subsidies paid out of public funds to prevent a fall in employment and in
wage incomes during extended periods of lockdown, a policy widely
During the COVID pandemic, implemented in most European countries with different levels of generosity
many businesses were forced to and inclusion. In the UK, this system became known as the furlough
close their doors system. For economists it is interesting to note that whenever a crisis hits,
Keynes’ ideas that a government’s fiscal policies can and should play an
active role in preventing output and employment of an economy imploding
downwards suddenly appear not to be so outdated or obsolete after all.
Note that the exceptional outlying values of GDP growth rates in 2020,
what econometricians often refer to as a structural break in the time
series, render it more difficult to get a good view of the patterns of
variation in the period from 1955 to 2019. For this reason, after taking
note of the exceptional behaviour of GDP growth during 2020, it is useful
to restrict the data range to 1955 Q2 to 2019 Q4 to get a better view of
the long-term pattern of variation. The resulting time plot is shown in
Figure 7. This time plot makes it easier to get a more in-depth look at the
level and variation around a mean of the GDP growth rates before 2020.

4
GDP growth rate (%)

−2

1960 1970 1980 1990 2000 2010 2020


Year
Figure 7 Seasonally adjusted quarterly GDP growth rate (%): 1955 Q2
to 2019 Q4

118
3 Random walks

2.5 Using R to explore time series


In this subsection, we will use R to explore a couple of univariate time
series. In Notebook activity A2.1, you will explore the data about
unemployment in unemployment. Then in Notebook activity A2.2, you will
explore data about GDP in ukGDP.

Notebook activity A2.1 Exploring time series in R


In this notebook, you will learn how to produce autoregressive
scatterplots and correlograms.

Notebook activity A2.2 More exploring time series in R


In this notebook, you will use autoregressive scatterplots and
correlograms to explore data in ukGDP.

3 Random walks
As shown in Section 2, one way to make sense of time series is to plot them
and relate what you see back to the historical contexts within which they
arose. For example, in Subsection 2.1 we looked at the evolution of the
rate of unemployment in the UK (Figure 1). There we noted distinctive
wave-like patterns over time in that time series, which we briefly sought to
explain in terms of past changes in policy emphases and policy regimes not
unlike what economic historians would do. Alternatively, as we also
showed in Subsections 2.2 and 2.3, we can seek to explain a time series in
terms of its own past by looking at its autoregressive behaviour using
autoregressive scatterplots and correlograms. All of this can be thought of
as exploratory data analysis for time series data.
Another approach to the analysis of time series, confirmatory analysis, was
developed under the impulse of the Haavelmo–Cowles research programme
in econometrics, which was initiated in the 1940s. This approach focused
on modelling the causal economic mechanisms that produced economic
outcomes with the explicit aim to test how well these models fitted the
actual observed patterns in the data. This was the type of analysis laid out
in Unit A1; this necessitated a consideration of the relation of each of the
regressors with the error term to detect possible correlation, and so to
ensure that the regressors of interest really are exogenous or to seek
estimators which accounted for endogenous regressors. As we will see later
in the unit, this is particularly challenging in the presence of
autocorrelations of the dependent variable and of the additional regressors
that a time series model may include. In this context, if a time series
model featured a dependent variable that showed cyclical behaviour, this
cyclical behaviour would be modelled by regressors and by time-related

119
Unit A2 Time series econometrics

dummy variables.
But what about the possibility that a time series displays wave-like
patterns which are not accounted for by observable regressors? What if
wave-like behaviour of a time series variable is due to random variation?
As the eminent mathematician George Pólya remarked, we should never
forget that chance is ‘an ever present rival conjecture’ when we seek to
explain observable phenomena (Pólya, 1968, p. 55). So, in this section, we
will look at why we should never ignore chance variation as a potential
explanation for fluctuations displayed in time series.
In 1927, the Russian probability theorist Eugen Slutzky had already
formulated this alternative perspective on modelling the behaviour of time
series – and of business cycles in particular – by asking the following very
intriguing question:
Is it possible that a definite structure of a connection between
random fluctuations could form them into a system of more or less
regular waves?
(Slutzky, 1937, p. 106)

Slutzky answered his own question in a very experimental way. He was


familiar with Dorothy Swaine Thomas’s quarterly index of English
business cycles from 1855 to 1877, which displayed a distinctive wave-like
pattern. He then constructed and plotted a ten-term moving summation of
the last digits of the numbers drawn in a Soviet lottery to see how this
summation of random effects behaved over time.
As you can see in Figure 8, the striking result was that not only did both
series show very distinctive cyclical patterns, but they also tallied each
other’s movements remarkably well.

Figure 8 The summation of random causes as the source of cyclical


processes, taken from Slutzky (1937). Note that the original version of this
figure is in an earlier Russian publication by Slutzky (1927, cited in Slutzky,
1937), which is reproduced in Klein (1997).

Slutzky’s experiment, therefore, showed that cycles could be produced


purely by the cumulation of random effects. Note, however, that Slutzky

120
3 Random walks

never asserted that economic cycles are merely due to the cumulation of
random effects, but rather warned us that this possibility is a plausible
rival conjecture that should not be left out of the picture. Ignoring this
possibility would mean that we might seek to infer causal structural
mechanisms where in fact none exists.
Slutzky dramatically changed the perspective on the analysis of time series
because his analysis implied that random disturbances can be an
important part of the data-generating process. The concept of a random
walk, which has become integral to modern econometric theory and
practice, encapsulates Slutzky’s idea that the behaviour of time series may
be driven by the cumulation of random effects.
In Subsection 3.1, we will introduce the simplest of random walks: the
simple random walk. In Subsection 3.2, we consider the random walk with
drift. We will use R to simulate both of these in Subsection 3.3 and hence
explore what impact changing parameters has. Finally, in Subsection 3.4,
we will compare a random walk with drift with a model that incorporates
a deterministic trend instead.

3.1 Simple random walk


The simplest model of a random walk, Y1 , Y2 , Y3 , . . ., can be written as
Yt = Yt−1 + εt , (1)
where we assume that the error terms εt are distributed normally with zero
mean, variance σ 2 and Cov(εt , εt−k ) = 0, for all k ̸= 0. These assumptions,
therefore, also imply that the error terms are independent and identically
distributed (i.i.d.) – which was introduced in Subsection 4.1 of Unit A1.
Substituting backwards in time yields
Yt = (Yt−2 + εt−1 ) + εt
Yt = (Yt−3 + εt−2 ) + εt−1 + εt
..
.
Yt = Y0 + ε1 + ε2 + ε3 + · · · + εt−3 + εt−2 + εt−1 + εt .
Or, more compactly,
t
X
Yt = Y0 + εi .
i=1

Note that Y0 represents the value of the time series at time t = 0 and so in
principle it could be a known quantity, but, as you’ll see in Subsection 3.4,
it can also be a quantity that needs to be estimated.
The definition of the simple random walk means that the trajectory
through time depends exclusively on the behaviour of random
disturbances. Hence, there is no structural component – no in-built
momentum – that determines its movement through time. You will
consider further what simple random walks look like in the next activity.

121
Unit A2 Time series econometrics

Activity 8 Characteristics of the simple random walk

Figure 9 shows 30 simulated random walks. Comment on the patterns you


observe in these walks.
200

100
Value

−100

−200
0 10 20 30 40 50
Time
Figure 9 30 simulated simple random walks

Formally, it can be shown that, for the random walk model given in
Equation (1), the variance of Yt increases linearly with t (time):
V (Yt ) = σ 2 t.
A summary of the simple random walk is given in Box 9.

Box 9 The simple random walk


A simple random walk is given by
Yt = Yt−1 + εt
t
X
= Y0 + εi ,
i=1

where, for t = 1, 2, . . . , the εt are i.i.d. with distribution N (0, σ 2 ).


For this model, ∆ Yt = εt and V (Yt ) = σ 2 t.

122
3 Random walks

An example of the use of the sample random walk model is given in


Example 3.

Example 3 Modelling unemployment using a simple


random walk
For the unemployment dataset introduced in Subsection 2.1, the
model
unemploymentRatet = unemploymentRatet−1 + εt
corresponds to a simple random walk.
This is similar to a simple linear regression model with
unemploymentRate as the dependent variable and the lagged variable
L (unemploymentRate) as the explanatory variable. However, there is
no intercept and the slope is set equal to 1 instead of being estimated.

3.2 A random walk with drift


In Subsection 3.1, you saw that the trajectory through time of a simple
random walk depends exclusively on the behaviour of random
disturbances. Hence, there is no structural component – no in-built
momentum – that determines its movement through time. Random walks
can be augmented with what is called drift, often denoted as d, a random
or stochastic trend component around which a random walk moves. As
you will see in Subsection 3.4, this is different from a deterministic trend t.
A definition of the random walk with drift is given in Box 10.

Box 10 Random walk with drift


A random walk with drift is given by
t
X
Yt = Yt−1 + d + εt = Y0 + dt + εi ,
i=1

where the error terms εt are i.i.d. with distribution N (0, σ 2 ).


For this model, ∆ Yt = Yt − Yt−1 = d + εt .

You will consider an application of the random walk with drift in


Activity 9.

123
Unit A2 Time series econometrics

Activity 9 Modelling a random walk with drift along the


Camino del Norte in Spain
In this activity, you will model the cumulative distances of 30 hikers
walking independently from one another along the spectacular coastal
(pilgrimage) route along the Camino del Norte, from Hondarribia to
Santiago de Compostela in Spain. (The total trajectory is 818 km
(507 miles).)
Part of the Camino del Norte To model these walks, you need to make the following assumptions.
• Each walker walks alone and proceeds independently from the other
walkers.
• Each walker has an average speed of 5 km/hour and follows a strict
routine of walking exactly 4 12 hours each day.
• The actual distances walked each day will vary according to weather
conditions, the moods and fitness of the walkers, slowdowns to admire
the scenery, and, of course, chance. This accounts for random variations
in distances covered each day. Assume that for each walker these random
disturbances are distributed normally with zero mean and standard
deviation of 10 km, including the heroic assumption that all covariances
between the random disturbances of walks on different days are 0.
• Each subsequent day the walkers continue their journeys from the point
reached the day before.
• Walkers should continue for 37 days (even if they go beyond Santiago de
Compostela) to adhere to the strict routine of walking 4 12 hours each
day throughout the period as a whole.
(a) Write down the random walk with drift model to be simulated based
on the assumptions above.
(b) Figure 10 shows a graph with the 30 simulated random walks with
drift representing 30 walkers along the Camino del Norte. How does
this random walk with drift differ from the patterns of a simple
random walk (such as the one in Figure 9, Subsection 3.1)?

124
3 Random walks

800
Total distance travelled (km)

600

400

200

0
0 10 20 30
Day
Figure 10 30 random walks with drift along the Camino del Norte in Spain.
The dashed line indicates the total length of the Camino del Norte.

As you have seen in Activity 9, the random walk with drift described in
that activity differs markedly from that of a simple random walk. The
reason is that a random walk with drift includes a systematic component.
These walkers set themselves the explicit goal to reach Santiago de
Compostela by walking steadily towards it from day to day.
It turns out that the higher the value of the drift (d) relative to the
standard deviation of the random disturbances (σ), the more the different
trajectories will come closer together.
A random walk with drift is a special case of the type of random process
described next in Box 11.

Box 11 The autoregressive process of order 1, AR(1)


An autoregressive process of order 1 (also written as AR(1)),
with values Y1 , Y2 , Y3 , . . . , is
Yt = d + β Yt−1 + εt ,
where the error terms εt are i.i.d. with distribution N (0, σ 2 ).
When β = 1, this is a random walk with drift.

125
Unit A2 Time series econometrics

We can model a time series variable as a random walk with drift, as


Example 4 demonstrates.

Example 4 Modelling unemployment using a random


walk with drift
For the unemployment dataset introduced in Subsection 2.1, the
model
unemploymentRatet = Y0 + unemploymentRatet−1 + εt
corresponds to a random walk with drift. This model is similar to a
simple linear regression model with unemploymentRate as the
dependent variable and the lagged variable L (unemploymentRate) as
the explanatory variable. However, the slope is set equal to 1 instead
of being estimated. For these data, the estimated value for Y0 (the
intercept) turns out to be 0.0065 with a standard error of 0.0189 and
the residual standard error is estimated to be 0.267.

3.3 Using R to simulate random walks


In this subsection, you will use simulation to explore some random walks.
In Notebook activity A2.3, you will explore the simple random walk and
the random walk with drift. Then, in Notebook activity A2.4, you will
explore a particularly simple type of time series, white noise, described
next in Box 12.

Box 12 White noise


A white noise process has i.i.d. observations and follows a normal
distribution with zero mean and constant variance:
Yt ∼ N (0, σ 2 ).

Notebook activity A2.3 Simulating random walks in R


This notebook explains how to use R to simulate a simple random
walk and a random walk with drift.

Notebook activity A2.4 Exploring the properties of a


white noise process
In this notebook, you will simulate a time series that is generated by a
white noise process and use correlograms to begin to compare the
intertemporal properties of white noise with those of random walks.

126
3 Random walks

3.4 Random walk with drift versus a model


with a deterministic trend
To conclude this section, it is useful to reflect for a moment on how the
model of a random walk with drift differs from the model with a
deterministic trend.
A definition of a model with a deterministic linear trend line is given in
Box 13.

Box 13 Deterministic linear trend model


A model for Yt that has a deterministic linear trend is given by
Yt = Y0 + dt + εt ,
where the error terms εt are i.i.d. with distribution N (0, σ 2 ).

Regarding Y0 and d as intercept and slope parameters respectively, this is


a simple linear regression model with dependent variable Y and
explanatory variable t, where t = 1, . . . , T . We can also model cyclical and
seasonal effects in a deterministic way. For instance, if the data are
quarterly, we can add quarter-specific dummy variables. (Dummy variables
were defined in Subsection 5.2.1 of Unit A1.)

Example 5 Modelling unemployment using a model with


a deterministic trend
Applying the deterministic trend model to the unemployment dataset,
where time is given as t = 1, 2, 3, 4, 5, . . . , gives the results in Table 4.
Table 4 Coefficients for the deterministic trend model fitted to
unemployment

Parameter Estimate Standard t-value p-value


error
Intercept 7.671 0.331 23.21 < 0.001
t −0.008 0.003 −2.86 0.005

For this model, the residual standard error is estimated to be 2.318.

The model with a deterministic trend shown above is not a random walk,
since period by period Y is expected to increase on average by d. The
assumption here is that the history of past disturbances has no effect on
the future trajectory of the time series variable over time. Only the
disturbance of the current period accounts for the random variation
around the trend line.

127
Unit A2 Time series econometrics

As given in Box 10 (Subsection 3.2), the model of a random walk with


drift is
Yt = Yt−1 + d + εt
t
X
= Y0 + dt + εi ,
i=1

where the εi are i.i.d. with distribution N (0, σ 2 ). So, in the model of the
random walk with drift, the history of past disturbances carries over into
the present. Past disturbances, therefore, are part of the data-generating
process for the random walk with drift.
It is also possible to combine the random walk with drift with a
deterministic trend, as demonstrated in Example 6.

Example 6 Modelling unemployment using a model with


drift and a deterministic trend
A random walk model that includes drift and a deterministic trend,
applied to the unemployment dataset, is
unemploymentRatet = Y0 + dt + unemploymentRatet−1 + εt .
Writing the variable t as t = 1, 2, 3, . . . , gives the results in Table 5.
Table 5 Coefficients for the random walk model with drift and
a deterministic trend fitted to unemployment

Parameter Estimate Standard t-value p-value


error
Intercept 0.0804 0.0378 2.12 0.035
t −0.0007 0.0003 −2.25 0.06
unemploymentRatet−1 1 – – –

For this model, the residual standard error is 0.264.

Keep in mind that it is important to think carefully about what


assumptions you make about error variation when modelling a real-world
situation. The simple linear trend model may look simple, but it is
actually based on a very strong assumption that past random disturbances
do not matter for the future direction of movement of a variable. This is
why earlier researchers have struggled to explain cyclical patterns without
accounting for a stochastic trend or drift.
In the early days of econometric practice, for example, it was fairly
common to add a deterministic trend variable t to a regression model of a
dependent variable Yt on one or more Xt variables when working with time
series. But merely inserting a time trend in a regression model does

128
3 Random walks

nothing to eliminate the potential cumulative effects of past disturbances


on the trajectory of a time series variable.
While the random walk with drift and the model with a deterministic
trend make strikingly different assumptions about the persistence of
history and trend formation of the time series variable, they may be quite
difficult to distinguish when looking at time plots, or correlograms. Formal
statistical modelling and estimation is often required to select the best
specification for a time series. The following activity will explore modelling
a time series variable with a stochastic trend, a deterministic trend and
with both trends.

Activity 10 Does a trend line eliminate the effect of past


disturbances?
Example 4 (Subsection 3.2), Example 5 and Example 6 described three
different models for the unemployment dataset.
• Model 1: A model with drift. (Example 4)
• Model 2: A model with a deterministic trend. (Example 5)
• Model 3: A model with a deterministic trend and drift. (Example 6)
(a) Write down the estimated regression equation and residual standard
error for each of the models.
(b) Interpret the results from the three different regressions. In
particular, is it a good idea to model unemployment rate by using an
additional time trend variable?

While Activity 10 offers a clear rejection of the simple model with a


deterministic trend, it does not provide any conclusive answer regarding
the best model. These results would not allow us to distinguish between a
random walk with a stochastic trend and a random walk with stochastic
and deterministic trends, which suggests that the random walks
representing the unemployment rate in this period may need to be
specified differently for more conclusive results. The next section will
discuss essential properties of time series variables that are required for
estimation results to be more reliable: stationarity and weak dependence.

129
Unit A2 Time series econometrics

4 Stationarity and lagged


dependence of a time series
In this section, we will introduce the concepts of stationarity and weak
dependence of a time series, explain why they matter, and show how to
systematically test for stationarity and for the degree of lag dependence of
a time series. To do this, however, you first need to understand the
concept of a stochastic process. This is the subject of Subsection 4.1. Then
in Subsection 4.2 we introduce two important properties that a stochastic
process might have: stationarity and weak dependence. As you will discover
shortly, not all stochastic processes are stationary. In Subsection 4.3 you
will see how differencing, which was introduced in Box 2 (Section 1), can
be used to transform a non-stationary time series into a stationary one.

4.1 Stochastic processes


A stochastic process can be understood as a process that generates a
particular sample of data, or as the family of possible samples which could
have been realised when looking at one sample alone. This concept allows
us to come to grips with what randomness means when dealing with time
series. In time series analysis our ‘sample’ is the observed data points
ordered in time, which can thus be interpreted as the actual realisation of
a stochastic process. The corresponding concept to population, therefore,
is all possible realisations of the stochastic process that generates the
observed data points in time.
Examples of stochastic processes include the random walk in any of the
three basic specifications discussed in the previous section (simple random
walk, random walk with drift and random walk with drift and
a deterministic trend – see Activity 10), as well as the simple linear model
with a deterministic trend and the AR(1) model. It is impossible to
foreknow the actual value taken by a time series variable modelled in any
of these ways in the future.

4.2 Stationarity and weak dependence


Stationarity is a key property of any time series variable whose behaviour
we wish to model and estimate. Most estimation and hypothesis testing
techniques can only deal with stationary variables, or with their stationary
transformation. Finding out whether a variable is stationary or needs to
be transformed to become stationary is an essential part of statistical
modelling of time series data. In Box 14 we give a definition of stationarity
before giving an example of a process that is stationary.

130
4 Stationarity and lagged dependence of a time series

Box 14 Stationarity in the strong sense


A stochastic process is called stationary in the strong sense if the
joint probability distribution of (Y1 , Y2 , Y3 , . . . , YT ) is the same as the
joint probability distribution of (Y1+p , Y2+p , Y3+p , . . . , YT +p ) for all
integers p. Put differently, shifting the start of the series in time does
not change its joint probability distribution.

Example 7 Stationarity of white noise


Recall from Box 12 (Subsection 3.3) that the white noise process has
i.i.d. observations and follows a normal distribution with zero mean
and constant variance. That is, Y1 , Y2 , . . . , is a white noise process
when
Yt ∼ N (0, σ 2 ).
Notice that in this case the distribution of Yt does not depend on t in
any way. This means that for any p the joint distribution of
(Y1 , Y2 , Y3 , . . . , YT ) must be the same as the joint distribution of
(Y1+p , Y2+p , Y3+p , . . . , YT +p ). So the white noise process is stationary
in the strong sense.

However, in most applications of time series econometrics, analysts utilise


the concept of weak stationarity, and we will do the same in this unit.
Box 15, next, gives the properties that a weakly stationary time series
must have.

Box 15 Weak stationarity


A weakly stationary time series is defined by the following three
properties (for the mean, variance and covariance, respectively):
• E(Yt ) = µ, for all t
• V (Yt ) = E(Yt − µ)2 = σ 2 , for all t
• γh = E(Yt − µ)(Yt+p − µ) is a function of p but not of t.
Therefore, a time series is said to be weakly stationary if its mean,
variance and successive covariances (for different intervals p in time)
all remain constant over time. This explains why weak stationary is
often also referred to as mean-covariance stationarity.

Note that the definition given in Box 15 is not as restrictive as that given
in Box 14. So any time series that is stationary in the strong sense must
also be weakly stationary.

131
Unit A2 Time series econometrics

A time series is said to be non-stationary if it fails to satisfy one or more


of the three properties given in Box 15. For example, if a time series is
consistently trending over time – moving upwards or downwards – it will
be non-stationary.
In a growing economy, for example, most macroeconomic time series will
be non-stationary. In particular, this means that the random walk with
drift and the model with a deterministic trend must both be
non-stationary. Moreover, a time series that displays a clear pattern of
heteroskedasticity over time will be non-stationary, which indicates that
even the simple random walk is non-stationary.
In the next activity, you will practise deciding whether a time series
appears to be (weakly) stationary or non-stationary based on its time plot.

Activity 11 Stationary or non-stationary?


In Subsection 2.4, GDP was plotted in Figure 4 and GDP growth was
plotted in Figure 7. Based on these plots, answer the following questions.
(a) Does quarterly GDP appear to be (weakly) stationary or
non-stationary?
(b) Does quarterly GDP growth rate appear to be (weakly) stationary or
non-stationary?

While visual inspections of data go a long way towards suggesting the


intertemporal properties of a variable, and their stationarity, there are
confounding aspects of modelling a time series that we need to keep track
of. For instance, if we added a deterministic trend or drift to the GDP
variable, or if we first differenced this series, would the transformed time
series display constant mean? Is GDP non-stationary, or does it have a
deterministic trend only? Are there structural breaks in the time series
that warrant a separate analysis of each period of GDP separately?
In Section 5, we will provide more systematic ways of analysing and testing
for stationarity. These ways will not, however, deal with the issue of
structural breaks – something that the econometrician needs to always
bear in mind. (Recall that we left out the quarterly data on GDP and
GDP growth from the year 2020 because these observations seemed to
suggest a structural break.) Statistical tests for the presence of structural
breaks exist but are beyond the scope of this unit.
Covariance stationarity, the third property in Box 15, requires more
explanation. What it tells us is that, for example, the covariance of Y2
and Y4 will be the same as that of Y16 and Y18 if the series is stationary,
because the time interval between them is the same. Instead, covariances
between two groups of two values in a stationary time series will only differ
if the time interval, p, between them is also different.
Notice that this property of covariance stationarity does not say how the
correlation between two observations depends on their separation in time.

132
4 Stationarity and lagged dependence of a time series

This latter property relates to the lag dependence of a time series.


Modelling and estimating this lag dependence can be challenging unless
time series variables are weakly dependent. The definition of weak
dependence is given in Box 16. Notice the properties of the covariance
required for weak stationarity are different from the property required for
weak dependence.

Box 16 Weak dependence


A stationary time series is defined as weakly dependent if the
autocorrelation ρp → 0 as p → ∞. That is, if the correlation between
Yt and Lp Yt tends to 0 as the lag, p, increases.

Persistent time series will display high correlation between observations


that may be quite far apart. So weak dependence means that variables are
not very persistent over time. Furthermore, finding the value of p after
which the covariance is sufficiently small that it is reasonable to assume
that observations do not correlate with each other is a key step in
modelling a time series. This is because, as you will see in Example 8, it
allows a suitable model to be written down.

Example 8 Modelling a weakly dependent time series


Suppose a time series variable Y is stationary, where
Cov(Yt , Yt+p ) ̸= 0, for p = 1, 2, 3,
Cov(Yt , Yt+p ) = 0, for p > 3.
An econometric model explaining the lagged dependence structure
of Y can be written as
Yt = Y0 + α1 Yt−1 + α2 Yt−2 + α3 Yt−3 + ut .
That is, the model regressors include all lags of Y for which there is a
non-zero covariance.

You may be wondering whether a simpler model would also be suitable for
the time series described in Example 8. You will explore this in Activity 12.

Activity 12 Modelling lagged dependence of a time series

Consider again the time series described in Example 8.


What happens if Yt−3 is left out of the model?

133
Unit A2 Time series econometrics

In Activity 12, you considered what would happen if Yt−3 was left out of
the model given in Example 8. The same argument holds for missing out
Yt−1 or Yt−2 , or more than one of Yt−1 , Yt−2 and Yt−3 .
As Example 8 and Activity 12 show, lag dependence is therefore dealt with
by including all lags that our time series variable correlates with. The
challenge in any application is to find out how far back the dependence
goes. You can see that weak dependence ensures there comes a point when
it is small enough to not worry about, but it does not say for what value
of p it will occur. One strategy for determining the value of p is given in
Box 17.

Box 17 Dealing with lagged dependence


Lag dependence is dealt with by including all lags that a time series
variable correlates with.
A common strategy is to start from a high lag order and test for the
statistical significance of the highest. If not significant, delete it and
test for the significance of the second highest in the reduced model.
Repeat until the t-test of the statistical significance of the highest
remaining lag rejects the null hypothesis.

Once we know how far back observations correlate, dealing with lag
dependence is a straightforward exercise. However, stationarity is more
challenging. Most time series we come across in economics are
non-stationary. This should not surprise you. It would indeed require quite
a stretch of the imagination and a complete denial of the importance of
historical context and conjuncture if we were to view the evolution of most
macroeconomic variables of the UK economy during 20 years covering
the 1950s and 1960s and their evolution during the 20 years covering 2000s
and 2010s as two distinct realisations of the same underlying stochastic
process. Most, if not all, economies are very different today compared with
50 or 60 years ago – and national economies are bigger today. These
differences are likely to have affected the ways in which each evolved
dynamically through time.
For example, during this time much of the UK economy has shifted from
manufacturing to services. Canary Wharf in London reflects this change.
As you can see in Figure 11, in the 1950s it was an area of docklands,
whereas in the 2020s it had transformed into a centre for banking.

134
4 Stationarity and lagged dependence of a time series

(a) (b)
Figure 11 Canary Wharf in London in (a) the 1950s and (b) the 2020s

However, in many (but not all) cases, it is possible to transform a


non-stationary time series into a stationary time series. For example, as
you will see in Subsection 4.3, by removing a linear deterministic trend or
by using first- or higher-order differencing. This is not the only
transformation that can be, and is, used. The variance of a time series
growing at an approximately constant rate can be stabilised by using a
logarithmic transformation and then removing a log-linear trend or using
first differencing. Keep in mind, however, that by using such
transformations we do so at the cost of removing key historical insights
from the analysis, and sometimes by increasing (but luckily other times by
decreasing) the order of the lag dependence.

4.3 Transforming to stationarity


Let us revisit the simple random walk summarised in Box 9
(Subsection 3.1):
t
X
Yt = Yt−1 + εt = Y0 + εi .
i=1

This time series is non-stationary because, as mentioned in Box 9, it will


display a heteroskedastic pattern over time: V (Yt ) = σ 2 t.
However, if we were to transform this random walk series by subtracting
Yt−1 , we would end up with a series of first differences that are equal to
the error terms εt :
∆ Yt = Yt − Yt−1
= (Yt−1 + εt ) − Yt−1 = εt ,
where εt are i.i.d. with zero mean.

135
Unit A2 Time series econometrics

This is the white noise process, which as you saw in Example 7


(Subsection 4.2) is stationary in the strong sense (and hence is weakly
stationary too). Hence for the simple random walk, ∆Yt satisfies all three
assumptions of weak stationarity.
Since the simple random walk can be transformed into a stationary series
by taking first differences, it is referred to as a difference stationary
series or as a variable with a unit root.
Variables which become stationary by first differencing are also called I(1),
where I(1) stands for a series whose order of integration is 1. This
means that an I(1) series needs to be differenced only once to obtain a
stationary series. What is meant by order of integration is summarised in
Box 18.

Box 18 Order of integration


Non-stationary variables that become stationary by first differencing
are said to have a unit root and are called integrated of order 1
(denoted as I(1)).
Similarly, non-stationary variables that become stationary after
differencing p times are said to be I(p).
A stationary series is I(0); that is, it has an order of integration of 0,
as it does not require differencing to become stationary.

A random walk with drift, as well as a random walk with drift and
a deterministic trend, can also be transformed into a stationary series by
taking first differences. So, like the simple random walk, these processes
are also I(1).

5 Testing for stationarity


Stationarity is a necessary condition to be able to model and estimate a
time series. Thus it is important to know when a time series can be
assumed to be stationary.
In Activity 11 (Subsection 4.2), you have informally used time plots to
judge whether a time series is (weakly) stationary or not. There are also
formal tests for stationarity that look for the presence of a unit root in the
series. They therefore test for a unit root (which signifies non-stationarity)
against the alternative of stationarity.
Let’s start with a simple form of one such test, based on the AR(1) process
that was described in Box 11 (Subsection 3.2).
Consider now AR(1) models where a time series variable Yt is modelled as
dependent on its first lag as
Yt = d + ρ Yt−1 + εt ,
where −1 ≤ ρ ≤ 1 and the εt are i.i.d. with distribution N (0, σ 2 ).

136
5 Testing for stationarity

Figure 12 shows the plots for four different simulated AR(1) series
featuring different values for ρ.

Autocorrelation: ρ=1 ρ = 0.9 ρ = 0.5 ρ = 0.2


15

10

5
Value

−5

−10

−15
0 10 20 30 40 50
Time
Figure 12 Simulated AR(1) series with different values for ρ

On casual inspection, the plots of series generated by the autoregressive


process with ρ = 0.5 and ρ = 0.2 look similar to that of a white noise
process. The reason is that persistence in the series diminishes as ρ
decreases.
The plot with ρ = 1 is a random walk with drift which, as discussed in
Subsection 4.2, is non-stationary. More general results about the
stationarity of the AR(1) process are given in Box 19.

Box 19 Stationarity of the AR(1) process


Suppose that Yt is from an AR(1) model of the form
Yt = d + ρ Yt−1 + εt ,
where −1 ≤ ρ ≤ 1 and the εt are i.i.d. with distribution N (0, σ 2 ).
• When ρ = 1, the time series Yt is non-stationary but ∆ Yt is
stationary. So, Yt contains a unit root and is I(1) (that is,
integrated of order 1).
• When |ρ| < 1, the time series is stationary. So, Yt does not contain
a unit root and it is I(0) (that is, integrated of order 0).

137
Unit A2 Time series econometrics

It can be shown that random walks with or without drift, or with or


without a deterministic trend, all satisfy the condition that ρ = 1; hence:
• are non-stationary but become stationary when first differenced so are
integrated of order 1, denoted I(1)
• contain a unit root.
Unit root tests are based on discovering whether or not it is reasonable to
assume that a model used for a time series has a unit root (null
hypothesis) against stationarity (alternative hypothesis). In
Subsection 5.1, we will introduce one such unit root test, the Dickey–Fuller
test. This test only models lag dependence of order 1. In Subsection 5.2,
we extend this to consider dependencies at more than just the first lag.
You have seen in Subsection 4.3 that when a time series is non-stationary
it might be possible to transform it to be stationary. In Activity 11
(Subsection 4.2), you have already seen that quarterly GDP and quarterly
GDP growth rate are non-stationary. So in Subsection 5.3 we will consider
transformations of GDP and how to formally test whether this
transformation has been successful in producing a stationary series. In
Subsection 5.4, you will use R to actually do this testing.

5.1 The Dickey–Fuller (DF) test


The Dickey–Fuller (DF) test is used to decide if a time series has a unit
root. It has three variants, depending on what type of non-stationary
random walk the series is most likely to be:
• the simple random walk
• the random walk with drift
• the random walk with drift and a deterministic trend.
Consider first the variant based on the random walk with drift.
For this to be a test of stationarity, we need to know what the null
hypothesis is. This is the subject of the next activity.

Activity 13 Determining the null hypothesis of the DF test

Unit root tests are based on discovering whether or not it is reasonable to


assume that a model used for a time series has a unit root (null
hypothesis) against stationarity (alternative hypothesis). For the model
Yt = d + ρ Yt−1 + εt :
(a) Write down the null hypothesis in terms of δ = ρ–1.
(b) What is the model for ∆ Yt when the null hypothesis is true?

The null and alternative hypotheses of the three variants of the DF test
are given in Box 20.

138
5 Testing for stationarity

Box 20 Variants of the DF test


For a time series Yt , three variants of the DF test are as follows.
Simple random walk
Model specification:
∆ Yt = δ Yt−1 + ut .
Hypotheses:
H0 : δ = 0 or equivalently ρ = 1
(that is, series is non-stationary),
H1 : δ < 0 and |ρ| < 1
(that is, series is stationary).

Random walk with drift


Model specification:
∆ Yt = d + δ Yt−1 + ut .
Hypotheses:
H0 : δ = 0 or equivalently ρ = 1
(that is, series is non-stationary),
H1 : δ < 0 and |ρ| < 1
(that is, series is stationary).

Random walk with drift and a deterministic trend


Model specification:
∆ Yt = d + βt + δ Yt−1 + ut .
Hypotheses:
H0 : δ = 0 or equivalently ρ = 1
(that is, series is non-stationary),
H1 : δ < 0 and |ρ| < 1
(that is, series is stationary around a deterministic trend).

The DF test is effectively a t-test of statistical significance on the


coefficient of Yt−1 in a regression model with ∆Yt as the dependent
variable. After choosing the right specification, the second step of the test
is to estimate the first-differenced model using OLS and to carry out this
t-test, which is the usual ratio of the estimator for δ to its standard error.
However, the test statistic does not follow the usual t-distribution under
the null hypothesis H0 : δ = 0. Dickey and Fuller (1979) showed, however,
that hypothesis testing was still possible with the DF distribution, which

139
Unit A2 Time series econometrics

has larger critical values than the standard student-t or the standard
normal distribution. We will discuss this distribution in the following
subsection.
The DF test, although intuitive, is limited in the way it models lag
dependence. Errors are assumed to have no autocorrelation, while it is
common to see autocorrelation to exist beyond lag 1. For this reason, the
most common and reliable unit root test is the Augmented Dickey–Fuller
(ADF) test, which we will discuss next.

5.2 The Augmented Dickey–Fuller (ADF)


test
The Augmented Dickey–Fuller (ADF) test is similar to the DF test but it
allows for the presence of lag dependence larger than of order 1. For the
ADF test, each of the three DF specifications given in Box 20 is
augmented by adding the lags of the dependent variable ∆Yt that are
necessary to model the time series lag dependence, so that the error terms
ut can be assumed to be white noise.
For the case of a random walk with drift, the model specification for the
ADF test would be
m
X
∆ Yt = d + δ Yt−1 + γi ∆ Yt−i + ut ,
i=1

where m is the number of lags.


Similarly if it is assumed that Yt is stationary around a deterministic linear
time trend, the model specification would be
m
X
∆ Yt = d + βt + δ Yt−1 + γi ∆ Yt−i + ut .
i=1

The null and alternative hypotheses will be the same as those in Box 20.
The appropriate number of lags (m) can be selected on the basis of
information criteria such as the Akaike information criterion (AIC)
discussed in Unit 2 or using the iterative deletion procedure of high lags
described in Box 17 (Subsection 4.2).
Similarly to DF tests, the test statistic for ADF tests does not follow the
usual t-distribution. Figure 13 plots the probability distribution for the
test statistic estimated from the simple random walk, the random walk
with drift, and the random walk with drift and a deterministic trend. The
distribution of the ADF test statistic, given the null hypothesis, is
obtained via simulation.
Under the null hypothesis that a series contains a unit root, the ADF test
statistic generally takes negative values. Tables 6, 7 and 8 give some of the
critical values for the DF distribution, at 1%, 5% and 10% significance
levels.

140
5 Testing for stationarity

Random walk model: Simple Drift


Drift and deterministic trend
0.6

0.5

0.4
Density

0.3

0.2

0.1

0.0
−6 −4 −2 0 2
t-statistic
Figure 13 The probability distribution of ADF test statistics under the null
hypothesis that the series contains a unit root compared with the usual
t-distribution
Table 6 Critical values of the ADF test for a simple random walk

Degrees of freedom, n 1% 5% 10%


25 −2.661 −1.955 −1.609
50 −2.612 −1.947 −1.612
100 −2.588 −1.944 −1.614
250 −2.575 −1.942 −1.616
500 −2.570 −1.942 −1.616
> 500 −2.567 −1.941 −1.616

Table 7 Critical values of the ADF test for a random walk with drift

Degrees of freedom, n 1% 5% 10%


25 −3.724 −2.986 −2.633
50 −3.568 −2.921 −2.599
100 −3.498 −2.891 −2.582
250 −3.457 −2.873 −2.573
500 −3.443 −2.967 −2.570
> 500 −3.434 −2.863 −2.568

141
Unit A2 Time series econometrics

Table 8 Critical values of the ADF test for a random walk with drift and
a deterministic trend

Degrees of freedom, n 1% 5% 10%


25 −4.375 −3.589 −3.238
50 −4.152 −3.495 −3.181
100 −4.052 −3.452 −3.153
250 −3.995 −3.427 −3.137
500 −3.977 −3.419 −3.132
> 500 −3.963 −3.413 −3.128

ADF tests have a low power, that is, a high probability of committing a
type II error. (Recall from previous study that in hypothesis testing a
type II error corresponds to failing to reject the null hypothesis that the
series contains a unit root when it actually does not.) In other words, the
ADF test will conclude a series is I(1) too often. It will also reject the null
hypothesis of a unit root when the ADF model specification does not
capture the stochastic process of the time series variable. Because of the
low power, a unit root test is often done in two steps.
1. Test if the original variable has a unit root; it will too often fail to reject
the I(1) conclusion.
2. Test if the first difference of the variable is stationary (which should be
the case if the previous ADF test concludes the series is I(1)).
Even so, problems with the power of the test may arise if:
• the time span is short
• ρ is close to, but not exactly, 1
• there is more than a single unit root (i.e. the series is I(2), I(3), etc.)
• there are structural breaks in the series.
The ADF test is also sensitive to the specification used. Using the wrong
one of the three models can impact on the size of the test, that is, increase
the probability of committing a type I error (rejecting the null hypothesis
when it is true).
In the next two subsections, we will put the ADF test into practice with a
real time series. However, it is worth noting first that whilst the ADF test
is a popular test for the presence of a unit root, there are many other unit
root tests, including the Phillips–Perron test. Econometricians frequently
test for a unit root in a series using several tests. This is because unit root
tests differ in terms of size and power.

142
5 Testing for stationarity

5.3 Stationarity of a transformation of GDP


In the Solution to Activity 11, it was noted that quarterly GDP is
non-stationary. In this subsection and the next, we will see if a
transformation of this GDP variable is stationary. The transformation we
will consider is the first differences of the natural log of the series. We
often transform data by taking the natural log of the series to reduce the
skewness of the data and to reduce the impact of any outliers. It also acts
to stabilise variances, which is a necessary condition for weak stationarity
(Box 15, Subsection 4.2).
A Taylor expansion can be used to show that the first differences of the
natural logs of GDP have a close link with the GDP growth rate, gGDP ,
introduced in Subsection 2.4. Recall that the Taylor expansion of
log(1 + x) for −1 < x ≤ 1 is
log(1 + x) = x − 12 (x − 1)2 + 31 (x − 1)3 − 14 (x − 1)4 + · · · .
Therefore,
Yt = log(GDPt ) − log(GDPt−1 )
 
GDPt
= log
GDPt−1
= log(1 + gGDP )
= gGDP − 12 (gGDP − 1)2 + 13 (gGDP − 1)3 − 14 (gGDP − 1)4 + · · ·
≃ gGDP ,
where the last step is an approximation. This approximation works well
for small values of gGDP , but becomes less accurate if gGDP is less
than −0.25 or more than 0.25 (Cleveland, 1993, p. 126).
So, calculating period-to-period growth rates of GDP is approximately
equivalent to taking the (natural) logarithms of GDP and then
first differencing these logged GDP values.
In the following activity, you will begin to explore the transformed time
series.

143
Unit A2 Time series econometrics

Activity 14 Using correlograms to explore whether UK


GDP might contain a unit root
Figures 14(a) and 14(b) show the time plot for log(GDP) and the first
differences of log(GDP), respectively. Figures 14(c) and 14(d) show the
correlograms for log(GDP) and the first differences of log(GDP),
respectively.
Comment on the plots in Figure 14. What do these plots tell us about the
appropriate model for log(GDP)?

13
0.06
12
0.04

∆ log(GDP)
log(GDP)

11
0.02
10
0.00
9
−0.02
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020
Year Year
(a) (b)
1.0 1.0

0.5 0.5
ACF

ACF

0.0 0.0

−0.5 −0.5

−1.0 −1.0
0 4 8 12 16 20 24 0 4 8 12 16 20 24
Lag Lag
(c) (d)

Figure 14 (a) Time plot of log(GDP), (b) time plot of first differences of log(GDP), (c) correlogram
for log(GDP) and (d) correlogram for first differences of log(GDP)

We can conclude, therefore, that, if the underlying pattern of growth of


GDP is characterised by a fairly constant and low rate of growth, taking
logarithms of GDP will stabilise its period-to-period increments over time.

144
5 Testing for stationarity

Subsequently, taking first differences of these logarithms of GDP will


stabilise the mean of the resulting time series. This double transformation
– first taking logarithms and then first differences – is often used in the
analysis of economic time series to stabilise the mean and the variance of a
time series that displays strong persistence.
Example 9 continues with the use of the GDP dataset, testing for the
presence of a unit root.

Example 9 Testing for the presence of a unit root in UK


GDP
The ADF test can be used to assess whether there is a unit root in
UK GDP using the following regression (which assumes there is a
random walk with drift):
∆ log(GDPt ) = β0 + δ log(GDPt−1 ) + γ1 ∆ log(GDPt−1 )
+ γ2 ∆ log(GDPt−2 ) + γ3 ∆ log(GDPt−3 )
+ γ4 ∆ log(GDPt−4 ) + ut .
When selecting lag lengths for the ADF regression with quarterly
data, it is common to begin with four lags. The appropriate model
can be selected through a process of successive estimation, dropping
one lagged term each time and comparing across information criteria;
this will be shown in the next subsection. For the moment, we will
continue with the model that includes four lags. The results from
fitting this model are given in Table 9.
Table 9 Results of running the ADF test for log(GDP)

Parameter Estimate Standard Usual


error t-statistic
Intercept 0.0176 0.0067 2.63
log(GDPt−1 ) −0.0011 0.0005 −2.08
∆ log(GDPt−1 ) −0.0794 0.0632 1.26
∆ log(GDPt−2 ) 0.2514 0.0598 4.21
∆ log(GDPt−3 ) 0.3368 0.0597 5.65
∆ log(GDPt−4 ) 0.0474 0.0630 0.75

The test statistic is the usual t-statistic calculated as the ratio of the
estimator to its standard error. So, reading off from the log(GDPt−1 )
row of Table 9, the ADF test statistic for this example is −2.08.
It turns out that the critical values for this particular test are −3.44
(1%), −2.87 (5%) and −2.57 (10%). So, as −2.08 is greater than
−2.57, this suggests that the p-value is greater than 0.10. Thus there
is insufficient evidence to reject the null hypothesis, and therefore we
conclude that log(GDPt ) has a unit root and is therefore
non-stationary.

145
Unit A2 Time series econometrics

In Example 9, you have seen that the ADF suggests that log(GDPt ) has a
unit root. However, the results in Table 9 also suggest the ADF test model
could drop the lag 4 term. So the regression without it should be
re-estimated. The final test statistic, the usual t-statistic for the
log(GDPt−1 ) term in the re-estimated regression, would then be compared
with the critical value of the DF distribution assuming a random walk
with drift.
At the end of Subsection 5.2, we explained that the unit root test is often
done in two ways, given its high type II error. So if the test suggests that
log(GDPt ) has a unit root, then an ADF test on its first differences should
return a result suggesting the latter is stationary. In the next subsection,
you will explore the specification of an ADF test.

5.4 Using R to test for stationarity


As you have seen in Subsections 5.2 and 5.3, performing the ADF test
requires fitting a regression model, and requires the use of a computer. In
Notebook activity A2.5, you will see how R can be used to achieve this.

Notebook activity A2.5 Performing ADF tests in R


This notebook will show you how to perform ADF tests in R using
the data in ukGDP. The different specifications of the ADF test will be
explored.

6 Modelling more than one time


series variable
So far in this unit, we have looked at the properties of a univariate time
series and focused in particular on testing whether a time series is
stationary or not, taking into account its lag dependence. As we have seen,
most time series you come across in econometric practice are
non-stationary but can often be rendered stationary by using log
transformations, taking first differences or removing a linear trend. When
time series variables are transformed to a stationary variable, then
estimation and hypothesis testing from OLS are valid.
You are already familiar with the notion of prediction. Indeed, regression
analysis allows you to express each observation of the dependent variable
as the sum of its predicted value (predicted by the regression line) and its
corresponding residual. Hence
actual value = fitted value + residual
= predicted value + residual.

146
6 Modelling more than one time series variable

With time series, one can predict the (as yet unknown) future trajectory of
a variable in terms of its own past behaviour (as well as, more generally, of
the past behaviour of other relevant explanatory variables). This is called
forecasting, which is one of the main uses of time series data.
When using more than one time series variable, the issues related to
stationarity persist. Each and all variables in an econometric model have
to be stationary and weakly dependent to render any forecasting or
modelling work useful. Individually, the procedure of finding and
modelling the lag dependence, and then using an ADF test to find a
stationary transformation, needs to be performed for each variable.
However, with more than one variable, there are additional challenges and
additional solutions to non-stationarity. Subsection 6.1 will discuss one
major challenge of having an econometric model with at least two time
series variables that are neither stationary nor I(1), which is called
spurious regression. Subsection 6.2 will then discuss cointegration, which is
a characteristic of I(1) variables (or I(p) variables more generally) which
allows them to be included in an econometric model without first
differencing. Then Subsection 6.3 will discuss error correction models,
which allow the econometrician to better understand how cointegration
works. Finally, in Subsection 6.4, you will see how to implement the ideas
in this section using R.

6.1 Spurious regression


Recall Slutzky’s experiment discussed at the beginning of Section 3.
Slutzky showed how a ten-term moving summation of the last digits of
numbers drawn in a Soviet lottery closely tracked the wave-like pattern of
Dorothy Swaine Thomas’s quarterly index of English business cycles in
the 1800s.
Both time series moved together like two partners on a dance floor in a
carefully choreographed routine, although in this particular experiment we
know that the relation between them is purely coincidental. It is perfectly
possible of course that Slutzky had to do several experiments using
different sets of lottery random numbers before picking out that realisation
of the joint stochastic process that tallied best with the business cycle data.
The key lesson from Slutzky’s experiment is that regressions featuring
stochastically trended time series may appear to yield highly significant
results based on the conventional normal critical values used in OLS
regression, even when these time series are substantively unrelated. This is
what econometricians refer to as the problem of spurious regression. As
Slutzky’s experiment showed, this problem occurs when variables seem to
be highly correlated, even though the cumulation of random disturbances
plays an important part of the data-generation process.
Example 10 demonstrates another example of spurious regression.

147
Unit A2 Time series econometrics

Example 10 Correlation between two simulated random


walks

Figure 15 shows two simulated time series generated by a random


walk with a drift process: process 1 (rwd1) and process 2 (rwd2). As
you can see, the series appear to move closely together for the first
ten observations and then diverge slightly. The series move roughly in
parallel with each other between observations 15 and 30. The distance
between the two series appears to close slightly from around
observation 35. We know these series are unrelated because they have
been simulated to be so. However, on casual inspection, if we did not
know the series were simulated, we might think that they were
correlated.

Random walk with drift: 1 2


120

100

80
Value

60

40

20

0
0 10 20 30 40 50
Time
Figure 15 Two simulated time series generated by a random walk with
drift (rwd1 and rwd2)

Table 10 contains the results from regressing one of the random walks
with drift against the other using OLS. The regression results show a
high degree of correlation between the two series rwd1 and rwd2. The
coefficient on rwd2 is close to 1 and appears to be highly statistically
significant.

148
6 Modelling more than one time series variable

Table 10 Results of the regression rwd1 ∼ rwd2

Parameter Estimate Standard t-statistic p-value


error
Intercept −2.542 1.116 −2.28 0.027
rwd2 0.938 0.017 56.31 < 0.001

The R2 and adjusted R2 values for this regression are 0.9851


and 0.9848, respectively. These very high values suggest that the line
has fitted very well. Indeed, the regression results look almost too
good to be true. However, in reality, rwd1 and rwd2 are unrelated.
The series have been simulated and their values are simply the
cumulative sum of random shocks. The regression is spurious in that
it implies a strong relationship between the variables when there is in
fact no relationship.

In Example 10, you saw that with time series data a spurious regression
can fit the data very well. So when should we suspect that a regression
might be spurious?
One rule of thumb, arising out of Granger and Newbold (1974) involves a
statistic known as the Durbin–Watson (DW) statistic described in Box 21.

Box 21 Durbin–Watson statistic


The Durbin–Watson (DW) statistic is used to detect
autocorrelation.
The DW statistic always takes a value between 0 and 4.
• A value of 2 means that there is no autocorrelation detected.
• A value of 0 indicates there is perfect positive autocorrelation.
• A value of 4 indicates there is perfect negative autocorrelation.

The rule of thumb is that we should suspect spurious regression if the R2


value is greater than the value of the corresponding DW statistic.
The problem of spurious regression, therefore, is a serious issue in time
series econometrics since many economic time series exhibit a stochastic
trend. Consequently, simply running a regression between two time series
with stochastic trends and taking the results at face value is not
appropriate because of the danger of spurious regression. This does not
mean, however, that it is impossible to study the statistical relationship
between non-stationary time series variables. But it does pose the question
of how to differentiate between the following two cases.

149
Unit A2 Time series econometrics

1. Two variables that appear to move together during a particular time


span simply because they both happen to be non-stationary and are
moving in a similar direction
2. Two variables that move together because there is a substantively
meaningful relationship between them that ties them together in their
movements even if random variations are part of their data-generating
processes.
The next subsection will shed light on these two cases.

6.2 Cointegration
The concept of cointegration is used by econometricians to infer whether
there is a meaningful statistical relationship between two or more I(1)
variables (or I(p) variables more generally) over time. Note that, whereas
the concept of order of integration is a property of a single time series
variable, the concept of cointegration is a property of how two or more
time series variables relate.
Think of examples such as the relationship between short- and long-term
interest rates, household incomes and expenditures, or commodity prices in
geographically separated markets. Economic theory suggests that these
respective pairs of variables have a long-run equilibrium relationship
between them. In other words, when the series move apart from one time
point to the next – for example, due to a force that acts only in the short
term – an equilibrating process tends to make them converge so that in the
long term they tend to trend together.
For example, if gold prices were higher in the USA compared with South
Africa, then merchants would be able to profit from buying in South Africa
and selling in the USA if the price differential were greater than the costs
for transportation and marketing. The possibility to benefit in this way
from price differences is called arbitrage. As more merchants engage in
this arbitrage trading, the price of gold in South Africa would rise until it
was no longer worthwhile for anyone to buy gold in South Africa simply to
sell it for a higher return in the USA; effectively, this would be until the
Golde last individual achieved zero profit and was indifferent between doing
pric s! arbitrage or not.
e
crash
If we were to plot the gold price in South Africa and the USA over time,
we would expect to see them trending closely together, and to come closer
after temporary departures.
If such an equilibrium relationship exists between two I(1) time series, you
would expect the error term of the regression between them to be weakly
With arbitrage, timing matters dependent and stationary. This is the intuition behind the idea of
cointegration, and behind the strategy used by the Engle–Granger test for
cointegration described in Box 22. If two (or more) non-stationary time
series, integrated of order 1, are cointegrated, there is a stable equilibrium
relationship between them. The error term resulting from regressing one
on the other will be stationary, i.e. integrated of order 0. In effect, the

150
6 Modelling more than one time series variable

‘persistence/drift’ process in the I(1) series will have cancelled each other
out to result in an error term with no persistence or drift.
Suppose Yt and Xt are two I(1) time series. The Engle–Granger test for
cointegration is an ADF unit root test applied to the residuals u
bt obtained
from the regression
Yt = β0 + β1 Xt + ut .

Box 22 Engle–Granger test


The Engle–Granger test is an ADF unit root test on the
residuals u
bt .
• If the null hypothesis cannot be rejected, this means residuals are
I(1) and the variables Yt and Xt are not cointegrated. Running a
regression of Yt as a function of Xt without further transforming
the variables and seeking to find a stationary transformation will
result in a spurious regression.
• If the null hypothesis can be rejected, we can conclude that the
residuals are stationary, i.e. I(0), and that Yt and Xt are
cointegrated. If two series are cointegrated, then the cointegrating
regression is not spurious. The case of cointegration is the only case
where OLS and standard hypothesis testing is valid even when
variables included in the regression model are non-stationary.

While this is effectively an ADF test, the exact critical values of the
Engle–Granger test are slightly different (MacKinnon, 2010). For
simplicity, however, we will use the critical values of the DF distribution.
We will apply the Engle–Granger test to a couple of real datasets in
Subsections 6.2.1 and 6.2.2.

6.2.1 GDP and consumption in the USA from 1970


to 1991
In this subsection, we will use the Engle–Granger test to test whether
GDP and personal consumption expenditure (PCE) in the USA were
cointegrated between 1970 and 1991. We start by introducing the US
consumption dataset.

151
Unit A2 Time series econometrics

GDP and consumption in the USA


In Subsection 2.1 of Unit A1, we introduced the idea of the
consumption function. The consumption function is as valid at the
microeconomic level, when units of observation are individuals or
families, as it is at the macroeconomic level, when units of observation
are countries or regions. This is the first example in this strand which
analyses a consumption function with time series data on a specific
country (at the macroeconomic level) over time. The data we will use
are as follows.
The US consumption dataset (usConsumption)
This dataset contains quarterly time series data for GDP and personal
consumption expenditure for the US economy from Q1 of 1970 to Q4
of 1991, a total of 88 observations.
The dataset contains data for the following variables:
• year: the year the observation relates to
• quarter: the quarter (of the year) the observation relates to, taking
the value 1, 2, 3 or 4
• gdp: the quarterly US GDP in billions – in real terms or constant
prices (1987 US dollars)
• pce: the quarterly US PCE for the US economy in billions – in real
terms or constant prices (1987 US dollars).
The data for the first six observations in the US consumption dataset
are shown in Table 11.
Table 11 The first six observations from usConsumption

year quarter gdp pce


1970 1 2872.8 1800.5
1970 2 2860.3 1807.5
1970 3 2896.6 1824.7
1970 4 2873.7 1821.2
1971 1 2942.9 1849.9
1971 2 2947.4 1863.5

Source: Table 21.1 in Gujarati, 2004, p. 794

152
6 Modelling more than one time series variable

Having sourced some data, let’s start with some exploratory data analysis
in the next two activities.

Activity 15 Considering time plots of GDP and PCE

Figure 16 shows the US GDP and PCE plotted over the period 1970
to 1991. Describe the trajectories of the two time series.

Time series: US GDP US PCE


5000
Billions of 1987 US dollars

4000

3000

2000

1000
1970 1975 1980 1985 1990
Year
Figure 16 Time plot of quarterly US GDP (gdp) and US PCE (pce)
from 1970 to 1991

Activity 16 Considering correlograms of GDP and PCE

Figure 17 shows plots of the ACF for US GDP and US PCE as well as for
their first differences. What do these correlograms suggest about the order
of integration and lag dependence of GDP and PCE?

153
Unit A2 Time series econometrics

1.0 1.0

0.5 0.5
ACF

ACF
0.0 0.0

−0.5 −0.5

−1.0 −1.0
0 4 8 12 16 0 4 8 12 16
(a) Lag (b) Lag

1.0 1.0

0.5 0.5
ACF

ACF

0.0 0.0

−0.5 −0.5

−1.0 −1.0
0 4 8 12 16 0 4 8 12 16
Lag Lag
(c) (d)
Figure 17 Correlograms: (a) ACF of GDP, (b) ACF for the first difference of GDP, (c) ACF of PCE, (d) ACF
for the first difference of PCE

In order to establish the order of integration of the GDP and PCE series,
we perform the ADF test for each series and their first differences. We will
do it in two different ways:
• by performing the ADF tests on the variables in levels (Activity 17)
• by performing the ADF tests on the variables after first differencing –
which, in the case that the original variables are I(1), will be I(0)
(Activity 18).
For both of these ways, we choose the ADF test that includes a drift term
to test for a unit root of gdp and pce. This is because, in an analysis that
we are not showing, the AIC criteria confirmed the lag dependence of
order 1 of first differences.

154
6 Modelling more than one time series variable

Activity 17 ADF tests on gdp and pce


Tables 12 and 13 show the results from the regressions for the ∆ gdp and
∆ pce fitted as part of the ADF tests.
(a) What is the value of the test statistic in each case?
(b) For both tests it turns out the critical values are −3.51 at 1%, −2.89
at 5% and −2.58 at 10%. Hence interpret the results from both these
tests.
Table 12 ADF results for gdp, using ∆ gdpt ∼ gdpt−1 + constant + ∆ gdpt−1

Parameter Estimate Standard t-statistic p-value


error
Intercept 28.719 23.650 1.21 0.228
gdpt−1 −0.003 0.006 −0.55 0.586
∆ gdpt−1 0.320 0.104 3.09 0.003

Table 13 ADF results for pce, using ∆ pcet ∼ pcet−1 + constant + ∆ pcet−1

Parameter Estimate Standard t-statistic p-value


error
Intercept 17.970 11.325 1.59 0.116
pcet−1 −0.002 0.004 −0.37 0.715
∆ pcet−1 0.181 0.108 1.67 0.098

Activity 18 ADF tests on ∆ gdp and ∆ pce


Tables 14 and 15 show the results from the ADF regressions for ∆ gdp and
∆ pce.
(a) What is the value of the test statistic in each case?
(b) For both tests it turns out the critical values are −2.60 at 1%, −1.95
at 5% and −1.61 at 10%. Hence interpret the results from both these
tests.
Table 14 ADF results for ∆ gdp , using ∆ (∆ gdpt ) ∼ ∆ gdpt−1 + ∆ (∆ gdpt−1 )

Parameter Estimate Standard t-statistic p-value


error
∆ gdpt−1 −0.377 0.104 −3.61 0.001
∆ (∆ gdpt−1 ) −0.207 0.107 −1.95 0.055

Table 15 ADF results for ∆ pce, using ∆ (∆ pcet ) ∼ ∆ pcet−1 + ∆ (∆ pcet−1 )

Parameter Estimate Standard t-statistic p-value


error
∆ pcet−1 −0.279 0.095 −2.93 0.004
∆ (∆pcet−1 ) −0.366 0.102 −3.58 0.001

155
Unit A2 Time series econometrics

We have found that both variables in levels are I(1). In general, and in the
absence of cointegration, we would need to find a transformation of each
variable which was stationary.
The next step is to test for cointegration. If there is cointegration, the
variables and the econometric model that has greater economic significance
can be used.
Since the series pce and gdp are of the same order of integration, I(1), we
can estimate the cointegrating regression
pcet = β0 + β1 gdpt + ut . (2)
The results from the cointegrating regression are shown in Table 16.
Table 16 Regression results from the cointegrating regression in pcet ∼ gdpt

Parameter Estimate Standard t-statistic p-value


error
Intercept −298.1 20.71 −14.39 < 0.001
gdpt 0.733 0.005 138.69 < 0.001

Furthermore, for this regression R2 = 0.9955 and Ra2 = 0.9955.


As expected, since both variables are I(1), the p-values associated with the
estimated coefficients are small and the R2 is too good to be true. In order
to check that this is not a spurious regression, we need to test whether the
residuals from the regression are stationary. This is what we will do in
Activities 19 and 20.

Activity 19 Specifying the ADF test

The residuals from the estimated regression are shown in Figure 18.
Based on this plot, state which of the variants (given in Box 20 of
Subsection 5.1) would be more appropriate:
• simple random walk
• random walk with drift
• random walk with drift and a deterministic trend,
and give the model specification.

156
6 Modelling more than one time series variable

80

60

40

20
u
0

−20

−40

−60
1970 1975 1980 1985 1990
Year
Figure 18 Plot of residuals from the cointegrating regression in Equation (2)

Activity 20 Testing for cointegration between gdp and pce

The results from the regression fitted as part of the ADF test on the
residuals from the estimated cointegrating regression, using the
specification given in the Solution to Activity 19, are shown in Table 17.
Table 17 ADF test results with u
b modelled as a simple random walk

Parameter Estimate Standard t-statistic p-value


error
u
bt−1 −0.212 0.068 −3.11 0.003
∆ubt−1 0.074 0.108 0.69 0.494

The critical values for this particular ADF test are −2.6 at 1%, −1.95
at 5% and −1.61 at 10%. Interpret the results.

Since the series are cointegrated, we can substitute the coefficients from
the cointegrating regression in Table 16 into Equation (2) to interpret the
regression results:
pcet = −298.1 + 0.73 gdpt + u
bt .

157
Unit A2 Time series econometrics

Because the series are cointegrated, we can say that there is a long-run
equilibrium statistical relationship between US GDP and PCE between
1970 and 1991. We will return to this idea of long-run equilibrium
relationships in Subsection 6.3.

6.2.2 Testing the theory of relative purchasing power


parity
Let’s now apply the process of finding out whether variables are
cointegrated, and whilst doing this find the best econometric model that
relates them to the co-movement of prices of goods across countries with
different general levels of prices and currencies.
We are all used to the concept of the nominal exchange rate as the price of
one currency in terms of another. The theory of purchasing power
parity (PPP) states that national price levels should tend to be equal
when expressed in a common currency. If PPP holds, then the nominal
exchange rate between two currencies should be equal to the ratio of the
general price levels in the two countries, such that the price of identical
items should be the same in each country when expressed in a common
currency (in the absence of transaction costs and other trade barriers).
PPP is a contested theory and there are many examples where PPP does
not hold. One famous example is the price of a Big Mac, which clearly
differs from country to country. In January 2022, a Big Mac would cost
The study of prices relative to you US$6.98 in Switzerland, US$5.81 in the USA, US$3.38 in Japan and
a Big Mac is sometimes just US$1.86 in Turkey.
referred to as ‘Burgernomics’ While absolute PPP, as described above, is unlikely to hold in the real
world, it might be that the difference in the rate of depreciation of one
currency relative to another matches differences in inflation between the
two countries. This is the theory of relative PPP, and we can use the time
series techniques discussed in this unit to investigate whether there is
evidence to support the theory. (We are back in the realm of confirmatory
analysis.)
In this unit, we will explore the theory of relative PPP by comparing
prices in Japan and the USA. For a particular time t, suppose we have the
following:
• pJapan,t , the log of the price level (in Japanese Yen) in Japan
• pUS,t , the log of the price level (in US dollars) in the USA (which in
what follows we will denote by uspt )
• et , the log of the US dollar price of the Japanese Yen.
Then
et + pJapan,t
is the log of the price level (in US dollars) in Japan (which in what follows
we will denote by jppt ).

158
6 Modelling more than one time series variable

If relative PPP holds for Japan and the USA, we would expect there to be
a stable long-run statistical relationship between jppt and uspt . So testing
for cointegration between jppt and uspt over time would therefore serve to
test the theory of relative PPP.
We will use the data described next to explore this.

Prices in Japan and the USA


Prices in Japan and the USA can be compared using wholesale price
indices and the bilateral exchange rate between the USA and Japan.
The PPP (purchasing power parity) dataset (ppp)
Quarterly data for Japan and the USA for the period from the first
quarter of 1973 until the second quarter of 2008 are given in the PPP
dataset.
It contains the following variables:
• year: the year, starting at 1973 and ending at 2008
• quarter: the quarter of the year, taking values 1, 2, 3 and 4
• exchangeRate: the index of the exchange rate between the US
dollar and Japanese Yen (1973 Q1 = 100)
• japanPrices: the index of the general price level in Japan
(1973 Q1 = 100)
• usPrices: the index of the general price level in the USA
(1973 Q1 = 100).
The data for the first six observations in the PPP dataset are shown
in Table 18.
Table 18 The first six observations from ppp

year quarter exchangeRate japanPrices usPrices


1973 1 100.00 100.00 100.00
1973 2 93.92 103.43 104.85
1973 3 93.93 108.41 109.15
1973 4 97.37 117.67 110.09
1974 1 103.08 132.80 117.39
1974 2 99.22 136.08 121.56

Source: Enders, 2015, accessed 5 January 2023

159
Unit A2 Time series econometrics

Using data from the PPP dataset, Figure 19 shows the evolution over time
of the two time series et + pJapan,t and pUS,t .

12

log US dollars (1973 Q1=100)


10
Japan

USA
6

2
1980 1990 2000
Year
Figure 19 Price levels in Japan and the USA at US constant prices from
1973 Q1 to 2008 Q2

The two series were tested for the presence of unit roots using ADF tests.
Both series were found to be I(1). If we found that the series had different
orders of integration, then it would not be possible to perform a test for
cointegration and we would be able to conclude immediately that the
theory of PPP does not hold.
Table 19 shows the results from the cointegrating regression
jppt = β0 + β1 uspt + ut .

Table 19 Regression results from the cointegrating regression in jppt ∼ uspt

Parameter Estimate Standard t-statistic p-value


error
Intercept 13.541 0.373 36.32 < 0.001
uspt −0.800 0.067 −11.87 < 0.001

For this regression, R2 is 0.5014 and Ra2 is 0.4978.


Table 20 gives the results of the ADF regression performed on the
residuals from the cointegrating regression (assuming no drift and no
deterministic time trend).

160
6 Modelling more than one time series variable

Table 20 ADF regression results for u bt ∼ u


bt , using ∆ u bt−1 + ∆ u
bt−1

Parameter Estimate Standard t-statistic p-value


error
u
bt−1 −0.052 0.018 −2.85 0.005
∆ubt−1 0.391 0.077 5.09 < 0.001

The ADF test statistic was found to be −2.852. It turns out that the 1%
critical value is −2.58. So we can reject the null hypothesis that the
residuals contain a unit root in favour of the alternative hypothesis that
they are stationary. Since both time series are I(1) and the residuals of the
cointegrating regression are I(0), we can conclude that the time series are
cointegrated, so there is a long-term stable statistical relationship between
the two series. This provides some support for the theory of relative PPP.

6.3 Error correction model (ECM)


If Xt and Yt are cointegrated, then we can conclude that there is a
long-run equilibrium statistical relationship between them. You have
already seen two examples of this, firstly in Subsection 6.2.1 between GDP
and personal expenditure in the USA, and again in Subsection 6.2.2 on
PPP in the relationship between the US dollar value of the Japanese price
level and the price level in the USA.
According to the Granger representation theorem, if two variables are
cointegrated, then the dynamic statistical relationship between them can
be expressed as an error correction model (ECM); but what is this?
If two time series are cointegrated, they are not locked together in a rigid
relationship. Instead, the time series evolve according to a long-run
pattern, sometimes moving apart, but always reverting back to the
long-run equilibrium. (The iconic example of a random walk is a drunken
farmer walking home from the pub; the corresponding example of
cointegration is a drunken farmer and his dog walking home from the
pub.) An error correction model is based on this idea of a correction from
any deviation from the equilibrium.
We can think of the error term ut from the cointegrating regression as this
deviation from equilibrium; starting from a cointegration regression
Yt = β0 + β1 Xt + ut ,
we can express ut as Yt − (β0 + β1 Xt ), that is, the difference between the
actual value of Yt and β0 + β1 Xt , which is the value of Yt predicted by the
cointegration relationship. So, ut captures the error from the equilibrium
which, in the error correction model, is assumed to influence the behaviour
of Yt in the next period.
The error correction model, as you will see next in Box 23, expresses
changes in Yt as driven by two things – changes in Xt and an attraction
mechanism back to equilibrium captured by a lagged version of ut .

161
Unit A2 Time series econometrics

Box 23 An error correction model


Let Yt and Xt be two time series linked by the cointegrating regression
Yt = β0 + β1 Xt + ut .
Furthermore, let u
bt be the residuals from the fitted cointegrated
regression. That is,
bt = Yt − (βb0 + βb1 Xt ).
u
Then in an error correction model (ECM),
∆ Yt = α0 + α1 ∆Xt + α2 u
bt−1 + εt .
In the ECM:
• The term α1 ∆ Xt captures changes in the dependent variable due
to current period changes in the independent variable; this is the
short-term adjustment.
• The term α2 ubt−1 captures changes in the dependent variable
resulting from equilibrium error in the previous period ; this is the
error correction term.

The coefficient α2 is the key to the following process.


• It should be negative to ensure that the error is reducing the difference
from the cointegrating equilibrium relationship between Y and X, so
that the time series are moving back together. (If it were positive, the
series would move apart forever as the equilibrium error would increase
in every period.)
• Its magnitude should be between 0 and −1.
◦ When α2 = 0, it means that there is no error correction.
◦ When α2 = −1, it means that the entire equilibrium error is corrected
in a single period.
Normally, α2 is between these two extremes, governing how long it takes
for a disturbance from equilibrium to die out. The closer α2 is to 0, the
longer this process is likely to take. Hence it is called the speed of
adjustment factor.
We will apply the error correction model to the PPP dataset in
Example 11. Recall that we showed that there was cointegration for these
data in Subsection 6.2.2.

162
6 Modelling more than one time series variable

Example 11 Estimating the relative PPP relationship for


Japan and the USA as an ECM
In this example, we will estimate the error correction model for the
relationship between the price level in the USA (uspt ) and the US
dollar value of the Japanese price level (jppt ), using the PPP dataset.
To estimate the relationship between the two, we applied the
Engle–Granger test for cointegration and found that the series were
indeed cointegrated. The estimated long-run statistical relationship
between them was
jppt = 13.54 − 0.80 uspt + u
bt . (3)
Now, we can use the saved residuals from the cointegrating regression
to estimate the error correction model
∆ jppt = α0 + α1 ∆ uspt + α2 u
bt−1 + εt . (4)
Recall that ubt−1 is the previous period’s gap between the actual value
of jpp and its predicted value from the cointegrating regression. As
such, it represents the ‘out of equilibrium error’ and, in order to
return to equilibrium (as cointegration tells us it must), jpp will be
reduced by some proportion (α2 ) of the error at period t.
The estimation results are shown in Table 21. The coefficients are
significant at the 10% significance level but the R2 values turn out to
be extremely low (R2 = 0.0646 and Ra2 = 0.0510). The coefficient of
u
bt−1 is indeed negative as expected, but small – only 3% of any
disturbance will be restored in each period, so departures from PPP
will take a long time to dissipate.
Table 21 Results from estimating the ECM in Equation (4)

Parameter Estimate Standard t-statistic p-value


error
Intercept −0.010 0.006 −1.82 0.071
∆ uspt 0.640 0.291 2.20 0.029
u
bt−1 −0.031 0.018 −1.70 0.092

Taking the estimated values from Table 21 and substituting into


Equation (4) yields the relationship
∆ jppt = −0.010 + 0.640 ∆ uspt − 0.031 u
bt−1 + εt .
This is the error correction model corresponding to the cointegration
regression in Equation (3). Together, they provide some limited
support for the theory of relative PPP, but the extremely low R2
suggests that the relationship is rather weak.

163
Unit A2 Time series econometrics

6.4 Using R to explore cointegration


In this subsection, we will use R to explore cointegration. In Notebook
activity A2.6, you will simulate data in R to explore how spurious
regression can arise. Then, in Notebook activities A2.7 and A2.8, you will
explore cointegration using a couple of real datasets. Notebook
activity A2.7 will focus on the PPP dataset. Finally, Notebook
activity A2.8 will focus on a dataset about coffee prices. This dataset is
described next.

The price of coffee


The daily prices for the Coffee C Futures contract, traded on the
New York futures market on weekdays, and the International Coffee
Organization’s (ICO) daily indicator price for Brazilian and other
natural arabicas (Brazilian naturals) from 1 April 2020 to
31 March 2021 were collected.
The coffee prices dataset (coffeePrices)
The dataset contains data for the following variables:
Green coffee beans, a widely
traded form of coffee • spot: the daily indicator price for Brazilian naturals compiled by
the International Coffee Association – these are the weighted
average of the prices of physical coffee delivered to the ports of
New York, Hamburg and Marseille in US cents per lb
• futures: the price of the Coffee C Futures contract traded on the
ICE exchange in New York.
The data for the first six observations in the coffee prices dataset are
shown in Table 22.
Table 22 The first six observations from coffeePrices

spot futures
112.94 116.00
115.86 119.35
112.62 114.90
113.65 116.65
116.14 119.90
116.36 119.80

Source: International Coffee Organization (no date) for the Brazilian


naturals, and ICE Futures U.S. (no date) for the Coffee C Futures, both
accessed 1 April 2021

Notebook activity A2.6 Spurious regression in R


In this notebook, you will be simulating random walks with drift and
examining their relationship.

164
7 Modelling time with panel data

Notebook activity A2.7 Testing for cointegration and


estimating error correction
models using R
This notebook explains how to use R to test for cointegration and to
estimate error correction models using the PPP dataset.

Notebook activity A2.8 More testing for cointegration


using R
In this notebook, you will have another chance to practise testing for
cointegration between two series and estimating an error correction
model where a long-run equilibrium relationship exists between the
series. This notebook uses the coffee prices dataset.

7 Modelling time with panel data


In Unit A1, you looked at modelling and methods which require a large
number of cross-sectional units. In Unit A2 so far, you have looked at
modelling and methods which require a large number of time periods, and
explored methods which rely on large T . The gap we want to fill in this
section, and to close Strand A, is to discuss what happens when modelling
and methods require both a large number of cross-sectional units (large N )
and time periods (large T ). We are back in the realm of panel data and can
now explore long panel data structures and time in a panel data set-up.
Most econometric models discussed in Unit A1 have left out time and have
used variables which are likely to be persistent over time; that was
certainly the case for the dependent variables. Current behaviour of wages,
consumption, external debt repayment, yield per hectare, are all very likely
to depend on their past values, and likely to trend over time. Not
modelling time and lag dependence when they are important will generate
biased estimators (Alvarez and Arellano, 2003).
This section, at the end of the strand, brings the techniques you have
learnt in these two units together and discusses modelling and estimation
with panel data when variables are likely to be autocorrelated, or when
time needs to be modelled explicitly. In Subsection 7.1, we will revisit
fixed effects models and introduce a new estimator, the first difference
estimator which is suitable when there is lag dependence. Then, to round
off this strand, in Subsection 7.2 we will return to the issue of the
dependency of wages on education and ability. Here you will see how both
omitted variable bias and time can be addressed using panel data.

165
Unit A2 Time series econometrics

7.1 Fixed effects models with time


In Subsection 5.2 of Unit A1, you learnt two main estimation methods for
dealing with fixed effects, that is, when unobserved omitted variables are
time invariant and correlated with the explanatory factors of interest.
These were the least squares dummy variable (LSDV) estimator and the
within groups (WG) estimator.
Variables in Unit A1 were modelled assuming they did not have any
persistence over time, which can be a strong assumption in a lot of cases.
For instance, our consumption today will depend on our income and prices
of possible goods and services, but it will likely depend on what we
consumed in the previous period, given our habits and preferences for
particular goods. This is more so if goods are what are called ‘addictions’,
such as cigarettes, sugary goods or alcohol. We also assumed errors were
not autocorrelated which, in most economic problems, can also be a strong
assumption.
This subsection revisits the panel data models as you have learnt them in
Unit A1 and studies how these models and the estimation of them change
when we model persistence over time of variables or error terms. To
summarise, when the dependent variable is a function of its past values, or
if the error terms are autocorrelated, neither of these estimators is
unbiased or consistent.
Let’s see this using the WG estimator, which consists of estimating a
model using OLS where original variables have been demeaned.
Suppose the original model is
Yi,t = α0 + α1 Xi,t + ui,t ,
where ui,t = fi + νi,t and fi (the fixed effect) may be correlated with Xi ,
but νi,t is i.i.d. with zero mean and constant variance. In this case, the
WG estimator is the OLS estimator of the demeaned model

Yi,t − Yi = α1 Xi,t − Xi + νi,t − νi . (5)
The error term in Equation (5) will be treated like any other by software
which estimates this model, but we need to remember ourselves how it
relates to the original model to be able to assess whether it delivers
unbiased and consistent estimators.
Both the constant term and the fixed effect disappear from the
transformed model, but it’s advisable to add a constant in the estimation;
that way, it is easier to ensure the error terms have zero mean. WG
generates unbiased
 and consistent estimates if the explanatory variable
Xi,t − Xi is uncorrelated with the error term (νi,t − νi ). Bearing in mind
that within-group means use all observations from each individual, the two
remain uncorrelated with each other if X is strictly exogenous – it needs to
be independent of past, present and future values of ν.
In the context of persistence over time, you may already notice how easily
the assumption of strict exogeneity will crumble. Now, let’s add lag

166
7 Modelling time with panel data

dependence by assuming the dependent variable Y follows an AR(1)


process; that is,
Yi,t = α0 + α1 Yi,t−1 + α2 Xi,t + ui,t ,
where ui,t = fi + νi,t and fi may be correlated with Xi , but νi,t is still i.i.d.
with zero mean and constant variance. WG estimation would be
estimating OLS on the demeaned model
∗ 
Yi,t − Yi = α1 Yi,t−1 − Yi + α2 Xi,t − Xi + νi,t − νi ,
where the asterisk (*) is to remind ourselves that this mean is calculated
without the last observation as a result of taking lags.
The use of means in these transformations has brought in
contemporaneous relationships between lagged Y and ν, and so, even with
i.i.d. error terms, there will be a correlation between the transformed
lagged dependent variable and the new error term. This creates a bias and
inconsistency of OLS in estimating the coefficient for Yi,t−1 . If there is a
correlation between the lagged dependent variable and the current value
for X, the bias and inconsistency are also transmitted to the estimators of
the coefficient of X.
So while WG is the most efficient estimator with time-invariant unobserved
heterogeneity, which is correlated with the explanatory variables if these
are strictly exogenous, its efficiency depends on there being a large number
of cross-sectional units and a fixed number of time periods, and on there
being no autocorrelation or persistence. In the next subsection, we will
show you an alternative which starts addressing these caveats.

7.1.1 The Anderson–Hsiao estimator


The attraction of the WG estimator is its computational simplicity. By
removing the fixed effects instead of trying to measure done, and by using
OLS, the new model remains as small as the original model, and efficient
due to the Gauss–Markov theorem.
But since demeaning variables is not an option when there is lag
dependence, it is important to explore alternative ways to remove the fixed
effects. An often-used alternative is to take first differences instead (see
Section 1).
The first difference (FD) estimator expresses each observation as a
difference from its previous value. So for each original observation, we
subtract its lag of order 1. The transformed variable has T − 1
observations only, because the first observation has no predecessor. The
resulting model looks like
∆ Yi,t = β ∆ Xi,t + (νi,t − νi,t−1 ),
where i = 1, . . . , n and t = 2, . . . , T . As with the WG estimator, the FD
estimator has no intercept. Interestingly, the FD and the fixed effects
estimators are the same when there are only two time periods, but with
longer panels they differ. FD will perform better with long panels, and is

167
Unit A2 Time series econometrics

the basis of estimators which deal with autocorrelated variables. Let’s see
how.
Consider an AR(1) model with a stationary regressor X which has no lag
dependence:
Yit = α0 + α1 Yi,t−1 + α2 Xi,t + ui,t ,
where ui,t = fi + νi,t , such that fi may be correlated with Xi but νi,t is still
i.i.d. with zero mean and constant variance. The FD estimator would
apply OLS on the differenced model
Yi,t − Yi,t−1 = α1 (Yi,t−1 − Yi,t−2 ) + α2 (Xi,t − Xi,t−1 ) + νi,t − νi,t−1 .
However, OLS would be inconsistent and biased because there is now a
correlation between Yi,t−1 and νi,t−1 in the relationship between the
explanatory factor Yi,t−1 − Yi,t−2 and the error term νi,t − νi,t−1 .
In Unit A1, you saw alternatives to OLS when regressors were endogenous.
One such alternative was the instrumental variable (IV) estimator. This
relied on finding an instrumental variable which is correlated with the
endogenous regressor but has no direct impact on the dependent variable.
Anderson and Hsiao (1981) realised that the correlation exists because the
first term of the regressor is correlated with the last term of the error term.
By using second and third lags of the dependent variable, this correlation
would be removed, provided that the AR(1) model is sufficient to model
the persistence of the dependent variable. Either lagged variables
themselves, or their differences, as long as higher than or equal to order 2,
would be adequate instruments.
This is the Anderson–Hsiao estimator. Starting from an FD estimator,
they analyse how far back the error terms go in terms of lags, and use
instruments of higher order still. Even if we had reason to believe that the
error terms might be following an AR(1) process, we could still follow this
strategy, ‘backing off’ one period and using the third and fourth lags of Y
(presuming that the time series for each cross-sectional unit is long enough
to do so).

7.1.2 Using R to obtain the Anderson–Hsiao


estimator
In Notebook activity A2.9, we return to a dataset that was introduced in
Unit A1: PSID.

Notebook activity A2.9 Applying the Anderson–Hsiao


estimator using R
In this notebook, you will practise applying the Anderson–Hsiao
estimator using psid.

168
7 Modelling time with panel data

7.2 Exploring the time dimension in panel


data in the presence of omitted ability
bias
In Subsections 4.3.1 and 4.3.3 of Unit A1, we saw how Blackburn and
Neumark (1993) addressed the issue of omitted variable bias due to
inherent ability in a wage regression. They used a cohort of the NLS panel
dataset, from its beginning in 1979 up until 1987 (nine waves).
The main purpose of their study was to test whether the sharp increase in
returns to education, and hence the increase in wage inequality between
education groups observed in the USA in the 1980s, could be due to a
change in the mean level of ability of workers in different education groups.
This is not an easy task when the main variable in their theory is not
easily measured! To recap:
• The basic econometric model explaining wages as a function of
education and ability was
log(wagei,t ) = β0 + β1 educi,t + β2 abilityi + ui,t .

• The authors aimed to overcome omitted ability bias by measuring it


explicitly with test scores; they used proxy variables (Subsection 4.3.1
of Unit A1).
• The authors were concerned with measurement error and worried that
measurement error of test scores was not independent of test scores or of
education. So they used instrumental variables to deal with
measurement error, using family background indicators as instruments
for test scores, education and for both of these explanatory factors
(Subsection 4.3.3 of Unit A1).
What we have not yet discussed is how the authors explore the way that
mean ability of individuals in different education groups was changing over
the observed period, and how they account for the fact they have
observations from different time periods. For this, we need to explore the
time dimension of panel data, and the techniques of this unit.
Table 3 from Blackburn and Neumark (1993) – which you saw as Table 13
in Unit A1, and is repeated next as Table 23 – presents the main
specifications used and their estimation results. We now explore the
modelling of time in their study.

169
Unit A2 Time series econometrics

Table 23 OLS and IV log wage equation estimates

IV for test scores


OLS IV for test scores and schooling IV for schooling
(1) (2) (3) (4) (5) (6) (7) (8)
Years of education 0.013 0.012 −0.000 −0.001 0.029 0.022 0.028 0.024
(0.008) (0.008) (0.013) (0.013) (0.045) (0.043) (0.021) (0.018)
Years of education × trend 0.0048 0.0048 0.0062 0.0057 0.0077 0.0062 0.0084 0.0070
(0.0017) (0.0017) (0.0023) (0.0018) (0.0047) (0.0042) (0.0029) (0.0030)
Academic test −0.010 −0.057 −0.110 −0.042
(0.017) (0.152) (0.155) 0.024)
Technical test 0.044
(0.013)
Computational test 0.041
(0.010)
Non-academic test 0.038 0.094 0.064 0.083 0.035 0.042 0.028
(0.006) (0.081) (0.020) (0.081) (0.044) (0.008) (0.009)

Wages tend to increase over time, often to counter the effects of rises of
prices of goods and services. Instead of using a deterministic trend,
Blackburn and Neumark (1993) explain that they use dummy variables for
time instead. Time dummies will not model wage trends as a smooth
gradual increase, but instead will pick up any year-specific effects on
hourly wages at a time when the wage distribution seemed to be changing
rapidly. These time dummies pick up factors such as productivity levels,
the general price level in the economy, and other cyclical economic factors.
Time trend t takes the value 0 for 1979, and increases by one unit each
year. On top of that, the authors also use a variable which is the product
of years of education and a deterministic time trend, that is, they include
an interaction between years of education and the time trend.
Let’s write this augmented model as
log(wagei,t ) = β0 + β1 educi,t + β2 educit × t + β3 abilityi
+ β4 year1980 + β5 year1981 + β6 year1982
+ β7 year1983 + β8 year1984 + β9 year1985
+ β10 year1986 + β11 year1987 + ui,t ,
where the time dummies, yearYYYY, take the value 1 if the observation is
collected in that year, and 0 otherwise.
The variable showing the interaction between education and a time trend
is crucial for the authors’ study. It shows how the change in wages for an
additional year of education varies over time during this period. The
positive and statistically significant coefficient estimate for this interaction
in all model specifications, and using all methods, suggests that returns to
schooling were increasing during this period, even accounting for ability
and for measurement error. The authors continue their exploration of the
data to tease out more about this increase, and with further tests – and

170
Summary

including further interactions between variables – conclude that the


increase in the return to education is due to shifting ability distributions in
each education group, but it has occurred only for workers with relatively
high levels of ‘academic’ ability. Their theory explaining increasing returns
to education based on ability changes was the theory receiving more
support in their comprehensive study.
If you would like to read the original Blackburn and Neumark (1993)
article, you can access it via the module website.

Summary
In this unit, we have explored approaches to model both univariate time
series and relationships between time series variables. Some of the main
challenges with modelling and estimating time series models relate to
stationarity and lag dependence. You have seen that while visual
inspections of each time series through time plots and then through
autoregressive scatterplots and correlograms go a long way towards a
theoretical proposal for the data-generating process of the time series, they
fail to capture subtle differences between non-stationarity and the
existence of a trend or between deterministic and stochastic trends, or the
presence of structural breaks. More formal procedures are often required.
You have seen that modelling the lag dependence of a variable often relies
on starting with a higher lag order and through the inspection of the t-test
of the highest-order term, decide whether lag dependence should be
revised. Once lag dependence is established, testing for stationarity of a
time series variable is important to make sure that only a stationary
transformation is used in modelling. To transform a non-stationary
variable into a stationary variable, first differencing will suffice if the type
of non-stationarity is integration of order 1 (I(1)).
Non-stationarity in the context of more than one time series variable offers
both challenges and opportunities. We discussed the case of spurious
regression of I(1) time series variables that are not cointegrated, and how
to test for cointegration. We also discussed the benefits of cointegration:
on one hand, the original cointegrated variables can be modelled in levels
as themselves since error terms are stationary; and on the other hand, an
error correction model can be analysed as well as the equilibrating
short-term forces attracting variables to a long-run equilibrium
relationship.
The final section revisited a data structure discussed in Unit A1 – panel
data – in order to introduce time and time series considerations to panel
data modelling and estimation. We also revisited a study from Unit A1 to
highlight how it modelled time.

171
Unit A2 Time series econometrics

As a reminder of what has been studied in Unit A2 and how the sections
in the unit link together, the route map is repeated below.

The Unit A2 route map

Section 2
Section 1 Describing
intertemporal Section 3
Stock and flow
properties of Random walks
variables
time series

Section 4
Section 5
Stationarity and
Testing for
lagged dependence
stationarity
of a time series

Section 6
Modelling more
than one time
series variable

Section 7
Modelling time
with panel data

172
Learning outcomes

Learning outcomes
After you have worked through this unit, you should be able to:
• apply time plots, autoregressive scatterplots and correlograms to the
exploration of time series data and as first steps towards modelling time
series variables
• define persistence and momentum and their role in modelling time series
data
• define the concept of stationarity and its implication for analysing time
series data
• define weak stationarity
• use logs, lags and differencing in modelling time series data
• model the lag dependence of a variable
• test for the presence of a unit root using ADF tests
• recognise spurious regressions and techniques for how to avoid
estimating them in applied work
• understand the concept of cointegration and its importance in modelling
I(1) series
• use the Engle–Granger test to test for cointegration between two I(1)
series
• estimate an error correction model (ECM) to analyse the long- and
short-term relationship between two I(1) series that are cointegrated
• simulate data-generation processes including white noise, random walk
and random walk with drift using R
• apply the Anderson–Hsiao estimator to panel data
• describe the main modelling and data choices made by economists in the
analysis of their economic problems.

173
Unit A2 Time series econometrics

References
Alvarez, J. and Arellano, M. (2003) ‘The time series and cross-section
asymptotics of dynamic panel data estimators’, Econometrica, 71(4),
pp. 1121–1159.
Anderson, T.W. and Hsiao, C. (1981) ‘Estimation of dynamic models with
error components’, Journal of the American Statistical Association,
76(375), pp. 598–606. doi:10.2307/2287517.
Blackburn, M.L. and Neumark, D. (1993) ‘Omitted-ability bias and the
increase in the return to schooling’, Journal of Labor Economics, 11(3),
pp. 521–544.
Cleveland, W.S. (1993) Visualizing data. Peterborough: Hobart Press.
Dickey, D.A. and Fuller, W.A. (1979) ‘Distribution of the estimators for
autoregressive time series with a unit root’, Journal of the American
Statistical Association, 74(366), pp. 427–431. doi:10.2307/2286348.
Enders, W. (2015) ‘COINT PPP.XLS’. Available at:
https://wenders.people.ua.edu/3rd-edition.html
(Accessed: 5 January 2023).
Granger, C.W.J. and Newbold, P. (1974) ‘Spurious regressions in
econometrics’, Journal of Econometrics, 2(2), pp. 111–120.
doi:10.1016/0304-4076(74)90034-7.
Gujarati, D. (2004) Basic econometrics, 4th edn. New York: McGraw Hill.
ICE Futures U.S. (no date) ‘Coffee C Futures’. Available at:
https://www.theice.com/products/15/Coffee-C-
Futures/data?marketId=6244298&span=3 (Accessed: 1 April 2021).
Institute for Government (2021) ‘Timeline of UK coronavirus lockdowns,
March 2020 to March 2021’. Available at:
https://www.instituteforgovernment.org.uk/sites/default/files/timeline-
lockdown-web.pdf (Accessed: 10 December 2022).
International Coffee Organization (no date) ‘Historical data on the global
coffee trade’. Available at: https://www.ico.org/coffee prices.asp
(Accessed: 1 April 2021).
Klein, J.L. (1997) Statistical visions in time. A history of time series
analysis 1662–1938. Cambridge: Cambridge University Press.
Leamer, E.E. (2010) Macroeconomic patterns and stories. Berlin:
Springer-Verlag.
MacKinnon, J.G. (2010) Critical values for cointegration tests, Queen’s
Economics Department Working Paper No. 1227, Queen’s University,
Kingston, Ontario, Canada. Available at: https://www.researchgate.net/
publication/4804830 Critical Values for Cointegration Tests
(Accessed: 6 January 2023).

174
References

Office for National Statistics (no date) ‘Seasonal adjustment’. Available at:
https://www.ons.gov.uk/methodology/methodologytopicsandstatistical
concepts/seasonaladjustment (Accessed: 22 December 2022).
Office for National Statistics (2021a) ‘Unemployment rate (aged 16 and
over, seasonally adjusted): %’, release date 23 February 2021.
Available at: https://www.ons.gov.uk/employmentandlabourmarket/
peoplenotinwork/unemployment/timeseries/mgsx/lms/previous
(Accessed: 13 December 2022).
Office for National Statistics (2021b) ‘Gross Domestic Product at market
prices: Current price: Seasonally adjusted £m’, release date 12 February
2021. Available at: https://www.ons.gov.uk/economy/
grossdomesticproductgdp/timeseries/ybha/pn2/previous
(Accessed: 13 December 2022).
Office for National Statistics (2021c) ‘Gross Domestic Product: Quarter on
Quarter growth: CVM SA %’, release date 12 February 2021. Available at:
https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/
ihyq/pn2/previous (Accessed: 13 December 2022).
Pólya, G. (1968) Patterns of plausible inference (Volume II of Mathematics
and plausible reasoning), 2nd edn. Princeton, NJ: Princeton University
Press.
Santos, C. and Wuyts, M. (2010) ‘Economics, recession and crisis’. In:
DD209 Running the economy. Milton Keynes: The Open University.
Slutzky, E.E. (1937) ‘The summation of random causes as the source of
cyclic processes’, Econometrica, 5(2), pp. 105–146. doi:10.2307/1907241.
Tooze, A. (2021) Shutdown: How Covid shook the world’s economy.
London: Penguin.
WHO (2021) ‘Listing of WHO’s response to COVID-19’. Available at:
https://www.who.int/news/item/29-06-2020-covidtimeline (Accessed:
10 December 2022).

175
Unit A2 Time series econometrics

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 2.1, closing down sale: bunhill / Getty
Subsection 2.4, Closed businesses during COVID: shaunl / Getty
Subsection 3.2, Camino del Norte: José Antonio Gil Martı́nez / Flickr.
This file is licensed under Creative Commons-by-2.0.
https://creativecommons.org/licenses/by/2.0/
Figure 11(a): Heritage Image Partnership Ltd / Alamy Stock Photo
Figure 11(b): Commission Air / Alamy Stock Photo
Subsection 6.2.2, burger: Bennyartist / Shutterstock
Subsection 6.4, coffee beans: Helen Camacaro / Getty
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

176
Solutions to activities

Solutions to activities
Solution to Activity 1
(a) Stock. Your bank balance is always given for a particular point in
time (for example, when you last checked your balance).
(b) Flow. Your food expenses relate to a given period of time (for
example, daily, weekly or monthly).
(c) Flow. Miles travelled is measured over a period of time (in this case,
daily).
(d) Stock. The number of unemployed people is measured at a particular
point in time.
(e) Flow. Infant mortality is measured as the number of infant deaths per
thousand of live births during a given period of time (usually, one
year).
(f) Flow. The rate of inflation of consumer prices is effectively the
relative change of consumer prices in a country. It is the change of a
stock variable and is defined over a period of time – say, month,
quarter or year.

Solution to Activity 2
The property of persistence is not immediately obvious from looking at
Figure 1. However, the fact that observations that are close together in
time also tend to be close together in the value of their rates of
unemployment suggests this is a persistent time series variable.
There is no overall pattern of momentum in this time series for the period
as a whole. But the data clearly display a number of episodes (a succession
of fairly long stretches in terms of several quarters in a row) that move
either upwards or downwards.

Solution to Activity 3
(a) Autoregressive scatterplots show high correlation between a variable
and its lag when the plot is a cloud closely clustered around the
45-degree line. We can see that the greater the lag, the larger the
spread around the line, and so the lower the correlation between
unemployment and its lagged counterpart. In particular, Figure 2(a)
(plotting lag 1) shows very high linear correlation.
(b) In Figure 2, even at lag 4 there is a clear correlation between the two
variables. So the rate of unemployment is a highly persistent time
series.

177
Unit A2 Time series econometrics

Solution to Activity 4
(a) The slope coefficient gives the first-order autocorrelation for the rate
of unemployment, which has been estimated to be 0.991. This is very
close to 1. An autocorrelation of 1 represents a perfect
autocorrelation.
(b) The p-value associated with this slope is very small, so there is strong
evidence that the population autocorrelation coefficient is not zero.

Solution to Activity 5
Autocorrelations decline as the lag increases, but do so only very slowly. It
is only when the gap is 22 lags or more (corresponding to five and a half
years or more) that the unemployment rate between two time points is
uncorrelated. While the time plot and the autoregressive scatterplots
already suggested a high degree of persistence of the rate of unemployment,
the correlogram makes it easier to visualise how far back one needs to go
to fully pick up the time dependence between time series values.
This time dependence and persistence suggest that, if you hear, for
example, that last quarter’s unemployment rate jumped up by
0.5 percentage points, that’s a really important piece of news. The reason
is that this jump in unemployment is not going away any time soon.

Solution to Activity 6
This time plot lies practically flat with no discernible long-term trend. In
the long period before 2020, the time plot shows that the earlier decades
were characterised by greater volatility in the growth rates with larger
spikes up and down a flat long-term trend. (At the time, particularly
during the 1960s and 1970s, the UK economy was often typified by its
stop–go pattern of economic growth. This was the period of fixed exchange
rates and the resulting pattern of growth was strongly influenced by the
tension between the explicit policy objectives to maintain full employment,
on the one hand, while at the same time protecting the pound from
devaluation.)
The most startling feature of the time plot, however, is the exceptionally
large fluctuations in the quarterly growth rate during 2020, particularly in
the second and third quarters – as can be seen in the table below.

Year Quarter GDP growth rate (%)


2020 1 −2.9
2020 2 −19.0
2020 3 16.1
2020 4 1.0

178
Solutions to activities

These growth rates and the output compression in the 2nd quarter of 2020
in particular were truly exceptional and hadn’t been witnessed since the
depression of the 1930s (Santos and Wuyts, 2010). In contrast, the
financial crisis of 2008 looks like a small disturbance in comparison with
this.

Solution to Activity 7
No, they do not show the same pattern. In particular, the unemployment
rate changed in one direction only (i.e. up), whereas the GDP growth rate
decreased then increased dramatically. Whilst unemployment went up, this
rise was not exceptional. In contrast, the exceptional fluctuations in
quarterly growth rate have already been noted in the Solution to
Activity 6.

Solution to Activity 8
Although the data-generating process is the same for each of these random
walks, the resulting plot shows a wide variety of different trajectories. This
is a typical display of a random walk.
The random walks move up and down (at times crossing the zero line once
or more) and several random walks show distinctive wave-like patterns.
The scatter of trajectories shows a clear heteroskedastic pattern: as time
moves on, so does the spread of the scatter. (Recall from Subsection 5.3 of
Unit A1 that ‘heteroskedasticity’ is another term for ‘non-constant
variance’.)

Solution to Activity 9
(a) Let Yt denote the distance travelled in km by a hiker from day 1 up
to t, where t is the cumulative number of days that the walker has
been on the path, for t = 1, . . . , 37.
From the assumptions above, the drift component is d = 22.5 since
the average daily distance covered will be 5 km/hour × 4.5 hours, and
Y0 = 0 because all the hikers start by having covered 0 km.
This gives
Yt = Yt−1 + 22.5 + εt ,
where the error terms εt are i.i.d. with distribution N (0, 100).
(b) A random walk with drift clearly differs markedly from a simple
random walk. The graph of distances reached each day varies around
a linear trajectory that depicts the progress of the average hiker
walking consistently at 22.5 km per day.
Similarly to a random walk without drift, the graph of trajectories
shows a clear heteroskedastic pattern: as time moves on, so does the
spread of the scatter.

179
Unit A2 Time series econometrics

Solution to Activity 10
(a) The estimated regression equations and residual standard errors are
as follows.
• Model 1:
unemploymentRatet = 0.0065 + unemploymentRatet−1 .
The residual standard error is 0.267.
• Model 2:
unemploymentRatet = 7.671 − 0.008t.
The residual standard error is 2.318.
• Model 3:
unemploymentRatet = 0.0804 − 0.0007t
+ unemploymentRatet−1 .
The residual standard error is 0.264.
In Model 3, the intercept was statistically significant, as was the
deterministic trend. Despite Model 3 being more complex, the
residual standard errors of Models 1 and 3 are practically the same.
In Model 1, the estimate for the intercept is not statistically
significant as it is smaller than its standard error, suggesting that
perhaps the drift term is not needed. Regression results of Model 2
suggest the simple model with a deterministic trend is a much poorer
fit for the data. Its slope coefficient is negative, as you would expect
from looking at Figure 1, given the very high rates of unemployment
that prevailed during the 1980s.
Given what we know about unemployment and the socioeconomic
context when this variable was recorded, it seems – and results
confirm – that a deterministic trend model is not a good idea because
an additional trend variable fails to capture potential autoregressive
behaviour inherent in its time trajectory.

Solution to Activity 11
(b) (a) In Figure 4, quarterly GDP is clearly trending. So, according to the
mean property of weakly stationary time series variables, the time
plot suggests GDP is non-stationary.
(b) Quarterly GDP growth rate does not appear to be trending. In
Figure 7, quarterly GDP growth rate from 1955 Q2 to 2019 Q4
displays a variance that is not constant over time; it was markedly
larger in the earlier decades than in the later decades of the overall
period. This suggests that quarterly GDP growth rate is
non-stationary.

180
Solutions to activities

Solution to Activity 12
By leaving out Yt−3 we would be modelling Yt as
Yt = Y0 + α1 Yt−1 + α2 Yt−2 + vt ,
where the error term is now vt = ut + α3 Yt−3 .
Both regressors would be correlated with this error term, since Yt−1 can be
written as
Yt−1 = Y0 + α1 Yt−2 + α2 Yt−3 + α3 Yt−4 + ut−1
and Yt−2 can be written as
Yt−2 = Y0 + α1 Yt−3 + α2 Yt−4 + α3 Yt−5 + ut−2 .
This violates the main OLS assumption of exogeneity discussed in Box 8 of
Unit A1 (Subsection 2.4).

Solution to Activity 13
(a) As you saw in Box 19, Yt has a unit root when ρ = 1, and is
stationary when |ρ| < 1. So a suitable null hypothesis is H0 : δ = 0.
(b) When Yt = d + ρ Yt−1 + εt this means that
∆ Yt = Yt − Yt−1 = d + δ Yt−1 + εt .
This reduces to ∆ Yt = d + εt , when δ = 0.

Solution to Activity 14
The plot in Figure 14(a) shows that log(GDP) has a positive trend over
time. Whether the trend is stochastic or deterministic cannot be
established from the graph, but either way it means that log(GDP) is not
stationary.
The first difference of log(GDP) plotted in Figure 14(b) appears to have a
relatively stable mean, except for the bump corresponding to the period of
economic turmoil in the 1970s. It is not obvious whether the variance
changes with time. This means that the first difference of log(GDP) could
be stationary and hence that the log(GDP) could be difference
stationary I(1).
Figure 14(c) indicates that there is strong persistence in the log(GDP)
series. The autocorrelation series declines only very slowly as the lag
increases.
In Figure 14(d), the ACF falls rapidly with the first lag and remains low
with subsequent lags. It looks a bit like the ACF for the white noise
process, which is a process that we know is stationary (Example 7,
Subsection 4.2); but in the case of a white noise process, the ACF would
stay zero for subsequent lags.

181
Unit A2 Time series econometrics

Solution to Activity 15
Both series exhibit upward trends, but it is not obvious if the trend is
stochastic (as in a random walk with drift) or deterministic. It is clear that
both series are non-stationary, since the mean increases over time. The two
series appear to trend together over time.

Solution to Activity 16
In the Solution to Activity 15, it was noted that neither GDP nor PCE
look stationary, so they are not I(0).
The correlograms for GDP and PCE indicate there is strong persistence in
both series and hence dependence on many lags.
The correlograms for the first differences of GDP and PCE suggest that
possibly the autocorrelations after order 1 could be assumed to be 0. So
both first differences may have a lag dependence of order 1. As with the
correlograms of the first difference of log(GDP) given in Activity 14
(Subsection 5.3), a correlogram like this is a bit like the ACF for the white
noise process. So they suggest that both first differences might be
stationary. This means that both GDP and PCE might be I(1).

Solution to Activity 17
(a) For the gdp series, the ADF test statistic is −0.55. For the pce series,
the ADF test statistic is −0.37.
(b) In both cases, the test statistic is more than the 10% critical value.
(Or equivalently, the magnitude of the test statistic is less than the
magnitude of the 10% critical value.) We therefore cannot reject the
null hypothesis in either case. So we conclude that both series are
non-stationary, contain a unit root and, hence, that both variables are
I(1) with drift.

Solution to Activity 18
(a) For the first difference of gdp, ∆ gdp, the ADF test statistic was
calculated to be −3.61.
The ADF test statistic for the first difference of pce, ∆ pce, was
found to be −2.93.

182
Solutions to activities

(b) In both cases, the test statistic is larger in magnitude than the 1%
critical value of −2.60, so we can reject the null hypothesis that the
series contains a unit root and conclude it is therefore stationary, I(0).
Since both gdp and pce can be transformed into a stationary series by
taking first differences, we can conclude that gdp and pce are
difference stationary or integrated of order 1, I(1).

Solution to Activity 19
The residuals appear to fluctuate around a stable mean which suggests
that the correct specification for the ADF test is that of a simple random
walk; that is,
bt ∼ u
∆u bt−1 + ∆ u
bt−1 .

Solution to Activity 20
The calculated test statistic is −3.11, the t-statistic for u
bt−1 in the
regression, which is larger in magnitude than the 1% critical value of −2.6.
We therefore reject the null hypothesis that the series is non-stationary in
favour of the alternative hypothesis that the series is stationary with zero
mean. Since the residuals are found to be I(0), we can conclude that the
two series are cointegrated based on the Engle–Granger test.

183
Unit B1
Cluster analysis
Welcome to Strand B

Welcome to Strand B
Strand B of M348 is the data science strand, consisting of Units B1
and B2. In this strand, we will be looking at data analysis from a data
science point of view. (Remember that you should study either Strand A
or Strand B. See the module guide for further details.)
So far, the emphasis in this module has been on regression.
• In Book 1, you have learnt about the linear models – regression models
where the response variable is assumed to follow a normal distribution,
and the mean depends on a number of explanatory variables. These
explanatory variables might be covariates or factors or a combination of
the two.
• Book 2 then moved on to considering generalised linear models; that is,
models where the response variable could follow one of a number of
other distributions – in particular, the binomial distribution and the
Poisson distribution.
Whilst you have been learning how to use R to fit such models, little has
been said about what calculations R is actually doing to come up with
estimates for all of the parameters. The implicit presumption is that it
does not matter much; that is, the computing power that R is able to draw
on is sufficient for the estimates to be provided quickly enough, and the
computations will end up with the ‘right’ estimates being reported. For
example, in multiple regression, the estimates will indeed be the least
squares estimates and in generalised linear regression, the estimates will
indeed be maximum likelihood estimates. In this strand, we will move
away from the fitting of such models to consider situations where
computational considerations become more important.
In Unit B1, we will consider cluster analysis. This is a technique that aims
to find groups – clusters – in data. The approaches that we will consider
will not be defined by explicit models. Instead, they emerge from thinking
about different ways of searching for groups in data. For example, by
repeatedly merging observations which are closest together or placing
observations in groups according to their proximity to a number of group
centres. The approach chosen impacts on the clusters that are found.
In Unit B2, we will focus on the data, in particular ‘big data’. As you will
see in this unit, just handling ‘big data’ brings its own challenges. The
number of observations can go into the millions, and fresh observations
might accumulate rapidly. Also, the data might not fit into the neat
structure of a data frame. All this means that even doing relatively simple
data analytical tasks, such as computing a mean, brings challenges. For
example, the dataset might be so vast that it is beyond the storage
capacity of a single computer; and without careful thought about how the
calculation is done, the time required for the computation might be far too
long. The unit does not just show how such data can be handled, but also
what uses such data should – or, more importantly, should not – be put to.

187
Unit B1 Cluster analysis

Data need to be gathered and used only in an ethically responsible way,


whether that is as a primary dataset or a secondary one. Otherwise, as you
will see, much harm can be done to individuals, and to society as a whole.

A note on notebooks
Make sure you have all the relevant files installed for the notebook
activities in this strand. Check the module website for any additional
instructions.

Introduction to Unit B1
In this first unit specifically about data science, we will consider cluster
analysis; that is, identifying groups – clusters – in data. Cluster analysis
is one of a number of techniques, such as discrimination, principal
component analysis, factor analysis and segmentation analysis, that are
designed to be used when the data consist of many variables.
The key feature in cluster analysis is that at the outset little is assumed
about the clusters. That is, initially there are no examples of what
members of each of the clusters look like – and often it’s not known how
many clusters there are. Knowledge about the clusters is gained only from
the data themselves. As such, cluster analysis can be seen as an
exploratory data analysis tool which is used to suggest structure in the
data. This knowledge about the structure in the data can then be used to
add insight and/or allow simplification. For example, in the analysis of a
survey, cluster analysis is used to group respondents. Typical responses
given in each of these groups are then used to discover how these groups
differ, as Example 1 will demonstrate. Replacement of observations in each
of the groups by a single typical observation can lead to much
simplification, as Example 2 will demonstrate. This means that the fewer
clusters that adequately represent the structure, the better.
Finding groups in data is useful in many situations, so it is not surprising
to find cluster analysis used across a wide range of disciplines, such as in
food chemistry (Qian et al., 2021), social welfare (Tonelli, Drobnič and
Huinink, 2021), tourism (Höpken et al., 2020) and archaeology (Jordanova
et al., 2020).

188
Introduction to Unit B1

Example 1 Attitudes to the EU


In 2017, a Chatham House–Kantar survey explored attitudes to the
European Union (EU) held by over 10 000 European citizens,
including some living in the UK. Using cluster analysis, Raines,
Goodwin and Cutts (2017) identified six political ‘tribes’ (presented
here in size order, from largest to smallest):
• ‘Hesitant Europeans’
• ‘Contented Europeans’
• ‘EU Rejecters’
• ‘Frustrated Pro-Europeans’
• ‘Austerity Rebels’
• ‘Federalists’.
As Raines, Goodwin and Cutts describe in their report:
These tribes differ in terms of their members’ social and
demographic characteristics and attitudes towards a wide range
of issues, including European integration, immigration and
political responsiveness.
(Raines, Goodwin and Cutts, 2017, p. 4)

For instance, two of the clusters are described as follows.


‘Hesitant Europeans’ . . . sit in the middle on many issues, and
need persuading on the merits of the EU. They tend to be
apathetic about politics, are concerned about immigration and
tend to prioritize national sovereignty over deeper EU
integration.
‘Contented Europeans’ are optimistic and pro-European. Often
young and broadly socially liberal, they feel that they benefit
from the EU but tend to favour the status quo over further
integration.
(Raines, Goodwin and Cutts, 2017, p. 1)

These groups were identified on the basis of the responses by those


EU citizens who completed the questionnaires.

189
Unit B1 Cluster analysis

Example 2 Simplifying an image


Figure 1(a) is a digital image of some beads. This image consists of a
grid of 640 × 587 pixels. The colour of each pixel can be represented
as a combination of three values: the amount of red, green and blue.
In the case of Figure 1(a), the amount of red, green and blue is given
on an integer scale between 0 and 255. As such, there is a range of
more than 10 000 000 colours that could occur in the image. However,
many of the colours are very similar.
The image can be simplified by clustering pixels with similar colours.
Then the colours of all the pixels in a single cluster can be
approximated by just one colour. For example, Figure 1(b) is a
simplified version of Figure 1(a) created using just ten different
colours.

(a) (b)

Figure 1 Images of a bead necklace (a) original, and (b) produced using
just ten colours after application of a clustering algorithm

Such a reduction in the number of colours allows adjacent pixels that


have the same colour to be easily identified and hence assists
identifying parts of the image which correspond to beads and those
which correspond to the background.

In Section 1, we describe in more detail what we mean by a cluster and see


different ways in which the closeness of observations can be defined. In
Section 2, you will be introduced to some methods that can be used to
assess how good an allocation of observations to clusters is. Then
Sections 3 to 5 will consider different approaches to finding clusters in
data. You will discover that they all have different merits, and hence which
approach is best depends on both the data and what sort of characteristics

190
1 Clusters in data

the clusters are assumed to have. Finally, in Section 6, the approaches you
have been learning about will be compared.
The structure of the unit is illustrated in the following route map.

The Unit B1 route map

Section 1
Clusters in data

Section 2
Assessing clusters

Section 3 Section 4 Section 5


Hierarchical Partitional Density-based
clustering clustering clustering

Section 6
Comparing
clustering methods

Note that you will need to switch between the written unit and your
computer for Subsections 3.4, 4.6 and 5.4.

1 Clusters in data
Cluster analysis is designed to identify groups within a dataset. It applies
in situations where very little might be known about the groups
beforehand. Often, the number of groups is not known in advance, let
alone what any of the elements in a group look like.
In learning about cluster analysis, the first step is to think about what a
cluster in some data actually is. A general, although vague, definition of a
cluster is given in Box 1.

Box 1 A definition of a cluster


A group of observations in a dataset form a cluster when these
observations are relatively similar to each other and relatively
different to observations that are not in the cluster.

191
Unit B1 Cluster analysis

So, a cluster represents a group of observations that are relatively similar


to each other and distinct from the other observations that do not depend
on the cluster. In Subsection 1.1, we give you some practice at spotting
clusters graphically.
There are many ways in which how similar, or how different, two
observations are can be measured mathematically. In Subsection 1.2, you
will learn about a few of these methods.

1.1 Spotting clusters


In this subsection, you will be asked to use your own judgement to spot
clusters in data. That is, you will apply the definition given in Box 1 in an
informal way. The aim will be for you to develop an appreciation of the
complexity of the task that cluster analysis represents.
To do so, we will introduce four new datasets. To begin with, a dataset
relating to the occupancy of a car park, which is described next.

Occupancy of a car park


The numbers of cars parked in various Birmingham car parks in the
autumn of 2016 were recorded by a car park company.
The parking dataset (parking)
The data given in this dataset relate to just one of these car parks –
Broad Street Car Park – around noon. Data for 73 days is given. The
The Bullring is the biggest car variables in this dataset are:
park in Birmingham city
centre with 1195 spaces • calendarDate: the date which the observation refers to
• day: the day of the week
• occupancy: the number of cars parked.
The first six observations from the parking dataset are given in
Table 1.
Table 1 The first six observations from parking

calendarDate day occupancy


04-Oct-16 Tuesday 677
05-Oct-16 Wednesday 653
06-Oct-16 Thursday 673
07-Oct-16 Friday 545
08-Oct-16 Saturday 126
09-Oct-16 Sunday 108

Source: Stolfi, Alba and Yao, 2017

In Example 3, we will use a histogram to spot clusters in the parking


dataset.

192
1 Clusters in data

Example 3 Occupancy in a car park


The occupancies given in the parking dataset are displayed in
Figure 2.

20

15
Frequency

10

0
100 200 300 400 500 600 700
Occupancy
Figure 2 Occupancy of Broad Street Car Park around noon
In this histogram, there were some days when there were more than
450 cars in the car park. (This is out of a total of 690 spaces.) On
other days, there were fewer than 250 cars in the car park. However,
there were no days on which the number of cars was between 250
and 450. So these data appear to have two clusters. One cluster
corresponds to days when the occupancy was fewer than 250 cars.
The other cluster corresponds to days when the occupancy was more
than 450 cars.
As you can see in Table 2, it turns out that all the days when the
occupancy was high were weekdays, and all the days when the
occupancy was low were at the weekend. So the cluster analysis is
revealing a split between weekday/weekend use of the car park.
Table 2 Occupancy of Broad Street Car Park on different days of the week

Occupancy Mon Tue Wed Thu Fri Sat Sun


> 450 cars 11 10 11 11 10 0 0
< 250 cars 0 0 0 0 0 10 10

193
Unit B1 Cluster analysis

In Example 4, we will use a scatterplot to spot clusters in the Old Faithful


geyser dataset; these data are introduced next.

Eruptions of the Old Faithful geyser


The Old Faithful geyser in Yellowstone National Park, USA, is
renowned for regular and frequent erupting. The data in the Old
Faithful geyser dataset were collected from almost 300 eruptions in
August 1978.
The Old Faithful geyser dataset (faithful)

People in Yellowstone Park The variables in this dataset include:


watch the Old Faithful geyser • eruptions: the duration of the eruption (in minutes)
erupt every 90 minutes
• waiting: the waiting time until the next eruption (in minutes).
The first six observations from the Old Faithful geyser dataset are
given in Table 3.
Table 3 The first six observations from faithful

eruptions waiting
3.600 79
1.800 54
3.333 74
2.283 62
4.533 85
2.883 55

Source: Azzalini and Bowman, 1990

Example 4 Are eruptions clustered?


Figure 3 shows a scatterplot of the data in the Old Faithful geyser
dataset.
The scatterplot suggests that the data appear to be in clusters – two,
in fact. There is one cluster of eruptions where both the durations
and the waiting times are relatively short. The other cluster of
eruptions corresponds to those where the duration was relatively long
and the waiting time until the next eruption was relatively long too.
Note that there are a few eruptions where it is not clear to which, if
either, of the two clusters they belong. These eruptions with a
duration of about 3 to 4 minutes and a waiting time of about 60 to
70 minutes lie somewhere between the two clusters. As you will see
later in this unit, there are methods that help allocate intermediate
points such as these to the most appropriate cluster. However, as with
much of exploratory statistics, there is not a definitive answer.

194
1 Clusters in data

90
Waiting time (minutes)

80

70

60

50

2 3 4 5
Duration (minutes)
Figure 3 Durations of, and waiting times between, eruptions

In Activity 1, you will try using a histogram to spot clusters in a different


situation.

Activity 1 Spotting clusters of grey pixels

Example 2 introduced a colour image of a necklace of beads. An


alternative version of Figure 1(a) can be produced using greyscale. For
such a version, each pixel is represented by one value that corresponds to
how grey it is. (The higher the value, the lighter the pixel.)
A histogram of the greyness of the pixels is given in Figure 4.
Based on Figure 4, how many different clusters of greyness values do you
think there might be?

195
Unit B1 Cluster analysis

25000

20000

Frequency
15000

10000

5000

0
0.0 0.2 0.4 0.6 0.8 1.0
Greyness
Figure 4 Greyness of pixels in a greyscale version of Figure 1(a)

In Activity 1, you were asked to find clusters in a greyscale version of


Figure 1(a) using a histogram. We will now introduce a dataset
corresponding to a sample of the pixels used in the full-colour version of
Figure 1(a). Then, in Activity 2, you will use a scatterplot matrix to spot
clusters in this sample of pixels from the full-colour version.

Quantification of a digital image


Digital images consist of grids of pixels. Mathematically, the colour of
pixels can be represented as a vector. (Remember that vectors are an
ordered list of numbers. Generally they are denoted by bold
lower-case letters, such as x or y.)
There are different ways in which this can be done. In this unit we
will use the convention that a colour is defined by a vector of length 3:
the first element is the amount of red, the second element is the
amount of green and the third element is the amount of blue. All
these elements are integers between 0 and 255. Some examples of
vectors and their corresponding colours are given in Figure 5.

196
1 Clusters in data

(255, 0, 0) (0, 255, 0) (0, 0, 255)

(0, 0, 0) (128, 128, 128) (255, 255, 255)

Figure 5 Examples of vectors of colours

The beads dataset (beads)


The data in this dataset correspond to pixels from the digital image of
some beads given in Figure 1(a). The entire image consists of
640 × 587 pixels. This dataset is just a sample of 250 of these pixels.
The variables in the data include:
• red: the intensity of red
• green: the intensity of green
• blue: the intensity of blue.
All three intensities are given as integers between 0 and 255. The first
six observations from the beads dataset are given in Table 4.
Table 4 The first six observations from beads

red green blue


208 198 196
181 181 179
127 126 121
174 167 174
185 179 183
164 172 175

197
Unit B1 Cluster analysis

Activity 2 Spotting clusters of coloured pixels

In Figure 6, a sample of the pixels in the colour version of Figure 1(a)


given in beads is depicted using a scatterplot matrix.

red

green

blue

Figure 6 Redness, greenness and blueness of a sample of pixels in Figure 1(a)

Based on Figure 6, how many different clusters of colours do you think


there might be?

As you have seen in Activities 1 and 2, it is possible for clusters to have


different shapes and sizes even with the same dataset.
Clusters are often assumed to have a convex (oval) shape. That is, in any
given direction out from the centre of the cluster, if a point is deemed to
be within the cluster, all points closer to the centre are also in the cluster.
However, this does not always have to be the case.
Within a dataset, clusters may not have the same shape and they may not
even have the same size. Equally, there is no reason why clusters should
contain roughly the same number of points. It may be more appropriate to
have some of the clusters containing many points, and others with very few
points, such as was suggested in the Solution to Activity 2.
For some observations in a dataset, it might be more appropriate to regard
them as outliers. That is, the observation does not belong to any cluster.
Alternatively, and equivalently in practice, each of such observations might
be deemed to belong to a cluster of size 1 (itself!).
So far in Examples 3 and 4, and in Activities 1 and 2, only datasets with
one, two and three variables have been considered. Furthermore, for ease
of presentation in this unit, we will continue to work with datasets that
have no more than three variables.
In Activities 1 and 2, the plots enabled you to use your judgement to

198
1 Clusters in data

determine what clusters there are in a dataset. With more variables,


matrix scatterplots can still be used to plot the data, such as was done in
Activity 2. But this quickly gets impractical as the number of variables
increases and datasets may contain many more variables. It is possible to
overcome this by using other statistical techniques to generate
approximations of the data that can be more easily plotted. One such
situation is given in Example 5.

Example 5 Visualising the composition of a liquor sample


It can be important to know the difference between a genuine product
and one that is counterfeit or has been tampered with. This is where
cluster analysis can help. Identified clusters can highlight what
characteristics samples from genuine products have, thus giving a
basis for spotting when the profile for a suspect sample does not fit.
A particular type of Chinese (Luzhou-flavour) liquor was investigated
by a group of researchers. They obtained a total of 40 samples from
such liquor spanning 13 brands. (All the samples were obtained from
a reliable source.) The flavour profile of each sample was determined
from its chemical composition, leading to each sample being described
by many variables. They produced a two-dimensional approximation
of the data, which is given in Figure 7.

d
h
PC2 23.2%

e c
f
b

PC1 33 19

Figure 7 The plot of the data from Qian et al., 2021

199
Unit B1 Cluster analysis

On this plot, the brands are distinguished by plotting symbols and


the regions in which points from a brand sit are highlighted. Notice
that many of the groups overlap. This could be because the groups in
the data really do overlap. Alternatively, it could simply be because
of the approximation that has been made.

Example 5 involved data with many variables. A simplified version of these


data are introduced below. (We will return to these data later in the unit.)

Composition of a type of liquor


In Example 5 the flavour profiles of samples of Luzhou-flavour liquor
were considered. This was done by measuring the concentration of
different chemical compounds.
We will consider a subset of the data collected by the researchers,
giving information about just two of the chemical compounds they
measured in each sample: ethyl acetate (an apple flavour) and ethyl
lactate (a grassy flavour).
The Chinese liquor dataset (liquor)
The variables in this dataset are:
• sample: an identifier for the sample
• brand: an identifier for the brand
• ethylAcetate: concentration of ethyl acetate in milligrams per litre
• ethylLactate: concentration of ethyl lactate in milligrams per litre.
The first six observations from the Chinese liquor dataset are given in
Table 5.
Table 5 The first six observations from liquor

sample brand ethylAcetate ethylLactate


GJ1573 LZLJ 756.8 512.2
JL90 LZLJ 804.6 550.0
JL30 LZLJ 760.5 523.1
TQ LZLJ 703.9 485.3
WLY-1 WLY 849.4 616.2
WLY-2 WLY 764.4 496.5

Source: Qian et al., 2021

Using plots to determine clusters is a subjective approach. Adopting a


more objective approach is needed – one in which the closeness of
observations can be quantified. So in the following subsection you will
explore how the closeness of points can be defined mathematically.

200
1 Clusters in data

1.2 Measuring the closeness of data points


The definition of a cluster given in Box 1 assumes that it is known what it
means for a data point to be close to another data point. However, as you
will discover in this subsection, ‘closeness’ is something that depends on
the context in which the data arose. Box 2 gives two approaches to
thinking about closeness: similarity and dissimilarity.

Box 2 Similarities and dissimilarities


In statistics, the following two types of function are used to capture
the difference between two observations:
• Similarity measures indicate how close, or similar, two
observations are to each other. Generally, the larger the value of a
similarity measure, the more similar the observations are.
• Dissimilarity measures indicate how far apart, or dissimilar, two
observations are from each other. Generally, the smaller the value
of a dissimilarity measure, the more similar the observations are.
The minimum value of a dissimilarity measure is usually defined to
be 0. When two observations are identical, the dissimilarity
measure has this minimum value, 0.

In this unit, we will concentrate on dissimilarity measures. The following


examples, Examples 6 to 9, illustrate different ways in which such
measures can be applied.

Example 6 Dissimilarity between pixels in a greyscale


image
Activity 1 introduced a dataset in which observations correspond to
the greyness of pixels making up an image.
For these data, one reasonable dissimilarity measure between two
pixels is the (absolute) difference between the amount of greyness
attributed to the two pixels. So, if the greyness of one pixel is x and
the greyness for the other pixel is y, the dissimilarity between the two
pixels is d(x, y) = |x − y|.
(Remember that |x| means the modulus, or absolute value, of x.)
This dissimilarity measure corresponds to the distance between the
two observations when plotted on a number line.

201
Unit B1 Cluster analysis

Example 7 Dissimilarity between pixels in a colour image


Example 2 introduced a dataset in which observations correspond to
the colour of pixels. One way of measuring the difference, d(x, y), in
colour for two pixels given by x = (xr , xg , xb ) and y = (yr , yg , yb ) is
q
d(x, y) = (xr − yr )2 + (xg − yg )2 + (xb − yb )2 ,

where the subscripts r, g and b denote the amount of red, green and
blue in a pixel, respectively.
For example, suppose the colours of two pixels are x = (183, 181, 182)
and y = (152, 149, 156). Then
p
d(x, y) = (183 − 152)2 + (181 − 149)2 + (182 − 156)2

= 961 + 1024 + 676

= 2661
≃ 51.6.
This dissimilarity measure corresponds to the distance between the
two observations if they were plotted on a three-dimensional
scatterplot (with the same scale used for each axis).

Example 8 Dissimilarities between areas of coastline


In a study investigating the impact of jetties on marine life (Bonnici
et al., 2018), biologists collected data on the presence or absence of
150 algal and faunal species at five sites in Malta (three corresponding
to natural rocky shores and two to concrete jetties).
A rocky shore in Qawra, Malta The biologists were interested in describing how different two sites are
Majjistral, Malta in terms of the species found. They used a dissimilarity measure
known as the Bray–Curtis distance. This dissimilarity measure is the
proportion of species found at one site but not the other out of the
total number of species found at both sites.
Mathematically, this can be written down in the following way.
Suppose that the species diversity at the first site is given by a
vector x = (x1 , x2 , . . . , x150 ), where the ith entry, xi , in this vector is
such that
(
1 if species i is present at site x,
xi =
0 if it is absent.

202
1 Clusters in data

Similarly, suppose that the species at the other site is given by the
vector y = (y1 , y2 , . . . , y150 ), where the ith entry, yi , in this vector is
such that
(
1 if species i is present at site y,
yi =
0 if it is absent.
Then the dissimilarity is given by d(x, y), where
P
|xi − yi |
d(x, y) = Pi .
i (xi + yi )

Example 9 Dissimilarities between words


The dissimilarity between words can be measured using the ‘edit’
distance. That is, the minimum number of steps to transform one
word into another word where a step is:
• changing a letter for another letter
• adding a letter
• removing a letter.
For example, consider the following words for ‘cat’ in different
languages: ‘chat’ (French) and ‘gato’ (Spanish).
• The edit distance between ‘cat’ and ‘chat’ is 1, as one letter, ‘h’, is Cat, chat or gato?
added to ‘cat’ in the second position.
• The edit distance between ‘cat’ and ‘gato’ is 2, as ‘c’ is changed to
‘g’ and ‘o’ is added to the end.
• The edit distance between ‘chat’ and ‘gato’ is 3, as ‘c’ is changed to
‘g’, ‘h’ is removed and ‘o’ is added to the end.

So, as these examples show, not all dissimilarity measures correspond


directly to distances on plots of the data. Indeed, it is not even necessary
that the variables used to characterise observations are quantitative. It is
possible to define dissimilarities for other types of data too. None the less,
all these dissimilarity measures have the following four features in common.
• They are functions of two, and only two, variables. The first variable
represents an observation and the second variable represents the
observation it is being compared with. Furthermore, it does not matter
which way round the two observations are labelled: the dissimilarity is
the same either way.
• The value of the dissimilarity must be at least 0.

203
Unit B1 Cluster analysis

• If the observations are identical, the dissimilarity is exactly 0. However,


the converse is not necessarily true; the dissimilarity might be 0 but the
observations might not be identical. (Though in this case it would be
assumed that the observations would be identical in any way that
mattered given the context.)
• Implicit in its definition, the bigger the value of a dissimilarity measure,
the more different the two observations are.
Mathematically, it means that all dissimilarity measures have the
properties given in Box 3.

Box 3 Properties of dissimilarity measures


Measures of dissimilarity, d(x, y), between two observations, x and y,
have the following properties.
• For any x and y, d(x, y) ≥ 0. So dissimilarities must be at least 0.
• If the observations x and y are identical then d(x, y) = 0.
• d(x, y) = d(y, x), so that the dissimilarity of x to y is the same as
the dissimilarity of y to x.
• If d(x1 , y 1 ) > d(x2 , y 2 ) then the pair of observations x1 , y 1 is more
dissimilar than the pair x2 , y 2 .

These properties are demonstrated in the following example.

Example 10 Measuring the dissimilarity between pixels


To compare the greyness of two pixels, x and y, Example 6 suggested
using the following dissimilarity measure: d(x, y) = |x − y|.
Note that d(x, y) has the properties required for a dissimilarity
measure.
• It is not possible for the modulus to have any negative values. So
the smallest possible value of d(x, y) is 0.
• If two pixels have the same greyness, then x = y.
Hence d(x, y) = |x − y| = |x − x| = 0.
• d(y, x) = |y − x| = |x − y| = d(x, y).

Box 4 gives a few dissimilarity measures that are commonly used. This list
is not intended to be exhaustive. Dissimilarity can be, and in fact is,
mathematically defined in other ways.

204
1 Clusters in data

Box 4 Some common dissimilarity measures


Suppose there are two observations x = (x1 , x2 , . . . , xp )
and y = (y1 , y2 , . . . , yp ).
Some common dissimilarity measures that are used are:
• Euclidean distance:
v
u p
uX
d(x, y) = t (xi − yi )2
i=1

• L1 distance (also known as city block or Manhattan distance):


p
X
d(x, y) = |xi − yi |
i=1

• Bray–Curtis distance (also known as binary distance):


Pp
|xi − yi |
d(x, y) = Ppi=1 ,
i=1 (xi + yi )
where for each i, xi and yi take the value 0 or 1.

Aside 1 On foot or as the crow flies?


The alternative names for the L1 distance – city block distance or
Manhattan distance – are inspired by cities where the road structure
is laid out in a grid-like design. In such places, the layout of the
buildings means that it is not possible to cut any corners. So, the
distance between two locations amounts to the distance in one
direction of the grid plus the distance along the other.
Mathematically, this is the L1 distance for two-dimensional data.
Euclidean distance is also sometimes referred to the distance ‘as the
crow flies’ – that is, the length of the straight line that a bird flying
above the buildings could take.

Which dissimilarity measure is used in any specific analysis depends partly


on the type of data it will be applied to. For example, Euclidean and
L1 distances are used with continuous data, Bray–Curtis distance is used
with binary data and the edit distance (given in Example 9) is used with
text. The choice of dissimilarity measure also depends on the context of
the data. It all depends on when two observations should be deemed to be
far away and also on factors such as whether absolute differences between
values matter, or whether relative differences matter more.
In this unit, we will generally use the Euclidean distance as the
dissimilarity measure. This has the nice interpretation that when two
observations are plotted on a scatterplot, using the same scale for all the

205
Unit B1 Cluster analysis

axes, it is the same as the distance between the two points. When there is
just one variable, Euclidean distance is the same as L1 distance – it is just
the modulus of the difference between the two values. You will practise
calculating some dissimilarities in Activities 3 and 4.

Activity 3 Comparing corner pixels in terms of greyness

The greyness values of the four pixels in the corners of Figure 1(a) are
given in Table 6.
Table 6 Greyness of the corner pixels in Figure 1(a)

Corner Greyness
Top left (xtl ) 0.713
Bottom left (xbl ) 0.591
Top right (xtr ) 0.765
Bottom right (xbr ) 0.847

(a) Using the L1 distance as the dissimilarity measure, calculate the


dissimilarity between all pairs of pixels given in Table 6.
(b) Based on your answers to part (a), which pair of corner pixels is the
most different? Referring back to Figure 1(a), is this reasonable?

Activity 4 Comparing corner pixels in terms of colour

The colour values of the four pixels in the corners of Figure 1(a) are given
in Table 7.
Table 7 Colour of the corner pixels in Figure 1(a)

Corner Redness Greenness Blueness


Top left (xtl ) 183 181 182
Bottom left (xbl ) 152 149 156
Top right (xtr ) 188 200 188
Bottom right (xbr ) 222 212 223

(a) Using Euclidean distance as the dissimilarity measure, calculate the


dissimilarity between the pixels in the top right and top left of the
image.
(b) For all the pairs of pixels, the dissimilarities are as follows. (The entry
for the dissimilarity between the colours of the top left and top right
pixels is left blank as you should have calculated this in part (a).)

Corners Dissimilarity
Top left and bottom left 51.6
Top left and top right
Top left and bottom right 64.5
Bottom left and top right 70.1
Bottom left and bottom right 115.6
Top right and bottom right 50.2

206
1 Clusters in data

Compare these dissimilarities with those based on the greyscale values


(Activity 3). How much difference does it make comparing the colour
rather than just the level of greyness?

When calculating the dissimilarities between observations, choosing the


right dissimilarity measure to use is often not the only decision that needs
to be taken. We also need to decide in what way, if any, should any of the
variables be transformed? In particular, should any variables be rescaled?
This matters as it can make a big difference to the relative dissimilarities
between observations, as you will find out in Activity 5.

Activity 5 Impact of rescaling on dissimilarities

In a study comparing countries in East and Southeast Asia, data about


child-related family policies were gathered (Tonelli, Drobnič and Huinink,
2021). This included information about public social expenditure for
children (as a percentage of GDP) and the length of maternity leave
(in days).
Three countries included in the study were Malaysia, Mongolia and Malaysian families have
Singapore. The data for these countries are given in Table 8. enjoyed an increase in paid
maternity leave from 60 to
Table 8 Expenditure for children and length of maternity leave for three Asian 90 days since 2021
countries

Country Public social expenditure Maternity leave


as % GDP standardised in days in years standardised
Malaysia (x) 0.02 −0.86 60 0.16 −1.16
Mongolia (y) 1.30 1.90 120 0.33 0.79
Singapore (z) 0.01 −0.88 112 0.31 0.53

(a) Using the L1 distance, calculate the dissimilarity between each pair of
countries for the following.
(i) When public social expenditure is measured as a percentage of
GDP and maternity leave in days.
(ii) When public social expenditure is measured as a percentage of
GDP and maternity leave in years.
(iii) When both public social expenditure and maternity leave have
been standardised.
(b) Using the values you calculated in part (a)(i), which two of the three
countries have the most similar child-related policies and which two
countries are most different in this respect?
(c) Does transforming the data make a difference to which countries
appear most similar? Why or why not?

207
Unit B1 Cluster analysis

So, as Activity 5 demonstrates, the scale on which variables are measured


can make a difference to which observations are deemed close together and
which are not. This in turn will impact on which clusters are found. So,
which units are used for each variable matters. The same is true when
Euclidean distance is used and can be true with other dissimilarity
measures.
In some situations, such as with the colours of pixels you considered in
Activity 4, it is reasonable that all the variables have the same units. More
often, though, the same units can’t be used for each variable, such as data
about child-related policies used in Activity 5. With such datasets, it is
usual to standardise the data first. This avoids the solution being
dependent on an arbitrary decision (such as recording maternity leave in
days, months or years). This is summarised in Box 5.

Box 5 Standardisation of data


When using Euclidean and L1 distances, it is usual to standardise the
data first, particularly if not all variables are measured in the same
units.

Having explored what is meant by ‘dissimilarity’, the definition of a cluster


can now be put in terms of dissimilarity.

Box 6 A revised definition of a cluster


A cluster is a group of observations in which the dissimilarities
between observations in the same group are smaller than the
dissimilarities between these observations and the other observations
not in the group.

As you will see in later sections, there are various ways this definition is
used to develop techniques for identifying clusters in data. First, in the
next section, you will consider how to assess the extent to which clusters,
once found, meet the definition given in Box 6.

2 Assessing clusters
In Subsection 1.1, you performed cluster analysis in both Activities 1
and 2 by looking at a plot of the data. Later in the unit, you will learn
about some other methods for cluster analysis. However, in this section,
you will consider the issue of how do we know that the suggested clustering
of observations is any good? That is, to what extent do the clusters that
have been found reflect structure that is really there in the data?
Sometimes there is other information available about what the clusters
should be. For example, clusters might be suggested by the context in
which the data arose, as is demonstrated in Example 11.

208
2 Assessing clusters

Example 11 A suggested clustering for liquor


Subsection 1.1 introduced some data relating to a Chinese liquor. The
observations in that dataset correspond to samples from a
Luzhou-flavour liquor. These samples, 40 in total, came from
13 different brands. Thus the brand of liquor gives one possible
clustering of the data.

However, it is more common that other information about the clusters is


not known. To use the data themselves is then the only way of judging
how good the results from a cluster analysis is. This approach is known as
internal validation. We look at a few methods of internal validation in
this section: plotting the dissimilarity matrix (in Subsection 2.1),
silhouette plots (in Subsection 2.2) and the mean silhouette statistic (in
Subsection 2.3). Each approach is based on the principle that observations
within the same cluster should be more similar than observations in
different clusters.

2.1 Plotting the dissimilarity matrix


This first approach involves plotting the dissimilarities between all pairs of
observations. The power of the approach comes from the way in which the
dissimilarities are arranged prior to plotting them. The format used, which
is described in Box 7, is known as a dissimilarity matrix.

Box 7 The dissimilarity matrix


The dissimilarity matrix is a matrix of numbers such that the (i, j)th
element, dij = d(xi , xj ), is the dissimilarity between the ith and jth
observations, xi and xj . A dissimilarity matrix has the following
properties.
• It is a square matrix with the number of rows and the number of
columns equal to the number of observations.
• It is a symmetric matrix. That is, element dij is the same as
element dji . This is because the dissimilarity of an observation i to
observation j is the same as the dissimilarity of an observation j to
observation i.
• The dissimilarity of an observation from itself is always taken to be
equal to 0. So, the main diagonal elements dii = 0.
• All the elements are non-negative. This is because dissimilarities
cannot be negative.

209
Unit B1 Cluster analysis

You will see a couple of examples of dissimilarities written in the form of


dissimilarity matrices in Examples 12 and 13.

Example 12 Writing down a dissimilarity matrix


In Activity 3, you compared the dissimilarities in the greyness of
corner pixels in an image. The same information can be displayed in
the following dissimilarity matrix. Note that this dissimilarity matrix
is only showing half of the matrix. This is just as you saw with
correlation matrices (in Subsection 5.1 of Unit 2), where the ‘missing’
values are mirror images of the values that are displayed.
top left bottom left top right bottom right
top left 0.000
bottom left 0.122 0.000
top right 0.052 0.174 0.000
bottom right 0.134 0.256 0.082 0.000

Example 13 A different dissimilarity matrix


In Activity 4, you compared the dissimilarities in the colour of corner
pixels in an image. The same information can be displayed in the
following dissimilarity matrix.
top left bottom left top right bottom right
top left 0.0
bottom left 51.6 0.0
top right 20.5 70.1 0.0
bottom right 64.5 115.6 50.2 0.0

In Activity 6, you will have a go at constructing the dissimilarity matrix


for the heights of a group of friends pictured next.

210
2 Assessing clusters

210
200
190
180
170
160
150
140
130
120
110

Adnan Billy Cath Dan Elise


180 cm 170 cm 164 cm 193 cm 182 cm

Activity 6 Constructing a dissimilarity matrix

Suppose the heights of a group of friends were as follows.


Table 9 Heights of a group of friends

Friend Height h, in cm
Adnan 180
Billy 170
Cath 164
Dan 193
Elise 182

(a) Using the L1 distance as the dissimilarity measure, calculate the


dissimilarities between Adnan and the others.
(b) Show that the dissimilarity, using the L1 distance, between Adnan
and Adnan is 0.
(c) Hence complete the following dissimilarity matrix.

Adnan Billy Cath Dan Elise


Adnan
Billy
Cath 6
Dan 23 29
Elise 12 18 11

211
Unit B1 Cluster analysis

With a dissimilarity matrix, it does not matter what order the rows are
presented in, so long as it is known what each row represents. (Note that
the symmetric nature of the matrix means that the columns always have
the same ordering as the rows.) However, when assessing the extent to
which a set of proposed clusters reflects structure in the data, it is useful
to group rows together by cluster. This means placing all the rows
corresponding to the first cluster first, then placing all the rows
corresponding to the second cluster second, and so on. Arranging the rows
(and with them the columns) in this way means that dissimilarities for
observations in the same cluster will appear together as one block in the
dissimilarity matrix. If the clustering is a good one, these blocks will
become noticeably clear. The values in these blocks should be lower than
elsewhere in the dissimilarity matrix. You will see a dissimilarity matrix
arranged in this way in Example 14.

Example 14 A clustering of friends’ heights


In Activity 6, some data about five friends was introduced. The
heights of these five friends are given in Figure 8.

Cluster 1 Cluster 2 Cluster 3

Cath Billy Adnan Elise Dan

160 170 180 190 200


Height (cm)
Figure 8 A clustering of five friends based on their heights

On this plot, which places each of the friends on a number line


according to their height, notice that the points representing Adnan
and Elise are close together. Also, the points representing Billy and
Cath are not too far apart from each other. So it could be argued that,
based on their heights, the five friends can be split into three clusters:
• Cluster 1 comprising Billy and Cath
• Cluster 2 comprising Adnan and Elise
• Cluster 3 comprising Dan (only).
We will write this clustering of the friends as
{Billy, Cath}, {Adnan, Elise} and {Dan}.
Note that the order of the names within brackets does not matter. So
this clustering could, for example, also be expressed as
{Cath, Billy}, {Elise, Adnan} and {Dan}.

212
2 Assessing clusters

The ordering of the clusters also does not matter. So, for example, we
could also express this clustering as
{Adnan, Elise}, {Dan} and {Cath, Billy}.
Reorganising the dissimilarity matrix you obtained in Activity 6, so
that the rows are sorted by cluster, results in the following matrix.
Billy Cath Adnan Elise Dan
Billy 0
Cath 6 0
Adnan 10 16 0
Elise 12 18 2 0
Dan 23 29 13 11 0
In this matrix, the lines separate friends in different clusters and
hence divides the matrix into blocks. Notice that the numbers in the
blocks along the main diagonal are noticeably less than those in the
other blocks. This confirms that the heights of the friends within the
same cluster are closer than the heights of the friends who are in
different clusters.

In Example 14, the dissimilarity matrix is small enough that looking at the
individual numbers to assess patterns (or lack thereof) is not too daunting
a task. However, in most cases the matrix will be too big to do this. For
example, the data on occupancy in the car park given in Example 3
consists of 73 observations (which is a small sample size in many contexts).
Thus the dissimilarity matrix would be a 73 × 73 matrix. So to fit such a
matrix onto an A4 page (210 mm by 297 mm), each element must be tiny:
about 3 mm by 4 mm, which is far too small for most people to read.
However, when judging how good a clustering is, it is only important to be
able to see general patterns in the sizes of the dissimilarities. This can be
conveyed using colour as a scale. Then patterns in the values of the
dissimilarities get translated to colour patterns. This is called plotting
the dissimilarity matrix, which is described in Box 8. An example of its
use is then given in Example 15.

Box 8 Plotting the dissimilarity matrix


Plotting the dissimilarity matrix refers to the process of converting
the elements of the matrix into colours using a colour scale and
displaying them on a square plot. Patterns in the elements of the
dissimilarity matrix then show up as patterns in the colours. This
process is particularly helpful when the matrix is big.

213
Unit B1 Cluster analysis

Example 15 Assessing the clustering of friends’ heights


In Example 14, the dissimilarity matrix obtained from a clustering of
some data on friends’ heights was given. In this dissimilarity matrix,
the values go from 0 to 29. Figure 9 shows one possible colour scale.

0 5 10 15 20 25 30
Figure 9 A possible colour scale

Applying this colour scale to the dissimilarity matrix in Example 14


(all of it, not just one half) results in the picture given in Figure 10.

Billy Cath Adnan Elise Dan

Billy

Cath

Adnan

Elise

Dan

Figure 10 A plot of a dissimilarity matrix

Notice that the use of colour encourages consideration of general


patterns instead of focusing on small differences between numbers. In
this example, the block diagonal structure of the matrix stands out,
particularly along the main diagonal. This is where the lightest
squares (representing lower values) are. Also it makes clear the
symmetry in the matrix, as the fact is that the top-right half of the
matrix is a mirror image of the bottom-left half.
The block diagonal structure also shows that Clusters 1 and 2, and
Clusters 2 and 3, are more similar than Clusters 1 and 3. This is
because the darkest squares are at the edge, in the columns

214
2 Assessing clusters

corresponding to the first cluster and the row corresponding to the


third cluster (or the rows corresponding to the first cluster and the
column corresponding to the third cluster). For these data, this is not
a surprising observation as the one-dimensional nature of the data
means that if Cluster 2 is between Clusters 1 and 3 then, inevitably,
Clusters 1 and 3 must be further apart.

One use of the plots of the dissimilarity matrix is to informally judge the
extent to which a proposed clustering of a dataset reflects real clusters in
the data. You will do this using some simulated data in Activity 7.

Activity 7 Judging dissimilarity matrices


Figure 11 gives plots of three different clusterings (that is, allocations of
points to clusters). In each, the total number of observations is the same
(there are 100 observations in each plot) and the observations have been
split into 5 different clusters. Plots of three dissimilarity matrices are given
in Figure 12, each corresponding to one of the clusterings.

Clustering 1 Clustering 2 Clustering 3


Figure 11 Three different clusterings

Matrix 1 Matrix 2 Matrix 3


Figure 12 Plots of three different dissimilarity matrices

215
Unit B1 Cluster analysis

(a) How convincing is each clustering in Figure 11? In other words, does
it appear to be the ‘right’ clustering, or does it appear that other
clusterings would be at least as appropriate?
(b) Each of the plots in Figure 12 uses a colour scale similar to that given
in Figure 9. (So the lighter the colour, the more similar two
observations are.) By using your answer to part (a), which of the
plotted dissimilarity matrices correspond to which clustering?

2.2 Silhouette plots


Recall that the definition of a cluster refers to a group of observations
being close to each other, and far from observations not in the cluster. As
you have seen in Subsection 2.1, plotting the dissimilarity matrix allows us
to see the extent to which clusters fulfil this definition in an informal way.
We can quantify the extent to which observations within a cluster are close
to each other and distant from observations in other clusters using the
silhouette statistic. The definition of the silhouette statistic is given in
Box 9.

Box 9 The silhouette statistic


For observation i in a dataset, the silhouette statistic, si , is
bi − ai
si = ,
max(ai , bi )
where ai is the mean dissimilarity between observation i and other
observations in the same cluster, bi is the mean dissimilarity between
observation i and observations in the next nearest cluster, and
max(ai , bi ) is the maximum of ai and bi . (In this context, ‘next
nearest’ corresponds to the cluster for which bi is minimised.)
For observations that are in a cluster by themselves, or equivalently
being treated as outliers, the value of si is taken to be 0.

In Example 16, you will learn how to calculate silhouette statistics for a
couple of friends from Example 14.

216
2 Assessing clusters

Example 16 Calculating silhouette statistics


Recall, from Example 14, that the following clustering of friends based
on their heights was suggested: {Billy, Cath}, {Adnan, Elise}
and {Dan}. The dissimilarity matrix ordered by cluster was
calculated to be as follows.
Billy Cath Adnan Elise Dan
Billy 0
Cath 6 0
Adnan 10 16 0
Elise 12 18 2 0
Dan 23 29 13 11 0

Using the values in this dissimilarity matrix, the value of the


silhouette statistic can be calculated for any friend. In this example,
we concentrate on calculating the silhouette statistic for Dan and
Cath.
As Dan is in a cluster by himself, the value of the silhouette statistic
for Dan is 0 (see Box 9).
Next, consider Cath who is in Cluster 1.
• The mean dissimilarity between Cath and the other members of
Cluster 1 (i.e. just Billy) is 6. So, aCath = 6.
• The mean dissimilarity between Cath and the members of Cluster 2
(Adnan and Elise) is
16 + 18
= 17.
2
Similarly, the mean dissimilarity between Cath and Cluster 3 (just
Dan) is 29.
So, Cluster 2 is closer to Cath than Cluster 3, meaning
that bCath = 17.
Thus, the silhouette statistic for Cath is
bCath − aCath
sCath =
max(aCath , bCath )
17 − 6
=
max(6, 17)
11
= ≃ 0.647.
17

217
Unit B1 Cluster analysis

In Activity 8, you will have a go at calculating silhouette statistics for the


rest of the friends.

Activity 8 Calculating more silhouette statistics


In Example 16, silhouette statistics were calculated for Dan and Cath
using the clustering {Billy, Cath}, {Adnan, Elise} and {Dan}. Still using
this clustering, calculate the silhouette statistics for the other three friends:
(a) Adnan, (b) Billy and (c) Elise.

Note that, for any observation i, the silhouette statistic si can only take
values between −1 and +1.
When an observation is very close to other observations in its cluster and
far away from observations in the other clusters, the value of ai will be
very small relative to bi , and hence the value of si will be close to +1. So,
a value of si close to +1 indicates that the point sits comfortably within its
allocated cluster.
If it turns out that the observation is far more similar to observations in a
different cluster than to the one it was allocated to, ai will be large relative
to bi and hence si will be negative. So, a negative value of si suggests that
the observation might have been allocated to the wrong cluster.
An interpretation of one set of silhouette statistics is given in Example 17.

Example 17 Interpreting silhouette statistics


In Example 16 and Activity 8, the silhouette statistics were calculated
for the five friends based on the clustering {Billy, Cath}, {Adnan,
Elise} and {Dan}.
The values of these silhouette statistics are brought together in
Table 10.
Table 10 Silhouette statistics for the five friends based on the clustering
{Billy, Cath}, {Adnan, Elise} and {Dan}

Friend Adnan Billy Cath Dan Elise


si value 0.846 0.455 0.647 0 0.818

So, it can be seen that Cluster 2, containing Adnan and Elise, is


convincing because both Adnan and Elise have high values for the
silhouette statistic. In contrast, the value of the silhouette statistic for
Billy suggests that it is not clear that Billy is in the correct cluster.
Looking back at the plot of the data (Figure 8), this is not that
surprising. The dissimilarities between Billy and each of Adnan and
Elise (who are in a different cluster) is not that much more than the
dissimilarity between Billy and Cath.

218
2 Assessing clusters

As with the dissimilarity matrix, the number of observations in a dataset


means that it is usually impractical to just look at a table of silhouette
statistics. Instead, the values of the silhouette statistic can be plotted. As
with plotting the dissimilarity matrix, the values are ordered by cluster.
Additionally, within clusters, the values are also placed in descending
order. This way, it should be clear where clusters are, and the number of
points with low values of the silhouette statistic are also highlighted. One
such plot is given in Example 18. You will then interpret other plots of
silhouette statistics in Activity 9.

Example 18 A plot of silhouette statistics


A plot of the silhouette statistics given in Table 10 is presented in
Figure 13.

1.0
Silhouette statistic

0.5

0.0

−0.5

−1.0
1 2 3
Figure 13 A plot of the silhouette statistics for the five friends based on
the clustering {Billy, Cath}, {Adnan, Elise} and {Dan}

In this plot, notice that the bars representing the five friends are
ordered with respect to which cluster they are in. Furthermore, within
each cluster, the bars are in order of the value of the silhouette
statistic. From this plot, it is easy to see that the silhouette statistics
for Adnan and Elise (in Cluster 2) are higher than those for Billy and
Cath (in Cluster 1).

219
Unit B1 Cluster analysis

Activity 9 Interpreting plots of silhouette statistics

In Activity 7 (Subsection 2.1), you matched plots of the dissimilarity


matrix to different clustering solutions. Plots of the silhouette statistics for
these different clustering solutions are given in Figure 14. Match each plot
in Figure 14 to the clustering solution from which it was derived.

1.0

Silhouette statistic
0.5

0.0

−0.5

−1.0
1 2 3 4 5
(a)

1.0
Silhouette statistic

0.5

0.0

−0.5

−1.0
1 2 3 4 5
(b)

1.0
Silhouette statistic

0.5

0.0

−0.5

−1.0
1 2 3 4 5
(c)
Figure 14 Plots of the silhouette statistics for the clustering solutions given
in Activity 7

Similarly to plots of the dissimilarity matrix, plots of the silhouette


statistics can be used to informally judge how good a clustering solution is.
In Activity 10, you will use a plot of the dissimilarity matrix and a plot of
silhouette statistics to judge a clustering of the Chinese liquor data (which
was introduced in Subsection 1.1).

220
2 Assessing clusters

Activity 10 Clustering of Chinese liquor by brand

The 40 observations in the Chinese liquor dataset (Subsection 1.1) come


from 13 brands of liquor. So the brand of liquor is one way of splitting the
data into clusters. However, we don’t know if this clustering is reflected by
the rest of the data. To find out if the samples from the same brand are
more similar than samples from different brands, we will use the Euclidean
distance, after the concentrations of ethyl acetate and ethyl lactate have
been standardised, to measure the distances between samples.
Consider the plot of the resulting dissimilarity matrix in Figure 15(a) and
the plot of the silhouette statistics in Figure 15(b). Do you think using
brand gives a good clustering?

1.0
Silhouette statistic

0.5

0.0

−0.5

−1.0

(a) (b)

Figure 15 Clustering of samples of Chinese liquor by brand

2.3 Mean silhouette statistic


Plotting the dissimilarity matrix and silhouette statistics allows clustering
solutions to be assessed visually. This is particularly useful for assessing
the extent to which the suggested clusters are really there in the data.
However, it is useful to have an overall numerical measure of how good the
clustering is – for example, when comparing two alternative clusterings
proposed for a given set of data. The silhouette statistic introduced in
Subsection 2.2 provides one way of doing so. The mean of the silhouette
statistics for all the observations can be calculated to produce a single
number summarising how good (or bad) the clustering is. This number is
called the mean silhouette statistic, and a formal definition is given in
Box 10. (There are other numerical measures of how good a clustering is
overall but we will not cover them in this unit.)

221
Unit B1 Cluster analysis

Box 10 Mean silhouette statistic


The mean silhouette statistic, s, is given by
n
1X
s= si ,
n
i=1

where si is the silhouette statistic for the ith observation.


Values for s range between +1 and −1. The better the clustering is
overall, the closer s is to +1.

In Example 19, you will learn how to calculate the mean silhouette
statistic for the heights of a group of friends.

Example 19 Calculating a mean silhouette statistic


Table 10 from Example 17, shown here again for convenience,
summarises the silhouette statistics calculated in Example 16 and
Activity 8.

Friend Adnan Billy Cath Dan Elise


si value 0.846 0.455 0.647 0 0.818

So, for this clustering,


0.846 + 0.455 + 0.647 + 0 + 0.818
s=
5
= 0.553.

In the following activity, you will be asked to calculate silhouette statistics


and the mean silhouette statistic for an alternative clustering than that in
Example 19.

Activity 11 Assessing alternative clusterings of the five


friends
In Example 14, it was suggested that the friends should be split into the
following three clusters:
{Billy, Cath}, {Adnan, Elise} and {Dan}.
An alternative way of dividing the friends into three clusters is:
{Cath}, {Adnan, Billy} and {Elise, Dan}.
(a) Calculate the silhouette statistics for Cath and for Billy based on this
alternative clustering.

222
3 Hierarchical clustering

(b) The silhouette statistics for the other three friends, Adnan, Elise and
Dan are, respectively, sAdnan = −0.250, sElise = −0.364 and
sDan = 0.389. Using your answer to part (a), calculate the mean
silhouette statistic for this alternative clustering. Hence comment on
which clustering seems to be better: {Billy, Cath}, {Adnan, Elise}
and {Dan}, or {Cath}, {Adnan, Billy} and {Dan, Elise}.
(c) There are 25 potential ways in which the friends can be split into
(exactly) three clusters, six of which are reasonable (that is, each
cluster only contains friends who are next to each other in terms of
height). These six reasonable clusterings are listed below, along with
their associated average silhouette statistics. (A space is left blank to
add your answer from part (b). Remember that the order of names
within a cluster is not important.) Based on the mean silhouette
statistic, which of these six clusterings seems best? With reference to
the plot of the data given in Figure 8 (Subsection 2.1), does this seem
reasonable?

Cluster 1 Cluster 2 Cluster 3 s


Cath, Billy, Adnan Elise Dan −0.025
Cath, Billy Adnan Elise, Dan 0.072
Cath, Billy Adnan, Elise Dan 0.553
Cath Billy, Adnan Elise, Dan
Cath Billy, Adnan, Elise Dan 0.090
Cath Billy Adnan, Elise, Dan 0.237

3 Hierarchical clustering
So far in this unit, the only method you have seen for allocating
observations to clusters has been informally using plots. However, this
approach is not a good one in many circumstances. Firstly, it is only
suitable for datasets which can be easily displayed graphically. If a dataset
is large, with many variables, it becomes very difficult to display the data
well in a graphical format. More importantly, it is a subjective process
which does not lend itself to being automated. In this section, and in
Sections 4 and 5, you will learn about different approaches for doing
automatic cluster allocation using algorithms. The approach we will
consider in this section is a form of hierarchical clustering, for which we
give a definition in Box 11.

Box 11 Hierarchical clustering


Hierarchical clustering refers to clustering techniques that use an
allocation of observations to k clusters to find an allocation to
either k − 1 or k + 1 clusters. Hierarchical clustering techniques are
either agglomerative or divisive.

223
Unit B1 Cluster analysis

• Agglomerative hierarchical clustering techniques use the


allocation of observations to k clusters to find an allocation of
observations to k − 1 clusters by merging two clusters, as shown
below.

merge merge

• Divisive hierarchical clustering techniques use the allocation of


observations to k clusters to find an allocation of observations
to k + 1 clusters by dividing a cluster into two.

divide divide

In this unit, we will only deal with agglomerative hierarchical clustering.


As mentioned in Box 11, these techniques use an allocation of observations
to k clusters to find an allocation of observations to k − 1 clusters. The
way this is done is outlined in Figure 16, although the starting number k
of clusters in the algorithm will be discussed in Subsection 3.1.

Have allocations
of observations
to k clusters

Merge the two


Is k = 1? closest clusters,
k reduces by 1
yes no

Stop Identify the two


closest clusters

Figure 16 Agglomerative hierarchical clustering

At the heart of the algorithm is a loop that takes an allocation of


observations to k clusters, finds the two closest clusters and merges them
to get an allocation of the observations to k − 1 clusters. How this can be
done will be discussed in Subsection 3.2, after Subsection 3.1 deals with
how to find an initial allocation of observations to clusters so that the
process can start. Then, in Subsection 3.3, a method of plotting the results

224
3 Hierarchical clustering

from such a clustering – the dendrogram – will be introduced. In


Subsection 3.4, you will use R to do some agglomerative hierarchical
clustering yourself.

3.1 Starting agglomerative hierarchical


clustering
For agglomerative hierarchical clustering to start, it is necessary to have an
allocation of observations to k clusters for some value of k. It is important
that the number of clusters, k, is at least as big as the number of clusters
that are thought to be actually in the data. This is because during the
algorithm the number of clusters steadily reduces. In Activity 12, you will
establish a starting allocation for a dataset.

Activity 12 Finding the maximum number of clusters

Activity 6 (Subsection 2.1) introduced some data about five friends:


Adnan, Billy, Cath, Dan and Elise.
(a) What is the maximum number of clusters these five friends could be
split into? (Hint: a cluster needs to have at least one observation
allocated to it. Also, an observation cannot be allocated to more than
one cluster.)
(b) Suppose the data were expanded to include three other friends:
Freddie, Gill and Hari. What is the maximum number of clusters
these eight friends could be split into?

In Activity 12, the maximum number of clusters matched the number of


friends. This result holds generally. For a dataset with n observations, the
maximum number of clusters is n. Furthermore, the n-cluster allocation of
observations to clusters is easy to write down. It simply allocates each
observation to a single-element separate cluster. As such, this provides an
excellent starting point for agglomerative hierarchical clustering. There
cannot be more clusters than n clusters in the data. Also it is an allocation
that can be written down immediately, so does not take time to find. So,
in summary, the starting point for agglomerative hierarchical clustering is
as given in Box 12.

Box 12 Starting point for agglomerative hierarchical


clustering
For a dataset of size n, the agglomerative hierarchical clustering starts
with an allocation such that each observation is in its own
single-element cluster. Thus there are n clusters in total.

225
Unit B1 Cluster analysis

3.2 Finding and merging the closest clusters


The other main task in agglomerative hierarchical clustering is to find, and
then merge, the two closest clusters. Now, merging clusters is an easy,
clear-cut process. All we need to do is to say that the observations
allocated to the two closest separate clusters are now allocated to the same
cluster.
Finding the two closest clusters is trickier. Recall that, in Section 1,
dissimilarity measures were introduced to quantify how different
observations are. These dissimilarity measures are also used to quantify
how different clusters are. However, the dissimilarity measure, as described
in Box 3 (Subsection 1.2), just measures the difference between two, and
only two, observations. There are different ways in which this can be
extended to deal with differences between groups of observations. Box 13
details three such ways, which are types of ‘linkage’.

Box 13 Dissimilarities between clusters


Three different ways of defining the distance between two clusters,
A and B, are as follows.
• Single linkage: the smallest dissimilarity between an observation
in Cluster A and an observation in Cluster B.
For example, the dissimilarity depicted in the following diagram.

• Complete linkage: the largest dissimilarity between an


observation in Cluster A and an observation in Cluster B.
For example, the dissimilarity depicted in the following diagram.

226
3 Hierarchical clustering

• Average linkage: the average dissimilarity between an observation


in Cluster A and an observation in Cluster B.
For example, the average of all the dissimilarities depicted in the
following diagram.

The linkage that is chosen will impact the shape of cluster that is likely to
be found by agglomerative hierarchical clustering. For instance, single
linkage can result in clusters that are long and thin, whereas complete
linkage is more likely to result in clusters that are more ‘ball’-like. Thus
the best choice will depend, in part, on the context in which the cluster
analysis is done.
The calculation of some dissimilarities between clusters is demonstrated in
Example 20. You will then calculate some yourself in Activity 13.

Example 20 Measuring the dissimilarity between two


clusters of pixels
In Example 12, you saw that the dissimilarities between the four
corner pixels in a greyscale image could be written in the form of the
following matrix.
top left bottom left top right bottom right
top left 0.000
bottom left 0.122 0.000
top right 0.052 0.174 0.000
bottom right 0.134 0.256 0.082 0.000

Suppose that these pixels are placed in two clusters:


{top left, top right} and {bottom left, bottom right}.
The values in the dissimilarity matrix for the pixels can then be used
to calculate the dissimilarity between these two clusters.
For single linkage, complete linkage and average linkage, the
dissimilarity between the clusters depends on the dissimilarities
between an observation in the first cluster and an observation in the
second cluster. For the clusters in this example, these are the
dissimilarities between a corner pixel at the top and a corner pixel at

227
Unit B1 Cluster analysis

the bottom. Reading off from the dissimilarity matrix, these are
0.122, 0.134, 0.174 and 0.082.
Using single linkage, the dissimilarity between the clusters is the
minimum of these four dissimilarities. So, its value is 0.082.
Using complete linkage, the dissimilarity between the clusters is the
maximum of these four dissimilarities. So, its value is 0.174.
Using average linkage, the dissimilarity between the clusters is the
mean of these four dissimilarities. So, its value is
0.122 + 0.134 + 0.174 + 0.082
= 0.128.
4

Activity 13 Calculating more dissimilarities between


clusters of pixels
In Activity 4 (Subsection 1.2), you compared the dissimilarities between
the corner pixels based on their colour instead of the amount of greyness.
The dissimilarity matrix in this case is as follows.
top left bottom left top right bottom right
top left 0.0
bottom left 51.6 0.0
top right 20.5 70.1 0.0
bottom right 64.5 115.6 50.2 0.0

Suppose that the pixels are split into the following two clusters:
{top left, bottom left} and {top right, bottom right}.
(a) Which dissimilarities between pixels is the calculation of the
dissimilarity between the clusters based on?
(b) What is the dissimilarity between the clusters using single linkage?
(c) What is the dissimilarity between the clusters using complete linkage?
(d) What is the dissimilarity between the clusters using average linkage?

In order to focus on the general principle underlying agglomerative


hierarchical clustering, we will focus on complete linkage; that is,
dissimilarity between Cluster A and Cluster B will be taken as the largest
dissimilarity between an observation in Cluster A and an observation in
Cluster B.
We now have a method for finding and merging the two nearest clusters,
along with a method for starting (Box 12). This means everything is in

228
3 Hierarchical clustering

place to find clusters using agglomerative hierarchical clustering, as you


will see in Example 21.

Example 21 Clustering the friends using hierarchical


clustering
Activity 6 (Subsection 2.1) introduced some data about five friends
and calculated the dissimilarity between each individual based on the
L1 distance between heights as follows.
Adnan Billy Cath Dan Elise
Adnan 0
Billy 10 0
Cath 16 6 0
Dan 13 23 29 0
Elise 2 12 18 11 0

In this example, we will show how the agglomerative hierarchical


clustering algorithm works with these data step-by-step.
Start
We start with each person corresponding to a separate cluster. So
there is a total of five clusters, and the dissimilarity matrix between
the clusters is as follows.
{Adnan} {Billy} {Cath} {Dan} {Elise}
{Adnan} 0
{Billy} 10 0
{Cath} 16 6 0
{Dan} 13 23 29 0
{Elise} 2 12 18 11 0

Iteration 1
Looking at the starting dissimilarity matrix, the two clusters which
are closest together are {Adnan} and {Elise}, since the dissimilarity
between them is the smallest value in the dissimilarity matrix
(excluding the main diagonal). So these two clusters can be merged.
Thus, we are left with the following clusters: {Adnan, Elise}, {Billy},
{Cath}, {Dan}. This is a total of four clusters.
The dissimilarities between the three clusters {Billy}, {Cath}
and {Dan} are not changed by merging the clusters {Adnan}
and {Elise}. The dissimilarity between the new cluster {Adnan, Elise}
and {Billy} is 12, which is the maximum of the dissimilarities between
{Adnan} and {Billy}, 10, and between {Elise} and {Billy}, 12.
Similarly, the dissimilarity between the clusters {Adnan, Elise} and
{Cath} is 18, and the dissimilarity between the clusters {Adnan,
Elise} and {Dan} is 13. So the dissimilarity matrix between these four
clusters is as follows.

229
Unit B1 Cluster analysis

{Adnan, Elise} {Billy} {Cath} {Dan}


{Adnan, Elise} 0
{Billy} 12 0
{Cath} 18 6 0
{Dan} 13 23 29 0

Iteration 2
Looking at the dissimilarity matrix between clusters calculated at the
end of Iteration 1, the clusters {Billy} and {Cath} are closest because
the dissimilarity, 6, is the smallest off-diagonal value. So, they are the
next clusters to be merged.
After this merger, we have the following clusters: {Adnan, Elise},
{Billy, Cath} and {Dan}. The dissimilarity matrix between the three
clusters is as follows.
{Adnan, Elise} {Billy, Cath} {Dan}
{Adnan, Elise} 0
{Billy, Cath} 18 0
{Dan} 13 29 0

Iteration 3
Looking at the dissimilarity matrix between clusters calculated at the
end of Iteration 2, the clusters {Adnan, Elise} and {Dan} are closest,
so they are the next clusters to be merged.
After this merger we have the following clusters: {Adnan, Dan, Elise}
and {Billy, Cath}. This is a total of two clusters. The members of a
cluster form a set. So, as mentioned in Example 14 (Subsection 2.1),
the order in which the members of a cluster are given does not
matter. For example, we could have expressed the cluster {Adnan,
Elise, Dan} as {Adnan, Dan, Elise} or {Elise, Dan, Adnan}. The
dissimilarity matrix between the two clusters is as follows.
{Dan, Adnan, Elise} {Billy, Cath}
{Dan, Adnan, Elise} 0
{Billy, Cath} 29 0

Iteration 4
At the end of Iteration 3, we only have two clusters. So these two
clusters must automatically be the closest two clusters and hence the
pair that is merged next. This leads to the cluster {Adnan, Billy,
Cath, Dan, Elise}. This is just one cluster, so the algorithm stops.

230
3 Hierarchical clustering

At each iteration, a couple of things make it easier to calculate the


dissimilarity matrix. First, the dissimilarities between clusters not involved
in the merger do not change. So the partial rows and columns in the
dissimilarity matrix relating to these clusters can be copied across. Then,
for some linkages, such as complete linkage and single linkage, the
dissimilarities involving the newly merged cluster can be easily calculated
from the dissimilarities involving the clusters before they were merged.
With complete linkage, each dissimilarity involving the merged cluster is
just the maximum of the dissimilarities pre-merger.
In the next activity, you will finish a hierarchical clustering of the parking
dataset, which was described in Subsection 1.1.

Activity 14 Clustering occupancies in a car park using


hierarchical clustering
The parking data introduced in Subsection 1.1 consist of occupancies
around noon on 73 days. Agglomerative hierarchical clustering was applied
to this data. The dissimilarity measure between two days, x and y, was
taken to be the L1 distance, and complete linkage was used.
After a number of iterations, the following five-cluster solution was found:
• Cluster A: {1, 2, 3, 7, 8, 9, 10, 14, 15, 16, 19, 20, 21, 22, 26, 27, 28, 29,
30, 33, 34, 36, 40, 41, 42, 43, 44, 47, 48, 49, 50, 54, 55, 56, 57, 58, 59, 60,
61, 62, 66, 67, 68, 69, 73}
• Cluster B: {4, 11, 23, 35, 37, 51, 63, 70}
• Cluster C: {5, 6, 18, 25, 32, 39, 46, 53, 64, 65}
• Cluster D: {12, 17, 24, 38, 45, 52, 71, 72}
• Cluster E: {13, 31}.
(Each number refers to a day in the dataset.)
The dissimilarity matrix between these clusters is as follows.
Cluster A Cluster B Cluster C Cluster D Cluster E
Cluster A 0
Cluster B 202 0
Cluster C 598 477 0
Cluster D 532 411 121 0
Cluster E 448 327 161 95 0

(a) Which pair of clusters would be the next to be merged? Hence, what
is the four-cluster solution?
(b) Write down the dissimilarity matrix for the four-cluster solution.

231
Unit B1 Cluster analysis

3.3 The dendrogram


One advantage of agglomerative clustering is that it provides a range of
solutions, one for each number of clusters. This means that the effect of
deciding on a different number of clusters can be explored after the
analysis has been done.
These solutions can be summarised by providing a list of which clusters get
merged at each stage. It is possible to give this information visually via
what is called a dendrogram. A definition of the dendrogram is given in
Box 14.

Box 14 The dendrogram


A dendrogram is a tree-like diagram to display hierarchical cluster
solutions. In the diagram, observations are depicted by branches.
These branches are linked at a position along the diagram
corresponding to the dissimilarity between the clusters.
The dendrogram gets its name The dendrogram also includes how far apart the merged clusters were,
from the Greek word for ‘tree’ which means that it also gives information about how reasonable each
merger was.

To illustrate, Figure 17 displays a dendrogram for a clustering of five


observations: a, b, c, d and e.

5
Dissimilarity

0
a b c d e
Figure 17 An example of a dendrogram

At the bottom of the dendrogram there are five branches, labelled a to e,


each branch representing a single observation in the dataset.

232
3 Hierarchical clustering

The branches d and e are connected by a horizontal line, the position of


which corresponds to the value of 1 on the vertical ‘Dissimilarity’ scale.
This indicates that during the hierarchical clustering process, d and e were
merged together, and that the dissimilarity between the two clusters was 1.
As this horizontal link is the lowest link on the dendrogram, these two
observations were the first to get merged. Similarly, the other horizontal
lines indicate the other mergers that were made during the clustering
process – in this case, another three mergers. Again, the level of the
horizontal line indicates how dissimilar clusters were when they were
merged. The vertical lines that they link indicate which groups were The ordering of observations
on a dendrogram has the same
merged. For example, the last merger was between a cluster formed by
flexibility as the ordering of
observations a and b and another cluster formed by c, d and e. The
the butterflies in this mobile
dissimilarity between these two clusters was 7 units.
One thing to bear in mind when looking at a dendrogram is that the order
in which the data points are listed is arbitrary. For each merger, it does
not matter which observation or cluster is put on the left and which on the
right. For example, equally valid representations of the hierarchical
clustering given in Figure 17 are shown in Figure 18.

8 8
7 7
6 6
Dissimilarity

Dissimilarity

5 5
4 4
3 3
2 2
1 1
0 0
b a c e d d e c b a
(a) (b)

Figure 18 Two alternative representations of the clustering depicted in


Figure 17

You will interpret a dendrogram in Activity 15, next.

233
Unit B1 Cluster analysis

Activity 15 Interpreting a dendrogram

Activity 14 involved the application of agglomerative hierarchical


clustering to some data on car park occupancy. The first ten days of these
data are given in Table 11.
Table 11 The first 10 days of the parking data

Day Occupancy
1 677
2 653
3 673
4 545
5 126
6 108
7 615
8 664
9 676
10 610

As part of an initial look at the data, agglomerative hierarchical clustering


was conducted on just the first 10 days’ worth of data. The dendrogram
based on these results is given in Figure 19.

600

500

400
Dissimilarity

300

200

100

0
5 6 4 7 10 3 1 9 2 8

Figure 19 A representation of hierarchical clustering of 10 days’ worth of car


park occupancies

234
3 Hierarchical clustering

(a) Which two days were the first to be merged?


(b) Roughly, what was the dissimilarity between the last two clusters to
be merged?
(c) For the two-cluster solution, which days were in which cluster?
(d) For the three-cluster solution, which days were in which cluster?
(e) You should have found that one of the clusters in part (c) is the same
as one of the clusters in part (d). Why is this?
(f) Based on the dendrogram, how many clusters do you think there are
in the data? Justify your answer.

3.4 Using R to do hierarchical clustering


So far in this section, all the agglomerative hierarchical clustering you have
been doing has been done by hand. This has been feasible (though tedious
at times) because the datasets considered have been very small; however,
in practice, computers are used to actually do the clustering. Therefore, in
Notebook activity B1.1, you will learn how to do the clustering and obtain
dendrogram plots in R by using the parking dataset (which was introduced
in Subsection 1.1). The subsection ends with Notebook activity B1.2,
where you will learn how to obtain plots of the dissimilarity matrix and
silhouette statistics using R. For this, you will need a new dataset, which
is introduced below.

Evidence from Bronze Age gold-mining


In a study, archaeologists collected rock samples that were linked with
Bronze Age gold-mining. All the samples were taken from a dig at
Ada Tepe in south-eastern Bulgaria.
The Ada Tepe dataset (adaTepe)
We will use just four of the variables that the archaeologists measured
for each sample. These four variables all relate to the magnetic
properties of the sample, but don’t worry about understanding all the
terms used here:
• magSus: mass specific magnetic susceptibility, which is measured in
10−8 m2 /kg
• logRatio1: the log of the ratio of magnetic susceptibility measured
by anhysteretic remanent magnetisation to frequency-dependent
magnetic susceptibility
• logRatio2: the log of the ratio of magnetic susceptibility measured
by anhysteretic remanent magnetisation to isothermal
magnetisation
• sqrtHIRM: the square-root of the hard-coercivity remanent
magnetisation, which is measured in 10−8 m (A/kg)1/2 .

235
Unit B1 Cluster analysis

The transformations of the measurements were chosen after some


exploratory data analysis so that the distributions were approximately
normal. The first six observations from the Ada Tepe dataset are
given in Table 12.
Table 12 The first six observations from adaTepe

magSus logRatio1 logRatio2 sqrtHIRM


235.34 4.055084 −0.6348783 23.97916
112.90 4.102974 −0.5798185 27.47726
198.34 4.002412 −0.2876821 16.58312
248.01 4.382027 −0.3566749 27.83882
212.19 4.716622 −0.5978370 30.82207
175.53 4.374750 −0.4620355 23.97916

Source: Jordanova et al., 2020

Notebook activity B1.1 Using R to implement


hierarchical clustering
In this notebook, you will implement hierarchical clustering for
parking and explore the results. In particular, you will learn how to
produce a dendrogram.

Notebook activity B1.2 Applying hierarchical clustering


to the Ada Tepe dataset
In this notebook, you will apply hierarchical clustering to adaTepe
and assess the solution. In particular, you will learn how to obtain
plots of the dissimilarity matrix and the silhouette statistics.

4 Partitional clustering
In the previous section, you learnt about an approach to finding clusters in
data that involved successively merging the closest two clusters. However,
this is not the only strategy that can be used. In this section, you will
learn another approach to finding clusters, one that is based on splitting
(‘partitioning’) the observations into exactly k groups.
Of course, just putting n observations into k groups is easy. All this
requires is allocating an integer between 1 and k to each observation. The
tricky bit is coming up with an allocation of integers to observations that
corresponds to the most convincing clusters. In principle, all possible
allocations could be investigated to see which one comes up the best. (As
measured, for example, by the mean silhouette statistic.) However, for all
but the smallest of datasets, the number of possible allocations is too large

236
4 Partitional clustering

for this to be practical. For example, even when the number of


observations is as little as 20, the total number of allocations of
observations to clusters can be in the billions, as Table 13 demonstrates.
Table 13 Number of different possible allocations to clusters

Number of Number of clusters


observations 2 3 4
4 7 6 1
5 15 25 10
6 31 90 65
8 127 966 1701
10 511 9330 34 105
15 16 383 2 375 101 10 391 745
20 524 287 5.81 × 108 4.52 × 1010

Instead, algorithms are used to try to find the best allocation of the
observations to k clusters without trying all possible different ways of
doing the allocation. There are different ways in which these algorithms
work. Here we will consider one approach, based on breaking the task into
two easier subtasks:
• allocating observations to clusters with known centres
• finding the centre of each of the clusters based on the observations that
are allocated to it.
These two subtasks will be the focus of Subsections 4.1 and 4.2,
respectively. In Subsection 4.3 you will see how these two subtasks are
combined, and Subsection 4.4 deals with the issue of how the algorithm
starts and how it stops. Then Subsection 4.5 deals with the issue of how to
choose k, the number of clusters. Finally, in Subsection 4.6, you will use R
to do some partitional clustering yourself.

4.1 Subtask 1: allocating observations to


clusters
The first subtask we will focus on, and the one you will consider in the
next activity, is allocating observations to clusters when the centres of
these clusters are known.

Activity 16 Allocating observations to clusters

Suppose that, in a specific set of data, the centres of the clusters were all
known. Suggest a way in which the observations might be allocated to the
different clusters.

237
Unit B1 Cluster analysis

As you have seen in Activity 16, if we know where the centres of the
clusters are, it is reasonable to allocate observations to clusters using the
rule given in Box 15.

Box 15 A rule for allocating observations to clusters


Allocate an observation to the cluster whose centre is the closest to
this observation.

When the dissimilarity measure being used is Euclidean distance (a


restriction we will also use in the next subsection), this rule amounts to
splitting up the space in which observations lie into k convex regions, one
region for each cluster. (In mathematics, such a partition of space is
known as a Voronoi diagram.) For example, Figure 20 shows one such
division of a two-dimensional plane when there are k = 5 clusters. In this
plot, the plotted points correspond to the location of the five different
cluster centres. The shaded areas around each point correspond to the
possible position of observations that would be allocated to each cluster.

Figure 20 The division of a plane into five different clusters using the rule
given in Box 15

Notice that, based on Figure 20, wherever an observation is placed in this


plot, it will be allocated to a cluster. Also, some boundaries are closer to
the centres of the clusters that they are separating than other boundaries
are. It means that some observations might be placed in a cluster despite
not being particularly close to the cluster centre or, possibly, other
observations in the same cluster. In Activity 17, you will use the rule in
Box 15 to allocate some observations to clusters.

238
4 Partitional clustering

Activity 17 Allocating observations to clusters based on


closeness to cluster centres
In Example 14 (Subsection 2.1), you saw how hierarchical clustering could
be used to analyse the data about five friends: Adnan, Billy, Cath, Dan
and Elise. In this activity, you will allocate these friends to two clusters
based on the centres of these clusters. As a reminder, their heights are
given below.

Friend Height h, in cm
Adnan 180
Billy 170
Cath 164
Dan 193
Elise 182

Suppose it is known that there are two clusters with centres at 160 cm
(Cluster 1) and 170 cm (Cluster 2), and it is decided that the dissimilarity
function is Euclidean distance.
Allocate the five friends to clusters.

Having the rule given in Box 15 is all very well. However, it does rely on
one very big ‘if’: if the centres of the clusters are known. This information
is not likely to be known at the outset. If the centres of all the clusters are
known before the cluster analysis is done, cluster analysis is unlikely to be
needed at all – sufficient information is probably already known about the
clusters. So what is the way forward? This is where Subtask 2 comes in.

4.2 Subtask 2: estimating the centres of


clusters
Estimating the position of the centres of clusters is in general not an easy
task. First, there is the issue of what exactly we mean by the centre of the
cluster, the definition of which is explored in Activity 18.

239
Unit B1 Cluster analysis

Activity 18 Defining the centre

Recall that in Subsection 1.1 you were introduced to some data collected
about the Old Faithful geyser. Figure 18 is the scatterplot of the data,
after they have been standardised. Additionally in Figure 18, the
observations have been split into two clusters where each cluster of
observations has been presented using a different symbol.
Looking at the scatterplot, suggest a way in which the centre of each
cluster could be defined.
2

1
Waiting time (standardised)

−1

−2

−2 −1 0 1 2
Duration (standardised)
Figure 21 Durations and waiting times between eruptions, split into two
clusters

In Activity 18, you saw that there are different ways of defining the
position of the centre of a cluster. In the context of partitional clustering,
it is possible to base the position of the centre on the dissimilarity
measure: the position where the dissimilarity between it and all of the
observations is minimised is defined as the cluster centre. In general, this
leads to a non-trivial minimisation problem.

240
4 Partitional clustering

In this unit, though, we will restrict ourselves to the situation when the
chosen dissimilarity function is Euclidean distance. In this case, it is
possible to work out the centre of the cluster using the formula given in
Box 16. This means it can be computed quickly and easily, as you will see
in Example 22. In Activity 19, you will calculate the centre of a cluster
when the dissimilarity function is Euclidean distance.

Box 16 The centre of a cluster when the dissimilarity is


the Euclidean distance
Let xj1 , xj2 , . . . , xjm be m observations allocated to a cluster j,
where xjr = (xjr1 , xjr2 , . . . , xjrp ), for r = 1, 2, . . . , m. Also, let the
dissimilarity between two observations xja and xjb be the Euclidean
distance.
Then the centre, xj , of cluster j corresponds to the position for which
xj = (xj1 , xj2 , . . . , xjp ),
where xjl is the mean of the values for the lth variable of the m
observations in the cluster, for l = 1, 2, . . . , p, that is,
Pm
xjrl
xjl = r=1 .
m
The centre point, xj , is also known as the centroid of the data.

Example 22 Evaluating the centre of a cluster


In Activity 10 (Subsection 2.2), you considered the cluster by brand of
samples of a Chinese liquor. In that activity, just two of the variables
were considered: the concentration of ethyl acetate and the
concentration of ethyl lactate.
The values (after standardisation) for the first seven samples,
representing all the samples from two brands (coded as LZLJ
and WLY), are given in Table 14, next.

241
Unit B1 Cluster analysis

Table 14 The first seven observations in the Chinese liquor dataset

Sample Brand Ethyl acetate Ethyl lactate


concentration (X1 ) concentration (X2 )
1 LZLJ 0.407 −0.658
2 LZLJ 0.692 −0.384
3 LZLJ 0.429 −0.579
4 LZLJ 0.092 −0.852
5 WLY 0.959 0.095
6 WLY 0.452 −0.771
7 WLY 0.788 0.293

So, for the brand LZLJ, the mean standardised ethyl acetate
concentration is
0.407 + 0.692 + 0.429 + 0.092
= 0.405
4
and the mean standardised ethyl lactate concentration is
−0.658 + (−0.384) + (−0.579) + (−0.852)
≃ −0.618.
4
Similarly, for the brand WLY these two values are 0.733 and −0.128,
respectively.
Therefore, the centre of the cluster given by the brand LZLJ
is (0.405, −0.618) and the centre of the cluster given by the
brand WLY is (0.733, −0.128).

Activity 19 Calculating a cluster centre

We now return to the data about the heights of five friends (introduced in
Subsection 2.1), which was repeated in Activity 17. In Example 21, at
Iteration 3, a cluster consisting of {Adnan, Dan, Elise} was suggested.
Suppose it is decided that the dissimilarity function is Euclidean distance.
Calculate the centre of this cluster.

So, as Example 22 and Activity 19 have shown, finding the centre of each
cluster is straightforward if it is known which cluster each observation
belongs to. But, again, this is another very big ‘if’. If we should happen to
know which cluster each observation belongs to, there would be no need
for any cluster analysis. This means that by themselves neither subtask is
helpful. However, as you will see in the next subsection, by combining
them, we can make progress.

242
4 Partitional clustering

4.3 Solving the bigger task


As was seen in Subsections 4.1 and 4.2, within the problem of finding
cluster centres and the allocation of observations to clusters, there are two
easier subtasks.
• Subtask 1: allocate observations to clusters assuming that the cluster
centres are known.
• Subtask 2: find the cluster centres assuming that the allocation of
observations to clusters is known.
You will compare these two subtasks in the next activity.

Activity 20 Comparing subtasks

Activities 16 and 19 (Subsections 4.1 and 4.2, respectively) considered two


fairly straightforward subtasks.
• Allocate observations to clusters assuming that the cluster centres are
known.
• Find the cluster centres assuming that the allocation of observations to
clusters is known.
Taken together, what do you notice about these two subtasks?

Unfortunately, as was pointed out at the end of Subsections 4.1 and 4.2,
the assumptions associated with both of these subtasks are unreasonable.
Neither assumption is something that is likely to be known before the
cluster analysis. Or at least, if either is known beforehand, there would be
little need for cluster analysis.
So, why bother with either of these subtasks? It turns out that progress
can be made by repeatedly performing each of the subtasks in turn. For
example, the following scheme could be done.
• Perform Subtask 1 to estimate which cluster each observation belongs to.
• Using this cluster allocation, perform Subtask 2 to estimate the cluster
centres.
• Using these cluster centres reperform Subtask 1.
• Using the new cluster allocations reperform Subtask 2.
• And so on, cycling between reperforming Subtask 1 and reperforming
Subtask 2.
This scheme is illustrated in Figure 22, next.

243
Unit B1 Cluster analysis

Allocate observations to clusters

Assume allocation
Assume cluster
of observations
centroids
to clusters
are known
is known

Calculate cluster centres


Figure 22 A schematic diagram illustrating using simpler subtasks to
estimate cluster centres and the allocation of observations to clusters

The scheme stops according to a stopping rule that will be discussed in


Subsection 4.4.
When the definition of a cluster centre given in Box 16 is used, the
approach is known as k -means clustering.
In the next activity, you will put this cycling of subtasks into practice.

Activity 21 Cycling through the subtasks

In this activity, we return again to the data about the heights of five
friends (introduced in Subsection 2.1). A repeat of Table 9 from Activity 6
is again given below for convenience.

Friend Height h, in cm
Adnan 180
Billy 170
Cath 164
Dan 193
Elise 182

Suppose we are now interested in finding two clusters in these data. (This
corresponds to the smallest non-trivial number of clusters. Finding one
cluster is trivial – it’s just all the friends in the same cluster.) Further,
suppose that the cluster centres are initially thought to be 160 cm and
170 cm and that the dissimilarity measure is Euclidean distance.
(a) Based on this information, do Subtask 1. In other words, allocate the
five friends to the two clusters.
(b) Using the allocation of clusters you obtained in part (a), do
Subtask 2. That is, re-estimate the cluster centres.
(c) Using the cluster centres you estimated in part (b), re-allocate the five
friends to the two clusters.
(d) Using the allocation of clusters you obtained in part (c), do
Subtask 2. That is, re-estimate the cluster centres.

244
4 Partitional clustering

However, before this becomes a workable algorithm, there are two issues
that still need to be resolved: how do you start and when do you stop? We
will consider this in the next subsection.

4.4 Starting and stopping


In the previous subsection, you saw how to combine Subtask 1 and
Subtask 2 to solve the bigger task. However, for this to work we need to
know how to start, and when to stop. Of the two, the stopping rule is the
easier to deal with.
Stopping rules for k-means clustering are given in Box 17 and then you
will use them in Activity 22.

Box 17 Stopping rules for k-means clustering


In the k-means algorithm, stop successively doing the subtasks when
either of the following conditions is satisfied.
• The allocation of observations to clusters does not change.
• The cluster centres do not change.
In either case, the algorithm is then said to have converged, and hence
it has been successful in finding a solution.
Also, stop if the following condition is satisfied.
• The subtasks have been solved more than a pre-specified number of
times.
This condition ensures that the algorithm will always come to an end.
However, if this is the reason for stopping, the algorithm has not been
successful in finding a solution.

Activity 22 Using the stopping rules

In Activity 21 (Subsection 4.3), you started to iteratively perform the


subtasks in the case of finding two clusters in the heights of five friends.
In that activity you were asked to perform each subtask twice – that is, to
perform two iterations of the algorithm.
(a) Suppose that it is decided that the maximum number of iterations is
five. Should the algorithm stop after the two iterations performed in
Activity 21? Why or why not?
(b) Whatever you decided in part (a), perform one more iteration. That
is, start by assuming the cluster centres are those that you estimated
in Activity 21(d).
(c) Should the algorithm now stop? Why or why not?

245
Unit B1 Cluster analysis

In the Solution to Activity 22, two of the criteria for stopping were
simultaneously met. In the following activity, you will consider whether
this is likely to happen in general.

Activity 23 Equivalence of stopping rules

(a) Consider Subtask 1. Will the same allocation of observations to


clusters always be obtained if values of the cluster centres do not
change?
(b) Consider Subtask 2. Can the cluster centres change if the allocation
of observations to clusters does not change?
(c) Hence explain why the first two conditions of the stopping rule are
equivalent.

As you found out in Activity 23, the first two conditions of the stopping
rule are effectively equivalent. If one is satisfied, so will the other. In both
cases it means a solution of the bigger task has been found. Moreover, this
is a stable solution. Performing more iterations of the algorithm will not
change anything.
In theory, the k-means algorithm should always converge. This means that
a stable solution can always be found. However, there is no guarantee
about how long it will take to find this stable solution. So the third
stopping condition, limiting the number of iterations, ensures that the
algorithm does not take an excessive amount of time to come to a halt.
We now just need to address how to get the process started. We either
have to guess the values of the k cluster centres or an allocation of
observations to clusters. Either way, ideally this guess should be a ‘good’
one that represents the clusters well. Such a guess could be based on prior
knowledge about such data, or as a result of some initial data analysis.
However, it is not essential for the guess to be good. So the starting
solution can be as given in Box 18 and demonstrated in Example 23.

Box 18 Starting a partitioning algorithm


One method of starting the k-means algorithm is to select k
observations to represent the cluster centres. The only restriction is
that none of the selected observations can be identical to any of the
other selected observations.

246
4 Partitional clustering

Example 23 A starting point when clustering the five


friends
For the data in Activity 6 (Subsection 2.1) giving the height of five
friends, note that each friend has a different height (when measured to
the nearest cm). So the k-means algorithm can be started by simply
choosing a pair of friends at random, and using their heights as the
cluster centres.

But does the choice of initial positions of the cluster centres matter? This
is what you will explore in Activity 24.

Activity 24 Starting in a different place

In Activities 21 and 22, you used k-means to cluster the five friends. In
those activities, the initial cluster centres were taken to be 160 cm and
170 cm. Suppose instead that at the start of the partitional algorithm the
initial cluster centres were taken to be the heights of Adnan (180 cm) and
Dan (193 cm). Repeat the algorithm to find the stable solution. (You
should not require more than two iterations in this case.) Compare your
solution with the one you obtained in Activity 22.

As you have seen in Activity 24, choosing different starting values for the
cluster centres can lead to different clusters being identified. Even though
in both cases the algorithm converged, these differences are more than just
labelling the clusters in a different order. Thus, any solution produced
by k-means clustering can only be regarded as ‘a’ solution not
‘the’ solution.
Using an overall numerical measure of how good a clustering solution is,
such as the mean silhouette statistic (Subsection 2.3), it is possible to
compare these different solutions. One way of ensuring that the best stable
solution is found is to use each possible combination of k observations from
the dataset as a starting point. This is done in Example 24 and
Activity 25.

247
Unit B1 Cluster analysis

Example 24 Effect of changing the starting point


For the group of five friends, there are ten ways in which two of them
can be picked to represent the initial cluster centres. Applying
k-means clustering with each of these ways as a starting point yields
one of two stable solutions:
• {Billy, Cath} and {Adnan, Dan, Elise}
• {Adnan, Billy, Cath, Elise} and {Dan}.
The stable solution with clusters consisting of {Billy, Cath} and
{Adnan, Dan, Elise} is arguably the better stable solution. This is
because the average silhouette statistic for this stable solution is 0.569
compared with an average silhouette statistic of 0.289 for the stable
solution {Adnan, Billy, Cath, Elise} and {Dan}.

Activity 25 Two cluster solutions for the Chinese liquor


dataset
In this activity, we return to the Chinese liquor dataset introduced in
Subsection 1.1. As in Example 22, we will just consider two of the
variables in this dataset: the standardised concentrations of ethyl
acetate (X1 ) and ethyl lactate (X2 ).
Suppose the data are to be placed in two clusters. That is, we apply
k-means clustering with k = 2.
Trying all possible different starting positions, three different solutions are
found. These are shown in Figure 23.
(a) Compare the solutions given in Figure 23. In what way are the
samples in the two clusters the same and in what way do they differ?
(b) Do they all represent reasonable divisions of the observations into two
clusters? Which solution seems best?
(c) The mean silhouette statistics for the three clustering solutions
are 0.404, 0.407 and 0.429, respectively. Which solution now looks
best?

248
4 Partitional clustering

2 2
Ethyl lactate

Ethyl lactate
1 1

0 0

−1 −1

−2 −1 0 1 2 −2 −1 0 1 2
Ethyl acetate Ethyl acetate
(a) (b)

2
Ethyl lactate

−1

−2 −1 0 1 2
(c) Ethyl acetate

Figure 23 Different two-cluster solutions for the Chinese liquor dataset


obtained using k-means clustering

For most datasets, trying all possible combinations of k observations as


initial guesses is not going to be feasible, as there will be too many of
them. So we cannot guarantee that the best solution will be found.
However, a sample of initial guesses can be tried. Without any additional
knowledge about the data, the best strategy is to try a random sample of
them. By trying enough of them, we hope that the best of the resulting
converged solutions will be good enough. That is, it will provide a useful
clustering of the data, and one that is hopefully close to the best solution.

249
Unit B1 Cluster analysis

4.5 Selecting k
So far, you have seen how partitional clustering works by making two
complementary assumptions in turn:
• the cluster centres are known
• the allocation of observations to clusters is known.
Notice that implicit in both of these assumptions is the following
assumption:
• the number, k, of clusters is known.
As the number of clusters is an assumption made when performing
Subtask 1 and Subtask 2, it is an assumption that has to be made
for k-means clustering. In some situations, this assumption will be a
reasonable one to make. For example, in situations when a specific number
of clusters are sought. However, often the ‘right’ number of clusters will
not be known. So what then? The answer is simply to repeat k-means
clustering using a range of different values for k. In Example 25, you will
see how such a range can be chosen.

Example 25 Assessing values of k to try


Subsection 1.1 introduced data about occupancy in a car park. It can
be argued that finding more than ten clusters in these data would lead
to an overly fragmented solution with too many clusters to interpret.
This means that the maximum value of k that is worth trying
is k = 10. Also, there is no point trying less than k = 2 because k = 1
corresponds to all the data being in a single, large, cluster. So
the k-means algorithm would be applied to k = 2, 3, 4, . . . , 9, 10.

As you will see in Example 26, once we have obtained the best stable
solution for each value of k, we then compare them to see which seems best
overall.

250
4 Partitional clustering

Example 26 Comparing solutions for different values of k


In Example 25, it was stated that the range of values of k that is
worth considering is k = 2, 3, 4, . . . , 9, 10.
Using Euclidean distance as the dissimilarity function, k-means
clustering was performed for each value of k to find the best stable
solution. The average silhouette statistics for these stable solutions
are given in Table 15.
Table 15 Average silhouette statistics for a range of k

k Average silhouette statistic


2 0.887
3 0.801
4 0.781
5 0.790
6 0.799
7 0.780
8 0.688
9 0.686
10 0.678

From this table, the value of k with the highest average silhouette
statistic is k = 2. This indicates that there are two clusters in the
data.

One method for comparing the solutions is to plot the overall silhouette
statistics produced for each value of k. You will do this in the next activity.

251
Unit B1 Cluster analysis

Activity 26 Number of clusters in the Chinese liquor data

In Activity 25, you considered two-cluster solutions for the Chinese liquor
dataset using just two variables: the concentrations of ethyl acetate (X1 )
and ethyl lactate (X2 ).
k-means was applied to these data (after standardisation) for a range of
possible k.
The average silhouette statistics for the best stable solution for each value
of k is given in Figure 24.

0.46
Average silhouette statistic

0.44

0.42

0.40

0.38

0.36

5 10 15 20
k
Figure 24 Average silhouette statistics for a range of k
(a) Based on Figure 24, how many clusters does there appear to be in the
data?
(b) Plots of all the cluster solutions corresponding to 2, 3, 4 and 5
clusters are given in Figure 25. Based on these plots, does your
answer to part (a) make sense?

252
4 Partitional clustering

Ethyl lactate 2 2

Ethyl lactate
1 1

0 0

−1 −1

−2 −1 0 1 2 −2 −1 0 1 2
Ethyl acetate Ethyl acetate
(a) (b)

2 2
Ethyl lactate

Ethyl lactate

1 1

0 0

−1 −1

−2 −1 0 1 2 −2 −1 0 1 2
(c) Ethyl acetate (d) Ethyl acetate

Figure 25 Cluster solutions for the Chinese liquor dataset with k = 2, 3, 4


and 5 obtained using k-means clustering

Recall from Section 3 that, with agglomerative clustering, the number of


clusters is also found by comparing solutions involving different numbers of
clusters. In Subsection 3.3 you learnt that, when doing agglomerative
clustering, the plot that is often used to do that is the dendrogram.
You might be wondering why the dendrogram is not used with partitional
clustering. The reason is that with partitional clustering, the solution
with k clusters does not necessarily correspond to splitting a cluster in
the (k − 1)-cluster solution. This is demonstrated in the next example.

253
Unit B1 Cluster analysis

Example 27 Comparing four-cluster and five-cluster


solutions obtained using k-means clustering
In Activity 26, plots of four-cluster and five-cluster solutions for the
Chinese liquor dataset are given (Figures 25(c) and 25(d)).
Notice, in Figure 25(d), that the cluster of observations in the central
cluster is formed from observations from two different clusters in
Figure 25(c). So, the five-cluster solution cannot be formed by
sub-dividing only one of the clusters in the four-cluster solution.

So, with partitional clustering, there is not the direct link between a k-
and a (k − 1)-cluster solution needed to construct a dendrogram.

4.6 Using R to implement partitional


clustering
So far in this section, you have been doing partitional clustering by
performing k-means clustering by hand. In this subsection, you will
instead use R. You will start by learning how to use R to implement
k-means clustering in Notebook activity B1.3. You will then complete the
subsection with Notebook activity B1.4, where you will use R to obtain
output that helps with the selection of k, the number of clusters in
parking.

Notebook activity B1.3 Doing k-means clustering


In this notebook, you will implement k-means clustering for the
parking dataset (introduced in Subsection 1.1).

Notebook activity B1.4 Choosing the number of clusters


In this notebook, you will implement partitional clustering for
different numbers of clusters for parking and obtain output suitable
for deciding on the number of clusters.

254
5 Density-based clustering

5 Density-based clustering
The two clustering methods which have been discussed so far both require
a separate step to decide the most appropriate number of clusters.
However, not all approaches to clustering need this. Some, like the
clustering method you will learn about in this section, DBScan, estimate
the number of clusters at the same time as determining cluster
membership.
The ‘DB’ in DBScan stands for ‘density-based’. This is because the
DBScan approach pre-specifies how densely observations need to be packed
together for them to be regarded as forming a cluster. The algorithm then
finds clusters that conform to this specification.
As has already been mentioned, focusing on whether observations are
packed closely enough to be in a cluster has the advantage that we will not
have to separately decide how many clusters there might be. It also means
that, unlike agglomerative clustering and partitional clustering, the
possibility of a few observations not being in any cluster, that is, are
outliers, is allowed for in DBScan.
In DBScan, the definition of ‘packed closely enough’ to be in a cluster
relies on two parameters. These are given in Box 19.

Box 19 Parameters that need to be specified in DBScan


In DBScan the following two parameters need to be specified:
• gmin , the minimum number of observations required to be close
(enough) together – that is, ‘nearby’
• dmax , the radius within which observations need to be counted as
‘nearby’.
Note that radius dmax is measured relative to whichever dissimilarity
measure is deemed appropriate.

The two parameters, gmin and dmax , will be demonstrated in the following
example.

255
Unit B1 Cluster analysis

Example 28 Deciding whether regions are parts of


clusters
Recall that Example 4 was based on data about the Old Faithful
geyser. In that example, the observations were informally split into
two clusters. Figure 26 displays these data again, this time after both
variables have been standardised.

1
Waiting time (standardised)

−1

−2

−2 −1 0 1 2
Duration (standardised)
Figure 26 Two circular regions on the plot of durations and waiting
times between eruptions

Now suppose it is decided that we require at least ten observations


scattered within a radius of 0.35 for observations to be regarded as
being dense enough to be in a cluster. This means that gmin = 10 and
dmax = 0.35 (and that the dissimilarity measure is Euclidean distance).
In Figure 26, two circular regions each with radius 0.35 are displayed.
Notice that in the orange region centred on the position (−1.4, −1.4)
there are lots of points, clearly more than ten. In contrast, in the blue
region that is centred on the position (0.15, −0.5), there are just
four points.

256
5 Density-based clustering

So, the density of points in the first region is sufficient for it to be


regarded as part of a cluster, whereas the second region would not be
regarded as being part of a cluster.

In principle, it would be possible to divide up the space in which the


observations sit into subregions and test whether each of them meets the
definition of being part of a cluster. However, the number of subregions to
get a reasonable assessment depends on both the number of variables and
the range of reasonable values for each variable, and quickly becomes too
vast for this approach to be practical.
So, instead, the focus is just on where there are observations. This still
allows us to decide whether an observation is somewhere that is part of a
cluster whilst limiting the number of positions that need to be checked to
be no more than the number of observations.
DBScan works by identifying the clusters one at a time. This is done by
first identifying an observation that is in the new cluster. Then adding
other observations to the cluster until there are no more observations
sufficiently close, or there are simply no more observations left. So the
algorithm has two phases:
• Phase 1: finding a new cluster
• Phase 2: identifying all the observations belonging to that cluster.
You will learn more about these two phases in Subsections 5.1 and 5.2,
before seeing how they are put together into a complete algorithm in
Subsection 5.3 and using R to implement it in Subsection 5.4.
At the heart of the algorithm is the idea that every observation is in one of
three states: unlabelled, labelled as an ‘outlier’ or labelled as belonging to
a cluster. This enables the algorithm to keep track of which observations
have been considered and, for those that have been considered, what was
the conclusion. This labelling is summarised in Box 20.

Box 20 Possible states of observations in DBScan


In DBScan, observations are in one of three states:
• unlabelled – that is, it is unknown whether it belongs to a cluster or
not
• labelled as an ‘outlier’ – that is, not belonging to any cluster
• labelled as belonging to a cluster (and if so, which cluster).

When the algorithm starts, every observation is unlabelled. The algorithm


ends when every observation has been given a label. That is, every
observation is labelled as either being in a cluster or as being an outlier.

257
Unit B1 Cluster analysis

5.1 Phase 1: finding a new cluster


The first phase in the DBScan algorithm is all about finding a new cluster
in the data. More specifically, it is about finding an unlabelled point that
is in the interior of a cluster. By only considering unlabelled observations,
should such an observation be found, the cluster it is in must be a new one
and provides a starting point for Phase 2.
This process of course depends on what is meant by an observation being
in the interior of a cluster. In DBScan the following definition is used.

Box 21 Interior of a cluster


An observation is defined as being in the interior of a cluster if there
are at least gmin observations (including itself) in the dataset whose
dissimilarity is no more than dmax away from it.

Identifying the interiors of clusters in practice is discussed in the following


example.

Example 29 Interiors of clusters for the data about the


Old Faithful geyser
In Example 28, the values gmin = 10 and dmax = 0.35 were considered
for clustering the data about the Old Faithful geyser.
Figure 27 is a scatterplot of these data. The blue triangles denote
observations that are in the interior of a cluster. The black dots
correspond to all the observations that are not in the interior of a
cluster.
Notice how the interior points form the core of clusters, with the other
points around the outside of these cores. However, knowing that two
observations are interior points does not mean that they are in the
same cluster.

258
5 Density-based clustering

Type of point: interior not interior


2

1
Waiting time (standardised)

−1

−2

−2 −1 0 1 2
Duration (standardised)
Figure 27 Observations in the Old Faithful geyser data classified
according to whether they are in the interior of a cluster

Identifying the interiors of clusters for the heights of a group of friends is


discussed in the following example.

Example 30 Identifying which friends are in the interior


of a cluster
Recall that Activity 6 (Subsection 2.1) introduced data giving the
height of five friends. The table is repeated shortly for convenience.
Suppose that the dissimilarity measure is the L1 distance and that
friend i is deemed to be in the interior of a cluster if there are at least
three friends such that d(hi , hj ) ≤ 10. That is, dmax = 10
and gmin = 3.

259
Unit B1 Cluster analysis

Friend Height h, in cm
Adnan 180
Billy 170
Cath 164
Dan 193
Elise 182

Consider first Adnan. Adnan’s height is 180 cm, so what matters is


how many of the friends have a height no more than 10 cm different
to 180 cm; that is, how many are in the range 170 cm to 190 cm
inclusive. Looking at the table reveals that there are three friends who
meet this criterion – Adnan, Billy and Elise – and this is the same as
the value of gmin . From Box 21, this means that Adnan is in the
interior of a cluster. Similarly, Billy is also in the interior of a cluster
(having three friends sufficiently close in height, including themself).
Cath, Dan and Elise on the other hand are not in the interior of a
cluster. For Cath and Elise, each only has one other friend whose
height is within 10 cm of their own (making a total of two friends
including themselves). Also, only Dan has a height in the
range 183 cm to 203 cm inclusive (no more than 10 cm from Dan’s
height).

In Activity 27, you will be asked to identify the interiors of clusters for the
same dataset of the previous example, but for different combinations
of dmax and gmin .

Activity 27 Trying other values of dmax and gmin

In Example 30, values of dmax = 10 and gmin = 3 were used to determine


which friends were in the interior of a cluster. For the following choices
of dmax and gmin , again determine which friends are in the interior of a
cluster, and which are not.
(a) dmax = 10 and gmin = 2
(b) dmax = 10 and gmin = 4
(c) dmax = 6 and gmin = 2
(d) dmax = 1 and gmin = 2
(e) dmax = 1 and gmin = 1

In Example 30 and Activity 27, all the observations in the dataset were
checked to see if they were in the interior of cluster. However, during
Phase 1, this process is speeded up by checking only unlabelled points.

260
5 Density-based clustering

Those which have already been labelled, either as belonging to a cluster or


as an outlier, are skipped. Furthermore, once an unlabelled observation
has been deemed to be in the interior of a cluster, Phase 1 ends and
Phase 2 begins.
Of course, it might be that none of the unlabelled observations are deemed
to be in the interior of a cluster. For example, this was found in parts (b)
and (d) of Activity 27. This then means that there are no more new
clusters to be found and the algorithm ends.
Overall, Phase 1 of the algorithm can be summarised in the way given in
Figure 28.

Select an
unlabelled observation

Count how many


observations
sufficiently close

Are
no yes
there
enough?

Label as outlier Switch to Phase 2

Select another Any more


yes unlabelled
unlabelled
observation observations
left?

no

Stop

Figure 28 Outline of Phase 1 of DBScan

261
Unit B1 Cluster analysis

5.2 Phase 2: identifying all the


observations in a cluster
Phase 2 of the DBScan algorithm deals with finding all the observations in
a cluster. It starts by assuming that one initial observation in the interior
of the cluster has been identified. But is this assumption reasonable? This
is what you will consider in the next activity.

Activity 28 Reasonable assumption?

Why is it reasonable to assume one observation in the interior of a cluster


has been identified?

This phase of the algorithm works by generating a set of observations that


are deemed to be in the same cluster as the initial observation (and only
those observations).
Every observation in this cluster set is considered in turn. If it is deemed
to be an interior point (using the definition in Box 21), it adds other
observations to the cluster set. These extra observations correspond to all
the observations that are within dmax of it and have not already been
labelled as being in a cluster.
However, it is likely that not all observations in the cluster set will be
interior observations. Instead, they might be ‘edge’ observations. The
characteristic of an edge observation is that it is close enough to an interior
observation in the cluster set, but not close enough to at least gmin
observations (which do not have to be in the cluster set). Hence, edge
observations can be thought of as lying on the boundary of the cluster.
This is summarised in Box 22 and demonstrated in Example 31.

Box 22 Types of observation in a cluster


Interior observations are observations that have at least gmin
observations within distance dmax of them.
Edge observations are observations that have less than gmin
observations within distance dmax of them (but at least one of these
observations is an interior observation of the cluster).

Example 31 Interior and edge points of a cluster


Figure 27 highlighted all the interior points for the geyser data
when gmin = 10 and dmax = 0.35.

262
5 Density-based clustering

Now suppose that the algorithm enters Phase 2 having found an


interior point that belongs to the cluster in the bottom-left corner.
Figure 29 shows all the interior and edge points associated with that
cluster.

Type of point: interior edge not in cluster


2

1
Waiting time (standardised)

−1

−2

−2 −1 0 1 2
Duration (standardised)
Figure 29 Observations in the data about the Old Faithful geyser
classified according to whether they are in the interior of a cluster

Notice that all the interior points that are in the cluster appear to
visually form a distinct cluster away from the other points. The
closeness of the edge points to at least one interior point is also
evident: they lie in a region where the points are less dense than for
the interior points.

During Phase 2, there comes a point when all the observations in the
cluster set have been checked. At this point, Phase 2 ends. The application
of Phase 2 in its entirety is demonstrated in Example 32, next. You will
then apply Phase 2 starting with a different observation in Activity 29.

263
Unit B1 Cluster analysis

Example 32 Applying Phase 2


Suppose that the DBScan algorithm with dmax = 10 and gmin = 3 is
being applied to the data about the friends. Table 16 shows which
observations are deemed to be sufficiently close to each other. Further
suppose that Phase 1 has just finished, having identified Adnan as an
interior observation in a new cluster.
Table 16 Closeness in height of the friends when dmax = 10

Friend Height Number of observations Observations


sufficiently close sufficiently close
Adnan 180 3 Adnan, Billy, Elise
Billy 170 3 Billy, Adnan, Cath
Cath 164 2 Cath, Billy
Dan 193 1 Dan
Elise 182 2 Elise, Adnan

At the start of Phase 2, only Adnan is in the cluster list.


Step 1
Adnan is the first (and currently only) observation in the cluster set.
As has been already established, Adnan is an interior point. So, all
the points sufficiently close to Adnan are added to the cluster set.
That is, the cluster set becomes {Adnan, Billy, Elise}.
Step 2
Now consider Billy. This observation is an interior point as there are
three observations sufficiently close. So, Billy can also add
observations to the cluster set, namely Adnan and Cath. However, as
Adnan is already in the cluster set, this means that the cluster set
becomes {Adnan, Billy, Cath, Elise}.
Step 3
Now consider Cath. Only two observations are sufficiently close, so
this is an edge observation. This means that no more observations are
added to the cluster set.
Step 4
Now consider Elise. Like Cath, only two observations are sufficiently
close. So, this is also an edge observation and hence also does not add
any more observations to the cluster set.
By Step 4, we have checked all the observations in the cluster set.
This means that Phase 2 ends with the cluster set being {Adnan,
Billy, Cath, Elise}.

264
5 Density-based clustering

Activity 29 Trying a different initial observation


In Example 32, the initial observation in the cluster set was taken to be
Adnan. However, when gmin = 3 and dmax = 10, Adnan is not the only
interior point – Billy is one too. So, for this activity, repeat Phase 2
starting with Billy as the initial point. Does the cluster set eventually
contain the same friends as in Example 32?

In Activity 29, you found that starting with a different initial interior point
did not make any difference to which observations were in the final cluster
set. This represents a general result: for any given cluster, it does not
matter which observation is chosen as the initial observation – the final
cluster set will be the same.
At this point, all the observations in this cluster set will be labelled as
belonging to the same cluster. Phase 2 then ends and the algorithm flips
back to Phase 1.

5.3 Putting it together


As you have seen in Subsection 5.2, at some point during Phase 2 the
algorithm flips back to Phase 1. Thus the algorithm goes back to trying to
find an interior observation of a cluster.
If another unlabelled interior observation is found, the algorithm flips back
to Phase 2 with this observation as the initial point. This is because the
lack of a label for this observation means it must be part of a new cluster.
Otherwise, as mentioned in Subsection 5.1 and you will see in Example 33,
if an unlabelled interior observation cannot be found, the algorithm comes
to an end because all the clusters have been found.

Example 33 Returning to Phase 1


In Example 32 and Activity 29, you saw that the DBScan algorithm
reverted back to Phase 1 once Adnan, Billy, Cath and Elise had been
labelled as being in the same cluster, say ‘Cluster 1’, and Dan
remained unlabelled.
So, this time in Phase 1 there is only one point to be checked: Dan.
In terms of the number of friends that Dan has sufficiently close in
height, this has not changed since it was considered in Example 30,
and is still one. So, as this is less than gmin = 3, Dan is labelled an
‘outlier’. The algorithm then ends because there is nobody that is
unlabelled.
So, when dmax = 10 and gmin = 3, there is just one cluster identified
consisting of {Adnan, Billy, Cath and Elise} and one outlier, Dan.

265
Unit B1 Cluster analysis

Like with partitional clustering, we have an algorithm that flips between


two different tasks – in this case, between finding an unlabelled interior
point and identifying all the points in a cluster. With k-means clustering,
you saw that it is helpful to limit the total number of iterations that will
be done. This is performed to make sure that the algorithm will finish in a
reasonable amount of time. With DBScan, note that the number of
labelled observations always diminishes as the algorithm progresses. In
Phase 1, at least one such observation will become labelled – the one that
will be then used as the initial observation in Phase 2. And more might
gain the ‘outlier’ label. Phase 2 will add more labels (or at least not
remove any) as other observations in the same cluster are identified. So,
sooner or later the number of unlabelled observations will run out and the
algorithm comes to halt.
With k-means clustering, we also saw that resulting clusters can depend on
what the initial guesses for the cluster centres are. With DBScan, we do
not have to supply any estimates about the clusters, such as the cluster
centres. However, the order in which observations are checked in Phase 1 or
Phase 2 has not been tightly specified. So, what difference does the order
make? Thankfully, as you will see in Activity 30, the answer is ‘not much’.

Activity 30 Comparing solutions with the same values


of dmax and gmin
DBScan was applied to the data about the Old Faithful geyser, this time
with dmax = 0.6 and gmin = 16. The algorithm was applied twice, the only
difference being the order in which the observations were presented.
The solutions, given in a graphical form, are presented in Figure 30.
Compare these two solutions. In what way are they different? Is this
difference important?

266
5 Density-based clustering

Allocation: outliers , different clusters


2 2
Waiting time (standardised)

Waiting time (standardised)


1 1

0 0

−1 −1

−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
(a) Duration (standardised) (b) Duration (standardised)
Figure 30 Cluster solutions for the data about the Old Faithful geyser found using DBScan with dmax = 0.6
and gmin = 16 using different orders of observations

As has already been mentioned in Subsection 5.2, for any cluster the same
interior points are always found no matter which initial interior point is
chosen. It also is not possible for an interior point in one cluster to be
regarded as an ‘edge’ point in another cluster because it must be more
than dmax away from the interior points in this other cluster.
Overall, this means that the allocation of interior observations does not
change regardless of the order in which observations are checked. Similarly,
outliers also will not depend on the order either. That just leaves the edge
observations. It is possible for edge points to be within dmax of interior
points in two or more different clusters. For such edge points, the cluster
to which they eventually get allocated depends on the order in which
observations are checked.
This just leaves the values of gmin and dmax to be decided. Together, they
determine how densely packed together observations need to be if they are
going to be considered part of a cluster. The parameter gmin also dictates
the minimum number of observations there must be in a cluster. In
Activity 31, you will explore what impact these choices can have on the
clustering that is found.

267
Unit B1 Cluster analysis

Activity 31 Clustering with different values of dmax and gmin

The DBScan algorithm was applied to the Chinese liquor dataset, with
three different combinations of dmax and gmin . The resulting clusterings are
given in Figure 31.

2 2
Ethyl lactate

Ethyl lactate
1 1

0 0

−1 −1

−2 −1 0 1 2 −2 −1 0 1 2
Ethyl acetate Ethyl acetate
(a) dmax = 0.6 and gmin = 4 (b) dmax = 0.6 and gmin = 2

2
Ethyl lactate

Allocation: outliers different clusters

−1

−2 −1 0 1 2
Ethyl acetate
(c) dmax = 1.2 and gmin = 4
Figure 31 Cluster solutions for liquor with three different combinations of dmax and gmin found using DBScan

268
6 Comparing clustering methods

(a) Compare the solutions. What has been the effect of changing gmin
and dmax ?
(b) Which solution seems best? Does it seem like a reasonable clustering
of the data?

5.4 Using R to implement DBScan


In Notebook activity B1.5, which makes up this subsection, you will learn
how to implement DBScan for the parking dataset (introduced in
Subsection 1.1) using R.

Notebook activity B1.5 Doing clustering using DBScan


In this notebook, you will use DBScan for clustering parking.

6 Comparing clustering methods


In Sections 3, 4 and 5, you have been introduced to three different
techniques for clustering data. So, why might one technique be chosen
instead of the others?
You have already seen that in both hierarchical clustering and in
partitional clustering, all observations are assumed to belong to a cluster,
though there is no minimum size for clusters. In contrast, DBScan, with
its density-based approach, can regard some observations as outliers.
Indeed, in the extreme, it is possible to set values for gmin and dmax such
that all the observations will be regarded as outliers! But, with DBScan,
alongside is the notion that there is a minimum size that any cluster
should be. As such, the treatment of outliers might be one reason for
choosing one approach over others.
Another consideration is which dissimilarity measure, or measures, is
deemed appropriate for the data being clustered. As was noted in
Subsection 4.2, k-means clustering works when the chosen dissimilarity
measure is Euclidean distance. (Though other forms of partitional
clustering do work with other dissimilarity measures.) So, if the chosen
dissimilarity measure is not Euclidean distance, k-means clustering is not
appropriate.
In this section, you will explore another important consideration: the size
of the dataset.

269
Unit B1 Cluster analysis

All of the clustering techniques you have learnt about in this unit involve
carrying out computations in a loop. For example:
• hierarchical clustering involves iteratively joining clusters together
• k-means clustering involves repeated allocations of data points to
clusters
• DBScan involves working through lists of observations.
Thus, it should be no surprise to learn that it takes longer for a computer
to obtain a clustering solution for a large dataset compared to a smaller
one. However, how much longer differs between clustering techniques.
These differences can be substantial enough to impact whether an answer
can be obtained in a reasonable length of time or not, depending on the
technique used.
One straightforward way of exploring how the implementation time
depends on the size of the dataset is to perform a practical experiment;
that is, to analyse datasets of different sizes on a computer and time how
long it takes to get an answer. (Using the same computer is important, as
computers vary in how quickly they can perform computations.) You will
consider the results from one such experiment in Activity 32.

Activity 32 Comparing timings for the different techniques

An experiment was conducted to compare how long hierarchical clustering,


k-means and DBScan took to find a solution on a particular computer. In
each case, the datasets were intended to be similar in every respect other
than their size. The results, expressed as relative time compared with the
fastest time found in the experiment, are given in Figure 32.
(a) For small datasets, which technique appears to be fastest?
(b) For large datasets, which technique appears to be fastest?
(c) Which do you think is more important: whether a technique is
(relatively) fast with small datasets or (relatively) fast with large
datasets?

270
6 Comparing clustering methods

Technique: Hierarchical k-means DBScan

10000

1000
Relative time taken

100

10

1
10 20 50 100 200 500 1000 2000 5000
Dataset size
Figure 32 Relative timings for different clustering techniques

In Activity 32, you saw how the time to obtain a solution varies between
different techniques. However, this only gives a rough guide to how long it
might take to get a solution using these techniques. For a start, it depends
on how efficient the software to implement each of the techniques is. This
is down to the skill of the programmer and how efficiently the algorithm is
designed. The computing resources available have an impact; for example,
the extent to which the implementation is able to make use of distributed
computing where this is available.
The analysis is also based on running the algorithm just once to find a
solution. In practice, the algorithm might be run on the data more than
once, often with different settings. You will consider this further in the
next activity.

271
Unit B1 Cluster analysis

Activity 33 Reasons to re-run clustering

Each of the clustering techniques introduced in Sections 3 to 5 required


choices to be made by the analyst. For each of the following techniques,
reflect on the type of choice that has to be made and then consider how
many times the algorithm might be re-run to fully explore the impact of
each choice.
(a) Hierarchical clustering (Section 3)
(b) Partitional clustering (Section 4)
(c) DBScan (Section 5)

So, as you have seen in Activity 33, the choices that some clustering
techniques force the data analyst to make can lead to it being re-run more
times than other techniques. If speed is an important consideration, this
can change which technique is best.
Another consideration when thinking about the computational burden of
implementing an algorithm is the memory required. If this gets too big,
the algorithm cannot be implemented without more computing resources
being found.
A similar analysis to that undertaken in Activity 32 can be done to assess
the memory required whilst an algorithm runs. The results from one such
analysis are given in Figure 33.

Technique: Hierarchical k-means DBScan

250

200
Memory required

150

100

50

0
10 20 50 100 200 500 1000 2000 5000
Dataset size
Figure 33 Memory requirements by different clustering techniques
272
Summary

Figure 33 shows that there are clear differences in memory demand


between the methods, which become noticeable when the sample size goes
up.
In particular, even with just 5000 data points, the memory requirements of
hierarchical clustering have become very large in comparison with the other
two approaches. This means that the memory requirement of hierarchical
clustering makes its application to really large datasets prohibitive.
In contrast, the memory requirements of k-means and DBScan appear to
be very small in comparison. This suggests that their application to really
big datasets is likely to remain feasible.
One explanation for this difference between hierarchical clustering and the
other two methods is that hierarchical clustering uses the dissimilarity
matrix, an object whose size is proportional to the square of the number of
observations. This can be avoided in k-means clustering and DBScan.

Summary
In this unit, you have been introduced to cluster analysis as one of the
data science techniques that is interested in identifying groups, or clusters,
in the data. Since very little is usually known about clusters beforehand,
especially the number of clusters and the membership of each observation
to a specific cluster, it is important to be able to spot clusters in a dataset.
You have seen how to use your own judgement to identify clusters in a
number of different datasets.
We have discussed a few similarity and dissimilarity measures that are
often used to measure the closeness of data points. You learnt how to
calculate the dissimilarity measures and were introduced to their
mathematical properties. These measures were then used to assess and
compare specific clusters using techniques such as the dissimilarity matrix
and its plot. You were also introduced to the silhouette statistic and used
its plot and mean statistic to assess clustering and informally allocate
observations to clusters.
Agglomerative hierarchical clustering has been introduced as one approach
that uses algorithms for automatic cluster allocation. You have learnt how
to start the algorithm and how to successively find and merge the closest
clusters. The dendrogram was then introduced as a graphical tool to
represent different clustering solutions that you can obtain using
agglomerative hierarchical clustering. The latter has also been
demonstrated in two notebooks activities where you learnt how to
implement it in R.
Instead of successively merging the closest clusters, as in agglomerative
clustering, partitional clustering has also been discussed as another
approach for cluster allocation that is based on splitting, or partitioning,
clusters. You have learnt about the Voronoi diagram as a graphical tool
that is used to allocate observations to clusters based on their position on

273
Unit B1 Cluster analysis

the plot. Then, you went through the partitional clustering algorithm from
selecting the starting point, successively partitioning clusters, selecting a
suitable number, k, of clusters and finally using stopping rules to obtain
reasonable clustering. Implementing partitional clustering in R was also
demonstrated in two notebook activities.
A third clustering method has been introduced, namely density-based
clustering (DBScan). You have seen that DBScan estimates both the
number of clusters and the cluster membership simultaneously. To achieve
this, the method uses two specific parameters, dmax and gmin , that need to
be pre-specified. You were asked to work through a notebook activity to
see how DBScan is implemented in R.
The unit concluded with a discussion on how to compare different
clustering methods and select a suitable method to use. This assessment is
usually done in terms of a number of considerations such as the treatment
of outliers, the dissimilarity measure that the method uses, the size of the
underlying dataset and the time each method may take.
As a reminder of what has been studied in Unit B1 and how the sections in
the unit link together, the route map is repeated below.

The Unit B1 route map

Section 1
Clusters in data

Section 2
Assessing clusters

Section 3 Section 4 Section 5


Hierarchical Partitional Density-based
clustering clustering clustering

Section 6
Comparing
clustering methods

274
References

Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate the importance of cluster analysis and understand its aims,
considerations and different algorithms
• use your own judgement to spot clusters in data based on histograms,
scatterplots and scatterplot matrices
• calculate different dissimilarity measures, especially the Euclidean and
L1 distances
• appreciate the importance of standardising data before calculating the
Euclidean and L1 distances
• calculate the dissimilarity matrix and understand its usage in assessing
clustering solutions
• understand and interpret the dissimilarity matrix plot and the silhouette
statistic plot
• calculate the mean silhouette statistic and use it to assess cluster
allocation
• understand agglomerative hierarchical clustering, the usage of the
dendrogram and the implementation of the technique in R
• understand partitional clustering, the usage of the Voronoi diagram and
the implementation of the technique in R
• understand density-based clustering, DBScan, work out cluster
allocation using different combinations of dmax and gmin , and be able to
implement DBScan in R
• appreciate the different factors you need to consider when selecting a
suitable clustering method for your data.

References
Azzalini, A. and Bowman, A.W. (1990) ‘A look at some data on the Old
Faithful geyser’, Applied Statistics, 39(3), pp. 357–365.
doi:10.2307/2347385.
Bonnici, L., Borg, J.A., Evans, J., Lanfranco, S. and Schembri, P.J. (2018)
‘Of rocks and hard places: Comparing biotic assemblages on concrete
jetties versus natural rock along a microtidal Mediterranean shore’,
Journal of Coastal Research, 34(5), pp. 1136–1148.
doi:10.2112/JCOASTRES-D-17-00046.1.
Höpken, W., Müller, M., Fuchs, M. and Lexhagen, M. (2020) ‘Flickr data
for analysing tourists’ spatial behaviour and movement patterns’, Journal
of Hospitality and Tourism Technology, 11(1), pp. 69–82.
doi:10.1108/JHTT-08-2017-0059.
Jordanova, N., Jordanova, D., Tcherkezova, E., Popov, H., Mokreva, A.,
Georgiev, P. and Stoychev, R. (2020) ‘Identification and classification of

275
Unit B1 Cluster analysis

archeological materials from Bronze Age gold mining site Ada Tepe
(Bulgaria) using rock magnetism’, Geochemistry, Geophysics and
Geosystems, 21(12). doi:10.1029/2020GC009374.
Qian, Y., Zhang, L., Sun, Y., Tang, Y., Li, D., Zhang, H., Yuan, S. and Li,
J. (2021) ‘Differentiation and classification of Chinese Luzhou-flavor
liquors with different geographical origins based on fingerprint and
chemometric analysis’, Journal of Food Science, 86(5), pp. 1861–1877.
doi:10.1111/1750-3841.15692.
Raines, T., Goodwin, M. and Cutts, D. (2017) Europe’s political tribes:
Exploring the diversity of views across the EU, Chatham House briefing.
Available at: https://www.chathamhouse.org/
sites/default/files/publications/research/2017-12-01-europes-political-
tribes-raines-goodwin-cutts.pdf (Accessed: 2 October 2018).
Stolfi, D.H., Alba, E. and Yao, X. (2017) ‘Predicting car park occupancy
rates in smart cities’, Smart cities: Second international conference,
Smart-CT 2017, Malaga, Spain, 14–16 June, 2017, pp. 107–117. Springer:
Cham, Switzerland.
Tonelli, S., Drobnič, S. and Huinink, J. (2021) ‘Child-related family policies
in East and Southeast Asia: an intra-regional comparison’, International
Journal of Social Welfare, 30(4), pp. 385–395. doi:10.1111/ijsw.12485.

Acknowledgements
Grateful acknowledgement is made to the following sources:
Figure 1: public domain
Subsection 1.1, the Bullring: jax10289 /Shutterstock
Subsection 1.1, Yellowstone Park geyser: public domain
Figure 7: Taken from Yu Qian, Liang Zhang, Yue Sun, Yongqing Tang,
Dan Li, Huaishan Zhang, Siqi Yuan and Jinsong Li. Food Chemistry.
‘Differentiation and classification of Chinese Luzhou-flavor liquors with
different geographical origins based on fingerprint and chemometric
analysis’, Journal of Food Science. Wiley.
Subsection 1.2, a rocky shore in Qawra, Malta Majjistral: Jocelyn
Erskine-Kellie / Flickr. This file is licenced under Creative
Commons-by-SA 2.0. https://creativecommons.org/licenses/by-sa/2.0/
Subsection 1.2, ‘Cat, chat or gato?’: public domain
Subsection 1.2, Malaysian families enjoyed an increase in paid maternity
leave: Edwin Tan / Getty
Subsection 3.3, the dendrogram: www.cantorsparadise.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

276
Solutions to activities

Solutions to activities
Solution to Activity 1
In Figure 4 there appears to be four peaks, which suggests that, in the
greyscale version of the image, there are four clusters of greyness values:
• a cluster centered around a greyness of about 0.25 (that is, a cluster of
relatively dark pixels)
• a cluster centered around a greyness of about 0.65
• a cluster centered around a greyness of about 0.8
• a cluster centered around a greyness of about 0.9 (that is, a cluster of
relatively light pixels).

Solution to Activity 2
In Figure 6, there is one clear cluster that seems to contain most of the
pixels. In this cluster, the values of redness, greenness and blueness are all
roughly equal. This means that pixels in this cluster appear grey.
Beyond this, things are less clear. Looking specifically at the plot of
redness versus greenness, there are arguably another two clusters each only
accounting for a small number of pixels. One of these clusters corresponds
to pixels where the redness is moderate but the greenness is low, and the
other cluster is where both the redness and greenness are low. Looking at
the plot of greenness versus blueness there also seems to be a cluster where
the blueness is noticeably below the greenness.
So, it would not be unreasonable to suggest that there is a total of four
clusters – as shown in Figure S1.

red

green

blue

Figure S1 Redness, greenness and blueness of a sample of pixels in


Figure 1(a), split into four clusters

277
Unit B1 Cluster analysis

Solution to Activity 3
(a) In one dimension, the L1 distance corresponds to d(x, y) = |x − y|.
Taking the two pixels in the left-hand corners, the dissimilarity is
d(xtl , xbl ) = |0.713 − 0.591| = |0.122| = 0.122,
or equivalently
d(xbl , xtl ) = |0.591 − 0.713| = |−0.122| = 0.122.

For all the pairs of pixels, the values are follows.

Corners Dissimilarity
Top left and bottom left 0.122
Top left and top right 0.052
Top left and bottom right 0.134
Bottom left and top right 0.174
Bottom left and bottom right 0.256
Top right and bottom right 0.082

(b) The corners that are most different are the bottom left and bottom
right as that is the pair with the biggest dissimilarity. This makes
sense because the darkest corner in Figure 1(a) is the bottom left
corner and the lightest corner is the bottom right corner.

Solution to Activity 4
(a) From Table 7, we have that xtl = (183, 181, 182)
and xtr = (188, 200, 188). Then
p
d(xtl , xtr ) = (183 − 188)2 + (181 − 200)2 + (182 − 188)2

= 25 + 361 + 36

= 422 ≃ 20.5.

(b) Comparing the colours of the pixels rather than just their greyness
does not make much difference to the conclusions. The bottom left
and bottom right corners still have the biggest dissimilarity. Equally,
the top left and top right corners are still the most similar corners.
However, between the other pairs of corners, the top right and bottom
right corners appear more different with respect to colour than with
respect to greyness.

278
Solutions to activities

Solution to Activity 5
(a) In this case, the L1 distance corresponds to
d(x, y) = |x1 − y1 | + |x2 − y2 |,
where x1 and y1 are the measurements for public social expenditure
for two countries and x2 and y2 are the corresponding maternity
leaves.
(i) When public social expenditure is measured as a percentage of
GDP and maternity leave in days, this means that
d(x, y) = |0.02 − 1.30| + |60 − 120|
= 1.28 + 60
= 61.28.
Similarly, d(x, z) = 52.01 and d(y, z) = 9.29.
(ii) When public social expenditure is measured as a percentage of
GDP and maternity leave in years, this means that
d(x, y) = |0.02 − 1.30| + |0.16 − 0.33|
= 1.28 + 0.17
= 1.45.
Similarly, d(x, z) = 0.16 and d(y, z) = 1.31.
(iii) When both public social expenditure and maternity leave have
been standardised, this means that
d(x, y) = |(−0.86) − 1.90| + |(−1.16) − 0.79|
= 2.76 + 1.95
= 4.71.
Similarly, d(x, z) = 1.71 and d(y, z) = 3.04.
(b) Based on the values calculated in part (a)(i), child-related policies in
Mongolia and Singapore are the most similar, as the dissimilarity is
smallest, and they are the most different between Malaysia and
Mongolia.
(c) Transforming the data makes a lot of difference! When maternity
leave is measured in years, not days, it is the policies in Malaysia and
Singapore that then look the most similar. The policies in Malaysia
and Singapore also look most similar when both variables have been
standardised. This is because the transformations change which
variable contributes most to the overall dissimilarity.
When public social expenditure is measured as a percentage of GDP
and maternity leave in days, the variation in maternity leave is far
bigger. So, the overall dissimilarity mostly reflects the differences in
maternity leave. In contrast, measuring maternity leave in years
means that it is the variation in public social expenditure that is
generally larger, meaning that overall dissimilarity tends to reflect
differences in public social expenditure.

279
Unit B1 Cluster analysis

When both variables have been standardised, the variation in both


variables is roughly the same, so neither variable dominates in the
calculation of the overall dissimilarity.

Solution to Activity 6
(a) The dissimilarity between Adnan and Billy is given by
|hAdnan − hBilly | = |180 − 170| = 10.
Similarly, for the other three friends we have
|hAdnan − hCath | = |180 − 164| = 16,
|hAdnan − hDan | = |180 − 193| = |−13| = 13,
|hAdnan − hElise | = |180 − 182| = |−2| = 2.

(b) The dissimilarity between Adnan and Adnan is


|hAdnan − hAdnan | = |180 − 180| = 0.

(c) The complete dissimilarity matrix is as follows.

Adnan Billy Cath Dan Elise


Adnan 0
Billy 10 0
Cath 16 6 0
Dan 13 23 29 0
Elise 2 12 18 11 0

The main diagonal can be filled in immediately as every entry is 0.


The first column comes from entering the values you calculated in
part (a) into the appropriate rows.

Solution to Activity 7
(a) Clustering 1 in Figure 11 is not at all convincing. There is much
overlap between observations that have been placed in different
clusters.
Clustering 2 is more convincing than that in Clustering 1. There are
regions in the plot where observations in the same cluster clearly
dominate. However, observations from different clusters still appear to
overlap. So maybe a slightly different clustering would be more
appropriate.
Clustering 3 is convincing. Each cluster corresponds to a cloud of
observations which is noticeably separated from all the other clouds of
observations.
(b) In Matrix 2 of Figure 12, the blocks along the main diagonal stand
out as much lighter than the other blocks. This indicates that
observations within each cluster are much closer than observations in

280
Solutions to activities

different clusters. This situation, indicative of a convincing cluster


solution, therefore corresponds to Clustering 3 in Figure 11.
In Matrix 1, the blocks along the main diagonal also look lighter than
the other blocks, though the contrast is less striking than in Matrix 2.
This suggests that observations in the same cluster are generally
closer together than observations in different clusters, but not as
obviously as for the data corresponding to Matrix 2.
In Matrix 3, the blocks along the main diagonal do not stand out
from the other blocks. This indicates that observations in different
clusters are often as similar as observations in the same cluster. So
Matrix 3 corresponds to Clustering 1 in Figure 11 and Matrix 1
corresponds to Clustering 2 in Figure 11.

Solution to Activity 8
(a) As Adnan is in Cluster 2:
• The mean dissimilarity between Adnan and the other members of
Cluster 2 (i.e. just Elise) is 2. So aAdnan = 2.
• The mean dissimilarity between Adnan and the members of
Cluster 1 (Billy and Cath) is
10 + 16
= 13.
2
Similarly, the mean dissimilarity between Adnan and Cluster 3
(just Dan) is 13.
So, Clusters 1 and 3 are equally close to Adnan, meaning
that bAdnan = 13.
Thus, the value of the silhouette statistic for Adnan is
bAdnan − aAdnan
sAdnan =
max(aAdnan , bAdnan )
13 − 2
=
max(2, 13)
11
= ≃ 0.846.
13
(b) As Billy is in Cluster 1:
• The mean dissimilarity between Billy and the other members of
Cluster 1 (i.e. just Cath) is 6. So, aBilly = 6.
• The mean dissimilarity between Billy and the members of Cluster 2
(Adnan and Elise) is
10 + 12
= 11.
2
Similarly, the mean dissimilarity between Billy and Cluster 3 (just
Dan) is 23.
So, Cluster 2 is closer to Billy than Cluster 3, meaning
that bBilly = 11.

281
Unit B1 Cluster analysis

Thus, the value of the silhouette statistic for Billy is


bBilly − aBilly
sBilly =
max(aBilly , bBilly )
11 − 6
=
max(6, 11)
5
= ≃ 0.455.
11
(c) As Elise is in Cluster 2:
• The mean dissimilarity between Elise and the other members of
Cluster 2 (i.e. just Adnan) is 2. So, aElise = 2.
• The mean dissimilarity between Elise and the members of Cluster 1
(Billy and Cath) is
12 + 18
= 15.
2
Similarly, the mean dissimilarity between Elise and Cluster 3 (just
Dan) is 11.
So, Cluster 3 is closer to Elise than Cluster 1, meaning
that bElise = 11.
Thus, the value of the silhouette statistic for Elise is
bElise − aElise
sElise =
max(aElise , bElise )
11 − 2
=
max(2, 11)
9
= ≃ 0.818.
11

Solution to Activity 9
In Figure 14(a), all the silhouette statistics are close to +1. So the plot
suggests that all the observations sit nicely in their allocated clusters. Out
of the clusterings given in Figure 11, this situation only applies to
Clustering 3.
Plots (b) and (c) in Figure 14 both contain observations for which the
silhouette statistic is negative; and this is more pronounced in plot (c). So,
they both correspond to situations in which some observations are closer to
a different cluster than the one they have been put in, particularly the
clustering corresponding to Figure 14(c). Thus, as Clustering 1 in
Figure 11 is worse than Clustering 2 in Figure 11, it suggests that
Figure 14(b) corresponds to Clustering 2 in Figure 11 and Figure 14(c)
corresponds to Clustering 1 in Figure 11.

282
Solutions to activities

Solution to Activity 10
In a word, no!
On the plot of the dissimilarity matrix, the blocks along the main diagonal
are generally some of the lightest coloured blocks, indicating that samples
from the same brand are similar. However, there are light-coloured blocks
elsewhere. This indicates that brands are not separate from each other.
This finding is borne out in the silhouette plot. Few of the silhouette
statistics are positive, again suggesting that samples from different brands
are not separate.

Solution to Activity 11
(a) Cath is in a cluster all by herself. So Cath’s silhouette statistic, sCath ,
just takes the value 0.
For observations that are in clusters bigger than size 1, the silhouette
statistic for observation i is
bi − ai
si = ,
max(ai , bi )
where ai is the mean dissimilarity between the observation and other
observations in the cluster and bi is the average dissimilarity between
the observation and observations in the next nearest cluster.
Billy is in a cluster with just Adnan, so
aBilly = |170 − 180|
= 10.
The mean dissimilarity between Billy’s height and the heights of those
in Cluster 1 corresponds to the dissimilarity between Billy’s height
and Cath’s height – that is
|170 − 164| = 6.
Similarly the mean dissimilarity between Billy’s height and the
heights of those in Cluster 3 corresponds to the mean of the
dissimilarities between Billy’s height and Elise’s height, and also
between Billy’s height and Dan’s height – that is
|170 − 182| + |170 − 193|
= 17.5.
2
So, for Billy,
aBilly = 10 and bBilly = min(6, 17.5) = 6.
This means that
bBilly − aBilly
sBilly =
max(aBilly , bBilly )
6 − 10
=
max(10, 6)
= −0.4.

283
Unit B1 Cluster analysis

(b) The mean silhouette statistic for this alternative clustering is


sAdnan + sBilly + sCath + sDan + sElise
s=
5
−0.250 − 0.4 + 0 + 0.389 − 0.364
=
5
= −0.125.

Thus the original clustering is better than this alternative clustering


because the mean silhouette statistic of the original clustering is
higher (s = 0.553 from Example 19).
(c) The clustering with the highest mean silhouette statistic is {Cath,
Billy}, {Adnan, Elise} and {Dan}. So this one is judged the best
clustering of these data. This seems reasonable. In Figure 8, it can be
seen that Adnan and Elise are very close in height. Billy is also more
similar in height to Cath compared with Adnan. Finally, Dan is a lot
taller than the others.

Solution to Activity 12
(a) As a cluster must contain at least one of the friends, and none of
these friends can be in more than one cluster, the maximum number
of clusters is five. This corresponds to placing each friend in their own
single-element separate cluster.
(b) When there are eight friends, the maximum number of clusters they
can be put in is eight. As in part (a), this corresponds to placing each
friend in their own single-element separate cluster.

Solution to Activity 13
(a) Here, the corner pixels are split into those on the left and those on the
right. So the dissimilarity between the clusters is based on the four
dissimilarities involving a corner pixel on the left and a corner pixel on
the right. This corresponds to the values 20.5, 64.5, 70.1 and 115.6.
(b) Using single linkage, the dissimilarity between the two clusters is the
minimum of the four values picked out in part (a), which is 20.5.
(c) Using complete linkage, the dissimilarity between the two clusters is
the maximum of the four values picked out in part (a), which is 115.6.
(d) Using average linkage, the dissimilarity between the two clusters is the
mean of the four values picked out in part (a), which is
20.5 + 64.5 + 70.1 + 115.6
≃ 67.7.
4

284
Solutions to activities

Solution to Activity 14
(a) The next pair of clusters to be merged is Clusters D and E. This is
because the smallest dissimilarity between different clusters occurs
between Clusters D and E. The four-cluster solution is therefore:
{Cluster A}, {Cluster B}, {Cluster C} and {Cluster D, Cluster E}.
(b) The dissimilarity matrix for the four-cluster solution is as follows.

Cluster A Cluster B Cluster C Clusters D and E


Cluster A 0
Cluster B 202 0
Cluster C 598 477 0
Clusters D and E 532 411 161 0

The part of the dissimilarity matrix relating to just Clusters A, B


and C is unchanged as the dissimilarities between these clusters is
unaffected by the merging of Clusters D and E.
For the dissimilarity between each of the Clusters A, B and C, and the
cluster formed by merging Clusters D and E, it is just the maximum
of the dissimilarity of each of Clusters A, B and C with Cluster D and
with Cluster E separately. For example, the dissimilarity between
{Cluster A} and {Clusters D and E} is the larger of the dissimilarity
between {Cluster A} and {Cluster D} (532) and the dissimilarity
between {Cluster A} and {Cluster E} (448). Therefore the
dissimilarity between {Cluster A} and {Clusters D and E} is 532.

Solution to Activity 15
(a) Looking from the bottom upwards, the first horizontal line appears to
link days 1 and 9. So, these were the first to be merged.
(b) The merging of the last two clusters corresponds to the topmost
horizontal line. Thus it occurred at a dissimilarity around 550. (Being
any more precise than this is difficult, given the scale to read off from.)
(c) The two-cluster solution is found by tracing down from the point at
which there are just two vertical lines on the dendrogram – something
which happens for dissimilarities in the range of about 150 to
about 550. One of these lines leads down to days 5 and 6. This means
that all the other days are in the other cluster – that is, days 1, 2, 3,
4, 7, 8, 9 and 10.
(d) The three-cluster solution is found by tracing down from a point
where there are exactly three vertical lines. Doing this we find that
the clusters are: days 5 and 6; day 4; days 1, 2, 3, 7, 8, 9 and 10.
(e) The hierarchical clustering process means that the two-cluster
solution is found by merging two of the clusters in the three-cluster
solution. Thus, one of the clusters in the three-cluster solution must
be carried through to the two-cluster solution without being changed.

285
Unit B1 Cluster analysis

(f) Using the principle of parsimony, we should be looking for the smallest
number of clusters that adequately describes the structure in the data.
The dendrogram suggests that the two-cluster solution is the most
appropriate one, as mergers before this point involve only relatively
small changes in dissimilarity. In contrast, the change in dissimilarity
from the two-cluster solution to the one-cluster solution is relatively
big.

Solution to Activity 16
One way, and the one we will pursue in this subsection, is to work out
which cluster centre each observation is closest to, then allocate
observations to the corresponding cluster. For any given observation,
working out which cluster centre is the closest can be done by identifying
the cluster centre for which the dissimilarity between the observation and
the centre is the smallest.

Solution to Activity 17
With the cluster centres at 160 cm and 170 cm, respectively, heights less
than 165 cm will be closer to the centre of Cluster 1 and heights more than
165 cm will be closer to the centre of Cluster 2. So, anyone with a height
less than 165 cm should be allocated to Cluster 1 and the rest to Cluster 2.
This means that Cluster 1 will just be {Cath}, and Cluster 2 will be
{Adnan, Billy, Dan, Elise}.

Solution to Activity 18
There are different ways in which the centres could reasonably be
calculated. For example:
• the position that, for observations in the cluster, corresponds to the
mean duration and the mean waiting time
• the position that, for observations in the cluster, corresponds to the
median duration and the median waiting time
• the position where the dissimilarities between the centre and all the
observations are minimised.
The different methods have different merits, so what is right in one context
might not be the best in another.

Solution to Activity 19
For these data there is only one variable: height. So, as the dissimilarity
function is Euclidean distance, the centre is just the mean of the heights of
the friends in the cluster. That is,
hAdnan + hDan + hElise
x=
3
180 + 193 + 182
=
3
= 185.

286
Solutions to activities

Solution to Activity 20
The two subtasks are like mirror images of each other: the assumption for
one subtask is what is found in the other subtask.

Solution to Activity 21
(a) Allocating the friends based on cluster centres at 160 cm and 170 cm
was considered in Activity 17. There, it was shown that the
appropriate allocation is Cluster 1: {Cath} and Cluster 2: {Adnan,
Billy, Dan, Elise}.
(b) In part (a), Adnan, Billy, Dan and Elise were allocated to the second
cluster. Using the mean to estimate cluster centres, the centre of this
cluster is then estimated to be
180 cm + 170 cm + 193 cm + 182 cm
= 181.25 cm.
4
Only Cath was allocated to the first cluster. So the centre of this
cluster is just the same as Cath’s height: 164 cm.
(c) As in part (a), each friend should be allocated to the cluster whose
centre is the closest. So, using the cluster centres of 164 cm
and 181.25 cm all friends with a height less than
164 cm + 181.25 cm
= 172.625 cm
2
should be allocated to Cluster 1 and the rest to Cluster 2.
So now the second cluster (the one with centre 181.25 cm) comprises
just Adnan, Dan and Elise. Billy has now been allocated to the first
cluster (the one with centre 164 cm), along with Cath.
(d) As the first cluster now consists of Billy and Cath, the centre of this
cluster is now estimated to be
170 cm + 164 cm
= 167 cm.
2
Similarly, as the second cluster now consists of Adnan, Dan and Elise,
the centre of this cluster is now estimated to be
180 cm + 193 cm + 182 cm
= 185 cm.
3

Solution to Activity 22
(a) Comparing the solutions found in parts (c) and (d) of Activity 21 with
those in parts (a) and (b), we can see that during the second iteration
the allocation of points to clusters has changed (Billy changed from
being allocated to Cluster 1 to being allocated to Cluster 2). Also, the
estimates of the cluster centres have changed. For example, the
estimate of the centre of Cluster 1 changed from 164 cm to 167 cm.
Furthermore, only two iterations have been completed, less than the
maximum number set. So, as none of the stopping criteria has been
met, there is no reason to stop the algorithm after two iterations.

287
Unit B1 Cluster analysis

(b) Using the cluster centres of 167 cm and 185 cm the allocation of
friends to the clusters is: allocate to Cluster 1 if the height is less
than (167 + 185)/2 = 176 cm, otherwise allocate to Cluster 2. This
means that Billy and Cath are allocated to Cluster 1 and that Adnan,
Dan and Elise are allocated to Cluster 2.
This is the same allocation as was found in Activity 21(c). This also
means that the estimated cluster centres are the same as in
Activity 21(d): 167 cm and 185 cm.
(c) As noted in the solution to part (b), the allocation of friends to
clusters is the same as found in the previous iteration. Also, the
estimated cluster centres did not change. So, as two of the criteria for
stopping have been met, the algorithm should now stop (successfully).
(We only need one of the criteria to be met.)

Solution to Activity 23
(a) Yes, the same allocation of observations to clusters will always be
obtained if the values of the cluster centres do not change. If the
cluster centres do not change, then which centre an observation is
closest to also cannot change.
(b) No, the cluster centres cannot change if the allocation of observations
to clusters does not change. The value of a cluster centre is completely
determined once it is known which observations are in the cluster.
(c) If the allocation of observations to clusters does not change, the
answer to part (b) implies that the cluster centres do not change.
Equally, if the cluster centres do not change, the answer to part (a)
implies that the allocation of observations to clusters does not change.
So, as the circumstances that lead to one of the stopping rules being
satisfied means that the other condition is also satisfied, the two
conditions are equivalent.

Solution to Activity 24
This time, with initial cluster centres of 180 cm and 193 cm, any friend
with a height less than
180 cm + 193 cm
= 186.5 cm
2
should be allocated to Cluster 1 and the rest to Cluster 2. This means that
the allocation of friends to clusters becomes:
• Cluster 1: Adnan, Billy, Cath and Elise
• Cluster 2: Dan.
Based on this cluster allocation, the cluster centres are then estimated to
be:
180 cm + 170 cm + 164 cm + 182 cm
• Cluster 1: = 174 cm
4
• Cluster 2: 193 cm.

288
Solutions to activities

Moving on to the second iteration we then have the dividing line between
allocating to Cluster 2 instead of Cluster 1 as a height of
174 cm + 193 cm
= 183.5 cm.
2
So, the allocation of friends to clusters remains as:
• Cluster 1: Adnan, Billy, Cath and Elise, with the centre of 174 cm
• Cluster 2: Dan, with the centre of 193 cm.
As neither the allocation of friends to clusters, nor the estimated cluster
centres, changes from Iteration 1 to Iteration 2, the algorithm stops.
Note that the solution is different to that found in Activity 22. In
Activity 22, at the end of the algorithm Billy and Cath were in one cluster,
leaving Adnan, Dan and Elise in the other. Neither is wrong, merely
different.

Solution to Activity 25
(a) In all of the plots, the dividing line between the two clusters goes
diagonally across the plot. This indicates that the main difference
between the clusters is the concentration of both chemicals together.
In one cluster, this total concentration is higher compared with the
other. In other words, assuming that a higher concentration
corresponds to a stronger flavour in the liquor, the two clusters
correspond to a relatively strongly flavoured liquor and a relatively
less flavoured liquor.
The solutions differ in the weight given to ethyl acetate over ethyl
lactate. This could be translated into differences in the flavour.
(b) In Figure 23(a) and Figure 23(b), there does not seem to be a clear
gap between the two clusters, so they do not represent reasonable
divisions of the data into two clusters.
In Figure 23(c), there does seem to be a gap between the two clusters.
So, this clustering solution seems more reasonable.
(c) Based on the mean silhouette statistics, the clustering solution given
in Figure 23(c) has the highest value of the statistic and thus appears
better than the other two solutions.

Solution to Activity 26
(a) The average silhouette statistics is highest for k = 3. So for these data
there appears to be three clusters.
(b) The points in the upper k = 2 cluster solution also form a cluster in
the k = 3 cluster solution, meaning that the k = 3 solution happens to
split the lower cluster given for the k = 2 solution. This split seems to
be where there is a bit of a gap between observations. So it seems
reasonable that this three-cluster solution is better than the
two-cluster solution.

289
Unit B1 Cluster analysis

The four-cluster solution also appears to split the data nicely into
groups. It is worth noting that the mean silhouette statistic for this
solution is not much less than that for the three-cluster solution. In
the five-cluster solution it is not clear that the observations in the
middle cluster are closer to each other than to observations in other
clusters. So it not surprising that the mean silhouette statistic for
the k = 5 solution is not as high as for the k = 3 solution.

Solution to Activity 27
(a) In Example 30, it was noted that Adnan and Billy had three friends
(including themselves) within 10 cm of their own heights.
Furthermore, Cath and Elise both had two friends whose height was
within 10 cm of their own heights. So, when dmax = 10 and gmin = 2,
Adnan, Billy, Cath and Elise would be regarded as being in the
interior of a cluster. Dan still would not deemed to be in the interior
of a cluster.
(b) When dmax = 10, at most there are only three friends within 10 cm of
one of their heights. This means that when dmax = 10 and gmin = 4,
none of the friends would be deemed to be in the interior of a cluster.
(c) When dmax = 6, we have the following.

Friend Height, h Range with dmax Number of friends


(cm) of height in range
Adnan 180 174 to 186 2
Billy 170 164 to 176 2
Cath 164 158 to 170 2
Dan 193 187 to 199 1
Elise 182 176 to 188 2

So when gmin = 2, this means that only Adnan, Billy, Cath and Elise
are in the interior of a cluster. Dan is not.
(d) When dmax = 1, we have the following.

Friend Height, h Range with dmax Number of friends


(cm) of height in range
Adnan 180 179 to 181 1
Billy 170 169 to 171 1
Cath 164 163 to 165 1
Dan 193 192 to 194 1
Elise 182 181 to 183 1

So when gmin = 2, none of the friends has a sufficient number of


friends who are close enough in height to them. Thus, none of them
would be deemed to be in the interior of a cluster.
(e) As noted in the solution to part (d), all of the friends have one person
within 1 cm of their heights (themselves!). So, when gmin = 1, all of
them would be deemed to be in the interior of a cluster.

290
Solutions to activities

Solution to Activity 28
Phase 2 of the algorithm is only triggered when one such observation has
been found.

Solution to Activity 29
• Step 1:
This time just Billy is in the cluster set initially. Billy is necessarily an
interior point, and so adds the observations which are sufficiently close
to it to the cluster set. This means that Adnan and Cath get added to
the cluster set, so that it becomes {Billy, Adnan, Cath}.
• Step 2:
Now consider the next observation in the set: Adnan. As in Example 32,
Adnan is also an interior point and potentially adds both Cath and
Elise. However, as Cath is already in the cluster set, this means that the
cluster set becomes {Billy, Adnan, Cath, Elise}.
• Step 3:
Now considering Cath, this is an edge observation, just as in
Example 32. So, no further observations are added to the cluster set.
• Step 4:
Now considering Elise, this is also an edge observation, just as in
Example 32. So, again, no further observations are added to the cluster
set.
At this point, all the observations in the cluster set have been checked
and no further observations will be added. So, the cluster set ends up as
being {Billy, Adnan, Cath, Elise}. This is exactly the same observations
that were in the final cluster set generated in Example 32. Only the
order of the observations in the cluster set has been changed (which does
not matter).

Solution to Activity 30
From the plots it is clear that the two solutions are very similar. In both
cases two clusters are identified and no points are identified as being an
outlier. In fact, the allocation of just two points is different in the two
solutions. These points are part of a small group of observations that
arguably lie between the hearts of the two clusters.
This difference is not likely to be important. The impression about both
clusters is unchanged. However, it does highlight that these points are not
clearly in one cluster or the other cluster.

291
Unit B1 Cluster analysis

Solution to Activity 31
(a) Decreasing gmin (Figures 31(a) and (b)) has led to more clusters being
identified, and fewer observations being classified as outliers.
Increasing dmax (Figures 31(a) and (c)) also has led to fewer clusters
being identified. However, this time it is because observations that
were in separate clusters are now allocated to the same cluster.
Additionally, observations that were labelled as outliers are now
allocated to the cluster too.
(b) The solution given by dmax = 1.2 and gmin = 4 (Figure 31(c)) is not
helpful. It simply puts all the samples into the same cluster.
There is less to choose between the other two solutions. However,
using the principle of parsimony (i.e. the fewer clusters the better),
the solution given by dmax = 0.6 and gmin = 4 should be preferred.
The high number of outliers in both these solutions suggests that
perhaps there are not clusters in the data to find!

Solution to Activity 32
(a) In this experiment, hierarchical clustering took less time
than k-means or DBScan when the dataset size was 50 or less. So
hierarchical clustering appears to be fastest for small datasets.
(b) When the dataset size was 500 or more, k-means clustering took less
time than DBScan or hierarchical clustering. So, this appears to be
the fastest for large datasets.
(c) It is more important for a technique to be fast with large datasets.
When a dataset is small, even a relatively slow technique is still likely
to come up with an answer in an acceptably small amount of time.

Solution to Activity 33
(a) In hierarchical clustering, the main choice that has to be made is the
choice of linkage. For example, single, complete or average. Changing
the linkage means re-running the algorithm. Furthermore, as it is easy
to change the dissimilarity measure in hierarchical clustering, different
choices for this might be tried. This might lead to the algorithm being
re-run a number of times to explore the impact of these choices.
(b) In partitional clustering, there are two main choices to be made: the
initial positions of the cluster centres, and what value of k to use. In
Subsection 4.4, you have already seen that exploring what is a
reasonable value for k can lead to the algorithm being run a few
times. For example, in Example 26 the algorithm was implemented
for nine different values of k. Furthermore, in Subsection 4.4 you saw
that choosing starting configurations at random was one way to try to
ensure that a good, stable solution is found. This could lead to the
algorithm being re-run lots of times to try to ensure there is a
reasonable chance that a good, stable solution is found.

292
Solutions to activities

(c) In DBScan there are two main choices: dmax , the maximum
dissimilarity between two data points for them to be in the same
cluster, and gmin , the minimum number of data points to define a new
cluster. Also, it is possible to change the dissimilarity measure.
Trying just a few different values for both dmax and gmin , and a few
different dissimilarity measures can quickly lead to the algorithm
being run ten or more times.

293
Unit B2
Big data and the application of data
science
Introduction

Introduction
At the time of writing this unit, there is much talk about ‘big data’. This
talk is both in a positive way (what it can make possible) and a negative
way (how it threatens values deemed important by society). This unit, the
second and final unit in the data science strand, considers some of the
issues and applications for ‘big data’.
In Section 1, you will explore when data might be thought of as ‘big data’
and we will describe some of its uses. (For the purposes of this unit, we will
refer to data of the type you have been dealing with so far as ‘small data’.)
An important feature of the analysis of big data is that the computational
aspects cannot be ignored. Data analysis becomes infeasible if it takes the
computer too long to come up with the results. Furthermore, with big
data it cannot be assumed that such computational problems will be
solved simply by switching to a bigger, faster computer. Section 2 will
focus on distributed computing, which harnesses computing power across
multiple processors and makes the analysis of big data feasible. In
Section 3, the focus is on algorithms used by the processors to produce
results for our data analysis – in particular, the extent to which we can
expect them to produce the correct results, or even the extent to what
constitutes the correct results is known.
In the remaining sections of this unit, Sections 4 to 7, we shift away from
considering the practical considerations of computation time and
algorithms to more philosophical considerations.
In Section 4, you will explore how having big data can impact on the
interpretational approach to data analysis. Sections 5 and 6 will deal with
wider ethical issues thrown up by big data (and small data too). Data
scientists and statisticians, and indeed others who collect and use data,
have a responsibility for the use to which the results are put. As you will
see, the nature of big data, and the complexity of models built using it,
means that ethical principles surrounding privacy and fairness can be
unwittingly violated. Finally, Section 7 will introduce guidelines for data
scientists and statisticians for dealing with such data. These guidelines aim
to ensure that the public justifiably regards the work of data scientists and
statisticians as being undertaken with integrity.
The following route map shows how the sections connect to each other.

297
Unit B2 Big data and the application of data science

The Unit B2 route map

Section 1
What is so special
about big data?

Section 2 Section 4
Handling Outputs from
big data big data analysis

Section 5
Privacy

Section 3
Section 6
Models and
Fairness
algorithms

Section 7
Guidelines for
good practice

Note that Subsection 2.3.1 contains three notebook activities, so you


will need to switch between the written unit and your computer to
complete this subsection.
Additionally, in Subsections 4.3, 5.1 and 5.2, and Section 7, you will
need to access other resources on the module website to complete a
number of other activities. There are also a few activities that suggest
sharing comments on the M348 forums.

298
1 What is so special about big data?

1 What is so special about big data?


As its title suggests, this unit is all about big data. The use of the term
‘big data’ in the context that is the subject of this unit is thought to go
back no further than the mid 1990s (Diebold, 2021). Thus, it is a relatively
new concept.
The first thing to consider is what is meant by ‘big data’ ? This is what
you will do in Activity 1.

Activity 1 What is big data?

Take a minute or two to think about what the term ‘big data’ currently
means to you. What do you think makes big data different from just data?

As the Solution to Activity 1 has suggested, one aspect that can make
data big is that there are lots (and lots, and lots!) of individual data A type of big data!
points. But there are other aspects too, notably the type of data, and the
speed at which it accumulates. These two aspects, along with the number
of data points, are indicated by the so-called three V’s of data science:
• volume
• velocity
• variety.
These three V’s were highlighted by Laney (2001, cited in Diebold, 2021)
and will be the subject of Subsection 1.2. (There have been further V’s
proposed since, and other letters too.)
A more general definition for big data is given by Walkowiak:
Big data is any data that cause significant processing, management,
analytical and interpretational problems.
(Walkowiak, 2016, p. 11)

This definition focuses on the impact that trying to handle big data
causes, rather than trying to quantify what a set of data needs to look
like in order to be considered big. We will return to this in Section 2.
However, first, in Subsection 1.1, we will consider some of the areas in
which big data have had an impact. This is to give you an idea of what
has already proved possible using big data.

299
Unit B2 Big data and the application of data science

1.1 Uses of big data


In this subsection, you will learn about three uses to which big data
have already been put: with respect to internet shopping (Example 1),
predicting flu epidemics (Example 2) and to win a TV game show
(Example 3)!

Example 1 Recommender systems


Online shopping has now become a firm part of shopping habits in the
UK. Shopping websites offer the convenience of being able to buy
goods whenever or wherever someone wants. (Well, if not everywhere,
it offers huge flexibility over locations.)
One of the issues with such websites is knowing which products to
display to (potential) customers in the hope that they will buy them.
The range of possible products to select from can be huge – far too
many to just place in front of a customer. The challenge also goes
beyond ensuring that the customer finds the product they are looking
for. It is in the interests of an online retailer to bring to a customer’s
attention other items they might be interested in – in other words, to
provide a recommender system. This might lead to the customer
purchasing more items from the online retailer than they initially
Is a recommender system thought they might.
needed here?
Unlike a physical store, online retailers have the advantage of being
able to tailor what is seen by each individual customer. Furthermore,
the web interface allows a retailer to collect a wealth of information
about their customers – for example, who has bought what, when and
with what other items. Also, the retailer is able to collect less obvious
information such as which other items were looked at, and for how
long. The result is a very large and complex dataset that keeps
evolving for online retailers to use to try to predict what else each
individual might like.

If you have ever used internet shopping, try Activity 2.

Activity 2 Success of recommender systems

When you last did some internet shopping, did the website also suggest
other items you might be interested in? If so, did the suggestions seem
reasonable to you? Have you bought items that have been suggested to
you by the website?
Share your responses via the M348 forums.

300
1 What is so special about big data?

Example 2 Predicting influenza


Influenza (or just ‘flu’ for short) is a viral disease which is monitored
by public health organisations due to its potential to cause a
pandemic resulting in many millions of cases and thousands of deaths.
However, an early detection of a large outbreak is a challenge. It can
take time for information about cases of flu to filter through to public
health organisations via health care professionals. Furthermore,
individuals with flu might not make contact with health care
professionals and instead choose to self-treat.
In 2008, Google launched a system, Google Flu Trends (Ginsberg et
al., 2009), designed to detect outbreaks of flu (or, more accurately,
flu-like illness) earlier by using the information from searches users
conducted. The idea was that as cases of flu increase in a community,
so flu-related search terms entered in a search engine increase. Thus,
it was hoped the rates and locations of the use of such search terms
would indicate rates of flu in close to real time, hence generating
‘nowcasts’ of prevalence. Google Flu Trends was deemed to have If you are not sure what
successfully detected a major outbreak in the USA in 2009, albeit Google is, Google it!
only once the algorithm was tweaked (Cook et al., 2011).
In subsequent years, the system was accused of overestimating
prevalence (Butler, 2013), used as an example about how the use of
big data can go wrong (Lazer et al., 2014) and discontinued in 2015
(Google, 2015).
Nevertheless, the idea has not fully gone away. It has been argued
that the information about searches can help to improve nowcasts of
prevalence of flu (Kandula and Shaman, 2019) and similar has been
tried with COVID-19 based on posts on Sina Weibo, a Chinese
microblogging platform (Guo et al., 2021).

Example 3 Winning Jeopardy!


Another use that big data, and the techniques underlying it, have
been put to is beating humans at various games. Notably, these games
have included chess, Go, and the popular US game show Jeopardy!.
In Jeopardy!, contestants are expected to provide the question that is
associated with a particular supplied answer. For example, in
response to the supplied answer ‘Applied statistical modelling’ an
appropriate question would be ‘What is the title of the OU module
with code M348?’.

301
Unit B2 Big data and the application of data science

In the 2000s, IBM decided as a ‘grand challenge’ that it would


construct a computer to take on the leading competitors on Jeopardy!.
This seemingly frivolous goal required the developers at IBM to
design software that could quickly interpret language presented in a
natural way. This could be done by trawling vast amounts of data to
find the most appropriate response, before suggesting an appropriate
question. The resulting computer, named Watson after IBM’s
founding chairman Thomas J. Watson, was put to the test in 2011.
Figure 1 is a still from one of the shows, aired between 14 February
and 16 February 2011, which ended with Watson coming out on top.

Figure 1 The IBM Watson computer winning against two human


competitors on Jeopardy!

So, as you have seen, big data is making its impact in different ways. In
the next subsection, we discuss in more detail the aspects that can make
data big, or what are known as the three V’s.

1.2 The three (or four or more) V’s


At the beginning of this section, the three V’s of big data were introduced.
These were:
• volume
• variety
• velocity.
In this subsection, we discuss why each of these can lead to significant
problems in the sense that Walkowiak suggested in his general definition of
big data (Walkowiak, 2016, p. 11).

302
1 What is so special about big data?

1.2.1 Volume
It probably does not surprise you that the size, or volume, of the datasets
being analysed is one aspect of big data. Some of the datasets being
analysed are huge – many thousands of times larger than the data you
have so far analysed in M348. So why is this a challenge?
In many situations, this involves two considerations: computer memory
and computational time.
You will have noticed as you worked through the Jupyter notebooks
produced for this module that data analysis using R requires the dataset
to be loaded into R. In order for that to happen, the data needs to be
stored somewhere that R can access it. Typically, this means being stored
on the same computer where R is running. The memory of the computer
has to be sufficient to be able to store the data. It also has to have
additional memory available to any temporary R objects created during
the analysis. If sufficient memory is not available, R will not able to With big data, huge volume
may sometimes be a challenge!
complete the analysis.
Every time you use R to analyse data, this will involve the software getting
the computer to perform various computations to come up with the
results. Whilst computers can do basic computations very quickly, it is
still possible for the number of computations to build so that they take a
noticeable amount of time. It is also possible for the number of
computations to be so great that the analysis takes so long that it becomes
too long to wait. For example, recommender systems such as those
described in Example 1 have to be able to come up with suggestions very
quickly before the (potential) customer loses interest and moves on to
another website (which could be hosted by a rival retailer).
Both of these issues, about computer memory and computer time, can be
solved by technological improvements in computing up to a point. For
example, in the past, PCs and laptops that came with 1 GB of storage
used to be top end. Nowadays, computers with 1 TB or more of storage are
commonly available. Computers have also become many orders of
magnitude faster. The pace of improvement in computer power – at least
in the last few decades – has been summarised by Moore’s Law, which is
described in Box 1. However, in the case of big data, switching to a bigger
and faster computer is unlikely to be sufficient. Instead, as you will see
in Section 2, the solution lies in harnessing the power of several processors
together.

Box 1 Moore’s Law


A primary determinant of the computer power is the number of
components that can be crammed on an integrated circuit. The more
components, the greater the capability.
In 1965, Gordon Moore, the future co-founder of Intel, wrote a paper
(Moore, 1965) in which he predicted how the number of components

303
Unit B2 Big data and the application of data science

on an integrated circuit would change over time. Interestingly, in the


same paper, he foresaw home computers, personal portable
communications equipment and even electronic wrist-watches as being
possible at a time when they were not yet reality.
In that paper, he predicted that the complexity of such integrated
circuits would double every two years, for at least the next ten years
(i.e. to 1975). In fact, this prediction has turned out to be largely true
for at least 40 years! For example, Figure 2 shows how the number of
transistors on an integrated circuit has increased over time. (Note the
use of the logarithmic scale for the number of transistors.)

1010
Number of transistors

108

106

104

1970 1980 1990 2000 2010 2020


Year
Figure 2 Scatterplot of the number of transistors on an integrated
circuit over time (using data taken from ‘Transistor count’, 2022)

1.2.2 Variety
So far in this module, the data you have worked with have been in the
form of a data frame. Recall that, in a data frame, the data are laid out in
a tabular format. There are a number of observations: the rows. For each
observation, a number of variables are recorded – the same variables for
each observation (though it is possible for some of the individual values to
be recorded as missing). Such data are said to be structured.
However, big data also encompasses data that cannot be captured in a
neat tabular format. For example, the information in tweets, Facebook
postings and YouTube videos can be used to form a big data dataset. Such
data are said to be unstructured. What makes sense as variables in one

304
1 What is so special about big data?

tweet/posting/video does not make sense for all the others.


Finally, some data are said to be semi-structured. That is, observations
contain elements that are structured and elements that are not. For
example, emails can be regarded as providing semi-structured data. Emails
contain elements that provide structure, such as recipient, sender, when
sent and subject. The body of the email is unstructured, as it is just text.
Lack of structure in the data provides challenges over how to store the
information efficiently. A neat table no longer works with unstructured
data. It also provides challenges for data analysis. Pre-processing may be
required so that common elements across observations can be picked out.

1.2.3 Velocity
Velocity refers to the speed at which the data are gathered and need to be
processed. So far in this module, the data have effectively been static –
giving the opportunity to take time over analysing the data. However,
some big datasets will be constantly added to, second by second. For
example, the online retailer Amazon (used by individuals and businesses,
and available in many different countries and languages) will constantly be
gathering information about visitors to its website. Moreover, customer
behaviour is also likely to be constantly evolving. So an appropriate model
at one point is not likely to remain appropriate. This means there is the
challenge of dealing with this constantly changing landscape.

1.2.4 Other V’s


Since Laney (2001, cited in Diebold, 2021), other V’s have been proposed
that describe challenges in big data. The most notable of these is veracity.
Veracity relates to the extent that the data can be regarded as correct.
When a dataset is small, time and effort is often put into data cleaning.
After all, with few data points it seems obvious that errors in the data
could a make a big impact. However, with big data the quality of each
individual data point is frequently much lower. The creators of the dataset
(and contributors to it) might be much less invested in maintaining
accuracy. This is particularly true for exhaust data.
Exhaust data are data that are created as a by-product of other activity.
For example, postings on social media generate data such as when posted,
who posted, topic or who read the posting within 24 hours. However, as
exhaust data are a by-product, there may be no impact on the creator if
the data are incorrect. This reduces, and possibly entirely removes, the
incentive for the creator to correct errors. For example, in a social media
posting, the creator of the posting might not be bothered about spelling
mistakes, or whether a title matches the content. However, this has to be
taken into account if, say, postings are analysed to assess the general mood
of users.
A key thing to remember is that having more data does not always lead to
better answers. Although increasing the sample size reduces sampling

305
Unit B2 Big data and the application of data science

error, it has no impact on some biases. So, whilst the sheer size of big data
datasets can lead to precise estimates, it does not necessarily mean that
the estimates are accurate. A reminder of the difference between precision
and accuracy is given in Box 2.

Box 2 Precision and accuracy


In estimation, precision and accuracy have two subtly different
meanings.
• Precision measures the degree to which similar measurements are
the same as each other. For example, precision is the extent to
which the arrows land close to each other in Figure 3.
• Accuracy relates to the degree to which the estimate is close to
the ‘right’ or ‘true’ answer. For example, accuracy is the extent to
which the arrows are close to the centre of the target in Figure 3.

(a) (b) (c)

Figure 3 Three archery targets: (a) showing accuracy but not


precision, (b) showing precision but not accuracy, and (c) showing
accuracy and precision

In the next activity, you will consider a situation where bias might occur.

Activity 3 Have your say . . .

Comment or ‘have your say’ sections are available on some news media
websites to allow the public to express their thoughts about topical issues.
Suggest reasons why an analysis of such postings on a controversial issue
would not necessarily reflect the opinion of the general public accurately.
Would the biases involved here diminish as the number of postings
increase?

306
2 Handling big data

2 Handling big data


As you saw in Subsection 1.2, handling big data brings challenges, not
least of which are computational. The emphasis in Units 1 to 8 of this
module has been on model building. You have been introduced to linear
models and generalised linear models. The next activity discusses how
these models are fitted.

Activity 4 Model fitting

How are the linear and generalised linear models fitted to data?
Hint: think about how this question can be answered using different levels
of detail.

As you saw in Activity 4, the question of how a model is fitted can be


answered to varying levels of detail. In Units 1 to 8, little detail beyond
‘by least squares’ or ‘by using maximum likelihood’ is given. Instead,
reliance is placed on R being able to accomplish this effectively and
efficiently.
In the case of simple linear regression, fitting by least squares leads to the
following formulas:
P
SXY (xi − x)(yi − y)
b = y − βx
α b and β=
b = ,
(xi − x)2
P
SXX
where α
b and βb are the estimated values of the intercept and slope.
In the case of linear models, described in Units 2 to 4, similar formulas
exist for finding estimates – such as the formula for the multiple linear
regression model, given in Example 4. (Note that fitting ‘by least squares’
and ‘by maximum likelihood’ comes down to the same thing
mathematically for linear models.)

Example 4 Finding parameter estimates for a multiple


linear regression model
Recall that when Y is the response variable and x1 , x2 , . . . , xq are
q explanatory variables, then the multiple linear regression model – or
more simply, the multiple regression model – for a collection of n data
points can be written as
Yi = α + β1 xi1 + β2 xi2 + · · · + βq xiq + Wi , for i = 1, 2, . . . , n,
where the Wi ’s are independent normal random variables with zero
mean and constant variance σ 2 , and xi1 , xi2 , . . . , xiq are values of the
q explanatory variables (Box 1 in Unit 2 (Subsection 1.1)).

307
Unit B2 Big data and the application of data science

It turns out, letting β = (α, β1 , β2 , . . . βq ) and using matrix notation,


the estimated slopes can be calculated using
b = (X T X)−1 X T y,
β
where y = (y1 , y2 , . . . , yn ), the n observed values of the response
variable, X is the n × (q + 1) matrix whose ith row is
(1, xi1 , xi2 , . . . , xiq ).

As mentioned in the Solution to Activity 4, estimates of parameters in the


simple linear regression model can be found using formulas. These can be
used to create an algorithm to fit the model. In Box 3 we give a
definition for what an algorithm is.

Box 3 Algorithms
Loosely, an algorithm is often taken to mean a formula or rule.
However, a dictionary definition of algorithm is:
a precisely defined set of mathematical or logical operations
for the performance of a particular task.
(OED Online, 2022)

Note that the operations are not just instructions of the form ‘do
this’. They can also contain loops (e.g. ‘do this 10 times’ or ‘do this
until something happens’) and conditions (e.g. ‘if this, do one thing
otherwise do something else’).

In Example 5 we give one such algorithm – one to fit a simple linear


regression line.

Example 5 An algorithm to calculate the least squares


regression line y = α + βx for a set of n data
points (x, y)
x2 and
P P P P
1. Calculate x, y, xy.
2. Calculate the means of x and y:
P P
x y
x= and y = .
n n

308
2 Handling big data

3. Calculate the sum of the squared deviations of the x-values


X X
(x − x)2 = x2 − nx2
and the sum of the products of the deviations
X X
(x − x)(y − y) = xy − nx y.

4. The slope, β,
b is given by
P P
(x − x) (y − y)
βb = .
(x − x)2
P

5. The intercept, α b = y − βx.


b, is given by α b

Now, for the generalised linear models you have been fitting to data,
not much has been said about how the parameter estimates are found,
beyond using the method of maximum likelihood estimation (for example,
in Subsection 3.1 of Unit 7).
So far, you have not had to worry about the details of the algorithm
used to do the fitting. You have been able to rely on the fact that such
algorithms exist and work sufficiently quickly that you are not left too
long for the computer to provide an answer.
However, with big data the speed of algorithms becomes important.
This is what we will consider in Subsection 2.1. Then, in Subsection 2.2,
we will consider distributed computing – an approach to computing that
can handle the demands of scale that big data brings – before briefly
discussing how this is done in practice in Subsection 2.3.

2.1 Computational time


In this subsection, we will consider the issue of computational time.
That is, how long it takes to implement an algorithm.
In the next activity, you will time the implementation of one particular
algorithm, that for fitting a simple linear regression model.

Activity 5 Implementation time for the simple linear


regression algorithm
In Unit 1, you used R to fit a simple linear regression model to data in the
manna ash trees dataset. In that model, height, the height of manna ash
trees, was taken as the response variable Y and diameter, the diameter of Who said that measuring a
the tree at 1.3 m above the ground, was taken as the explanatory variable, tree diameter is not a
x. In this activity, we return to the fitting of such a model using data from straightforward task?
the manna ash trees dataset, this time to consider the time it takes to
obtain estimates of model parameters.

309
Unit B2 Big data and the application of data science

(a) Table 1 lists five trees from the manna ash trees dataset. Although
you could use R to calculate the estimated values for the intercept, α,
and slope, β, for the purposes of this activity you should do this
calculation ‘by hand’ and time how long it takes you. (Note: ‘by
hand’ includes making use of a calculator or calculator app.)
Table 1 Five trees from mannaAsh

Tree identity number Diameter (in m) Height (in m)


244 0.32 8
243 0.26 7
242 0.28 7
241 0.26 7
240 0.35 8

(b) Suppose we now consider 15 other trees from this dataset.


Which steps in the algorithm of Example 5 are likely to take longer
if part (a) is repeated using these trees? Overall how many times
longer would you expect calculating α b and βb to take?
(c) As the estimation becomes based on more and more trees, which step
or steps will take most of the computation time? In general, how does
the computation time vary as the number of trees increases?

Calculating regression estimates by hand might seem unnecessarily slow.


However, the slowness of the computations in comparison to that achieved
by computers emphasises that each step in an algorithm will take time.
As you have seen in Activity 5, this time can depend on the number of
data points. Hence, the overall time can depend on the number of data
points. Notation used to express this dependency is described in Box 4.

Box 4 Big O notation


Suppose that the length of time of a calculation involving n data
points is given by f (n).
This function f (n) can be said to be of order g(n) for any
function g(n) such that
f (n)
Another example of big O →C as n → ∞, for some positive constant C.
g(n)
notation?
Rather than saying f (n) is of order g(n), we can say that f (n) is
O(g(n)). This is often referred to as ‘big O’ notation.
For example, the order of f (n) = 2n2 + 5n + 1 is n2 because
2n2 + 5n + 1
→2 as n → ∞,
n2
and so f (n) is O(n2 ) using big O notation.

310
2 Handling big data

In the case of producing estimates for a simple linear regression line,


some of these calculations take about the same length of time regardless
of the number of data points. That is, they are O(1). Other calculations,
in contrast, will take longer as the number of data points, n, increases.
For example, the time required for calculations that are O(n) goes up
linearly with n. The time required for calculations that are O(n log(n))
or O(n2 ) goes up faster than linearly as n increases.
It is the calculations where the function g(n) in O(g(n)) increases with n
that are important with big data. They are the ones that could lead to the
analysis taking too long, even with a fast computer.
It is not only the number of data points that makes a difference to the
computation time. Often there is more than one algorithm for the same
task. For example, consider the calculation of the standard deviation.
Two methods (algorithms) that you may have come across are given in
Boxes 5 and 6.

Box 5 Calculating the standard deviation: Method 1


P
sum x
1. Calculate the mean x = = .
size n
2. Calculate the deviations (x − x).
3. Square the deviations (x − x)2 .
4. P
Calculate the variance by summing the squared deviations, to give
(x − x)2 , and dividing by (n − 1): that is,
(x − x)2
P
2
variance (or s ) = .
n−1

5. Calculate the standard deviation as s = variance.

Box 6 Calculating the standard deviation: Method 2


P
1. Calculate the sum of the data values, x.
P 2
2. Calculate the sum of the squares of the data values, x .
3. Calculate the sum of the squares of the deviations, (x − x)2 , as
P

( x)2
X P
x2 − .
n
4. Calculate the variance by dividing (x − x)2 by (n − 1): that is,
P

(x − x)2
P
2
variance (or s ) = .
n−1

5. Calculate the standard deviation as s = variance.

311
Unit B2 Big data and the application of data science

Mathematically, the methods in Boxes 5 and 6 result in the same value for
the standard deviation. But what about the computational time?
Consider this now in Activity 6.

Activity 6 Options for calculating a standard deviation

(a) Using Method 1, time how long it takes you to calculate by hand
the standard deviation for the following set of data.
14.59 18.97 24.56 54.28 32.15

(b) Using Method 2, time how long it takes you to calculate by hand
the standard deviation for the same set of data.
(c) Which method was faster for this particular dataset? For larger
datasets, which method do you think is going to be faster?

As you have seen in Activity 6, although two different algorithms


might end up with the same results, the computational time can vary.
In particular, the way the computational time depends on n might be
different, as you will see in Example 6.

Example 6 Sorting data


Recall that the calculation of the median for a set of data requires the
data to be sorted. There are various algorithms to sort data such as
bubble sort, merge sort and quick sort. Quick sort, as its name
implies, is regarded as a quick method of sorting data and
is O(n log(n)). In comparison, whilst bubble sort is conceptually
simple, it is generally slower as it has been shown to be O(n2 ).

With small datasets, the difference in speed may well be unimportant.


For example, if you only have one standard deviation to calculate, does it
matter if, say, it takes 0.0002 seconds instead of 0.0001 seconds? You are
unlikely to notice the difference. However, for a larger dataset differences
in execution time can become significant. After all, whilst you are unlikely
to notice a doubling of an execution time from 0.0001 seconds to
0.0002 seconds, you might notice a doubling from 1 second to 2 seconds
and you of course would notice a doubling from 1 hour to 2 hours.
Even with the same algorithm, the computation time associated with
different implementations can be substantially different. The skill of the
programmer, along with the language it is programmed in, both make an
impact. The best programmers will be able to exploit features in whatever
programming language they are working in to enable the required
computations to be carried out as efficiently as possible.

312
2 Handling big data

Clever programming and fast computers will only go so far. It is not


unusual for big datasets to outstrip what can be stored on a single
computer, let alone for the computations involved with any data analysis
to be done within a reasonable time frame. The solution is to harness the
power of multiple processors working together in tandem – distributed
computing.

Aside 1 Human computers


The term ‘computer’ has not always been used to describe a piece of
technology that carries out computations; in fact, it was first applied
to people who carry out computations. For example, between 1935
and 1970, the Langley Memorial Aeronautical Laboratory in
California employed women as computers. At that time, ‘computer’
was a job title for someone who ‘performed mathematical equations
and calculations by hand’ (NASA, 2016). The women celebrated in
the 2016 film Hidden Figures and depicted in Figure 4 were also
employed as computers (at NASA or its precursor), though their
contributions were far greater than the appellation ‘computer’ implies
(Howell, 2020).

Figure 4 The three brilliant human computers celebrated in Hidden


Figures are, from left to right, Mary Jackson (Janelle Monáe), Katherine
Johnson (Taraji P. Henson) and Dorothy Vaughan (Octavia Spencer)

313
Unit B2 Big data and the application of data science

2.2 Distributed computing


For your study of M348, it has been assumed that when you have been
carrying out data analysis you have used a single computer. Thus, the
amount of data you can feasibly analyse is dependent on the capability of
that computer. Faster computers with more memory are able to process
more data and generate the results in a reasonable length of time.
With distributed computing, the capability of a single computer is no
longer the critical factor. Instead, distributed computing combines the
computing power of multiple individual processors. The processors do not
have to be particularly powerful computers individually, as the following
example, Example 7, demonstrates. Nor do the processors have to be
separate computers combined into a cluster (which may or may not be
geographically close). The processors might all be contained in the same
computer. This is something you will make use of later in Subsection 2.3.

Example 7 The Raspberry Pi as processor


The Raspberry Pi was introduced as an inexpensive computer.
However, its relatively low-end specification does not stop numbers of
these computers being combined as processors in a distributed
computing set-up (Evans, 2019), as demonstrated in Figure 5.

Figure 5 A cluster of Raspberry Pis

In a distributed computing set-up, handling increasing amounts of data


can be done by adding more processors. This means the size of data that
can be handled is only limited by the number of processors. Thus, in
principle, even the biggest of datasets can be analysed, provided enough
processors are available.

314
2 Handling big data

Example 8 demonstrates the data handling capability that has already


been created using distributed computing.

Example 8 The Worldwide LHC Computing Grid


As of 2021, the Worldwide LHC Computing Grid is the world’s largest
computing grid, with 1 million computer cores in 170 computing
centres spread across 42 countries (CERN, 2021). This grid, part of
which is depicted in Figure 6, is used by physicists worldwide to
process the vast amount of data generated by scientific experiments
performed at the Large Hadron Collider (LHC). (Data from these
experiments were mentioned in Activity 3 of Subsection 2.1, Unit 1.)

Figure 6 Part of the Worldwide LHC Computing Grid

Note that, in a distributed computing set-up, computations on the data


are structured so that the bulk of it is spread out amongst the processors,
and can be worked on simultaneously by these processors. This means that
the data too can also be spread across multiple processors so that any
single processor just holds a part of the data.
To see how this works in principle, consider again Activity 5 where you
estimated the slope and intercept of a simple linear regression model for a
dataset comprising just five trees. In that activity, it was noted that if the
dataset comprised 15 trees, the computation time would be roughly
three times as long. In the following activity, you will explore how, by
enlisting the help of friends to form a distributed computing cluster, it is
possible for the regression estimates based on 15 trees to be found almost
as fast as for that based on five trees. (But don’t worry, you do not need
these friends to be available in order to complete this activity!)

315
Unit B2 Big data and the application of data science

Activity 7 Estimating a simple linear regression by hand –


with the help of friends
(a) Look back at Activity 5 (Subsection 2.1). Which step, or steps, in the
algorithm to estimate the simple linear regression parameters take
longer as the number of trees increases?
(b) Recall that P
Step 1Pof the
P algorithm
P requires four sums to be
calculated:P x, y, 2
x and xy. Focus on the calculation of the
first sum: x.
P
Which of the following strategies for calculating x are valid (that
is, end with the correct value of this sum)?
(i) Working out the sum x1 + x2 + · · · + x15 .
(ii) Calculating
s1 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 ,
s2 = x9 + x10 + x11 + x12 + x13 + x14 + x15 ,
then calculating the sum as s1 + s2 .
(iii) Calculating
s1 = x1 + x2 + x3 + x4 + x5 ,
s2 = x6 + x7 + x8 + x9 + x10 ,
s3 = x11 + x12 + x13 + x14 + x15 ,
then calculating the sum as s1 + s2 + s3 .
(iv) Calculating
s1 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 ,
s2 = x9 + x10 + x11 + x12 + x13 ,
s3 = x14 + x15 ,
then calculating the sum as s1 + s2 + s3 .
(c) Suppose that you
P have two friends available to help with the
calculation of x (and that your friends are as quick and accurate as
you are when it comes to doing summations by hand). Which of the
valid strategies given in part (b) is likely to be fastest?
(d) Suppose you use strategy (iii), what data does each of your friends
need to be given in order to complete their allocated calculations? For
Select two friends who can
work in ‘parallel’ ! the data they are given, do you need to have it too?

In a distributed computing cluster, one processor needs to act as a leader.


This processor coordinates the work of the other, follower, processors. For
example, the leader keeps a record of which processor has which chunk of
data and tells each processor what to do with its data. In Activity 7, this
leader was implicitly assumed to be you, though it is possible for the
processor acting as the leader to change during the computation.

316
2 Handling big data

As you have seen in Activity 7, splitting a computation amongst a number


of people – or more commonly amongst a number of processors – can help
keep the computation time reasonable as the sample size rises. However, it
is not the case that using distributed computing is better in all
circumstances. Using distributed computing brings its own overheads in
terms of coordinating the calculations. That is, keeping track of which
processor is dealing with which subset of the data and collating the results
at the end. So these overheads may not be worth it if the dataset is small.

Activity 8 Is distributed better?


In Activity 7, you considered the impact of enlisting friends to help
estimate a simple linear regression line by hand from a dataset of 15 trees.
(a) Think about the process of managing the calculation. What tasks
need to be completed? (Here you do not need to consider the
additional burden of persuading two others to help.)
(b) Suppose the dataset only comprised 12 trees, or 9 trees, or just
6 trees. Is enlisting the help of two friends still worth it?

In the Solution to Activity 8, the need to check whether one of your friends
fails to deliver the subsum they promised was mentioned. In distributed
computing, such a circumstance is not confined to the vagaries of human
behaviour: it is also a consideration when using a cluster of processors to
do the computations. There is always the risk that an individual processor
does not complete or return its assigned computation – that it ‘fails’ in
some sense. Now, the probability of an individual failure is usually low.
However, once several processors are combined into a cluster, the
probability that at least one fails can easily become non-negligible, as you
will discover in Activity 9.

Activity 9 The probability of failure


Suppose during a set of computations the probability that an individual
processor fails is 0.1, and that the failure of any one processor is
independent of the failure of any other processor.
(a) Suppose that the cluster consists of 2 processors. What is the
probability that at least one of them fails?
(b) Suppose that the cluster consists of 5 processors. What is the
probability that at least one of them fails?
(c) How many processors do there need to be so that the probability that
at least one processor fails is 0.5?
Another multiprocessing
(d) Now suppose that the probability of failure is 0.01. How many system
processors do there need to be before the probability that at least one
fails is 0.5?

317
Unit B2 Big data and the application of data science

So, the risk of a least one processor failing increases with the number of
processors. This means that in setting up distributed computing clusters,
it is something that has to be confronted. Additionally, there might be
problems with the network used to allow computers to communicate with
each other. So, implementations involving distributed computing need to
include strategies for coping when a follower processor, the network or even
the processor acting as the leader fails. Then failure of a processor becomes
a minor inconvenience rather than sabotaging the whole computation.
This is in addition to having strategies for optimising the distribution of
calculations across the processors. This means there is a cost, in terms of
computation time, in using distributed computing. So, whilst distributed
computing can bring great reductions in computation time when big data
is worked on, it is often not worth it for small data.

2.3 Distributed computing in practice


Subsection 2.2 dealt with the general principles underlying distributed
computing. As mentioned in that subsection, implementing distributed
computing includes implementing strategies for ensuring that a cluster of
processors work well together. How this is managed is non-trivial.
The MapReduce algorithm (Figure 7) is a generic algorithm for
implementing computations using distributed computing. This algorithm
dates back to at least 2004, when Dean and Ghemawat (2004) described a
system that had been implemented within Google. Since then, software
programs such as Hadoop and Spark have been developed based on the
MapReduce algorithm. There are also packages that extend the capability
of Python and R to handle computations done in a distributed way.

Map

Reduce

Input Map Output

Reduce

Map

Map task Reduce task

Figure 7 A simple diagram of the MapReduce algorithm, taken from the


‘Fundamentals of MapReduce’ blog by Oridupa (2018)

A variation on the same principle is the split-apply-combine algorithm


(Wickham, 2011). ‘Split’ means to break the problem into manageable
pieces, to ‘apply’ a calculation or calculations to the pieces separately and
then to ‘combine’ the results.

318
2 Handling big data

So far in this section, you may have been assuming the multiple processors
for distributed computing have to be in different computers. However,
modern computers generally have more than one processor in them. This
enables your computer to do two or more separate computations at the
same time. We will make use of this in Subsection 2.3.1, where you will
use R for distributed computing. In particular, you will use distributed
computing to fit a simple linear regression model. This is just one of the
many statistical techniques you have used so far in this module. In
Subsection 2.3.2, we will explore whether converting all these techniques to
make the most of a distributed computing environment is likely to be
straightforward.

2.3.1 Using R for distributed computing


In this subsection, you will work through implementing an algorithm using
distributed computing and timing how long it takes compared with an
equivalent algorithm that does not use distributed computing.
As a start, in Notebook activity B2.1 you will write lines of code to
implement an algorithm. If you just want to apply the algorithm once,
such an approach is sufficient. However if, say, the need is to implement
the algorithm many times, with different data, it helps to bundle up the
code in such a way that you can get the computer to refer back to it easily.
This makes it quicker and easier to write subsequent code that relies on
getting the answer from the algorithm. In Notebook activity B2.2, you will
do this for the code you produced in Notebook activity B2.1. Finally, in
Notebook activity B2.3, you will use a distributed computing version of
the same algorithm. Note for this you will still only need one computer.

Notebook activity B2.1 Implementing an algorithm


In this notebook, you will implement an algorithm that does not make
use of distributed computing. The particular algorithm chosen is that
for simple linear regression given in Example 5.

Notebook activity B2.2 Creating a function


In this notebook, you will take the code you produced in Notebook
activity B2.1 and turn it into a function in R. This will enable you to
implement simple regression using just one line of code.

Notebook activity B2.3 Distributed computing in R


In this notebook, you will use distributed computing to fit a simple
linear regression. You will also learn how to time computations in R
and discover whether using the distributed computer version of the
algorithm is faster with the size of dataset you will be working with.

319
Unit B2 Big data and the application of data science

2.3.2 Structuring calculations for distributed


computing
In the MapReduce algorithm and the split-apply-combine algorithm, a first
step is to break the problem down into manageable pieces. So when
formulating an algorithm for a statistical technique that is suitable to be
implemented using distributed computing, it is important to know whether
(and how) analyses from subsets (chunks) of data can be combined. In
Activity 10, you will consider first the calculation of a mean.

Activity 10 Calculating a mean using distributed computing

Suppose that a mean of some data is required. Furthermore, suppose that


these data are split into two subsets. The first subset, consisting of n1 data
points, has sample mean x1 . Similarly, the second subset consists of n2
data points and has sample mean x2 .
(a) Give a formula for calculating the overall mean, x.
(b) Can this formula be extended to deal with lots of subsets? (Assume
that the number of data points and the sample mean in each subset
are known.)
(c) Why does this suggest that a mean can be calculated using
distributed computing?

You have just seen, in Activity 10, that the mean can be calculated using
distributed computing. However, not all statistics are so amenable as the
mean to compute using a distributed computing set-up. One example is
the calculation of the median. To see one reason why, work through
Activity 11.

Activity 11 Calculating the median using subsets

Suppose that we have the following dataset.


1 2 5 15 21 37 103 117 242
(a) What is the median of this dataset?
(b) If the data were split into the following subsets (each containing three
of the numbers), calculate the median of each subset.
1 15 103, 2 21 117, 15 37 242

(c) Repeat part (b) but with the following split.


1 2 5, 15 21 37, 103 117 242

(d) Repeat part (b) but with the following split.


1 5 21, 15 37 117, 2 103 242

320
3 Models and algorithms

(e) Using your answers to parts (b) to (d), calculate the median of the
three medians. Does this always give the median for the whole
dataset?
(f) Repeat part (e), but this time calculate the mean of the medians.
Is this any better?

Activities 10 and 11 demonstrate that computing the median using


distributed computing is a harder problem than computing the mean
(though not one that is insurmountable). So some techniques are easier to
convert to distributed computing than others.

3 Models and algorithms


In Subsection 2.1, you have seen how the choice of algorithm can affect the
computation time. With big data, the application of an algorithm to some
data can take a significant amount of computation time. At worst, this
could mean that the answer cannot be obtained sufficiently quickly for it
to be useful.
However, getting an answer sufficiently quickly is not the only
consideration when it comes to algorithms. Can we be sure that the
answer we get is the ‘best’ answer or at least a ‘good’ answer?
In Subsection 3.1, you will see that even if we get the ‘best’ answer, this
answer is often not exact. Instead, the goal is to get sufficiently close to
the best answer. In Subsection 3.2, you will see that a useful algorithm
may not always produce the ‘best’ answer. Further, in Subsection 3.3,
you will see that there may not even be an unambiguous answer as to what
the ‘best’ answer looks like.

3.1 Convergence
Recall that, in Boxes 5 and 6, two different algorithms for calculating the
standard deviation were given. Both algorithms appear to be
straightforward – they each describe a sequence of five steps that ends up
with the standard deviation being calculated. But does this mean that
using either of these algorithms calculates the standard deviation exactly,
regardless of the data it is applied to? Unfortunately, the answer is no, not
quite, as you will discover in Activity 12.

321
Unit B2 Big data and the application of data science

Activity 12 Is the standard deviation calculated exactly?

(a) Consider first a set of data whose values are as follows:


x1 = −3, x2 = 1, x3 = 2.
Calculate the sum of the squares of the deviations, (x − x)2 .
P
(i)
Is the value you obtain exact? Why or why not?
(ii) Calculate the variance. Is the value you obtain exact? Again,
why or why not?
(iii) Calculate the standard deviation. Is this value exact? Why or
why not?
(b) Consider now a set of data whose values are as follows:
x1 = −1, x2 = 0, x3 = 0, x4 = 1.
Calculate the sum of the squares of the deviations, (x − x)2 .
P
(i)
Is the value you obtain exact? Why or why not?
(ii) Calculate the variance. Is the value you obtain exact? Again,
why or why not?
(iii) Calculate the standard deviation. Is this value exact? Why or
why not?

So, as you have seen in Activity 12, the standard deviation will not usually
be exact because the exact value corresponds to an irrational number.
Furthermore, the variance, a quantity used in an intermediate step of the
calculation, may not be an exact number either. So, all of this limits how
accurately the variance can be calculated.

Aside 2 Accuracy of numbers stored in a computer


Data, including numbers used in calculations, are usually stored as
binary numbers. A consequence of this is that some numbers which
can be written down in just a few decimal digits are not exactly
stored by a computer. For example, the decimal number 0.1 is, in
binary, 0.000110011001100 . . . , so, in calculations, the computer has
to use a rounded version, which then will not translate back to
equalling 0.1 (in decimal) exactly.

Not being able to calculate a value exactly does not just apply to
calculating the standard deviation: it is a general feature of algorithms.
The issue is, instead, whether the value is close enough to the exact value.
What counts as ‘close enough’ will depend on the context.
First of all, there is no point trying to go beyond what is known as
machine precision. That is, precision with which the processor (or
processors) is able to perform basic calculations. This is because two

322
3 Models and algorithms

numbers that should mathematically be identical may differ by about this


amount simply due to rounding error. With modern processors, the
machine precision is typically small. For example, in R it is currently of
the order 1 × 10−16 (R Development Core Team, 2022, p. 5).
‘Close enough’ also depends on what would be considered spurious
accuracy. It only matters whether the calculated value differs from the
exact value with respect to the precision that will be used when the value
is reported. For example, in Activity 12, you calculated the standard
deviation for two sets of data. Both these datasets are small, and the
values given as whole numbers. So, it could be argued that quoting the
standard deviation to more than one or two decimal places would represent
spurious accuracy. This means, in both these cases, it could be argued that
‘close enough’ is ‘equal to two decimal places’. Close enough!
Knowing when an answer will be ‘close enough’ is particularly important
with algorithms that take an initial estimate and aim to incrementally
improve it. Then, it provides a means for setting a sensible stopping
criterion. That is, when the algorithm has converged.
In Example 9, we give an algorithm which uses a ‘close enough’ type of
criterion as a stopping rule.

Example 9 An algorithm to calculate the (positive)


square root
Suppose that the positive square root, s, of a number, V , is required.
One algorithm for calculating s is as follows.
1. Set s0 = c, where c > 0 is an initial guess for s, and set i = 0.
(Although it is better if c is a good guess, it does not have to be.)
A square root
2. Set i = i + 1.
3. Set
 
1 V
si = 2 si−1 + .
si−1

4. Repeat Steps 2 and 3 until |s2i − V | < ε, for


√ some preset small
value ε. The value si is then taken to be V .
Here, Step 4 is the stopping criterion with ε measuring what counts as
‘close enough’.

For the algorithm to calculate a square root described in Example 9, the


value chosen for ε will, in general, affect how long the algorithm takes to
run. The smaller the value, the more times Steps 2 and 3 will be done, and
hence the longer it will take.

323
Unit B2 Big data and the application of data science

Another way of reducing the computation time is in designing how quickly


the algorithm converges. That is, how quickly the difference between the
current value and final answer reduces. However, in this case, there is often
a trade-off: fewer steps required but each step requiring more computation,
as Example 10 illustrates.

Example 10 Another algorithm to calculate the (positive)


square root
Another algorithm to calculate the positive square root, s, of a
number, V , is as follows.
1. Set s0 = c, where c > 0 is an initial guess for s, and set i = 0.
(Although it is better if c is a good guess, it does not have to be.)
Another square root 2. Set
V − s2i
ai = .
2si
3. Set bi = si + ai .
4. Set i = i + 1.
5. Set
a2i−1
si = bi−1 − .
2bi−1

6. Repeat Steps 2 to 5 until |s2i − V | < ε, for√some preset small


value ε. The value si is then taken to be V .
It can be shown that this algorithm converges faster than that given
in Example 9. That is, in general, Steps 2 to 5 need to be repeated
less often than Steps 2 and 3 in the algorithm given in Example 9 to
obtain a value of si that is within ε of V . However, Steps 2 to 5 are
more complicated than Steps 2 and 3 in Example 9.

So, as Examples 9 and 10 have demonstrated, there can be different


algorithms to achieve the same ends (in this case, to calculate the positive
square root of a number.) As you have seen, even when you know that
both algorithms will converge to the right answer, which is fastest may not
be obvious. So, how to choose which one to use? One option that is often
the best one in practice is to go with algorithms that have already been
programmed into standard software or programming languages. These are
likely to have been chosen for their good performance. For example, R has
a command for calculating the square root of a function. Documentation
for R does not make it clear what algorithm it uses to do this task. But a
reasonable assumption is that the developers of R have made a good choice
over what algorithm to use.

324
3 Models and algorithms

3.2 One solution or many solutions?


So far, in our discussion of algorithms, there has been the implicit
assumption that an algorithm will result in the best answer being found,
all of the time. However, this is by no means guaranteed.
First, two algorithms to solve the same problem will not necessarily give
the same result.
Example 11 reminds you of a couple of algorithms you have already met
that try to solve the same problem. In Activity 13, you will consider
whether it matters which algorithm we choose to use.

Example 11 Selecting explanatory variables


Recall, from Unit 2, that one task associated with multiple regression
is deciding what the most parsimonious model is. That is, the
simplest model that fits the data well.
In Unit 2, you were introduced to two different strategies for achieving
this: forward and backward stepwise regression. Each of these
approaches can be thought of as an algorithm for obtaining a
parsimonious model. Each describes a series of steps that can be
methodically worked through to end up with a final model.

Activity 13 Forward or backward stepwise regression: does


it matter?
Would you expect forward and backward stepwise regression to always
result in the same parsimonious model? (When they are applied to the
same data, of course.) Justify your answer.

This does not stop backward or forward stepwise regression from being
useful. Both might produce reasonable parsimonious models, just not
necessarily the same parsimonious model.
Even when you are applying the same algorithm to the same data,
you might not end up with the same result. As you will see in Example 12,
you have already met one such algorithm in Unit B1: the k-means
algorithm.

325
Unit B2 Big data and the application of data science

Example 12 k-means – one dataset, multiple solutions


In Activities 22 and 24, Subsection 4.4 of Unit B1, recall that you
used k-means to find a two-cluster solution based on height for a
group of five friends.
In Activity 22, the clusters that were found were as follows:
• Cluster 1: Billy and Cath
• Cluster 2: Adnan, Dan and Elise.
In Activity 24, the clusters that were found were as follows:
• Cluster 1: Adnan, Billy, Cath and Elise
• Cluster 2: Dan.
So, these are different clusters, and not just different labels attached
to the same clusters. This is because, for k-means, the starting
configuration matters.

As you have seen in Activity 13, algorithms with the same end goal do not
necessarily end up producing the same answers. Even the same algorithm,
applied to the same data, may not result in the same answer, as you have
seen in Example 12. So, why are such algorithms still regarded as useful?
One reason is that proving that an algorithm will always produce the ‘best’
answer is often a non-trivial task. This is particularly true if we also need
to prove that the algorithm will produce an answer in a reasonable amount
of time. The best that might be achieved is that the algorithm is shown to
produce the best answer for all the examples it has been applied to.
Perhaps more importantly, the problem that needs solving may be
sufficiently hard that an algorithm to produce the ‘best’ answer all the
time does not exist – perhaps cannot exist. For example, it has been
famously proven that an algorithm to always correctly detect whether
other algorithms will always come to an end cannot be constructed. Or it
might be that it is possible to find an algorithm that will produce the
‘best’ answer for some classes of the problem but not others, such as the
problem of finding a maximum described in Example 13.

Example 13 Finding the maximum


Maximum likelihood estimation means that finding the maximum of
the function is a problem that occurs in statistics.
Suppose we are just interested in finding the maximum likelihood
estimate of a single parameter θ. That is, the value, θ,
b which

326
3 Models and algorithms

maximises the likelihood function, f (θ). If f (θ) is known to be a


unimodal function, f (θ) with a single maximum, this a relatively
easy problem. Even if calculus does not come to our rescue, various
hill-climbing algorithms exist. That is, algorithms where θb1 , θb2 , . . . , θbi
are chosen so that f (θbi+1 ) > f (θbi ).
However, if f (θ) is such that there are multiple maximums, finding θb
is much more difficult. In particular, hill-climbing algorithms will tend
to only reliably find a local maximum, not the global maximum.

In such cases, it matters what algorithm is used and what starting point On a local, but not global,
maximum
the algorithm uses. These then become details that should be reported as
part of the data analysis.

3.3 Is there a ‘best’ answer?


Implicit in our discussion so far in this section is that it is easy to define
what properties the answer produced by an algorithm should have. That
is, that there is a recognisable ‘best’ answer. In this subsection, we will
consider some situations that have produced multiple answers, which you
have already met in this module. We start in Activity 14 by comparing
different options for parameters in a simple linear regression model.

Activity 14 Comparing model fits in regression


In Subsection 2.1, a small dataset on the heights and widths of trees was
given. Suppose a couple of algorithms are used to fit the model
height = α + β width.
Further, suppose the first algorithm produces αb1 = 4 and βb1 = 13 and that
the second algorithm results in αb2 = 3.6 and β2 = 12.9. Is it possible to
b
unambiguously decide which algorithm has produced the better answer?

For the models discussed in Units 1 to 8, the fitting has been done ‘by
least squares’ or ‘by maximum likelihood’. Both these approaches provide
an unambiguous definition of what the estimates should be. As noted in
Example 13 (Subsection 3.2), the MLEs are the values of the parameters
for which the likelihood is maximised. However, in statistics, not all
algorithms are trying to solve problems where there is an unambiguous
notion of what makes one answer better than another. Some algorithms
are developed, and adopted, on the basis that the approach seems like a
sensible one to take. This is backed with evidence that the algorithm
works at least with some test cases. In the following couple of activities,
you will consider whether some algorithms you have already met can be
thought of in such terms.

327
Unit B2 Big data and the application of data science

Activity 15 Comparing variable selections

Activity 13 (Subsection 3.2) demonstrated that when it comes to selecting


a parsimonious model in multiple regression, which algorithm is used can
matter. Forward and backward stepwise regression can produce different
selections of variables. Is it possible to unambiguously judge which
algorithm has produced the better answer?

Activity 16 Cluster analysis

Unit B1 focused on cluster analysis. Think about the methods for cluster
analysis introduced in that unit: hierarchical clustering, k-means,
and DBScan. To what extent are the differences between these methods
about fitting different models to the data and to what extent are they
about using different algorithms?

So, as you have seen in Activities 15 and 16, not only might different
algorithms produce different answers to the same problem, it might be
ambiguous as to which solution is better. In these situations, it is
important to keep track of which algorithm has been used. For example,
when reporting a cluster analysis, the report should include whether
clusters have been found using agglomerative hierarchical clustering (along
with which linkage), a partitional clustering method such as k-means or a
density-based clustering method such as DBScan.

4 Outputs from big data analysis


As its name suggests, the focus of this module has been on statistical
modelling. In particular, on linear and generalised linear modelling.
One pertinent question is, therefore, to what extent these methods are
relevant when it comes to big data.
The simple answer is that, in principle, this type of modelling is still
relevant. After all, as noted in Section 1, there is no clear dividing line
between when data changes from being small data to being big data.
However, as you will discover in this section, the outputs from the analysis
of big data, and how they are interpreted, can feel very different. In some
cases, we have data from an entire population rather than just a sample –
this is the subject of Subsection 4.2. The goal of the analysis may almost
exclusively be on prediction rather than explanation – the subject of
Subsection 4.3. Also in some cases, as we will consider first in
Subsection 4.1, correlations are regarded as important, whether they seem
plausible or not.

328
4 Outputs from big data analysis

4.1 Correlation versus causation


From your previous study in statistics you are probably familiar with the
mantra:
Correlation is not causation!
This mantra is there to remind statisticians (and indeed anyone using
the output from an analysis) of the limitations of statistical analysis.
Determining that two variables are related to each other is often a
straightforward statistical task. In contrast, determining whether a
relationship is causal is more difficult and requires assumptions in addition
to appropriate statistical analysis.

Example 14 Strength and weight of footballers


In Unit 2, you saw that the strength of a footballer can be related to
their weight via the following equation.
strength = α + β weight.
Does this mean that weight has a causal impact on a footballer’s
strength? No, of course it doesn’t. Otherwise, it is saying that a
simple way for a footballer to become stronger would be to
consistently eat far more calories than they burn and become
overweight.

Furthermore, you may have come across the term ‘spurious correlation’.
Such correlations are those that are deemed to have happened just by
chance. That is, the sample of values for each of the two variables turns
out to produce a strong correlation even though the two variables are not
linked. The underlying principle of this is that if you look hard enough
(i.e. compare enough variables), some will show a correlation just by
chance. As you will see in Example 15, this limits how we can interpret
results.

Example 15 Cheese consumption and civil engineering


You might be thinking there should be no connection between
mozzarella cheese eating and civil engineering. Surely, eating lots of
mozzarella cheese does not turn someone into a civil engineer. Nor
does it seem reasonable that becoming a civil engineer is likely to
develop someone’s taste for mozzarella cheese.
However, consider the following real dataset from the US Department
of Agriculture and National Science Foundation, which is given in

329
Unit B2 Big data and the application of data science

Table 2. This shows yearly values for two variables Y1 and Y2


described as:
• Y1 : the per capita consumption of mozzarella cheese, measured in
pounds (lb)
• Y2 : the number of civil engineering doctorates awarded.
Table 2 A dataset from the US Department of Agriculture and National
Science Foundation (Vigen, no date) for yearly values of two variables

Year: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Y1 9.3 9.7 9.7 9.7 9.9 10.2 10.5 11.0 10.6 10.6
Y2 480 501 540 552 547 622 655 701 712 708

It turns out that the correlation between these two variables (cheese
and engineering) is 0.96. A strong positive correlation!

The reasoning underlying the ‘spurious correlation’ terminology is that


correlations which do not appear to reflect a causal link, either directly
or indirectly, should be dismissed.

Activity 17 Spurious or not?

The correlation between a footballer’s strength and weight in the FIFA 19


dataset is 0.62. Is this correlation likely to be spurious?

You may be surprised to learn that in some big data analyses the
plausibility of correlations is not considered. If two items are correlated
then this is something that can be exploited. It does not matter how
unreasonable this association seems. Some statisticians are concerned and
unhappy about this.
Such an approach brings advantages. In particular, it allows the
opportunity for new relationships to be found and exploited. For example,
loyalty cards allow supermarket chains to collect a wealth of data about
their customers. From such data, the supermarket can seek out
associations between purchases that the customers makes that are not
obvious. This information can then be used to better target marketing.

330
4 Outputs from big data analysis

4.2 Sample or population


So far in this module, the data you have analysed has been regarded as a
sample from a larger population. (And, we hope, a representative sample
too.) The task is usually then to infer information about the wider
population based on this sample. Examples include: what the population
mean is, whether two populations means are equal, what the slope of the
relationship is between two variables (in the population).
As you have already seen, one feature of big data is that the size of the
dataset can be huge. This removes the need to work with a sample simply
to get a dataset that is manageable in size. It is possible to process all of
it. Thus, the sample size, n, is such that ‘n = all’.
When this is the case, much of inferential statistics becomes far less
important. At the extreme end, the data can be thought of being the
entire population rather than just a sample. In this situation, the need for
techniques to estimate population quantities disappears. They can simply
be evaluated from the data, as completing Activity 18 will demonstrate.

Activity 18 Strength of international football players

In Section 3 of Unit 1, we introduced the FIFA 19 dataset about


100 international footballers. One of the measurements given for each
footballer was their strength. (Recall that this variable is given on a scale
between 0 and 100, with larger values given to the stronger footballers.)
(a) In this dataset, the mean strength is 71.07 and the standard deviation
is 5.821. Use these to give an estimate of the population mean
strength of footballers included in FIFA 19.
(b) Why is it helpful to also quote a 95% confidence interval for the
population mean?
(c) As mentioned in the description of the FIFA 19 dataset, the
100 footballers are a subset from a database of more than 18 000
footballers. (Although this database therefore contains a lot of data,
it is small in big data terms.) Based on all of the data in the
database, the mean strength is 65.32 and the standard deviation
is 12.55. What is the population mean strength and why is the 95%
confidence interval no longer required?

However, unless we are sure we have the entire population, having a large
amount of data does not diminish the importance of it being
representative. Increasing a sample size only reduces the sampling error;
it does not address biases due to lack of representativity. So, we will
consider the representativeness of the FIFA 19 database in the next
activity.

331
Unit B2 Big data and the application of data science

Activity 19 Are the data representative?

In Activity 18, it was stated that data about all the footballers in the
FIFA 19 database are available. Are such data likely to be representative
of all international footballers?

Even when the data are not the entire population, and we are happy that
they are representative of the population we are interested in, statistically
significant results will often be obtained even for effects that are in
practice too small to be important.

Activity 20 Back to the football players

In Unit 2, you fitted various models for the strength of international


footballers. One such model that can be fitted is
strength ∼ height + weight + marking + preferredFoot
+ skillMoves.
(a) Fitting such a model to the sample of 100 footballers (the FIFA 19
dataset) used in Units 1 and 2 produces the output given in Table 3.
Based on this output, does it appear that all these variables are
contributing to the model?
Table 3 Coefficients when fitting the model to a sample of 100 footballers

Parameter Estimate Standard t-value p-value


error
Intercept −26.150 17.672 −1.480 0.142
height 0.664 0.270 2.455 0.016
weight 0.271 0.051 5.366 < 0.001
marking 0.085 0.020 4.136 < 0.001
preferredFoot −0.279 1.046 −0.267 0.790
skillMoves −0.110 0.505 −0.217 0.829

(b) A sample of 1000 players is now taken from the FIFA 19 database and
the same model fitted. The output is given in Table 4. What do you
conclude about the variables now?
Table 4 Coefficients when fitting the model to a sample of 1000 footballers

Parameter Estimate Standard t-value p-value


error
Intercept −104.027 8.540 −12.182 < 0.001
height 1.155 0.147 7.830 < 0.001
weight 0.415 0.025 16.772 < 0.001
marking 0.253 0.014 18.710 < 0.001
preferredFoot 1.172 0.623 1.883 0.060
skillMoves 2.175 0.402 5.413 < 0.001

332
4 Outputs from big data analysis

(c) The same model is fitted to all 18 147 footballers in the database. The
output is given in Table 5. What do you conclude about the variables
now?
Table 5 Coefficients when fitting the model to all 18 147 footballers

Parameter Estimate Standard t-value p-value


error
Intercept −103.455 2.148 −48.16 < 0.001
height 1.194 0.037 32.04 < 0.001
weight 0.400 0.006 65.22 < 0.001
marking 0.215 0.003 66.11 < 0.001
preferredFoot 1.004 0.150 6.70 < 0.001
skillMoves 2.667 0.094 28.31 < 0.001

(d) For each of these three models (or, more accurately, the same model
to different data) calculate the 95% confidence interval for the slope
associated with preferredFoot. (Hint: when doing this, use the fact
that as n gets big the t-distribution with n degrees of freedom gets
increasingly close to the standard normal distribution. So,
tn (0.975) ≃ z(0.975) ≃ 1.96 when n is big.)
Does the width of the 95% confidence interval for the slope associated
with preferredFoot get bigger or smaller as the sample size
increases?
In each case, is it plausible that the effect of preferredFoot is to
increase strength by 1 unit? (Remember that the variable
preferredFoot can only take the values 0 or 1.)

So, as you have seen in Activity 20, when more data are available,
variables with small effects are more likely to feature in parsimonious
models. This means that models fitted to big data can be, and are, more
complicated than models fitted to smaller datasets. This leads to the
change in emphasis that is the subject of Subsection 4.3 – prediction
rather than explanation.

4.3 Prediction rather than explanation


In this module, you have been encouraged to interpret your model. That
is, to look over the form of the model and then think about what it implies
about the relationships between variables (particularly the response
variable and the explanatory variables.) However, for many of the
techniques applied to big data, the emphasis is on classification and
prediction. Such techniques are often referred to as machine learning
techniques. Examples include predicting what a customer might like to buy
next, and what the correct response is to an answer given on Jeopardy!.
We have already been doing some prediction and classification in this
module. In Unit 5, we used data to try to predict the number of medals a

333
Unit B2 Big data and the application of data science

country will win at a future Olympics. These data contain information


about how many medals countries have won in the past, so we were
forming test and training datasets. Techniques, such as regression, are
sometimes referred to as supervised learning techniques.
In contrast, Unit B1 was all about classification. There, it was assumed
that there were no observations in the data for which we knew which
cluster it definitely belonged to. Such techniques are sometimes referred
to as unsupervised learning techniques.
So, what difference does having big data make? One answer lies in the
complexity of the models that are fitted.

Activity 21 Cost of living revisited

In Unit 8 (Subsection 1.1), you analysed data from the UK’s 2013 Living
Costs and Food Survey, given again here in Table 6.
Table 6 Counts of households from the UK survey dataset classified by
employment, gender and incomeSource

incomeSource
earned other
gender gender
employment female male female male
full-time 626 1688 31 95
part-time 235 112 123 66
unemployed 18 16 72 58
inactive 68 78 815 1043
Total 947 1894 1041 1262

Below are listed four different log-linear models that could be fitted to
these data. Which of these models is the most complex? Which is the
most difficult to interpret?
count ∼ incomeSource + employment + gender,
count ∼ incomeSource + employment + gender + employment:gender,
count ∼ incomeSource + employment + gender + employment:gender
+ employment:incomeSource,
count ∼ incomeSource + employment + gender + employment:gender
+ employment:incomeSource + gender:incomeSource.

As you have seen in Subsection 4.2, using big data means that even small
effects end up as statistically significant. This makes it difficult to simplify
models, which in turn makes it more difficult to interpret them. As you
will see in Example 16, the complexity of the model can go beyond just
including more terms in a linear regression or generalised linear model.

334
4 Outputs from big data analysis

Example 16 Neural networks


One type of machine learning uses neural networks (or just ‘neural
nets’ for short). This approach takes as inspiration how neurons are
thought to work in the brain. (This inspiration drives much of the
terminology. For example, variables are often referred to as ‘neurons’.)
Input variables are linked to output variables via layers of artificial or
hidden variables using a structure such as that depicted in Figure 8.
The value of a variable in each layer is assumed to depend on the
value of the variables in the previous layer. (If there are many layers
then this is sometimes referred to as deep learning.) The strength of
each of these associations between variables is something that is
estimated when the model is fitted.

INPUT HIDDEN OUTPUT


LAYER LAYERS LAYER

Figure 8 A diagram of a neural network

This approach offers an immensely flexible non-linear model, but not


one that can be easily interpreted. The number of variables involved
and the layers of artificial nodes obscure why the model associated a
particular prediction with any given set of input values.

The flexibility of neural networks, coupled with big data, means that they
have been applied in a number of different settings, as you will discover in
Activity 22.

335
Unit B2 Big data and the application of data science

Activity 22 Uses of neural nets

Read the blog ‘10 Business Applications of Neural Network (With


Examples!)’ (Mach, 2021) provided on the module website.
What are the ten applications that the author gives?

The size, complexity and flexibility of the models that can be fitted using
big data often mean that a succinct summary of how changes in an
explanatory variable impact on the response variable is not possible.
Indeed, any such interpretation may only be possible by looking for
differences (or lack of differences) in predictions when individual values
are changed.
In many situations, it may not matter how any particular prediction is
arrived at, simply whether a good prediction can be arrived at. However,
as you will discover in Subsections 6.1 and 6.2, understanding how a model
behaves with respect to different sets of inputs is important. It is critical
for ensuring that a model made possible by big data does not lead to
real-world impacts that are detrimental to society.
Furthermore, having a prediction or classification based on big data rather
than small data does not mean that it is certain or inevitable. There
remains the need to include some estimate of the uncertainty associated
with the prediction or classification, for example, by giving a confidence
interval. Otherwise, we have no means of telling the difference between a
reliable prediction and a wild guess!

5 Privacy
So far in this unit, you have seen what is meant by big data and how it is
dealt with. (Admittedly this is in a general way.) You have also seen how
the analysis of big data can be different in character to the analysis of
small data. In this section, you will consider some of the ethics
surrounding collection and storage of big data: consent (Subsection 5.1)
and preserving anonymity (Subsection 5.2). At the heart of this is the idea
of allowing individuals to maintain privacy, should they wish to. As such,
these ideas apply to small data as well as big data.

5.1 Consent
For any data analysis, one of the first considerations should be to check
that the data have been collected in an ethical way. The analysis of data,
particularly data relating to people or animals, that has not been obtained
ethically could be seen as condoning the unethical practices. Within the
research community, projects involving people or animals have to gain
ethical clearance before they are given approval to go ahead. So, as you
will see in Example 17, institutions such as The Open University have to
consider how such ethical clearance can be obtained.

336
5 Privacy

Example 17 OU Research and ethical review


The Open University has a system for ensuring that all its research
meets high ethical standards. At the time of writing, it has three
ethics bodies: the Human Research Ethics Committee, the Animal
Welfare Ethical Review Body and the Ethical Research Review Body.
These bodies will only give approval to research that they think meets
the highest ethical standards. (A link to information about the OU’s
research ethics is provided on the module website.)

Enshrined in the guidelines for ethical research involving people is the


principle of informed consent. That is, consent that is freely given by
a participant when they fully understand what they are consenting to.
One means of providing participants information about research they are
being asked to take part in is via an information sheet. In Activity 23, you
will examine one such information sheet from a study related to COVID-19
that was running in 2020 and 2021. The aim of this study was to gauge
what proportion of the UK population had detectable antibodies against
COVID-19. This study was running at a time when a vaccination
programme aimed at all adults in the UK was in progress and pandemic
restrictions were still in place.

Activity 23 Informed consent


Read the leaflet ‘COVID-19 in-home antibody testing research study:
Information sheet for participants’ provided on the module website.
(a) Who is the intended reader of the leaflet?
(b) Do participants have to take part?
(c) What information about themselves will a participant be asked
to provide?
(d) Who will have access to the data?

The principle of informed consent is seen as the gold standard for


obtaining and using data about people. However, there are situations in
which it is not possible for such consent to be obtained. In medical
research, such situations include:
• when the participant is a child – they are deemed not to be old enough
to be able to fully understand what they are consenting to
• when the participant does not have the mental capacity, either
temporarily or permanently, to understand what is being asked of them
• there is not enough time to obtain informed consent – for example,
because the research concerns an emergency procedure.

337
Unit B2 Big data and the application of data science

However, even in these cases it is expected that the participant


understands and agrees as far as they are able to, along with assent given
by the next of kin. Additionally it is expected that, where possible,
informed consent is given as soon as possible afterwards.
When it comes to big data, issues of consent may not be as transparent as
being able to point to informed consent being given. It may, for example,
be deemed to have been given by a customer when agreeing to some
(often lengthy) terms and conditions, such as those you will consider in
Activity 24.

Activity 24 Looking at some terms and conditions

Look at the terms and conditions stated by a social media platform or an


online retailer.
• What sort of data do they directly collect about their users/customers?
• What sort of data (if any) do they collect about their users/customers
from others.
(For example, you may wish to consider a copy of the Privacy Notice for
Amazon.co.uk (2022) provided on the module website, which gives
information about the data they collect.)

Whilst completing Activity 24, you may have come to your own
conclusions about whether the customers or users of the social media
platform or online retailer understand the full range of data they are
consenting to the company collecting about them. More importantly, this
includes whether the users/customers would still be happy with the terms
and conditions if they did understand them.
However, data captured by social media platforms and online retailers can
include data about others. These others may not even know that such data
is being gathered, let alone consent in any way. You will consider the
possibility for such data to be gathered in the next activity.

Activity 25 What about others?

Look at the Privacy Notice for Amazon.co.uk (2022) provided on the


module website. Give two examples where the information they collect
could include information about other people.

So, as the Solution to Activity 25 shows, a person consenting to Amazon’s


terms and conditions might not just be consenting to data being collected
about themselves but also about others. Such inadvertent data collection
is not always seen as harmless, as Example 18 demonstrates.

338
5 Privacy

Example 18 Alexa and children


In June 2019, reports emerged about Amazon being sued over
recordings of children’s voices picked up by Alexa. The case is being
brought in the US courts (Kelion, 2019, and Keller Postman, 2021).

The use of Alexa by children


So far in this section, we have just been considering consent, or lack of it, brings particular issues about
surrounding the initial collection of data. That is, consent issues consent
concerning primary data. (Definitions of primary and secondary data were
introduced in Subsection 2.1 of Unit 1.)
Recall that secondary data are data that have already been collected for
another use. However, just because someone else has been willing to
supply data does not mean that the users of secondary data can ignore
issues over consent. They still need to consider whether this secondary
data analysis is reasonable, considering the consent that was given during
the primary data collection. Even when the end goal is clearly desirable,
this does not automatically mean that the transfer of data will be seen as
okay, as Example 19 demonstrates.

Example 19 Detecting kidney injury


Acute kidney injury is a serious illness which requires quick treatment
if complications are to be avoided. In 2016, the Royal Free London
NHS Foundation Trust began a collaboration with Alphabet (the
parent company of Google) to try to develop a system to quickly and
easily diagnose the problem. This collaboration included handing over
The Royal Free Hospital,
patient-level data to Alphabet.
which is part of the Royal Free
However, in 2017, concerns were raised about this transfer of data, London NHS Foundation
particularly as it was done without seeking the consent of patients Trust, was founded in 1828 to
(some of whom had acute kidney disease but also others who did not). provide free healthcare to
This led to the Information Commissioner’s Office (ICO) in the UK those who could not afford
looking at the way the Royal Free London was handling patient data, medical treatment
and poor publicity for the project (for example, it was reported on the
BBC News website (BBC News 2017)). Following this, the Royal Free
London was able to address the shortcomings identified by the
Information Commissioner’s Office (Royal Free London, no date) and
the success of the project to build an app to quickly identify acute
kidney injury was announced in August 2019 (Pym, 2019).

339
Unit B2 Big data and the application of data science

Concerns about the use of data, particularly exhaust data (which was
introduced in Subsection 1.2.4), have led to the introduction of legislation.
For example, the General Data Protection Regulation (GDPR) in the UK
and in the EU. As Box 7 details, this legislation sets out a number of
principles that should be applied to the processing of personal data.

Box 7 Seven principles of the General Data Protection


Regulation in the UK
In the UK, the General Data Protection Regulation (GDPR) sets out
the following seven principles which should apply to the processing of
personal data (The National Archives, 2020):
• lawfulness, fairness and transparency
• purpose limitation
• data minimisation
• accuracy
• storage limitation
• integrity and confidentiality (security)
• accountability.

So, how do the principles set out in Box 7 help? First, let us consider
transparency. In guidance about the legislation (ICO, no date) the ICO
explains this as ‘being clear, open and honest with people from the start
about who you are, and how and why you use their personal data’. This
means that even people whose data are included as part of a secondary
dataset have a chance to assert rights such as the right to object.
Another of the principles, that of ‘purpose limitation’, the ICO explains as
meaning that personal data should be ‘collected for specified, explicit and
legitimate purposes’ and crucially ‘not further processed in a manner
incompatible with these purposes’. So, this regulation prohibits the
analysis of personal data by an organisation for whatever purpose they
choose just because they happen to have the data. Instead, the purpose
has to be in line with why the data were obtained, or fresh consent has to
be obtained – or a clear obligation needs to be demonstrated (which, of
course, includes a legal obligation).

340
5 Privacy

5.2 Anonymisation
In Subsection 5.1, you saw that when data relates to individuals it is
important to obtain informed consent when that is feasible. More
generally, you saw that the data should only be used in a way that
is compatible with that consent.
Often, that consent comes with provisos over who can access data that
allows the individual to be identified. These people with access often
form a very restricted list and are given such access for specific reasons.
For example, for the COVID-19 study you considered in Activity 23, it is
clearly stated in the participant information sheet that only the research
team will have access to all of the data.
However, always keeping data restricted to small groups of people brings
its own disadvantages. For a start, it means that others are denied the
chance to scrutinise any results obtained from it. It also means that
data cannot be put to uses that bring benefits beyond the primary
purpose. For example, in this module you have been making use of data
that has been made publicly available. So, if nothing else, the use of these
data is enhancing the teaching of data science and increasing the number
of people skilled in making sense of data.
The competing imperatives of maintaining confidentiality and making the
best use of data are usually resolved by anonymising the data. That is,
removing sufficient information from the data to make it very unlikely that
individuals could be identified. Note that this requirement about not
being able to identify individuals applies to all the individuals in the
dataset. The anonymisation will have failed if even just one individual
can be too easily identified.
On the face of it, anonymisation may seem easily achievable by just
removing names from a dataset. However, it is not only names that can
lead to individuals being identified from a dataset. Other information that
is commonly known about people, such as their addresses, can also cause
a problem. In Activity 26, you will consider the variables in the OU
students dataset that relate to student location.

Activity 26 Identifying OU students

Recall that, in Unit 4, you began fitting regression models to data about
some OU students. As stated in the description of the data, care was taken
with these data to ensure that they are anonymised.
Look back at the data description given in Subsection 5.3 of Unit 4. Which
variables include information about where each student was located?
Based on the information given from these variables, explain why the Anonymised OU students
location of a student cannot be precisely deduced.

341
Unit B2 Big data and the application of data science

As you have seen in Activity 26, the data in the OU students dataset only
provides crude information about where a student is based. In the case of
region, all of the categories correspond to geographical areas of large
numbers of people. For imd, the categorisations ‘most’, ‘middle’ and ‘least’
also correspond to large areas of the UK and the category ‘other’ to
everywhere else in the world. So, each combination of region and imd
translates to a large number of people in the population. So, knowing the
value of region and imd for someone in the dataset still means that there
are still very many people this student could be.
One of the other variables given in the dataset is age. In the next activity,
you will consider whether this can be used to identify students instead.

Activity 27 Age of OU students

Figure 9 is a histogram of the ages given in the OU students dataset.


Given this histogram, explain why it seems unlikely that students could be
identified on the basis of age.

70

60

50
Frequency

40

30

20

10

0
20 40 60 80
Age (years)
Figure 9 Students’ ages in the OU students dataset

In Activity 27, you have seen that it does seem unlikely that students
could be identified on the basis of age. Unfortunately, this by itself does
not mean that we can be satisfied that confidentiality has been maintained
just yet. With the advent of the internet, it is difficult for individuals not

342
5 Privacy

to leave a digital trace of themselves. These traces can be picked up via


searches on the internet. We need confidentiality to be maintained taking
into account this extra information which is publicly available. You will
consider this with respect to the students in the OU students dataset in
the next activity.

Activity 28 Looking for older OU students

The ability of students in their 70s, 80s and even older to pass OU
modules and earn degrees is something to be celebrated. When Clifford
Dadson graduated, he became the oldest person to do so from the OU. Getting his OU degree,
Spend five minutes looking online for other information about Clifford. Clifford Dadson became the
Could he be the oldest student in the OU students dataset? oldest British graduate

In Activity 28, you have seen that finding out extra information about the
oldest person to graduate from the OU is relatively easy. You may feel
that this is only possible with people achieving something exceptional such
as Clifford Dadson. However, the growth of social media means that trivial
and not-so-trivial information is available about people, often posted by
themselves. This might include information such as name, age and – in the
case of OU students – which modules they are studying and when. To
avoid this information being used to identify someone in the OU students
dataset, recall an extra protection is built in with respect to age – a It seems that his OU degree
random amount between −2 and +2 has been added to each age. Thus, for inspiration opened the sky
wide for him! One year after
any individual represented in the dataset, we only know their age within a
his degree, Dadson was
five-year age band.
skydiving for charity.
Deliberately adding extra variation to data might seem at first glance to
be something that would make the data unusable. The key thing is that it
needs to be done in such a way that it does not introduce any biases. For a
start, this means that changes to individual values are randomly
generated. Then, the only impact of the adjustment is that parameter
estimates will not be as precisely estimated compared with estimation
based on the original dataset; this is a cost of improving the anonymisation
applied to the data.
In the OU students dataset, extra variation was added to the variable age
by the holders of the data (the OU) to improve the anonymisation; the true
age (or at least the age that was given) is known by the holders of the data.
It is also possible to obtain usable anonymised data by asking individuals
to apply some randomisation when they give data. Such an approach is
known as randomised response (Warner, 1965). This then places the
randomisation in the hands of the respondents instead of the researchers.
For example, this could be done by employing a scheme such as the one
given in Box 8.

343
Unit B2 Big data and the application of data science

Box 8 Responding using randomisation


To answer a question using randomisation, the respondent could toss
a coin and then do the following.
• If the coin toss is ‘heads’, answer ‘yes’.
• If the coin toss is ‘tails’, answer truthfully.

Randomised response can be particularly useful when the collectors of the


data ask individuals to supply information about themselves that could be
particularly embarrassing or damaging if revealed. Even if the collectors do
not intend to ever release such compromising information about
Tossing a coin may be easy.
individuals to others, there might be a situation where they are forced to,
But is it fair?
such as via a court order.
For instance, randomised response is used when researchers want to
measure engagement with activities that are illegal, such as drug use.
(Just because something is illegal doesn’t mean that nobody does it.)
Researching how prevalent illegal activities are, and what factors might
lead people to engage in them, can help formulate effective prevention and
harm-reduction strategies, as illustrated in Examples 20 and 21.

Example 20 Investigating consumption of wildlife


In a paper published in 2021, some researchers were interested in
estimating the consumption of wildlife by people living in the
Brazilian Amazon, and the factors that were associated with increased
consumption (Chaves et al., 2021). This included meats that were
illegal. Being able to collect such data is important. The estimation of
how much consumption of illegal meat is occurring, and what is
associated with it, can help with the design of effective conservation
interventions.

Example 21 Plagiarism and essay mills


As you will be aware when you have submitted TMAs, the work you
submit needs to be your own, and not copied from others without
acknowledgement. You are required to make a declaration about this
when you submit your assignments. Despite this, material available
online from ‘essay mills’ (companies offering to provide examples of
essays) suggests that a few students have been tempted to cheat.
To assess the scale of the problem, the OU could conduct a survey of
students asking about the use of essay mills. However, the challenge

344
5 Privacy

would then be how to ensure that students who indicate that they
have used them are not placed in a compromised position.
(Note: guidance about what plagiarism is and how to avoid it is
provided on the module website.)

As you have seen in Example 21, investigating plagiarism is something


that researchers might be interested in doing. In Activity 29, you will
consider how using randomised response can help.

Activity 29 Asking about cheating

Suppose a survey of OU students asked them to use the procedure


described in Box 8 to answer the following question.
‘Have you ever submitted someone else’s work as your own?’
If a student responds with ‘yes’, does this mean that they are admitting to
having submitted someone else’s work as their own?

As Activity 29 has shown, it is possible to ask questions about sensitive


activities in such a way that an individual does not place themselves in a
compromised position. However, this is only of some use if the resulting
data still allows the true proportion to be estimated. The scheme given in
Box 8 is such that we can, as you will explore in Activity 30.

Activity 30 Estimating the prevalence of cheating

Consider again the following question to which students provide their


answer according to the scheme in Box 8.
‘Have you ever submitted someone else’s work as your own?’
(a) Suppose that no OU student has ever submitted someone else’s work
as their own. What percentage of ‘yes’ replies would you expect?
(b) Now suppose that 1% of OU students have ever submitted someone
else’s work as their own. What percentage of ‘yes’ replies would you
expect?
(c) Now suppose that p% of OU students have ever submitted someone
else’s work as their own. What percentage of ‘yes’ replies would you
expect?
Does an open-book remote
(d) Why does your answer to part (c) mean it is possible to estimate the exam increase the prevalence
percentage of OU students who have ever submitted someone else’s of cheating?
work as their own from the percentage of ‘yes’ replies recorded?

345
Unit B2 Big data and the application of data science

Thus, schemes such as that described in Box 8 allow researchers to obtain


useful responses in such a way that they cannot be sure what the true
response is for any given individual.
You may at this point be wondering why all surveys are not set up using
schemes such as that given in Box 8. One answer is that there is a
trade-off when steps are taken to reduce the risks that individuals can be
identified. An inevitable cost is that it increases variability in the data.
Parameters cannot be estimated as precisely. Or, equivalently, more data
have to be gathered to estimate the parameters with the same precision.
For example, using the scheme described in Box 8, effectively only about
50% of the responses will be informative about the question, rather than
all of the responses if the question is asked directly.
Thus, a balance needs to be struck between the risks of anonymity being
broken (and/or people refusing to provide any response) and the precision
with which parameters can be estimated. A scheme such as that described
in Box 8 may only be worth it if a question is asking something sensitive.
You will consider what such sensitive questions might be in Activity 31.

Activity 31 When might randomised response be worth it?

Take a few minutes to think about the type of important questions that
are worth researching but might benefit from a randomised response. Use
the M348 forums on the module website to share your thoughts.

6 Fairness
The previous section, Section 5, dealt with ethical issues surrounding the
collection and storage of data, big and small alike. In this section, we will
discuss ethical issues surrounding the analysis of data. This is important
because an analysis being possible does not mean it should be done.
Before reading and working through this section, please notice that the
material discusses some forms of injustice issues. This content may evoke
powerful emotions in some people. If you think that you may need to
discuss your thoughts and/or emotions on the topics presented in this
section, or elsewhere in this unit, please contact the Open University
Report and Support service or the Student Support Team (SST) – see the
module website for links to these.
In Subsection 6.1, you will see how it is possible for data analysis to
exacerbate inequality despite that not being the intention. You will also
see in Subsection 6.2 that, without care, feedback loops can be introduced
so that inadequacies of a model could get worse over time, not better.
As in Section 5, these ideas apply to the analysis of small data as well as
big data. However, the reach of big data is such that analyses have the
potential to impact on daily lives, including critical areas such as health,

346
6 Fairness

crime, justice and insurance. Unfortunately, as you will see, this has led
to high-profile cases of big data analyses going wrong.

6.1 Inequality
One aim of analysing big data is to build predictive models that can bring
objectivity to decision-making. So, instead of relying on a person, or group
of people, to predict outcomes with all the unconscious bias they can’t
help but have, the predictions are based on actual data. This aspiration is
particularly important with respect to predictions used to make decisions
that have a big impact on people’s lives. Examples include: if someone
should be deemed a good enough risk to grant them a loan for a house,
car, etc.; what the most appropriate medical diagnosis is, given a set of
symptoms; how likely it is that someone has committed a crime.
With big data, it is possible to make finer-grained predictions: for
example, whether somebody (and people like them) will repay a loan on
time, or suffer a heart attack in the next 5 years. This brings benefits, but
also dangers. Should a decision about giving somebody a loan depend on
factors over which they have little or no control? And who should have
access to their health records anyway? Legislation such as the Equality
Act and GDPR (see Box 7 in Subsection 5.1) exists to prevent
discrimination and data misuse. At the heart of such legislation is the
notion of protected characteristics – characteristics about people that
are covered by the legislation. There is no universal list of protected
characteristics, as you can see in Box 9, though there can be much overlap.
But, as we will see in the rest of this subsection, avoiding discrimination is
not always straightforward.

Box 9 Protected characteristics


Equality legislation in the UK, and elsewhere, refers to ‘protected
characteristics’ – groups which are covered by the legislation. The
exact list depends on which country you are considering.
In England, Scotland and Wales, the Equality Act 2010 defines the
following as protected characteristics (The National Archives, 2013):
• age
• disability
• gender reassignment
• marriage or civil partnership
• race
• religion or belief
• pregnancy and maternity
• sex
• sexual orientation.

347
Unit B2 Big data and the application of data science

The EU, in its Charter of Fundamental Rights, similarly lists as


protected characteristics: sex, race, colour, ethnic or social origin,
genetic features, language, religion or belief, political or any other
opinion, membership of a national minority, property, birth, disability,
age and sexual orientation (EU-FRA, 2021).

We start, in Example 22, by considering a problem that arose with respect


to facial recognition.

Example 22 Errors in facial recognition


Facial recognition is increasingly being used to identify people. For
example, to unlock smartphones and at electronic passport gates at
international borders.
In 2018, a couple of researchers published a paper focusing on the
performance of three commercial facial recognition algorithms,
produced by Microsoft, IBM and Face++. In this study, just one
aspect of the algorithm was examined – the accuracy with which the
algorithm could deduce whether a face was that of a man or a woman
Different faces but recognisable (Buolamwini and Gebru, 2018).
smiles!
Whilst they found that, for the sample of faces they tested, overall the
algorithms got the right classification 88% to 94% of the time, this hid
great disparities. For men with lighter skin types, the algorithm got
the right answer at least 99% of the time. However, for women with
darker skin types this dropped to only 65% to 80%. Also, to a lesser
extent, the accuracy of the algorithm was less for women’s faces
compared to men’s faces and less for those with darker skins
compared to those with lighter skins.
Such disparities in accuracy could be important and lead to
disadvantages for those with darker skins and/or for women. For
example, it could lead to such individuals becoming more likely to be
erroneously picked out, should the algorithm contribute to systems
designed to pick up individuals of interest in crowds.

At the heart of this system was data analysis and, presumably, there was
not an intention to discriminate. So, what went wrong?
In this case, the blame has been laid on the selection of the data used to
build the facial recognition system. The data used consisted of
predominantly lighter faces, with few darker faces. You will explore why
this matters in the next couple of activities.

348
6 Fairness

Activity 32 Effect of dataset size on predictions

Suppose a model to predict whether someone will repay a loan is required


– and that this prediction is based on their income (only).
(a) First suppose that a logistic regression model with just a linear term
for income reflects in the population the probability that someone will
repay the loan. In this case, which model is more likely to produce
better predictions: a model estimated using data from 100 people or a
model using data from 100 000 people? Why?
(b) Now suppose that in the population, the relationship between income
and the probability of repaying the loan has a complicated and
unknown shape. In which situation are you more likely to get a model
that reflects this shape: using data from 100 people or using data
from 100 000 people? Why?

Activity 33 Effect of relative group sizes on predictions

In Activity 32, it was assumed that in the population the only thing that
influences the probability of repaying a loan is a person’s income. Now
suppose that for a subgroup of the population the relationship between the
probability of repaying the loan is different. Furthermore, suppose that in
the data used to build the predictive model only 1% of the data
corresponds to people in this subgroup.
(a) If the predictive model is built ignoring whether someone belongs to
this subgroup, is the model likely to give predictions for people in the
subgroup as good as for everyone else? Why or why not?
(b) Suppose a separate model is fitted for people in this subgroup. Is this
model likely to give predictions for people in the subgroup as good as
the one for everyone else? Why or why not?

So, as Activities 32 and 33 have demonstrated, representation within


datasets used to build predictive models matters. If a group is not well
represented, the resulting predictive model is likely to be less appropriate
for them.
Notice that, in Activity 33, an assumption was made that the relationship
between income and the probability that a loan would be repaid was
different for a subgroup of the population. This raises the thorny issue of
what is the right thing to do when this subgroup is a group with one or
more specific categories of protected characteristics. As you will see in
Example 23, this comes up in relation to setting insurance premiums.

349
Unit B2 Big data and the application of data science

Example 23 Insurance pricing


In the UK and in many other places around the world, drivers are
expected to be insured whilst driving their cars. This insurance is
provided by private companies. The private companies have to decide
what price they are going to charge individual drivers. If the prices
are generally too low, the company is likely to end up paying out
more in claims than they receive in premiums. However, if the prices
are generally too high, the company is likely to lose business as drivers
obtain their insurance from other providers.
Before the ability to process lots of information about drivers, these
premiums only depended on a few factors. It is now possible to make
the premium much more tailored to the individual, not only taking
into account the type of car (including the colour) but also
information about how the driver actually drives (and not just how
they say they drive).
Prior to 2012, in the EU it was possible for these premiums to depend
on the gender of the driver. For example, insurers could offer lower
premiums to women than to men if data suggested they were less
likely to be involved in accidents and hence less likely to make a claim
on their insurance. However, a ruling from the Court of Justice of the
European Union prevented such differences in premiums from
21 December 2012 as a form of discrimination based on sex (EU-CJ,
2011).

A means to achieve equitable treatment is to avoid including any protected


characteristics when building the model. That way, one hopes that
predictions produced using the model should not treat groups differently
according to where they fall in each protected characteristic. In the case of
the insurance example, applications for car insurance do not generally
request information about race, nor should they. However, as you will see
in Example 24, this may be an ineffective strategy.

Example 24 More on insurance pricing


In early 2018, The Sun newspaper carried out an investigation into
the pricing of car insurance (Leo, 2018). They found that by just
changing the name from John Smith to Mohammed Ali, the price
Mohammed Ali (1769–1849) is increased, in some cases by a considerable amount. This suggests the
referred to as the founder of algorithms that the companies used were setting different premiums
modern Egypt. His mosque in on the basis of the racial group someone belonged to.
Cairo is one of the famous
touristic attractions.

350
6 Fairness

So, how could such a model come up with pricing that seems to depend on
race? You will begin to explore this in Activity 34.

Activity 34 Inferring ethnicity

Take a look again at Example 24. What information about the applicant
did The Sun change when doing their investigation? Is this a protected
characteristic in the UK?

So, as you have seen in Activity 34, removing a variable such as sex and
race (and gender and ethnicity too) is not always sufficient to remove all
information about protected characteristics in a dataset. Such information
could be inferred from other variables that by themselves are not covered
by equality legalisation. That is, using other variables as proxies for
protected characteristics. This might be someone’s name, but also where
they live or, less obviously, something seemingly innocuous such as the
products that they buy. These variables could include variables that you
might not want to drop from the modelling.
Preventing a model from using other variables as proxies for protected
characteristics and thereby implicitly using them to improve the model is
difficult. If using a protected characteristic would improve the predictive
accuracy of the model, then the search for the best model is likely to end
up trying to include this information via proxies.
In one sense, this issue is not different from modelling discussed in Units 1
to 8. For example, forward and backward stepwise regression only consider
the impact of including a variable on the fit of the model. Value
judgements about what variables mean, either by themselves, or in
combination, do not come into it at this point. However, the difference
comes with the complexity of the models. Interpretation of the model
makes it possible to be able to examine why such a model gives the
predictions that it does. This makes it possible to detect when protected
characteristics are inadvertently being used.
The complexity of models used with big data, such as neural nets, make it
very difficult, if not impossible, to figure out why a model gives the
predictions that it does. This means that situations where protected
characteristics are inadvertently used to improve predictions can remain
hidden. Detection might only arise after comparing predictions for similar
sets of inputs, such as what was done in Example 24.
The final example we give in this subsection relates to a health app.

351
Unit B2 Big data and the application of data science

Example 25 Diagnosing heart attacks


In Autumn 2019, a health app hit the headlines when it was accused
of sexism (for example, Das, 2019). The app, Babylon, is designed to
act as a substitute General Practitioner (GP) by diagnosing illness.
However, it emerged that for a set of symptoms, just changing the
gender of the patient (and nothing else) substantially changed
whether a heart attack or panic attack was given as the most likely
cause. A worrying consequence of this is that it changed the medical
advice offered to the user from going immediately to an Accident &
Emergency department (A&E) to waiting a few hours to see what
A way of diagnosing heart happens.
problems before the Babylon
app!

One issue this example highlights is whether the app should give the same
diagnosis regardless of gender. Arguably, gender equity is better served by
an app that is equally accurate for people of any gender, though going
down this route raises issues about how best to measure accuracy. In
Box 9 of Unit 5 (Subsection 4.3.1), a couple of measures for continuous
data were introduced: MSE and MAPE. For categorical outcomes, such as
‘heart attack’ or ‘panic attack’, the misclassification rate is often used.
That is, the percentage of outcomes for which the model came up with the
wrong diagnosis. However, any calculation also needs to factor in the cost
of making the wrong diagnosis. In the next activity, you will consider what
this cost might be in the case of heart attack versus panic attack.

Activity 35 What if the prediction is wrong?


Suppose, for a set of symptoms, there are only two possible diagnoses:
heart attack or panic attack. Furthermore, suppose the medical advice
that an app offers is as follows.
• Heart attack: go straight to A&E.
• Panic attack: wait 4 hours to see if symptoms disappear.
(a) Suggest a cost that would be incurred if the app misdiagnoses a panic
attack as a heart attack. (Note these costs are not necessarily
monetary.)
(b) Suggest a cost that would be incurred if the app misdiagnoses a heart
attack as a panic attack.
(c) Given your answers to parts (a) and (b), does this mean that each
type of misdiagnosis is equally bad?

There is also the question of whether the app is reflecting real differences
in the rates of heart attack versus panic attack in men and women. As the

352
6 Fairness

developers of Babylon point out, the app is built on historical data and
symptoms. Thus, it is reflecting a difference that is there in the data.
However, there is also a danger that the app is reflecting a historical bias
in the diagnosis of heart attacks. That is, whether historically heart
attacks in women have been more likely to be missed by doctors (which is
a real concern – expressed, for instance, by the British Heart Foundation,
2019). This could be because the doctors themselves thought of heart
attacks as something that more affects men and hence were more likely to
associate the same symptoms in women with something else (such as a
panic attack). Or this could be because the ‘typical’ symptoms of heart
attack might occur less in women who are having a heart attack.
This leads to another issue about fairness, which we will consider in
Subsection 6.2: whether historical biases in data will diminish over time as
the app is used and more data are gathered.

6.2 Feedback loops


One aspect of predictive models based on big data is that they can evolve.
That is, as they are used, more data are gathered and incorporated into
the model. It is to be hoped that this will lead to the model becoming
better over time. Unfortunately, this is not guaranteed as feedback loops
might be created. That is, the results from the predictive model could lead
to more biases in the data collection, not less. Thus, these loops can work
to heighten inequalities. Example 26 shows a potential feedback loop.

Example 26 Feedback loop in diagnosing heart attacks


Recall in Example 25 it was stated that the Babylon app was more
likely to suggest that a particular set of symptoms indicates a heart
attack in men than in women. Unfortunately, this has the potential to
set up a feedback loop, as shown in Figure 10.

App more likely


to suggest heart
attack to men

More data from men Men more likely


linking symptoms to seek further
with heart attack medical attention

Men more likely


to have heart
attack confirmed
Figure 10 A feedback loop that could be set up by the Babylon app

353
Unit B2 Big data and the application of data science

In Example 26, you have seen how a feedback loop could be set up. If this
happens, it means that over time an inequality will not diminish but
instead it is likely to just grow and grow.

Example 27 PredPol
For some while now, it has been recognised that the location of many
crimes is not completely at random. Some areas, at some times, are
unfortunately more likely to experience crime than others. However,
this means that by analysing where, when and what crime has already
happened, predictions can be made as to where crime is more likely to
PredPol, the system predicting
possible dangers has its own happen next. This knowledge allows police forces to act in a proactive
possible dangers! way, directing resources to the predicted hotspots. This is the idea
underpinning PredPol, a predictive model based on big data (PredPol,
2020).

It is to be hoped that the objectiveness of PredPol leads to policing being


done in a fairer way. Unfortunately, it is another situation where a
feedback loop can easily be set up. You will consider how in Activity 36.

Activity 36 Potential feedback loop in a system like


PredPol
Suppose in an area there are two districts, labelled A and B. Further
suppose based on historical data, and subjective opinion, it is thought that
district A is the higher-crime area and that district B is the lower-crime
area.
(a) Based on the historical data, and subjective opinion, in which district
does it make sense for the police to focus on?
(b) Suppose that some crimes (for example, the low-level ones) are more
likely to be recorded only if a police officer happens to be around.
Given your answer to part (a), in which district are more crimes likely
to be detected?
(c) Is your answer to part (b) likely to reinforce, or contradict, the view
that district A is the higher-crime area?

The labelling of some areas as higher-crime simply because crime was more
likely to be observed there, as shown can happen in Activity 36, is bad
enough. However, there is also the danger that it also plays into other
prejudices as different districts almost inevitably have different
demographic mixes. For example, if the districts initially targeted happen
to be poorer areas, then the idea that more crime occurs in poor areas
could get reinforced. (Rather than crime being more likely to be reported

354
7 Guidelines for good practice

in poor areas.) Or even worse, a specific race might be thought to be more


likely to commit crimes if it happens that these districts are occupied by a
majority of that race.
In both Example 26 and Activity 36, notice that the problem is that extra
data is largely gathered from one side of the prediction (that it’s a heart
attack, that this is a higher-crime district). This makes it difficult for
predictions going the other way to be checked. For example, in a district
predicted to have low crime, maybe the same amount of low-level crime is
going on there too, it’s just that there are not the same number of police
officers to notice.
Now, in the examples described in this section, it is worth noting that
humans without access to big data would also inevitably make biased
decisions. The issue comes with assuming that use of big data will
automatically lead to results that are objective and unbiased. This is far
from being the case.
In Activity 37, you will get a chance to have your say and think about some
other examples that might involve ethical issues with small or big data.

Activity 37 Other examples involving ethical issues

Have you ever worked on or read about a situation dealing with a dataset
or data analysis where there might be any potential ethical issues? Based
on your study of Sections 5 and 6, clearly determine the potential issues
and give your suggestions on how you could avoid them in practice. Share
your reasoning on the M348 forums.

7 Guidelines for good practice


In Sections 5 and 6, we have concentrated on ethical issues associated with
big data. In this section, we detail some guidelines for good practice aimed
at ensuring that such data are handled with integrity and for the public
good.
At the time of writing this unit, ‘data science’ was still a very new
discipline, so guidance in this area is still developing. The guide we will
focus on in this unit, detailed in Box 10, is acknowledged by its authors to
be a first attempt at providing (non-mandatory) guidance to practitioners.
So this guidance may have been updated by the time you are studying this
module.

355
Unit B2 Big data and the application of data science

Box 10 A guide for data science practitioners


In ‘A guide for ethical data science’, produced by the Institute and
Faculty of Actuaries along with the Royal Statistical Society (2019),
the following five themes are listed as being relevant to practitioners
of data science:
• seek to enhance the value of data science for society
• avoid harm
• apply and maintain professional competence
• seek to preserve or increase trustworthiness
• maintain accountability and oversight.

Activity 38 Exploring the themes

Open ‘A guide for ethical data science’ provided on the module website,
and read the descriptions of the themes given in Section 3.2 of the
Introduction.
(a) Which groups of people are mentioned in the descriptions?
(b) To what extent are the considerations guided by legal and regulatory
requirements?
(c) How much do the data analysis techniques feature in the descriptions?

Having generic statements, such as those contained in the summary of the


themes that you read whilst completing Activity 38, is all very well.
However, data scientists need to be able to implement them in practice on
any project they are involved in. In the next activity, you will consider a
checklist provided at the end of ‘A guide for ethical data science’ that is
designed to help data scientists with this.

Activity 39 A checklist for ethical data science

Now look at the checklist given at the end of ‘A guide for ethical data
science’ on the module website. Which of the suggested actions address
issues surrounding consent, anonymisation, inequality and feedback loops
that we have considered in Sections 5 and 6?

Notice that only one of the categories in this guide relates directly to the
analysis of data and the building of statistical models. This stresses that,
in practice, there is so much more to data science than fitting statistical
models. The management of the data also plays a vital part. As you saw
in Unit 5, preparing data ready for analysis is a non-trivial task.

356
7 Guidelines for good practice

The ethical checklist highlights the need to be sure that the data have
been ethically sourced, and privacy safeguarded. It also highlights the
importance of communication skills in data scientists, in particular, the
ability to communicate technical issues in a non-technical way.
To round off this section, unit and strand, we end with a reminder of the
benefits that data science can bring to the world. On 11 March 2020 the
World Health Organization (WHO) declared COVID-19 a pandemic
following the spread of this newly identified disease from its first known
source in Wuhan, China. During this pandemic, data scientists have
worked hard to provide an evidence base for strategies to mitigate spread,
to understand more about the disease and how public behaviour changed
as a result.
Whilst data about the pandemic soon started accumulating, making sense
of some of it has been challenging. For example, comparisons of case rates,
either over time or between nations, depended on factors such as on the
availability of testing, the type of testing, and reporting rates of the results
– all of which impacted on the numbers of recorded cases. Nevertheless,
data scientists used their skills to help. For example, The Alan Turing
Institute in the UK lists the following as some of the key projects it was
engaged with in response to the pandemic (The Alan Turing Institute,
2021).
• Project Odysseus. A project where they monitored activity on London’s
streets, so aiding infrastructure to be reconfigured to make it easier for
social distancing to be observed.
• DECOVID. A project using anonymised patient data to help improve
treatment plans for COVID-19 patients.
• Rapid Assistance in Modelling the Pandemic initiative. Creating a
model of individuals’ movements around towns and cities so that the
impact of different lockdown strategies could be tested.
• Modelling of positive COVID-19 test counts to provide up-to-date
numbers despite a slight lag in the processing of tests.
• Improving the NHS COVID-19 app to more accurately predict the risk
that a user has been in contact with a COVID-19-positive person.
Finally, remember that big data is a relatively new area for data scientists
and statisticians. Who knows what exciting developments are just around
the corner – or have already happened in the time between this unit being
written (early 2022) and when you are reading these words!

357
Unit B2 Big data and the application of data science

Summary
In this unit, you have been learning about big data, in particular some of
the challenges it brings.
There is not an agreed definition of when data becomes big data. One
definition is that big data are data that cause significant processing,
management, analytical and interpretational problems. These problems
might be because the data possess one, or more, of the three V’s: volume,
variety and velocity.
With big data processing, problems are generally overcome by using
distributed computing. That is, splitting the data storage and
computation across a number of processors. The use of generic algorithms,
such as MapReduce, help structure computations to take advantage of
distributed computing, though they are easier to apply to some statistical
computations (such as computing a mean) than others (such as computing
a median).
Data analysis via a computer requires algorithms to be implemented.
There might be different algorithms designed to achieve the same task, but
even if there are, how the computation time depends on the number of
data points (for example using big O notation) might vary. Such
differences become important with the size of big data datasets.
The time taken to implement the algorithms is not the only problem. It
may not be possible to obtain exactly the right answer. Even with simple
calculations, some rounding error is often inevitable. Or it may not be
guaranteed that the algorithm will deliver the best answer every time. For
example, that a global maximum will always be obtained, rather than just
a local maximum. In other situations, it may not even be clear which of
two answers is better. In such cases, it is important to document which
algorithm has been used, for example whether clustering was done via
k-means, DBScan or some other algorithm.
The output from big data is sometimes different from that of small data.
Correlations might be exploited whether or not they are spurious.
Furthermore, large amounts of data mean that more complicated models
are fitted. This makes interpretation harder, and hence may not be done.
Also, some of the uses that big data have been put to have focused on
prediction rather than interpretation.
In this unit, you have also been considering ethical issues surrounding the
use of big data, and small data too. Informed consent is an important
principle when it comes to personal data. That is, that people freely agree
having fully understood what they are being asked to do. Furthermore,
legislation such as GDPR covers what personal data can be stored and
what can be done with it. When data are to be shared, it is often only
after it has been anonymised. Care has to be taken with anonymisation to
make sure that individuals cannot be identified again afterwards. By using
randomised response approaches, it is possible to collect usable data in a

358
Summary

way that respondents do not reveal compromising information.


One hope with big data is that it leads to data-driven and hence objective
decision-making, at least more objective than people could manage.
However, as you have seen in this unit, just using lots of data does not
guarantee this. If the data do not include enough information about
subgroups, they could be disadvantaged. If results are not meant to
depend on a particular protected characteristic, other innocuous variables
could act as proxies. Furthermore, historical biases in the data may get
built into the models and, if feedback loops are set up, might not diminish
over time. To ensure that these types of ethical issues are considered by
data scientists, guidelines for good practice have been produced. One such
set of guidelines is considered in this unit.
As a reminder of what has been studied in Unit B2 and how the sections in
the unit link together, the route map is repeated below.

The Unit B2 route map

Section 1
What is so special
about big data?

Section 2 Section 4
Handling Outputs from
big data big data analysis

Section 5
Privacy

Section 3
Section 6
Models and
Fairness
algorithms

Section 7
Guidelines for
good practice

359
Unit B2 Big data and the application of data science

Learning outcomes
After you have worked through this unit, you should be able to:
• interpret what is meant by big data and appreciate its importance, uses
and challenges
• explain the different aspects of big data, concentrating on the three V’s
aspects: volume, variety and velocity
• understand the notion of distributed computing and appreciate its
computational power obtained by combining multiple individual
processors
• implement distributed computing in R through working on a set of
Jupyter notebook activities
• understand some computational algorithms that can be used to facilitate
the analysis of big data
• appreciate different aspects of these algorithms and investigate their
convergence, accuracy, and the uniqueness and optimality of the
solutions they give
• describe some differences in the interpretation of big data outputs
compared to that of small data – these include the extent to which
correlations may be spurious, whether the analysis is for sample or
population data, and that the interpretation of the big data output
usually depends on prediction, rather than explanation
• appreciate the importance and legal obligation to maintain the privacy
of individuals while handling both small and big data – this includes
obtaining the appropriate consent and maintaining the anonymisation of
data at all stages of data collection and storage
• appreciate the importance and legal obligation to maintain other ethical
standards at all stages of big data analysis – specifically, ensuring equity
of all individuals and avoiding feedback loops
• understand and use guidelines for good practice to ensure that data are
being handled with integrity and that all ethical issues are considered.

360
References

References
Amazon.co.uk (2022) Privacy Notice. Available at:
https://www.amazon.co.uk/gp/help/customer/display.html?nodeId=502584
(Accessed: 29 September 2022).
BBC News (2017) ‘Google DeepMind NHS app test broke UK privacy law’,
3 July. Available at: https://www.bbc.co.uk/news/technology-40483202
(Accessed: 9 February 2021).
British Heart Foundation (2019) Bias and biology: how the gender gap in
heart disease is costing women’s lives. (British Heart Foundation briefing.)
Available at: https://www.bhf.org.uk/informationsupport/heart-matters-
magazine/medical/women-and-heart-disease/download-bias-and-biology-
briefing (Accessed: 20 October 2022).
Buolamwini, J. and Gebru, T. (2018) ‘Gender shades: intersectional
accuracy disparities in commercial gender classification’, Proceedings of
Machine Learning Research, 81, pp. 77–91.
Butler, D. (2013) ‘When Google got flu wrong’, Nature, 494, 14 February,
pp. 155–156. doi:10.1038/494155a.
CERN (2021) Worldwide LHC Computing Grid. Available at:
https://wlcg-public.web.cern.ch (Accessed: 24 March 2021).
Chaves, W.A., Valle, D., Tavares, A.S., von Mühlen, E.M. and Wilcove,
D.S. (2021) ‘Investigating illegal activities that affect biodiversity: the case
of wildlife consumption in the Brazilian Amazon’, Ecological Applications,
31(7), Article e02402. doi:10.1002/eap.2402.
Cook, S., Conrad C., Fowlkes, A.L. and Mohebbi, M.H. (2011) ‘Assessing
Google Flu Trends performance in the United States during the 2009
Influenza Virus A (H1N1) pandemic’, PLoS One, 6(8), Article e23610.
doi:10.1371/journal.pone.0023610.
Das, S. (2019) ‘It’s hysteria, not a heart attack, GP app Babylon tells
women’, 13 October. Available at:
https://www.thetimes.co.uk/edition/news/its-hysteria-not-a-heart-attack-
gp-app-tells-women-gm2vxbrqk (Accessed: 29 September 2022).
Dean, J. and Ghemawat, S. (2004) ‘MapReduce: simplified data processing
on large clusters’, OSDI’04: Proceedings of the 6th Conference on
Symposium on Operating Systems Design & Implementation. San
Francisco, 6–8 December, pp. 137–149. doi:10.5555/1251254.1251264.
Diebold, F.X. (2021) “What’s the big idea? ‘Big Data’ and its origins”,
Significance, 18(1), pp. 36–37. doi:10.1111/1740-9713.01490.
EU-CJ (2011) ‘Taking the gender of the insured individual into account as
a risk factor in insurance contracts constitutes discrimination’, Court of
Justice of the European Union, Press Release No 12/11. Available at:
https://curia.europa.eu/jcms/upload/docs/application/pdf/2011-
03/cp110012en.pdf (Accessed: 23 February 2022).

361
Unit B2 Big data and the application of data science

EU-FRA (2021) ‘EU Charter of Fundamental Rights, Article 21’. Available


at: https://fra.europa.eu/en/eu-charter/article/21-non-discrimination
(Accessed: 15 July 2021).
Evans, P.J. (2019) ‘Build a Raspberry Pi cluster computer’, The MagPi
Magazine. Available at: https://magpi.raspberrypi.com/articles/build-a-
raspberry-pi-cluster-computer (Accessed: 10 February 2022).
Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S.
and Brilliant, L. (2009) ‘Detecting influenza epidemics using search engine
query data’, Nature, 457, 19 February, pp. 1012–1014.
doi:10.1038/nature07634.
Google (2015) ‘The next chapter for Flu Trends’, Google Research,
20 August. Available at:
https://ai.googleblog.com/2015/08/the-next-chapter-for-flu-trends.html
(Accessed: 29 September 2022).
Guo, S., Fang, F., Zhou, T., Zhang, W., Guo, Q., Zeng, R., Chen, X., Liu,
J. and Lu, X. (2021) ‘Improving Google Flu Trends for COVID-19
estimates using Weibo posts’, Data Science and Management, 3, pp. 13–21.
doi:10.1016/j.dsm.2021.07.001.
Howell, E. (2020) ‘NASA’s real “Hidden Figures”’. Available at:
https://www.space.com/35430-real-hidden-figures.html#section-history-of-
human-computers-at-nasa (Accessed: 4 February 2022).
ICO (no date) ‘Guide to data protection – The principles’. Available at:
https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-
general-data-protection-regulation-gdpr/principles
(Accessed: 22 February 2022).
Institute and Faculty of Actuaries and Royal Statistical Society (2019)
‘A guide for ethical data science’. Available at: https://actuaries.org.uk/
standards/data-science-ethics (Accessed: 30 September 2022).
Kandula, S. and Shaman, J. (2019) ‘Reappraising the utility of Google Flu
Trends’, PLoS Computational Biology, 15(8), Article e1007258.
doi:10.1371/journal.pcbi.1007258.
Kelion, L. (2019) ‘Amazon sued over Alexa child recordings in US’, BBC
News, 13 June. Available at: https://www.bbc.co.uk/news/technology-
48623914 (Accessed: 9 February 2021).
Keller Postman (2021) ‘9th Circ. won’t let Amazon arbitrate kids’ Alexa
privacy suit’, News & Insight, 25 April. Available at:
https://www.kellerpostman.com/9th-circ-wont-let-amazon-arbitrate-kids-
alexa-privacy-suit (Accessed: 21 October 2022).
Lazer, D., Kennedy, R., King, G. and Vespignani, A. (2014) ‘The parable
of Google Flu: Traps in big data analysis’, Science, 343(6176), 14 March,
pp. 1203–1205. doi:10.1126/science.1248506.

362
References

Leo, B. (2018) ‘Motorists fork out £1,000 more to insure their cars if their
name is Mohammed’, The Sun, 22 January. Available at:
https://www.thesun.co.uk/motors/5393978/insurance-race-row-john-
mohammed (Accessed: 29 September 2022).
Mach, P. (2021) ‘10 business applications of neural network (with
examples!)’, Ideamotive, 7 January. Available at:
https://www.ideamotive.co/blog/business-applications-of-neural-network
(Accessed: 16 September 2022).
Moore, G.E. (1965) ‘Cramming more components onto integrated circuits’,
Electronics, 38(8), 19 April.
NASA (2016) ‘When the computer wore a skirt: Langley’s computers,
1935–1970’. Available from: https://www.nasa.gov/feature/when-the-
computer-wore-a-skirt-langley-s-computers-1935-1970
(Accessed: 4 February 2022).
OED Online (2022) ‘algorithm, n.’. Available at:
https://www.oed.com/view/Entry/4959 (Accessed: 18 August 2022).
Oridupa, G. (2018) ‘Fundamentals of MapReduce (new to MapReduce?)’,
Coding and analytics, 23 August. Available at:
https://www.codingandanalytics.com/2018/08/fundamentals-of-
mapreduce.html (Accessed: 20 December 2022).
PredPol (2020) ‘PredPol and community policing’. Available at:
https://blog.predpol.com/predpol-and-community-policing (Accessed:
25 February 2022).
Pym, H. (2019) ‘App warns hospital staff of kidney condition in minutes’,
BBC News, 1 August. Available at: https://www.bbc.co.uk/news/
health-49178891 (Accessed: 9 February 2021).
R Development Core Team (2022) ‘The R Reference Index’ (R-release,
version 4.2.1). Available at: https://cran.r-project.org/manuals.html
(Accessed: 21 October 2022).
Royal Free London (no date) ‘Information Commissioner’s Office (ICO)
investigation’. Available at: https://www.royalfree.nhs.uk/patients-
visitors/how-we-use-patient-information/information-commissioners-office-
ico-investigation-into-our-work-with-deepmind
(Accessed: 22 February 2022).
The Alan Turing Institute (2021) ‘Data science and AI in the age of
COVID-19’. Available at: https://www.turing.ac.uk/research/publications
/data-science-and-ai-age-covid-19-report (Accessed: 5 August 2021).
The National Archives (2013) ‘Equality Act 2010, Part 2, Chapter 1:
Protected Characteristics’. Available at:
https://www.legislation.gov.uk/ukpga/2010/15/part/2/chapter/1
(Accessed: 19 December 2022).

363
Unit B2 Big data and the application of data science

The National Archives (2020) ‘Regulation (EU) 2016/679 of the European


Parliament and of the Council, Article 5: Principles relating to processing
of personal data’. Available at: https://www.legislation.gov.uk/eur/2016/
679/article/5 (Accessed: 19 December 2022).
The Open University (2013) The Open University Annual Report 2012/13,
p. 43. Available at: https://www.open.ac.uk/about/main/sites/
www.open.ac.uk.about.main/files/files/ecms/web-content/Open-
University-Annual-Report-2012-13.pdf (Accessed: 29 September 2022).
‘Transistor count’ (2022) Wikipedia. Available at:
https://en.wikipedia.org/wiki/Transistor count
(Accessed: 10 October 2022).
Vigen, T. (no date) Spurious correlations. Available at:
www.tylervigen.com (Accessed: 30 January 2020).
Walkowiak, S. (2016) Big data analytics with R. Birmingham: Packt.
Warner, S.L. (1965) ‘Randomized response: A survey technique for
eliminating evasive answer bias’, Journal of the American Statistical
Association, 60(309), pp. 63–69. doi:10.1080/01621459.1965.10480775.
Wickham, H. (2011) ‘The split-apply-combine strategy for data analysis’,
Journal of Statistical Software, 40(1), pp. 1–29. doi:10.18637/jss.v040.i01.

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, a recommender system: Photos on Siggy Nowak da
Pixabay
Subsection 1.1, Google: Lets Design Studio / Shutterstock
Figure 1: Ben Hider / Stringer
Subsection 1.2.1, volume: Richard Bailey / Corbis Documentary / Getty
Subsection 2.1, kids measuring tree width: Jupiterimages / Getty
Figure 4: Kathy Hutchins / Shutterstock
Figure 5: Magpi. This file is licenced under Commons Attribution-
NonCommercial-ShareAlike 3.0 Unported licence.
https://creativecommons.org/licenses/by-nc-sa/3.0/
Figure 6: James Brittain / Getty
Subsection 2.2, work in parallel: JGI / Jamie Grill / Getty
Subsection 2.2, multiprocessor: Taken from Appliances Direct website
Figure 7: diagram of the map-reduce algorithm: Taken from
https://www.codingandanalytics.com/2018/08/fundamentals-of-
mapreduce.html
Subsection 3.1, Lotoo: Posted by u/Jredrock on Reddit.com
Subsection 3.1, a square root: chaostrophic.com

364
Acknowledgements

Subsection 3.1, another square root: Academihaha


Subsection 3.2, local maximum: Roman Kybus / Shutterstock
Figure 8: © TIBCO Software Inc. All rights reserved. Original image
source: https://www.tibco.com/reference-center/what-is-a-neural-network
Subsection 5.1, Alexa and children: Photo illustration by Slate. Photos by
Amazon and Photo by Timo Stern on Unsplash
Subsection 5.1, The London Royal Free Hospital: PA Images / Alamy
Stock Photo
Subsection 5.2, Getting his OU degree: Open University Press
Subsection 5.2, skydiving: Steven Mangini / Twitter
Subsection 5.2, tossing a coin: ICMA Photos / Flickr. This file is licenced
under Creative Commons-by-SA 2.0.
https://creativecommons.org/licenses/by-sa/2.0/
Subsection 5.2, cheating: Photo by Surface on Unsplash
Subsection 6.1, smiles: JohnnyGreig / iStock / Getty Images
Subsection 6.1, mosque: Mohamed Zeineldine / Getty
Subsection 6.1, heart problems: Orlando / Stringer / Getty
Subsection 6.2, PredPol: San Francisco Bay Area Independent Media
Center
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

365
Unit B2 Big data and the application of data science

Solutions to activities
Solution to Activity 1
There is not a single generally accepted definition of ‘big data’. So, it is
not surprising if you struggled to put your finger on what makes data ‘big’.
Going by the adjective ‘big’, you may well have thought in terms of
datasets that have lots of observations and/or lots of variables. This is
indeed one aspect that can make data big. But did you also think about
the type of data, and the speed at which it can be gathered? As you will
see next, these can also be factors that turn data into big data.

Solution to Activity 2
Everyone’s responses are likely to be different. If you have bought items
which were suggested to you, or even if the items were of interest to you,
this suggests that a recommender system has worked – at least this one
time.

Solution to Activity 3
Reasons the analysis of such postings might not accurately reflect the
opinion of the general public include the following.
• Access to the internet is required to be able to leave comments on such
websites. Even though access to the internet has dramatically increased
during this century, it is not universal even in developed countries.
The attitudes of those without access to the internet may be very
different to those with access.
• Even if someone can access the internet, they may not have the time to
do so. Or it may be a website that they don’t like, possibly because of
the perceived political leaning of the news media outlet. Thus, in this
respect, the attitudes of those who notice the comments section may be
different to the attitudes of the general public.
• The attitudes of those people prepared to post in a comment may also
be different to those who are aware of it, but do not post. Furthermore,
people might be more reticent to express particular opinions, for fear of
censure from others.
• Some people might post multiple times using different names in an effort
to make it appear that a particular viewpoint is more common than it
really is. In the extreme, a posting may have been artificially generated.
These issues do not diminish as the number of postings increases,
particularly if the increase in postings is simply a result of there being
more posts per individual rather than more individuals posting.

366
Solutions to activities

Solution to Activity 4
The simplest answer you probably thought of is that these models are fitted
using R – or using a different statistics package such as Minitab or SPSS.
But how do these packages do it? Recall that linear models are generally
fitted by least squares. That is, the parameter estimates are chosen so that
the sums of the squared differences between the data points and the model
are minimised.
In contrast, maximum likelihood estimation is used for generalised linear
modelling. That is, the parameter estimates are chosen to be those for
which the likelihood is maximised.
However, saying that the model is fitted using least squares or maximum
likelihood still does not fully answer the question about how the package
comes up with the estimates for model parameters. You need to consider
the computations that are performed. The complexity of these
computations depends on the model that is being fitted.

Solution to Activity 5
(a) When a member of the module team did this, they took 8 minutes
30 seconds to calculate βb and then αb. As the time taken depends on
factors such as dexterity with a calculator, focus on the task and
practice, the time you took is likely to be different.
It turns out that for these data βb = 12.9747 and α b = 3.5854. Did you
get the correct values for βb and α
b?
(b) The calculations in Step 1 are likely to take three times as long as
there are three times as many numbers to add up. The other steps are
likely to take about the same length of time as when there are only
five trees included. This is because the number of numbers involved in
the calculations are the same whatever the number of trees. Overall
this means that calculating α b and βb will take a bit less than three
times as long.
(c) As the number of trees increases, it is the time spent on Step 1 that
will dominate. The other steps will only require a few computations,
no matter how many trees there are. So, generally the computation
time will go up proportionally with the number of trees.

Solution to Activity 6
(a) There is no right answer to this. How long is taken will vary from
person to person. When a member of the module team tried this,
they took 4 minutes 15 seconds. (The value of the standard deviation
you should have obtained is 15.628.)
(b) When the same member of the module team tried the second method,
they took 3 minutes 15 seconds.
(c) So the module team member was able calculate the standard
deviation faster using Method 2. You probably found the same.

367
Unit B2 Big data and the application of data science

As the number of points increases, these differences are likely to


become more pronounced. The differences between the two methods
lie in Steps 1 to 4. Also, both methods require calculating the sample
mean, along with squaring n numbers and adding them up. However,
Method 2 requires the value
( x)2
X P
x2 −
n
to be calculated, a calculation that includes the same number of
terms, no matter how big the sample size is. In contrast, Method 1
requires the deviations (x − x) to be individually calculated. Thus,
the number of individual calculations increases for this step as the
sample size increases.

Solution to Activity 7
(a) It is the first step that is likely to take the most time. The number of
terms involved with each of the calculations detailed in Step 1 will
increase as the number of trees increases.
(b) All four approaches are valid as they will all result in the same value
of the sum being calculated. This is because it does not matter what
terms are added together.
(c) Strategy (iii) is likely to be fastest. With you and your two friends
each calculating one of s1 , s2 and s3 , these can be calculated
simultaneously. This just leaves the relatively small sum s1 + s2 + s3
to be calculated afterwards. If everyone can add two numbers together
as quickly as everyone else, it does not make sense for one person to
be given more numbers to add together than anyone else because the
final sum can’t be calculated until s1 , s2 and s3 are all known.
(d) Suppose that you give one friend s1 to calculate and the other s2 .
The friend given s2 to calculate only needs to know the values
of x6 , . . . , x10 , just one third of the data. Similarly, the friend given s3
to calculate only needs to know the values x11 , . . . , x15 , again just one
third of the data. Finally, you would only need to know the values
of x1 , . . . , x5 – provided you are sure that your friends have the data
they need.

Solution to Activity 8
(a) There are two key tasks. The first is to decide which subset of the
data each friend is going to work on (and of course, which data points
you are going to leave for yourself). This is important because each
data point must go to one, and only one, person. The other key task
is to keep track of what the subsums s1 , s2 and s3 are once they have
been calculated. Note that keeping track also includes having some
system of noticing if one of your friends fails to get back to you with
their subsum.

368
Solutions to activities

(b) Whilst you might think that dividing up the work is worthwhile when
there are 12 trees in the dataset, it is unlikely to be when there are
just 6 trees. The extra hassle of tasking your two friends and keeping
track of what they report for their subsums might make it not worth
it. Instead you may well have decided that, in this circumstance, it is
quicker to do all the computations yourself.

Solution to Activity 9
(a) P (at least one fails) = 1 − P (neither fails)
= 1 − (P (a processor does not fail))2
= 1 − (0.9)2 = 1 − 0.81 = 0.19.
(b) P (at least one fails) = 1 − P (none fails)
= 1 − (P (a processor does not fail))5
= 1 − (0.9)5 ≃ 1 − 0.5905 = 0.4095.
(c) Let n be the minimum number of processors we need. This means
that we must have that
1 − (0.9)n = 0.5.
That is (0.9)n = 0.5.
In other words, that is n log(0.9) = log(0.5).
Thus,
log(0.5)
n= ≃ 6.58 ≃ 7.
log(0.9)

(d) When the probability of a single processor failing is 0.01, then the
number of processors, n, so that the probability of at least one failure
is 0.5 is given by
log(0.5)
n= ≃ 68.97 ≃ 69.
log(0.99)

Solution to Activity 10
(a) The overall mean, x, can be calculated using
n1 x1 + n2 x2
x= .
n1 + n2

(b) A similar formula works if we have p subsets. One way of seeing this
is by considering that the overall mean can be calculated by
combining the mean based on the first p − 1 subsets and the mean
based on the last subset. More directly, this is same as using
Pp
k=1 nk xk
x= P p .
k=1 nk

369
Unit B2 Big data and the application of data science

(c) As parts (a) and (b) show, it is possible to calculate the mean by first
splitting the data into p subsets and working on each subset
independently. Then, the results from each subset can be combined to
get the overall mean. So, it is a computation that is amenable to
being done using distributed computing.

Solution to Activity 11
(a) As the data are already ordered, the median corresponds to the
middle value, which is 21 in this case.
(b) With this split of the data, the median of each subset is 15, 21 and 37,
respectively.
(c) With this split of the data, the median of each subset is 2, 21 and 117,
respectively.
(d) With this split of the data, the median of each subset is 5, 37 and 103,
respectively.
(e) The median of the medians for each subset is shown in the following
table.

Data Median of Median of Median of Median of


split first subset second subset third subset the medians
(b) 15 21 37 21
(c) 2 21 117 21
(d) 5 37 103 37

So, using the subsets proposed in parts (b) and (c) leads to the
median of medians that is the same as the median for the whole
dataset. However, not all subsets will lead to the median for the
whole dataset, as the selection in part (d) demonstrates.
(f) The mean of the medians for each subset is shown in the following
table.

Data Median of Median of Median of Mean of


split first subset second subset third subset the medians
(b) 15 21 37 24.33
(c) 2 21 117 46.67
(d) 5 37 103 48.33

So, taking the mean of the medians is even worse. The mean of the
medians is not guaranteed to be even close to the median of the whole
dataset.

370
Solutions to activities

Solution to Activity 12
For these data, (x − x)2 = 14. This value is exact as the
P
(a) (i)
calculation involves the adding and squaring of (small-valued)
integers.
(ii) The variance is 7. This is exact because 14 is a multiple of 2
(= n − 1).
(iii) The standard deviation is 2.646 (to three decimal places). As the
phrase ‘to three decimal places’ indicates, this value is not exact.
At least, it is not exact if we want to write it down in decimal
form. The exact value is an irrational number, so we cannot write
down the value exactly using a finite number of decimal places.
For these data, (x − x)2 = 2. Like in part (a), this value is
P
(b) (i)
exact as it’s just involving the adding and squaring of
(small-valued) integers.
(ii) The variance is 0.667 (to three decimal places). This is not exact.
The value 2 is not a multiple of 3 (= n − 1), so we cannot write
down this value exactly using a finite number of decimal places.
(iii) To three decimal places, the value of the standard deviation you
obtained would have been 0.837, 0.819, 0.817 or 0.816, depending
on how many decimal places you used for the variance when
taking the square root, i.e. these are the square root values
of 0.7, 0.67, 0.667 and 0.66667, respectively. (To three decimal
places, the square root of 0.6667 is the same as 0.667.) None of
the standard deviation values will be the exact value, as now it is
not possible to input the exact value to calculate the square root.
Also, we are not able to write down the result exactly using a
finite number of decimal places.

Solution to Activity 13
No, forward and backward stepwise regression do not always result in the
same parsimonious model. For example, in Unit 2, both forward and
backward stepwise regression were used to find a parsimonious model for
the income a film will generate.
In Example 14 (Subsection 5.3.1 of Unit 2), forward stepwise regression
found the parsimonious model to be

income = −3.627 + 0.033 budget + 1.268 screens + 0.923 rating.
In contrast, as Example 15 in Subsection 5.3.2 of Unit 2 showed, backward
stepwise regression working from the same set of explanatory variables,
found a different parsimonious model. This model corresponds to

income = − 4.385 + 0.036 budget + 1.058 rating + 1.104 screens
− 0.268 views + 0.099 likes + 1.493 dislikes
− 0.554 comments.

371
Unit B2 Big data and the application of data science

Solution to Activity 14
Yes it is. There is general agreement that the best model corresponds to
the one that minimises the sum of the differences between the predicted
values (heights of trees) and the actual heights of the trees. (As it turns
out for these data, this sum is 1.027 based on the result from the first
algorithm, and 0.136 using the second algorithm. So in this case it
happens to be that the second algorithm produces the better answer.)

Solution to Activity 15
No, it is not. As you learnt in Subsection 5.2 of Unit 2, there are different
ways of measuring how well a model fits the data. These ways take into
account how many variables are included. In particular, there is the
adjusted R2 statistic (Ra2 ) and the Akaike information criterion (AIC). It
might be that Ra2 and AIC both indicate the same selection of variables is
best. However, this is not guaranteed.

Solution to Activity 16
It could be argued that all the methods are about fitting the same type of
model: a model in which the data can be split into different clusters.
However, this leads to a vague model. The shape of the clusters and the
distribution of the numbers of observations in each cluster are not tied
down.
Instead, the methods are about using different algorithms to achieve the
same end (splitting the data into clusters). The different algorithms have
different strengths.

Solution to Activity 17
No, it is not.
In Example 14, it was agreed that the link between strength and weight is
unlikely to be causal. However, an indirect link is plausible. A footballer
might be heavier because they have more muscle mass and hence they
are stronger.

Stronger

More muscle mass

Heavier

Figure S2 An increase in muscle mass could lead to an increase in weight


and an increase in strength

372
Solutions to activities

Solution to Activity 18
(a) The sample mean is a good estimator of the population mean.
So, an estimate of the population mean strength is 71.07.
(b) It is good practice to give some indication of the sampling error
associated with an estimate. The 95% confidence interval is one
means of doing so.
(c) As we are now dealing with all the data, the mean strength
(calculated to be 65.32) is the population mean strength. Note that
this is the value of the population mean, not an estimate of it. So,
sampling error is no longer an issue and hence indications of the size
of the sampling error are no longer required.

Solution to Activity 19
Your first thoughts may have been that, yes, these data will be
representative of all international footballers. However, it is not clear
whether the database contains information on both male and female
footballers.
Also, the data may only be representative of footballers in 2019, or a few
years either side. The data is almost certainly not representative of all
international footballers in the past and may not be representative of such
footballers in the future due to factors such as changes in training methods.

Solution to Activity 20
(a) No, it does not appear that all the variables are contributing to the
model. The p-values associated with preferredFoot and skillMoves
are both relatively large.
(b) Now, it looks like most of the variables are contributing significantly
to the model. There is only one variable, preferredFoot, that might
not be, and even here the p-value is close to the level that is normally
considered for statistical significance.
(c) Now, there is strong evidence that all the variables are contributing to
the model as all the p-values are very small.
(d) In each case, the 95% confidence interval can be calculated using the
formula
(βb − t(0.975) × s.e.(β),
b βb + t(0.975) × s.e.(β)),
b

where ‘s.e.(β)’
b means ‘the standard error of βb ’. Here, even when the
sample size is n = 100, the degrees of freedom for the relevant
t-distribution is big enough that it is reasonable to replace it
by z(0.975) ≃ 1.96.
So, when the model is based on 100 footballers, the 95% confidence
interval is
(−0.279 − 1.96 × 1.046, −0.279 + 1.96 × 1.046) ≃ (−2.329, 1.771).

373
Unit B2 Big data and the application of data science

When the model is based on 1000 footballers, the 95% confidence


interval is
(1.172 − 1.96 × 0.623, 1.172 + 1.96 × 0.623) ≃ (−0.049, 2.393).

Finally, when the model is based on all footballers, the 95%


confidence interval is
(1.004 − 1.96 × 0.1498, 1.004 + 1.96 × 0.1498) ≃ (0.710, 1.298).

As the amount of data increases, the 95% confidence interval has


become narrower. In particular, when the model is based on all
footballers, the interval is narrow enough that the lower bound of the
interval is positive.
In each case, the 95% confidence interval contains the value 1. So for
each model, it is plausible that the effect size of preferredFoot is 1.

Solution to Activity 21
Out of the models listed, the last model is the most complex as it has the
most terms in it. It is also the most complicated one to interpret. The
effect of any one of the variables incomeSource, employment and gender
depends on what value the other variables take.

Solution to Activity 22
The authors give the following examples:
• recommender systems
• providing a way to see how a selected item of clothing would look on a
model
• providing software to help banks decide whether they should extend
credit to someone
• forecasting the value of currencies, cryptocurrencies and stocks
• assisting doctors with diagnoses and with treatment plans
• detecting cyber attacks
• detecting online fraud
• planning and monitoring the routing of deliveries
• predicting the time of deliveries
• developing an autopilot system for a car.

374
Solutions to activities

Solution to Activity 23
(a) The intended reader is someone who is considering taking part in the
study. Notice how the leaflet makes frequent use of the pronoun ‘you’
to mean this person and the pronoun ‘we’ to mean the researchers.
(b) The leaflet makes it very clear the reader does not have to take part
if they don’t want to.
(c) A participant will be asked to give basic details such as their name,
age, gender and ethnicity, details about recent illnesses, and about
any recent COVID-19 test they might have had. They will also be
asked to upload a photograph of their antibody test result.
(d) The research team will have access to all of the data. Only results
of the study, which the researchers stress will not contain personal
identifiable information, will be shared with others. (These others
include the UK’s National Health Service (NHS), Public Health
England and the Department of Health and Social Care.)

Solution to Activity 24
For this, the module team looked at Amazon.co.uk. Their Privacy Notice
(in this case, one dated 22 March 2019) includes a long list of instances
where they gather information, such as:
• communicate with us by phone, email or otherwise
• search for products or services
• upload your contacts
• place an order through Amazon services
• talk to or otherwise interact with our Alexa Voice service.
They also gather other information automatically such as:
• the IP address
• login; email address; password
• the location of your device or computer.
Other services that Amazon.co.uk gather information about their
customers include:
• information about your interactions with products and services offered
by our subsidiaries
• information about internet-connected devices and services with Alexa
• credit history information from credit bureaus.

375
Unit B2 Big data and the application of data science

Solution to Activity 25
Two examples are:
• upload your contacts
• talk or otherwise interact with Alexa Voice service.
By uploading information about your contacts, you are providing Amazon
with information about these other people. Yet unless you have told your
contacts about this, they may not even be aware that you have passed on
information about them to Amazon.
Similarly, the presence of Alexa in a household means that any member of
the household or visitor may inadvertently have what they have said
picked up by Alexa.

Solution to Activity 26
There are two variables that relate to where each student was based:
region and imd. (Remember that imd is a factor representing the index of
multiple deprivation (IMD), which is a measure of the level of deprivation
for the student’s (UK) postcode address.)
However, both variables only give a very rough indication of where each
student was located. The variable region splits the UK into just
13 regions, and one of those regions includes all the students based outside
of the UK. The other variable, imd, combines locations in the UK on the
basis of their index of multiple deprivation. For these data, just four
categorisations are used: ‘most’, ‘middle’, ‘least’ and ‘other’.

Solution to Activity 27
In the dataset, most students are 70 years old or younger. However, there
are a few students over the age of 75, including one over 90 years old. Of
course, there are enough people aged at least 90 years old that this piece of
information does not seem to be useful for identification.

Solution to Activity 28
Other information about Clifford available online includes that he earned
his degree, a BA Open in Arts, in 2013 when he was aged 93 (The Open
University, 2013).
This means it is unlikely he is the oldest student in the OU students
dataset. As stated in the description of the dataset, the dataset only
contains data about students on modules between 2015 and 2020 – after
Clifford graduated. Also, all the students in this dataset were studying
statistics modules, which are modules that Clifford is unlikely to have
studied as part of his degree (though not impossible).

376
Solutions to activities

Solution to Activity 29
A student might be giving the response ‘yes’ simply because their coin toss
was ‘heads’.
Furthermore, only the student will know if the coin toss was in fact ‘tails’,
and they have indeed submitted someone else’s work as their own. So,
someone processing the survey cannot say if an individual respondent has
admitted to submitting someone else’s work as their own because they do
not know.

Solution to Activity 30
(a) If no OU students have ever submitted someone else’s work as their
own, the response ‘yes’ would only be a result of coin tosses coming
up heads. Assuming that all the coins are unbiased, we would expect
this to happen 50% of the time. So, we would expect 50% of the
responses to this question to be ‘yes’.
(b) If 1% of OU students have ever submitted someone else’s work as
their own, we would still expect 50% of respondents to answer ‘yes’
because their coin toss is ‘heads’. However, now 1% of students whose
coin toss was ‘tails’ would also answer ‘yes’.
So, overall we would expect 50% + 0.5 × 1% = 50.5% of the responses
to this question to be ‘yes’.
(c) If p% of OU students have ever submitted someone else’s work as
their own, we would still expect 50% of respondents to answer ‘yes’
because their coin toss is ‘heads’. However, now p% of students whose
coin toss was ‘tails’ would also answer ’yes’.
So, overall we would expect 50% + 0.5 × p% = (100% + p%)/2 of the
responses to this question to be ‘yes’.
(d) The percentage of ‘yes’ replies we expect to have depends on the true
percentage, p, of OU students who have ever submitted someone else’s
work as their own. As p increases, the percentage of ‘yes’ replies we
expect goes up. So, we are able to use the percentage of ‘yes’ replies
to estimate p.

Solution to Activity 31
As has already been mentioned in this subsection, questions that involve
illegal behaviour benefit from using a randomised response approach. So,
for example:
• Why would someone drive even when they know they are not in a fit
state to do so?
• Why would someone knowingly buy counterfeit or stolen goods?

377
Unit B2 Big data and the application of data science

A randomised response approach can also be useful when researching


topics that, whilst being about legal behaviour, may lead to responses that
participants would prefer not to reveal to the researchers (and possibly not
to anybody else). For example:
• opinions about controversial topics
• sexual history.
You may well have thought of other examples.

Solution to Activity 32
(a) All other things being equal, a model estimated using data from
100 000 people should produce better predictions than a model
estimated using data from 100 people. This is because the extra data
should mean that there will be far less sampling error associated with
the estimates of model parameters.
(b) In general, more data means that models with more parameters can
reasonably be fitted. These extra parameters are likely to allow the
shape of the model to more closely reflect the shape of the
relationship between income and the probability of repaying the loan.
For example, think of the shapes that can be achieved by functions of
the form a + bx, a + bx + cx2 and a + bx + cx2 + dx3 .

Solution to Activity 33
(a) No, this model is not likely to be as good for people in this subgroup.
The low proportion of people from this subgroup is likely to only have
a small impact on shaping the estimated relationship between income
and the probability of repaying a loan. So, the modelled relationship
between income and the probability of repaying the loan will largely
reflect people not in the subgroup.
(b) A separate model for people in this subgroup will be based on far
fewer data than the model for everyone else. So, as explored in
Activity 32, this separate model is likely to give worse predictions
than the predictions for people not in the subgroup.

Solution to Activity 34
According to Example 24, only the name of the applicant was changed,
nothing else. In the UK, someone’s name is not a protected characteristic.
However, someone’s name can be, and often is, used to infer race (and sex
too). This is why The Sun’s findings about the cost implications of just
changing an applicant’s name from John Smith to Mohammed Ali
mattered.

378
Solutions to activities

Solution to Activity 35
(a) Costs here include the time of the medical staff at A&E, the costs of
any medical tests run to discover it is not a heart attack after all, the
time and expense for the person going to A&E and the stress induced
by receiving such advice for the person concerned (and others who
care about them). You may have thought of other costs.
(b) Costs here include those associated with not getting the required
medical treatment in a timely fashion. This could lead to a much
worse outcome for the person.
(c) The costs associated with misdiagnosing a heart attack as a panic
attack are potentially much greater than those associated with
misdiagnosing a panic attack as a heart attack. So, most people
would agree that the former misdiagnosis is worse.

Solution to Activity 36
(a) It makes sense for the police to focus on where the crime is. So, in
this case, to focus on district A.
(b) If more crime is recorded where the police are looking, then more
crime will be recorded in district A.
(c) If more crime is recorded in district A, then this reinforces the idea
that district A is the higher-crime area.
The answers to parts (a), (b) and (c) set up the undesirable feedback loop
shown in Figure S3.

District labelled
as ‘high crime’

Police notice Police focus


more crime on district

Figure S3 A potential feedback loop when district policing levels are decided
by crime rates

Solution to Activity 37
All kinds of real-life scenarios are welcome. In your discussion of the
potential ethical issues, try to put them in context with the contents of
Sections 5 and 6. Engage with other students on the module and discuss
your thoughts clearly and responsibly on the module forums.

379
Unit B2 Big data and the application of data science

Solution to Activity 38
(a) You should have noticed that a wide range of people are mentioned.
These include people whose information is being used and stakeholders
in the work that is being done. It also includes society as whole.
(b) There is mention of the need to adhere to legal and regulatory
frameworks (including where that encompasses being a whistleblower
if necessary), but the themes go beyond that. For example, to act
with transparency to build public trust in data science.
(c) The themes just mention using ‘robust statistical and algorithmic
methods that are appropriate to the question being asked’. Which
technique should be used in which situation is not mentioned.

Solution to Activity 39
Consideration about consent occurs under project planning (‘Can data be
ethically sourced?’), under data management (‘Fully understand the
consents and legal uses of the data’) and under analysis and development
(‘Apply consents and permitted uses, professional and regulatory
requirements’).
Anonymisation also appears under data management (‘Consider impacts of
data processes to privacy, bias and error’) as well as analysis and
development (‘Monitor risks identified at planning, assess for additional
risks (harm, bias, error, privacy)’) and implementation and delivery
(‘Applying best practice in anonymisation before sharing data or
disseminating outputs’).
Whilst inequality and feedback loops are not explicitly mentioned, bias,
harm and fairness are. For example, under project planning (‘Are there
risks (privacy, harm, fairness) for individuals, groups, businesses,
environment?’), under data management (‘Detecting and mitigating
sources of bias’) and analysis and development (‘Monitor risks identified at
planning, assess for additional risks (harm, bias, error, privacy))’.

380
Review Unit
Data analysis in practice
Introduction

Introduction
This unit is designed to help you consolidate what you have learned in the
core part of M348. It describes some extensions of ideas you have already
met, as well as a small number of new statistical ideas that you may find
useful for data analysis in practice. The main aim, however, is to reinforce
the connections that exist between the various modelling approaches you
have seen in Units 1 to 8 and to give you some practical advice for
choosing a model for your data. Several datasets will be introduced in the
unit, so that you can draw out their features through different modelling
approaches.
To ease you into the spirit of this unit, we’ll start with a repeat of the
quote from Unit 5, attributed to the eminent statistician George Box:
‘All models are wrong, but some are useful.’
Ultimately, Box is saying that there is not one ‘right’ answer when trying
to find a model for your data. Every statistician is likely to choose slightly
differently. This is illustrated in the paper called ‘Many analysts, one data
set: making transparent how variations in analytic choices affect results’
(Silberzahn et al., 2018), where 29 teams of researchers analysed the same
dataset to answer the same research question. Only 20 of the teams found
the effect of interest to be statistically significant, and the estimates for
this effect varied a lot. Moreover, the 29 final models proposed by the
teams used 21 unique combinations of explanatory variables!
Keep this in mind when modelling a dataset. There is a multitude of
techniques you can use to analyse the data, and more than one approach
may lead to an appropriate model. So it’s a case of doing something that
can be justified and defended. This unit will help you along this path.
The unit starts by reviewing linear models, which were the focus of
Units 1 to 5. In Unit 1, we explored the concept of simple linear
regression, where a linear relationship between a continuous response
variable and one covariate is modelled. In Unit 2, this framework was
generalised to multiple linear regression, or more simply, multiple
regression, to incorporate an arbitrary number of covariates that may
affect the response. We will review various features of multiple regression,
including transformations, model selection and model diagnostics, in
Sections 1 and 2: Section 1 will focus on the exploratory analysis stage of
multiple regression, while Section 2 will consider the model fitting stage.
Not all explanatory variables are continuous, and therefore Unit 3 explored
methods to analyse data where the explanatory variable is categorical, that
is, a factor. We saw how a factor can be incorporated into the regression
framework by using indicator variables, and the concept of analysis of
variance, or ANOVA, was introduced. Unit 4 finally brought all these
methods together, introducing multiple regression with an arbitrary
number of covariates and factors, as well as potential interactions between
them. We’ll review multiple regression with factors in Section 3.

383
Review Unit Data analysis in practice

Unit 5 presented a case study of linear modelling, including practical


considerations, such as sourcing the data and how to deal with missing
data. Practical considerations will be dotted around this whole unit,
embedded in various examples.
In Units 6 to 8, we broadened our modelling approach to the larger
framework of generalised linear models, or GLMs, for short. GLMs can
model both normal and non-normal response variables. As such, the GLM
modelling framework allows more flexibility regarding the type of response
variable that can be modelled. Section 4 will give a brief review of GLMs,
and illustrate their use for modelling three datasets.
Data are usually collected to answer one or more research questions. In
Section 5, we will discuss research questions from various different areas,
with particular emphasis on how to relate the model to the research
question (and how to answer it!).
Sections 6 to 8 discuss the importance of model choice, with a focus on
three specific aspects of modelling, and potential flexibility within these.
In Section 6, we will take another look at the model assumptions: are they
reasonable, or at least not unreasonable? And in Sections 7 and 8, we will
consider transformations and outliers, respectively.
To finish the unit, Section 9 gives a brief overview of methods that go
beyond those covered in M348, but may be useful for analysing data in a
practical context. You may encounter them in a postgraduate module!
These are just for your interest, and are, of course, not assessed in M348.
The structure of the Review Unit is represented diagrammatically in the
following route map.

384
Introduction

The Review Unit route map

Section 1
Multiple regression:
understanding the data

Section 2 Section 3
Multiple regression: Multiple regression
model fitting with factors

Section 4
The larger framework
of generalised linear
models

Section 5
Relating the model to
the research question

Section 7
Section 6 Section 8
To transform, or
Another look at the Who’s afraid
not to transform,
model assumptions of outliers?
that is the question

Section 9
The end of your
journey: what next?

Note that Subsections 5.2 and 6.4 contain a number of notebook


activities, so you will need to switch between the written unit and
your computer to complete these.

385
Review Unit Data analysis in practice

1 Multiple regression:
understanding the data
In this section, we’ll review various aspects of the exploratory analysis
stage of multiple regression. To do this, we’ll use a dataset that is
introduced in Subsection 1.1. We’ll then start an exploratory analysis for
this dataset in Subsection 1.2 and propose a first multiple regression model
for the data. Subsection 1.3 focuses on visualising the data; this raises
several questions about the dataset and how it should be modelled, which
are discussed in Subsection 1.4.

1.1 A first look at some data from a


pharmaceutical company
A dataset containing data from an experiment by a pharmaceutical
company is described next.

Optimising yield of an alcohol


The pharmaceutical company GlaxoSmithKline conducted an
experiment where they were interested in identifying the conditions
that would optimise the yield of an alcohol (that is, how much alcohol
was formed) from a chemical reaction. The type of chemical reaction
they used in the experiment is called desilylation.
For information, some of the measurements in this experiment were in
‘moles’, where a ‘mole’ is a unit to measure the amount of a substance
The GlaxoSmithKline Carbon in chemistry, and one mole corresponds to approximately 6 × 1023
Neutral Laboratory for particles such as atoms or molecules in a substance.
Sustainable Chemistry at The
University of Nottingham was Data from 30 runs of the chemical reaction for the desilylation
the first carbon-neutral experiment were recorded. The response variable of interest is:
chemistry laboratory in the • yield: the percentage yield of the alcohol of interest (where
UK
percentage yield is the ratio of the actual molar yield and the
theoretical molar yield, multiplied by 100).
Data are also available for four potential explanatory variables:
• temp0: the temperature at which the reaction is run, measured in
degrees Celsius (◦ C)
• time0: the reaction time, measured in hours
• nmp0: the concentration of the solution, measured in volumes of
NMP (where one volume of the solvent corresponds to
approximately two millilitres)
• equiv0: the molar equivalents of the reagent (the substance added
to cause a chemical reaction), where a molar equivalent is the ratio
of the moles of one compound to the moles of another.

386
1 Multiple regression: understanding the data

The values for the first five observations from the experiment are
given in Table 1.
Table 1 First five observations from the desilylation experiment

yield temp0 time0 nmp0 equiv0


82.93 15 22 4 1.17
94.04 25 22 4 1.17
88.07 15 28 4 1.17
93.97 25 28 4 1.17
77.21 15 22 6 1.17

The desilylation dataset (desilylation)


In the desilylation dataset, the data for the four explanatory variables
from the desilylation experiment (that is, temp0, time0, nmp0 and
equiv0) were standardised. The reason for doing this will be
explained in Subsection 1.4.
So, the desilylation dataset has the same response variable yield, and
the four potential explanatory variables are:
• temp: the variable temp0 standardised
• time: the variable time0 standardised
• nmp: the variable nmp0 standardised
• equiv: the variable equiv0 standardised
The first five observations from the desilylation dataset are given in
Table 2.
Table 2 First five observations of desilylation

yield temp time nmp equiv


82.93 −1 −1 −1 −1
94.04 1 −1 −1 −1
88.07 −1 1 −1 −1
93.97 1 1 −1 −1
77.21 −1 −1 1 −1

Source: Owen et al., 2001

We will start by considering the data from the desilylation experiment


given in Table 1 – that is, the response yield and the four potential
explanatory variables temp0, time0, nmp0 and equiv0. (We’ll use the
standardised data in the desilylation dataset soon.) First, in Activity 1,
let’s categorise the explanatory variables.

387
Review Unit Data analysis in practice

Activity 1 Covariates or factors?

There were four explanatory variables considered in the desilylation


experiment: temp0, time0, nmp0 and equiv0. Which of these are
covariates, and which are factors? And why does it matter?

We will consider an initial model for the data from the desilylation
experiment next.

1.2 An initial multiple regression model


The response yield and the four explanatory variables temp0, time0, nmp0
and equiv0 are all continuous, so we will use the multiple regression
framework to build a model for the desilylation data. We’ll consider an
initial model next.
As a reminder, the multiple regression model is summarised in Box 1.

Box 1 The multiple regression model


Suppose that Y is the response variable and x1 , x2 , . . . , xq are q
covariates. The multiple regression model for a collection of n data
points can be written as
Yi = α + β1 xi1 + β2 xi2 + . . . + βq xiq + Wi , for i = 1, 2, . . . , n, (1)
where the Wi ’s are independent normal random variables with zero
mean and constant variance σ 2 , and xi1 , xi2 , . . . , xiq are values of the q
covariates for the ith observation.
Using simpler notation, this model can be denoted as
y ∼ x1 + x2 + · · · + xq .

Model (1) from Box 1 provides us with a starting point for modelling the
data from the desilylation experiment from Table 1 in Subsection 1.1. So,
as a first model, let’s focus on a multiple regression model which includes
all of the possible covariates from the experiment as explanatory variables
– that is, the multiple regression model
yield ∼ temp0 + time0 + nmp0 + equiv0. (2)
It is usually a good idea to have a look at the data before starting to fit a
model. This can give an idea of which covariates affect the response, and
how. We will start in Activity 2 by looking at the data for the first five
observations from the desilylation experiment given in Table 1. Of course,
we cannot draw any strong conclusions from just looking at the first five
observations, but these may already give some indication about the model
which can be followed up in the data analysis.

388
1 Multiple regression: understanding the data

Activity 2 What can the first five observations tell us?

The first five observations from the desilylation experiment were given in
Table 1; this is repeated here as Table 3 for convenience.
Table 3 Repeat of Table 1

yield temp0 time0 nmp0 equiv0


82.93 15 22 4 1.17
94.04 25 22 4 1.17
88.07 15 28 4 1.17
93.97 25 28 4 1.17
77.21 15 22 6 1.17

As a starting point, we’re considering the multiple regression model


yield ∼ temp0 + time0 + nmp0 + equiv0.
Let’s write the full model, for i = 1, 2, . . . , n, as
yieldi = α + β1 temp0i + β2 time0i + β3 nmp0i + β4 equiv0i + Wi ,
where the Wi ’s are independent normal random variables with zero mean
and constant variance σ 2 ; temp0i , time0i , nmp0i and equiv0i are values of
the four covariates for the ith observation; and yieldi is the value of the
ith response.
(a) Using the data from Table 3, for each of the estimated regression
coefficients, βb1 , βb2 , βb3 and βb4 , indicate if you think they will be
positive or negative, or if it’s impossible to say at this stage. Explain
your answers.
(b) Compare the first observation with the third observation, and the
second observation with the fourth observation. Which covariates are
fixed, which vary, and how is the response affected?
What might this mean for the model
yield ∼ temp0 + time0 + nmp0 + equiv0?

When considering an initial model for yield in Activity 2, it was useful to


consider the data values from the desilylation experiment (from Table 1)
directly. However, as we move forwards with analysing these data, we’ll
now only consider the standardised versions of the explanatory variables
contained in the desilylation dataset, as given in Table 2.
Using the standardised data won’t affect any conclusions with regards to
the relationships between the variables (for example, except for scaling,
plots of the data will look the same). But, as will be explained in
Subsection 1.4, using the standardised data values from this experiment
can be beneficial to the modelling process.

389
Review Unit Data analysis in practice

So, we’ll continue to focus our attention on a multiple regression model for
yield, which includes all of the explanatory variables, but we’ll now use
the standardised data in the desilylation dataset so that Model (2) becomes
yield ∼ temp + time + nmp + equiv. (3)

1.3 Visualising the dataset


In Activity 2, you only had five observations out of 30 from the
desilylation experiment for your preliminary analysis. We’ll look at the
whole of the (standardised) desilylation dataset very soon, but first, in
Activity 3, we’ll consider what visualisation tool could be useful here.

Activity 3 Visualising the relationships between the


response and explanatory variables
How can the data be illustrated in a way that may reveal the relationships
between the response and all covariates in one plot? What can this plot be
used for?

In Example 1 we will look at the scatterplot matrix for all 30 observations


from the desilylation dataset. (Remember that the scatterplot matrix will
look the same regardless of whether we plot the data from the experiment
or their standardised values.)

Example 1 Interpreting the plots in the scatterplot


matrix involving the response
Figure 1 shows the scatterplot matrix for all 30 observations from the
standardised desilylation dataset.
In this example, we’ll focus on the scatterplots involving the response
variable, yield, shown on the bottom row of the scatterplot matrix.
What does the bottom row of Figure 1 tell us, and how is this related
to the conclusions made when looking at the first five observations in
Activity 2?
Let’s start by summarising the conclusions from the Solution to
Activity 2 in terms of the response yield and the explanatory
variables temp, time, nmp and equiv.
• Increasing temp increases yield.
• There’s uncertainty about the effect of time on yield.
• Increasing nmp decreases yield.
• No conjecture about the relationships between equiv and yield
can be made based on the data considered in Activity 2.

390
1 Multiple regression: understanding the data

temp

time

nmp

equiv

yield

Figure 1 Scatterplot matrix for the desilylation dataset

So, now let’s look at what the scatterplot matrix in Figure 1 tells us
about these relationships.
• Let’s start with the plot in the bottom-left corner – that is, the
scatterplot of yield and temp. We can see that the yield increases
as the temperature increases. This confirms our conjecture from
Activity 2. However, the relationship does not look like a straight
line, so we may consider a transformation of the explanatory
variable temp for our modelling approach. Also, the response values
do not seem evenly spread around the potential mean curve, which
may indicate a problem with the constant variance assumption.
• From the second scatterplot in the bottom row of Figure 1, the
relationship between yield and time appears to be less strong.
There may be a slight increase in yield over time, but again the
relationship is not linear and a transformation could be considered.
In Activity 2, we were undecided about the effect of time on yield.
Seeing all the data in the scatterplot gives us a slightly stronger
steer towards including this covariate in the model, but not
necessarily as a linear term.
• The third scatterplot in the bottom row of Figure 1, showing the
relationship between yield and nmp, is almost a mirror image of
the previous plot. We see a somewhat decreasing, but not linear,
relationship between these variables. Again, the scatterplot confirms
our conjecture from Activity 2 of a decreasing relationship, while
giving us further information about the possible shape of the model.
• The fourth scatterplot in the bottom row of Figure 1, showing the
relationship between yield and equiv, gives a similar picture to
the second. So, we deduce that there is some evidence of an

391
Review Unit Data analysis in practice

increase in yield as the equivalents of the reagent are increased.


Again, this relationship does not look linear, so we may consider a
transformation. In Activity 2, we had not been able to conjecture
any effect of equiv, as this variable had the same value for all
observations in the subset available for the activity.

Overall, the bottom row of the scatterplot matrix considered in Example 1


has confirmed and strengthened the conclusions from the Solution to
Activity 2 (as summarised in Example 1) about the relationships between
the response and individual explanatory variables. Example 1 did,
however, raise the issue that transformations of the explanatory variables
may be needed to help make the relationships between the response and
the explanatory variables linear. We’ll look into which transformations
might be useful when we discuss fitting a model to the desilylation dataset
in Section 2.
To continue our exploratory analysis, we will next take a look at the
relationships between the explanatory variables in the scatterplot matrix.
Recall from Subsection 5.1 of Unit 2 that the scatterplots between pairs of
explanatory variables can show whether some of the explanatory variables
are highly correlated. Let’s think about this a bit more in Activity 4.

Activity 4 Correlation between explanatory variables

Why is correlation between pairs of explanatory variables of interest?

Now let us have a look at the six pairwise scatterplots for the four
explanatory variables in the scatterplot matrix in Figure 1 (in Example 1).
We will explore these in Activity 5.

Activity 5 Interpreting the plots involving pairs of


explanatory variables
Describe the six pairwise scatterplots for the four explanatory variables in
the scatterplot matrix given in Figure 1. How do they differ from the
scatterplots for explanatory variables that you encountered in Unit 2?
(We’ll discuss why the scatterplots differ in the next subsection.)

392
1 Multiple regression: understanding the data

1.4 Designed experiments and


observational studies
The pairwise scatterplots for the four explanatory variables in the
scatterplot matrix in Figure 1 (in Example 1) open up a whole host of
questions about the desilylation data. These include:
1. Why are there only nine points when there are 30 observations in the
dataset?
2. Why is there a systematic pattern here?
3. What are the consequences for data analysis? Does it matter?
To answer Question 1, there are only nine points because some values of
the explanatory variables are repeated. There are only nine distinct
combinations of variable values for each pairwise selection of variables. In
fact, there are five distinct values for each explanatory variable. Figure 2
shows the frequency of (standardised) values for the explanatory variable
temp in the desilylation dataset. The bar charts for the other three
explanatory variables are identical.

12

10

8
Frequency

0
−2 −1 0 1 2
temp
Figure 2 Bar chart of values for the covariate temp in the desilylation dataset

To answer Question 2, you could think about what it is that makes the
desilylation dataset different from the datasets that you have seen
previously. Consider, for example, the FIFA 19 dataset introduced in
Unit 1. This dataset contains the values of several variables that have been
measured for 100 footballers. In Unit 2, we used the height and weight of a

393
Review Unit Data analysis in practice

footballer to predict his strength. So, how do explanatory variables such as


the height and weight of footballers differ from explanatory variables such
as the time and temperature of a chemical reaction in a laboratory?
The answer is that the time and temperature for a chemical reaction are
variables whose values are in the control of the experimenters. This means
that the chemists running the experiment can choose these values before
they collect the data. In contrast, an adult footballer’s height is fixed, and
while he may experience some changes in weight during his career, he is
unlikely, for example, to change his weight on purpose in order to see how
this changes his strength! We therefore have experimental data in the
desilylation dataset, whereas the data in the FIFA 19 dataset are
observational. You have seen the distinction between these types of data
before in Subsection 2.2.2 of Unit 1. A summary of the statistical
terminology for the different types of studies is given in Box 2.

Box 2 Designed experiment versus observational study


For a designed experiment, the researchers vary the values of the
explanatory variables in a way that enables them to see how this
affects the response variable. The combinations of values of the
explanatory variables that are used in the experiment are called the
design, or the plan, of the experiment.
For an observational study, the researchers observe the effect of the
explanatory variables on the response variable without having any
influence on the values of the explanatory variables. These are simply
observed.

The type of study is usually determined by the nature of the explanatory


variables. For instance, the study of the desilylation data is a designed
experiment and the study of the FIFA 19 data is an observational study.
The good news is that the statistical techniques you use to analyse the
data, such as linear regression, are the same for both study types, which
answers Question 3.
You will explore the differences between study types further in Activity 6.
Phew! We can use the same
statistical techniques whether Activity 6 Designed experiment or observational study?
the data are from a designed
experiment or from an For each of the following two studies, decide if it is a designed experiment
observational study. or an observational study. Explain your answer.
(a) A study based on the Olympics dataset first considered in Unit 5,
where the numbers of medals for the participating countries at the
next Olympics are predicted using the explanatory variables GDP and
how many medals each country won at the previous Olympics.

394
1 Multiple regression: understanding the data

(b) A study where researchers are interested in whether a new training


regime can increase footballers’ strength. Here, half the footballers
from an amateur league are randomly selected to be given the new
training regime and the other half just use their usual training regime.
The strength is then assessed at the end of the season.

There are many considerations that go into designing, or planning, an


experiment. We will finish our journey into the world of experimental
design by revisiting the experiment that generated the desilylation data,
with some practical considerations in Activity 7.

Activity 7 Designing an experiment

Write down a list of considerations the chemists would need to have


thought about when designing the experiment that generated the
desilylation data.

For the desilylation experiment, it was decided to use a sample size of


30 runs for the experiment, and each of the explanatory variables (temp0,
time0, nmp0 and equiv0) should be set to five different values. Table 4
shows the ranges of each of the explanatory variables in the desilylation
experiment as suggested by the chemists.
Table 4 Ranges of the explanatory variables in the desilylation experiment

Covariate Lowest value Highest value Unit



temp0 10 30 C
time0 19 31 hours
nmp0 3 7 volumes (of 2 ml)
equiv0 1 1.67 molar equivalents

As illustrated in Figure 2, the standardised values for the explanatory


variables (temp, time, nmp and equiv) in the desilylation dataset are
−2, −1, 0, 1, 2. The lowest values of the original explanatory variables given
in Table 4 correspond to a value of −2 when standardised, the highest
to 2, the middle values to 0, and so on. This is a kind of standardisation of
explanatory variables that is common in some types of designed
experiments in order to reduce the correlation between parameter
estimates. A detailed discussion of this approach is beyond the scope of
M348, but we will briefly return to this issue in Subsection 5.4 where we
transform the variables back to their original values after fitting the model.
The exploratory analysis of the desilylation dataset discussed in this
section has hopefully given you a feel for the data. Now, we’re ready to fit
a model to the data!

395
Review Unit Data analysis in practice

2 Multiple regression: model fitting


In this section, we’ll start in Subsection 2.1 by fitting Model (3) from
Subsection 1.2 to the desilylation dataset – that is, we’ll fit the multiple
regression model
yield ∼ temp + time + nmp + equiv.
The assumptions of this multiple regression model are checked in
Subsection 2.2, and Subsection 2.3 then discusses how transformations can
improve our model. In Subsection 2.4, we’ll consider how our model can be
further improved by adding some interactions to the model, and the
assumptions for the selected multiple regression model will be checked in
Subsection 2.5.

2.1 Fitting the initial model


Table 5 summarises the output after fitting the multiple regression model
yield ∼ temp + time + nmp + equiv
to data from the desilylation dataset.
Table 5 Summary of output after fitting the model
yield ∼ temp + time + nmp + equiv

Parameter Estimate Standard t-value p-value


error
Intercept 90.545 0.526 172.040 < 0.001
temp 4.159 0.588 7.068 < 0.001
time 1.258 0.588 2.138 0.0424
nmp −1.088 0.588 −1.850 0.0762
equiv 1.438 0.588 2.443 0.0220

The F -statistic for the fitted model summarised in Table 5 is 15.98 with 4
and 25 degrees of freedom, and the associated p-value is p < 0.001. We’ll
explore the output for this fitted model in Activity 8.

Activity 8 Exploring the output for the fitted model

In this activity, we’ll focus on the output relating to the regression


coefficients and tests for the model, as described in Subsections 1.2 and 1.3
in Unit 2.
(a) Test whether the regression model contributes information to
interpret changes in the yield of the alcohol. For the p-value you are
using, name the distribution it is obtained from and give (and
explain) its degrees of freedom.
(b) Which distribution are the p-values for the individual regression
coefficients calculated from? What’s the value of the degrees of
freedom?

396
2 Multiple regression: model fitting

(c) For the regression coefficient of temp, state what the associated
p-value is testing. What is the conclusion of this test?
(d) Interpret the estimated regression coefficient of temp.
(e) From their p-values, what do you conclude about the regression
coefficients of time, nmp and equiv?

So, we have fitted the multiple regression model


yield ∼ temp + time + nmp + equiv
and considered the output for this fitted model in Activity 8. But is this a
good model? To help us answer this question, we first need to check if the
model assumptions are satisfied. We will do this next.

2.2 Checking the assumptions of the initial


model
Box 3 summarises methods for checking the multiple regression model
assumptions, as discussed in Subsection 3.1 of Unit 2.

Box 3 Checking the multiple regression model


assumptions
For response Y and explanatory variables x1 , x2 , . . . , xq , suppose that
we have the multiple regression model, for each Yi , i = 1, 2, . . . , n,
Yi = α + β1 xi1 + β2 xi2 + . . . + βq xiq + Wi , Wi ∼ N (0, σ 2 ).

Multiple regression model assumptions


There are four main model assumptions:
• Linearity: the relationship between each of the explanatory
variables x1 , x2 , . . . , xq and Y is linear.
• Independence: the random terms Wi , i = 1, 2, . . . , n, are
independent.
• Constant variance: the random terms Wi , i = 1, 2, . . . , n, all have
the same variance σ 2 across the values of x1 , x2 , . . . , xq .
• Normality: the random terms Wi , i = 1, 2, . . . , n, are normally
distributed with zero mean and constant variance, N (0, σ 2 ).
Ways to check if these assumptions hold
• In the residual plot of the residuals and the fitted values, if the
points are not scattered randomly about the zero residual line, then
the zero mean and linearity assumptions may not hold.

397
Review Unit Data analysis in practice

• If there is some sort of ordering for the observations, then any


pattern in a plot of the residuals in this order may indicate that
there is a problem with the independence assumption.
Otherwise, if there is no obvious ordering of the observations, then
the independence assumption is generally assumed to hold unless
there is reason to believe that the assumption is unrealistic.
• In the residual plot of the residuals and the fitted values, if the
vertical spread of the points varies across the fitted values, then the
constant variance assumption may not be valid.
• If the normality assumption is reasonable, then the points in a
normal probability plot of the residuals should roughly lie on a
straight line.

In Activity 9, we’ll check the assumptions for our initial fitted model for
the desilylation data.

Activity 9 Checking the assumptions of the fitted model

The residual plot and the normal probability plot after fitting the multiple
regression model
yield ∼ temp + time + nmp + equiv
to the desilylation data are given in Figure 3(a) and Figure 3(b),
respectively.

6
2
Standardised residuals

4
1
2
Residuals

0 0

−2
−1
−4
−2
−6
85 90 95 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 3 The residual plot (a) and normal probability plot (b) for yield ∼ temp + time + nmp + equiv

(a) Does the residual plot in Figure 3(a) support the assumption that the
Wi ’s have zero mean and constant variance?

398
2 Multiple regression: model fitting

(b) What shape does the pattern of the points in the residual plot in
Figure 3(a) remind you of?
(c) Does the normal probability plot in Figure 3(b) support the
assumption that the Wi ’s are normally distributed?

By inspecting the residual plot and the normal probability plot given in
Figure 3 for the fitted model
yield ∼ temp + time + nmp + equiv,
it is clear that the model assumptions on the Wi ’s are violated. Why is
that, and what could be our next steps?
Well, the scatterplot matrix in Figure 1 (in Example 1, Subsection 1.3)
suggested that, while there was clearly a relationship between the response
and each of the explanatory variables, this relationship was not necessarily
linear. We have, however, just fitted a model with a linear relationship
between the response yield and the explanatory variables. So, the
resulting pattern in the residual plot reflects the non-linear relationships
visible in the scatterplot matrix.

2.3 Using transformations to improve


model fit
When analysing the scatterplot for the desilyation dataset in Example 1
(Subsection 1.3), we suggested using transformations of the explanatory
variables to help make the relationships between the response and the
explanatory variables more linear. However, it was not obvious which
transformations might be useful here. In particular, since the data come
from a designed experiment, histograms of the values of the explanatory
variables – such as the one in Figure 2 (Subsection 1.4) – are not useful
here.
Luckily, the residual plot of the residuals and the fitted values can give us
some indication of what transformations might be useful. The residual plot
shown in Figure 3(a) (in Activity 9) shows a clear quadratic relationship
between the fitted values and the residuals, which is a strong indication
that at least some quadratic (that is, squared) terms are needed in the
model. To find out which explanatory variables need to be squared, we can
look at a plot of the residuals and the values of each variable in turn, to
check if a quadratic relationship is visible. Figure 4(a), which follows,
shows such a plot for the variable temp and, for a clearer picture,
Figure 4(b) shows a plot of the means of the residuals for each value of
temp.
As you will see, the plots in Figure 4 show that the quadratic relationship
clearly persists when we consider the variable temp. This means that we
should try transforming the variable temp by taking its square, and then
using the transformed variable, temp2 , as a covariate in the model.

399
Review Unit Data analysis in practice

2
2
0

Mean of residuals
0
Residuals

−2
−2

−4
−4

−6 −6

−2 −1 0 1 2 −2 −1 0 1 2
(a) temp (b) temp
Figure 4 Plot of (a) the residuals and (b) the means of the residuals against the values of temp

We could continue in the same manner to plot the residuals and the values
of each of the other explanatory variables in turn, to see if a similar
pattern emerges. Alternatively, we can just transform all of the
explanatory variables by taking the square of each one, use all of these
transformed variables as covariates in the model, and then use formal tests
to see which of the transformed variables are significant. We will pursue
the latter strategy here. There is, however, more than one way that we
could include the transformed variables as covariates in the model. Should
we simply replace the original covariate with the transformed version, or
should we also keep the original covariate in the model in addition to the
transformed version? For example, should we just include temp2 as a
covariate in the model, or should we include both temp and temp2 as
covariates in the model? When analysing the films dataset in
Subsection 4.2.2 of Unit 2, the variable screens was transformed to
become screens3 to improve the model fit, and we did not then include
the untransformed variable in the model. For the desilylation dataset,
however, there are two good reasons why we may want to include both the
transformed and untransformed variables as covariates in the model.
In order to explain the first reason, let’s first consider how it was decided
to fit screens3 to the films dataset. A scatterplot of the response variable
and screens had revealed a non-linear pattern. Consequently, several
transformations of screens had been tried, and the scatterplot of the
response and screens3 turned out to show a reasonably linear pattern.
Therefore the transformation screens3 was adopted. For the desilylation
dataset, we looked instead at a scatterplot of the residuals and the fitted
values (in Figure 3(a), Subsection 2.2) and a scatterplot of the residuals
and the values of temp (in Figure 4(a)). That means that on the vertical
axis we did not plot the response variable, but instead we plotted the
residuals after already fitting a model including the untransformed

400
2 Multiple regression: model fitting

variables as covariates. This gives an indication that we may need both the
linear terms in the model (that is, the untransformed variables) and the
quadratic terms (that is, the transformed squares). If these terms are not
all needed in the model, we can always remove insignificant terms using a
model selection approach such as the Akaike information criterion (AIC).
(A reminder of the AIC is given in Box 4, Subsection 2.4.)
The second reason why we may want to include both the transformed and
untransformed variables as covariates in the model, is because quadratic
models without linear terms lack flexibility. To illustrate, for simplicity
consider a model with just one variable x. Figure 5 shows typical graphs of
a quadratic function of the form y = a + bx + cx2 , that is, a function with
an intercept a, a linear term bx and a quadratic term cx2 .

4 4
Quadratic function

Quadratic function
3

2 2

1 1

0 0
more positive c increasing a
−1 −1
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
(a) x (b) x

4 more negative c, 4
decreasing a increasing b
Quadratic function

3
Quadratic function

2 2

1 1

0 0

−1 −1
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
(c) x (d) x
Figure 5 Quadratic functions with (a) various values of the quadratic parameter c, (b) various values of the
intercept parameter a, (c) various values of the intercept parameter a and the (negative) quadratic parameter c,
(d) various values of the linear parameter b

401
Review Unit Data analysis in practice

All the graphs in Figure 5 have the typical parabola shape of quadratic
functions. If the quadratic parameter c is positive, the parabola will be
open at the top, and if c is negative, the graph is flipped to be open at the
bottom.
• Figure 5(a) shows that varying the (positive) quadratic parameter c
changes how quickly the graph increases.
• Figure 5(b) shows that changing the intercept parameter a will shift the
graph up and down.
• Figure 5(c) shows the combined effect of varying the intercept
parameter a and the quadratic parameter c, this time with negative
values of c.
In Figures 5(a), (b) and (c), the minimum (when c > 0) or maximum
(when c < 0) value of the graph is always attained when x = 0. This
cannot be changed by changing the values of a or c.
• Figure 5(d) shows quadratic functions when the linear parameter b is
varied. You can see that this will give us the flexibility to have a
minimum or maximum at a different value of x.
In the desilylation experiment, the researchers are interested in finding the
values of the explanatory variables where the maximum response is
obtained. Therefore, we require a model that is flexible enough to have the
maximum at any plausible combination of values of the explanatory
variables, and not just when they are all equal to 0 (as would happen if
there were no linear terms, as in Figures 5(a), (b) and (c)).
When we look at the scatterplots of the response variable versus each
explanatory variable in turn on the bottom row of the scatterplot matrix
given in Figure 1 (Subsection 1.3), we can see that the maximum is not
exactly in the centre (horizontally) of each plot, which corresponds to
when each explanatory variable is equal to 0. In the films example in
Subsection 4.2.2 of Unit 2, on the other hand, it is clear that the minimum
of the response variable income will be attained when there are no
screens. Therefore, in that case, it was sufficient to just use the
transformed variable, screens3 , without keeping the original variable
screens.
So, we’ve decided that we want to investigate adding quadratic terms to
our model for yield, in addition to the linear terms for each explanatory
variable. What other terms could we consider adding?
Well, the Solution to Activity 2 noted that the yield of alcohol seemed to
be affected differently by an increase in time, depending on the
temperature of the experiment. In order to accommodate differences
between the effect of one of the explanatory variables on the response for
different values of another explanatory variable, an interaction between the
two explanatory variables can be added to the model. So, we should also
consider adding the interaction between temp and time – that is, the
interaction term temp:time – to the model.

402
2 Multiple regression: model fitting

2.4 Adding interactions to the model


Subsection 6.3 of Unit 4 explained how we can introduce an interaction
between two covariates into a linear model: the interaction term between
two covariates x1 and x2 , say, is simply their product x1 × x2 . So, adding
the interaction temp:time into Model (3) from Subsection 1.2, where both
temp and time are covariates, we have the multiple regression model
yield ∼ temp + time + nmp + equiv + temp:time,
where the term temp:time is the covariate temp × time.
What about other interactions? We cannot say anything about them after
only seeing the first five observations from the desilylation experiment
given in Table 1 (Subsection 1.1). Moreover, while scatterplot matrices
such as in Figure 1 (Subsection 1.3) are useful to see individual
relationships between the response variable and each of the explanatory
variables, these cannot be used to determine if and how explanatory
variables affect the response jointly. Since we are planning to fit an
interaction between temp and time, and there is no reason to believe that
other explanatory variables do not interact, it is sensible to try adding all
of the interactions between pairs of explanatory variables to the model and
use formal tests to see which two-way interactions are significant.
So, bringing all of the discussions about the model together, let’s try
fitting the multiple regression model including as explanatory variables all
four covariates, the transformed squares of each covariate, and the six
two-way interactions between the four covariates. In other words, let’s fit
the multiple regression model
yield ∼ temp + time + nmp + equiv
+ temp2 + time2 + nmp2 + equiv2
+ temp:time + temp:nmp + temp:equiv
+ time:nmp + time:equiv + nmp:equiv. (4)
The output after fitting this model to the desilylation data is summarised
in Table 6, next, and the value of the F -statistic for this fitted model
is 266.1 with 14 and 15 degrees of freedom, with associated p-value of
p < 0.001.

403
Review Unit Data analysis in practice

Table 6 Summary of output after fitting Model (4)

Parameter Estimate Standard t-value p-value


error
Intercept 93.070 0.181 512.899 < 0.001
temp 4.159 0.091 45.841 < 0.001
time 1.258 0.091 13.869 < 0.001
nmp −1.088 0.091 −11.995 < 0.001
equiv 1.438 0.091 15.844 < 0.001
temp2 −2.119 0.085 −24.972 < 0.001
time2 −0.204 0.085 −2.408 0.0294
nmp2 −0.244 0.085 −2.879 0.0115
equiv2 −0.588 0.085 −6.930 < 0.001
temp:time −1.179 0.111 −10.608 < 0.001
temp:nmp 1.179 0.111 10.608 < 0.001
temp:equiv −1.386 0.111 −12.475 < 0.001
time:nmp 0.220 0.111 1.980 0.0664
time:equiv −0.323 0.111 −2.902 0.0109
nmp:equiv 0.245 0.111 2.205 0.0435

What can we ascertain from this output? We’ll begin by exploring the
results shown in Table 6 in Activity 10, and then we will compare our
results to those for Model (3) (that is, the model with just the individual
covariates as explanatory variables).

Activity 10 Exploring the output for our new model

(a) The p-value for testing that all regression coefficients are 0 is less
than 0.001. Name the distribution this p-value is obtained from and
give (and explain) its degrees of freedom.
(b) Test whether the regression model contributes information to
interpret changes in the yield of the alcohol.
(c) Which distribution are the p-values for the individual coefficients
calculated from? What’s the value of the degrees of freedom?
(d) Comment on the individual significance of all regression coefficients
by category (linear, quadratic, interaction). Would you consider
removing any terms from the model?

The p-value for the interaction time:nmp indicates that there is only weak
evidence to include this term in the model, if all other terms are in the
model. So, is this interaction needed in the model? To answer this
question, we should compare the models with and without the interaction
time:nmp, to see which model is preferable.
Box 4 provides a reminder of how the model selection criteria from
Subsection 5.2 in Unit 2 can be applied.

404
2 Multiple regression: model fitting

Box 4 Selecting a multiple regression model possible models


We seek a model that fits the data well while being as simple as
possible. In other words, we want to include the terms in the model best
fit!
that are needed to explain the patterns in the data while excluding
unnecessary terms, following the principle of parsimony. Two model
selection criteria that provide a balance between model fit and model
complexity are as follows.
• The adjusted R2 statistic, also written as Ra2 , represents the
percentage of variance accounted for by the model, adjusted by the
number of regression terms fitted to the model.
The best model from a set of alternatives is the model with the
highest value of Ra2 .
• The Akaike information criterion, usually referred to more simply
as the AIC, calculates the amount of information lost by a given
model relative to the amount of information lost by other models. super
simple
It considers the number of explanatory variables in the model,
preferring simpler models.
The best model from a set of alternatives is the model with the
lowest AIC.

So, in order to decide whether the interaction time:nmp is needed in the


model in addition to the other explanatory variables, we can compare
Model (4) (which includes all the covariates, their squares and the two-way
interactions between the covariates) with the model which is the same as
Model (4) but with the interaction time:nmp removed. Table 7 gives the
values of the adjusted R2 statistic and the AIC for these two models.
Table 7 Adjusted R2 and AIC for Model (4) and the model without time:nmp

Model Adjusted R2 AIC


Model (4) 0.9922 47.6910
Model without time:nmp 0.9908 52.6557

Which model would be preferable based on these criteria? We’ll explore


this question in Activity 11.

Activity 11 Model selection

Consider the values of the model selection criteria in Table 7. Which


model would you prefer, and why?

After selecting our model, the next step in the data analysis is now to
check the model assumptions through diagnostic plots. We will do this
soon in Subsection 2.5, but before we do, we’ll take a bit of a detour and in

405
Review Unit Data analysis in practice

Activity 12 we’ll consider the estimates for the coefficients that are
common to both Model (4) (which includes the four covariates, their
squares and the two-way interactions) and Model (3) (which just includes
the four covariates).

Activity 12 Comparing coefficients

Compare the estimates for the linear coefficients for Model (4) in Table 6
with those for Model (3) in Table 5. For convenience, the relevant parts of
Tables 5 and 6 are repeated here as Tables 8 and 9, respectively.
Table 8 Estimates for the linear coefficients after fitting Model (3)

Parameter Estimate Standard t-value p-value


error
temp 4.159 0.588 7.068 < 0.001
time 1.258 0.588 2.138 0.0424
nmp −1.088 0.588 −1.850 0.0762
equiv 1.438 0.588 2.443 0.0220

Table 9 Estimates for the linear coefficients after fitting Model (4)

Parameter Estimate Standard t-value p-value


error
temp 4.159 0.091 45.841 < 0.001
time 1.258 0.091 13.869 < 0.001
nmp −1.088 0.091 −11.995 < 0.001
equiv 1.438 0.091 15.844 < 0.001

Did you expect this result? Where in M348 have you seen a similar
situation with a different outcome?

Recall that the partial regression coefficient in the multiple regression


model has a different interpretation to the coefficient in the simple linear
regression model.
• A partial regression coefficient in the multiple regression model measures
the effect of an increase of one unit in the corresponding explanatory
variable while treating the values of the other variables as fixed.
• A regression coefficient in the simple linear regression model represents
an increase of one unit in the corresponding explanatory variable,
assuming that the other explanatory variables are not in the model.
Here, we are in a similar situation, where both models are multiple linear
regression models, but Model (3) is nested within Model (4), so that all
the terms in Model (3) are also in Model (4). We would therefore expect
the estimates for the coefficients of the common terms to be different.
We do not, in general, get the same estimates for coefficients of common
terms in nested models, but (from Activity 12), Models (3) and (4) do

406
2 Multiple regression: model fitting

have the same estimates for their common parameters. It turns out that
the desilylation dataset is a special case. So, why is this happening? What
makes the desilylation dataset different to the datasets seen in Unit 2?
Well, the difference lies in the way the desilylation experiment has been
designed. The statistician who helped the chemists design the experiment
deliberately chose to do it this way. This is why the scatterplots of
pairwise explanatory variables in Figure 1 (Subsection 1.3) have this
particularly symmetric and balanced structure. (A detailed discussion of
how exactly to approach the design problem is beyond the scope of M348.)

2.5 Checking the assumptions of the


selected model
Activity 11 concluded that the preferred model for the desilylation data is
Model (4) – that is, the model with all four covariates, their squares, and
the two-way interactions between them as the explanatory variables. So,
now we need to check the assumptions for this model. We’ll start by
taking a look at the residual plot and the normal probability plot for this
model in Activity 13.

Activity 13 Interpreting the residual plot and the normal


probability plot
The plot of residuals against fitted values and the normal probability plot
after fitting Model (4) to the desilylation data are given in Figure 6.

2
0.5
Standardised residuals

1
Residuals

0.0 0

−1
−0.5
−2

80 85 90 95 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 6 The residual plot (a) and normal probability plot (b) for Model (4)

(a) Does the residual plot in Figure 6(a) support the assumption that the
Wi ’s have zero mean and constant variance?
(b) Does the normal probability plot in Figure 6(b) support the
assumption that the Wi ’s are normally distributed?

407
Review Unit Data analysis in practice

The plots for Model (4) in Figure 6 show a big improvement over their
counterparts from our initial model, Model (3), in Figure 3
(Subsection 2.2). It therefore seems reasonable to conclude from
Activity 13 that the assumptions on the Wi ’s are satisfied for Model (4).
Another step in the assessment of Model (4) is to look for influential
points. Recall from Subsection 3.2 of Unit 2 that a data point is likely to
be influential if it has both high leverage and a large residual. An easy way
to look for potentially influential points is by using the residuals versus
leverage plot; this plot for Model (4) is shown in Figure 7.

13
2

1
Standardised residuals

−1

−2

0.0 0.1 0.2 0.3 0.4 0.5 0.6


Leverage
Figure 7 The residuals versus leverage plot for Model (4)

The residuals versus leverage plot in Figure 7 shows that there are only
two different values of leverage: six points have leverage of
approximately 0.17, while all the other points have leverage of just
under 0.6. (Again, this is a feature of the design of the experiment for the
desilylation dataset. The six points with low leverage are in the centre of
the experimental region, i.e. where all covariates have the (standardised)
value of 0, and the other points all have the same distance from the
centre.) The most influential points are therefore those which are in the
higher-leverage group and have large standardised residuals. Here,
point 13 in the top-right corner of the plot is most influential as its
standardised residual has the highest absolute value.
In order to decide whether point 13 in Figure 7 actually is an influential
point, we can use the Cook’s distance, as described in Subsection 3.3 of
Unit 2. A Cook’s distance plot gives a graphical representation of the

408
2 Multiple regression: model fitting

Cook’s distance values for a dataset; this plot for Model (4) is shown in
Figure 8.

13

0.4

11
Cook’s distance

0.3
24

0.2

0.1

0.0
0 5 10 15 20 25 30
Observation number
Figure 8 The Cook’s distance plot for fitted Model (4)

We’ll interpret this Cook’s distance plot in Activity 14.

Activity 14 Interpreting the Cook’s distance plot

Explain why the Cook’s distance plot given in Figure 8 suggests that there
aren’t any problems with influential points for these data and Model (4).

Overall, in this section, we have looked at ways to improve an initial


model, and we have found a model that fits the data from the desilylation
dataset well and satisfies the model assumptions. We’ll leave the
desilylation dataset for now, but will revisit it in Subsection 5.4 where we
will use Model (4) to answer the research question behind the experiment.

409
Review Unit Data analysis in practice

3 Multiple regression with factors


So far in this unit, we have focused on data where all explanatory variables
are numerical, that is, they are covariates. In this section, we will analyse
a dataset where the explanatory variables are categorical, that is, they are
factors, while the response variable is still assumed to follow a normal
distribution.
You saw in Unit 3 how a single factor can be incorporated into the linear
model framework: level one of the factor is set to be the baseline level, and
for all other levels we assign an indicator variable telling us if, for each
respective observation, the factor is at this level or not. We can then fit a
multiple regression model where the indicator variables are treated as
covariates. The coefficient of such an indicator variable is the effect that
the corresponding level of the factor has on the response in comparison to
the effect that the baseline level has on the response.
This idea was extended in Unit 4 to linear models with arbitrary numbers
of covariates and factors, using the same indicator variable strategy to
incorporate the factors into a multiple regression model.
In this section, we’ll review how to incorporate factors into a model,
focusing in particular on applying the techniques to a dataset from biology.
This dataset is explored next in Subsection 3.1. Then, in Subsection 3.2,
the focus shifts to modelling the data.
We can test whether there is a relationship between the response and a
factor through an F -test, as explained in Subsection 3.2 of Unit 3. The
same model fit can also be presented in the form of an ANOVA table, as
explained in Subsection 5.4 of Unit 3, which also provides the result of the
F -test; the ANOVA table for these data is considered in Subsection 3.3.
To round off this section, in Subsection 3.4 we’ll address the question of
whether our chosen model is a good one, and then have another quick look
at how the biology experiment was planned and its impact on our analysis
in Subsection 3.5.

3.1 Introducing a dataset from biology


In this section, we’ll be focusing on modelling a dataset with two factors.
The dataset contains data from an experiment in biology and is described
next.

Measuring the effect of a toxin on the survival of cells


Biologists at the Ruhr University Bochum, Germany, were interested
in the effect of a toxin on the survival of certain cells. In particular,
they wanted to know if the toxin will change the proportion of
surviving cells in a sample. To find out, they conducted an
experiment, called a bioassay (a method to determine the strength of

410
3 Multiple regression with factors

a substance by comparing its effects on living cells or organisms with


the effects of a standard substance or no treatment at all). The
survival of the cells is measured using a measure called the optical
density, with high values of the optical density indicating high
proportions of survival.
The cells dataset (cells)
In their experiment, the researchers treated six samples of the cells
with the toxin of interest and left six further samples untreated. They
then ran the experiment over two consecutive days. On the first day,
measurements of the optical density were taken on three of the treated
samples and three of the untreated samples. Then, on the second day, When miners took canaries
measurements of the optical density were taken on the three into coal mines, they were
remaining treated samples and the three remaining untreated samples. carrying out a bioassay
experiment on the gases: if the
The response variable of interest for this dataset is: canary experienced ill effects,
• opticalDensity: the optical density measurement (to one decimal the miners would exit the mine
place).
There are two potential factors:
• treatment: an indicator variable taking the value 1 if the sample
was treated with the toxin, and the value 0 if the sample was
untreated
• day: the day the sample was measured, taking the value 1 if the
measurement was taken on the first day of the experiment, and the
value 2 if the measurement was taken on the second day.
Since there are only 12 observations in this dataset (for the six treated
samples and the six untreated samples), Table 10 shows the values of
the full cells dataset.
Table 10 All 12 observations of cells

opticalDensity treatment day


1.3 0 1
1.1 0 1
1.0 0 1
0.8 1 1
0.6 1 1
0.5 1 1
0.8 0 2
0.7 0 2
0.5 0 2
0.4 1 2
0.3 1 2
0.2 1 2

Source: Biedermann, 2006

411
Review Unit Data analysis in practice

It’s usually a good idea to have a look at the data first. Figure 9 shows the
observed values of the optical density plotted against the day of
observation, with the observations for untreated samples distinguished
from those for treated samples.

Treatment: Untreated Treated

1.2

1.0
Optical density

0.8

0.6

0.4

0.2
1 2
Day
Figure 9 Values of the optical density on two different days for treated and
untreated samples

We’ll interpret Figure 9 next, in Activity 15.

Activity 15 Interpreting the cells dataset

(a) From Figure 9, would you say that the optical density is affected by
the toxin (factor treatment)? Is this what you had expected?
(b) What about the factor day? Is the distribution of responses different
on different days of measuring, and would you have expected this
result?
(c) Would you say the effect of the factor treatment is roughly the same
on both days?

Following on from Activity 15, the distribution of the response


opticalDensity differs on the two different days. This is not necessarily
expected. Why would a measurement be affected by the day on which it’s
been taken?

412
3 Multiple regression with factors

According to one of the biologists involved in the research, it’s actually


common for this to happen in a laboratory. This type of measurement is
very sensitive to laboratory conditions and to the measuring device, so it’s
normal to observe day-to-day fluctuations like those seen in Figure 9.
This shows that it’s important to talk to an expert in the application area
of the data, to get a better understanding.
In Activity 15, Figure 9 helped us get a feel for the data. It can also help
us to build a model for the data; we’ll do this next.

3.2 Modelling the data


The visual representation of the cells dataset given in Figure 9 provides us
with some initial ideas about a model for opticalDensity. Activity 16
will explore these ideas.

Activity 16 Initial model ideas

The response variable opticalDensity is continuous, so let’s consider


fitting a linear regression model. Use your results from Activity 15 to
conjecture which of the following terms may be needed for a good fit.
(a) The factor treatment.
(b) The factor day.
(c) The interaction between treatment and day.

One way to select a model more rigorously than just by looking at the data
is to look at the values of the adjusted R2 statistic and the AIC for
competing models. Table 11 provides these values for various linear models
for the cells data.
Table 11 Adjusted R2 and AIC for various linear models for the cells data

Model Adjusted R2 AIC


Null model 0 10.589
Model with treatment 0.409 5.128
Model with day 0.334 6.568
Model with treatment and day 0.826 −8.795
Model with treatment and day and their interaction 0.819 −7.755

We’ll consider which model we’d prefer based on the results given in
Table 11 next, in Activity 17.

413
Review Unit Data analysis in practice

Activity 17 Model selection for the cells data

Explain why the preferred model based on the values in Table 11 is


opticalDensity ∼ treatment + day.

Is this the model that you expected to be the preferred model from
Activity 16?

Following on from Activity 17, based on the adjusted R2 statistic and the
AIC, our preferred model for the cells dataset is
opticalDensity ∼ treatment + day. (5)
The output after fitting this model to the cells data is given in Table 12.
Table 12 Summary of output after fitting Model (5) to the cells data

Parameter Estimate Standard t-value p-value


error
Intercept 1.100 0.069 15.853 < 0.001
treatment 1 −0.433 0.080 −5.408 < 0.001
day 2 −0.400 0.080 −4.992 < 0.001

Recall from Subsection 3.1 of Unit 3 that, when we have a factor, there’s a
parameter estimate for the effect for each level of the factor in comparison
to the effect of the first level of the factor. So, ‘treatment 1’ means that
this row corresponds to the indicator variable for when the sample is
treated with the toxin, that is, when treatment = 1, and ‘day 2’ means
that this row corresponds to the indicator variable for the second day, that
is, when day = 2. The factor levels treatment = 0 and day = 1 have been
used as the baseline levels, and the effects of these are part of the intercept
parameter.
The output given in Table 12 is very similar to what we have seen for
covariates, that is, for continuous explanatory variables: the table gives a
row for each parameter estimate, its standard error and the p-value
associated with it. What are the p-values testing? Let’s have a closer look
in Activity 18.

Activity 18 Interpreting Table 12

The row corresponding to the factor treatment, has a p-value less than
0.001.
(a) What is the null hypothesis and what is the alternative hypothesis
corresponding to this p-value?
(b) What is the value of the test statistic for the test in part (a), and
what would be its distribution if the null hypothesis were true?

414
3 Multiple regression with factors

(c) Based on the evidence in Table 12, do you think the toxin affects cell
survival?

In Unit 3, we showed that the analysis of a dataset where the explanatory


variable is a factor can also be presented in an ANOVA table, where we can
immediately see how much of the variability in the responses is explained
by the factor. We’ll consider the ANOVA table for our fitted model next.

3.3 The ANOVA table


When we introduced ANOVA in Unit 3, we only had one factor in the
model. This idea can be extended to more than one explanatory variable,
as long as all of them are factors. The interpretation changes in the same
way as the interpretation of a linear regression model changes when going
from simple to multiple linear regression: each factor has its own explained
sum of squares (ESS), which is calculated when the other factors have been
fitted to the model. As an example, the ANOVA table when fitting
Model (5) from Subsection 3.2 is shown in Table 13.
Table 13 ANOVA table for Model (5)

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
treatment 1 0.563 0.563 29.250 < 0.001
day 1 0.480 0.480 24.923 < 0.001
Residuals 9 0.173 0.019

Here, the ESS for the factor treatment is the variability in optical density
that is explained by the different treatments (toxin or no toxin), given that
the factor day has also been fitted to the model, and the ESS for the factor
day is the variability in optical density that is explained by the different
days, given that the factor treatment has also been fitted to the model.
We actually get the same p-values for the factors in both Table 12 and
Table 13. This is because the factors here have only two levels.
For factors with more than two levels, k say, we would get k − 1 p-values in
the regression table (for the coefficients comparing each level of the factor
with the baseline level). This can make the table of regression coefficients
look quite messy! In contrast, we only have one p-value associated with
each factor in the ANOVA table.
Also recall that, when there are more than two levels for a factor, the
p-values for the individual levels of the factor aren’t terribly useful. This is
because we either include the factor, in which case we need to include the
corresponding indicator variables for all of the factor levels, or we don’t
include the factor, in which case we don’t need any of the indicator
variables for the factor levels. So, when the factors have more than two
levels, the F -values in the ANOVA table are used for testing whether each
factor is required in the model. In this case, the p-value for a factor in the

415
Review Unit Data analysis in practice

ANOVA table will not be the same as the p-values associated with each
level of the factor.
The ANOVA table is useful for presenting the information required for
testing whether each factor is required in the model in a concise way, but
the regression table is also useful since it might have extra information
that is needed for specific questions of interest. For example, if we wanted
to predict a new response, we would need the coefficients from the
regression table.

3.4 Is the model a good one?


According to the adjusted R2 statisticand the AICgiven in Table 11
(Subsection 3.2), Model (5) is the preferred model for the cells dataset
when compared with various other possible models. But is it a good model
that satisfies the model assumptions we have made? It remains to look at
the diagnostic plots. These are given in Figure 10.

0.2
Standardised residuals

1
0.1
Residuals

0.0 0

−0.1
−1

−0.2
0.4 0.6 0.8 1.0 −1 0 1
(a) Fitted values (b) Theoretical quantiles
Figure 10 The residual plot (a) and the normal probability plot (b) for fitted Model (5)

The diagnostic plots in Figure 10 do not flag up any cause for concern.
The residual plot of residuals and fitted values shows reasonably equal
spread about 0, and the normal probability plot shows that the
standardised residuals are close to the straight line. Minor deviations in
these plots are expected as we have a very small dataset. We conclude that
Model (5) provides a good fit for the cells data.

416
4 The larger framework of generalised linear models

3.5 Another look at how the experiment


Experiment
was planned plan
We will conclude our modelling exercise of the cells data by having another
look at how the experiment was planned. The biologists had 12 samples for
experimentation. It therefore made sense to treat six of these with the
toxin and to leave the other six untreated for comparison. They also knew
that they could only take six measurements per day, so had to decide
which samples should be measured on which day. They decided to use
three treated and three untreated cells on each of the two days, as they
knew there can be high day-to-day variation in this type of experiment.
(By the way, factors such as day, which we are not interested in in their
own right, but which may cause variability in the response values, are also
called ‘blocks’.) You will explore the advantage of their plan in Activity 19.
‘Hmm . . . What’s the best way
Activity 19 Planning the cells experiment to do this experiment?’

Consider the plot of data from the cells dataset given in Figure 9
(Subsection 3.1) and answer the following questions.
(a) Suppose that the biologists had measured all of the treated samples
on Day 1 and all the untreated ones on Day 2. What could be a
problem here?
(b) Now suppose that the biologists had measured all of the untreated
samples on Day 1 and all the treated ones on Day 2. What could be a
problem this time?
(c) Hence, was it a good idea to plan the experiment in the way it was
run?

That concludes our review of linear models. In the next section, we’ll turn
our attention to generalised linear models.

4 The larger framework of


generalised linear models
So far in this unit, we have focused on linear models, reviewing material
covered in Units 1 to 5 of M348. In this section, we’ll consider the larger
modelling framework of generalised linear models – that is, GLMs – which
were the focus of Units 6 to 8 of this module.
We’ll start in Subsection 4.1 by reviewing GLMs. We’ll then use GLMs to
model three separate datasets: we’ll model a dataset with a Poisson
response in Subsection 4.2, a dataset with a binary response in
Subsection 4.3, and some contingency table data in Subsection 4.4.

417
Review Unit Data analysis in practice

4.1 Review of GLMs


We will start this section with a brief review of GLMs. Box 5 summarises
the key components of a GLM.

Box 5 The GLM


The relationship between a response variable Y and a set of
explanatory variables follows a GLM if
• Y1 , Y2 , . . . , Yn all have the same distribution, but each Yi has a
different mean
• for i = 1, 2, . . . , n, the regression equation has the form
g(E(Yi )) = ηi ,
where g is a link function and ηi is the linear predictor.
The link function, g, is the function which links the mean response
E(Yi ) to the linear component of the model, ηi .

How do the linear models we have met so far in this unit fit into this
framework? Well, in linear models, the distribution for the response
variable is the normal distribution, and the link function, g, is the ‘identity
link’, so that g(E(Yi )) = E(Yi ). So, when modelling a dataset using a
linear model, we therefore only had to decide which terms to include in the
linear predictor.
The definition of a GLM allows us to be more flexible. We can now fit
responses that are not normally distributed, such as, count data. This
means we have to make three decisions in the modelling process:
• which distribution we want to fit to the response variable
• which link function to use
• which terms to include in the linear predictor.
How do we know which distribution to pick? In many cases, a distribution
will spring to mind quite naturally considering the nature of the response.
Often, this will result in a good fit, but keep in mind that there may be
datasets where a different distribution provides a better fit. In this
module, we considered the following response distributions for GLMs.
• The normal distribution is a natural choice when the data are
continuous and their distribution roughly follows a bell-shaped curve.
For example, a normal distribution may be a natural choice to model the
heights of footballers.
• The Bernoulli distribution is a natural choice when the data have two
possible outcomes, often called ‘success’ and ‘failure’. For example, you
will either pass or fail your driving test, or a footballer taking a penalty
kick may, or may not, score a goal.

418
4 The larger framework of generalised linear models

• The binomial distribution is related to the Bernoulli distribution and


describes the number of successes out of a pre-selected number of trials.
For example, if we group learner drivers by driving instructor, then, for
each group, we could use the binomial distribution to model the number
of learners who passed the driving test, out of the total number of
learners in the group. Similarly, the number of penalties that resulted in
a goal, out of the total number of penalty kicks awarded to a football
team, can be modelled by a binomial distribution.
• The Poisson distribution is typically used to model count data. For
example, the Poisson distribution could be used to model the number of
people ahead of you in a supermarket queue, or the number of times
‘filler words’ such as ‘um’, ‘er’, ‘like’, ‘well’, and ‘you know’ are used in a
conversation.
• The exponential distribution is often used to model ‘time to event’ data.
For example, the response variable could be the time until your mobile
phone breaks, the duration of a period of unemployment, or the length
of time of a hospital stay.
The exponential distribution is a special case of a distribution called the
‘gamma distribution’, which can also be used to model ‘time to event’
data. If the exponential distribution is not a good fit for a ‘time to
event’ dataset, the more flexible gamma distribution may be a natural
next step.
In Activity 20, you will have the opportunity to reflect on some further
examples.

Activity 20 Which distribution?

Which distribution could be used to model the responses in the following


situations?
(a) A manufacturer of electric cars measures the distance travelled in a
single trip for a sample of cars on fully charged batteries.
(b) For each policy holder, a car insurance company records whether or
not they have made a claim in the last year.
(c) A car insurance company records the number of claims each policy
holder has made in the last year.
(d) A car insurance company records the number of policy holders who
have made a claim in the last year.
(e) A biologist measures the height of British bluebells (Hyacinthoides British bluebells at Lickey Hill
non-scripta) in several places in Birmingham, UK. Country Park in Birmingham,
UK

419
Review Unit Data analysis in practice

Often there seems to be a ‘natural’ or ‘obvious’ distribution to model a


dataset, but this is not definite. Sometimes more than one model will fit
the data well, and sometimes none of those we try will. Also, a
distribution that is not the obvious choice for the response variable may fit
better than a more natural distribution.
For example, in Unit 5 we saw that a linear model based on a normal
response provided an acceptable fit for the Olympics dataset, although the
response variable – the number of olympic medals each country achieved –
is a count. In fact, in Notebook activity 7.3, when we revisited the
Olympics dataset and fitted a GLM with a Poisson response and the same
linear predictor, we found that, based on the MSE, the predictive
performance of the linear model was better than the predictive
performance of the Poisson GLM!
So, once we have decided which distribution to use for our response
variable, the next decision is which link function to use. In M348, the link
functions summarised in Table 14 were used for the response variable
distributions considered; by choosing the link function for the response as
given in this table, the decision of which link function to use is made for
us! Note that, as indicated in Table 14, in M348 we’ve used the canonical
link functions for normal, Bernoulli, Poisson and binomial responses, but
used a non-canonical link function for exponential responses.
Table 14 The link functions used in M348

Response Link function g Link name Canonical?


Normal g(E(Yi )) = E(Yi ) identity yes
 
E(Yi )
Bernoulli g(E(Yi )) = log logit yes
1 − E(Yi )

Poisson g(E(Yi )) = log (E(Yi )) log yes

Exponential g(E(Yi )) = log (E(Yi )) log no


 
E(Yi )
Binomial g(E(Yi )) = log logit yes
1 − E(Yi )

The final decision required for a GLM, is which terms to include in the
linear predictor. For this, we can use the same techniques that we’ve used
for selecting which terms to include in a linear model.
In the following subsections, we will have a close look at three different
datasets, which we’ll use to build generalised linear models.

420
4 The larger framework of generalised linear models

4.2 Example: Modelling citations of


published articles
In this subsection, we’ll use a GLM to model a dataset concerning
citations of published articles. The dataset is described next.

Citations of published articles


In order to disseminate their results, researchers publish articles in
peer-reviewed academic journals. These articles are then cited in work
that builds on them. As a rule of thumb, the more groundbreaking
the article, the more it is cited in the literature. Typical numbers of
citations vary between disciplines. In statistical methodology, for
example, it is rare to achieve more than 100 citations for an article,
with numbers typically in single digits or low double digits.
The citations dataset (citations)
A member of the Statistics group at the School of Mathematics and
Statistics at the Open University has provided her citation data (as of
10 February 2021, according to Scopus). The dataset contains the
following variables:
• numCitations: the number of times the article has been cited Albert Einstein’s articles have
(response variable) been cited 136,190 times (as of
10 February 2021, according to
• pubYear: the year the article was published Google Scholar)
• yearDiff: a variable created as yearDiff = 2021 − pubYear, to
reflect how long an article has been published
• journal: a factor indicating the type of academic journal, taking
the coded values 0 (for standard statistics journal), 1 (for
prestigious statistics journal) and 2 (for medical journal).
For the analysis, we will use yearDiff as an explanatory variable
instead of pubYear.
The values for the first five observations (from a total of 23) of the
citations dataset are given in Table 15.
Table 15 First five observations from citations

numCitations pubYear yearDiff journal


0 2019 2 0
3 2018 3 0
3 2018 3 0
1 2018 3 0
2 2017 4 0

Source: Elsevier, 2021

421
Review Unit Data analysis in practice

As an initial model for the citations dataset, let’s try fitting a Poisson
GLM (with a log link) of the form
numCitations ∼ yearDiff + journal + yearDiff:journal. (6)
We’ll consider this model in Activity 21.

Activity 21 An initial model

Explain why Model (6) would be a first good GLM for the citations
dataset.

Table 16 shows the estimated coefficients after fitting Model (6).


Table 16 Summary of output from R after fitting Model (6)

Parameter Estimate Standard t-value p-value


error
Intercept 0.964 0.223 4.330 < 0.001
yearDiff 0.082 0.015 5.412 < 0.001
journal 1 2.701 0.375 7.205 < 0.001
journal 2 2.310 0.140 16.548 < 0.001
yearDiff:journal 1 −0.073 0.026 −2.813 0.005
yearDiff:journal 2 NA NA NA NA

What happened here? Why are we not getting an estimate for the ‘slope’
of yearDiff when journal takes the value 2?
In order to answer this question, we need to have a look at the data. (In
fact, we should have done this before we even started proposing a model!)
Figure 11 shows a scatterplot of the number of citations (numCitations)
and the time since the article was published (yearDiff), with the different
journal types identified.
We’ll consider Figure 11 next in Activity 22.

Activity 22 Looking more closely at the citations dataset

Consider Figure 11 showing the data from the citations dataset.


(a) What conclusions can you draw from Figure 11 about the relationship
between the type of journal and the number of citations?
(b) What do you notice about the data for medical journals?

422
4 The larger framework of generalised linear models

Journal type:
standard statistics prestigious statistics medical

80

60
Number of citations

40

20

0
5 10 15 20
Years since publication
Figure 11 Scatterplot of numCitations and yearDiff, with the different
values of journal identified

As there is only one observation for which journal takes value 2, could this
be the cause of the problem with the interaction term?
Unfortunately, the answer is ‘Yes’. Having only one observation in the
dataset for which journal takes value 2 means that we cannot estimate a
separate slope for this category. It is clear that in order to estimate a
slope, there must be at least two observations, or, in other words, we need
at least two points to draw a line (of best fit) between them. (Imagine you
tried to fit a line to the single observation for medical journals. You could
fit the intercept as the value of the response for this one data point, but
then there is no way to know what the slope should be.) So, sadly, we have
to give up on Model (6). What can we do?
Well, the best way round this problem is to get more data! (In this case,
that means more articles published in medical journals, and their numbers
of citations.) However, given the context of the data, this is easier said
than done! One possible solution in such a situation (where there are too
few data in some of the categories of a factor) could be to combine some of
the categories. This is sometimes called conflation. You need to make sure
though that this makes sense from the context. For example, here it might
seem tempting to have just two categories: one category for standard
statistics journals and a second category combining prestigious statistics
journals with medical journals. Activity 23 considers this approach.

423
Review Unit Data analysis in practice

Activity 23 Combine categories?


Explain why combining prestigious statistics journals and medical journals
into a single category of journal may not be sensible.

Combining prestigious statistics journals with medical journals into a


single journal category may not be sensible. So, we need to think of
another approach.
Don’t compare apples with Another possible solution could be to consider the importance of medical
pears! journals. Are we interested in making comparisons between the numbers of
citations for statistics articles in medical journals with the numbers of
citations for statistics articles in (different types of) statistics journals? If
the answer is ‘no’, then we might consider removing the observation
corresponding to the medical journal. In this particular case, where there
is only one observation in a category, removing this observation will not
change the conclusions for the remaining categories, and would enable us
to fit the interaction between the two explanatory variables. Therefore,
removing this observation would be an option here.
Alternatively, as we cannot fit an interaction between the two explanatory
variables in the full dataset, we could consider fitting a model without the
interaction term in the linear predictor – that is, we could fit a Poisson
GLM (with log link) of the form
numCitations ∼ yearDiff + journal. (7)
Let’s fit Model (7) and check if it’s a good fit to the citations data. First,
we’ll investigate the residual deviance for Model (7) in Activity 24.

Activity 24 Interpreting the residual deviance for Model (7)

When fitting Model (7), the null deviance is 462.801 with 22 degrees of
freedom, and the residual deviance is 80.933 with 19 degrees of freedom.
(a) Is there a significant gain in fit for Model (7) compared with the null
model?
(b) Do you think Model (7) is a good fit to the citations data? What
could be an issue here?

In Activity 24, we detected a possible problem with overdispersion for


Model (7). A poor model fit might also be caused by other problems, such
as important explanatory variables missing from the model or the model
needing a different link function. For a broader overview of the model fit,
let’s look at the diagnostic plots for the model. (Diagnostic plots for GLMs
were discussed in Subsection 6.2 of Unit 7.)
The diagnostic plots for Model (7) are given in Figure 12. Plot (a) shows
the standardised deviance residuals against a transformation of µ
b, plot (b)

424
4 The larger framework of generalised linear models

shows the standardised deviance residuals against index, taking the data in
order of publication date, plot (c) shows the squared standardised deviance
residuals against index number, again taking the data in order of
publication date (where the red circles denote positive residuals and the
blue triangles denote negative residuals), and plot (d) shows the normal
probability plot. We’ll consider Figure 12 in Activity 25.

4 4
Standardised deviance residuals

Standardised deviance residuals


2 2

0 0

−2 −2

2 3 4 5 6 7 5 10 15 20
(a) µ (b) Index number
Squared standardised deviance residuals

15 4
Standardised deviance residuals

2
10

5
−2

0 −4
5 10 15 20 −2 −1 0 1 2
Index number Theoretical quantiles
(c) (d)

Figure 12 Residual plots for Model (7) for the citations data

Activity 25 Assessing the diagnostic plots for Model (7)


Do the diagnostic plots in Figure 12 suggest that the assumptions for the
Poisson GLM Model (7) are met?

425
Review Unit Data analysis in practice

While we can fit all of the terms in Model (7), the residual deviance for the
model considered in Activity 24 suggests that the model may not be a
good fit, and the diagnostic plots considered in Activity 25 suggest that
some of the model assumptions may be questionable. We will revisit this
dataset later, in Subsection 6.2, to explore further modelling options.

4.3 Example: Modelling a dose escalation


trial
We will now analyse a dataset from what is known as a ‘dose escalation
trial’ for a new medical treatment. The dataset is described next.

A dose escalation trial


Dose escalation trials are often used in the early stages of developing
novel treatments, when the aim is to find a safe dose for
administration in humans. The dose of the test drug is increased a
little at a time in different groups of volunteers until the highest dose
that does not cause harmful side effects is found.
Dose escalation trials aim to The data on which decisions are based are usually binary indicators of
find a safe dose for a new whether a patient experienced a severe side effect (toxicity) or not. A
treatment common assumption in dose escalation trials is that toxicity increases
with the dose of the treatment. It is further assumed that increasing
the dose also increases the efficacy of the treatment.
In cancer trials, some level of toxicity associated with the dose of the
treatment is deemed acceptable, since the potential of curing a patient
can outweigh the dangers of toxicity. Researchers seek the highest
dose whose probability of toxicity is equal to or less than a
pre-specified value. This value can be as high as around one third in
some cases, which means that the recommended dose for investigating
efficacy is expected to cause toxicity in up to 1 in 3 of patients in later
phase trials.
The dose escalation dataset (doseEscalation)
This dataset describes the outcomes of a Phase I (safety) trial for the
drug temozolomide in 49 patients with certain types of cancer.
In this trial, some of the patients had experienced a specific prior
treatment, whereas the other patients had not. In similar trials of the
treatment, patients who had had the prior treatment responded to the
doses differently to those who hadn’t had the prior treatment. So,
whether or not the patient had the prior treatment was used as a
‘biomarker’ in the trial: those patients who had the prior treatment
were said to be ‘biomarker positive’ and those who didn’t have the
prior treatment were said to be ‘biomarker negative’. (A biomarker is
used as an indication that a biological process in the body has
happened or is ongoing.)

426
4 The larger framework of generalised linear models

The response variable is:


• toxicity: an indicator for adverse reaction, taking the values 1 if
the patient has an adverse reaction, and 0 otherwise.
There are two explanatory variables of interest:
• dose: the dose of the drug administered (in mg/m2 )
• biomarker: the value of the biomarker for each patient, taking the
values 1 if the patient is biomarker positive, and 0 if the patient is
biomarker negative.
The data for the first five patients from the dose escalation dataset
are shown in Table 17. So, the first five patients from the dose
escalation dataset all had a dose 100 mg/m2 , didn’t have the prior
treatment and didn’t have an adverse reaction.
Table 17 First five observations from doseEscalation

toxicity dose biomarker


0 100 0
0 100 0
0 100 0
0 100 0
0 100 0

Source: Data originally reported in Nicholson et al., 1998

A useful summary of the dataset is shown in Table 18. This table


shows the proportion of toxicity ‘events’ for the patients in each dose
and biomarker group. For example, the value in the first row and
column of the table is 0/5, which means that, of the five patients who
had a dose of 100 mg/m2 and didn’t have the prior treatment, none
had an adverse reaction. The bottom row gives the pooled proportion
of toxicity events for each of the doses, while the last column gives the
pooled proportion of toxicity events for each of the biomarker groups.
Table 18 Proportion of toxicity events observed by dose (mg/m2 ) and
biomarker group

Dose (mg/m2 )
Biomarker group 100 150 180 215 245 260 Total
Biomarker negative 0/5 0/4 0/4 0/6 2/7 1/1 3/27
Biomarker positive 1/6 0/4 0/8 2/4 – – 3/22
Pooled data 1/11 0/8 0/12 2/10 2/7 1/1 6/49

Source: A partial reproduction of Table 1 in Cotterill and Jaki, 2018

427
Review Unit Data analysis in practice

In medical studies on toxicity (and also on efficacy), the logarithm of the


dose tends to provide a better fit than the dose itself. We will therefore use
the log-transformation on the variable dose in what follows. In particular,
following the analysis in Cotterill and Jaki (2018), instead of the
explanatory variable dose we will use the transformed variable
 
dose
logDose = log +1 . (8)
200
Why are we dividing the dose by 200 and adding 1 here? Dividing by 200
is not strictly necessary. It standardises the doses to values around 1 and
makes them less spread out, but will not change the overall conclusions. A
dose of 0 corresponds to a placebo in the medical context and the
logarithm is not defined when its argument is 0. Therefore, many
statisticians add 1 (or another small positive value) to the dose when the
log-transformation is used, so that the placebo dose can potentially be
included in the analysis. (This is also common practice in other
application areas where the explanatory variable may take the value 0.)
There are three choices we need to make when fitting a generalised linear
model: the distribution of the response variable, the link function and the
form of the linear predictor.
• Distribution: The response variable is binary, no toxicity or toxicity,
Yi = 0 or Yi = 1. This means our responses, Y1 , Y2 , . . . , Y49 come from a
Bernoulli distribution and it is thus appropriate to use the Bernoulli
distribution for modelling here.
• Link function: From Table 14 (Subsection 4.1), the logit link is the
canonical link function for the Bernoulli distribution. It therefore makes
sense to fit a logistic regression model as introduced in Unit 6.
• Linear predictor: We have two explanatory variables, the covariate dose
and the factor biomarker. We are interested in how these affect the
response, so both should be fitted to the model. As already explained,
instead of dose, we’re using its log-transformed version – that is,
logDose – as defined in Equation (8).
What about an interaction between them? This would give more
flexibility to the model and in particular could incorporate the situation
where the effects of the dose on the toxicity is different for the two
values of biomarker.
Alternatively, we could use the summaries in Table 18 directly, and model
the number of toxicity events in each dose and biomarker group through a
binomial distribution with a logit link and the same linear predictor. The
results would be exactly the same.
A third option could be to rewrite Table 18 as a contingency table and to
fit a log-linear model as introduced in Unit 8.
Here, we’ll model the data using a logistic regression model; Activity 26
considers why this would be a natural choice given the context of the data.

428
4 The larger framework of generalised linear models

Activity 26 Modelling for the dose escalation data

Explain why a logistic regression model would be a natural choice of model


for the dose escalation dataset.

So, here we’ll fit a logistic regression model of the form


toxicity ∼ logDose + biomarker + logDose:biomarker. (9)
From Table 14 (Subsection 4.1), the (canonical) link function, g, for a
logistic regression model with a Bernoulli response, is the logit link given
by
 
E(Yi )
g(E(Yi )) = log .
1 − E(Yi )
Therefore, from Subsection 2.3 in Unit 6,
 
E(Yi )
log = ηi ,
1 − E(Yi )
where ηi is the linear predictor. So, since E(Yi ) = pi , where pi is the
probability of an adverse reaction (that is, the ‘success’ probability) for the
ith patient, we have that
exp(ηi )
pi = . (10)
exp(ηi ) + 1
Recall that Equation (10) shows the equation of the logistic function (as
described in Subsection 2.2 in Unit 6).
Now, for the dose escalation dataset, we have the single covariate logDose
and the single factor biomarker with two levels. This means that we are
modelling pi by a logistic function curve across the values of the covariate
logDose, and that there will be two of these curves, one for each level of
the factor biomarker.
This is analogous to modelling parallel and non-parallel slopes in linear
regression, as described in Sections 2 and 3 in Unit 4: whereas we had
separate regression (straight) lines for each level of a factor across the
values of the covariate in Unit 4, when we’re using logistic regression we
have separate logistic function curves for each level of a factor across the
values of the covariate.
As you saw in Subsection 2.2 in Unit 6, the logistic function is flexible, its
shape being determined by ηi . We’ll now look at how changes in the linear
predictor, ηi , affect the shapes of the logistic function curves for each level
of biomarker across the values of logDose. Since we’re fitting curves
across the values of the (log-transformation of the) dose, these curves are
often referred to as dose response curves in this type of context.

429
Review Unit Data analysis in practice

Figure 13 shows typical shapes of the logistic regression curves for the
probability of toxicity for the two levels of the factor biomarker across the
values of the covariate logDose. Notice that each of the plots shows two
curves – one for each level of the factor.
• Figure 13(a) shows the situation where there is no interaction. In this
case, the curves for the two levels of biomarker have the same shape
and are simply shifted to the left or right according to the parameter for
biomarker. This is similar to the parallel slopes model that we
introduced in Section 2 in Unit 4.
• Figure 13(b) shows the situation where there is an interaction between
biomarker and logDose. In this case, the curves for the two levels of
biomarker can also differ in shape by being stretched or shrunk. This is
similar to the non-parallel slopes model that we introduced in Section 3
in Unit 4.

1.0 1.0
Probability of adverse reaction

Probability of adverse reaction


0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
(a) logDose (b) logDose

Figure 13 Typical shapes of the logistic regression curves for the probability of toxicity for models (a) with no
interaction and (b) with an interaction

Next, we’ll look at the fit of Model (9) and see whether all terms in the
model are needed, or if some of the parameters may be equal to 0. To do
this, we want to compare Model (9) with the most plausible of its nested
submodels. In logistic regression (and also more generally for generalised
linear models), we can compare nested models through their deviance
differences. To aid this comparison, consider Table 19, which provides the
residual deviances of Model (9) and several of its submodels.

430
4 The larger framework of generalised linear models

Table 19 Residual deviances of various logistic regression models for the dose
escalation data

Model Residual Degrees of


deviance freedom
M0 : toxicity ∼ 1 (null model) 36.434 48
M1 : toxicity ∼ logDose 32.528 47
M2 : toxicity ∼ logDose + biomarker 31.149 46
(9): toxicity ∼ logDose + biomarker + logDose:biomarker 25.398 45

Let’s establish our preferred model in Activity 27, using what we’ve
learned about model comparisons for nested generalised linear models in
Units 6 and 7.

Activity 27 Model comparisons for the dose escalation data

(a) What are the deviance differences between Model (9) and Models M2 ,
M1 and M0 , respectively?
(b) If all models fit the data, what is the chi-square distribution which
should be used to test whether there is significant gain in choosing
Model (9) over each of the respective Models M2 , M1 and M0 ?
(c) The p-values associated with the test statistics equal to your answer to
part (a) using your chi-squared distributions from part (b) are 0.0165,
0.0283 and 0.0115, respectively, for comparing Model (9) with Models
M2 , M1 and M0 . Interpret these p-values in terms of the models.

We have now established that Model (9) provides a better fit to the dose
escalation data, compared with its submodels containing fewer terms. We
also need to check that our chosen model satisfies the assumptions that are
needed in order to use logistic regression.
Figure 14, given next, shows the diagnostic plots for Model (9).
Plot (a) in Figure 14 shows the standardised deviance residuals against a
transformation of µ b, plot (b) shows the standardised deviance residuals
against index, taking the data in the order they were entered into the data
file, plot (c) shows the squared standardised deviance residuals against
index number, again taking the data in the order they were entered into
the data file (where the red circles denote positive residuals and the blue
triangles denote negative residuals), and plot (d) shows the normal
probability plot. Note that, for these data, the actual order in which the
data were collected is unknown.

431
Review Unit
Standardised deviance residuals Data analysis in practice

Standardised deviance residuals


2 2

1 1

0 0

−1 −1
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 10 20 30 40 50
(a) 2 arcsin µ (b) Index number
Squared standardised deviance residuals

6
Standardised deviance residuals

2
5
1
4

3 0

2 −1
1
−2
0
0 10 20 30 40 50 −2 −1 0 1 2
Index number Theoretical quantiles
(c) (d)

Figure 14 Diagnostic plots for Model (9) for the dose escalation dataset: (a) the standardised deviance
residuals against a transformation of µ
b, (b) the standardised deviance residuals against index (c) the squared
standardised deviance residuals against index (red circles denote positive residuals and blue triangles denote
negative residuals), and (d) the normal probability plot.

We’ll assess these diagnostic plots in Activity 28.

Activity 28 Diagnostic plots for Model (9) for the dose


escalation dataset
Describe the diagnostic plots in Figure 14. Do they support the model
assumptions?

432
4 The larger framework of generalised linear models

Overall, there are some concerns about the assumptions for Model (9),
particularly in relation to the large (squared) deviance residuals for five of
the positive residuals where Yi = 1. With this amount of unusual
(according to the model) observations, we could try to accommodate them
by adding further terms to the model. However, the dataset includes only
six points where Yi = 1 in total – that is, only six out of 49 patients
experienced toxicity. (A low number of toxicity events is good for the
participants, but makes the modelling more difficult!) There is some
danger of overfitting if we add extra terms to Model (9): we might just be
matching this particular dataset very closely, but our fit might not be a
good prediction for future toxicity trials. We will revisit this issue in
Subsection 8.2 when we discuss strategies for dealing with outliers.

4.4 Example: Modelling child


measurements
The final dataset that we’ll consider in this section concerns child
measurements; the dataset is described next.

Public health programme on child measurements


The National Child Measurement Programme is a public health
programme in England. As part of the programme, children in
England are weighed and measured at school when they are 4 to 5
years old, and when they are 10 to 11 years old. Data are collected
annually, and the information is used to plan and provide better
health care services for children.
The child measurements dataset (childMeasurements)
This dataset contains data collected as part of the National Child
Measurement Programme for the 2019/20 school year. In that school
year, the process was interrupted by school closures due to the
COVID-19 pandemic. As a result, the sample size (of 890 608
children) is smaller than in previous years.
For each child being measured as part of the programme, their body
mass index (BMI) was calculated to check that they were growing as
The National Child
expected. The sample of children were then classified in this dataset
Measurement Programme was
according to the following three categorical variables:
established in 2005
• ethnicity: the child’s ethnicity, taking the six possible values
Asian, Black, Mixed, White, Chinese or other, and Unknown
• ageGroup: the age group (in years) that the child belongs to, taking
the two possible values 4 to 5 and 10 to 11
• obese: whether a child is classified as being obese based on their
BMI, taking the two possible values no and yes.
The contingency table for these data is given as Table 20, next.

433
Review Unit Data analysis in practice

Table 20 Contingency table for the data in childMeasurements

obese
no yes

ageGroup ageGroup
ethnicity 4 to 5 10 to 11 4 to 5 10 to 11
Asian 33241 36837 3537 12463
Black 14800 17646 2607 7468
Mixed 19171 17766 2194 5413
White 227637 251319 24240 60565
Chinese/other 9830 9522 1062 3098
Unknown 55387 54686 5764 14355

Source: NHS Digital, 2020, accessed 17 September 2022

We will now work through choosing log-linear models for this three-way
contingency table in Activity 29.

Activity 29 Choosing a log-linear model for the child


measurements dataset
The data in Table 20 from the child measurements dataset are to be
modelled by a log-linear model. We would like to investigate if any of the
two-way interactions can be left out of the model. To do this, we look at
some model diagnostics for the model which has all main effects and all
two-way interactions and compare this with the models omitting one
two-way interaction in turn. Table 21 shows the residual deviances for
these models.
Table 21 Models omitting one two-way interaction term and the model with all
two-way interactions

Two-way interaction Residual Degrees of


omitted deviance freedom
obese:ethnicity 2712.30 10
obese:ageGroup 21390.00 6
ethnicity:ageGroup 976.62 10
None 302.78 5

(a) How would you go about testing whether the model containing all
two-way interactions fits better than the model omitting the
interaction ethnicity:ageGroup? Find the value of an appropriate
test statistic and the distribution this comes from if both models fit
equally well.
(b) Suppose your test in part (a) gives a p-value close to 0. What is your
conclusion? What does this tell you about the comparison between
the model containing all two-way interactions and the other two
models where one interaction is omitted?

434
5 Relating the model to the research question

(c) Which model, if any, provides an adequate fit for the data? You can
assume that a value of 30.856 would give a p-value of less than 0.001
for a χ2 (5) distribution.

While it is disappointing that we could not find a well-fitting model for the
child measurements data (apart from the saturated model, which is not
very useful for predictions), we should also ask ourselves why we fitted a
log-linear model in the first place. Section 5 will introduce the concept of
the research question behind a study, and how this affects our modelling
approach. In Section 5, we will revisit several datasets, in particular the
cells, the child measurements, the dose escalation and the desilylation
datasets.

5 Relating the model to the


research question
Often, data are collected to answer a specific research question. In this
case, in order to answer the question of interest, we need to translate the
research question into a statistical problem associated with our model.
This is illustrated in Example 2.

Example 2 Research questions for the Olympics dataset


A gambler may be interested in the Olympics dataset from Unit 5 and
a specific research question such as ‘Which countries will win the most
medals at the next Olympics?’.
This research question translates into the statistical problem of
finding a model for the Olympics dataset that provides good
predictions for the number of medals each country will win at the
next Olympics. (Warning: even with a good predictive model the
gambler may face ruin!)

In this section, we’ll revisit some of the datasets from earlier in this unit:
• the cells dataset (from Subsection 3.1) in Subsection 5.1
• the child measurements dataset (from Subsection 4.4) in Subsection 5.2
• the dose escalation dataset (from Subsection 4.3) in Subsection 5.3
• the desilylation dataset (from Subsection 1.1) in Subsection 5.4.
For each dataset, we’ll identify the research question of interest and a
statistical model to try to find its answer.

435
Review Unit Data analysis in practice

5.1 The cells dataset revisited


The cells dataset was first introduced in Subsection 3.1. In Subsection 3.2,
we decided that our preferred model for these data is
opticalDensity ∼ treatment + day, (11)
where opticalDensity measures the proportion of surviving cells in a
sample, treatment is a factor indicating whether the sample was treated
with a particular toxin, and day is a factor indicating whether the sample
was measured on the first or second day of the experiment.
In Activity 30, we’ll look at the research question of interest to the
biologists who carried out this experiment, and how we can relate
These cells hope they aren’t
Model (11) to the biologists’ research question.
treated with the toxin!

Activity 30 The research question for the cells dataset

(a) Looking back at the description of the cells dataset given in


Subsection 3.1, formulate the biologists’ research question.
(b) How can we use Model (11) to translate the biologists’ research
question into a statistical problem?
(c) The summary output after fitting Model (11) to the cells data was
given in Table 12, and is repeated here for convenience in Table 22.
What is the answer to the biologists’ research question?
Table 22 Repeat of Table 12

Parameter Estimate Standard t-value p-value


error
Intercept 1.100 0.069 15.853 < 0.001
treatment 1 −0.433 0.080 −5.408 < 0.001
day 2 −0.400 0.080 −4.992 < 0.001

The biologists might have formulated the problem as a one-sided research


question: ‘Will the toxin decrease the proportion of surviving cells in a
sample?’
Following the notation used in the solution to Activity 30, β1 measures the
change in opticalDensity when we go from untreated to treated cells.
Therefore, β1 < 0 corresponds to a decrease in opticalDensity for treated
cells and so, in this case, we would test the hypotheses
H0 : β1 = 0, H1 : β1 < 0 (assuming that β2 = βb2 ).
The corresponding p-value is half the p-value of the two-sided test. So,
since the value of the p-value for the two-sided test was less than 0.001, the
p-value for this new test will also be less than 0.001. There is therefore

436
5 Relating the model to the research question

strong evidence that the toxin does indeed decrease the proportion of
surviving cells in a sample.
In Unit 3, and in Section 3 of this unit, we saw that datasets where all
explanatory variables are factors can be analysed by either creating an
ANOVA table or by reporting the output of a linear regression model in a
table of coefficients. Which one would be preferred for the cells dataset,
given the research question? We’ll consider this in Activity 31.

Activity 31 ANOVA or linear regression for the cells


dataset?
Suppose that β1 is the regression coefficient for the indicator variable for
the second level of the factor treatment.
(a) To answer the research question represented by the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0, assuming that β2 = βb2 ,
would you use an ANOVA table or the output from fitting the
regression model?
(b) Suppose that the research question was instead represented by the
hypotheses
H0 : β1 = 0, H1 : β1 < 0, assuming that β2 = βb2 .
Would you use an ANOVA table or the output from fitting the
regression model to answer the research question this time?

5.2 The child measurements dataset


revisited
In Section 6 of Unit 8, we showed how some contingency tables can be
modelled by either a log-linear model or a logistic regression model. The
choice of model depends on what we want to find out about the data. If
interest is focused on the outcome of a binary categorical variable which is
an obvious response for the research question, then a logistic regression
model would be a sensible way forward. But, if there is no obvious
response variable and we are more interested in investigating the
relationships between all of the categorical variables, then a log-linear
model is a better way forward.
In Subsection 4.4, we analysed the child measurements contingency table
data by fitting a log-linear model. But was that the right choice? We’ll
revisit these data in Subsection 5.2.1 and then use R to fit a logistic
regression model from contingency table data in Subsection 5.2.2.

437
Review Unit Data analysis in practice

5.2.1 Log-linear model or logistic regression?


Given that average child heights vary from country to country, suppose
that some public health researchers wish to find the answer to the research
question: ‘Does the probability that the variable obese takes the value 1
change for different ethnicities and age groups?’. Activity 32 will consider
which model to fit in order to answer this.

Activity 32 Log-linear or logistic?


Rodriguez-Martinez et al., To answer the public health researchers’ question, would you fit a log-linear
(2020) found the same mean model or a logistic regression model? What is your response variable?
height for 19-year-old girls in
Bangladesh as for 11-year-old
girls in The Netherlands!
In Activity 33 we’ll consider a possible logistic regression model to address
the public health researchers’ question of interest, and relate the model to
a log-linear model for the same data.

Activity 33 Relationship between log-linear and logistic


regression models
(a) In order to help answer the public health researchers’ question
regarding whether the probability of a child being obese differs for
different ethnicity and age groups, the logistic regression model of the
form
obese ∼ ethnicity + ageGroup
was fitted to the contingency table data in the child measurements
dataset.
Which terms would we expect in a log-linear model for the same data?
(b) When fitting the logistic regression model from part (a), the residual
deviance for the fitted model is 302.78 with 5 degrees of freedom. Is
this model a good fit to the data?

Activity 33, considered the fit of the logistic regression model


obese ∼ ethnicity + ageGroup.
This is the only logistic regression model for obese short of the saturated
logistic regression model
obese ∼ ethnicity + ageGroup + ethnicity:ageGroup.
In Activity 34, we’ll interpret these two models.

438
5 Relating the model to the research question

Activity 34 Interpreting the logistic regression model


Consider the following two logistic regression models:
• without interaction:
obese ∼ ethnicity + ageGroup
• with interaction:
obese ∼ ethnicity + ageGroup + ethnicity:ageGroup.
Explain how these two logistic regression models differ in terms of their
interpretation.

We’ve seen in this subsection that the research question is important for
how we analyse the data. It is therefore essential to find out as much as
possible about the study from the researchers who are conducting it. If in
doubt, ask!

5.2.2 Using R to fit a logistic regression model from


contingency table data
What can we do if the data are provided as a contingency table, but we
would like to fit a logistic regression model in R? Notebook activity RU.1
shows a possible way of manipulating the child measurements data into the
required form.

Notebook activity RU.1 Fitting a logistic regression


model from a contingency table
This notebook activity explains how to manipulate contingency table
data into a form to which a logistic regression model can be fitted.

5.3 The dose escalation dataset revisited


In Subsection 4.3, we fitted a logistic regression model to the dose
escalation data given in Table 17. However, fitting the model is only part
of the story. We now need to put the model to good use to answer the
research problem of identifying which dose is associated with a given
probability of toxicity. From Nicholson et al. (1998), we learn that the
level of toxicity deemed acceptable here is 0.16, or 16%. This means that
we are looking for a dose at which no more than approximately one in six
patients are expected to experience severe side effects of the treatment.
We need to take into account that there are two different groups of What’s the dose so that no
patients: those patients who are biomarker positive and those patients who more than one in six patients
are biomarker negative. From Subsection 4.3, our preferred model for these experience severe side effects?
data is the logistic regression model
toxicity ∼ logDose + biomarker + logDose:biomarker.

439
Review Unit Data analysis in practice

This model means that the dose affects the potential of a toxicity outcome
differently in the two biomarker groups. In other words, the dose where
toxicity is expected to have probability 0.16 is different in the two
biomarker groups, and we are therefore looking for two different doses. A
quick and easy way of finding these doses is through a graphical approach.
Figure 15 shows the fitted probabilities of toxicity for the dose escalation
data for the two biomarker groups. In each plot, the horizontal arrow is at
the value where the probability of toxicity is 0.16.

1.0 1.0
Probability of toxicity

Probability of toxicity
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
100 150 200 250 100 150 200 250
(a) Dose (mg/m2 ) (b) Dose (mg/m2 )
Figure 15 Plots of the fitted probabilities for (a) biomarker negative patients, and (b) biomarker positive
patients

In Activity 35 you will find the recommended doses for each biomarker
group of patients by looking at the graphs in Figure 15.

Activity 35 Using a graphical approach to finding the


recommended doses
Use the graphs in Figure 15 to read off the approximate values of the
recommended doses in the two biomarker groups.

The scale in the graphs in Figure 15 is not fine enough to allow reading off
the exact values of the required dose for a value of 0.16 for the probability
of toxicity. In order to get a more accurate result, we need to use our fitted
model equation.
Recall that we can predict the values for p, the probability of toxicity, at
given dose levels for each biomarker group, by using the fitted model
equation
 
pb
log = ηb, (12)
1 − pb

440
5 Relating the model to the research question

where ηb is the fitted value of the linear predictor. For our particular
logistic regression model
toxicity ∼ logDose + biomarker + logDose:biomarker,
for biomarker negative patients we get the fitted model equation
 
pb
log = −424 + 529 logDose
1 − pb  
dose
= −424 + 529 log +1 , (13)
200
and for biomarker positive patients we get the fitted model equation
 
pb
log = −4.3 + 4.1 logDose
1 − pb  
dose
= −4.3 + 4.1 log +1 . (14)
200
However, we are not interested in predicting values for the probability of
toxicity, p. Our aim, instead, is to find the dose where p equals 0.16. How
can this be done? The answer is that we can substitute pb = 0.16 into
Equations (13) and (14), and then solve for dose. Substituting the value pb
in these equations, the left-hand side of each equation becomes
   
pb 0.16
log = log
1 − pb 1 − 0.16
≃ log(0.1905) ≃ −1.6582.
So, in order to find the dose for the biomarker negative group when
pb = 0.16, we need to find the value of dose which satisfies
 
dose
−1.6582 = −424 + 529 log +1 , (15)
200
and to find the dose for the biomarker positive group when pb = 0.16, we
need to find the value of dose which satisfies
 
dose
−1.6582 = −4.3 + 4.1 log +1 . (16)
200

Activity 36 What are the doses?

Use Equations (15) and (16) to calculate the dose required for each
biomarker group when pb = 0.16.

So, we’ve seen how we can use our fitted model to address the research
question of interest. However, before recommending these doses, it’s
always a good idea to check how sensible the results seem to be.

441
Review Unit Data analysis in practice

We’ll start by looking back at the fitted curves given in Figure 15. The
graph in Figure 15(a) for the biomarker negative group shows a very steep
S-shaped curve, whereas the graph in Figure 15(b) for the biomarker
positive group shows a slowly increasing curve. Why do they look so
different?
Well, the first thing to note is that the graph for the biomarker positive
group would be S-shaped if we plotted it over a larger range of the dose.
So, although it’s not clear from the plots, in fact both curves are S-shaped.
However, the graph for the biomarker negative group causes some concern.
While having the desired S-shape of the logistic regression model, it is too
steep for a statistician’s liking. It would seem rather unlikely that the
probability of toxicity in the biomarker negative group is almost 0 up to a
dose of about 240 mg/m2 , and then goes up to one when reaching a dose of
about 250 mg/m2 .
In addition to having some concern about the fitted curve for the
biomarker negative group, we also weren’t entirely happy with the
diagnostic plots for this model given in Figure 14 (Subsection 4.3). So,
how else might we check whether our results seem to be sensible? Well,
there’s another check that we can do for dose escalation trials.
One of the features of a dose escalation trial is that several patients are
given the same dose. For example, from Table 18 (Subsection 4.3), we can
see that five patients in the biomarker negative and six patients in the
biomarker positive group were given dose 100 mg/m2 . For each of the
doses used in the trial, we can therefore find an estimate for the
probability of toxicity by simply dividing the number of patients who
experienced toxicity at this dose by the total number of patients who
received this dose, separately for the two biomarker groups. Table 23
shows these estimates for the dose escalation dataset.
Table 23 Estimated probability of toxicity by dose and biomarker groups

Dose (mg/m2 )
Biomarker 100 150 180 215 245 260
0 0 0 0 2 1
negative =0 =0 =0 =0 ≃ 0.286 =1
5 4 4 6 7 1
1 0 0 2
positive ≃ 0.167 =0 =0 = 0.5 – –
6 4 8 4

We then proceed as follows. For each group separately, starting at the


lowest dose, we move up until we find a dose where the estimated
probability of toxicity is unacceptably high. The next lower dose is then
the recommended dose for use in this group. Let’s try this method in
Activity 37 using the estimates in Table 23.

442
5 Relating the model to the research question

Activity 37 Finding the recommended doses through direct


estimation
(a) For the biomarker negative group, apply the method described above
to find the recommended dose.
(b) Find the recommended dose for the biomarker positive group,
assuming an estimated probability of 61 is still deemed acceptable.

Comparing these results with the recommended doses from Model (9), we
find that for the biomarker positive group, the values essentially coincide
(181 mg/m2 using the fitted model in Activity 36 and 180 mg/m2 using the
direct estimation in Activity 37), which is strong evidence this dose can be
recommended for this group. In addition, the plot of fitted probabilities in
this group is what we would expect, so the results from Model (9) should
be ‘trustworthy’ in the biomarker positive group.
For the biomarker negative group, however, the values are quite far apart
(244 mg/m2 using the fitted model in Activity 36 and 215 mg/m2 using the
direct estimation in Activity 37). We also had some concern about the
steepness of the S-shaped curve of fitted probabilities in this group. In
hindsight, it would have been good if an intermediate dose, such as
230 mg/m2 , had been investigated in this group. Overall, with the tools
and the data we have, we cannot give a strong recommendation for a
specific dose in the biomarker negative group. In this type of situation in
practice, medical statisticians might use more sophisticated statistical
methods that allow them to incorporate prior knowledge into the model, or
they might take forward more than one dose to the next stage of the trial.
Any of these methods will require close collaboration with the clinicians.
In any case, it is worth noting that the recommended doses will be different
for the two groups. This again highlights the importance of considering all
variables that may affect the response. In this example, we accounted for
potential differences between patients through the biomarker. If this
information had been neglected, we would have recommended the same
dose for all patients. It is likely that this dose would have unacceptable
levels of toxicity for the biomarker positive group, while being too low to
reach the best level of efficacy for the biomarker negative group.

5.4 The desilylation dataset revisited


The desilylation experiment data contained in the desilylation dataset
described in Section 1 had the aim of finding the maximum yield of the
alcohol of interest and the values of the covariates where this is achieved.
So, the research question here is ‘What is the maximum yield of the
alcohol, and at which values of the covariates is this maximum achieved?’.
Activity 38 will explore how we can relate our statistical model to this Working in a sterile
research question. environment in the
pharmaceutical industry
443
Review Unit Data analysis in practice

Activity 38 The research question for the desilylation


experiment
(a) How could the answer to the research question being addressed by the
chemists at GlaxoSmithKline impact on practice at GlaxoSmithKline?
(b) How can you use the statistical model you fitted in Section 2 to help
the chemists answer this research question?

Following on from Activity 38, we would now like to maximise the fitted
model. Luckily, any statistical software will do this for us; maximisation
in R provides the results given in Table 24.
Table 24 Optimal values for the variables in the desilylation dataset

temp time nmp equiv yield


−0.1445 2.0000 −1.3979 0.5529 95.3112

We will interpret these values in Activity 39.

Activity 39 Prediction of the maximum yield for the


desilylation experiment
Given the optimal values for the variables in the desilylation dataset shown
in Table 24, does this mean that, for these values of the covariates, the
yield of the alcohol will always be 95.3112?

There are two types of uncertainty when predicting an individual response:


there is uncertainty in the estimation of the regression model, and there is
uncertainty in how the new response varies about the estimated regression
model. A prediction interval takes into account both types of variability, so
that it provides a range of plausible values for the value of a new response
at these covariate values that will actually be observed.
A 95% prediction interval for the yield attained at the values of the
covariates from Table 24 is (93.8610, 96.7614). We’ll interpret this interval
in Activity 40.

Activity 40 Interpretation of the prediction interval

(a) What is wrong with the interpretation that this is a prediction


interval for the maximum yield?
(b) Interpret the 95% prediction interval given the research question.

444
5 Relating the model to the research question

We now have a prediction interval for the yield of the alcohol of interest
when the reaction is run at the estimated optimal covariate values in
Table 24. What is the next step for the chemists to optimise the
production process of the alcohol? We’ll explore this question in
Activity 41.

Activity 41 Optimising the production process


The chemists want to optimise the production process of the alcohol by
running the reaction at the optimal estimated covariate values. These are
−0.1445, 2.0000, −1.3979 and 0.5529, respectively, for temp, time, nmp and
equiv. What do they need to consider before setting the values of the
temperature, time, concentration of NMP and equivalents of the reagents?
(Hint: The description of the desilylation dataset and Table 2
(Subsection 1.1) might be useful here.)

The original ranges for the covariates used in the desilylation experiment
(that is, temp0, time0, nmp0 and equiv0) were provided in Table 4 in
Subsection 1.4. The variables were standardised by setting the lowest value
for each respective variable to −2 and the highest value to 2. The
midpoint of the range was then set to 0, and so on. We can reverse the
standardisation to get the optimal estimated covariate values in their
original scale as given in Table 25.
Table 25 Optimal values for the variables in the desilylation experiment in
original units

temp0 time0 nmp0 equiv0


19.2775 31 3.6021 1.4276

Notice that the value for time0 is at the upper limit of the range for this
covariate (31 hours in the original scale or 2 in the standardised scale). In
the optimisation procedure where we found these values, we had implicitly
constrained the values of the variables to the interval [−2, 2] in the
standardised scale, so that we only looked for the maximum in this range
for each of the covariates. Why did we do that, and does this mean a
higher yield is possible outside this range?
We constrained the ranges of the covariates to [−2, 2] in order to avoid
extrapolation. You have seen in Subsection 6.1 of Unit 1 that a fitted
model is only valid for prediction for values of the explanatory variables
within the range of values in the sample of the data. There is no guarantee
that the same relationship between the response variable and the
explanatory variable will hold outside the range of data used to calculate
the fitted model. Going outside this range is extrapolation, which can lead
to incorrect conclusions.
If we had not constrained the optimisation routine to only look for the
maximum yield when time is less than or equal to 2, then R might have

445
Review Unit Data analysis in practice

given us a higher maximum yield than the one we found, with a value of
time greater than 2. However, this would be extrapolation, since we have
no observed responses when time is greater than 2, and therefore we would
not be able to trust these results. The model may not be valid in this area.
In this situation, the chemists have two options to ensure efficient mass
production of the alcohol. The first one is to set their equipment to the
values in Table 25. This way, they have a prediction interval which has
been derived from a reliable model. The second one is to increase the
sample size. In particular, they could run the reaction a few times for
larger values of time0 and then re-estimate the model including these new
observations. This way, they might find an even higher yield for the
alcohol of interest.
In deciding what to do, the chemists may also take into account the costs
of producing the alcohol on a large scale. For example, if they can produce
slightly more alcohol when running the reaction for 35 hours, say, then
they may still produce more alcohol in total when going with 31 hours as
in Table 25 because they can use the equipment to run the reaction more
often! Of course, the costs for heating, for the materials and for equipment
maintenance also need to be taken into account.

6 Another look at the model


assumptions
In previous units, and indeed previous sections of this unit, we have often
started our data analysis with proposing a model that seemed to be
‘natural’ or ‘obvious’. Often this turned out to be a good choice, but there
were cases where we found that the model assumptions were not satisfied,
and we had to look further to find a better fit for the data. In this section,
we will look back at several examples, with a focus on what seems to be a
‘natural’ model, and what could be alternatives.
We’ll start by considering datasets which have responses which are
percentages in Subsection 6.1, and then responses which are counts in
Subsection 6.2. We’ll discuss how to choose between models in
Subsection 6.3, and then to round off the section, in Subsection 6.4 we’ll
use R to compare possible ‘natural’ models with alternatives.

6.1 Models for responses which are


percentages
In Activity 42, we’ll investigate some of the model assumptions for the
desilylation data.

446
6 Another look at the model assumptions

Activity 42 Model assumptions for the desilylation data


The response variable in the desilylation dataset is yield, measured in
percentage yield. In Section 2, we assumed that the data come from a
normal distribution to build the model.
(a) Do you think this was reasonable? Explain your answer.
(b) What could potentially be an issue in such a situation?

A similar situation to that discussed in Activity 42 arose in


Subsection 1.3.3 of Unit 2, where the Brexit dataset was introduced. The
response variable, leave, in the Brexit dataset is the percentage of voters
who voted ‘Leave’ in the 2016 referendum, in 237 local authority areas.
The data were modelled by a linear regression model which assumes a
normal distribution for the response.
Again, this turned out to be appropriate for this dataset. This is because
the minimum and maximum percentages of the response leave are 37.29
and 75.56, respectively, and so it would be unlikely that predictions could
get outside the range of 0 to 100, at least within the given ranges of values
of the explanatory variables.
For both the desilylation and Brexit datasets, we have continuous response
variables, whose values are bounded within an interval. However, the
normal distribution, which is continuous but unbounded, turned out to
provide an appropriate fit to the data. The assumption of a normal model
has therefore been reasonable.
What about count data? Do they always have to be modelled by a Poisson
distribution? We’ll discuss this next.

6.2 Modelling counts


In Subsection 4.2, we modelled the citations data assuming a Poisson
distribution for the response numCitations. This distribution seemed a
natural choice, since the responses are counts. However, we were not
happy with the fitted model; Activity 24 suggested that there could
possibly be an overdispersion problem or important explanatory variables
missing from the model, or that the model may need a different link
function, while the diagnostic plots for the fitted model considered in
Activity 25 suggested that the model might not be appropriate for the
data. So, we should ask ourselves whether the assumption of a Poisson
model was the best we can do.
There are further distributions that can be used to model count data. For
example, a distribution called the negative binomial distribution suffers
considerably less from overdispersion issues than the Poisson distribution.
However, we cannot create a GLM from this distribution, and it therefore
goes beyond the scope of M348. Alternatively, let’s consider the tools we
already have.

447
Review Unit Data analysis in practice

Figure 16 shows a repeat of the scatterplot (given in Figure 11,


Subsection 4.2) of numCitiations and yearDiff, with values of journal
identified. From this scatterplot, it seems reasonable to try and fit a linear
model. In particular, a parallel slopes model looks promising here, and
while the responses are non-negative integers, they don’t seem too far away
from being continuous.

Journal type:
standard statistics prestigious statistics medical

80

60
Number of citations

40

20

0
5 10 15 20
Years since publication
Figure 16 Scatterplot of numCitations and yearDiff, with the different
values of journal identified (repeat of Figure 11)

Recall that we’ve already seen an example of using a linear model to model
a dataset with a count response when we analysed the Olympics dataset in
Unit 5. In that unit, we found that the fitted linear model worked well for
these data. So, let’s try fitting the linear model
numCitations ∼ yearDiff + journal (17)
to data from the citations dataset. A summary of the output after fitting
the model is given in Table 26.

448
6 Another look at the model assumptions

Table 26 Summary of output after fitting the linear model


numCitations ∼ yearDiff + journal

Parameter Estimate Standard t-value p-value


error
Intercept 1.174 2.683 0.438 0.6666
yearDiff 0.552 0.216 2.556 0.0193
journal 1 35.610 3.703 9.617 < 0.001
journal 2 74.092 6.066 12.215 < 0.001

What can we learn from the output in Table 26? We’ll explore the
interpretation of the results in Activity 43.

Activity 43 Interpreting the output for the fitted linear


model
(a) Interpret the parameter estimates labelled ‘Intercept’, ‘yearDiff’ and
‘journal 1’ in the output for the fitted linear model given in Table 26.
(b) Using the fitted regression equation, find the fitted value, yb, for an
article that was published in a standard statistics journal eight years
ago.
(c) A second linear model
numCitations ∼ yearDiff
was fitted to the citations data. An ANOVA test to compare the RSS
values for this model and Model (17) was carried out; the test statistic
was calculated to be F = 110.13, and the associated p-value was less
than 0.001. Considering Table 26 and this extra information, would
you recommend removing any of the variables from the model?
Explain your answer.

Model (17) is a parallel slopes model as described in Section 2 of Unit 4.


However, there is no reason why the increase in citations over time should
be the same for all types of journal. So, from the context of the data, we
might also consider adding an interaction between journal and yearDiff.
Unfortunately, however, for the same reason as explained in Subsection 4.2
when we fitted a Poisson GLM to the citations data, we cannot fit a linear
model with the interaction journal:yearDiff to these data. There is only
one observation that has the value 2 (medical journal) for journal, and so
it is not possible to determine a slope for this level. So, we’ll need to stick
with the model without an interaction.
Let’s see if the fit of Model (17) satisfies the linear model assumptions. To
check these, the residual plot and normal probability plot are shown in
Figure 17, next.

449
Review Unit Data analysis in practice

10 2

Standardised residuals
5 1
Residuals

0 0

−5 −1

−10 −2
0 20 40 60 80 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 17 The residual plot (a) and normal probability plot (b) for Model (17)

Activity 44 Checking the model assumptions

(a) Does the residual plot in Figure 17(a) support the assumption that
the Wi ’s have zero mean and constant variance?
(b) Does the normal probability plot in Figure 17(b) support the
assumption that the Wi ’s are normally distributed?

In conclusion, the model assumptions for the linear model seem to be


satisfied, and our linear model fits the data well. Comparing this with the
Poisson GLM given by Model (7) (in Subsection 4.2) and the
corresponding diagnostic plots in Figure 12, we find that the linear model
is a much better choice for these data than the Poisson GLM. As the
responses are counts, we initially expected the Poisson distribution to be
appropriate, but after applying model diagnostics we found this not to be
the case. This example shows that we should be pragmatic. If a model
provides a poor fit to the data, we should consider alternatives.

6.3 Choosing between models


The main focus in this section has been on choosing an appropriate
distribution for the response variable. However, even for the same
distribution and link function, we may find ourselves in a situation where
we have more than one linear predictor that seems to fit the data well. For
example, for a dataset with a large number of explanatory variables, we
may have used both forward and backward stepwise regression to find out
which explanatory variables should be in the model. What do we do if
these two methods give us different results? For example, we might end up

450
6 Another look at the model assumptions

with two models with similar values of the AIC, but different sets of
explanatory variables in the linear predictor.
First, the AIC is only a relative measure. It can be used to decide between
models, but does not tell us anything about the fit, or if the model
assumptions are satisfied. We need to use model diagnostics on both
models. If none of the two models satisfies the model assumptions, we will
need to look for a better model. If only one of them satisfies the model
assumptions, we can pick this one. But what do we do if both models are
appropriate?
There is nothing wrong with reporting two models. You could go back to
the researchers who collected the data, and discuss your findings with
them. This may help them (and you) to interpret the results.
Alternatively, or in addition, you could split the data into a test and a
training dataset as in Subsection 4.3 of Unit 5 and assess how well the
models found by fitting the training data can predict the test data.
To round off this section, we’ll revisit the Olympics dataset and use R to
compare some different potential models.

6.4 Using R for model comparison: the


Olympics dataset revisited
The response variable medals in the Olympics dataset represents the
number of medals won by a nation at a summer Olympics (as it stood at
the end of Tokyo 2020). In Unit 5, we fitted a linear model to these data.
However, since medals is a count variable, a Poisson GLM would be a
natural choice for these data.
We have, in fact, already fitted a Poisson GLM to these data in
Notebook activity 7.3. In that notebook activity, we tried fitting a Poisson
GLM using exactly the same explanatory variables that had been selected
when fitting the same data using multiple regression in Unit 5, and we
found that the linear model outperformed this particular Poisson GLM as
measured by the MSE.
In this section, rather than simply using the same explanatory variables in
the linear predictor that had been selected when using multiple regression
for these data, we’ll widen our search for a Poisson GLM for medals by
using model selection criteria to decide which explanatory variables should
be included in the Poisson GLM’s linear predictor. We’ll then compare our
Poisson GLM with the linear model proposed in Unit 5.
We’ll start in Subsection 6.4.1 by selecting a Poisson GLM so that the
linear predictor only includes main effects, and then in Subsection 6.4.2
we’ll consider both main effects and two-way interactions in the linear
predictor.

451
Review Unit Data analysis in practice

6.4.1 A Poisson GLM with only main effects


In Notebook activity RU.2, we’ll consider Poisson GLMs for the Olympics
dataset, where only the main effects of the explanatory variables are
included in the linear predictor. Then, in Notebook activity RU.3, we’ll
check whether the model assumptions of the Poisson GLM selected in
Notebook activity RU.2 are satisfied.

Notebook activity RU.2 Poisson GLM with only main


effects for the Olympics dataset
In this notebook, we’ll select the explanatory variables for the linear
predictor of a Poisson GLM, considering only the main effects.

Notebook activity RU.3 Checking the model assumptions


for a Poisson GLM with only
main effects
This notebook will check the model assumptions for the Poisson GLM
selected in Notebook activity RU.2.

6.4.2 A Poisson GLM with two-way interactions and


main effects
In Notebook activity RU.3, we found that the model selected in Notebook
activity RU.2 does not seem to satisfy the model assumptions. Is it time to
give up on Poisson modelling for the Olympics data, or can you think of a
way to possibly save this idea?
Well, we could try adding terms to the linear predictor of the model. So
far, we’ve only considered the main effects as explanatory variables.
However, two of the explanatory variables may affect the response jointly,
which can be modelled by a two-way interaction. For example, a country
with high GDP which is hosting the Olympic games may have put
substantial funding towards talented athletes, whereas a country with low
GDP which is hosting may not have had the same opportunity. Therefore
it seems plausible that the effect of variable host may not be the same for
different countries, depending on the value of gdp. Similar reasoning can
be applied to other pairs of variables.
So, in Notebook activity RU.4, we will add two-way interactions to the
model to see if this improves the fit.
Then, in Notebook activity RU.5, the final notebook activity of this
section (indeed the final notebook activity in this unit, and indeed in
M348!), we’ll assess how well the Poisson GLMs selected in Notebook
activity RU.4 predict the numbers of Olympic medals won in the Tokyo
2020 Olympics, and we’ll compare the performance of these Poisson GLMs
with the linear model proposed in Unit 5.

452
7 To transform, or not to transform, that is the question

Notebook activity RU.4 Adding two-way interactions to


the Poisson GLM
In this notebook, we will add two-way interactions to the Poisson
GLM to see if this improves the fit.

Notebook activity RU.5 Assessing predictions for the


Poisson GLMs
In this notebook, we’ll assess predictions made using the GLMs
selected in Notebook activity RU.4.

7 To transform, or not to transform,


that is the question
It is sometimes useful to transform the response and/or one or more of the
explanatory variables. An appropriate transformation can make the
relationship between an explanatory variable and the response variable
more linear. In this case, the explanatory variable is transformed to attain
linearity. If you have doubts about the model assumptions, in particular if
the ‘constant variance’ assumption on the normal distribution for the error
terms is violated, you can transform the response variable so that the
assumptions seem to be satisfied for the transformed data. In some cases,
it may be useful to transform both the response and one or more of the A different transformation!
explanatory variables.
When considering transformations, what tools can we use to decide if a
transformation could be useful? And if we decide to transform, how do we
know which transformation to use? You have already seen some of these
considerations in Section 4 of Unit 2. We will revise these here in the
context of linear models.
For generalised linear models for non-normal response variables, a direct
transformation of the response variable would often not make sense. For
example, we can’t use a log-transformation when the response comes from
a Bernoulli distribution, as we can’t take the log of 0. The closest we get
to transformations here would be to choose a different link function that
links the mean of the distribution to the linear predictor. For explanatory
variables, transformations work in the same way as for linear models, but
it is harder to figure out when and how to transform. If the diagnostic
plots for a GLM imply there are issues with the model assumptions,
transformations may help, but finding a good one may have to be done by
trial and error.
The methods can be divided into those you can apply before and after
fitting a model to the data. We’ll consider transformations before fitting a
model in Subsection 7.1 and after fitting a model in Subsection 7.2. We’ll

453
Review Unit Data analysis in practice

then illustrate the use of transformations when modelling the citations


dataset in Subsection 7.3.

7.1 Transformations before fitting a model


For linear models, let’s look at the typical process of deciding if (and
which) transformations are needed.
Before you fit a model, it is always a good idea to look at the scatterplot(s)
of the response versus (each of) the explanatory variable(s). To see if a
variance stabilising transformation may be needed, try to identify a
relationship (a line or potentially a curve) between the explanatory
variable and the response. Then check if the points in the scatterplot are
evenly spread around this line (curve) across the whole range of the
explanatory variable, or if the points tend to be further away in some areas
of this range. If the latter is the case, this is a hint you should transform
the response variable.
To aid your decision on which transformation to use, you can look at a
histogram of the responses. If the histogram is right-skewed, try
transformations down the ladder of powers as summarised in Box 17 in
Unit 2. If the histogram is left-skewed, go up the ladder. You are advised
to try more than one transformation and redo the scatterplot(s) with the
transformed responses. Choose the transformation that seems to conform
the most to the constant variance assumption.
When you have selected a transformation for the response (or have decided
transforming the response is not necessary), you next consider what to do
with the explanatory variable(s). Again, you can look at the scatterplot(s)
of the (possibly transformed) response versus (each of) the explanatory
variable(s). Does the relationship look approximately linear? If not, a
transformation of the explanatory variable may help. You could again look
at the histogram, this time of the explanatory variable, to see if you should
go up or down the ladder of powers. As we have seen in the desilylation
example, this may not always work, in particular for some designed
experiments. Alternatively, you can propose a transformation by simply
looking at the relationship between an explanatory variable and the
response. Then redo the scatterplot(s) with the transformed explanatory
variable(s). Choose the transformation that makes the relationship most
linear.

7.2 Transformations after fitting a model


After any initial transformations have been applied along the lines of the
methods described in Subsection 7.1, we can then fit a model. There may
be some intermediate steps needed before a model is selected, such as
dropping insignificant explanatory variables, but once we have our chosen
model, the next step is to assess the fit of the model. At this stage, we need
to check that any transformations used have led to a model that satisfies

454
7 To transform, or not to transform, that is the question

the model assumptions. Similarly, if no transformations have been used at


this stage, the diagnostic plots provide ways to find out if they should be.
If the plot of residuals versus fitted values shows unequal spread around
the horizontal zero-line, for example a funnel shape, then there is a
problem with the constant variance assumption, and a transformation of
the response variable is recommended. You could look at a histogram of
residuals and use it in the same way as explained above for the histogram
of responses. In fact, the histogram of residuals may be more meaningful
for making this decision, as it incorporates not only the responses but your
assumed relationship between responses and explanatory variables.
In the same plot, if there is an unusual pattern, as for example in Figure 3
(Subsection 2.2) where the residual plot had a clear quadratic shape, this
implies that a transformation of at least one explanatory variable or an
extra term is needed. In the desilylation example, we saw that sometimes
it is not enough to simply transform an explanatory variable, but it may
be necessary to include both the original (linear) term and an additional
non-linear term, for example a quadratic term to improve the model fit. To
find out which explanatory variable is the culprit, we can look at the
scatterplot(s) of residuals versus each explanatory variable in turn. For
example, for the desilylation data, it was clear from Figure 4
(Subsection 2.3) that a quadratic (or similarly shaped) term in the variable
temp was needed.
We’ll complete this (short) section, by investigating a transformation of
the response variable in the citations dataset in Subsection 7.3.

7.3 The citations dataset revisited


In Subsection 6.2, we fitted a linear model (Model (17)) to the citations
dataset. The diagnostic plots given in Figure 17 showed that this model
provides a good fit overall. However, the residual plot and the normal
probability plot shown in Figure 17 are not the whole story. We can learn
more about the model fit by looking at a scatterplot of the residuals
against the covariate yearDiff and a comparative boxplot of the factor
journal; these are shown for this model and data in Figure 18, next.

455
Review Unit Data analysis in practice

10
10

5
5

Residuals
Residuals

0
0

−5
−5

−10
−10

5 10 15 20 0 1 2
(a) yearDiff (b) journal
Figure 18 (a) Scatterplot of the residuals against the covariate yearDiff and (b) comparative boxplot of the
factor journal for Model (17)

We’ll consider Figure 18 next in Activity 45.

Activity 45 Any problems with the model assumptions?

Do the two plots in Figure 18 support the assumption that the Wi ’s have
zero mean and constant variance?

Following on from Activity 45, from the scatterplot of the residuals against
the values of yearDiff, it looks like the variance increases as the time
since publication increases. This makes intuitive sense, since soon after
publication, when only few researchers have read the article, there will
generally be a small number of citations, so the variability of
numCitations when yearDiff is small will also be small. However, an
article that has been published for a long time may get a large number of
citations or very few, depending on how interesting the research in the
article is to other researchers. This means that the variability of
numCitations when yearDiff is large is likely to be larger than when
yearDiff is small.
We could try a variance stabilising transformation of the response variable
numCitations. Here, we want to decrease the variance as yearDiff
increases, so we could try going down the ladder of powers. Typical
transformations in this case are the square root and the log
transformation. The increasing vertical spread of the residuals with the
values of yearDiff didn’t look too severe, so we should start with the
‘mildest’ transformation, in this case the square root.

456
7 To transform, or not to transform, that is the question

Figure 19 shows a scatterplot of the square root of the number of citations


versus the time since the article was published, with the journal types
identified.

Journal type:
standard statistics prestigious statistics medical

8
Square root of number of citations

0
5 10 15 20
Years since publication
Figure 19 Square root of numCitations versus yearDiff, with the journal
type identified

A parallel slopes model with constant error variance looks promising for
the data in Figure 19.
Figure 20, given next, shows the diagnostic plots for the citations data
after fitting the linear model

numCitations ∼ yearDiff + journal. (18)
Figure 20 also shows the scatterplot of the residuals for fitted Model (18)
against the values of yearDiff, and a comparative boxplot of these
residuals for the factor journal.

457
Review Unit Data analysis in practice

2 2

Standardised residuals
1 1
Residuals

0 0

−1 −1

−2 −2
2 4 6 8 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles

2 2

1 1
Residuals

Residuals

0 0

−1 −1

−2 −2
5 10 15 20 0 1 2
(c) yearDiff (d) journal

Figure 20 Diagnostic plots after fitting Model (18): (a) residual plot, (b) normal probability plot,
(c) scatterplot of residuals against yearDiff, and (d) comparative boxplots of residuals for the three levels of
journal

Let’s compare the diagnostic plots in Figure 20 with the corresponding


plots in Figures 17 and 18.
• Residual versus fitted values: This plot looks acceptable for both
models. There are very few large fitted values, so what might look like a
decreasing vertical spread in Figure 20(a) is no cause for concern.
• Normal probability plot of standardised residuals: This plot looks
acceptable for both models, with the points in Figure 20(b) roughly
following the straight line.
• Residuals versus time since publication: Here, the square root
transformation wins. The increasing vertical spread that we observed in
Figure 18(a) has gone in Figure 20(c).

458
8 Who’s afraid of outliers?

• Residuals versus type of journal: Again, the square root transformation


wins. While we could have accepted this plot for the untransformed
responses in Figure 18(b) due to the small sample size when journal
is 1, the boxplot when journal is 1 in Figure 20(d) is centered close to 0,
and the length of the box is closer to that of the box when journal is 0.
Overall, the linear model for the square root transformed responses seems
to satisfy the model assumptions even better than its untransformed
counterpart.
It is to some extent up to the statistician to decide when to accept a model
as it is, and when to try a transformation. Some statisticians would have
been perfectly happy with the model diagnostics in Figures 17 and 18,
whereas others would suggest a transformation of the response. Ultimately,
there is not always a right or a wrong decision.

8 Who’s afraid of outliers?


Outliers are data points which do not seem to follow the pattern of the
majority of data. We discussed possible methods of detecting potential
outliers and some strategies for dealing with them in Subsection 4.2 in
Unit 5.
In this section, we will try to understand intuitively some possible reasons
for outliers, and further explore how best to deal with them. We will then
apply our strategy to various datasets. The take home message from this
section is that potential outliers should always be considered in the context
of the dataset, and how it has been collected.
We’ll start in Subsection 8.1 by looking at some possible reasons for
outliers. Then in Subsections 8.2 and 8.3 we’ll consider strategies to deal
with outliers.

8.1 Reasons for outliers


There can be many different reasons for outliers, and it is often useful to
have an initial look at a dataset to search for unusual values, and also to
consider how the data have been collected, to rule out potential mistakes
in data collection or recording. A histogram or similar plot to visualise
your data is often useful here. In particular, watch out for measurements
that may have been recorded in different units!

459
Review Unit Data analysis in practice

Aside 1 COVID vaccine invitation mix-up


In the UK during February 2021, a healthy man in his 30s was offered
a COVID vaccine far ahead of his age cohort, after a National Health
Service error mistakenly listed him as just 6.2 cm in height. He was
told he qualified for the jab because his measurements gave him a
body mass index of 28 000! It turned out his height had been recorded
in imperial units, and he was 6 feet 2 inches tall.
(BBC News, 2021)

Spot the odd one out! Suppose, for example, that in the FIFA 19 dataset you find that a
footballer’s height has been recorded as 7 inches. This is clearly a mistake!
In this case, you have two options. The first one is to delete the value of
height for this footballer and to analyse the data with this particular
value missing. The other option is to try and find the correct value for the
footballer’s height. In this particular example, this may be possible as it is
likely that the heights of these international footballers have been recorded
in more than one database. Make sure you use a reliable source, though!
Similarly, if in the desilylation dataset one of the (standardised) values of
the variable temp, say, had been recorded as 10, this is very likely a
mistake, since the chemists had decided that all (standardised) values of
temp are between −2 and 2. In such a case, it is recommended to go back
to the person who recorded the results, if at all possible, to confirm the
correct value. It may be tempting to use the symmetry of the design you
saw in the scatterplot matrix in Figure 1 (Subsection 1.3) to find out
which value should be the correct one, and you are likely to be right in this
case of a carefully designed experiment. However, there is a remote
possibility that, after all, the wrong temperature has accidentally been
used in this particular run of the experiment!
Another example where outliers may occur by mistake is when a dataset
has been merged from a large database or different sources, such as the
Olympics dataset in Unit 5. There is room for error by omitting values
that should be in the dataset, duplicating rows of data or adding data that
do not belong. For example, we could accidentally add data from a winter
Olympics, and suddenly countries such as Switzerland and Austria have
far more medals than expected. These data are not ‘wrong’, but do not
belong to the ‘population’ you are studying, in this case the medals at
summer Olympics.
Some of the data may also have been influenced by a situation beyond
anyone’s control. Imagine a group of UK city planners in 2022, say, who
study how employees commute to work, in order to plan public transport
provision for their city. An important source of data for their research
would be the the national census, which is run every 10 years by the Office
for National Statistics. (The 2021 census may have been the last, but there
will be equivalent ways of sourcing these data in the future.) If they look,

460
8 Who’s afraid of outliers?

for example, at the three most recent censuses to model commuting habits
in their city, it seems likely that the data from the March 2021 census will
come up as outliers. Far more employees than expected are working from
home. A whole dataset is an outlier here!
While the 2021 data provide a unique snapshot of commuting patterns
during a pandemic, this may not be particularly useful to predict demand
for public transport in the future. However, just using data from
pre-pandemic times may also not give the full picture. Some employees
The COVID-19 lockdowns
may change their commuting habits permanently after the pandemic. In
meant that places which are
this case, it may be best for the city planners to conduct their own survey
usually very busy, like St
to get the most reliable data to predict demand. Mark’s Square in Venice, Italy,
For the rest of this section, we will investigate strategies to deal with were empty
outliers that do not result from mistakes or where there are no obvious
reasons for them being outliers.

8.2 Changing the model


Once we have ruled out any errors as best we can, we may find potential
outliers through statistical methods such as Cook’s distance, as introduced
in Subsection 3.3 of Unit 2. A simple alternative method can be to look at
the values of the standardised residuals after fitting a model. If the model
assumptions hold, the standardised residuals follow a standard normal
distribution. A 95% confidence interval for such values is (−1.96, 1.96) or,
roughly, (−2, 2). Therefore, we would expect no more than 5% of the
standardised residuals to be outside (−2, 2). If standardised residuals are
either much larger than 2 (or much smaller than −2), or if a much larger
percentage than 5% is outside (−2, 2), then we may have an outlier
problem.
These methods are applied after fitting a model to the data, so any outliers
that come up are dependent on the model we have selected. This may
mean that the model we have fitted is not able to capture important
features of the data. In such a case, it makes sense to look for a better
model for the data. This ‘better’ model may be found by adding extra
terms to the linear predictor, such as interactions between explanatory
variables, transformations of explanatory variables or even (if available)
new explanatory variables. Other changes to the model that may help
address outliers include transformations of the response, changing the link
function or changing the underlying distribution of the model.
We will first look at two examples where the model was inadequate to
capture the data and, as a result, the linear predictor of the model was
extended.

461
Review Unit Data analysis in practice

Example 3 Outliers in the desilylation dataset


Consider once again the desilylation dataset. In Section 2, we fitted
Model (3) (from Subsection 1.2) to these data – that is, we fitted the
multiple regression model
yield ∼ temp + time + nmp + equiv.
The Cook’s distance plot for this model is shown in Figure 21. Notice
that the plot flags up four potential outliers, for observations 5, 12, 17
and 18.

17
0.30

0.25 18
5
Cook’s distance

0.20 12

0.15

0.10

0.05

0.00
0 5 10 15 20 25 30
Observation number
Figure 21 Cook’s distance plot for Model (3) for the desilylation data

The normal probability plot in Figure 3(b) in Subsection 2.2 also


shows four potential outliers with standardised residuals that are
around −2 or smaller at the bottom left of the plot. These also
correspond to observations 5, 12, 17 and 18, so the same outliers are
identified through this simple rule-of-thumb method! The sample size
is 30, and 5% of 30 is 1.5, so we would expect no more than
two values outside (−2, 2).
By assessing the plots in Figure 3, we previously decided that
Model (3) is a poor fit to the data, so the outliers that are flagged up
in Figure 21 may describe a feature that is not captured by the model.
In particular, the quadratic shape in the ‘Residuals versus fitted
values’ plot in Figure 3(a) gave a strong hint that quadratic terms
were missing from the model.

462
8 Who’s afraid of outliers?

To obtain a model with a good fit, we added squared terms of all


explanatory variables and interactions between all pairs of explanatory
variables into the model, to end up with Model (4) (Subsection 2.4).

Example 3 shows that it is a good idea to look at the diagnostic plots


together with the Cook’s distance plot, to find a better model, which will
then (hopefully) not have any (or, at least, not too many) outliers.
Next, we will look into an example where initially a factor had been missed
out from the model.

Example 4 Outliers in the citations dataset


An example where a new variable was needed is an earlier version of
the citations dataset. In this early version of the dataset, pubYear
(the year of publication) was the only explanatory variable that had
been recorded. After transforming the variable pubYear to yearDiff
(the number of years since publication), a linear model of the form
numCitations ∼ yearDiff (19)
was fitted to the citations dataset. The Cook’s distance plot for this
model is shown in Figure 22. This flags up that potential outliers are
observations 8, 15 and 20.

15
0.30

0.25
Cook’s distance

0.20

20
0.15

0.10 8

0.05

0.00
0 5 10 15 20
Observation number
Figure 22 Cook’s distance plot for Model (19) for the citations data

463
Review Unit Data analysis in practice

A look at the data reveals that the potential outliers all correspond to
very large response values, so that they correspond to articles that
have large numbers of citations. This is visualised in Figure 23 where
a scatterplot of numCitations and yearDiff is shown, together with
the regression line from Model (19).

80

60
Number of citations

40

20

0
5 10 15 20
Years since publication
Figure 23 Scatterplot of numCitations and yearDiff, with the
regression line for Model (19)

It is clear from Figure 23 that the regression line cannot capture the
very high values of the response. Is it just chance that these articles
have so many more citations than the rest? We asked the researcher
who provided the data and an explanation was found.
The number of citations an article will get depends on which type of
journal it has been published in. Articles in prestigious journals are
read by more researchers and are thus more likely to be cited. Also,
articles published in journals that are also read by researchers from
other disciplines may get more citations. For the citations dataset, it
turned out that the article with the largest number of citations
(observation 15) is about the statistical analysis of a clinical trial, and
had therefore been published in a medical journal. The three articles
with the next highest numbers of citations (observations 8, 17 and 20)
had been published in prestigious statistics journals. This was the
motivation behind creating the factor journal reflecting the type of
journal the respective article has been published in.

464
8 Who’s afraid of outliers?

Example 4 shows that it is useful to understand as much about the data as


possible.
We have now seen two examples where the presence of outliers, together
with model diagnostics that suggested a poor fit, led us to improve the
model by adding extra terms to the linear predictor. We have to be
careful, however, not to overfit. In Example 5, we will investigate adding
an extra term to the linear predictor in Model (9) (from Subsection 4.3)
for the dose escalation dataset – that is, the logistic regression model
toxicity ∼ logDose + biomarker + logDose:biomarker.

Example 5 The dose escalation dataset revisited


In Activity 28, we found that there are five potential outliers in the
dose escalation dataset when Model (9) is fitted. The small U-shaped
dip in the plot of standardised deviance residuals versus transformed
fitted values in Figure 14 (Subsection 4.3) suggests that the terms in
the linear predictor cannot completely capture all of the features of
the data. What can we do?
At this point in time, it is unrealistic to go back to the researchers
who collected the data. Therefore, we cannot find out if they have
collected further data on the patients, such as the age, the body mass
index or certain pre-existing conditions, which we could have used as
potential further explanatory variables.
The next idea could be to fit an extra term in the continuous
explanatory variable. As there was a U-shaped dip in the plot in
Figure 14(a), a squared term looks promising, since a squared
function follows a U-shape (as we saw in Figure 5 in Subsection 2.3).
So, we’ll try fitting a logistic regression model of the form
toxicity ∼ logDose + logDose2 + biomarker
+ logDose:biomarker. (20)

Example 5 proposed adding an extra squared term into the linear predictor
for the logistic regression model for the dose escalation dataset, giving the
proposed model as Model (20).

465
Review Unit Data analysis in practice

The diagnostic plots after fitting Model (20) are shown in Figure 24, and
they look promising!

2 2
Standardised deviance residuals

Standardised deviance residuals


1 1

0 0

−1 −1

−2 −2
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 10 20 30 40 50
(a) 2 arcsin µ (b) Index number
Squared standardised deviance residuals

2
Standardised deviance residuals

1
3

2 0

1 −1

0 −2
0 10 20 30 40 50 −2 −1 0 1 2
Index number Theoretical quantiles
(c) (d)
Figure 24 Diagnostic plots for Model (20) for the dose escalation data

But . . .
Why a ‘but’ ? Aren’t the diagnostic plots looking much better than those
in Figure 14 for Model (9) without the squared term? Surely, only a
gloomy and pessimistic statistician can find fault here!
Or is there anything else we should consider?
We now have a large number of residuals that are essentially 0. Shouldn’t
we have a look to see whether we are overfitting? Are we fitting a model
that fits this particular dataset really well, but would not be useful for
predicting toxicity in future patients?
The revelation comes when we try to find the optimum dose. Let’s have a

466
8 Who’s afraid of outliers?

look at the fitted probabilities of toxicity in the biomarker positive group


in Figure 25.

1.0

0.8
Probability of toxicity

0.6

0.4

0.2

0.0
100 150 200 250
2
Dose (mg/m )
Figure 25 Plot of the fitted probabilities in the biomarker positive group for
Model (20)

We’ll interpret Figure 25 next in Activity 46.

Activity 46 Interpreting Figure 25

(a) Can you spot a problem with the plot in Figure 25?
(b) What could happen if we used this model to find the optimum dose
(the dose where the probability of toxicity is 0.16)?
(c) Can you think of a statistical explanation for this shape of the plot?

The unexpected shape of the fitted probabilities seen in Figure 25 is a


feature of this particular dataset: one patient in the lowest dosage group,
where the probability of toxicity should be quite low, got unlucky, while in
the next two dosage groups, where the probability of toxicity should be
higher than in the lowest group (while probably still being low), no
patients experienced toxicity.
In summary, Model (20) follows the data too closely to be of use for
answering the research question. While improving the ‘outlier’ problem we
had with Model (9), our alternative Model (20) would be of no use to
clinicians.

467
Review Unit Data analysis in practice

In this section, we have explored several examples where we changed the


model to accommodate what was flagged up as potential outliers. In two
cases (Examples 3 and 4), this worked well. In Example 3 we were guided
by the patterns in the diagnostic plots to come up with a better model,
whereas in Example 4 we needed more context to the dataset before we
could improve the model. In Example 5, however, changing the model led
to overfitting, and we could not find a sensible model that would
accommodate all the data.

8.3 Can we just delete outliers?


We have seen that outliers can be annoying. Can we just get rid of them
to get a model that fits the data better? Well, unless the value is an
obvious mistake, such as the man who is 6.2 cm tall, we may be throwing
away important information. Example 6 revisits the dose escalation trial
to point out the potential dangers of excluding data.

Example 6 Deleting outliers in the dose escalation


dataset
In the dose escalation trial, we identified five points with large
standardised residuals when Model (9) (from Subsection 4.3) is fitted,
and adding a term to the linear predictor did not result in an
improved model. Let’s consider what would happen if we deleted
these potential outliers. Closer inspection of the residuals reveals that
all five large standardised residuals belong to observations where
Yi = 1. There are only six patients (out of 49 in the trial) who
experienced toxicity. If we removed the five observations that are
flagged up as outliers, the entire study would no longer make sense.
The only observation of toxicity would then belong to a patient in the
biomarker negative group, at the highest dose, 260 mg/m2 . We would
have deleted all records of toxicity in the biomarker positive group! A
conclusion from those data might then be that none of the studied
doses causes toxicity in the biomarker positive group, with the
consequence that patients in this group might be given even higher
doses.
Having seen in Subsection 5.3 that for the biomarker positive group a
relatively low dose of around 180 mg/m2 should be recommended, this
could lead to disastrous outcomes.

Not all decisions around data will have severe consequences like the ones
described in Example 6, but we should be aware that our decisions on
analysing and interpreting data may have some impact in the real world.
This could be patients’ health if the data come from a clinical trial, or a

468
8 Who’s afraid of outliers?

company’s financial health as in the desilylation example. In the latter


case, the company was interested in finding the settings where the
maximum yield of an alcohol was achieved, so they could optimise their
production processes. If we had simply excluded some of the desilylation
data after looking at the Cook’s distance plot in Figure 21
(Subsection 8.2), instead of improving the model, then we may have ended
up with a very different answer to the research question. As a
consequence, we might have recommended a setting far from the optimal
one, and the production process might have become inefficient.
So, what can we do if we find ourselves lumbered with some stubborn
outliers, for which we cannot find an obvious explanation or an obvious
way to change the model such that they can be accommodated? In that
case, we could consider doing the statistical analysis twice, once with all
data, and then again with the potential outliers removed. If both analyses
give similar conclusions, for example very similar values for the optimal
settings in the desilylation experiment, then we can be confident that
we’ve found a sensible answer to the research question.
If the two analyses give different answers, compare these carefully. If
possible, discuss the results with the researchers who collected the data
and/or who have set the research question. They may be able to find a
reason for the results. Get as much context on the data collection process
and the research question as possible! In any case, report both analyses
and point out the similarities and the differences.
Alternatively, if there are concerns that outliers may skew the conclusions,
there are also statistical methods that are robust to outliers and are
designed to be not overly affected by them. These are often summarised by
the term ‘robust regression’. We won’t go into detail of these methods
here, but you can think of these as generalisations to the solution of the
following problem.
Suppose we want to estimate the income of an average household in the
UK, and you have a sample of household income data. Some households
have very high incomes, and if we just calculate the sample mean of the
observations, these high values will drag the mean up to a level that
‘average’ households will never reach.
The sample mean is affected by unusual values. The sample median of the
data can be used instead of the mean to get a feel for what an ‘average’
household earns, since the sample median is not affected by some very high
(or very low) values in the sample. Robust regression extends this (and
similar) idea(s) to linear and generalised linear regression models.

469
Review Unit Data analysis in practice

9 The end of your journey: what


next?
Here, your journey of M348 comes to an end. But before we say ‘goodbye’,
we’ll have a glimpse of what the future might hold, possibly in a
postgraduate degree, or in work as a statistician/data scientist.
You have seen that generalised linear models provide us with a flexible
toolbox to fit many types of data. Of course, there is much more to
statistical modelling than could be covered here in M348. In this section,
some pointers are given to further areas of applied statistical modelling.
There are so many more Distributions which can’t be used in a GLM
exciting areas of applied Some distributions are too complex to form the basis for a GLM, but can
statistical modelling to explore
still be used to build a similar type of model including a linear predictor
beyond M348 . . .
and a link function. The extra complexity increases the flexibility of the
resulting model, so this could be the way to go if none of the GLMs
(including with transformations) provides a good fit.
An example of such a distribution is the negative binomial distribution,
which can be used to model count data. This distribution has one more
parameter than the Poisson distribution and is therefore better equipped
to avoid the issue of overdispersion.
Another example is the beta distribution, which is a continuous
distribution defined on the interval (0, 1). It tends to be used to model
continuous rates, proportions or indices.
Non-linear predictors
In some applications, the experimenters can derive a model from physical
laws, and such a model is not necessarily linear. For example, in
biochemistry, the rate, v, of an enzymatic reaction is related to the
concentration, x, of the substrate by the Michaelis Menten equation
αx
v = ,
β+x
where α and β are the unknown model parameters we want to estimate.
Such knowledge about the model can be incorporated into the modelling
process by using non-linear predictors in our GLMs. The resulting models
are then called generalised non-linear models.
Censored data
In Unit 7, you met the exponential regression model, which is used to
model ‘time to event’ data. However, we are not necessarily able to
observe when the event of interest happens. For example, doctors may be
interested in how long their patients survive after a liver transplant.
Luckily, even five years after the operation, the majority of transplant
recipients are still alive, so at this point the actual survival time is
unknown. We only know it is more than five years. Such data are called

470
9 The end of your journey: what next?

‘censored’, and we need to take censoring into account when modelling the
data.

Modelling extreme events


In extreme value modelling, researchers are interested in modelling the
extreme events that might occur. For example, they might be interested in
modelling the height of flood water, to try to estimate how bad that
‘once-a-century’ flood might be. There are many further applications in
this area, such as modelling/predicting extreme heatwaves or large
wildfires. To do this, researchers use a distribution known as the
generalised extreme value distribution, which is a family of continuous
probability distributions specifically developed within the extreme value
framework.

Generalised linear mixed models


In some experiments, researchers take repeated measurements on the same
subjects to investigate the trajectory they follow. This could be in a
medical context where the long-term effects of a treatment are assessed by
examining the patients every year since starting the treatment. Similarly,
learning trajectories could be assessed by testing students several times,
each time a couple of weeks apart. It seems likely that responses for the
same patient (or student) cannot be assumed to be independent. For
example, students who scored highly in the first three tests are likely to do
well in the fourth test. We cannot handle dependent data with the tools
from Units 1 to 8 of M348, but a simple generalisation to so called
generalised linear mixed models will do the trick. These models have many
further applications, not just for modelling repeated measurements data.

Non-parametric/semi-parametric methods
All generalisations of GLMs so far are still based on parametric models
where we assume a general form, including a distribution, and then we
estimate some unknown model parameters. If we have no idea what model
might fit our data, or if we tried several seemingly plausible models
without success, a non-parametric or semi-parametric method could be an
option. These methods include splines, wavelets, local polynomials,
support vector machines and generalised additive models, to name but a
few. They are more flexible than parametric models, but may be more
difficult to fit and to interpret.

471
Review Unit Data analysis in practice

Summary
In this unit, we have reviewed the main themes of M348: linear models and
generalised linear models, how they are related, and their application in
practice. You may have seen more clearly the similarities and links between
many of the concepts and methods that you have learned in the module.
You will also have gained more practice at using many of these methods.
When thinking about possible models for your data, you have many
options. You select a distribution for the response variable, a link function
that links the distribution mean to the linear predictor, and finally the
form of the linear predictor itself. Which variables should be included?
Are there significant interactions? Do we need transformations? From this
arguably incomplete list, you can see the options seem endless.
Ultimately, we should be pragmatic about model choice. There is usually
not one ‘best’ model, but many models that don’t fit, and some that do.
The aim is not to find an elusive perfect model, but a model that is
justifiable. Questions you could ask yourself are: Are the assumptions on
my model reasonable (or at least not too unreasonable)? Can I justify
what I have done? Does my analysis answer the research question?
If in doubt, and where possible, discuss your findings with the researchers
who collected the data. Often, modelling is just a means to an end to
answer a specific research question in the application area. Do your results
make sense to them? Don’t be afraid of asking questions. Lots of
questions! Communication is key.
The route map for this final M348 unit is repeated here, as a reminder of
what has been studied and how the sections link together.

472
Summary

The Review Unit route map

Section 1
Multiple regression:
understanding the data

Section 2 Section 3
Multiple regression: Multiple regression
model fitting with factors

Section 4
The larger framework
of generalised linear
models

Section 5
Relating the model to
the research question

Section 7
Section 6 Section 8
To transform, or
Another look at the Who’s afraid
not to transform,
model assumptions of outliers?
that is the question

Section 9
The end of your
journey: what next?

473
Review Unit Data analysis in practice

Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate that there is no ‘correct’ model for a dataset and different
statisticians might recommend different models for the same data
• appreciate the importance of communication between the statistician
doing the modelling and the researcher who collected the data
• understand that the model needs to be able to address the research
question of interest
• have an understanding of linear models and generalised linear models,
and the connection between them
• appreciate that fitting a generalised linear model requires three decisions:
◦ which distribution to use for the response variable
◦ which link function to use
◦ which terms to include in the linear predictor
• test whether individual covariates, factors and interactions need to be
included in a model
• compare models using the adjusted R2 statistic and/or the AIC
• check the model assumptions through appropriate diagnostic plots
• appreciate that there can be more than one way to model a dataset, and
the ‘natural’ model for a dataset may not necessarily be the best model
to use
• appreciate that a model can sometimes be improved by transforming one
or more of the variables, or adding another explanatory variable or an
interaction
• appreciate some of the possible reasons for outliers and strategies for
dealing with them
• use R to fit a logistic regression model to contingency table data
• use R to compare models.

474
References

References
BBC News (2021) ‘Covid: Man offered vaccine after error lists him as
6.2 cm tall’, 18 February. Available at: https://www.bbc.co.uk/news/
uk-england-merseyside-56111209 (Accessed: 5 December 2022).
Biedermann, S. (2006) Private communication with one of the researchers
who conducted the experiment.
Cotterill, A. and Jaki, T. (2018) ‘Dose-escalation strategies which use
subgroup information’, Pharmaceutical Statistics, 17, pp. 414–436.
Elsevier (2021) ‘Biedermann, Stefanie’, Scopus (Accessed: 10 February
2021).
NHS Digital (2020) National Child Measurement Programme, England
2019/20 School Year. Available at: https://digital.nhs.uk/data-and-
information/publications/statistical/national-child-measurement-
programme/2019-20-school-year (Accessed: 17 September 2022).
Nicholson, H.S., Krailo, M., Ames, M.M., Seibel, N.L., Reid, J.M.,
Liu-Mares, W, Vezina, L.G., Ettinger, A.G. and Reaman, G.H. (1998)
‘Phase I study of Temozolomide in children and adolescents with recurrent
solid tumors: a report from the children’s cancer group’, Journal of
Clinical Oncology, 16(9), pp. 3037–3043.
Owen, M.R., Luscombe, C., Lai, L., Godbert, S., Crookes, D.L. and
Emiabata-Smith, D. (2001) ‘Efficiency by design: optimisation in process
research’, Organic Process Research and Development, 5, pp. 308–323.
Rodriguez-Martinez, A. et al. (2020) ‘Height and body-mass index
trajectories of school-aged children and adolescents from 1985 to 2019 in
200 countries and territories: a pooled analysis of 2181 population-based
studies with 65 million participants’, The Lancet, 396(10261),
pp. 1511–1524. doi: 10.1016/S0140-6736(20)31859-6.
Silberzahn, R. et al. (2018) ‘Many analysts, one data set: making
transparent how variations in analytic choices affect results’, Advances in
Methods and Practices in Psychological Science, 1(3), pp. 337–356. doi:
10.1177/2515245917747646.

475
Review Unit Data analysis in practice

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, The GlaxoSmithKline Carbon Neutral Laboratory for
Sustainable Chemistry: Michael Thomas / Flickr. This file is licensed
under Creative Commons -by-2.0
https://creativecommons.org/licenses/by/2.0/
Subsection 1.4, ‘Phew!’: Mykola Kravchenko / 123rf
Subsection 3.1, coal miner with a canary: Laister / Stringer / Getty
Subsection 4.1, British bluebells: Ket Sang Tai / 123RF
Subsection 4.2, Albert Einstein: Orren Jack Turner / Wikipedia / Public
Domain
Subsection 4.2, apples and pears: Inna Kyselova / 123RF
Subsection 4.3, medicine dose: rawpixel / 123RF
Subsection 4.4, child’s height being measured: Janie Airey / Getty
Subsection 5.2.1, average child heights: Lingkon Serao / 123RF
Subsection 5.4, working in a pharmaceutical industry: traimak / 123RF
Section 7, transformations: Sutisa Kangvansap / 123RF
Subsection 8.1, St Mark’s square during lockdown: federicofoto / 123RF
Section 9, looking out over the horizon: mihtiander / 123RF
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

476
Solutions to activities

Solutions to activities
Solution to Activity 1
All four explanatory variables are continuous, and are therefore covariates.
To see this, you can think of the units that the variables are measured in.
• The temperature at which the reaction is run, temp0, is measured in ◦ C,
and can be set to any value the experimenters deem sensible. This is
clearly a continuous variable.
• The time for which the reaction is run, time0, is measured in hours, and
can also be set to any value the experimenters deem sensible (not
necessarily just whole hours). Therefore, this is also a continuous
variable.
• The variable nmp0 measures the concentration of the solution in terms of
volumes of the solvent NMP. Again, this can take any value within a
range the experimenters may specify. This is also a continuous variable.
• The variable equiv0 measures the molar equivalents of the reagent. A
molar equivalent is the ratio of the moles of one compound to the moles
of another. Again, this is something the experimenters can vary within a
sensible range, and is thus continuous.
This matters because the type of variable will affect the way the model is
set up: both types of explanatory variables can be handled in the
framework of regression models, but the inclusion of factors requires the
use of indicator variables.

Solution to Activity 2
(a) First, you should reacquaint yourself with the interpretation of the
estimated coefficients, as explained in Subsection 1.2 of Unit 2.
Each βbj represents the expected change in the response when the
corresponding jth covariate increases by one unit, assuming all other
covariates remain fixed.
• For βb1 : We would expect this to be positive. The yield is
considerably higher for the second and fourth observations, where a
higher temperature is used, than for the first, third and fifth
observations. In particular, when comparing the second observation
with the first observation, we can see that the yield is higher and
the temperature was the only covariate that changed its value. The
same argument holds when comparing the fourth observation with
the third observation. This indicates that increasing the
temperature may increase the yield.
• For βb2 : This is less clear-cut. Both the highest and the lowest yield
occur at the shorter reaction time. We would need to see more data
to get a better idea about βb2 .

477
Review Unit Data analysis in practice

• For βb3 : There is some indication this may be negative. The


minimum yield occurs at the highest value of nmp0, so increasing
this variable may lead to a decrease in yield.
• For βb4 : We cannot say anything about βb4 at this stage, because in
all five observations listed here, the variable equiv0 has the same
value. We therefore cannot assess what happens to the yield when
the value of equiv0 is increased.
(b) For the first and third observations, the only covariate that changes
its value is time0, while all other covariates are fixed. An increase in
time0 coincides with an increase in the response variable yield.
For the second and fourth observations, again the only covariate that
changes its value is time0, while all other covariates are fixed. This
time, however, an increase in time0 coincides with essentially no
change in the response variable yield.
Covariates nmp0 and equiv0 are fixed for all four observations,
whereas the temperature for the first and third observations is lower
(at 15 ◦ C) than it is for the second and fourth observations (25 ◦ C).
In summary, at a lower temperature the yield seems to increase when
time0 is increased, whereas at a higher temperature the yield remains
essentially constant when time0 is increased. In other words, the
yield is affected differently by an increase in time0, depending on the
value of temp0.
We have seen this phenomenon before, in Unit 4. There, we defined
interactions between explanatory variables to accommodate
differences between the effect of one of the explanatory variables on
the response for different values of the other explanatory variable. In
the example being considered here, this means that we should
consider adding an interaction between time0 and temp0 to the
proposed model. We will return to the idea of including an interaction
when modelling yield in Subsection 2.4.

Solution to Activity 3
In Subsection 5.1 of Unit 2, the scatterplot matrix was introduced. This is
a graphical tool that shows scatterplots for all pairs of variables in the
dataset.
You can use it to assess how the response variable is related to each
explanatory variable. For example, if the relationship between an
explanatory variable and the response is non-linear, then a scatterplot can
help us to decide whether a transformation of the explanatory variable
may be useful. A scatterplot can also indicate issues with non-constant
variance and can therefore help us to decide whether a transformation of
the response may be useful. Additionally, the scatterplot matrix can be
useful for spotting relationships between the explanatory variables.

478
Solutions to activities

Solution to Activity 4
The regression coefficients in a multiple regression model are partial
regression coefficients, which means they are associated with a variable’s
contribution after allowing for the contributions of the other explanatory
variables. Therefore, if a variable is highly correlated with another
variable, it will have little or no additional contribution over and above
that of the other. This means there is a case for omitting one of the
variables from the model.

Solution to Activity 5
Notice that all six pairwise scatterplots look the same. They consist of
nine points, four of which are placed on the vertices of a square that is
standing on one vertex. Four further points are in the middle of each edge
connecting the vertices. The last point is in the centre of the square.
The striking differences to previous scatterplots are that all of the plots are
the same, and that their patterns appear to be systematic.

Solution to Activity 6
(a) This is an observational study, since we simply observe the values of
the explanatory variables rather than influencing their values. It is
not easily possible to influence the GDP or the number of medals won
by a country at the previous Olympics.
(b) This is a designed experiment, since the researchers selected the
values (‘new’ or ‘usual’) of the explanatory variable describing the
training regime, and they also decided that half of the participants
should receive each training regime. They could then compare the two
training groups and see how the choice of training regime affects the
strength of football players.

Solution to Activity 7
Some considerations the chemists would need to think about include:
• Which variables are likely to affect the yield of the alcohol of interest,
and should therefore be included as potential explanatory variables?
• What are plausible values or ranges for these variables?
• Which combinations of values should be used in the experiment?
• Sample size versus costs: how many runs of the reaction can be afforded,
and is there time to run them?
You may well have thought of other considerations!

479
Review Unit Data analysis in practice

Solution to Activity 8
(a) Let β1 , β2 , β3 and β4 be the regression coefficients of temp, time, nmp
and equiv, respectively.
To test whether the regression model contributes information to
interpret changes in the yield of the alcohol, we need to test the
hypotheses
H0 : β1 = β2 = β3 = β4 = 0,
H1 : at least one of the four coefficients differs from 0.
The test statistic for testing these hypotheses is the F -statistic (which
was reported to be 15.98).
Since the p-value associated with the F -statistic is less than 0.001,
there is strong evidence that at least one of the four regression
coefficients is different from 0. Hence, there is strong evidence that
the regression model contributes information to interpret changes in
the yield of the alcohol.
Denoting the number of regression coefficients that we’re testing by q,
the p-value for this test is calculated from an F (ν1 , ν2 ) distribution
where
ν1 = q = 4,
ν2 = n − (q + 1) = 30 − (4 + 1) = 25.

(b) Each p-value for the individual regression coefficients is calculated


from a t(ν) distribution where the degrees of freedom, ν, is the same
for each of the coefficients and is
ν = n − (q + 1) = 30 − (4 + 1) = 25.

(c) The p-value associated with the regression coefficient of temp, tests
the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0,
(assuming that β2 = βb2 , β3 = βb3 and β4 = βb4 ).
The p-value for testing these hypotheses is less than 0.001. Therefore,
there is strong evidence to suggest that the regression coefficient of
temp, β1 , is not 0, when time, nmp and equiv are in the model.
(d) The estimated regression coefficient of temp is 4.159. This means that
the yield of the alcohol (yield) is expected to increase by 4.159 if the
value of temp increases by one unit, and the values of time, nmp and
equiv remain fixed.
(e) The p-values associated with the partial regression coefficients of time
and equiv are both small (p = 0.0424 for time and p = 0.0220 for
equiv), indicating that there is evidence that β2 and β4 are different
from 0, if the other variables are in the model.

480
Solutions to activities

The p-value associated with the partial regression coefficient of nmp,


however, is quite large (p = 0.0762), indicating only weak evidence
that β3 is different from 0, if the other variables are in the model.

Solution to Activity 9
(a) The residual plot in Figure 3(a) suggests that the zero mean
assumption of the Wi ’s does not hold. This is because the points in
this plot are not randomly scattered about zero, but instead show a
systematic pattern; there are large negative values of the residuals for
the smallest and largest fitted values, and exclusively positive
residuals for fitted values in the middle of the range.
Note, however, that the vertical scatter is fairly constant across the
fitted values, indicating that the assumption of constant variance does
seem reasonable.
(b) The systematic pattern in the plot in Figure 3(a) looks like a parabola
– that is, a quadratic function.
(c) The normal probability plot in Figure 3(b) shows that the
standardised residuals clearly do not follow the straight line, but wrap
around it. So, it seems that the normality assumption also does not
hold.

Solution to Activity 10
(a) Let q be the number of regression coefficients we are testing. Then, in
Model (4) we have four coefficients for the linear terms, four for the
quadratic terms and six for the interactions. Therefore, the number of
regression coefficients in the model is q = 4 + 4 + 6 = 14.
The p-value for the test is then calculated from an F (ν1 , ν2 )
distribution where
ν1 = q = 14,
ν2 = n − (q + 1) = 30 − (14 + 1) = 15.

(b) Here we need to test the hypotheses


H0 : all 14 regression coefficients are equal to 0,
H1 : at least one of the 14 coefficients differs from 0.
The F -statistic is the test statistic for this test (reported to be 266.1).
Since the p-value associated with this test is less than 0.001, there is
strong evidence that at least one of the 14 regression coefficients is
different from 0. Hence, there is strong evidence that the regression
model contributes information to interpret changes in the yield of the
alcohol.
(c) Each p-value for the individual coefficients is calculated from a t(ν)
distribution where the degrees of freedom, ν, is the same for each of
the coefficients and is
ν = n − (q + 1) = 30 − (14 + 1) = 15.

481
Review Unit Data analysis in practice

(d) The p-values associated with all partial regression coefficients of the
linear terms are less than 0.001, indicating strong evidence for being
different from 0, if all terms are in the model.
For the quadratic terms, we find that two of the p-values (for the
coefficients of temp2 and equiv2 ) are also less than 0.001, again
indicating strong evidence that the coefficients for these terms are
not 0, if all terms are in the model. Although the p-values for the
other two quadratic terms (time2 and nmp2 ) are not as small, they are
still small enough to indicate that there is evidence for their
coefficients being different from 0, if all terms are in the model.
We can see that the p-values for the three interactions involving temp
are less than 0.001, indicating strong evidence that their coefficients
are different from 0, if all terms are in the model. The p-values
associated with the coefficients of time:equiv and nmp:equiv are not
so small, but still small enough to suggest that there is evidence for
being different from 0, if all terms are in the model. The evidence that
the coefficient for the interaction time:nmp is 0, if all terms are in the
model, is, however, only weak, since the associated p-value is larger.
The only term that we might consider dropping from the model is the
interaction time:nmp, since there was only weak evidence that this
term needs to be in the model.

Solution to Activity 11
From Box 4, a high value of the adjusted R2 is preferable, whereas a low
value of the AIC is preferable. Model (4) has the higher adjusted R2 and
the lower AIC, and therefore Model (4) is the preferred model.

Solution to Activity 12
The estimates for the linear coefficients are the same for both models.
However, their standard errors are smaller in Model (4).
This is unexpected. Back in Subsection 1.2 of Unit 2, we compared
coefficients of the simple and multiple regression models, and in all
examples we found that the estimated coefficient in the simple linear
regression model was different to the partial coefficient for the
corresponding explanatory variable in the multiple regression model.

Solution to Activity 13
(a) The plot of residuals against fitted values in Figure 6(a) does not give
any evidence to doubt the model assumptions of zero mean and
constant variance for the Wi ’s.
(b) The normal probability plot in Figure 6(b) shows that the
standardised residuals follow the straight line quite well. There is
therefore no cause for concern about the normality assumption.

482
Solutions to activities

Solution to Activity 14
From Subsection 3.3 of Unit 2, there is no standard rule of thumb for
deciding that a point is influential. Sometimes a Cook’s distance greater
than 0.5 is considered to indicate an influential point, but a point could
also be considered as being influential for smaller Cook’s distance values if
its Cook’s distance is large in comparison to the other Cook’s distance
values.
All of the Cook’s distance values in Figure 8 are less than 0.5, and none of
them are particularly large in comparison to the other values, and so this
Cook’s distance plot suggests that there doesn’t seem to be any problems
with influential points for these data and Model (4).

Solution to Activity 15
(a) Yes, the optical density is affected by whether or not the sample was
treated. The distribution of values for untreated samples is centered
at a higher value than the distribution of values for treated samples.
This is expected. Adding a toxin to a sample is likely to decrease the
proportion of surviving cells in a sample, resulting in lower values of
the optical density.
(b) Yes, the distribution of responses is different on different days. It’s
shifted down on the second day compared with the first.
This is not necessarily expected. Why would a measurement be
affected by the day on which it’s been taken?
(c) The data look quite similar on both days, just shifted. In particular,
the mean difference between opticalDensity for treated cells and for
untreated cells (that is, the treatment effect) looks roughly the same
on both days.

Solution to Activity 16
(a) The factor treatment should be included in the model, since we have
seen that there are differences between the responses of treated and
untreated cells.
(b) The factor day should also be included. We have seen that the
responses appear to be ‘shifted’ down on the second day.
It is good that we looked at the responses by day so that we noticed
the difference a day can make, which was confirmed by the biologist.
If we had neither looked at the data visually nor spoken with an
expert, then we might not have bothered fitting the factor day, as it
was not obvious this could be important.
(c) The interaction is probably not needed in the model, since the effect
of the treatment is roughly the same on both days.

483
Review Unit Data analysis in practice

Solution to Activity 17
The model with both factors, treatment and day, but without their
interaction, is the preferred model because it has the largest value of the
adjusted R2 statistic and the smallest value of the AIC.
Yes, this is as expected. In Activity 16, we conjectured that both factors,
but not their interaction, should be in the model.

Solution to Activity 18
(a) Let α be the intercept parameter, β1 be the regression coefficient for
the indicator variable ‘treatment 1’, and β2 be the regression
coefficient for the indicator variable ‘day 2’. Then the p-value on the
row corresponding to the factor treatment is testing the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0,
assuming that β2 = βb2 = −0.4.
(b) The value of the test statistic is −5.408. If the null hypothesis were
true, the distribution of the test statistic would be a t(ν) distribution,
where, for a model with q regression coefficients,
ν = n − (q + 1) = 12 − (2 + 1) = 9.

(c) The p-value is very small, so there is strong evidence against the null
hypothesis that β1 = 0. Therefore, we conclude that there is evidence
that the toxin does affect cell survival.

Solution to Activity 19
(a) There is clearly day-to-day variation in the data. In particular, in this
experiment the responses on Day 2 seem lower than those on Day 1.
If the treated cells had been measured on Day 1, we would expect the
responses to be in a similar region as the three values for treated cells
for Day 1 of Figure 9 (Subsection 3.1). Similarly, if the untreated cells
had been measured on Day 2, we would expect the responses to be in
a similar region as the three values for untreated cells for Day 2 of
Figure 9. If that were the case, the model would not show a
significant effect of the treatment, as the values in the two treatment
groups would be fairly similar. The biologists would thus (wrongly)
conclude that the toxin does not affect cell survival.
(b) In this situation, we would have the opposite problem. The difference
in optical density between treated and untreated cells would appear
to be much larger than it actually is.
(c) Yes, it was a good idea. The way the experiment was planned gave us
the opportunity to separate the treatment effect from the day-to-day
variation, so the treatment effect can be estimated more precisely.

484
Solutions to activities

Solution to Activity 20
(a) Exponential distribution. We can view the distance until the battery
runs out of power as a ‘time to event’ variable.
(b) Bernoulli distribution. The outcome is either ‘made a claim’ or ‘no
claim’.
(c) Poisson distribution. For each policy holder, the number of claims in
the last year is counted.
(d) Binomial distribution. The number of policy holders who made a
claim, out of all policy holders, is recorded.
(e) Normal distribution. Height is a continuous variable, and it seems
plausible that it could follow a bell-shaped curve.

Solution to Activity 21
There are three potential choices when fitting a GLM: the distribution of
the response variable, the link function and the form of the linear
predictor. For the link function, we’ll use Table 14 to guide our choice, and
so we only need to choose the response distribution and the form of the
linear predictor.
So, for the citations data:
• Distribution: The response variable is a count, the number of citations
an article has accrued. The Poisson distribution is a natural choice for
modelling here.
• Link function: From Table 14, we’ll use the log link, the canonical link
function for the Poisson distribution.
• Linear predictor: We have two explanatory variables, the covariate
yearDiff and the factor journal. Both seem relevant here. We would
expect that an article that has been published for longer may have more
citations than a newer article, since more researchers will have had the
chance to read it. It also seems intuitive that the type of journal where
an article has been published could affect the number of citations.
Therefore both variables should be fitted to the model.
What about an interaction between them? This would correspond to
allowing non-parallel slopes in the linear predictor, or, in other words,
the number of citations could increase at different rates over time for
different journal types. This seems plausible.
Putting this together, the proposed model for the citations data is a
Poisson GLM (with a log link) of the form
numCitations ∼ yearDiff + journal + yearDiff:journal.

485
Review Unit Data analysis in practice

Solution to Activity 22
(a) The three categories of the factor journal coincide with different
ranges for numbers of numCitations, with articles in standard
statistics journals (journal = 0) having fewer citations than those in
prestigious statistics journals (journal = 1), which in turn have fewer
citations than those in medical journals (journal = 2). This means
that we can expect journal to be a good variable to have in our
model as it seems to explain some of the differences in the number of
citations. We were also right about assuming that the number of
citations increases with the years an article has been published.
(b) There is only one article in a medical journal, so there is only one
observation for which journal takes value 2.

Solution to Activity 23
It is important to consider this question in the context of the data. There
is no reason to assume the numbers of citations for articles in prestigious
statistics journals and medical journals should be similar. We would not
be comparing like with like. We can also see this from our (albeit limited)
dataset, as the one response we have for a medical journal is quite different
from those for prestigious statistics journals.

Solution to Activity 24
(a) The null model, which does not take any explanatory variables into
account, is nested in Model (7). As explained in Units 6 and 7, we can
compare nested GLMs by calculating the deviance difference of the
two models, which gives
462.801 − 80.933 = 381.868.
If both models fit equally well, this difference has a χ2 (d) distribution,
where
d = difference in the degrees of freedom associated with
the null deviance and the residual deviance
= 22 − 19 = 3.
It is clear that such a large value of the deviance difference (381.868)
will yield a tiny p-value when compared against a χ2 (3) distribution.
Indeed, conducting this comparison in R yields a p-value close to 0.
We conclude that Model (7) provides a highly significant gain in fit
over the null model.
(b) If Model (7) is a good fit, then the residual deviance should come
from a χ2 (r) distribution, where
r = n − number of parameters in the proposed model.
The value of the degrees of freedom for the residual deviance (r) was
given in the question to be 19. This comes from the fact that there

486
Solutions to activities

are 23 observations in the citations dataset, and so n = 23, and there


are four parameters in Model (7) (the intercept parameter, the
regression coefficient for yearDiff and the two coefficients associated
with values 1 and 2 of journal). Therefore
r = 23 − 4 = 19.
So, since the residual deviance is 80.933, which is so much larger than
the degrees of freedom, we can conclude that Model (7) is not a good
fit for the citations data. (This is confirmed by the p-value (of 80.933
with a χ2 (19) distribution), which is less than 0.001.)
Because we have a Poisson GLM which isn’t a good fit to the citations
dataset, a potential issue here could be overdispersion, as explained in
Subsection 7.1 of Unit 7. To detect possible overdispersion, we look at
the residual deviance divided by the degrees of freedom
80.933
≃ 4.2596.
19
Since this is greater than 2, there is likely to be a problem with
overdispersion. (Incidentally, when removing the observation
corresponding to the medical journal and fitting Model (6), the model
with the interaction, we get
73.17
≃ 4.065.
18
So, since this ratio is also greater than 2, removing the observation for
the medical journal does not solve the possible problem with
overdispersion.)

Solution to Activity 25
• The red line shown in the plot of the standardised deviance residuals
against a transformation of µ
b given in Figure 12(a) shows some possible
curvature suggesting that either the Poisson model or its canonical link
may not be appropriate, or some important term may be missing from
the linear predictor.
• In the plot of the standardised deviance residuals versus index shown in
plot (b), the standardised deviance residuals appear to be fairly
randomly scattered about zero across the index, except for a cluster of
negative residuals associated with the first few index values.
• The plot of the squared standardised deviance residuals against index
shown in plot (c) suggests that the magnitude of the standardised
deviance residuals remains fairly constant across the index.
• The normal probability plot shown in plot (d) is reasonably close to the
straight line, with some small deviation at the lower end.
Because of the hint of curvature in Figure 12(a), it appears that Model (7)
may not be appropriate for the citations data, and that further
investigation or a different model may be needed.

487
Review Unit Data analysis in practice

Solution to Activity 26
In dose escalation trials, the binary variable toxicity is an obvious
response variable since the researchers want to find out how this variable is
related to the other variables in the dataset. A logistic regression model
with the binary response toxicity is therefore a natural choice for these
data. (Section 5 will provide more detail on how to select a model
according to the research question of the study.)

Solution to Activity 27
(a) The deviance difference for Models (9) and M2 is
D(M2 ) − D(9) = 31.149 − 25.398 = 5.751.
The deviance difference for Models (9) and M1 is
D(M1 ) − D(9) = 32.528 − 25.398 = 7.13.
The deviance difference for Models (9) and M0 is
D(M0 ) − D(9) = 36.434 − 25.398 = 11.036.

(b) The appropriate chi-square distribution for the deviance difference has
degrees of freedom equal to the difference in the degrees of freedom
for the models we are comparing, or the number of extra parameters
in the larger model.
For Models (9) and M2 , this difference is 46 − 45 = 1. Therefore the
distribution is χ2 (1).
For Models (9) and M1 , this difference is 47 − 45 = 2. Therefore the
distribution is χ2 (2).
For Models (9) and M0 , this difference is 48 − 45 = 3. Therefore the
distribution is χ2 (3).
(c) In all three comparisons, the p-value is small. The p-values therefore
suggest there is significant gain in model fit when including the extra
parameters from Model (9). Therefore Model (9) should be chosen.

Solution to Activity 28
In the plot of the standardised deviance residuals against fitted values
given in plot (a), the typical (for logistic regression) ‘lines’ of positive and
negative deviance residuals correspond to response values of Yi = 1 and
Yi = 0, respectively. The red line, which in the ideal case should be a
horizontal line at 0, has a slight upward trajectory which is interrupted by
a small U-shaped dip. This dip may indicate that the terms in the linear
predictor cannot completely capture all of the features of the data. The
dip is rather small, but we note some small concern here.
In the plot of the standardised deviance residuals against index shown in
plot (b), if the responses are independent, the standardised deviance
residuals in the plot should fluctuate randomly and there shouldn’t be

488
Solutions to activities

systematic patterns in the plot. Although there do appear to be some


patterns in the plot, since the actual order in which the data were collected
is unknown, we cannot assess the assumption of independence through this
plot. It is, however, noticeable that the positive residuals seem to be much
larger than the negative ones.
The plot of the squared standardised deviance residuals against index
given in plot (c) also shows some patterns. Again, since the order in which
the data were collected is unknown, we can’t really use this plot to assess
the independence assumption. This plot does, however, highlight how large
the standardised deviance positive residuals are in comparison to the
negative ones.
In the normal probability plot shown in plot (d), if the model assumptions
are correct, the standardised deviance residuals should be roughly on a
straight line. In Figure 14(b), there are five points away from the line in
the top-right corner, and some minor deviation from the line at its lower
end. So there is some evidence against the normality assumption.

Solution to Activity 29
(a) To compare the fit of these models, we look at the deviance difference.
This gives a test statistic of
976.62 − 302.78 = 673.84.
If both models fit equally well, this value comes from a χ2 (d)
distribution, where d is the difference between the degrees of freedom
of the two models, 10 − 5 = 5, or the number of extra terms fitted to
the larger model.
(b) If the p-value is close to 0, then there is very strong evidence that the
extra terms in the larger model, with all two-way interactions, are
needed.
The model omitting the interaction obese:ethnicity has a larger
residual deviance than the model omitting ethnicity:ageGroup, so
the deviance difference (the test statistic) will be even larger than the
value in part (a), with the same degrees of freedom, so the p-value will
be even smaller.
The model omitting the interaction obese:ageGroup also has a larger
residual deviance than the model omitting ethnicity:ageGroup, so
the deviance difference (the test statistic) will again be larger than
the value in part (a). This time, the degrees of freedom will be
6 − 5 = 1, but again, the p-value will be very small, since the deviance
difference is so much larger than the degrees of freedom.
(c) We have already ruled out (in part (b)) the models that omit one of
the two-way interactions. They are all significantly worse than the
model with all two-way interactions, so we just need to investigate the
latter.

489
Review Unit Data analysis in practice

In order to decide whether the model with all two-way interactions is


a good fit to the data, we can use its residual deviance: from
Table 21, this is 302.78.
If the model fits well, the residual deviance value should come from a
χ2 (5)-distribution. Given that, for this distribution a value of 30.856
would give a p-value of less than 0.001, and the p-value for a test
statistic that is larger than 30.856 will give an even smaller p-value.
We conclude that the model with all two-way interactions does not
provide an adequate fit for the data, since the p-value of the test is
less than 0.001.
(A possible explanation for needing the saturated model is that the
large sample size allows us to pick up very small differences in
response for different levels of the factors. We therefore can’t leave
anything out without reducing the model fit significantly.)

Solution to Activity 30
(a) The biologists’ research question could be written as: ‘Will the toxin
change the proportion of surviving cells in a sample?’
(b) The proportion of surviving cells in a sample is measured in terms of
the sample’s optical density; these values are recorded in the variable
opticalDensity. The variable treatment indicates whether the
sample has been treated with the toxin. So, statistically we must ask
if the factor treatment has an effect on the response variable
opticalDensity.
So, for Model (11), we’re ultimately interested in testing whether the
factor treatment needs to be in the model in addition to day. In
other words, we’re interested in testing the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming that β2 = βb2 ),
where β1 and β2 denote, respectively, the regression coefficients for the
indicator variables for the second levels of factors treatment and day.
(c) The p-value for the test of interest to the biologists is the p-value
associated with treatment in Table 22, which is less than 0.001.
Therefore, there is strong evidence that the toxin does indeed affect
cell viability. (It’s good that we fitted the effect of day to the model.
Without this, the p-value for this test would be 0.0149, so that there
is still evidence against H0 : β1 = 0, but it’s less strong.)

490
Solutions to activities

Solution to Activity 31
(a) If the biologists are interested in the research question formulated in
Activity 30, so that they want to find out if there is an effect of the
toxin on the proportion of surviving cells in a sample, both methods
can be used. We have already found the p-value for the test in
Table 12 (Subsection 3.2) from the output when fitting a linear
regression model. Because the factor treatment only has two levels,
the ANOVA table for the same model given in Table 13
(Subsection 3.3) produces the same p-value in the row associated with
treatment, so can also be used.
(b) If the biologists are interested in testing the one-sided hypothesis that
the toxin decreases the proportion of surviving cells, then we need the
table of coefficients from the regression model. (Because the
t-distribution is symmetric about 0, the p-value of the one-sided test
is either half the p-value of the two-sided test or one minus half the
p-value of the two-sided test, depending on the sign of the estimated
coefficient.) The ANOVA table can only tell us if a factor affects the
response (and should therefore be in the model), but not in which
direction (increase or decrease).

Solution to Activity 32
The logistic regression model with response variable obese and factors
ethnicity and ageGroup should be fitted. The binary variable obese is
an obvious response for the research question since the researchers want to
find out how this variable is related to the other two variables in the
dataset.

Solution to Activity 33
(a) The proposed logistic regression model corresponds to the log-linear
model where the response variable is the count in Table 20
(Subsection 4.4). Since ethnicity and ageGroup are in the logistic
regression model for obese, we’d expect all three main effects
(ethnicity, ageGroup and obese) to be in the log-linear model for
the same data, and we’d also expect the model to include the two-way
interactions obese:ethnicity and obese:ageGroup.
(b) The residual deviance for the logistic regression model in part (a)
is 302.78, which is much larger than 5, the value of the degrees of
freedom. Therefore, by the usual ‘rule of thumb’ comparing the values
of the residual deviance and its degrees of freedom, the residual
deviance is much larger than expected if the model were a good fit
and so we conclude that the model from part (a) is not a good fit to
the data.

491
Review Unit Data analysis in practice

Solution to Activity 34
The interaction ethnicity:ageGroup models how these two factors are
jointly related to the response, or, in other words, how changing them
jointly will affect the response. For example, the odds of an Asian child
being obese are changing in a specific way when we go from the younger
ageGroup to the older ageGroup. For a child from a different ethnic group,
for example, Black, the odds may change in a different way when we go
from the younger ageGroup to the older ageGroup.
In the model without the interaction, the odds of being obese would be
changing in exactly the same way for children from all ethnicities when we
go from the younger ageGroup to the older ageGroup. This might not be
realistic, which could explain why the fit of the smaller model (considered
in Activity 33) was not adequate.

Solution to Activity 35
In the biomarker negative group, we can see that the value of x where the
vertical line crosses the x-axis is approximately 240 mg/m2 . In the
biomarker positive group, this dose value is approximately 180 mg/m2 .

Solution to Activity 36
Solving Equation (15) for dose, we get
 
dose
−1.6582 = −424 + 529 log +1
200
−1.6582 + 424
 
dose
= log +1
529 200
dose
exp(0.7984) ≃ +1
200
(2.2220 − 1) × 200 ≃ dose
244 ≃ dose.
So, for biomarker negative patients, the dose required so that the
probability of toxicity is 0.16 is (approximately) 244 mg/m2 .
Solving Equation (16) for dose, we get
 
dose
−1.6582 = −4.3 + 4.1 log +1
200
−1.6582 + 4.3
 
dose
= log +1
4.1 200
dose
exp(0.6443) ≃ +1
200
(1.9047 − 1) × 200 ≃ dose
181 ≃ dose.
Therefore, for biomarker positive patients, the dose required so that the
probability of toxicity is 0.16 is (approximately) 181 mg/m2 .

492
Solutions to activities

Solution to Activity 37
(a) For the biomarker negative group, the estimated probabilities are
equal to 0 up until dose 215 mg/m2 . Dose 245 mg/m2 , however, has
probability of toxicity estimated as 0.286, which is considerably larger
than the acceptable 0.16. We therefore conclude that dose 245 mg/m2
is likely to cause too many toxicity events, and we recommend the
next lower dose, 215 mg/m2 .
(b) For the biomarker positive group, the estimated probability at dose
100 mg/m2 is 16 , which is just about acceptable. Moving up the doses,
we get estimates of 0 for doses 150 mg/m2 and 180 mg/m2 . Then, at
215 mg/m2 , the probability of toxicity is estimated to be 0.5, which is
far higher than acceptable. We therefore recommend the next lower
dose, 180 mg/m2 , for use in this group.

Solution to Activity 38
(a) The answer to the research question could impact the production
process of the alcohol, since GlaxoSmithKline can set the covariates
to their (estimated) optimal values. This is likely to increase the yield
and would thus make production more efficient.
(b) You could find the maximum of your fitted model from Section 2 and
use this to estimate the maximum yield. Similarly, you can use the
values of the covariates where the maximum of the fitted model is
attained as estimates of the optimal covariate values.

Solution to Activity 39
No. The value of 95.3112 for the yield at these values of the covariates is a
prediction from the model. We need to take into account the uncertainty
that we have around this prediction.

Solution to Activity 40
(a) The correct answer is that the interval is the prediction interval for
the yield of alcohol at the values of the covariates given in Table 24.
How is this different from the interpretation in the question? Well, we
need to take into account that the values of the covariates where the
maximum yield is attained are also estimated. This means that there
is an extra level of uncertainty around these values, which has not
been included in the prediction interval. So, a prediction interval for
the maximum yield would be wider than the interval shown here.
(b) When the covariates are fixed at the values provided in Table 24,
then, in the long run, we expect 95% of new responses (yield of
alcohol during a run of the experiment/production process) to fall
into the interval (93.8610, 96.7614).

493
Review Unit Data analysis in practice

Solution to Activity 41
In the desilylation dataset, the values of the covariates had been
standardised before the data were analysed. The optimal estimated values
are therefore on the standardised scale, and the chemists must reverse the
standardisation to obtain the corresponding values on the original scale.
For example, the value of nmp is given as −1.3979 here, and it would be
impossible to prepare a solution with a negative volume of the solvent!

Solution to Activity 42
(a) Yes, it seems reasonable to assume a normal distribution. First, the
unit of measurement for the response variable, percentage yield, is
continuous, in which case a normal distribution is often a sensible
starting point. Second, after fitting the final Model (4), we
investigated the normal probability plot of standardised residuals (in
Activity 13, Subsection 2.5). The points in the plot are close to the
straight line, confirming the validity of this model assumption.
(b) The normal distribution can attain any real value. The response
variable in this example, however, is measured in percentage yield,
which is restricted to the interval from 0 to 100. A potential issue
could be that the normal distribution might provide predictions
outside this interval. Luckily, in the example this did not happen.
From Subsection 5.4, the prediction interval for the percentage yield
of the alcohol of interest when the reaction is run at the estimated
optimal covariate values, is (93.8610, 96.7614), which does not exceed
the limit of 100.
If our predictions had exceeded the upper limit of 100%, we could
have tried a distribution bounded on (0, 100) (or a transformed
response bounded on (0, 1)) instead of the normal distribution.

Solution to Activity 43
(a) The term called ‘Intercept’ is the value of numCitations when the
covariate yearDiff is 0 and the factor journal takes the baseline
level. The baseline level of journal is when journal takes coded
value 0, which represents a ‘standard statistics journal’, and if
yearDiff = 0, then this means that the journal has just been
published. So, since the estimated value of ‘Intercept’ for the fitted
model is 1.174 in Table 26, we’d expect the number of citations for an
article in a standard statistics journal that’s just been published to
be 1.174.
The regression coefficient for yearDiff is estimated to be 0.552. This
means that, after controlling for the type of journal, the expected
number of citations for an article is expected to increase by 0.552
after each year since publication.
The value 1 for journal represents a ‘prestigious statistics journal’.
The estimate for this parameter is 35.610. So, after controlling for the

494
Solutions to activities

number of years since publication, the expected number of citations is


expected to increase by 35.610 for an article in a prestigious statistics
journal in comparison to a standard statistics journal.
(b) If the article was published in a standard statistics journal eight years
ago, then journal takes the value 0 (the baseline level) and yearDiff
takes the value 8. So, the fitted value is
yb = 1.174 + (0.552 × 8) = 5.590.

(c) Since yearDiff is a covariate, we can use the p-value for the
regression coefficient for yearDiff given in the summary output table
to assess whether or not yearDiff should be kept in the model. From
Table 26, this p-value is 0.0193, which means there is evidence to
suggest that this coefficient is different from 0 (provided the factor
journal is also in the model). Therefore the covariate yearDiff
should be kept in the model.
Since journal is a factor, in order to assess whether journal should
be kept in the model, we can use the ANOVA test comparing the RSS
values for Model (17) and the model without journal included. The
p-value from this ANOVA test is very small, so there is strong
evidence that the factor journal should also be kept in the model in
addition to yearDiff.
Overall, we can say that there is evidence to suggest that each of the
two explanatory variables, journal and yearDiff, influences the
number of an article’s citations, given the existence of the other
explanatory variable in the model.

Solution to Activity 44
(a) The plot of residuals against fitted values in Figure 17(a) does not
give strong evidence to doubt the model assumptions of zero mean
and constant variance for the Wi ’s.
(b) The normal probability plot in Figure 17(b) shows that the
standardised residuals follow the straight line quite well. There is no
reason for concern about the normality assumption.

Solution to Activity 45
In the scatterplot of the residuals against values of yearDiff, it looks like
the variance of the Wi ’s increases as yearDiff increases, since the vertical
spread of points moving from left to right seems to increase. This may
indicate that a transformation of the response variable might be needed to
make the variance constant.
The plot of residuals against the levels of journal does not give strong
evidence to doubt the model assumptions of zero mean and constant
variance for the Wi ’s. The boxplot for values where journal is 0 looks
reasonable, and the other levels simply do not contain enough data (three

495
Review Unit Data analysis in practice

points where journal is 1, and only one point where journal is 2) to draw
any strong conclusions.

Solution to Activity 46
(a) The estimated probabilities of toxicity decrease before they increase!
This makes no sense scientifically.
(b) There are now two fitted values of pb = 0.16, one at a dose of roughly
100 mg/m2 , and the other at a dose of roughly 215 mg/m2 (by looking
at the plot). From the data, neither of these doses seems a good
choice. If we choose 100 mg/m2 as the recommended dose, we may
lose efficacy. The higher dose levels of 150 mg/m2 and 180 mg/m2 did
not cause toxicity in this study, while likely being more effective. But,
if we choose 215 mg/m2 as the recommended dose, we may expose
more patients to toxicity than planned. At this dose level, two out of
four patients experienced toxicity in the study, so this dose may be
too high to be safe.
(c) When we look at the data in Table 18 (Subsection 4.3), there is one
case of toxicity in the biomarker positive group at the lowest dose
level (100 mg/m2 ) and no cases in the two next higher dosage groups.
The model we have fitted follows this pattern very closely.

496
Index

Index
∆ 47, 103 selection 43
big data
absolute income hypothesis 32 definition 299
accuracy 306 three V’s 299
adaTepe 235 variety 304
adjusted R2 statistic 405 velocity 305
agglomerative hierarchical clustering 224 veracity 305
start 225 volume 303
AIC 405 big O notation 310
AIH 32 binary distance 205
Akaike information criterion (AIC) 405 BLUE 29
algorithm Bray–Curtis distance 202, 205
definition 308
for simple linear regression 308 cells 411
for sorting data 312 central limit theorem 46
hill-climbing 327 centroid 241
MapReduce 318 ceteris paribus 15
split-apply-combine 318 change
to calculate a square root 323, 324 absolute 47
to calculate a standard deviation 311 percentage 47
Analysis of variance (ANOVA) 415 relative 47
Anderson–Hsiao estimator 168 Charter of Fundamental Rights 348
ANOVA 415 childMeasurements 433
explained sum of squares (ESS) 415 citations 421
F -value 415 city block distance 205
table 415 clothingFirms 39
APC 13 cluster analysis 188
AR(1) 125 cluster definition 191, 208
arbitrage 150 cluster, allocation to 238
asymptotic property 28 cluster, centre 241
Augmented Dickey–Fuller (ADF) test 140 clustering
autocorrelation coefficient 111 k -means 244
autocorrelation function (ACF) 112 density-based 255
autonomous consumption 13 hierarchical 223, 224
autoregressive process 125 partitional 236
autoregressive scatterplot 108 clusters 188
average linkage 227 coffeePrices 164
average propensity to consume 13 cointegration 150
cointegration test 151
Babylon app 352 complete linkage 226
backshift operator 108 confirmatory analysis 6, 119
beads 197 conflation 423
bias 30 consistency 28
attenuation 45 contingency table 434
attrition 37 Cook’s distance 408
omitted variable 57 plot 408

497
Index

correlated explanatory variables 392 plot 213


correlogram 112 dissimilarity measure
covariate 388 between clusters 226
cross-sectional data 37 binary distance 205
Bray–Curtis distance 202, 205
data city block distance 205
longitudinal 35 definition 201
panel 35 edit distance 203
dataset Euclidean distance 205
Ada Tepe 235 L1 distance 205
beads 197 Manhattan distance 205
cells 411 properties 204
child measurements 433 distributed computing 313
Chinese liquor 200 disturbance term 5
citations 421 divisive hierarchical clustering 224
clothing firms 39 dmax 255
coffee prices 164 dose response curve 429
desilylation 387 doseEscalation 426
dose escalation 426 drift 114, 123
imports and exports 40 dummy variable 71
Old Faithful geyser 194 Durbin–Watson statistic 149
parking 192 DW statistic 149
PPP 159
PSID 41 ECM 162
simulated panel 75 econometrics 3
UK GDP 115 edge observation 262
unemployment 104 edit distance 203
US consumption 152 efficiency 29
DBScan 255 elasticity 48
dmax 255 endogenous variable 10
edge observation 262 Engel’s law 9
gmin 255 Engle–Granger test 151
interior observation 262 Equality Act 347
outlier 257 error components 81
start 266 error correction model 162
unlabelled observation 257 error correction term 162
deep learning 335 error term 5
demand curve 17 ESS 415
demeaned 74 estimator
dendrogram 232 instrumental variable 66
designed experiment 393, 394, 417 LSDV 71
desilylation 387 pooled 69
deterministic trend 114 proxy variable 59
Dicky–Fuller (DF) test 139 random effects 81
difference stationary 136 within groups 73
dissimilarity matrix Euclidean distance 205
definition 209 exhaust data 305
interpretation 212 exogeneity assumption 25

498
Index

exogenous variable 10 relative income 34


expected value 24
expenditure share 7 I(p) 136
explained sum of squares (ESS) 415 i.i.d. 46
exploratory analysis 6 identification 3
identification, assumption 25
F -value (in ANOVA) 415 identity link function 420
factor 388 idiosyncratic error 81
faithful 194 importExport 40
FD estimator 167 in levels 47
finite sample property 30 in logs 47
first difference estimator 167 independent and identically distributed 46
first differencing 103 indicator variable 410
fixed effects 70 individual effects 70
flow variable 101 influential point 408
forecasting 147 informed consent 337
instrumental variable 65
Gauss–Markov theorem 29 interaction 402
General Data Protection Regulation (GDPR) interior observation 262
340 internal validation 209
generalised linear model (GLM) 417 IV 65
GLM 417, 418
diagnostic plots 424, 431 k-means clustering 244
linear predictor 418 start 246
link function 418, 420 stopping rule 245
model comparison 431 Keynesian consumption function 12
nested 431 Kuznets’ consumption puzzle 32
overdispersion 424
regression equation 418 L1 distance 205
relationship between log-linear and logistic labour economics 15
438 lag operator 108
residual deviance 424 law of demand 18
response distribution 418 level (of a variable) 103
gmin 255 leverage 408
Google Flu Trends 301 lifecycle income hypothesis 32
Granger representation theorem 161 LIH 32
growth rate (GDP) 114 linear estimator 30
linear in parameters 11
Hausman test 83 linear predictor 418
HCT 15 link function 418, 420
heteroskedasticity 81 identity 420
hierarchical clustering 223, 224 log 420
homoskedasticity 81 logit 420
human capital theory 15 linkage
hypothesis average 227
absolute income 32 complete 226
lifecycle income 32 single 226
permanent income 33 liquor 200

499
Index

log link function 420 outlier 198, 257, 459


logistic function 429 finding outliers 461
logit link function 420 reasons for 459
longitudinal data 35 overdispersion 424
LSDV 71
panel
machine learning 333 balanced 35
machine precision 322 long 35
macroeconomics 35 short 35
Manhattan distance 205 unbalanced 35
MapReduce algorithm 318 panel data 35, 37
marginal propensity to consume 13 parking 192
mean corrected 74 partitional clustering 236
mean silhouette statistic 222 PCE 151
mean-covariance stationarity 131 permanent income hypothesis 33
microeconomics 35 persistent variable 107
Mincer equation 15 Phillips–Perron test 142
model PIH 33
generalised linear model (GLM) 417 ppp 159
log-linear 434 precision 306
multiple regression (with covariates and PredPol 354
factors) 410 protected characteristics 347
multiple regression (with covariates) 388 proxy variable 59, 351
multiple regression (with factors) 410 psid 41
momentum 107 purchasing power parity (PPP) 158
MPC 13
quadratic function 401
multiple regression
quarter (of a year) 103
constant variance assumption 397
independence assumption 397 randomised response 343
interaction term 402 RE 81
linearity assumption 397 real terms 12
model fitting 396 recommender system 300
model selection 405 relative income hypothesis 34
normality assumption 397 research question 435
with covariates 388 residual plot 397
with covariates and factors 410 response distribution
with factors 410 for counts 447
for percentages 446
n = all 331 RIH 34
neural network 335
nominal terms 12 sample
non-stationary 132 representative 43
normal probability plot 398 unrepresentative 43
scatterplot matrix 390
observational study 393, 394 semi-structured data 305
OLS 4 short-term adjustment 162
pooled 69 silhouette statistic
order of integration 136 definition 216

500
Index

interpretation 218 waves 35


mean 222 weakly dependent 133
plot 219 weakly stationary 131
similarity measure 201 WG 73
simulatedPanel 75 white noise 126
single linkage 226
speed of adjustment factor 162
split-apply-combine algorithm 318
spurious correlation 329
spurious regression 147
stable solution 246
standardisation 208
standardised data 387, 395
stationarity
mean-covariance 131
strong 131
weak 131
statistical economics 19
stochastic process 130
stochastic trend 114
stock variable 101
structural break 118
structured data 304
supervised learning 334
supply curve 18

three V’s 299


time plot 105
time series data 37
transformation 399, 453
residuals (use of) 399

ukGDP 115
unbiasedness 26
unemployment 104
unit root 136
unlabelled observation 257
unstructured data 304
unsupervised learning 334
usConsumption 152

variable
dummy 71
endogenous 10
exogenous 10
instrumental 65
proxy 59
Voronoi diagram 238

501

You might also like