M348 Applied Statistical Modelling - Applications
M348 Applied Statistical Modelling - Applications
Book 3
Applications
This publication forms part of the Open University module M348 Applied statistical modelling. Details of this
and other Open University modules can be obtained from Student Recruitment, The Open University, PO Box
197, Milton Keynes MK7 6BJ, United Kingdom (tel. +44 (0)300 303 5303; email [email protected]).
Alternatively, you may visit the Open University website at www.open.ac.uk where you can learn more about
the wide range of modules and packs offered at all levels by The Open University.
Welcome to Strand A 3
Introduction to Unit A1 4
Summary 85
Learning outcomes 87
References 88
Acknowledgements 89
Solutions to activities 90
Introduction 99
Summary 171
References 174
Acknowledgements 176
Summary 273
References 275
Acknowledgements 276
Introduction 297
5 Privacy 336
5.1 Consent 336
5.2 Anonymisation 341
6 Fairness 346
6.1 Inequality 347
6.2 Feedback loops 353
Summary 358
References 361
Acknowledgements 364
Introduction 383
Summary 472
References 475
Acknowledgements 476
Index 497
Unit A1
Introduction to econometrics
Welcome to Strand A
Welcome to Strand A
Strand A of M348 is the econometrics strand, consisting of Units A1
and A2. The spirit of this strand is to show ways in which economists
apply statistical modelling and techniques to economic data.
(Remember that you should study either Strand A or Strand B. See the
module guide for further details.)
Some of the reasons why economists use statistics differently relate to the
prominent roles that famous economists have had in advising government
policy. Some of the most famous economists of the last century and
beyond had relevant positions in national and international policy
committees. This emphasis on policy created an additional set of priorities
and objectives that economists draw from data and statistical analysis.
For this reason, there are two important differences between the
data-related branch of economics – called econometrics – and how we
have so far done statistical modelling in this module.
The first difference stems from the role of economic theory in the relations
that econometrics establishes between data and theory, and between
theory and measurement. These relations are not stable over time, nor
agreed amongst economists at a particular point in time. In fact, they have
given rise to historical ongoing debates in economics, and to different
schools of thought within the economics discipline. We will give you a
flavour of these debates by exploring two case studies that show how the
relation between data and theory progressed economic understanding of
key economic problems. We will look at:
• what economists have said about the relation between income and
consumption, and how the relation of economic theory with data and
measurement influenced this debate
• what economists have said about how the levels of wages paid to workers
are determined, and the main competing theories explaining them.
In line with the rest of the module, and as stated in the module guide, we
will keep mathematical manipulations to a minimum.
The second difference relates to the importance attached to key
parameters of econometric models. When economists use data and
statistical modelling to investigate economic problems, they are often
interested in the magnitudes and signs of particular parameters of this
model. The ‘correct’ estimation of these parameters, and the search for
estimators and data which will yield the best possible estimates, is often
called the identification phase of statistical modelling; it is a topic of
particular interest to many econometricians.
In this strand, you will learn what economists mean by:
• the ‘correct’ estimators
• unbiasedness and consistency of estimators
• identification and causality.
3
Unit A1 Introduction to econometrics
You will also learn some of the techniques and uses of the theory involved
in correctly estimating parameters of interest.
This is not to say economists will not use statistics to describe and explore
patterns in data with an agnostic attitude towards what they will find.
However, and for the purposes of showing what is distinctive about
econometrics from what you have learnt so far in this module, we will be
focusing on the instrumental way that economists have used data to
confirm or refute theoretical explanations of economic observations.
The type of data available often conditions the way economic relations can
be modelled and estimated, and conditions the type of identification
strategies available. This strand is therefore organised as follows. Unit A1
will explore two types of data structures: the so called cross-sectional
structure, and the longitudinal or panel structure. Unit A2 will then
discuss the third main data structure, called time series, and end with
revisiting panel data models which account for time.
A note on notebooks
Make sure you have all the relevant files installed for the notebook
activities in this strand. Check the module website for any additional
instructions.
Introduction to Unit A1
This unit starts by showing you the main steps of doing econometrics when
its main purpose is to use it as a tool for economic policy. For this reason,
the engagement with economic theory, and the ways in which this theory is
represented and seeks support in the data, warrant some thought and
practice.
We will start by showcasing key elements of a historical economic problem
that looks at the relation between income and consumption. We will show
you how main debates have evolved and we will discuss the use (and
neglect) of data and of modelling in furthering these debates over time. No
knowledge of economics is assumed in this unit. In doing this, we will
introduce you to some of the developments in alternatives to OLS as ways
of improving the properties of the estimators of key parameters.
(Econometricians refer to the ‘standard’ linear regression model as OLS –
ordinary least squares.) These additional estimators are often also
dependent on the data structure, and on the assumptions made about the
behaviour of economic variables included in the model.
4
Introduction to Unit A1
Economic problem
Data
In this unit, we will deal with the different parts of the econometrician’s
modelling process. Whilst doing this, we will introduce you to one of the
oldest economic problems, which still very much debated: the relationship
between income and consumption.
The structure of Unit A1 in terms of how the unit’s sections fit together is
represented diagrammatically in the following route map.
5
Unit A1 Introduction to econometrics
Section 1
The economic problem
Section 2
The econometric model
Section 3
Data: structures,
sampling and
measurement
Section 5
Section 4
Estimating causal
Estimating causal
relationships using
relationships
panel data
Note that Subsections 2.2.2, 3.2.3, 4.2.2 and 5.5 contain a number of
notebook activities, so you will need to switch between the written
unit and your computer to complete these.
6
1 The economic problem
The above quote comes from a paper by George J. Stigler. The next
activity will introduce you to some early studies about income and
consumption that he did find.
Stigler identifies three main early studies from David Davies, Sir Frederick
Morton Eden and Ernst Engel. These arose from a need to show evidence
of the extent of poverty in England. They collected ‘budget data’ on
several English families, which is data on income and on the expenditure
on various items.
In his article, Stigler reproduces a table for each of the studies,
summarising the information on income and consumption in each one.
Often, and until now, consumption is measured as expenditure. All three
studies use a particular measurement of expenditure called expenditure
share. This is described in Box 1.
7
Unit A1 Introduction to econometrics
8
1 The economic problem
What the tables given in Activity 3 also show is that when looking at
absolute values of expenditure, rather than at relative shares, and when
comparing absolute values of expenditure with absolute values of income,
one can see that for poorer groups, expenditure is much larger than
income, and this difference is reduced as income increases. In these studies
therefore, most people are spending more than they are earning, and this is
particularly noticeable for poorer families.
Activity 3 has shown you one way in which economics uses data and relates
it to economic theory. As Stigler (1954, p. 98) says, ‘[Engel’s law] was the
first empirical generalization from budget data’, where exploration of the
data informed economic theory. So we have just done some exploratory
analysis of economic data! However, later studies and moments in the
economics discipline have tilted more towards confirmatory analysis. That
is, data analysis aimed at investigating the appropriateness of a
pre-specified model. This is what we will be doing in most of the unit.
9
Unit A1 Introduction to econometrics
10
2 The econometric model
11
Unit A1 Introduction to econometrics
This relation exists for all possible income and consumption levels, even
hypothetical levels which have not been observed. This is what we call an
a priori relationship, as opposed to an a posteriori relationship which we
can analyse by observing when it occurs in the data. In this case, plans to
consume are a function of expected income. While Keynes recognised that
expectations were also likely to influence consumption, he argued that the
simplicity and stability of the above model outweighed the added value of
explicitly including expectations in the model. In the next activity, you
will consider the interpretation of the Keynesian consumption function.
12
2 The econometric model
In the Solution to Activity 5, you have seen that Keynes called the
parameter c0 autonomous consumption and the parameter b the
marginal propensity to consume (MPC). Keynes did not assume
that the marginal propensity to consume was fixed for all income levels.
As populations get richer, a larger fraction of income is allocated to
savings. Keynes observed and modelled that increases in income would
allocate an increasing share to savings and a decreasing share to
consumption, but in a way that absolute levels of consumption would still
increase, although at a decreasing rate. This mirrors Engel’s law when it
observes that richer people will use up a decreasing share of their income
for necessities such as food, but it goes one step further and claims some of
the unspent income is saved.
The way in which economic theory has trickled into the interpretation of
these parameters means we can use it to suggest ranges of values for each
that are more likely to be observed, should the Keynesian consumption
function be a good approximation to the relationship between consumption
and income. You will do this in Activity 6.
In Activity 6, you have seen how the economic theory underlying the
model given by Model (1) places constraints on the value of the
autonomous consumption, c0 , and the value of the marginal propensity to
consume, b. A third implication is to do with the relation between the
marginal propensity to consume and the average propensity to consume,
which is explained next.
The average propensity to consume (APC) is the ratio of
consumption to income; that is, for the ith pair of values for income (Ii )
and consumption (Ci ),
Ci
APCi = .
Ii
The relation between the marginal and average propensities to consume is
depicted in Figure 2, shown next.
13
Unit A1 Introduction to econometrics
(I2 , C2 )
Consumption (C)
Consumption
function
(I1 , C1 ) C = c0 + bI
c0
0
Income (I)
Figure 2 Marginal and average propensity to consume
14
2 The econometric model
the scope of Section 4. But first, in Subsection 2.2, you will look at other
examples of economic problems and how they translate into possible
econometric model specifications.
15
Unit A1 Introduction to econometrics
40
30
Hourly wage ( £ )
20
10
0 10 20 30 40 50
Work experience (years)
Figure 3 A Mincer model showing diminishing marginal returns to work
experience
16
2 The econometric model
17
Unit A1 Introduction to econometrics
Price
Demand
Quantity
Figure 4 A demand curve
18
2 The econometric model
Using the model, they predict which prices will be operating in each
market, given the way that suppliers set their prices in response to how
they think demand will respond – but holding all else constant. This way
of arriving at equilibrium prices as an alignment of how suppliers and
consumers behave for each price is represented in Figure 5 as the
intersection between the demand and the supply curves.
Supply
Price
Demand
Quantity
19
Unit A1 Introduction to econometrics
interest, the slope of the demand curve, seemed to have been estimated
poorly.
It turns out that one of the key limitations of Moore’s study, and several
studies of demand behaviour, is the fact that the econometrician often
does not observe a priori quantities demanded. Let’s think about this. We
would like to observe how much consumers would buy of a good for all
possible prices. But at any given time, in one particular location, there is
only one price which results from the interaction between what consumers
want to buy, and what sellers or suppliers want to sell. When time
changes, we may observe another price, but the change in time may well
have brought further changes to both demand and supply. In other words,
by not modelling what else may have changed during this period that
would have influenced the relationship between demand and price, Moore
failed to keep all else constant: the ceteris paribus assumption was not
safeguarded.
Over 100 years ago, when Henry Moore published this work, the
understanding of identification of causal effects of, say, prices on the
demand for goods, was still in its infancy. In fact, in a letter to Moore,
Alfred Marshall – another prominent contemporaneous economist in the
analysis of prices, and of demand and supply of quantities – wrote of any
effort to attempt to identify such causal effects, holding all else constant,
as ‘though formally adequate seems to me impracticable’ (Stigler, 1962).
So if the ceteris paribus assumption was not safeguarded in the study
estimating the demand for corn and for pig iron, why do estimates of these
two demand curves have opposite slopes?
While there are other explanations and omitted variables, some of which
we consider in Subsection 2.3.2, it is likely that when estimating the
demand for corn, supply was also shifting over the period 1867–1911 in the
USA due to rising agricultural productivity. This could have resulted in
the supply curve shifting downwards, as shown by curves S1 to S4 in
Figure 6. This results in the amount supplied increasing for each and all
price levels, hence implying the expected decreasing demand schedule.
The period 1867–1911 was a period soon after the Industrial Revolution
that witnessed dramatic transformations in agricultural and industrial
production processes, and the use of machinery proliferated in the USA.
This may have increased the preferences and willingness to pay higher
prices for pig iron, a key input to the production of machinery, and this
could be represented by higher demand schedules, as shown by curves D1
to D4 on Figure 7. This results in an increasing relationship between
observed prices and demand.
20
2 The econometric model
S1 S2
S3
S4
Price
Demand
Quantity
Supply
Price
D4
D1 D2 D3
Quantity
21
Unit A1 Introduction to econometrics
22
2 The econometric model
issues became impossible to ignore in the study of the demand for pig iron,
all these issues are likely to have occurred in other studies of demand,
including the one for corn. While the sign of the effect of price on the
demand for corn was the right one according to the law of demand, there is
no reason to expect that the magnitudes are the right ones if one or several
of these issues were at play.
In Subsections 2.4 and 2.5, we will be more precise about the conditions
that are required to identify the causal effect of a regressor on the
dependent variable.
Ceteris
paribus?
Omitted
variables?
Exogenous
explanatory
variable?
E(u|X) = 0?
23
Unit A1 Introduction to econometrics
To put it simply, the causal effects of regressors in a model such as the one
in Model (3) are identified when
E(u|X1 , X2 , . . . , XK ) = 0.
In this class of models, it can be shown that this condition is exactly the
same as
Cov(u, Xk ) = 0, for all k = 1, . . . , K,
where Cov(u, Xk ) is the covariance between u and Xk .
These conditions are summarised in Box 8.
24
2 The econometric model
You have seen in Box 8 that OLS assumes that for every regressor Xk in
the model, there is no correlation between it and the error term. Other
methods have been developed to deal with the situation when there is a
correlation between Xk and u. In the rest of the unit, you will be exploring
some of these.
25
Unit A1 Introduction to econometrics
and
N
∂S X
= 0 ⇐⇒ (Yi − e
a − ebXi )Xi = 0 ⇐⇒ Cov(e
u, Xi ) = 0.
∂eb i=1
26
2 The econometric model
Let’s look at the relation between the unbiasedness of the OLS estimator
for the slope, b, and the error term in the univariate model
Yi = a + bXi + ui .
Furthermore, assume that
• E(ui |Xi ) = 0
• E(u2i |Xi ) = σ 2
• Cov(ui , uj |Xi ) = 0, for all i ̸= j
• V (Xi ) > 0, for all i = 1, . . . , N, N > 2.
(If you are unsure why we specify N > 2, look again at the cartoon at the
beginning of Subsection 4.3 in Unit 1.)
Solving the first-order conditions in Subsection 2.5, the OLS estimators are
a = Y − bb X
b
and
PN
bb = i=1 (Yi − Y )(Xi − X)
PN ,
2
i=1 (Xi − X)
27
Unit A1 Introduction to econometrics
N = 120
N = 80
N = 50
N = 25
28
2 The econometric model
In this case, ‘efficient’ means whether OLS will have the lowest possible
variance in the class of linear unbiased estimators; that is, the best linear
unbiased estimator (BLUE). This is important, as lower variance of an
estimator means smaller and more informative confidence intervals for each
of the parameters of our model.
For example, Figure 10 represents the sampling distribution of two
estimators for a parameter β.
β
Figure 10 Sampling distributions of two estimators of β, both unbiased
29
Unit A1 Introduction to econometrics
Bias
30
3 Data: structures, sampling and measurement
We have shown that in the univariate model, the OLS estimator for the
slope coefficient can be written as
PN
bb = i=1 (Yi − Y )(Xi − X)
PN 2
i=1 (Xi − X)
N
X (Xi − X)
= PN Yi .
2
i=1 i=1 (Xi − X)
This section will look at how the use of data and the limits and
opportunities present in data relate to the type of confirmatory analysis
done in econometrics and discussed in this unit.
As was mentioned in Subsection 1.1, in confirmatory analysis we look to
see the extent that data supports a theory. So, in Subsection 3.1 we
introduce a number of different theories about consumption and income.
31
Unit A1 Introduction to econometrics
32
3 Data: structures, sampling and measurement
33
Unit A1 Introduction to econometrics
34
3 Data: structures, sampling and measurement
35
Unit A1 Introduction to econometrics
t=T
.
..
Time t=3
t=2
t=1
Y X1 X2 ··· XK
Variables
Figure 12 The three dimensions of panel data
Short Long
(T small) (T large) Time
Narrow
(N small)
1...N
1...T 1...N 1...T
Wide
(N large)
1...N
1...N
1...T
1...T
Cross-section
Figure 13 Shapes of panel data
36
3 Data: structures, sampling and measurement
37
Unit A1 Introduction to econometrics
Econometric models are often written in a way that signals the type of
data they use. The index i is often used in cross-sectional studies and the
index t in time series. Panel data, because each variable is
two-dimensional, uses index it. The univariate version of the generic model
in Model (3) in Subsection 2.4 becomes, for each data structure:
Cross-sectional: Yi = a + bXi + ui , for all i = 1, . . . , N
Time series: Yt = a + bXt + ut , for all t = 1, . . . , T
Panel: Yit = a + bXit + uit , for all i = 1, . . . , N and t = 1, . . . , T.
38
3 Data: structures, sampling and measurement
Table 1 Theories and econometric models representing the relationship between consumption and income
39
Unit A1 Introduction to econometrics
40
3 Data: structures, sampling and measurement
41
Unit A1 Introduction to econometrics
3.3 Sampling
Some of the earlier cross-sectional studies of consumption and income have
been criticised for having been too narrow in terms of goods consumed, or
in terms of regions or socio-economic groups considered. Some of the
earlier time series studies in this field were instead criticised and their
results dismissed because the time period considered was too short, or
because the period observed had undergone structural breaks, such as
world wars.
The reasoning behind this criticism relates to the relationship between the
sample of observations and the population the sample aims to represent
and generalise back to. As discussed in Section 2, economists aim to
estimate causal relationships between variables for which they need
unbiased and consistent estimators. This will be difficult if not impossible
to achieve if the sample in the dataset used is not representative of the
population. The key question is: what is the relevant population for each
data structure?
Cross-sectional and panel datasets aspire to represent the group of subjects
from which they sampled their individuals, firms or families. For
consistency, the asymptotic behaviour of estimators which use these
structures require the analysis of its properties as the number of subjects
increases and, in the limit, converges to infinity. In contrast, time series
asymptotics require time to tend to infinity. You will explore more about
time series data structures and properties required from a time series
sample in Unit A2.
Unbiasedness of cross-sectional and panel data can be achieved with a
small sample of subjects. What is key is that each sampled subject is as
likely to be in the sample as they are represented in the population, such
as in the scenario given in Example 1.
42
3 Data: structures, sampling and measurement
43
Unit A1 Introduction to econometrics
3.4 Measurement
Other than the units of observation, the data structure and the sampling,
a fourth consideration to make when analysing data is the measurement of
variables.
In economics, specifically in the theories of consumption proposed, an
additional data challenge is the gap between existing observable data and
what is often conceptualised. The Keynesian consumption function and
AIH, which is by far the theory of the four introduced in Subsection 3.1
which has fewer challenges with measurement given widely available data,
already imposes several assumptions on the data. For instance, while data
available often measure observed consumption and earned income, the
Keynesian theory of consumption models a relationship between planned
consumption and planned income, two theoretical constructs of which
existing data can only be an approximation. Assumptions needed to
measure the consumption or income of a reference group (in Duesenberry’s
RIH), or to measure lifetime income or permanent income (in Modigliani’s
LIH or Friedman’s PIH), are however unquestionably bolder.
What do measurement issues add to the challenges of estimating causal
relationships? The immediate answer is measurement error and possible
bias.
Intuitively, one may expect that if a variable is measured with error, but
this error is not correlated with the error term of the econometric model,
then one would hope this would only add randomness and extra noise to
our estimation results. While this is true for errors in the measurement of
the dependent variable, things get more complicated when it is the
explanatory variable which is measured with error; see Box 14.
44
4 Estimating causal relationships
In this situation the covariance between the values used for the
explanatory variable and the error term for the regression is
Cov(x, u − b εX ) = Cov(X + εX , u − bεX ),
which is no longer zero. Instead, this covariance is a function of the
variance of the measurement error and the slope b.
• When b is positive, this covariance is negative.
• When b is negative, this covariance is positive.
45
Unit A1 Introduction to econometrics
N (0, 1)
In this unit, we do not always need to assume normality of the errors since
we can sometimes rely on the central limit theorem to be able to infer
properties of the estimators. The central limit theorem states that,
even if the underlying population from which we are sampling is
non-normal, the properties of the mean or sum approach normality as the
number of observations becomes large (i.e. tends to ∞). The key
assumptions of the central limit theorem are:
• a constant mean µ
• a finite variance σ 2
• independence; in the models we encounter in this strand (and elsewhere
in the module), independence means zero covariance between
observations, cov(ui , uj ) = 0, for all i ̸= j.
This means that the observations are independent and identically
distributed, which we refer to simply as i.i.d.
You will recall that these three assumptions are part of the set of
assumptions required for OLS to be BLUE. However, and very often in
economics data, variables will not be symmetric around a mean value, and
often exhibit a large concentration of observations at low values and a very
sparse frequency of higher values. That is, the data are right-skew.
46
4 Estimating causal relationships
∆Z
relative change in Z = .
initial level of Z
47
Unit A1 Introduction to econometrics
Box 17 Elasticity
The elasticity of Y with respect to X is often denoted εY,X (read
as ‘epsilon of Y with respect to X’) and can be written as
relative change in Y ∆ Y /Y
εY,X = = .
relative change in X ∆ X/X
48
4 Estimating causal relationships
The values in Table 5 indicate that all three variables are skewed to the
right because
• the mean is much larger than the median for all three variables
• the standard deviations are much larger than the means.
The following plots of the distribution for each variable, in Figures 15(a),
(b) and (c), confirm the right-skewness of the data.
49
Unit A1 Introduction to econometrics
20000 20000
15000 15000
Frequency
Frequency
10000 10000
5000 5000
0 0
0 1000 2000 3000 4000 5000 6000 0 5 10 15
(a) Production (millions of euros) (b) Labour (thousands of employees)
20000
15000
Frequency
10000
5000
0
0 500 1000 1500
(c) Capital (millions of euros)
Figure 15 Distribution of (a) production, (b) labour and (c) capital, all variables in levels
50
4 Estimating causal relationships
Despite this skewness, we can still fit a model to these data. In Activity 9,
you will consider the results from fitting one such model.
Using the data in the clothing firms dataset, the following model for
production was fitted: production ∼ labour + capital.
The coefficients for this model are given in Table 6 and some summary
statistics of the residuals are given in Table 7.
Table 6 Coefficients for production ∼ labour + capital
51
Unit A1 Introduction to econometrics
3000 3000
2500 2500
2000 2000
Frequency
Frequency
1500 1500
1000 1000
500 500
0 0
−4 −2 0 2 4 6 8 −4 −2 0 2
(a) log(production) (b) log(labour)
3000
2500
2000
Frequency
1500
1000
500
0
−4 −2 0 2 4 6 8
(c) log(capital)
Figure 16 Distribution of (a) production, (b) labour and (c) capital, all variables in logs
In Activity 10, you will consider the results from a model similar to that
considered in Activity 9 but this time the variables are in logs, not in levels.
Using the data in the clothing firms dataset, the following model for
production was also fitted:
log(production) ∼ log(labour) + log(capital).
The coefficients for this model are given in Table 8 and some summary
statistics of the residuals are given in Table 9.
52
4 Estimating causal relationships
53
Unit A1 Introduction to econometrics
54
4 Estimating causal relationships
When we have identified the causal effects of all regressors, this can be
represented simply as in Figure 17.
Education
Hourly wages
Experience
In Figure 17, shaded boxes represent variables included in the model and
the arrows indicate the effects of education and experience on hourly
wages. As both these arrows separately point to hourly wages, it indicates
that in this situation we can estimate the separate effects of education and
experience on wages. If the true model of hourly wages only includes
education and experience, and if these are exogenous, we have identified
causal effects of each on the dependent variable, and their coefficient
estimators will be unbiased and return (on average) the true effects.
55
Unit A1 Introduction to econometrics
Education
Hourly wages
Experience
56
4 Estimating causal relationships
Let’s explore the statement given in Box 19 a little more using the case
when we include just one of two regressors that should be in a model.
So suppose the true model is
Yi = β0 + β1 X1i + β2 X2i + ui
and that instead, we estimate
Yi = β0 + β1 X1i + ui .
In other words, we leave out X2 . (In Figure 18 this was experience.)
It can be shown that if E(ui |X1 , X2 ) = 0 (which is necessary to identify
causal effects) then the expectation of βe1 , the OLS estimate of the
coefficient of X1 , is
Cov(X1 , X2 )
E(βe1 |X1 , X2 ) = β1 + β2 . (6)
V (X1 )
X2 creates an omitted variable bias on βe1 if the second term in
Equation (6) is not zero, which requires the two following conditions:
• its coefficient β2 ̸= 0
• the included and omitted variables are correlated.
57
Unit A1 Introduction to econometrics
In the next activity, you will consider why such biasing might occur if
ability is not included in the model.
Why would ability create an omitted variable bias in the estimator of the
education coefficient? In other words, in what ways is it correlated with
education? And why should it be in the model?
The explanations given in the solution to Activity 14, and there are many
more, suggest ability has a direct positive effect on wages, which does not
depend on its correlation with education.
Other social sciences may find these arguments quite simplistic, and that
the way ability is thought of is also simplistic, but here the point is that
there is such a relation and that the omission of ability from a wage
equation leads to omitted variable bias.
As you have seen, omitting ability from the Mincerian wage equation can
lead to omitted variable bias. However, the solution is not as
straightforward as ‘include ability in the model’ because it’s not clear how
to measure the ability of an individual. So, in the rest of this subsection,
we will consider three alternative ways of dealing with this bias: using
58
4 Estimating causal relationships
Education
Hourly wages
Proxy variable
for ability
Example 3 describes one study that has made use of proxy variables.
59
Unit A1 Introduction to econometrics
Note: this is a panel dataset which also contains regressors related to time. We
will leave the discussion of time-related variables for Unit A2.
60
4 Estimating causal relationships
Interpret the coefficient estimates of the three models given in columns (1),
(2) and (3) of Table 12. Do these results suggest an omitted variable bias
which is positive, as suggested in the previous activity?
While this strategy has been used often, and earnings and income datasets
often collect measurements of ability, we have seen in Section 3 that
measurement error also biases our results. Proxy variables are prone to
more complex types of measurement error so results for the coefficients of
tests scores have to be taken with a pinch of salt. But what is more,
simply including the test scores as regressors may not be the best
way to use the information in these variables to control for ‘ability’.
It seems reasonable to expect that the productive ability that
employers value is at least partly reflected in our test scores but that
several other factors also affect the outcome of the tests (e.g.,
test-taking ability, sleep the previous night, etc.).
(Blackburn and Neumark, 1993)
Alternative methodologies to deal with such issues include instrumental
variables, which we will discuss in Subsection 4.3.3.
61
Unit A1 Introduction to econometrics
62
4 Estimating causal relationships
Note that, even though the variables are no longer the collected earnings
and education levels, and instead the difference between twin earnings as
well as the difference between twin educational levels need to be created,
the coefficient on the education variable is still β1 , and is argued to be
freed of omitted variable bias present in standard models which exclude
ability. Moreover, this strategy can account for family background without
having to measure it explicitly.
Generically, the twin study approach can be written in the following way.
Suppose the original model is
Yi = a + bXi + ui ,
where ui = cWi + νi , and E(ui |Xi ) is different from 0 due to a correlation
between X and W . Further suppose there are groups consisting of
individuals who share the same value of unobserved heterogeneity W , so
Wi = Wj when i and j belong to the same group. (In twin studies, each
group contains just two members.) Differencing then yields
Yi − Yj = α + b(Xi − Xj ) + vi ,
where vi = νi − νj . Unobserved heterogeneity due to W is not in the
transformed model, so E(v|X) = 0. The original intercept is cancelled out
by the differencing; nevertheless, we often run OLS on a model with an
intercept.
The twin studies technique is summarised in Box 21.
63
Unit A1 Introduction to econometrics
64
4 Estimating causal relationships
Instrument
for education
Hourly wages
Education
Ability
In Activity 17, you will consider the model depicted in Figure 21 a bit
further.
65
Unit A1 Introduction to econometrics
66
4 Estimating causal relationships
67
Unit A1 Introduction to econometrics
directly and OLS was used to estimate its coefficients (Columns (2)
and (3) of Table 12).
All education coefficient estimates remain lower than in the model which
did not account for ability at all (Column (1) of Table 12). The models
which instrument for test scores (Columns (3), (4), (5) and (6)) even
generate returns to education close to zero, and are statistically not
significant. Estimates when schooling is instrumented with family
background are almost twice the size of the estimates using proxy variables
only, but have the additional advantage of not being as susceptible to
measurement error bias.
All in all, and for this dataset, it is likely that estimates in Columns (5)
to (8) are the best estimates given the econometric model used to explain
wages.
The work of the econometrician would not necessarily stop here.
Blackburn and Neumark (1993) went on to refine and augment their model
by interacting education with test scores (see Unit 4 for a discussion of
interactions between variables and how they model non-parallel slopes),
explored the time series nature of their dataset, and made a few more
robustness checks to strengthen the evidence and results obtained. But
within the spirit of confirmatory analysis, the analysis presented suggests
that estimates in Columns (5) to (8) would be convincing enough, as they
have the ‘right’ size: lower than the initial ones which were subject to
positive omitted variable bias, and higher than the estimates which were
subject to proxy variable measurement error.
68
5 Estimating causal relationships with panel data
Using a pooled model for panel data can appear to defeat the purpose of
having a panel at all, but there are occasions where it is appropriate. This
makes pooled OLS not too different from cross-sectional OLS.
However, one needs to remember that observations in a cross-sectional
dataset are assumed to be randomly sampled from the population, and are
representative of the population. So, with panel data, it would be difficult
to convince the econometrician that two observations taken from the same
individual in two different time periods are as random as two
cross-sectional observations. Some software packages use
variance-covariance correction methods to estimate pooled models which
account for this data structure.
69
Unit A1 Introduction to econometrics
In the fixed effects model, all individuals are affected the same way by
changes in the regressors (the Xk ’s). This means the regression lines for all
individuals have the same slope, but they have different intercepts due to
the differing individual effects. So this is a parallel slopes model which can
be extended with interactions between dummy variables and other
regressors. (Remember from Subsection 2.1 of Unit 3 that ‘dummy
variable’ is an alternative term for ‘indicator variable’.)
In this strand, you will learn three main estimators of a fixed effects model:
• the least squares dummy variable (LSDV) estimator (introduced in
Subsection 5.2.1)
• the within groups (WG) estimator (introduced in Subsection 5.2.2)
• the first difference (FD) estimator (introduced in Subsection 7.1 of
Unit A2).
70
5 Estimating causal relationships with panel data
Let’s now explore alternative fixed effects estimators and their strategies to
estimate the parameters of the model consistently and with no bias.
71
Unit A1 Introduction to econometrics
72
5 Estimating causal relationships with panel data
Table 14 LSDV results for a wage regression with education and experience
Notice from Table 14 that the coefficients for exper and exper2 are
the same despite different pairs of dummy variables being dropped. In
contrast, the coefficient for educ is one of the coefficients that does
change. Even the sign of the coefficient for educ changes! This means
we have to take great care when fitting the model and interpreting
what it is telling us about the effect of education on wages.
73
Unit A1 Introduction to econometrics
Subtracting Model (11) from Model (10) yields the following form for the
within groups model:
Yit − Y i = β1 (X1it − X 1i ) + · · · + βK (XKit − X Ki ) + uit − ui .
These are called demeaned or mean corrected variables; the intercepts
have been eliminated. OLS estimation of the new model using
∗
X1it = X1it − X 1i ,
..
.
∗
XKit = XKit − X Ki
yields the WG estimator. In practice, it’s not necessary to perform this
transformation: software packages do it all.
Notice that in the demeaning process, the intercept and the fixed effect in
the error term u that characterise the fixed effects model have been
eliminated. So the WG estimator avoids the complexity of adding extra
variables and the computational cost of the LSDV estimator. We often
estimate WG with an intercept nevertheless.
In Example 7, we will estimate the same fixed effects model as in
Example 6 but this time using the WG estimator.
74
5 Estimating causal relationships with panel data
In the PSID dataset, education does not vary and so its effect cannot
be estimated using a WG estimator. Notice that the coefficients on
exper and exper2 are the same as the LSDV estimates presented in
Table 14.
75
Unit A1 Introduction to econometrics
12
5 11
10
4 9
8
Country
7
3
Year
6
5
2 4
3
1 2
1
2.5 3.0 3.5 4.0 4.5 2.5 3.0 3.5 4.0 4.5
Consumption Consumption
Figure 22 Boxplots of consumption given in simulatedPanel (a) by country and (b) by period
Let’s compare the various estimators introduced so far – the fixed effects
estimators (LSDV and WG) with the pooled model.
Scatterplots of the data are given in Figure 23. In this plot, different
plotting symbols are used for each country. (From the plot it is not
possible to determine which period each point corresponds to.) In
Figure 23(a) the pooled model is also plotted. Notice that this corresponds
to just one line. Although the pooled model fits the data reasonably well,
it is better for some countries. For example, the points for country 3 lie
76
5 Estimating causal relationships with panel data
Country: 1 2 3 4 5
5.0 5.0
4.5 4.5
Consumption
Consumption
4.0 4.0
3.5 3.5
3.0 3.0
2.5 2.5
2.0 2.0
3 4 5 6 3 4 5 6
Income Income
(a) (b)
Figure 23 Scatterplot of aggregate consumption versus aggregate income for five hypothetical countries with
(a) the pooled OLS consumption function and (b) the pooled OLS and the LSDV consumption functions
In Figure 23(b) the fixed effects model is shown as well as the pooled
model. As you would expect from a parallel slopes type model, the fixed
effects model corresponds to separate lines for each country. Furthermore,
these lines are all parallel. Notice these lines fit the data better than the
pooled model. However, the quality of the fit still varies (slightly) between
countries.
Table 17 shows the estimated consumption function results using pooled
OLS and Table 18 shows the estimated consumption function results using
LSDV.
Table 17 Consumption function: pooled OLS results
77
Unit A1 Introduction to econometrics
By comparing Tables 17 and 18, we can see the change of slope from the
pooled model to the LSDV model – an increase in the coefficient of the
income term. We can also see the replacement of the common intercept by
the dummies for each country except the first.
To avoid estimating a model with as many dummy variables as
cross-sectional units, we use the WG estimator. As mentioned in
Subsection 5.2.2, the WG estimators start by demeaning the data. The
simulated panel dataset, after demeaning, are shown in Figure 24(a).
Notice that once this is done, the data for all five countries overlaps
considerably and are centered about (0, 0). The WG estimator then
corresponds to fitting a single line to these demeaned data. The resulting
line is also shown on Figure 24(a). Transforming back to the original scale
results in the single line becoming separate lines for each country, all
parallel – as shown in Figure 24(b). Notice in Figure 24(b) these lines are
identical to those found using LSDV. The two regressions, WG and LSDV,
are mathematically the same – WG just represents a translation of LSDV,
not a distinct model.
4.5
Consumption
0.5 4.0
3.5
0.0 3.0
2.5
−0.5 2.0
−1.0 −0.5 0.0 0.5 1.0 1.5 3 4 5 6
(a) Income (demeaned) (b) Income
Figure 24 (a) Scatterplot of demeaned consumption and income and the WG estimated consumption function
(b) the LSDV and WG estimated consumption functions
To select between fixed effects or pooled OLS models, we test the joint
significance of the individual-specific intercepts using an F -test.
Testing the hypothesis that the same intercept applies to all individuals
with these data returns a very small p-value. So the null hypothesis that
the same coefficients apply to all individuals (pooled model) is rejected in
favour of the presence of fixed effects. Either LSDV or WG can be used (or
FD, as we will see in Unit A2.)
78
5 Estimating causal relationships with panel data
79
Unit A1 Introduction to econometrics
80
5 Estimating causal relationships with panel data
There are two error components and both are assumed to be normally
distributed with zero mean and constant variance:
εi ∼ N (0, σε2i )
and
νit ∼ N (0, σν2 ).
The εi capture the random individual effects; νit , often called the
idiosyncratic error, captures remaining elements of variation.
Consequently, the model is assumed to have a constant variance: σε2 + σν2 .
That is, the model is homoskedastic. (A model where the variance is not
the same is often referred to as being heteroskedastic.) Neither of the
error terms, ε and ν, is correlated with the regressors Xk , for all
k = 1, . . . , K.
OLS is not suitable to estimate a random effects model since the error
structure and error covariance matrix need to acknowledge that error terms
are not independent of each other for each set of cross-sectional data.
81
Unit A1 Introduction to econometrics
82
5 Estimating causal relationships with panel data
83
Unit A1 Introduction to econometrics
4.5
4.0
Consumption 3.5
3.0
2.5
2.0
3 4 5 6
Income
Figure 25 Comparing FE and RE consumption functions using
simulatedPanel
Yes
Random effects
estimator
84
Summary
This concludes the panel data estimation methods covered in Unit A1, but
Unit A2 contains more alternatives.
Summary
This unit emphasised two main differences from statistical modelling when
used by economists: the use of economic theory and the confirmatory
analysis that economic theories often require from statistical models, and
the need to identify causal effects in the relationships between the
dependent variable and each of the regressors of interest.
To identify causal effects, you learnt that three conditions have to be met:
the ceteris paribus assumption, the exogeneity of the regressors, and the
lack of omitted variable bias. You also learnt that to meet these conditions,
a key method is to think about the existence of correlation between the
error term of the model and the regressors of interest. If there is a
suspicion that there might be such correlation, this needs to be addressed.
85
Unit A1 Introduction to econometrics
You explored different data structures and their relation with the solutions
available to identify causal effects. You also learned that data themselves
can add biases, such as measurement error or selection biases. Panel data
in particular, can include a specific type of selection bias called attrition
bias.
With cross-sectional data, the most common approaches at identifying
causal effects when the econometrician suspects there are biases are proxy
variables, twin studies, and instrumental variables. In the presence of
panel data, three more types of estimators are available: pooled OLS, fixed
effects estimators, and the random effects estimator.
We looked at statistical tests in the context of panel data models, namely
the F -test of joint significance of fixed effects and the Hausman test, to
choose between panel data estimators.
You were given the opportunity to apply these methods using R. You
should now be in a position to understand studies that model and estimate
economic problems, and to model your own.
The Unit A1 route map, repeated from the Introduction, provides a
reminder of what has been studied and how the different sections link
together.
Section 1
The economic problem
Section 2
The econometric model
Section 3
Data: structures,
sampling and
measurement
Section 5
Section 4
Estimating causal
Estimating causal
relationships using
relationships
panel data
86
Learning outcomes
Learning outcomes
After you have worked through this unit, you should be able to:
• identify the key stages of doing econometrics
• write down an economic model which represents an economic problem
• discuss the identification of key parameters of an econometric model
• analyse the error term of an econometric model and assess ways in which
it may be correlated with the regressors
• engage with the possibilities and limits of different data structures when
estimating an econometric model
• critically evaluate the limits of different estimation techniques in
providing unbiased and consistent estimators
• estimate these methods using R.
87
Unit A1 Introduction to econometrics
References
Ashenfelter, O. and Krueger, A. (1994) ‘Estimates of the economic return
to schooling from a new sample of twins’, The American Economic Review,
84(5), pp. 1157–1173.
Baltagi, B.H. and Levin, D. (1992) ‘Cigarette taxation: Raising revenues
and reducing consumption’, Structural Change and Economic Dynamics,
3(2), pp. 321–335. doi:10.1016/0954-349X(92)90010-4.
Blackburn, M.L. and Neumark, D. (1993) ‘Omitted-ability bias and the
increase in the return to schooling’, Journal of Labor Economics, 11(3),
pp. 521–544.
Bureau van Dijk (2020) Amadeus. Available at: https://www.open.ac.uk/
libraryservices/resource/database:350727&f=33492 (Accessed:
22 November 2022). (The Amadeus database can be accessed from The
Open University Library using the institution login.)
Deaton, A. (1995) ‘Data and econometric tools for development analysis’,
in Behrman, J. and Srinivasan, T.N. (eds) Handbook of Development
Economics. Amsterdam: Elsevier Science, pp. 1785–1882.
Duesenberry, J. (1949) Income, saving, and the theory of consumer
behaviour. Cambridge, Massachusetts: Harvard University Press.
Eurostat (2022), GDP and main components (output, expenditure and
income). Available at https://ec.europa.eu/eurostat/en/ (Accessed:
17 January 2023).
Friedman, M. (1957) A theory of the consumption function. Princeton:
Princeton University Press.
Hajivassiliou, V.A. (1987) ‘The external debt repayments problems of
LDCs: An econometric model based on panel data’, Journal of
Econometrics, 36(1–2), pp. 205–230. doi:10.1016/0304-4076(87)90050-9.
Stata Press (no date) ‘Datasets for Stata Longitudinal-Data/Panel-Data
Reference Manual, Release 17’. Available at:
https://www.stata-press.com/data/r17/xt.html
(Accessed: 10 January 2023).
Stigler, G.J. (1954) ‘The early history of empirical studies of consumer
behavior’, Journal of Political Economy, 62(2), pp. 95–113.
doi:10.1086/257495.
Stigler, G.J. (1962) ‘Henry L. Moore and statistical economics’,
Econometrica, 30(1), pp. 1–21. doi:10.2307/1911284.
88
Acknowledgements
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 2.1, grocery shopping: The Print Collector / Alamy Stock
Photo
Subsection 2.3.1, pig iron production: Pi3.124 / Wikimedia. This file is
licensed under the Creative Commons Attribution-Share Alike 4.0
International license.
Subsection 3.3, green and orange balls: Laurent Sauvel / Getty
Subsection 4.3.3, trombone: C Squared Studios / Getty
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
89
Unit A1 Introduction to econometrics
Solutions to activities
Solution to Activity 1
At the heart of both modelling processes is a cycle. In particular,
estimation or fitting of a model are done many times. So in both modelling
processes there is the idea of using results from model estimation to
improve the modelling, leading to what is hopefully an improved model
being estimated.
The main difference is that in the statistical modelling process some tasks
are not part of the cycle, whereas in the econometric modelling process
everything is part of the cycle. In particular, in the statistical modelling
process steps relating to ‘Collect data’ and ‘Explore data’ appear before
the cycle, whereas a step relating to ‘Data’ appears in the cycle that forms
the econometrician’s modelling process. This reflects that in econometrics
the relationship between economic theory and the selection of data are
often entwined. Given the key variables and their relationships expressed
by different economic models, the econometrician will choose an
econometric model. Using that model, they will estimate parameters given
the data available, and use the results to shed light on the key
relationships of the economic problem.
Solution to Activity 3
Both studies show expenditure shares for several categories of goods, food
included. They show these shares for different groups.
In Table 1, Davies tabulates these shares across different income bands,
starting from the £10–£20 income band to the £30–£45 income band. He
only looks at agricultural workers.
In Table 3, Engel uses an alternative to income which tries to capture
overall poverty, economic position, and income security. He uses a
categorical variable which has three groups: families on relief, poor but
independent, and comfortable families.
Both studies show that as income or economic position improves, the
individualised share of expenditure spent on food decreases. This
individualised share divides the family expenditure share by the number of
people in the household. In contrast, and looking specifically at Engel’s
study, the share spent on health and recreation increases with income.
90
Solutions to activities
Solution to Activity 4
The models that are linear in parameters are (a), (b), (c) and (e).
(a) Note that a0 can be thought of as a0 × 1. So if we think of a variable
which takes the value 1 for all units of observation, this model is
linear in parameters.
(b) Like for the model in part (a), we can think of a0 as a0 × 1. (And/or
a1 too if we like.) So this model is linear in parameters.
(c) Note that each parameter is associated to one variable only, and that
all terms are additively separated. One of the variables is non-linear –
quadratic in fact – but that does not contradict the definition of linear
in parameters. It just means that the original variable X would need
to be transformed and its transformation added to the model.
(d) Note that there is one term that includes two parameters in a
multiplicative way: a1 a2 . So this model is not linear in parameters.
(e) Note that each parameter is associated to one variable only, and that
all terms are additively separated. One of the variables is non-linear,
the product of two variables, but that does not contradict the
definition of linear in parameters. It means, as for the model in
part (c), that a new variable needs to be created to estimate this
model.
Solution to Activity 5
The parameter c0 is the intercept of this linear function and, by definition,
it is the amount of consumption needed when income (I) is zero. (The
literature on the Keynesian consumption function often uses c0 to
represent the intercept because this notation reminds us that it represents
consumption c when income is zero. You will remember from Activity 3
that, for poorer families, Davies as well as Engel found that consumption
was often higher than income. Keynes called this level of subsistence
consumption autonomous consumption.)
The parameter b is the slope of the linear function and, by definition, it is
the amount of the change in consumption when income changes. At each
income level, the consumption increase is a defined proportion of the
income increase. This parameter b can be interpreted as the change in
consumption divided by change in income. (Keynes called b the marginal
propensity to consume.)
91
Unit A1 Introduction to econometrics
Solution to Activity 6
The parameter c0 , the autonomous consumption, should be a positive
number, and in line with what poorer people would consume.
The marginal propensity to consume, b, should be fixed and lie between 0
and 1 for a particular income group, since income increases will be used as
consumption or savings.
Solution to Activity 7
Data focusing on a particular country will record data over time. This was
the case of the data Henry Moore used to estimate the demand for corn
and pig iron in the USA, and it included annual data from 1867–1911.
(This type of data structure is called a time series and it will be the
subject of Unit A2.)
In contrast, data on several families and their expenditure and income
observed at a specific point in time, such as the data used by Engels in
Table 3 of Stigler (1954) shown in Activity 3. (This type of data structure
said to be cross-sectional.)
Solution to Activity 8
The AIH, by focusing on current income only, and not having an in-built
lifecycle or intertemporal dimension, can be and has been analysed using
both cross-sectional and time series data.
The LIH and PIH require different observations on the same subject over
time, which is why studies using these theories require time series data.
Alternatively, the RIH requires information on each subject’s peers or
reference group so that the basic data structure is cross-sectional, looking
at several individuals observed at the same time.
All the theories can be tested using panel data.
Solution to Activity 9
(a) For this model in levels, we interpret the coefficient of labour as the
amount (in thousands of euros) that production increases when labour
is increased by one thousand euros. That amount is almost 54, and is
statistically significant. By the same token, production increases by
1.364 thousand euros when capital increases by 1 thousand euros, and
this is also statistically significant.
(b) The R2 of this model is 0.4632, which is reasonable in economics
studies of production functions, and when only two regressors are
included. However, the distribution of the residuals suggests that they
are very skewed to the right, and therefore too far from the normality
assumption.
92
Solutions to activities
Solution to Activity 10
(a) As this model is fitted to logged data, the coefficients are elasticities.
When labour increases by 1%, production increases by 0.33%; when
capital increases by 1%, the production increases by 0.52%.
(b) The R2 of this model is more than the R2 for the model fitted in
Activity 9. Table 9 also suggests that the distribution of the residuals
is now much closer to a symmetric distribution. All in all, the
interpretation of the coefficients is clearer, the model fit has improved,
and hypothesis testing is more reliable since the distribution of the
residuals is now closer to a normal distribution. So this model seems
better.
Solution to Activity 11
Experience shows up in the model as a quadratic function, with both a
linear and a squared term. To calculate the impact of experience on hourly
wages, we need to find the derivative of log(wage) with respect to exper.
After some calculations, we derive that the effect of experience on
log(wage) is
β1 + 2β2 exper
and so it varies with the individual’s experience, and the effect of
experience on hourly wages is
β1 + 2β2 exper
.
wage
So we will need to resist the temptation to apply the ceteris paribus
assumption to each coefficient separately when its effect is being
represented by more than one term.
Solution to Activity 12
The interpretation of a coefficient in a multiple regression, in particular the
assumption of holding all else constant (ceteris paribus), provides an
additional mechanism to choose the right specification. In Model (5) we
can interpret the coefficient of educ as the effect of education on the log of
hourly wages, holding experience constant. Its own direct effect on the log
of hourly wages has been modelled. If both variables are exogenous (and
that is the situation the econometrician ideally wants), we have identified
the causal effect of both education and experience on the log of hourly
wages.
The coefficient of educ is statistically significant in both specifications. As
seen in Box 18 (Subsection 4.1), we can interpret the coefficient estimate
multiplied by 100 as the percentage increase in hourly wages given by a
unit increase in education. In Model (4), a univariate model, this
percentage is approximately 5.5%, and in Model (5), an augmented model,
it increases to almost 6.6%. While the difference between the two estimates
93
Unit A1 Introduction to econometrics
is not large, this difference does force the econometrician to think about
which specification, if any, is the one which should be analysed further.
Solution to Activity 13
As education increases, on average in our dataset, experience decreases
(due to their negative correlation). Looking back at Figure 18, the effect of
experience on wages (a positive and statistically significant effect in our
example) is being picked up by changes in education. Because the effect of
experience on wages is positive, that means the higher the experience, the
higher the wages on average. High experience is however associated with
lower levels of education in this data, so if lower levels of education are
associated with higher levels of experience, higher values of education are
associated with lower values of experience. An estimator in a model which
omits experience will pick up the dampening effect on wages of decreasing
experience as education increases. In Activity 12, this dampening effect
was of 1.1 percentage points: including experience increased the effect of
education on wages from 5.5% to 6.6%.
Solution to Activity 14
Some economists argue that those with higher ability will go further in
their education, which creates a positive correlation between ability and
education. But to explore why ability should be in the model as well as
education, we will need to use the ceteris paribus assumption: holding all
else constant, including education, what is the direct effect of ability on
wages?
Consider two individuals with the same education, same experience, but
with different ability. A human capital explanation argues that ability
translates into more productivity, ease at picking up new tasks and
efficiency at existing ones, which translates into higher wages. An
alternative explanation, known as a signalling model explanation, assumes
higher-ability individuals choose to have more education to signal to the
labour market that they are more productive, which is then confirmed once
they are employed.
Solution to Activity 15
The direction of the bias is given by the sign of the second term on the
right-hand side of Equation (6), which is the product of the sign of two
factors: the effect of the excluded variable on the dependent variable (β2 in
the formula), and the sign of the correlation between included and
excluded regressors (in this case, education and ability). Given economic
theory and competing explanations, the effect of ability on wages is
positive. And the correlation between education and ability is also
positive. The product of two positive factors is also positive. So the biased
estimator of the education coefficient is higher than the true one, and we
expect the true value to be lower. This is what we call positive bias.
94
Solutions to activities
Solution to Activity 16
Column (1) shows the results of the model which omits ability. It shows
that, on average, each year of education increases hourly wages by 3.2%.
In columns (2) and (3), this return to each year of education decreases to
1.3% and 1.2%, respectively.
Economic theory suggests that there is a positive bias when ability is
excluded from the econometric model. In this particular study, we can see
that this positive bias is quite substantial. You may also note that the R2
has not increased substantially when test scores were added. In
econometrics, the choice between models often requires an analysis of
possible omitted variables biasing our parameters of most interest, as we
have just done. Looking solely at the R2 , we would be tempted to reject
all additional variables given such a low increase in the goodness of fit. In
doing so however, we would be using estimates for returns to schooling
which are broadly three times of what they were likely to be with this
sample in this time period, and we would be drawing misguided policy
recommendations.
Solution to Activity 17
The instrumental variable, or variables (as we can use several variables to
replace one endogenous regressor) would be replacing education. The IV
would need to be correlated with education but, conditional on this
correlation, ceteris paribus, it should not have a direct impact on the
dependent variable (in our example, this would be wages).
Effectively, the IV would not enter the original wage equation model as it
only works via education.
95
Unit A2
Time series econometrics
Introduction
Introduction
Sequences of observations recorded at regular intervals over time are called
time series. This second (and final) unit of Strand A is concerned with the
economic analysis of time series data. A key feature of time series is that
they are numerical footprints of history. In other words, they track the
evolution of a variable through time and, therefore, tell you something
about its past evolution. History matters, therefore, when dealing with
time series.
Historical variation, however, is rarely random in nature. Almost
invariably, successive measurements of an economic or social time series
variable will be related to each other in some way, and so will
measurements across different time series, revealing patterns that give us
structure to understand what went on in the past and possibly also provide
us with a basis for making predictions (forecasts) about their future
evolution.
Time series data, therefore, pose particular challenges for statistical and
econometric modelling because each variable on its own reflects historical
variation through time. In practical terms, some of these challenges
include: the fact that time series variables cannot simply be viewed as a
randomly drawn sample from a wider population; that observations are
also likely to be correlated with each other; that changes due to time can
be difficult to distinguish from changes of a time series variable; and that
when adding additional time series regressors to an econometric model,
these time-dependence issues get compounded with the likely correlation
between the regressors and the error terms. These have consequences for
the way we model the evolution of a time series and for how we model and
estimate the relationship between more than one time series in a regression
equation.
In Section 1, we start by considering two types of time series variable:
stocks and flows. Then, in Section 2, the focus is on the concepts and
techniques that help us to make sense of the patterns inherent in a single
time series variable. In Section 3, we will use the example of particular
time series models, called random walks, to rehearse the techniques from
Section 2. A key concept you will learn and need to ensure when modelling
time series is stationarity. We will be discussing how to model and
transform time series data so that they become stationary in Section 4,
and how to test for stationarity in Section 5. We will argue that most of
the estimation and hypothesis testing you know from the rest of the
module is still valid if and only if variables are stationary and all OLS
assumptions hold (including the condition studied in Unit A1, assuming
that errors and regressors have to be independent of each other).
Subsequently, based on this understanding of the analysis of a single time
series, in Section 6 we will discuss how the concerns with stationarity
extend to the modelling of an econometric model with additional
regressors. To do so, we will first focus on what is called the problem of
99
Unit A2 Time series econometrics
spurious regressions with time series data, which is a problem that results
when two or more non-stationary time series variables display similar
patterns of movement through time without there being a relationship
between them. When there is a structural relationship between them,
variables which are initially non-stationary are called cointegrated and
their joint behaviour can be meaningfully modelled together. The section
ends by discussing relationships between cointegrated variables and how
error correction models can further add to the modelling and analysis of
the relationship between cointegrated variables.
Finally, in Section 7, we will discuss how time series data can be modelled
using panel data.
The following route map shows how the sections of the unit link together.
Section 2
Section 1 Describing
intertemporal Section 3
Stock and flow
properties of Random walks
variables
time series
Section 4
Section 5
Stationarity and
Testing for
lagged dependence
stationarity
of a time series
Section 6
Modelling more
than one time
series variable
Section 7
Modelling time
with panel data
Note that Subsections 2.5, 3.3, 5.4, 6.4 and 7.1.2 contain a number of
notebook activities, which means you will need to switch between the
written unit and your computer to complete these sections.
100
1 Stock and flow variables
101
Unit A2 Time series econometrics
In the following activity, you will practise distinguishing between stock and
flow variables yourself.
102
2 Describing intertemporal properties of time series
2 Describing intertemporal
properties of time series
In this section, we will consider a couple of times series, relating to
unemployment and GDP, and consider some properties that can be used to
describe them.
In Subsection 2.1, we will look at the trajectory of the quarterly rate of
unemployment using a time plot. In Subsections 2.2 and 2.3, we will
introduce two different plots that are used to explore time series – the
autoregressive scatterplot and the correlogram – and you will see how they
can be used to describe the unemployment data. In Subsection 2.4, you
will then apply the ideas in Subsections 2.1 to 2.3 to a time series relating
to GDP. Finally, in Subsection 2.5, you will use R to explore times series.
103
Unit A2 Time series econometrics
Unemployment in the UK
The available labour force of a country at a particular point in time
consists of the total number of those who are currently employed plus
those who are unemployed. The rate of unemployment of a country is
then defined as the number of unemployed as a percentage of the total
labour force. Similarly to the number of unemployed people discussed
in Activity 1 (Section 1), the unemployment rate is also a stock
variable. (Generally, to be unemployed a person has to be actively
seeking work or about to start working.)
The unemployment dataset (unemployment)
In the UK, the Office for National Statistics (ONS) routinely
publishes the seasonally adjusted rate of UK unemployment for people
aged 16 and over on a monthly, quarterly and annual basis. Monthly
A sight that becomes more
and quarterly data are available from 1971 onwards.
common when unemployment
is rising The data considered in this dataset are the quarterly data for the
seasonally adjusted UK unemployment rate for the period 1971 Q1 to
2020 Q4.
The dataset contains data for the following variables:
• year: the year that the observation relates to
• quarter: the quarter (of the year) that the observation relates to,
taking the value 1, 2, 3 or 4
• unemploymentRate: the seasonally adjusted UK unemployment
rate for the quarter that the observation relates to.
The data for the first six observations in the unemployment dataset
are shown in Table 1.
Table 1 The first six observations from unemployment
Source: Office for National Statistics, 2021a, release date 23 February 2021
104
2 Describing intertemporal properties of time series
The simplest and often the first approach to describing a time series is to
use a time plot: that is, a connected scatterplot (line plot) of the data
against time. An example of a time plot is given next – Figure 1 shows the
evolution of the seasonally adjusted quarterly rate of unemployment
from 1971 to 2020. It yields an immediate historical panorama of the
evolution of unemployment in the UK over a period of 50 years.
105
Unit A2 Time series econometrics
12
10
Figure 1 shows that the trajectory of the rate of unemployment left very
distinctive historical footprints about what happened in the past. The rate
of unemployment clearly varied quite markedly over time, showing a
landscape of hills, valleys and occasional flatlands.
Economic historians interpret these patterns taking into account the
broader context of economic history of the UK, and particularly of the
successive distinctive policy regimes that prevailed during these years.
During the 1950s and 1960s (not shown in Figure 1) the rates of
unemployment fluctuated around or below 2% (Santos and Wuyts, 2010,
p. 33). These were the heyday of Keynesian economic policies
characterised by the primacy given to maintaining full employment in the
economy. As Figure 1 shows, the rate of unemployment started rising
during the 1970s but still remained historically relatively low in light of
subsequent developments.
A radical break with the past occurred at the start of the 1980s when
exceptionally high levels of unemployment prevailed during most of this
decade and into the 1990s. The 1980s was a decade for which Margaret
Thatcher was the UK’s prime minister, when maintaining low
unemployment no longer featured as a key focus of economic policy.
Most of the 2000s (up to the second half of 2008) showed moderate
variation in the rate of unemployment at relatively lower levels. This
period became known as the period of the ‘great moderation’. But also,
106
2 Describing intertemporal properties of time series
In the next activity, you will consider the extent to which the
unemployment data have these two properties.
What does the Figure 1 time plot of the seasonally adjusted quarterly rate
of unemployment suggest about the variable’s persistence and momentum
over time?
While a time plot, such as that in Figure 1, is often the first technique
used to describe a time series, other visual techniques are used to start
breaking into the intertemporal properties of a time series, and to focus on
how a variable is explained in terms of its own past behaviour. To do this,
it is convenient at this stage to introduce you in Box 5 to the idea of the
lag operator.
107
Unit A2 Time series econometrics
We will make use of the lag operator and lagged variables in the two plots
that are introduced next: autoregressive scatterplots in Subsection 2.2 and
correlograms in Subsection 2.3.
108
2 Describing intertemporal properties of time series
12 12
Unemployment rate (%)
8 8
6 6
4 4
4 6 8 10 12 4 6 8 10 12
(a) Unemployment rate (%) lag 1 (b) Unemployment rate (%) lag 2
12 12
Unemployment rate (%)
10 10
8 8
6 6
4 4
4 6 8 10 12 4 6 8 10 12
(c) Unemployment rate (%) lag 3 (d) Unemployment rate (%) lag 4
Figure 2 Autoregressive scatterplots of the rate of unemployment against: (a) lag 1, (b) lag 2, (c) lag 3
and (d) lag 4
109
Unit A2 Time series econometrics
As has already been noted, the lag 4 autoregressive plot for quarterly data
plots values against those from the same quarter of the previous year. So
dependencies in such plots often reflect seasonality in such data. Recall,
however, that in Figure 2 we are plotting seasonally adjusted rates of
unemployment and, hence, any seasonal particularities have been removed
from the raw data on which this series was based. This explains why
Figure 2(d) shows no specific features that might be due to seasonality.
While the autoregressive scatterplot technique already goes a long way in
suggesting how far back each value of a time series determines its current
value, visual inspections are often not enough to model the degree of
persistence of a time series variable. This is why autoregressive
scatterplots are often used in conjunction with a plot called a correlogram.
So, in the next subsection we will introduce the correlogram and the
summary statistic it plots: the autocorrelation coefficient.
110
2 Describing intertemporal properties of time series
111
Unit A2 Time series econometrics
The correlogram will visually and quickly show how far back the
autocorrelation of a variable and its lag is significantly different from zero,
as you will see in Example 2 and Activity 5.
112
2 Describing intertemporal properties of time series
1.0
0.5
ACF
0.0
−0.5
−1.0
0 5 10 15 20
Lag
Figure 3 The correlogram of a quarterly rate of unemployment time
series
What does the correlogram in Figure 3 suggest about how far back
unemployment and its lags are correlated?
Reflect upon your answers for Activities 3 to 5; what is your takeaway
message in terms of predicting the rate of unemployment based on its past
evolution?
113
Unit A2 Time series econometrics
While the autoregressive scatterplots and the correlogram are tools that
most time series econometricians will use to start the exploration of the
intertemporal properties of a time series variable, their suggestions about
how far back a time series variable is correlated with its lags can be
misleading. Activity 5 showed how slowly the correlation between
unemployment and its lags was decreasing, and it suggested that
unemployment is still significantly correlated with its value from five years
back. This pattern of slowly decreasing correlation can, however, be
explained by the presence of a time trend. While this is unlikely to be the
case for variables such as unemployment rates, which are finite and with a
limited range, you will see throughout this unit how difficult it can be to
disentangle persistence from a time trend – either a deterministic trend
(no random element) or what is referred to as drift or a stochastic trend
(a random element to the trend).
You also came across in Activity 4 another alarm signal when using these
tools. The R2 was close to 1, which – at least with time series data – is
often a sign of misspecification and high multicollinearity. In this unit, you
will see that one solution in such cases is to use the first differencing
operator (∆) discussed in Section 1. Let’s look at such an example in the
next subsection.
114
2 Describing intertemporal properties of time series
Source: Office for National Statistics, 2021b and 2021c, release date
12 February 2021
Figure 4 shows the time plot for quarterly GDP in the UK between 1955
and 2020. As you will see, GDP exhibits an overall upward trend over the
period, interspersed with episodes of recession. Falls in GDP associated
with the recessionary period after the financial crisis of 2007–2008 and the
sharp fall in GDP at the onset of the COVID pandemic in early 2020 are
clearly visible in the plot.
Figure 5 shows the correlogram for quarterly GDP.
115
Unit A2 Time series econometrics
500000
300000
200000
100000
0
1960 1980 2000 2020
Year
Figure 4 Real UK GDP: 1955 Q2 to 2020 Q4
1.0
0.5
ACF
0.0
−0.5
−1.0
0 5 10 15 20 25
Lag
Figure 5 Correlogram of UK GDP: 1955 Q2 to 2020 Q4
116
2 Describing intertemporal properties of time series
Figure 6 is a time plot of the UK GDP growth rate. How did the quarterly
rate of GDP growth vary over the period 1955 Q2 to 2020 Q4?
10
GDP growth rate (%)
−10
−20
1960 1970 1980 1990 2000 2010 2020
Year
Figure 6 Seasonally adjusted quarterly GDP growth rate: 1955 Q2
to 2020 Q4
The unemployment and UK GDP datasets cover similar time periods; this
enables us to consider how historical events have impacted on both time
series. In Activity 7, you will consider one such event – the onset of the
COVID pandemic.
117
Unit A2 Time series econometrics
The Solution to Activity 7 may have surprised you. During the Great
Depression of the 1930s, the fall in output (GDP) and in unemployment
went hand in hand. Similarly, after the 2008 financial crisis, unemployment
rose quite steeply when GDP growth fell in the UK and lingered on at this
level for quite some time. In 2020 unemployment rose but not exceptionally
so. The difference in 2020 was the widespread and massive use of wage
subsidies paid out of public funds to prevent a fall in employment and in
wage incomes during extended periods of lockdown, a policy widely
During the COVID pandemic, implemented in most European countries with different levels of generosity
many businesses were forced to and inclusion. In the UK, this system became known as the furlough
close their doors system. For economists it is interesting to note that whenever a crisis hits,
Keynes’ ideas that a government’s fiscal policies can and should play an
active role in preventing output and employment of an economy imploding
downwards suddenly appear not to be so outdated or obsolete after all.
Note that the exceptional outlying values of GDP growth rates in 2020,
what econometricians often refer to as a structural break in the time
series, render it more difficult to get a good view of the patterns of
variation in the period from 1955 to 2019. For this reason, after taking
note of the exceptional behaviour of GDP growth during 2020, it is useful
to restrict the data range to 1955 Q2 to 2019 Q4 to get a better view of
the long-term pattern of variation. The resulting time plot is shown in
Figure 7. This time plot makes it easier to get a more in-depth look at the
level and variation around a mean of the GDP growth rates before 2020.
4
GDP growth rate (%)
−2
118
3 Random walks
3 Random walks
As shown in Section 2, one way to make sense of time series is to plot them
and relate what you see back to the historical contexts within which they
arose. For example, in Subsection 2.1 we looked at the evolution of the
rate of unemployment in the UK (Figure 1). There we noted distinctive
wave-like patterns over time in that time series, which we briefly sought to
explain in terms of past changes in policy emphases and policy regimes not
unlike what economic historians would do. Alternatively, as we also
showed in Subsections 2.2 and 2.3, we can seek to explain a time series in
terms of its own past by looking at its autoregressive behaviour using
autoregressive scatterplots and correlograms. All of this can be thought of
as exploratory data analysis for time series data.
Another approach to the analysis of time series, confirmatory analysis, was
developed under the impulse of the Haavelmo–Cowles research programme
in econometrics, which was initiated in the 1940s. This approach focused
on modelling the causal economic mechanisms that produced economic
outcomes with the explicit aim to test how well these models fitted the
actual observed patterns in the data. This was the type of analysis laid out
in Unit A1; this necessitated a consideration of the relation of each of the
regressors with the error term to detect possible correlation, and so to
ensure that the regressors of interest really are exogenous or to seek
estimators which accounted for endogenous regressors. As we will see later
in the unit, this is particularly challenging in the presence of
autocorrelations of the dependent variable and of the additional regressors
that a time series model may include. In this context, if a time series
model featured a dependent variable that showed cyclical behaviour, this
cyclical behaviour would be modelled by regressors and by time-related
119
Unit A2 Time series econometrics
dummy variables.
But what about the possibility that a time series displays wave-like
patterns which are not accounted for by observable regressors? What if
wave-like behaviour of a time series variable is due to random variation?
As the eminent mathematician George Pólya remarked, we should never
forget that chance is ‘an ever present rival conjecture’ when we seek to
explain observable phenomena (Pólya, 1968, p. 55). So, in this section, we
will look at why we should never ignore chance variation as a potential
explanation for fluctuations displayed in time series.
In 1927, the Russian probability theorist Eugen Slutzky had already
formulated this alternative perspective on modelling the behaviour of time
series – and of business cycles in particular – by asking the following very
intriguing question:
Is it possible that a definite structure of a connection between
random fluctuations could form them into a system of more or less
regular waves?
(Slutzky, 1937, p. 106)
120
3 Random walks
never asserted that economic cycles are merely due to the cumulation of
random effects, but rather warned us that this possibility is a plausible
rival conjecture that should not be left out of the picture. Ignoring this
possibility would mean that we might seek to infer causal structural
mechanisms where in fact none exists.
Slutzky dramatically changed the perspective on the analysis of time series
because his analysis implied that random disturbances can be an
important part of the data-generating process. The concept of a random
walk, which has become integral to modern econometric theory and
practice, encapsulates Slutzky’s idea that the behaviour of time series may
be driven by the cumulation of random effects.
In Subsection 3.1, we will introduce the simplest of random walks: the
simple random walk. In Subsection 3.2, we consider the random walk with
drift. We will use R to simulate both of these in Subsection 3.3 and hence
explore what impact changing parameters has. Finally, in Subsection 3.4,
we will compare a random walk with drift with a model that incorporates
a deterministic trend instead.
Note that Y0 represents the value of the time series at time t = 0 and so in
principle it could be a known quantity, but, as you’ll see in Subsection 3.4,
it can also be a quantity that needs to be estimated.
The definition of the simple random walk means that the trajectory
through time depends exclusively on the behaviour of random
disturbances. Hence, there is no structural component – no in-built
momentum – that determines its movement through time. You will
consider further what simple random walks look like in the next activity.
121
Unit A2 Time series econometrics
100
Value
−100
−200
0 10 20 30 40 50
Time
Figure 9 30 simulated simple random walks
Formally, it can be shown that, for the random walk model given in
Equation (1), the variance of Yt increases linearly with t (time):
V (Yt ) = σ 2 t.
A summary of the simple random walk is given in Box 9.
122
3 Random walks
123
Unit A2 Time series econometrics
124
3 Random walks
800
Total distance travelled (km)
600
400
200
0
0 10 20 30
Day
Figure 10 30 random walks with drift along the Camino del Norte in Spain.
The dashed line indicates the total length of the Camino del Norte.
As you have seen in Activity 9, the random walk with drift described in
that activity differs markedly from that of a simple random walk. The
reason is that a random walk with drift includes a systematic component.
These walkers set themselves the explicit goal to reach Santiago de
Compostela by walking steadily towards it from day to day.
It turns out that the higher the value of the drift (d) relative to the
standard deviation of the random disturbances (σ), the more the different
trajectories will come closer together.
A random walk with drift is a special case of the type of random process
described next in Box 11.
125
Unit A2 Time series econometrics
126
3 Random walks
The model with a deterministic trend shown above is not a random walk,
since period by period Y is expected to increase on average by d. The
assumption here is that the history of past disturbances has no effect on
the future trajectory of the time series variable over time. Only the
disturbance of the current period accounts for the random variation
around the trend line.
127
Unit A2 Time series econometrics
where the εi are i.i.d. with distribution N (0, σ 2 ). So, in the model of the
random walk with drift, the history of past disturbances carries over into
the present. Past disturbances, therefore, are part of the data-generating
process for the random walk with drift.
It is also possible to combine the random walk with drift with a
deterministic trend, as demonstrated in Example 6.
128
3 Random walks
129
Unit A2 Time series econometrics
130
4 Stationarity and lagged dependence of a time series
Note that the definition given in Box 15 is not as restrictive as that given
in Box 14. So any time series that is stationary in the strong sense must
also be weakly stationary.
131
Unit A2 Time series econometrics
132
4 Stationarity and lagged dependence of a time series
You may be wondering whether a simpler model would also be suitable for
the time series described in Example 8. You will explore this in Activity 12.
133
Unit A2 Time series econometrics
In Activity 12, you considered what would happen if Yt−3 was left out of
the model given in Example 8. The same argument holds for missing out
Yt−1 or Yt−2 , or more than one of Yt−1 , Yt−2 and Yt−3 .
As Example 8 and Activity 12 show, lag dependence is therefore dealt with
by including all lags that our time series variable correlates with. The
challenge in any application is to find out how far back the dependence
goes. You can see that weak dependence ensures there comes a point when
it is small enough to not worry about, but it does not say for what value
of p it will occur. One strategy for determining the value of p is given in
Box 17.
Once we know how far back observations correlate, dealing with lag
dependence is a straightforward exercise. However, stationarity is more
challenging. Most time series we come across in economics are
non-stationary. This should not surprise you. It would indeed require quite
a stretch of the imagination and a complete denial of the importance of
historical context and conjuncture if we were to view the evolution of most
macroeconomic variables of the UK economy during 20 years covering
the 1950s and 1960s and their evolution during the 20 years covering 2000s
and 2010s as two distinct realisations of the same underlying stochastic
process. Most, if not all, economies are very different today compared with
50 or 60 years ago – and national economies are bigger today. These
differences are likely to have affected the ways in which each evolved
dynamically through time.
For example, during this time much of the UK economy has shifted from
manufacturing to services. Canary Wharf in London reflects this change.
As you can see in Figure 11, in the 1950s it was an area of docklands,
whereas in the 2020s it had transformed into a centre for banking.
134
4 Stationarity and lagged dependence of a time series
(a) (b)
Figure 11 Canary Wharf in London in (a) the 1950s and (b) the 2020s
135
Unit A2 Time series econometrics
A random walk with drift, as well as a random walk with drift and
a deterministic trend, can also be transformed into a stationary series by
taking first differences. So, like the simple random walk, these processes
are also I(1).
136
5 Testing for stationarity
Figure 12 shows the plots for four different simulated AR(1) series
featuring different values for ρ.
10
5
Value
−5
−10
−15
0 10 20 30 40 50
Time
Figure 12 Simulated AR(1) series with different values for ρ
137
Unit A2 Time series econometrics
The null and alternative hypotheses of the three variants of the DF test
are given in Box 20.
138
5 Testing for stationarity
139
Unit A2 Time series econometrics
has larger critical values than the standard student-t or the standard
normal distribution. We will discuss this distribution in the following
subsection.
The DF test, although intuitive, is limited in the way it models lag
dependence. Errors are assumed to have no autocorrelation, while it is
common to see autocorrelation to exist beyond lag 1. For this reason, the
most common and reliable unit root test is the Augmented Dickey–Fuller
(ADF) test, which we will discuss next.
The null and alternative hypotheses will be the same as those in Box 20.
The appropriate number of lags (m) can be selected on the basis of
information criteria such as the Akaike information criterion (AIC)
discussed in Unit 2 or using the iterative deletion procedure of high lags
described in Box 17 (Subsection 4.2).
Similarly to DF tests, the test statistic for ADF tests does not follow the
usual t-distribution. Figure 13 plots the probability distribution for the
test statistic estimated from the simple random walk, the random walk
with drift, and the random walk with drift and a deterministic trend. The
distribution of the ADF test statistic, given the null hypothesis, is
obtained via simulation.
Under the null hypothesis that a series contains a unit root, the ADF test
statistic generally takes negative values. Tables 6, 7 and 8 give some of the
critical values for the DF distribution, at 1%, 5% and 10% significance
levels.
140
5 Testing for stationarity
0.5
0.4
Density
0.3
0.2
0.1
0.0
−6 −4 −2 0 2
t-statistic
Figure 13 The probability distribution of ADF test statistics under the null
hypothesis that the series contains a unit root compared with the usual
t-distribution
Table 6 Critical values of the ADF test for a simple random walk
Table 7 Critical values of the ADF test for a random walk with drift
141
Unit A2 Time series econometrics
Table 8 Critical values of the ADF test for a random walk with drift and
a deterministic trend
ADF tests have a low power, that is, a high probability of committing a
type II error. (Recall from previous study that in hypothesis testing a
type II error corresponds to failing to reject the null hypothesis that the
series contains a unit root when it actually does not.) In other words, the
ADF test will conclude a series is I(1) too often. It will also reject the null
hypothesis of a unit root when the ADF model specification does not
capture the stochastic process of the time series variable. Because of the
low power, a unit root test is often done in two steps.
1. Test if the original variable has a unit root; it will too often fail to reject
the I(1) conclusion.
2. Test if the first difference of the variable is stationary (which should be
the case if the previous ADF test concludes the series is I(1)).
Even so, problems with the power of the test may arise if:
• the time span is short
• ρ is close to, but not exactly, 1
• there is more than a single unit root (i.e. the series is I(2), I(3), etc.)
• there are structural breaks in the series.
The ADF test is also sensitive to the specification used. Using the wrong
one of the three models can impact on the size of the test, that is, increase
the probability of committing a type I error (rejecting the null hypothesis
when it is true).
In the next two subsections, we will put the ADF test into practice with a
real time series. However, it is worth noting first that whilst the ADF test
is a popular test for the presence of a unit root, there are many other unit
root tests, including the Phillips–Perron test. Econometricians frequently
test for a unit root in a series using several tests. This is because unit root
tests differ in terms of size and power.
142
5 Testing for stationarity
143
Unit A2 Time series econometrics
13
0.06
12
0.04
∆ log(GDP)
log(GDP)
11
0.02
10
0.00
9
−0.02
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020
Year Year
(a) (b)
1.0 1.0
0.5 0.5
ACF
ACF
0.0 0.0
−0.5 −0.5
−1.0 −1.0
0 4 8 12 16 20 24 0 4 8 12 16 20 24
Lag Lag
(c) (d)
Figure 14 (a) Time plot of log(GDP), (b) time plot of first differences of log(GDP), (c) correlogram
for log(GDP) and (d) correlogram for first differences of log(GDP)
144
5 Testing for stationarity
The test statistic is the usual t-statistic calculated as the ratio of the
estimator to its standard error. So, reading off from the log(GDPt−1 )
row of Table 9, the ADF test statistic for this example is −2.08.
It turns out that the critical values for this particular test are −3.44
(1%), −2.87 (5%) and −2.57 (10%). So, as −2.08 is greater than
−2.57, this suggests that the p-value is greater than 0.10. Thus there
is insufficient evidence to reject the null hypothesis, and therefore we
conclude that log(GDPt ) has a unit root and is therefore
non-stationary.
145
Unit A2 Time series econometrics
In Example 9, you have seen that the ADF suggests that log(GDPt ) has a
unit root. However, the results in Table 9 also suggest the ADF test model
could drop the lag 4 term. So the regression without it should be
re-estimated. The final test statistic, the usual t-statistic for the
log(GDPt−1 ) term in the re-estimated regression, would then be compared
with the critical value of the DF distribution assuming a random walk
with drift.
At the end of Subsection 5.2, we explained that the unit root test is often
done in two ways, given its high type II error. So if the test suggests that
log(GDPt ) has a unit root, then an ADF test on its first differences should
return a result suggesting the latter is stationary. In the next subsection,
you will explore the specification of an ADF test.
146
6 Modelling more than one time series variable
With time series, one can predict the (as yet unknown) future trajectory of
a variable in terms of its own past behaviour (as well as, more generally, of
the past behaviour of other relevant explanatory variables). This is called
forecasting, which is one of the main uses of time series data.
When using more than one time series variable, the issues related to
stationarity persist. Each and all variables in an econometric model have
to be stationary and weakly dependent to render any forecasting or
modelling work useful. Individually, the procedure of finding and
modelling the lag dependence, and then using an ADF test to find a
stationary transformation, needs to be performed for each variable.
However, with more than one variable, there are additional challenges and
additional solutions to non-stationarity. Subsection 6.1 will discuss one
major challenge of having an econometric model with at least two time
series variables that are neither stationary nor I(1), which is called
spurious regression. Subsection 6.2 will then discuss cointegration, which is
a characteristic of I(1) variables (or I(p) variables more generally) which
allows them to be included in an econometric model without first
differencing. Then Subsection 6.3 will discuss error correction models,
which allow the econometrician to better understand how cointegration
works. Finally, in Subsection 6.4, you will see how to implement the ideas
in this section using R.
147
Unit A2 Time series econometrics
100
80
Value
60
40
20
0
0 10 20 30 40 50
Time
Figure 15 Two simulated time series generated by a random walk with
drift (rwd1 and rwd2)
Table 10 contains the results from regressing one of the random walks
with drift against the other using OLS. The regression results show a
high degree of correlation between the two series rwd1 and rwd2. The
coefficient on rwd2 is close to 1 and appears to be highly statistically
significant.
148
6 Modelling more than one time series variable
In Example 10, you saw that with time series data a spurious regression
can fit the data very well. So when should we suspect that a regression
might be spurious?
One rule of thumb, arising out of Granger and Newbold (1974) involves a
statistic known as the Durbin–Watson (DW) statistic described in Box 21.
149
Unit A2 Time series econometrics
6.2 Cointegration
The concept of cointegration is used by econometricians to infer whether
there is a meaningful statistical relationship between two or more I(1)
variables (or I(p) variables more generally) over time. Note that, whereas
the concept of order of integration is a property of a single time series
variable, the concept of cointegration is a property of how two or more
time series variables relate.
Think of examples such as the relationship between short- and long-term
interest rates, household incomes and expenditures, or commodity prices in
geographically separated markets. Economic theory suggests that these
respective pairs of variables have a long-run equilibrium relationship
between them. In other words, when the series move apart from one time
point to the next – for example, due to a force that acts only in the short
term – an equilibrating process tends to make them converge so that in the
long term they tend to trend together.
For example, if gold prices were higher in the USA compared with South
Africa, then merchants would be able to profit from buying in South Africa
and selling in the USA if the price differential were greater than the costs
for transportation and marketing. The possibility to benefit in this way
from price differences is called arbitrage. As more merchants engage in
this arbitrage trading, the price of gold in South Africa would rise until it
was no longer worthwhile for anyone to buy gold in South Africa simply to
sell it for a higher return in the USA; effectively, this would be until the
Golde last individual achieved zero profit and was indifferent between doing
pric s! arbitrage or not.
e
crash
If we were to plot the gold price in South Africa and the USA over time,
we would expect to see them trending closely together, and to come closer
after temporary departures.
If such an equilibrium relationship exists between two I(1) time series, you
would expect the error term of the regression between them to be weakly
With arbitrage, timing matters dependent and stationary. This is the intuition behind the idea of
cointegration, and behind the strategy used by the Engle–Granger test for
cointegration described in Box 22. If two (or more) non-stationary time
series, integrated of order 1, are cointegrated, there is a stable equilibrium
relationship between them. The error term resulting from regressing one
on the other will be stationary, i.e. integrated of order 0. In effect, the
150
6 Modelling more than one time series variable
‘persistence/drift’ process in the I(1) series will have cancelled each other
out to result in an error term with no persistence or drift.
Suppose Yt and Xt are two I(1) time series. The Engle–Granger test for
cointegration is an ADF unit root test applied to the residuals u
bt obtained
from the regression
Yt = β0 + β1 Xt + ut .
While this is effectively an ADF test, the exact critical values of the
Engle–Granger test are slightly different (MacKinnon, 2010). For
simplicity, however, we will use the critical values of the DF distribution.
We will apply the Engle–Granger test to a couple of real datasets in
Subsections 6.2.1 and 6.2.2.
151
Unit A2 Time series econometrics
152
6 Modelling more than one time series variable
Having sourced some data, let’s start with some exploratory data analysis
in the next two activities.
Figure 16 shows the US GDP and PCE plotted over the period 1970
to 1991. Describe the trajectories of the two time series.
4000
3000
2000
1000
1970 1975 1980 1985 1990
Year
Figure 16 Time plot of quarterly US GDP (gdp) and US PCE (pce)
from 1970 to 1991
Figure 17 shows plots of the ACF for US GDP and US PCE as well as for
their first differences. What do these correlograms suggest about the order
of integration and lag dependence of GDP and PCE?
153
Unit A2 Time series econometrics
1.0 1.0
0.5 0.5
ACF
ACF
0.0 0.0
−0.5 −0.5
−1.0 −1.0
0 4 8 12 16 0 4 8 12 16
(a) Lag (b) Lag
1.0 1.0
0.5 0.5
ACF
ACF
0.0 0.0
−0.5 −0.5
−1.0 −1.0
0 4 8 12 16 0 4 8 12 16
Lag Lag
(c) (d)
Figure 17 Correlograms: (a) ACF of GDP, (b) ACF for the first difference of GDP, (c) ACF of PCE, (d) ACF
for the first difference of PCE
In order to establish the order of integration of the GDP and PCE series,
we perform the ADF test for each series and their first differences. We will
do it in two different ways:
• by performing the ADF tests on the variables in levels (Activity 17)
• by performing the ADF tests on the variables after first differencing –
which, in the case that the original variables are I(1), will be I(0)
(Activity 18).
For both of these ways, we choose the ADF test that includes a drift term
to test for a unit root of gdp and pce. This is because, in an analysis that
we are not showing, the AIC criteria confirmed the lag dependence of
order 1 of first differences.
154
6 Modelling more than one time series variable
Table 13 ADF results for pce, using ∆ pcet ∼ pcet−1 + constant + ∆ pcet−1
155
Unit A2 Time series econometrics
We have found that both variables in levels are I(1). In general, and in the
absence of cointegration, we would need to find a transformation of each
variable which was stationary.
The next step is to test for cointegration. If there is cointegration, the
variables and the econometric model that has greater economic significance
can be used.
Since the series pce and gdp are of the same order of integration, I(1), we
can estimate the cointegrating regression
pcet = β0 + β1 gdpt + ut . (2)
The results from the cointegrating regression are shown in Table 16.
Table 16 Regression results from the cointegrating regression in pcet ∼ gdpt
The residuals from the estimated regression are shown in Figure 18.
Based on this plot, state which of the variants (given in Box 20 of
Subsection 5.1) would be more appropriate:
• simple random walk
• random walk with drift
• random walk with drift and a deterministic trend,
and give the model specification.
156
6 Modelling more than one time series variable
80
60
40
20
u
0
−20
−40
−60
1970 1975 1980 1985 1990
Year
Figure 18 Plot of residuals from the cointegrating regression in Equation (2)
The results from the regression fitted as part of the ADF test on the
residuals from the estimated cointegrating regression, using the
specification given in the Solution to Activity 19, are shown in Table 17.
Table 17 ADF test results with u
b modelled as a simple random walk
The critical values for this particular ADF test are −2.6 at 1%, −1.95
at 5% and −1.61 at 10%. Interpret the results.
Since the series are cointegrated, we can substitute the coefficients from
the cointegrating regression in Table 16 into Equation (2) to interpret the
regression results:
pcet = −298.1 + 0.73 gdpt + u
bt .
157
Unit A2 Time series econometrics
Because the series are cointegrated, we can say that there is a long-run
equilibrium statistical relationship between US GDP and PCE between
1970 and 1991. We will return to this idea of long-run equilibrium
relationships in Subsection 6.3.
158
6 Modelling more than one time series variable
If relative PPP holds for Japan and the USA, we would expect there to be
a stable long-run statistical relationship between jppt and uspt . So testing
for cointegration between jppt and uspt over time would therefore serve to
test the theory of relative PPP.
We will use the data described next to explore this.
159
Unit A2 Time series econometrics
Using data from the PPP dataset, Figure 19 shows the evolution over time
of the two time series et + pJapan,t and pUS,t .
12
USA
6
2
1980 1990 2000
Year
Figure 19 Price levels in Japan and the USA at US constant prices from
1973 Q1 to 2008 Q2
The two series were tested for the presence of unit roots using ADF tests.
Both series were found to be I(1). If we found that the series had different
orders of integration, then it would not be possible to perform a test for
cointegration and we would be able to conclude immediately that the
theory of PPP does not hold.
Table 19 shows the results from the cointegrating regression
jppt = β0 + β1 uspt + ut .
160
6 Modelling more than one time series variable
The ADF test statistic was found to be −2.852. It turns out that the 1%
critical value is −2.58. So we can reject the null hypothesis that the
residuals contain a unit root in favour of the alternative hypothesis that
they are stationary. Since both time series are I(1) and the residuals of the
cointegrating regression are I(0), we can conclude that the time series are
cointegrated, so there is a long-term stable statistical relationship between
the two series. This provides some support for the theory of relative PPP.
161
Unit A2 Time series econometrics
162
6 Modelling more than one time series variable
163
Unit A2 Time series econometrics
spot futures
112.94 116.00
115.86 119.35
112.62 114.90
113.65 116.65
116.14 119.90
116.36 119.80
164
7 Modelling time with panel data
165
Unit A2 Time series econometrics
166
7 Modelling time with panel data
167
Unit A2 Time series econometrics
the basis of estimators which deal with autocorrelated variables. Let’s see
how.
Consider an AR(1) model with a stationary regressor X which has no lag
dependence:
Yit = α0 + α1 Yi,t−1 + α2 Xi,t + ui,t ,
where ui,t = fi + νi,t , such that fi may be correlated with Xi but νi,t is still
i.i.d. with zero mean and constant variance. The FD estimator would
apply OLS on the differenced model
Yi,t − Yi,t−1 = α1 (Yi,t−1 − Yi,t−2 ) + α2 (Xi,t − Xi,t−1 ) + νi,t − νi,t−1 .
However, OLS would be inconsistent and biased because there is now a
correlation between Yi,t−1 and νi,t−1 in the relationship between the
explanatory factor Yi,t−1 − Yi,t−2 and the error term νi,t − νi,t−1 .
In Unit A1, you saw alternatives to OLS when regressors were endogenous.
One such alternative was the instrumental variable (IV) estimator. This
relied on finding an instrumental variable which is correlated with the
endogenous regressor but has no direct impact on the dependent variable.
Anderson and Hsiao (1981) realised that the correlation exists because the
first term of the regressor is correlated with the last term of the error term.
By using second and third lags of the dependent variable, this correlation
would be removed, provided that the AR(1) model is sufficient to model
the persistence of the dependent variable. Either lagged variables
themselves, or their differences, as long as higher than or equal to order 2,
would be adequate instruments.
This is the Anderson–Hsiao estimator. Starting from an FD estimator,
they analyse how far back the error terms go in terms of lags, and use
instruments of higher order still. Even if we had reason to believe that the
error terms might be following an AR(1) process, we could still follow this
strategy, ‘backing off’ one period and using the third and fourth lags of Y
(presuming that the time series for each cross-sectional unit is long enough
to do so).
168
7 Modelling time with panel data
169
Unit A2 Time series econometrics
Wages tend to increase over time, often to counter the effects of rises of
prices of goods and services. Instead of using a deterministic trend,
Blackburn and Neumark (1993) explain that they use dummy variables for
time instead. Time dummies will not model wage trends as a smooth
gradual increase, but instead will pick up any year-specific effects on
hourly wages at a time when the wage distribution seemed to be changing
rapidly. These time dummies pick up factors such as productivity levels,
the general price level in the economy, and other cyclical economic factors.
Time trend t takes the value 0 for 1979, and increases by one unit each
year. On top of that, the authors also use a variable which is the product
of years of education and a deterministic time trend, that is, they include
an interaction between years of education and the time trend.
Let’s write this augmented model as
log(wagei,t ) = β0 + β1 educi,t + β2 educit × t + β3 abilityi
+ β4 year1980 + β5 year1981 + β6 year1982
+ β7 year1983 + β8 year1984 + β9 year1985
+ β10 year1986 + β11 year1987 + ui,t ,
where the time dummies, yearYYYY, take the value 1 if the observation is
collected in that year, and 0 otherwise.
The variable showing the interaction between education and a time trend
is crucial for the authors’ study. It shows how the change in wages for an
additional year of education varies over time during this period. The
positive and statistically significant coefficient estimate for this interaction
in all model specifications, and using all methods, suggests that returns to
schooling were increasing during this period, even accounting for ability
and for measurement error. The authors continue their exploration of the
data to tease out more about this increase, and with further tests – and
170
Summary
Summary
In this unit, we have explored approaches to model both univariate time
series and relationships between time series variables. Some of the main
challenges with modelling and estimating time series models relate to
stationarity and lag dependence. You have seen that while visual
inspections of each time series through time plots and then through
autoregressive scatterplots and correlograms go a long way towards a
theoretical proposal for the data-generating process of the time series, they
fail to capture subtle differences between non-stationarity and the
existence of a trend or between deterministic and stochastic trends, or the
presence of structural breaks. More formal procedures are often required.
You have seen that modelling the lag dependence of a variable often relies
on starting with a higher lag order and through the inspection of the t-test
of the highest-order term, decide whether lag dependence should be
revised. Once lag dependence is established, testing for stationarity of a
time series variable is important to make sure that only a stationary
transformation is used in modelling. To transform a non-stationary
variable into a stationary variable, first differencing will suffice if the type
of non-stationarity is integration of order 1 (I(1)).
Non-stationarity in the context of more than one time series variable offers
both challenges and opportunities. We discussed the case of spurious
regression of I(1) time series variables that are not cointegrated, and how
to test for cointegration. We also discussed the benefits of cointegration:
on one hand, the original cointegrated variables can be modelled in levels
as themselves since error terms are stationary; and on the other hand, an
error correction model can be analysed as well as the equilibrating
short-term forces attracting variables to a long-run equilibrium
relationship.
The final section revisited a data structure discussed in Unit A1 – panel
data – in order to introduce time and time series considerations to panel
data modelling and estimation. We also revisited a study from Unit A1 to
highlight how it modelled time.
171
Unit A2 Time series econometrics
As a reminder of what has been studied in Unit A2 and how the sections
in the unit link together, the route map is repeated below.
Section 2
Section 1 Describing
intertemporal Section 3
Stock and flow
properties of Random walks
variables
time series
Section 4
Section 5
Stationarity and
Testing for
lagged dependence
stationarity
of a time series
Section 6
Modelling more
than one time
series variable
Section 7
Modelling time
with panel data
172
Learning outcomes
Learning outcomes
After you have worked through this unit, you should be able to:
• apply time plots, autoregressive scatterplots and correlograms to the
exploration of time series data and as first steps towards modelling time
series variables
• define persistence and momentum and their role in modelling time series
data
• define the concept of stationarity and its implication for analysing time
series data
• define weak stationarity
• use logs, lags and differencing in modelling time series data
• model the lag dependence of a variable
• test for the presence of a unit root using ADF tests
• recognise spurious regressions and techniques for how to avoid
estimating them in applied work
• understand the concept of cointegration and its importance in modelling
I(1) series
• use the Engle–Granger test to test for cointegration between two I(1)
series
• estimate an error correction model (ECM) to analyse the long- and
short-term relationship between two I(1) series that are cointegrated
• simulate data-generation processes including white noise, random walk
and random walk with drift using R
• apply the Anderson–Hsiao estimator to panel data
• describe the main modelling and data choices made by economists in the
analysis of their economic problems.
173
Unit A2 Time series econometrics
References
Alvarez, J. and Arellano, M. (2003) ‘The time series and cross-section
asymptotics of dynamic panel data estimators’, Econometrica, 71(4),
pp. 1121–1159.
Anderson, T.W. and Hsiao, C. (1981) ‘Estimation of dynamic models with
error components’, Journal of the American Statistical Association,
76(375), pp. 598–606. doi:10.2307/2287517.
Blackburn, M.L. and Neumark, D. (1993) ‘Omitted-ability bias and the
increase in the return to schooling’, Journal of Labor Economics, 11(3),
pp. 521–544.
Cleveland, W.S. (1993) Visualizing data. Peterborough: Hobart Press.
Dickey, D.A. and Fuller, W.A. (1979) ‘Distribution of the estimators for
autoregressive time series with a unit root’, Journal of the American
Statistical Association, 74(366), pp. 427–431. doi:10.2307/2286348.
Enders, W. (2015) ‘COINT PPP.XLS’. Available at:
https://wenders.people.ua.edu/3rd-edition.html
(Accessed: 5 January 2023).
Granger, C.W.J. and Newbold, P. (1974) ‘Spurious regressions in
econometrics’, Journal of Econometrics, 2(2), pp. 111–120.
doi:10.1016/0304-4076(74)90034-7.
Gujarati, D. (2004) Basic econometrics, 4th edn. New York: McGraw Hill.
ICE Futures U.S. (no date) ‘Coffee C Futures’. Available at:
https://www.theice.com/products/15/Coffee-C-
Futures/data?marketId=6244298&span=3 (Accessed: 1 April 2021).
Institute for Government (2021) ‘Timeline of UK coronavirus lockdowns,
March 2020 to March 2021’. Available at:
https://www.instituteforgovernment.org.uk/sites/default/files/timeline-
lockdown-web.pdf (Accessed: 10 December 2022).
International Coffee Organization (no date) ‘Historical data on the global
coffee trade’. Available at: https://www.ico.org/coffee prices.asp
(Accessed: 1 April 2021).
Klein, J.L. (1997) Statistical visions in time. A history of time series
analysis 1662–1938. Cambridge: Cambridge University Press.
Leamer, E.E. (2010) Macroeconomic patterns and stories. Berlin:
Springer-Verlag.
MacKinnon, J.G. (2010) Critical values for cointegration tests, Queen’s
Economics Department Working Paper No. 1227, Queen’s University,
Kingston, Ontario, Canada. Available at: https://www.researchgate.net/
publication/4804830 Critical Values for Cointegration Tests
(Accessed: 6 January 2023).
174
References
Office for National Statistics (no date) ‘Seasonal adjustment’. Available at:
https://www.ons.gov.uk/methodology/methodologytopicsandstatistical
concepts/seasonaladjustment (Accessed: 22 December 2022).
Office for National Statistics (2021a) ‘Unemployment rate (aged 16 and
over, seasonally adjusted): %’, release date 23 February 2021.
Available at: https://www.ons.gov.uk/employmentandlabourmarket/
peoplenotinwork/unemployment/timeseries/mgsx/lms/previous
(Accessed: 13 December 2022).
Office for National Statistics (2021b) ‘Gross Domestic Product at market
prices: Current price: Seasonally adjusted £m’, release date 12 February
2021. Available at: https://www.ons.gov.uk/economy/
grossdomesticproductgdp/timeseries/ybha/pn2/previous
(Accessed: 13 December 2022).
Office for National Statistics (2021c) ‘Gross Domestic Product: Quarter on
Quarter growth: CVM SA %’, release date 12 February 2021. Available at:
https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/
ihyq/pn2/previous (Accessed: 13 December 2022).
Pólya, G. (1968) Patterns of plausible inference (Volume II of Mathematics
and plausible reasoning), 2nd edn. Princeton, NJ: Princeton University
Press.
Santos, C. and Wuyts, M. (2010) ‘Economics, recession and crisis’. In:
DD209 Running the economy. Milton Keynes: The Open University.
Slutzky, E.E. (1937) ‘The summation of random causes as the source of
cyclic processes’, Econometrica, 5(2), pp. 105–146. doi:10.2307/1907241.
Tooze, A. (2021) Shutdown: How Covid shook the world’s economy.
London: Penguin.
WHO (2021) ‘Listing of WHO’s response to COVID-19’. Available at:
https://www.who.int/news/item/29-06-2020-covidtimeline (Accessed:
10 December 2022).
175
Unit A2 Time series econometrics
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 2.1, closing down sale: bunhill / Getty
Subsection 2.4, Closed businesses during COVID: shaunl / Getty
Subsection 3.2, Camino del Norte: José Antonio Gil Martı́nez / Flickr.
This file is licensed under Creative Commons-by-2.0.
https://creativecommons.org/licenses/by/2.0/
Figure 11(a): Heritage Image Partnership Ltd / Alamy Stock Photo
Figure 11(b): Commission Air / Alamy Stock Photo
Subsection 6.2.2, burger: Bennyartist / Shutterstock
Subsection 6.4, coffee beans: Helen Camacaro / Getty
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
176
Solutions to activities
Solutions to activities
Solution to Activity 1
(a) Stock. Your bank balance is always given for a particular point in
time (for example, when you last checked your balance).
(b) Flow. Your food expenses relate to a given period of time (for
example, daily, weekly or monthly).
(c) Flow. Miles travelled is measured over a period of time (in this case,
daily).
(d) Stock. The number of unemployed people is measured at a particular
point in time.
(e) Flow. Infant mortality is measured as the number of infant deaths per
thousand of live births during a given period of time (usually, one
year).
(f) Flow. The rate of inflation of consumer prices is effectively the
relative change of consumer prices in a country. It is the change of a
stock variable and is defined over a period of time – say, month,
quarter or year.
Solution to Activity 2
The property of persistence is not immediately obvious from looking at
Figure 1. However, the fact that observations that are close together in
time also tend to be close together in the value of their rates of
unemployment suggests this is a persistent time series variable.
There is no overall pattern of momentum in this time series for the period
as a whole. But the data clearly display a number of episodes (a succession
of fairly long stretches in terms of several quarters in a row) that move
either upwards or downwards.
Solution to Activity 3
(a) Autoregressive scatterplots show high correlation between a variable
and its lag when the plot is a cloud closely clustered around the
45-degree line. We can see that the greater the lag, the larger the
spread around the line, and so the lower the correlation between
unemployment and its lagged counterpart. In particular, Figure 2(a)
(plotting lag 1) shows very high linear correlation.
(b) In Figure 2, even at lag 4 there is a clear correlation between the two
variables. So the rate of unemployment is a highly persistent time
series.
177
Unit A2 Time series econometrics
Solution to Activity 4
(a) The slope coefficient gives the first-order autocorrelation for the rate
of unemployment, which has been estimated to be 0.991. This is very
close to 1. An autocorrelation of 1 represents a perfect
autocorrelation.
(b) The p-value associated with this slope is very small, so there is strong
evidence that the population autocorrelation coefficient is not zero.
Solution to Activity 5
Autocorrelations decline as the lag increases, but do so only very slowly. It
is only when the gap is 22 lags or more (corresponding to five and a half
years or more) that the unemployment rate between two time points is
uncorrelated. While the time plot and the autoregressive scatterplots
already suggested a high degree of persistence of the rate of unemployment,
the correlogram makes it easier to visualise how far back one needs to go
to fully pick up the time dependence between time series values.
This time dependence and persistence suggest that, if you hear, for
example, that last quarter’s unemployment rate jumped up by
0.5 percentage points, that’s a really important piece of news. The reason
is that this jump in unemployment is not going away any time soon.
Solution to Activity 6
This time plot lies practically flat with no discernible long-term trend. In
the long period before 2020, the time plot shows that the earlier decades
were characterised by greater volatility in the growth rates with larger
spikes up and down a flat long-term trend. (At the time, particularly
during the 1960s and 1970s, the UK economy was often typified by its
stop–go pattern of economic growth. This was the period of fixed exchange
rates and the resulting pattern of growth was strongly influenced by the
tension between the explicit policy objectives to maintain full employment,
on the one hand, while at the same time protecting the pound from
devaluation.)
The most startling feature of the time plot, however, is the exceptionally
large fluctuations in the quarterly growth rate during 2020, particularly in
the second and third quarters – as can be seen in the table below.
178
Solutions to activities
These growth rates and the output compression in the 2nd quarter of 2020
in particular were truly exceptional and hadn’t been witnessed since the
depression of the 1930s (Santos and Wuyts, 2010). In contrast, the
financial crisis of 2008 looks like a small disturbance in comparison with
this.
Solution to Activity 7
No, they do not show the same pattern. In particular, the unemployment
rate changed in one direction only (i.e. up), whereas the GDP growth rate
decreased then increased dramatically. Whilst unemployment went up, this
rise was not exceptional. In contrast, the exceptional fluctuations in
quarterly growth rate have already been noted in the Solution to
Activity 6.
Solution to Activity 8
Although the data-generating process is the same for each of these random
walks, the resulting plot shows a wide variety of different trajectories. This
is a typical display of a random walk.
The random walks move up and down (at times crossing the zero line once
or more) and several random walks show distinctive wave-like patterns.
The scatter of trajectories shows a clear heteroskedastic pattern: as time
moves on, so does the spread of the scatter. (Recall from Subsection 5.3 of
Unit A1 that ‘heteroskedasticity’ is another term for ‘non-constant
variance’.)
Solution to Activity 9
(a) Let Yt denote the distance travelled in km by a hiker from day 1 up
to t, where t is the cumulative number of days that the walker has
been on the path, for t = 1, . . . , 37.
From the assumptions above, the drift component is d = 22.5 since
the average daily distance covered will be 5 km/hour × 4.5 hours, and
Y0 = 0 because all the hikers start by having covered 0 km.
This gives
Yt = Yt−1 + 22.5 + εt ,
where the error terms εt are i.i.d. with distribution N (0, 100).
(b) A random walk with drift clearly differs markedly from a simple
random walk. The graph of distances reached each day varies around
a linear trajectory that depicts the progress of the average hiker
walking consistently at 22.5 km per day.
Similarly to a random walk without drift, the graph of trajectories
shows a clear heteroskedastic pattern: as time moves on, so does the
spread of the scatter.
179
Unit A2 Time series econometrics
Solution to Activity 10
(a) The estimated regression equations and residual standard errors are
as follows.
• Model 1:
unemploymentRatet = 0.0065 + unemploymentRatet−1 .
The residual standard error is 0.267.
• Model 2:
unemploymentRatet = 7.671 − 0.008t.
The residual standard error is 2.318.
• Model 3:
unemploymentRatet = 0.0804 − 0.0007t
+ unemploymentRatet−1 .
The residual standard error is 0.264.
In Model 3, the intercept was statistically significant, as was the
deterministic trend. Despite Model 3 being more complex, the
residual standard errors of Models 1 and 3 are practically the same.
In Model 1, the estimate for the intercept is not statistically
significant as it is smaller than its standard error, suggesting that
perhaps the drift term is not needed. Regression results of Model 2
suggest the simple model with a deterministic trend is a much poorer
fit for the data. Its slope coefficient is negative, as you would expect
from looking at Figure 1, given the very high rates of unemployment
that prevailed during the 1980s.
Given what we know about unemployment and the socioeconomic
context when this variable was recorded, it seems – and results
confirm – that a deterministic trend model is not a good idea because
an additional trend variable fails to capture potential autoregressive
behaviour inherent in its time trajectory.
Solution to Activity 11
(b) (a) In Figure 4, quarterly GDP is clearly trending. So, according to the
mean property of weakly stationary time series variables, the time
plot suggests GDP is non-stationary.
(b) Quarterly GDP growth rate does not appear to be trending. In
Figure 7, quarterly GDP growth rate from 1955 Q2 to 2019 Q4
displays a variance that is not constant over time; it was markedly
larger in the earlier decades than in the later decades of the overall
period. This suggests that quarterly GDP growth rate is
non-stationary.
180
Solutions to activities
Solution to Activity 12
By leaving out Yt−3 we would be modelling Yt as
Yt = Y0 + α1 Yt−1 + α2 Yt−2 + vt ,
where the error term is now vt = ut + α3 Yt−3 .
Both regressors would be correlated with this error term, since Yt−1 can be
written as
Yt−1 = Y0 + α1 Yt−2 + α2 Yt−3 + α3 Yt−4 + ut−1
and Yt−2 can be written as
Yt−2 = Y0 + α1 Yt−3 + α2 Yt−4 + α3 Yt−5 + ut−2 .
This violates the main OLS assumption of exogeneity discussed in Box 8 of
Unit A1 (Subsection 2.4).
Solution to Activity 13
(a) As you saw in Box 19, Yt has a unit root when ρ = 1, and is
stationary when |ρ| < 1. So a suitable null hypothesis is H0 : δ = 0.
(b) When Yt = d + ρ Yt−1 + εt this means that
∆ Yt = Yt − Yt−1 = d + δ Yt−1 + εt .
This reduces to ∆ Yt = d + εt , when δ = 0.
Solution to Activity 14
The plot in Figure 14(a) shows that log(GDP) has a positive trend over
time. Whether the trend is stochastic or deterministic cannot be
established from the graph, but either way it means that log(GDP) is not
stationary.
The first difference of log(GDP) plotted in Figure 14(b) appears to have a
relatively stable mean, except for the bump corresponding to the period of
economic turmoil in the 1970s. It is not obvious whether the variance
changes with time. This means that the first difference of log(GDP) could
be stationary and hence that the log(GDP) could be difference
stationary I(1).
Figure 14(c) indicates that there is strong persistence in the log(GDP)
series. The autocorrelation series declines only very slowly as the lag
increases.
In Figure 14(d), the ACF falls rapidly with the first lag and remains low
with subsequent lags. It looks a bit like the ACF for the white noise
process, which is a process that we know is stationary (Example 7,
Subsection 4.2); but in the case of a white noise process, the ACF would
stay zero for subsequent lags.
181
Unit A2 Time series econometrics
Solution to Activity 15
Both series exhibit upward trends, but it is not obvious if the trend is
stochastic (as in a random walk with drift) or deterministic. It is clear that
both series are non-stationary, since the mean increases over time. The two
series appear to trend together over time.
Solution to Activity 16
In the Solution to Activity 15, it was noted that neither GDP nor PCE
look stationary, so they are not I(0).
The correlograms for GDP and PCE indicate there is strong persistence in
both series and hence dependence on many lags.
The correlograms for the first differences of GDP and PCE suggest that
possibly the autocorrelations after order 1 could be assumed to be 0. So
both first differences may have a lag dependence of order 1. As with the
correlograms of the first difference of log(GDP) given in Activity 14
(Subsection 5.3), a correlogram like this is a bit like the ACF for the white
noise process. So they suggest that both first differences might be
stationary. This means that both GDP and PCE might be I(1).
Solution to Activity 17
(a) For the gdp series, the ADF test statistic is −0.55. For the pce series,
the ADF test statistic is −0.37.
(b) In both cases, the test statistic is more than the 10% critical value.
(Or equivalently, the magnitude of the test statistic is less than the
magnitude of the 10% critical value.) We therefore cannot reject the
null hypothesis in either case. So we conclude that both series are
non-stationary, contain a unit root and, hence, that both variables are
I(1) with drift.
Solution to Activity 18
(a) For the first difference of gdp, ∆ gdp, the ADF test statistic was
calculated to be −3.61.
The ADF test statistic for the first difference of pce, ∆ pce, was
found to be −2.93.
182
Solutions to activities
(b) In both cases, the test statistic is larger in magnitude than the 1%
critical value of −2.60, so we can reject the null hypothesis that the
series contains a unit root and conclude it is therefore stationary, I(0).
Since both gdp and pce can be transformed into a stationary series by
taking first differences, we can conclude that gdp and pce are
difference stationary or integrated of order 1, I(1).
Solution to Activity 19
The residuals appear to fluctuate around a stable mean which suggests
that the correct specification for the ADF test is that of a simple random
walk; that is,
bt ∼ u
∆u bt−1 + ∆ u
bt−1 .
Solution to Activity 20
The calculated test statistic is −3.11, the t-statistic for u
bt−1 in the
regression, which is larger in magnitude than the 1% critical value of −2.6.
We therefore reject the null hypothesis that the series is non-stationary in
favour of the alternative hypothesis that the series is stationary with zero
mean. Since the residuals are found to be I(0), we can conclude that the
two series are cointegrated based on the Engle–Granger test.
183
Unit B1
Cluster analysis
Welcome to Strand B
Welcome to Strand B
Strand B of M348 is the data science strand, consisting of Units B1
and B2. In this strand, we will be looking at data analysis from a data
science point of view. (Remember that you should study either Strand A
or Strand B. See the module guide for further details.)
So far, the emphasis in this module has been on regression.
• In Book 1, you have learnt about the linear models – regression models
where the response variable is assumed to follow a normal distribution,
and the mean depends on a number of explanatory variables. These
explanatory variables might be covariates or factors or a combination of
the two.
• Book 2 then moved on to considering generalised linear models; that is,
models where the response variable could follow one of a number of
other distributions – in particular, the binomial distribution and the
Poisson distribution.
Whilst you have been learning how to use R to fit such models, little has
been said about what calculations R is actually doing to come up with
estimates for all of the parameters. The implicit presumption is that it
does not matter much; that is, the computing power that R is able to draw
on is sufficient for the estimates to be provided quickly enough, and the
computations will end up with the ‘right’ estimates being reported. For
example, in multiple regression, the estimates will indeed be the least
squares estimates and in generalised linear regression, the estimates will
indeed be maximum likelihood estimates. In this strand, we will move
away from the fitting of such models to consider situations where
computational considerations become more important.
In Unit B1, we will consider cluster analysis. This is a technique that aims
to find groups – clusters – in data. The approaches that we will consider
will not be defined by explicit models. Instead, they emerge from thinking
about different ways of searching for groups in data. For example, by
repeatedly merging observations which are closest together or placing
observations in groups according to their proximity to a number of group
centres. The approach chosen impacts on the clusters that are found.
In Unit B2, we will focus on the data, in particular ‘big data’. As you will
see in this unit, just handling ‘big data’ brings its own challenges. The
number of observations can go into the millions, and fresh observations
might accumulate rapidly. Also, the data might not fit into the neat
structure of a data frame. All this means that even doing relatively simple
data analytical tasks, such as computing a mean, brings challenges. For
example, the dataset might be so vast that it is beyond the storage
capacity of a single computer; and without careful thought about how the
calculation is done, the time required for the computation might be far too
long. The unit does not just show how such data can be handled, but also
what uses such data should – or, more importantly, should not – be put to.
187
Unit B1 Cluster analysis
A note on notebooks
Make sure you have all the relevant files installed for the notebook
activities in this strand. Check the module website for any additional
instructions.
Introduction to Unit B1
In this first unit specifically about data science, we will consider cluster
analysis; that is, identifying groups – clusters – in data. Cluster analysis
is one of a number of techniques, such as discrimination, principal
component analysis, factor analysis and segmentation analysis, that are
designed to be used when the data consist of many variables.
The key feature in cluster analysis is that at the outset little is assumed
about the clusters. That is, initially there are no examples of what
members of each of the clusters look like – and often it’s not known how
many clusters there are. Knowledge about the clusters is gained only from
the data themselves. As such, cluster analysis can be seen as an
exploratory data analysis tool which is used to suggest structure in the
data. This knowledge about the structure in the data can then be used to
add insight and/or allow simplification. For example, in the analysis of a
survey, cluster analysis is used to group respondents. Typical responses
given in each of these groups are then used to discover how these groups
differ, as Example 1 will demonstrate. Replacement of observations in each
of the groups by a single typical observation can lead to much
simplification, as Example 2 will demonstrate. This means that the fewer
clusters that adequately represent the structure, the better.
Finding groups in data is useful in many situations, so it is not surprising
to find cluster analysis used across a wide range of disciplines, such as in
food chemistry (Qian et al., 2021), social welfare (Tonelli, Drobnič and
Huinink, 2021), tourism (Höpken et al., 2020) and archaeology (Jordanova
et al., 2020).
188
Introduction to Unit B1
189
Unit B1 Cluster analysis
(a) (b)
Figure 1 Images of a bead necklace (a) original, and (b) produced using
just ten colours after application of a clustering algorithm
190
1 Clusters in data
the clusters are assumed to have. Finally, in Section 6, the approaches you
have been learning about will be compared.
The structure of the unit is illustrated in the following route map.
Section 1
Clusters in data
Section 2
Assessing clusters
Section 6
Comparing
clustering methods
Note that you will need to switch between the written unit and your
computer for Subsections 3.4, 4.6 and 5.4.
1 Clusters in data
Cluster analysis is designed to identify groups within a dataset. It applies
in situations where very little might be known about the groups
beforehand. Often, the number of groups is not known in advance, let
alone what any of the elements in a group look like.
In learning about cluster analysis, the first step is to think about what a
cluster in some data actually is. A general, although vague, definition of a
cluster is given in Box 1.
191
Unit B1 Cluster analysis
192
1 Clusters in data
20
15
Frequency
10
0
100 200 300 400 500 600 700
Occupancy
Figure 2 Occupancy of Broad Street Car Park around noon
In this histogram, there were some days when there were more than
450 cars in the car park. (This is out of a total of 690 spaces.) On
other days, there were fewer than 250 cars in the car park. However,
there were no days on which the number of cars was between 250
and 450. So these data appear to have two clusters. One cluster
corresponds to days when the occupancy was fewer than 250 cars.
The other cluster corresponds to days when the occupancy was more
than 450 cars.
As you can see in Table 2, it turns out that all the days when the
occupancy was high were weekdays, and all the days when the
occupancy was low were at the weekend. So the cluster analysis is
revealing a split between weekday/weekend use of the car park.
Table 2 Occupancy of Broad Street Car Park on different days of the week
193
Unit B1 Cluster analysis
eruptions waiting
3.600 79
1.800 54
3.333 74
2.283 62
4.533 85
2.883 55
194
1 Clusters in data
90
Waiting time (minutes)
80
70
60
50
2 3 4 5
Duration (minutes)
Figure 3 Durations of, and waiting times between, eruptions
195
Unit B1 Cluster analysis
25000
20000
Frequency
15000
10000
5000
0
0.0 0.2 0.4 0.6 0.8 1.0
Greyness
Figure 4 Greyness of pixels in a greyscale version of Figure 1(a)
196
1 Clusters in data
197
Unit B1 Cluster analysis
red
green
blue
198
1 Clusters in data
d
h
PC2 23.2%
e c
f
b
PC1 33 19
199
Unit B1 Cluster analysis
200
1 Clusters in data
201
Unit B1 Cluster analysis
where the subscripts r, g and b denote the amount of red, green and
blue in a pixel, respectively.
For example, suppose the colours of two pixels are x = (183, 181, 182)
and y = (152, 149, 156). Then
p
d(x, y) = (183 − 152)2 + (181 − 149)2 + (182 − 156)2
√
= 961 + 1024 + 676
√
= 2661
≃ 51.6.
This dissimilarity measure corresponds to the distance between the
two observations if they were plotted on a three-dimensional
scatterplot (with the same scale used for each axis).
202
1 Clusters in data
Similarly, suppose that the species at the other site is given by the
vector y = (y1 , y2 , . . . , y150 ), where the ith entry, yi , in this vector is
such that
(
1 if species i is present at site y,
yi =
0 if it is absent.
Then the dissimilarity is given by d(x, y), where
P
|xi − yi |
d(x, y) = Pi .
i (xi + yi )
203
Unit B1 Cluster analysis
Box 4 gives a few dissimilarity measures that are commonly used. This list
is not intended to be exhaustive. Dissimilarity can be, and in fact is,
mathematically defined in other ways.
204
1 Clusters in data
205
Unit B1 Cluster analysis
axes, it is the same as the distance between the two points. When there is
just one variable, Euclidean distance is the same as L1 distance – it is just
the modulus of the difference between the two values. You will practise
calculating some dissimilarities in Activities 3 and 4.
The greyness values of the four pixels in the corners of Figure 1(a) are
given in Table 6.
Table 6 Greyness of the corner pixels in Figure 1(a)
Corner Greyness
Top left (xtl ) 0.713
Bottom left (xbl ) 0.591
Top right (xtr ) 0.765
Bottom right (xbr ) 0.847
The colour values of the four pixels in the corners of Figure 1(a) are given
in Table 7.
Table 7 Colour of the corner pixels in Figure 1(a)
Corners Dissimilarity
Top left and bottom left 51.6
Top left and top right
Top left and bottom right 64.5
Bottom left and top right 70.1
Bottom left and bottom right 115.6
Top right and bottom right 50.2
206
1 Clusters in data
(a) Using the L1 distance, calculate the dissimilarity between each pair of
countries for the following.
(i) When public social expenditure is measured as a percentage of
GDP and maternity leave in days.
(ii) When public social expenditure is measured as a percentage of
GDP and maternity leave in years.
(iii) When both public social expenditure and maternity leave have
been standardised.
(b) Using the values you calculated in part (a)(i), which two of the three
countries have the most similar child-related policies and which two
countries are most different in this respect?
(c) Does transforming the data make a difference to which countries
appear most similar? Why or why not?
207
Unit B1 Cluster analysis
As you will see in later sections, there are various ways this definition is
used to develop techniques for identifying clusters in data. First, in the
next section, you will consider how to assess the extent to which clusters,
once found, meet the definition given in Box 6.
2 Assessing clusters
In Subsection 1.1, you performed cluster analysis in both Activities 1
and 2 by looking at a plot of the data. Later in the unit, you will learn
about some other methods for cluster analysis. However, in this section,
you will consider the issue of how do we know that the suggested clustering
of observations is any good? That is, to what extent do the clusters that
have been found reflect structure that is really there in the data?
Sometimes there is other information available about what the clusters
should be. For example, clusters might be suggested by the context in
which the data arose, as is demonstrated in Example 11.
208
2 Assessing clusters
209
Unit B1 Cluster analysis
210
2 Assessing clusters
210
200
190
180
170
160
150
140
130
120
110
Friend Height h, in cm
Adnan 180
Billy 170
Cath 164
Dan 193
Elise 182
211
Unit B1 Cluster analysis
With a dissimilarity matrix, it does not matter what order the rows are
presented in, so long as it is known what each row represents. (Note that
the symmetric nature of the matrix means that the columns always have
the same ordering as the rows.) However, when assessing the extent to
which a set of proposed clusters reflects structure in the data, it is useful
to group rows together by cluster. This means placing all the rows
corresponding to the first cluster first, then placing all the rows
corresponding to the second cluster second, and so on. Arranging the rows
(and with them the columns) in this way means that dissimilarities for
observations in the same cluster will appear together as one block in the
dissimilarity matrix. If the clustering is a good one, these blocks will
become noticeably clear. The values in these blocks should be lower than
elsewhere in the dissimilarity matrix. You will see a dissimilarity matrix
arranged in this way in Example 14.
212
2 Assessing clusters
The ordering of the clusters also does not matter. So, for example, we
could also express this clustering as
{Adnan, Elise}, {Dan} and {Cath, Billy}.
Reorganising the dissimilarity matrix you obtained in Activity 6, so
that the rows are sorted by cluster, results in the following matrix.
Billy Cath Adnan Elise Dan
Billy 0
Cath 6 0
Adnan 10 16 0
Elise 12 18 2 0
Dan 23 29 13 11 0
In this matrix, the lines separate friends in different clusters and
hence divides the matrix into blocks. Notice that the numbers in the
blocks along the main diagonal are noticeably less than those in the
other blocks. This confirms that the heights of the friends within the
same cluster are closer than the heights of the friends who are in
different clusters.
In Example 14, the dissimilarity matrix is small enough that looking at the
individual numbers to assess patterns (or lack thereof) is not too daunting
a task. However, in most cases the matrix will be too big to do this. For
example, the data on occupancy in the car park given in Example 3
consists of 73 observations (which is a small sample size in many contexts).
Thus the dissimilarity matrix would be a 73 × 73 matrix. So to fit such a
matrix onto an A4 page (210 mm by 297 mm), each element must be tiny:
about 3 mm by 4 mm, which is far too small for most people to read.
However, when judging how good a clustering is, it is only important to be
able to see general patterns in the sizes of the dissimilarities. This can be
conveyed using colour as a scale. Then patterns in the values of the
dissimilarities get translated to colour patterns. This is called plotting
the dissimilarity matrix, which is described in Box 8. An example of its
use is then given in Example 15.
213
Unit B1 Cluster analysis
0 5 10 15 20 25 30
Figure 9 A possible colour scale
Billy
Cath
Adnan
Elise
Dan
214
2 Assessing clusters
One use of the plots of the dissimilarity matrix is to informally judge the
extent to which a proposed clustering of a dataset reflects real clusters in
the data. You will do this using some simulated data in Activity 7.
215
Unit B1 Cluster analysis
(a) How convincing is each clustering in Figure 11? In other words, does
it appear to be the ‘right’ clustering, or does it appear that other
clusterings would be at least as appropriate?
(b) Each of the plots in Figure 12 uses a colour scale similar to that given
in Figure 9. (So the lighter the colour, the more similar two
observations are.) By using your answer to part (a), which of the
plotted dissimilarity matrices correspond to which clustering?
In Example 16, you will learn how to calculate silhouette statistics for a
couple of friends from Example 14.
216
2 Assessing clusters
217
Unit B1 Cluster analysis
Note that, for any observation i, the silhouette statistic si can only take
values between −1 and +1.
When an observation is very close to other observations in its cluster and
far away from observations in the other clusters, the value of ai will be
very small relative to bi , and hence the value of si will be close to +1. So,
a value of si close to +1 indicates that the point sits comfortably within its
allocated cluster.
If it turns out that the observation is far more similar to observations in a
different cluster than to the one it was allocated to, ai will be large relative
to bi and hence si will be negative. So, a negative value of si suggests that
the observation might have been allocated to the wrong cluster.
An interpretation of one set of silhouette statistics is given in Example 17.
218
2 Assessing clusters
1.0
Silhouette statistic
0.5
0.0
−0.5
−1.0
1 2 3
Figure 13 A plot of the silhouette statistics for the five friends based on
the clustering {Billy, Cath}, {Adnan, Elise} and {Dan}
In this plot, notice that the bars representing the five friends are
ordered with respect to which cluster they are in. Furthermore, within
each cluster, the bars are in order of the value of the silhouette
statistic. From this plot, it is easy to see that the silhouette statistics
for Adnan and Elise (in Cluster 2) are higher than those for Billy and
Cath (in Cluster 1).
219
Unit B1 Cluster analysis
1.0
Silhouette statistic
0.5
0.0
−0.5
−1.0
1 2 3 4 5
(a)
1.0
Silhouette statistic
0.5
0.0
−0.5
−1.0
1 2 3 4 5
(b)
1.0
Silhouette statistic
0.5
0.0
−0.5
−1.0
1 2 3 4 5
(c)
Figure 14 Plots of the silhouette statistics for the clustering solutions given
in Activity 7
220
2 Assessing clusters
1.0
Silhouette statistic
0.5
0.0
−0.5
−1.0
(a) (b)
221
Unit B1 Cluster analysis
In Example 19, you will learn how to calculate the mean silhouette
statistic for the heights of a group of friends.
222
3 Hierarchical clustering
(b) The silhouette statistics for the other three friends, Adnan, Elise and
Dan are, respectively, sAdnan = −0.250, sElise = −0.364 and
sDan = 0.389. Using your answer to part (a), calculate the mean
silhouette statistic for this alternative clustering. Hence comment on
which clustering seems to be better: {Billy, Cath}, {Adnan, Elise}
and {Dan}, or {Cath}, {Adnan, Billy} and {Dan, Elise}.
(c) There are 25 potential ways in which the friends can be split into
(exactly) three clusters, six of which are reasonable (that is, each
cluster only contains friends who are next to each other in terms of
height). These six reasonable clusterings are listed below, along with
their associated average silhouette statistics. (A space is left blank to
add your answer from part (b). Remember that the order of names
within a cluster is not important.) Based on the mean silhouette
statistic, which of these six clusterings seems best? With reference to
the plot of the data given in Figure 8 (Subsection 2.1), does this seem
reasonable?
3 Hierarchical clustering
So far in this unit, the only method you have seen for allocating
observations to clusters has been informally using plots. However, this
approach is not a good one in many circumstances. Firstly, it is only
suitable for datasets which can be easily displayed graphically. If a dataset
is large, with many variables, it becomes very difficult to display the data
well in a graphical format. More importantly, it is a subjective process
which does not lend itself to being automated. In this section, and in
Sections 4 and 5, you will learn about different approaches for doing
automatic cluster allocation using algorithms. The approach we will
consider in this section is a form of hierarchical clustering, for which we
give a definition in Box 11.
223
Unit B1 Cluster analysis
merge merge
divide divide
Have allocations
of observations
to k clusters
224
3 Hierarchical clustering
225
Unit B1 Cluster analysis
226
3 Hierarchical clustering
The linkage that is chosen will impact the shape of cluster that is likely to
be found by agglomerative hierarchical clustering. For instance, single
linkage can result in clusters that are long and thin, whereas complete
linkage is more likely to result in clusters that are more ‘ball’-like. Thus
the best choice will depend, in part, on the context in which the cluster
analysis is done.
The calculation of some dissimilarities between clusters is demonstrated in
Example 20. You will then calculate some yourself in Activity 13.
227
Unit B1 Cluster analysis
the bottom. Reading off from the dissimilarity matrix, these are
0.122, 0.134, 0.174 and 0.082.
Using single linkage, the dissimilarity between the clusters is the
minimum of these four dissimilarities. So, its value is 0.082.
Using complete linkage, the dissimilarity between the clusters is the
maximum of these four dissimilarities. So, its value is 0.174.
Using average linkage, the dissimilarity between the clusters is the
mean of these four dissimilarities. So, its value is
0.122 + 0.134 + 0.174 + 0.082
= 0.128.
4
Suppose that the pixels are split into the following two clusters:
{top left, bottom left} and {top right, bottom right}.
(a) Which dissimilarities between pixels is the calculation of the
dissimilarity between the clusters based on?
(b) What is the dissimilarity between the clusters using single linkage?
(c) What is the dissimilarity between the clusters using complete linkage?
(d) What is the dissimilarity between the clusters using average linkage?
228
3 Hierarchical clustering
Iteration 1
Looking at the starting dissimilarity matrix, the two clusters which
are closest together are {Adnan} and {Elise}, since the dissimilarity
between them is the smallest value in the dissimilarity matrix
(excluding the main diagonal). So these two clusters can be merged.
Thus, we are left with the following clusters: {Adnan, Elise}, {Billy},
{Cath}, {Dan}. This is a total of four clusters.
The dissimilarities between the three clusters {Billy}, {Cath}
and {Dan} are not changed by merging the clusters {Adnan}
and {Elise}. The dissimilarity between the new cluster {Adnan, Elise}
and {Billy} is 12, which is the maximum of the dissimilarities between
{Adnan} and {Billy}, 10, and between {Elise} and {Billy}, 12.
Similarly, the dissimilarity between the clusters {Adnan, Elise} and
{Cath} is 18, and the dissimilarity between the clusters {Adnan,
Elise} and {Dan} is 13. So the dissimilarity matrix between these four
clusters is as follows.
229
Unit B1 Cluster analysis
Iteration 2
Looking at the dissimilarity matrix between clusters calculated at the
end of Iteration 1, the clusters {Billy} and {Cath} are closest because
the dissimilarity, 6, is the smallest off-diagonal value. So, they are the
next clusters to be merged.
After this merger, we have the following clusters: {Adnan, Elise},
{Billy, Cath} and {Dan}. The dissimilarity matrix between the three
clusters is as follows.
{Adnan, Elise} {Billy, Cath} {Dan}
{Adnan, Elise} 0
{Billy, Cath} 18 0
{Dan} 13 29 0
Iteration 3
Looking at the dissimilarity matrix between clusters calculated at the
end of Iteration 2, the clusters {Adnan, Elise} and {Dan} are closest,
so they are the next clusters to be merged.
After this merger we have the following clusters: {Adnan, Dan, Elise}
and {Billy, Cath}. This is a total of two clusters. The members of a
cluster form a set. So, as mentioned in Example 14 (Subsection 2.1),
the order in which the members of a cluster are given does not
matter. For example, we could have expressed the cluster {Adnan,
Elise, Dan} as {Adnan, Dan, Elise} or {Elise, Dan, Adnan}. The
dissimilarity matrix between the two clusters is as follows.
{Dan, Adnan, Elise} {Billy, Cath}
{Dan, Adnan, Elise} 0
{Billy, Cath} 29 0
Iteration 4
At the end of Iteration 3, we only have two clusters. So these two
clusters must automatically be the closest two clusters and hence the
pair that is merged next. This leads to the cluster {Adnan, Billy,
Cath, Dan, Elise}. This is just one cluster, so the algorithm stops.
230
3 Hierarchical clustering
(a) Which pair of clusters would be the next to be merged? Hence, what
is the four-cluster solution?
(b) Write down the dissimilarity matrix for the four-cluster solution.
231
Unit B1 Cluster analysis
5
Dissimilarity
0
a b c d e
Figure 17 An example of a dendrogram
232
3 Hierarchical clustering
8 8
7 7
6 6
Dissimilarity
Dissimilarity
5 5
4 4
3 3
2 2
1 1
0 0
b a c e d d e c b a
(a) (b)
233
Unit B1 Cluster analysis
Day Occupancy
1 677
2 653
3 673
4 545
5 126
6 108
7 615
8 664
9 676
10 610
600
500
400
Dissimilarity
300
200
100
0
5 6 4 7 10 3 1 9 2 8
234
3 Hierarchical clustering
235
Unit B1 Cluster analysis
4 Partitional clustering
In the previous section, you learnt about an approach to finding clusters in
data that involved successively merging the closest two clusters. However,
this is not the only strategy that can be used. In this section, you will
learn another approach to finding clusters, one that is based on splitting
(‘partitioning’) the observations into exactly k groups.
Of course, just putting n observations into k groups is easy. All this
requires is allocating an integer between 1 and k to each observation. The
tricky bit is coming up with an allocation of integers to observations that
corresponds to the most convincing clusters. In principle, all possible
allocations could be investigated to see which one comes up the best. (As
measured, for example, by the mean silhouette statistic.) However, for all
but the smallest of datasets, the number of possible allocations is too large
236
4 Partitional clustering
Instead, algorithms are used to try to find the best allocation of the
observations to k clusters without trying all possible different ways of
doing the allocation. There are different ways in which these algorithms
work. Here we will consider one approach, based on breaking the task into
two easier subtasks:
• allocating observations to clusters with known centres
• finding the centre of each of the clusters based on the observations that
are allocated to it.
These two subtasks will be the focus of Subsections 4.1 and 4.2,
respectively. In Subsection 4.3 you will see how these two subtasks are
combined, and Subsection 4.4 deals with the issue of how the algorithm
starts and how it stops. Then Subsection 4.5 deals with the issue of how to
choose k, the number of clusters. Finally, in Subsection 4.6, you will use R
to do some partitional clustering yourself.
Suppose that, in a specific set of data, the centres of the clusters were all
known. Suggest a way in which the observations might be allocated to the
different clusters.
237
Unit B1 Cluster analysis
As you have seen in Activity 16, if we know where the centres of the
clusters are, it is reasonable to allocate observations to clusters using the
rule given in Box 15.
Figure 20 The division of a plane into five different clusters using the rule
given in Box 15
238
4 Partitional clustering
Friend Height h, in cm
Adnan 180
Billy 170
Cath 164
Dan 193
Elise 182
Suppose it is known that there are two clusters with centres at 160 cm
(Cluster 1) and 170 cm (Cluster 2), and it is decided that the dissimilarity
function is Euclidean distance.
Allocate the five friends to clusters.
Having the rule given in Box 15 is all very well. However, it does rely on
one very big ‘if’: if the centres of the clusters are known. This information
is not likely to be known at the outset. If the centres of all the clusters are
known before the cluster analysis is done, cluster analysis is unlikely to be
needed at all – sufficient information is probably already known about the
clusters. So what is the way forward? This is where Subtask 2 comes in.
239
Unit B1 Cluster analysis
Recall that in Subsection 1.1 you were introduced to some data collected
about the Old Faithful geyser. Figure 18 is the scatterplot of the data,
after they have been standardised. Additionally in Figure 18, the
observations have been split into two clusters where each cluster of
observations has been presented using a different symbol.
Looking at the scatterplot, suggest a way in which the centre of each
cluster could be defined.
2
1
Waiting time (standardised)
−1
−2
−2 −1 0 1 2
Duration (standardised)
Figure 21 Durations and waiting times between eruptions, split into two
clusters
In Activity 18, you saw that there are different ways of defining the
position of the centre of a cluster. In the context of partitional clustering,
it is possible to base the position of the centre on the dissimilarity
measure: the position where the dissimilarity between it and all of the
observations is minimised is defined as the cluster centre. In general, this
leads to a non-trivial minimisation problem.
240
4 Partitional clustering
In this unit, though, we will restrict ourselves to the situation when the
chosen dissimilarity function is Euclidean distance. In this case, it is
possible to work out the centre of the cluster using the formula given in
Box 16. This means it can be computed quickly and easily, as you will see
in Example 22. In Activity 19, you will calculate the centre of a cluster
when the dissimilarity function is Euclidean distance.
241
Unit B1 Cluster analysis
So, for the brand LZLJ, the mean standardised ethyl acetate
concentration is
0.407 + 0.692 + 0.429 + 0.092
= 0.405
4
and the mean standardised ethyl lactate concentration is
−0.658 + (−0.384) + (−0.579) + (−0.852)
≃ −0.618.
4
Similarly, for the brand WLY these two values are 0.733 and −0.128,
respectively.
Therefore, the centre of the cluster given by the brand LZLJ
is (0.405, −0.618) and the centre of the cluster given by the
brand WLY is (0.733, −0.128).
We now return to the data about the heights of five friends (introduced in
Subsection 2.1), which was repeated in Activity 17. In Example 21, at
Iteration 3, a cluster consisting of {Adnan, Dan, Elise} was suggested.
Suppose it is decided that the dissimilarity function is Euclidean distance.
Calculate the centre of this cluster.
So, as Example 22 and Activity 19 have shown, finding the centre of each
cluster is straightforward if it is known which cluster each observation
belongs to. But, again, this is another very big ‘if’. If we should happen to
know which cluster each observation belongs to, there would be no need
for any cluster analysis. This means that by themselves neither subtask is
helpful. However, as you will see in the next subsection, by combining
them, we can make progress.
242
4 Partitional clustering
Unfortunately, as was pointed out at the end of Subsections 4.1 and 4.2,
the assumptions associated with both of these subtasks are unreasonable.
Neither assumption is something that is likely to be known before the
cluster analysis. Or at least, if either is known beforehand, there would be
little need for cluster analysis.
So, why bother with either of these subtasks? It turns out that progress
can be made by repeatedly performing each of the subtasks in turn. For
example, the following scheme could be done.
• Perform Subtask 1 to estimate which cluster each observation belongs to.
• Using this cluster allocation, perform Subtask 2 to estimate the cluster
centres.
• Using these cluster centres reperform Subtask 1.
• Using the new cluster allocations reperform Subtask 2.
• And so on, cycling between reperforming Subtask 1 and reperforming
Subtask 2.
This scheme is illustrated in Figure 22, next.
243
Unit B1 Cluster analysis
Assume allocation
Assume cluster
of observations
centroids
to clusters
are known
is known
In this activity, we return again to the data about the heights of five
friends (introduced in Subsection 2.1). A repeat of Table 9 from Activity 6
is again given below for convenience.
Friend Height h, in cm
Adnan 180
Billy 170
Cath 164
Dan 193
Elise 182
Suppose we are now interested in finding two clusters in these data. (This
corresponds to the smallest non-trivial number of clusters. Finding one
cluster is trivial – it’s just all the friends in the same cluster.) Further,
suppose that the cluster centres are initially thought to be 160 cm and
170 cm and that the dissimilarity measure is Euclidean distance.
(a) Based on this information, do Subtask 1. In other words, allocate the
five friends to the two clusters.
(b) Using the allocation of clusters you obtained in part (a), do
Subtask 2. That is, re-estimate the cluster centres.
(c) Using the cluster centres you estimated in part (b), re-allocate the five
friends to the two clusters.
(d) Using the allocation of clusters you obtained in part (c), do
Subtask 2. That is, re-estimate the cluster centres.
244
4 Partitional clustering
However, before this becomes a workable algorithm, there are two issues
that still need to be resolved: how do you start and when do you stop? We
will consider this in the next subsection.
245
Unit B1 Cluster analysis
In the Solution to Activity 22, two of the criteria for stopping were
simultaneously met. In the following activity, you will consider whether
this is likely to happen in general.
As you found out in Activity 23, the first two conditions of the stopping
rule are effectively equivalent. If one is satisfied, so will the other. In both
cases it means a solution of the bigger task has been found. Moreover, this
is a stable solution. Performing more iterations of the algorithm will not
change anything.
In theory, the k-means algorithm should always converge. This means that
a stable solution can always be found. However, there is no guarantee
about how long it will take to find this stable solution. So the third
stopping condition, limiting the number of iterations, ensures that the
algorithm does not take an excessive amount of time to come to a halt.
We now just need to address how to get the process started. We either
have to guess the values of the k cluster centres or an allocation of
observations to clusters. Either way, ideally this guess should be a ‘good’
one that represents the clusters well. Such a guess could be based on prior
knowledge about such data, or as a result of some initial data analysis.
However, it is not essential for the guess to be good. So the starting
solution can be as given in Box 18 and demonstrated in Example 23.
246
4 Partitional clustering
But does the choice of initial positions of the cluster centres matter? This
is what you will explore in Activity 24.
In Activities 21 and 22, you used k-means to cluster the five friends. In
those activities, the initial cluster centres were taken to be 160 cm and
170 cm. Suppose instead that at the start of the partitional algorithm the
initial cluster centres were taken to be the heights of Adnan (180 cm) and
Dan (193 cm). Repeat the algorithm to find the stable solution. (You
should not require more than two iterations in this case.) Compare your
solution with the one you obtained in Activity 22.
As you have seen in Activity 24, choosing different starting values for the
cluster centres can lead to different clusters being identified. Even though
in both cases the algorithm converged, these differences are more than just
labelling the clusters in a different order. Thus, any solution produced
by k-means clustering can only be regarded as ‘a’ solution not
‘the’ solution.
Using an overall numerical measure of how good a clustering solution is,
such as the mean silhouette statistic (Subsection 2.3), it is possible to
compare these different solutions. One way of ensuring that the best stable
solution is found is to use each possible combination of k observations from
the dataset as a starting point. This is done in Example 24 and
Activity 25.
247
Unit B1 Cluster analysis
248
4 Partitional clustering
2 2
Ethyl lactate
Ethyl lactate
1 1
0 0
−1 −1
−2 −1 0 1 2 −2 −1 0 1 2
Ethyl acetate Ethyl acetate
(a) (b)
2
Ethyl lactate
−1
−2 −1 0 1 2
(c) Ethyl acetate
249
Unit B1 Cluster analysis
4.5 Selecting k
So far, you have seen how partitional clustering works by making two
complementary assumptions in turn:
• the cluster centres are known
• the allocation of observations to clusters is known.
Notice that implicit in both of these assumptions is the following
assumption:
• the number, k, of clusters is known.
As the number of clusters is an assumption made when performing
Subtask 1 and Subtask 2, it is an assumption that has to be made
for k-means clustering. In some situations, this assumption will be a
reasonable one to make. For example, in situations when a specific number
of clusters are sought. However, often the ‘right’ number of clusters will
not be known. So what then? The answer is simply to repeat k-means
clustering using a range of different values for k. In Example 25, you will
see how such a range can be chosen.
As you will see in Example 26, once we have obtained the best stable
solution for each value of k, we then compare them to see which seems best
overall.
250
4 Partitional clustering
From this table, the value of k with the highest average silhouette
statistic is k = 2. This indicates that there are two clusters in the
data.
One method for comparing the solutions is to plot the overall silhouette
statistics produced for each value of k. You will do this in the next activity.
251
Unit B1 Cluster analysis
In Activity 25, you considered two-cluster solutions for the Chinese liquor
dataset using just two variables: the concentrations of ethyl acetate (X1 )
and ethyl lactate (X2 ).
k-means was applied to these data (after standardisation) for a range of
possible k.
The average silhouette statistics for the best stable solution for each value
of k is given in Figure 24.
0.46
Average silhouette statistic
0.44
0.42
0.40
0.38
0.36
5 10 15 20
k
Figure 24 Average silhouette statistics for a range of k
(a) Based on Figure 24, how many clusters does there appear to be in the
data?
(b) Plots of all the cluster solutions corresponding to 2, 3, 4 and 5
clusters are given in Figure 25. Based on these plots, does your
answer to part (a) make sense?
252
4 Partitional clustering
Ethyl lactate 2 2
Ethyl lactate
1 1
0 0
−1 −1
−2 −1 0 1 2 −2 −1 0 1 2
Ethyl acetate Ethyl acetate
(a) (b)
2 2
Ethyl lactate
Ethyl lactate
1 1
0 0
−1 −1
−2 −1 0 1 2 −2 −1 0 1 2
(c) Ethyl acetate (d) Ethyl acetate
253
Unit B1 Cluster analysis
So, with partitional clustering, there is not the direct link between a k-
and a (k − 1)-cluster solution needed to construct a dendrogram.
254
5 Density-based clustering
5 Density-based clustering
The two clustering methods which have been discussed so far both require
a separate step to decide the most appropriate number of clusters.
However, not all approaches to clustering need this. Some, like the
clustering method you will learn about in this section, DBScan, estimate
the number of clusters at the same time as determining cluster
membership.
The ‘DB’ in DBScan stands for ‘density-based’. This is because the
DBScan approach pre-specifies how densely observations need to be packed
together for them to be regarded as forming a cluster. The algorithm then
finds clusters that conform to this specification.
As has already been mentioned, focusing on whether observations are
packed closely enough to be in a cluster has the advantage that we will not
have to separately decide how many clusters there might be. It also means
that, unlike agglomerative clustering and partitional clustering, the
possibility of a few observations not being in any cluster, that is, are
outliers, is allowed for in DBScan.
In DBScan, the definition of ‘packed closely enough’ to be in a cluster
relies on two parameters. These are given in Box 19.
The two parameters, gmin and dmax , will be demonstrated in the following
example.
255
Unit B1 Cluster analysis
1
Waiting time (standardised)
−1
−2
−2 −1 0 1 2
Duration (standardised)
Figure 26 Two circular regions on the plot of durations and waiting
times between eruptions
256
5 Density-based clustering
257
Unit B1 Cluster analysis
258
5 Density-based clustering
1
Waiting time (standardised)
−1
−2
−2 −1 0 1 2
Duration (standardised)
Figure 27 Observations in the Old Faithful geyser data classified
according to whether they are in the interior of a cluster
259
Unit B1 Cluster analysis
Friend Height h, in cm
Adnan 180
Billy 170
Cath 164
Dan 193
Elise 182
In Activity 27, you will be asked to identify the interiors of clusters for the
same dataset of the previous example, but for different combinations
of dmax and gmin .
In Example 30 and Activity 27, all the observations in the dataset were
checked to see if they were in the interior of cluster. However, during
Phase 1, this process is speeded up by checking only unlabelled points.
260
5 Density-based clustering
Select an
unlabelled observation
Are
no yes
there
enough?
no
Stop
261
Unit B1 Cluster analysis
262
5 Density-based clustering
1
Waiting time (standardised)
−1
−2
−2 −1 0 1 2
Duration (standardised)
Figure 29 Observations in the data about the Old Faithful geyser
classified according to whether they are in the interior of a cluster
Notice that all the interior points that are in the cluster appear to
visually form a distinct cluster away from the other points. The
closeness of the edge points to at least one interior point is also
evident: they lie in a region where the points are less dense than for
the interior points.
During Phase 2, there comes a point when all the observations in the
cluster set have been checked. At this point, Phase 2 ends. The application
of Phase 2 in its entirety is demonstrated in Example 32, next. You will
then apply Phase 2 starting with a different observation in Activity 29.
263
Unit B1 Cluster analysis
264
5 Density-based clustering
In Activity 29, you found that starting with a different initial interior point
did not make any difference to which observations were in the final cluster
set. This represents a general result: for any given cluster, it does not
matter which observation is chosen as the initial observation – the final
cluster set will be the same.
At this point, all the observations in this cluster set will be labelled as
belonging to the same cluster. Phase 2 then ends and the algorithm flips
back to Phase 1.
265
Unit B1 Cluster analysis
266
5 Density-based clustering
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
(a) Duration (standardised) (b) Duration (standardised)
Figure 30 Cluster solutions for the data about the Old Faithful geyser found using DBScan with dmax = 0.6
and gmin = 16 using different orders of observations
As has already been mentioned in Subsection 5.2, for any cluster the same
interior points are always found no matter which initial interior point is
chosen. It also is not possible for an interior point in one cluster to be
regarded as an ‘edge’ point in another cluster because it must be more
than dmax away from the interior points in this other cluster.
Overall, this means that the allocation of interior observations does not
change regardless of the order in which observations are checked. Similarly,
outliers also will not depend on the order either. That just leaves the edge
observations. It is possible for edge points to be within dmax of interior
points in two or more different clusters. For such edge points, the cluster
to which they eventually get allocated depends on the order in which
observations are checked.
This just leaves the values of gmin and dmax to be decided. Together, they
determine how densely packed together observations need to be if they are
going to be considered part of a cluster. The parameter gmin also dictates
the minimum number of observations there must be in a cluster. In
Activity 31, you will explore what impact these choices can have on the
clustering that is found.
267
Unit B1 Cluster analysis
The DBScan algorithm was applied to the Chinese liquor dataset, with
three different combinations of dmax and gmin . The resulting clusterings are
given in Figure 31.
2 2
Ethyl lactate
Ethyl lactate
1 1
0 0
−1 −1
−2 −1 0 1 2 −2 −1 0 1 2
Ethyl acetate Ethyl acetate
(a) dmax = 0.6 and gmin = 4 (b) dmax = 0.6 and gmin = 2
2
Ethyl lactate
−1
−2 −1 0 1 2
Ethyl acetate
(c) dmax = 1.2 and gmin = 4
Figure 31 Cluster solutions for liquor with three different combinations of dmax and gmin found using DBScan
268
6 Comparing clustering methods
(a) Compare the solutions. What has been the effect of changing gmin
and dmax ?
(b) Which solution seems best? Does it seem like a reasonable clustering
of the data?
269
Unit B1 Cluster analysis
All of the clustering techniques you have learnt about in this unit involve
carrying out computations in a loop. For example:
• hierarchical clustering involves iteratively joining clusters together
• k-means clustering involves repeated allocations of data points to
clusters
• DBScan involves working through lists of observations.
Thus, it should be no surprise to learn that it takes longer for a computer
to obtain a clustering solution for a large dataset compared to a smaller
one. However, how much longer differs between clustering techniques.
These differences can be substantial enough to impact whether an answer
can be obtained in a reasonable length of time or not, depending on the
technique used.
One straightforward way of exploring how the implementation time
depends on the size of the dataset is to perform a practical experiment;
that is, to analyse datasets of different sizes on a computer and time how
long it takes to get an answer. (Using the same computer is important, as
computers vary in how quickly they can perform computations.) You will
consider the results from one such experiment in Activity 32.
270
6 Comparing clustering methods
10000
1000
Relative time taken
100
10
1
10 20 50 100 200 500 1000 2000 5000
Dataset size
Figure 32 Relative timings for different clustering techniques
In Activity 32, you saw how the time to obtain a solution varies between
different techniques. However, this only gives a rough guide to how long it
might take to get a solution using these techniques. For a start, it depends
on how efficient the software to implement each of the techniques is. This
is down to the skill of the programmer and how efficiently the algorithm is
designed. The computing resources available have an impact; for example,
the extent to which the implementation is able to make use of distributed
computing where this is available.
The analysis is also based on running the algorithm just once to find a
solution. In practice, the algorithm might be run on the data more than
once, often with different settings. You will consider this further in the
next activity.
271
Unit B1 Cluster analysis
So, as you have seen in Activity 33, the choices that some clustering
techniques force the data analyst to make can lead to it being re-run more
times than other techniques. If speed is an important consideration, this
can change which technique is best.
Another consideration when thinking about the computational burden of
implementing an algorithm is the memory required. If this gets too big,
the algorithm cannot be implemented without more computing resources
being found.
A similar analysis to that undertaken in Activity 32 can be done to assess
the memory required whilst an algorithm runs. The results from one such
analysis are given in Figure 33.
250
200
Memory required
150
100
50
0
10 20 50 100 200 500 1000 2000 5000
Dataset size
Figure 33 Memory requirements by different clustering techniques
272
Summary
Summary
In this unit, you have been introduced to cluster analysis as one of the
data science techniques that is interested in identifying groups, or clusters,
in the data. Since very little is usually known about clusters beforehand,
especially the number of clusters and the membership of each observation
to a specific cluster, it is important to be able to spot clusters in a dataset.
You have seen how to use your own judgement to identify clusters in a
number of different datasets.
We have discussed a few similarity and dissimilarity measures that are
often used to measure the closeness of data points. You learnt how to
calculate the dissimilarity measures and were introduced to their
mathematical properties. These measures were then used to assess and
compare specific clusters using techniques such as the dissimilarity matrix
and its plot. You were also introduced to the silhouette statistic and used
its plot and mean statistic to assess clustering and informally allocate
observations to clusters.
Agglomerative hierarchical clustering has been introduced as one approach
that uses algorithms for automatic cluster allocation. You have learnt how
to start the algorithm and how to successively find and merge the closest
clusters. The dendrogram was then introduced as a graphical tool to
represent different clustering solutions that you can obtain using
agglomerative hierarchical clustering. The latter has also been
demonstrated in two notebooks activities where you learnt how to
implement it in R.
Instead of successively merging the closest clusters, as in agglomerative
clustering, partitional clustering has also been discussed as another
approach for cluster allocation that is based on splitting, or partitioning,
clusters. You have learnt about the Voronoi diagram as a graphical tool
that is used to allocate observations to clusters based on their position on
273
Unit B1 Cluster analysis
the plot. Then, you went through the partitional clustering algorithm from
selecting the starting point, successively partitioning clusters, selecting a
suitable number, k, of clusters and finally using stopping rules to obtain
reasonable clustering. Implementing partitional clustering in R was also
demonstrated in two notebook activities.
A third clustering method has been introduced, namely density-based
clustering (DBScan). You have seen that DBScan estimates both the
number of clusters and the cluster membership simultaneously. To achieve
this, the method uses two specific parameters, dmax and gmin , that need to
be pre-specified. You were asked to work through a notebook activity to
see how DBScan is implemented in R.
The unit concluded with a discussion on how to compare different
clustering methods and select a suitable method to use. This assessment is
usually done in terms of a number of considerations such as the treatment
of outliers, the dissimilarity measure that the method uses, the size of the
underlying dataset and the time each method may take.
As a reminder of what has been studied in Unit B1 and how the sections in
the unit link together, the route map is repeated below.
Section 1
Clusters in data
Section 2
Assessing clusters
Section 6
Comparing
clustering methods
274
References
Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate the importance of cluster analysis and understand its aims,
considerations and different algorithms
• use your own judgement to spot clusters in data based on histograms,
scatterplots and scatterplot matrices
• calculate different dissimilarity measures, especially the Euclidean and
L1 distances
• appreciate the importance of standardising data before calculating the
Euclidean and L1 distances
• calculate the dissimilarity matrix and understand its usage in assessing
clustering solutions
• understand and interpret the dissimilarity matrix plot and the silhouette
statistic plot
• calculate the mean silhouette statistic and use it to assess cluster
allocation
• understand agglomerative hierarchical clustering, the usage of the
dendrogram and the implementation of the technique in R
• understand partitional clustering, the usage of the Voronoi diagram and
the implementation of the technique in R
• understand density-based clustering, DBScan, work out cluster
allocation using different combinations of dmax and gmin , and be able to
implement DBScan in R
• appreciate the different factors you need to consider when selecting a
suitable clustering method for your data.
References
Azzalini, A. and Bowman, A.W. (1990) ‘A look at some data on the Old
Faithful geyser’, Applied Statistics, 39(3), pp. 357–365.
doi:10.2307/2347385.
Bonnici, L., Borg, J.A., Evans, J., Lanfranco, S. and Schembri, P.J. (2018)
‘Of rocks and hard places: Comparing biotic assemblages on concrete
jetties versus natural rock along a microtidal Mediterranean shore’,
Journal of Coastal Research, 34(5), pp. 1136–1148.
doi:10.2112/JCOASTRES-D-17-00046.1.
Höpken, W., Müller, M., Fuchs, M. and Lexhagen, M. (2020) ‘Flickr data
for analysing tourists’ spatial behaviour and movement patterns’, Journal
of Hospitality and Tourism Technology, 11(1), pp. 69–82.
doi:10.1108/JHTT-08-2017-0059.
Jordanova, N., Jordanova, D., Tcherkezova, E., Popov, H., Mokreva, A.,
Georgiev, P. and Stoychev, R. (2020) ‘Identification and classification of
275
Unit B1 Cluster analysis
archeological materials from Bronze Age gold mining site Ada Tepe
(Bulgaria) using rock magnetism’, Geochemistry, Geophysics and
Geosystems, 21(12). doi:10.1029/2020GC009374.
Qian, Y., Zhang, L., Sun, Y., Tang, Y., Li, D., Zhang, H., Yuan, S. and Li,
J. (2021) ‘Differentiation and classification of Chinese Luzhou-flavor
liquors with different geographical origins based on fingerprint and
chemometric analysis’, Journal of Food Science, 86(5), pp. 1861–1877.
doi:10.1111/1750-3841.15692.
Raines, T., Goodwin, M. and Cutts, D. (2017) Europe’s political tribes:
Exploring the diversity of views across the EU, Chatham House briefing.
Available at: https://www.chathamhouse.org/
sites/default/files/publications/research/2017-12-01-europes-political-
tribes-raines-goodwin-cutts.pdf (Accessed: 2 October 2018).
Stolfi, D.H., Alba, E. and Yao, X. (2017) ‘Predicting car park occupancy
rates in smart cities’, Smart cities: Second international conference,
Smart-CT 2017, Malaga, Spain, 14–16 June, 2017, pp. 107–117. Springer:
Cham, Switzerland.
Tonelli, S., Drobnič, S. and Huinink, J. (2021) ‘Child-related family policies
in East and Southeast Asia: an intra-regional comparison’, International
Journal of Social Welfare, 30(4), pp. 385–395. doi:10.1111/ijsw.12485.
Acknowledgements
Grateful acknowledgement is made to the following sources:
Figure 1: public domain
Subsection 1.1, the Bullring: jax10289 /Shutterstock
Subsection 1.1, Yellowstone Park geyser: public domain
Figure 7: Taken from Yu Qian, Liang Zhang, Yue Sun, Yongqing Tang,
Dan Li, Huaishan Zhang, Siqi Yuan and Jinsong Li. Food Chemistry.
‘Differentiation and classification of Chinese Luzhou-flavor liquors with
different geographical origins based on fingerprint and chemometric
analysis’, Journal of Food Science. Wiley.
Subsection 1.2, a rocky shore in Qawra, Malta Majjistral: Jocelyn
Erskine-Kellie / Flickr. This file is licenced under Creative
Commons-by-SA 2.0. https://creativecommons.org/licenses/by-sa/2.0/
Subsection 1.2, ‘Cat, chat or gato?’: public domain
Subsection 1.2, Malaysian families enjoyed an increase in paid maternity
leave: Edwin Tan / Getty
Subsection 3.3, the dendrogram: www.cantorsparadise.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
276
Solutions to activities
Solutions to activities
Solution to Activity 1
In Figure 4 there appears to be four peaks, which suggests that, in the
greyscale version of the image, there are four clusters of greyness values:
• a cluster centered around a greyness of about 0.25 (that is, a cluster of
relatively dark pixels)
• a cluster centered around a greyness of about 0.65
• a cluster centered around a greyness of about 0.8
• a cluster centered around a greyness of about 0.9 (that is, a cluster of
relatively light pixels).
Solution to Activity 2
In Figure 6, there is one clear cluster that seems to contain most of the
pixels. In this cluster, the values of redness, greenness and blueness are all
roughly equal. This means that pixels in this cluster appear grey.
Beyond this, things are less clear. Looking specifically at the plot of
redness versus greenness, there are arguably another two clusters each only
accounting for a small number of pixels. One of these clusters corresponds
to pixels where the redness is moderate but the greenness is low, and the
other cluster is where both the redness and greenness are low. Looking at
the plot of greenness versus blueness there also seems to be a cluster where
the blueness is noticeably below the greenness.
So, it would not be unreasonable to suggest that there is a total of four
clusters – as shown in Figure S1.
red
green
blue
277
Unit B1 Cluster analysis
Solution to Activity 3
(a) In one dimension, the L1 distance corresponds to d(x, y) = |x − y|.
Taking the two pixels in the left-hand corners, the dissimilarity is
d(xtl , xbl ) = |0.713 − 0.591| = |0.122| = 0.122,
or equivalently
d(xbl , xtl ) = |0.591 − 0.713| = |−0.122| = 0.122.
Corners Dissimilarity
Top left and bottom left 0.122
Top left and top right 0.052
Top left and bottom right 0.134
Bottom left and top right 0.174
Bottom left and bottom right 0.256
Top right and bottom right 0.082
(b) The corners that are most different are the bottom left and bottom
right as that is the pair with the biggest dissimilarity. This makes
sense because the darkest corner in Figure 1(a) is the bottom left
corner and the lightest corner is the bottom right corner.
Solution to Activity 4
(a) From Table 7, we have that xtl = (183, 181, 182)
and xtr = (188, 200, 188). Then
p
d(xtl , xtr ) = (183 − 188)2 + (181 − 200)2 + (182 − 188)2
√
= 25 + 361 + 36
√
= 422 ≃ 20.5.
(b) Comparing the colours of the pixels rather than just their greyness
does not make much difference to the conclusions. The bottom left
and bottom right corners still have the biggest dissimilarity. Equally,
the top left and top right corners are still the most similar corners.
However, between the other pairs of corners, the top right and bottom
right corners appear more different with respect to colour than with
respect to greyness.
278
Solutions to activities
Solution to Activity 5
(a) In this case, the L1 distance corresponds to
d(x, y) = |x1 − y1 | + |x2 − y2 |,
where x1 and y1 are the measurements for public social expenditure
for two countries and x2 and y2 are the corresponding maternity
leaves.
(i) When public social expenditure is measured as a percentage of
GDP and maternity leave in days, this means that
d(x, y) = |0.02 − 1.30| + |60 − 120|
= 1.28 + 60
= 61.28.
Similarly, d(x, z) = 52.01 and d(y, z) = 9.29.
(ii) When public social expenditure is measured as a percentage of
GDP and maternity leave in years, this means that
d(x, y) = |0.02 − 1.30| + |0.16 − 0.33|
= 1.28 + 0.17
= 1.45.
Similarly, d(x, z) = 0.16 and d(y, z) = 1.31.
(iii) When both public social expenditure and maternity leave have
been standardised, this means that
d(x, y) = |(−0.86) − 1.90| + |(−1.16) − 0.79|
= 2.76 + 1.95
= 4.71.
Similarly, d(x, z) = 1.71 and d(y, z) = 3.04.
(b) Based on the values calculated in part (a)(i), child-related policies in
Mongolia and Singapore are the most similar, as the dissimilarity is
smallest, and they are the most different between Malaysia and
Mongolia.
(c) Transforming the data makes a lot of difference! When maternity
leave is measured in years, not days, it is the policies in Malaysia and
Singapore that then look the most similar. The policies in Malaysia
and Singapore also look most similar when both variables have been
standardised. This is because the transformations change which
variable contributes most to the overall dissimilarity.
When public social expenditure is measured as a percentage of GDP
and maternity leave in days, the variation in maternity leave is far
bigger. So, the overall dissimilarity mostly reflects the differences in
maternity leave. In contrast, measuring maternity leave in years
means that it is the variation in public social expenditure that is
generally larger, meaning that overall dissimilarity tends to reflect
differences in public social expenditure.
279
Unit B1 Cluster analysis
Solution to Activity 6
(a) The dissimilarity between Adnan and Billy is given by
|hAdnan − hBilly | = |180 − 170| = 10.
Similarly, for the other three friends we have
|hAdnan − hCath | = |180 − 164| = 16,
|hAdnan − hDan | = |180 − 193| = |−13| = 13,
|hAdnan − hElise | = |180 − 182| = |−2| = 2.
Solution to Activity 7
(a) Clustering 1 in Figure 11 is not at all convincing. There is much
overlap between observations that have been placed in different
clusters.
Clustering 2 is more convincing than that in Clustering 1. There are
regions in the plot where observations in the same cluster clearly
dominate. However, observations from different clusters still appear to
overlap. So maybe a slightly different clustering would be more
appropriate.
Clustering 3 is convincing. Each cluster corresponds to a cloud of
observations which is noticeably separated from all the other clouds of
observations.
(b) In Matrix 2 of Figure 12, the blocks along the main diagonal stand
out as much lighter than the other blocks. This indicates that
observations within each cluster are much closer than observations in
280
Solutions to activities
Solution to Activity 8
(a) As Adnan is in Cluster 2:
• The mean dissimilarity between Adnan and the other members of
Cluster 2 (i.e. just Elise) is 2. So aAdnan = 2.
• The mean dissimilarity between Adnan and the members of
Cluster 1 (Billy and Cath) is
10 + 16
= 13.
2
Similarly, the mean dissimilarity between Adnan and Cluster 3
(just Dan) is 13.
So, Clusters 1 and 3 are equally close to Adnan, meaning
that bAdnan = 13.
Thus, the value of the silhouette statistic for Adnan is
bAdnan − aAdnan
sAdnan =
max(aAdnan , bAdnan )
13 − 2
=
max(2, 13)
11
= ≃ 0.846.
13
(b) As Billy is in Cluster 1:
• The mean dissimilarity between Billy and the other members of
Cluster 1 (i.e. just Cath) is 6. So, aBilly = 6.
• The mean dissimilarity between Billy and the members of Cluster 2
(Adnan and Elise) is
10 + 12
= 11.
2
Similarly, the mean dissimilarity between Billy and Cluster 3 (just
Dan) is 23.
So, Cluster 2 is closer to Billy than Cluster 3, meaning
that bBilly = 11.
281
Unit B1 Cluster analysis
Solution to Activity 9
In Figure 14(a), all the silhouette statistics are close to +1. So the plot
suggests that all the observations sit nicely in their allocated clusters. Out
of the clusterings given in Figure 11, this situation only applies to
Clustering 3.
Plots (b) and (c) in Figure 14 both contain observations for which the
silhouette statistic is negative; and this is more pronounced in plot (c). So,
they both correspond to situations in which some observations are closer to
a different cluster than the one they have been put in, particularly the
clustering corresponding to Figure 14(c). Thus, as Clustering 1 in
Figure 11 is worse than Clustering 2 in Figure 11, it suggests that
Figure 14(b) corresponds to Clustering 2 in Figure 11 and Figure 14(c)
corresponds to Clustering 1 in Figure 11.
282
Solutions to activities
Solution to Activity 10
In a word, no!
On the plot of the dissimilarity matrix, the blocks along the main diagonal
are generally some of the lightest coloured blocks, indicating that samples
from the same brand are similar. However, there are light-coloured blocks
elsewhere. This indicates that brands are not separate from each other.
This finding is borne out in the silhouette plot. Few of the silhouette
statistics are positive, again suggesting that samples from different brands
are not separate.
Solution to Activity 11
(a) Cath is in a cluster all by herself. So Cath’s silhouette statistic, sCath ,
just takes the value 0.
For observations that are in clusters bigger than size 1, the silhouette
statistic for observation i is
bi − ai
si = ,
max(ai , bi )
where ai is the mean dissimilarity between the observation and other
observations in the cluster and bi is the average dissimilarity between
the observation and observations in the next nearest cluster.
Billy is in a cluster with just Adnan, so
aBilly = |170 − 180|
= 10.
The mean dissimilarity between Billy’s height and the heights of those
in Cluster 1 corresponds to the dissimilarity between Billy’s height
and Cath’s height – that is
|170 − 164| = 6.
Similarly the mean dissimilarity between Billy’s height and the
heights of those in Cluster 3 corresponds to the mean of the
dissimilarities between Billy’s height and Elise’s height, and also
between Billy’s height and Dan’s height – that is
|170 − 182| + |170 − 193|
= 17.5.
2
So, for Billy,
aBilly = 10 and bBilly = min(6, 17.5) = 6.
This means that
bBilly − aBilly
sBilly =
max(aBilly , bBilly )
6 − 10
=
max(10, 6)
= −0.4.
283
Unit B1 Cluster analysis
Solution to Activity 12
(a) As a cluster must contain at least one of the friends, and none of
these friends can be in more than one cluster, the maximum number
of clusters is five. This corresponds to placing each friend in their own
single-element separate cluster.
(b) When there are eight friends, the maximum number of clusters they
can be put in is eight. As in part (a), this corresponds to placing each
friend in their own single-element separate cluster.
Solution to Activity 13
(a) Here, the corner pixels are split into those on the left and those on the
right. So the dissimilarity between the clusters is based on the four
dissimilarities involving a corner pixel on the left and a corner pixel on
the right. This corresponds to the values 20.5, 64.5, 70.1 and 115.6.
(b) Using single linkage, the dissimilarity between the two clusters is the
minimum of the four values picked out in part (a), which is 20.5.
(c) Using complete linkage, the dissimilarity between the two clusters is
the maximum of the four values picked out in part (a), which is 115.6.
(d) Using average linkage, the dissimilarity between the two clusters is the
mean of the four values picked out in part (a), which is
20.5 + 64.5 + 70.1 + 115.6
≃ 67.7.
4
284
Solutions to activities
Solution to Activity 14
(a) The next pair of clusters to be merged is Clusters D and E. This is
because the smallest dissimilarity between different clusters occurs
between Clusters D and E. The four-cluster solution is therefore:
{Cluster A}, {Cluster B}, {Cluster C} and {Cluster D, Cluster E}.
(b) The dissimilarity matrix for the four-cluster solution is as follows.
Solution to Activity 15
(a) Looking from the bottom upwards, the first horizontal line appears to
link days 1 and 9. So, these were the first to be merged.
(b) The merging of the last two clusters corresponds to the topmost
horizontal line. Thus it occurred at a dissimilarity around 550. (Being
any more precise than this is difficult, given the scale to read off from.)
(c) The two-cluster solution is found by tracing down from the point at
which there are just two vertical lines on the dendrogram – something
which happens for dissimilarities in the range of about 150 to
about 550. One of these lines leads down to days 5 and 6. This means
that all the other days are in the other cluster – that is, days 1, 2, 3,
4, 7, 8, 9 and 10.
(d) The three-cluster solution is found by tracing down from a point
where there are exactly three vertical lines. Doing this we find that
the clusters are: days 5 and 6; day 4; days 1, 2, 3, 7, 8, 9 and 10.
(e) The hierarchical clustering process means that the two-cluster
solution is found by merging two of the clusters in the three-cluster
solution. Thus, one of the clusters in the three-cluster solution must
be carried through to the two-cluster solution without being changed.
285
Unit B1 Cluster analysis
(f) Using the principle of parsimony, we should be looking for the smallest
number of clusters that adequately describes the structure in the data.
The dendrogram suggests that the two-cluster solution is the most
appropriate one, as mergers before this point involve only relatively
small changes in dissimilarity. In contrast, the change in dissimilarity
from the two-cluster solution to the one-cluster solution is relatively
big.
Solution to Activity 16
One way, and the one we will pursue in this subsection, is to work out
which cluster centre each observation is closest to, then allocate
observations to the corresponding cluster. For any given observation,
working out which cluster centre is the closest can be done by identifying
the cluster centre for which the dissimilarity between the observation and
the centre is the smallest.
Solution to Activity 17
With the cluster centres at 160 cm and 170 cm, respectively, heights less
than 165 cm will be closer to the centre of Cluster 1 and heights more than
165 cm will be closer to the centre of Cluster 2. So, anyone with a height
less than 165 cm should be allocated to Cluster 1 and the rest to Cluster 2.
This means that Cluster 1 will just be {Cath}, and Cluster 2 will be
{Adnan, Billy, Dan, Elise}.
Solution to Activity 18
There are different ways in which the centres could reasonably be
calculated. For example:
• the position that, for observations in the cluster, corresponds to the
mean duration and the mean waiting time
• the position that, for observations in the cluster, corresponds to the
median duration and the median waiting time
• the position where the dissimilarities between the centre and all the
observations are minimised.
The different methods have different merits, so what is right in one context
might not be the best in another.
Solution to Activity 19
For these data there is only one variable: height. So, as the dissimilarity
function is Euclidean distance, the centre is just the mean of the heights of
the friends in the cluster. That is,
hAdnan + hDan + hElise
x=
3
180 + 193 + 182
=
3
= 185.
286
Solutions to activities
Solution to Activity 20
The two subtasks are like mirror images of each other: the assumption for
one subtask is what is found in the other subtask.
Solution to Activity 21
(a) Allocating the friends based on cluster centres at 160 cm and 170 cm
was considered in Activity 17. There, it was shown that the
appropriate allocation is Cluster 1: {Cath} and Cluster 2: {Adnan,
Billy, Dan, Elise}.
(b) In part (a), Adnan, Billy, Dan and Elise were allocated to the second
cluster. Using the mean to estimate cluster centres, the centre of this
cluster is then estimated to be
180 cm + 170 cm + 193 cm + 182 cm
= 181.25 cm.
4
Only Cath was allocated to the first cluster. So the centre of this
cluster is just the same as Cath’s height: 164 cm.
(c) As in part (a), each friend should be allocated to the cluster whose
centre is the closest. So, using the cluster centres of 164 cm
and 181.25 cm all friends with a height less than
164 cm + 181.25 cm
= 172.625 cm
2
should be allocated to Cluster 1 and the rest to Cluster 2.
So now the second cluster (the one with centre 181.25 cm) comprises
just Adnan, Dan and Elise. Billy has now been allocated to the first
cluster (the one with centre 164 cm), along with Cath.
(d) As the first cluster now consists of Billy and Cath, the centre of this
cluster is now estimated to be
170 cm + 164 cm
= 167 cm.
2
Similarly, as the second cluster now consists of Adnan, Dan and Elise,
the centre of this cluster is now estimated to be
180 cm + 193 cm + 182 cm
= 185 cm.
3
Solution to Activity 22
(a) Comparing the solutions found in parts (c) and (d) of Activity 21 with
those in parts (a) and (b), we can see that during the second iteration
the allocation of points to clusters has changed (Billy changed from
being allocated to Cluster 1 to being allocated to Cluster 2). Also, the
estimates of the cluster centres have changed. For example, the
estimate of the centre of Cluster 1 changed from 164 cm to 167 cm.
Furthermore, only two iterations have been completed, less than the
maximum number set. So, as none of the stopping criteria has been
met, there is no reason to stop the algorithm after two iterations.
287
Unit B1 Cluster analysis
(b) Using the cluster centres of 167 cm and 185 cm the allocation of
friends to the clusters is: allocate to Cluster 1 if the height is less
than (167 + 185)/2 = 176 cm, otherwise allocate to Cluster 2. This
means that Billy and Cath are allocated to Cluster 1 and that Adnan,
Dan and Elise are allocated to Cluster 2.
This is the same allocation as was found in Activity 21(c). This also
means that the estimated cluster centres are the same as in
Activity 21(d): 167 cm and 185 cm.
(c) As noted in the solution to part (b), the allocation of friends to
clusters is the same as found in the previous iteration. Also, the
estimated cluster centres did not change. So, as two of the criteria for
stopping have been met, the algorithm should now stop (successfully).
(We only need one of the criteria to be met.)
Solution to Activity 23
(a) Yes, the same allocation of observations to clusters will always be
obtained if the values of the cluster centres do not change. If the
cluster centres do not change, then which centre an observation is
closest to also cannot change.
(b) No, the cluster centres cannot change if the allocation of observations
to clusters does not change. The value of a cluster centre is completely
determined once it is known which observations are in the cluster.
(c) If the allocation of observations to clusters does not change, the
answer to part (b) implies that the cluster centres do not change.
Equally, if the cluster centres do not change, the answer to part (a)
implies that the allocation of observations to clusters does not change.
So, as the circumstances that lead to one of the stopping rules being
satisfied means that the other condition is also satisfied, the two
conditions are equivalent.
Solution to Activity 24
This time, with initial cluster centres of 180 cm and 193 cm, any friend
with a height less than
180 cm + 193 cm
= 186.5 cm
2
should be allocated to Cluster 1 and the rest to Cluster 2. This means that
the allocation of friends to clusters becomes:
• Cluster 1: Adnan, Billy, Cath and Elise
• Cluster 2: Dan.
Based on this cluster allocation, the cluster centres are then estimated to
be:
180 cm + 170 cm + 164 cm + 182 cm
• Cluster 1: = 174 cm
4
• Cluster 2: 193 cm.
288
Solutions to activities
Moving on to the second iteration we then have the dividing line between
allocating to Cluster 2 instead of Cluster 1 as a height of
174 cm + 193 cm
= 183.5 cm.
2
So, the allocation of friends to clusters remains as:
• Cluster 1: Adnan, Billy, Cath and Elise, with the centre of 174 cm
• Cluster 2: Dan, with the centre of 193 cm.
As neither the allocation of friends to clusters, nor the estimated cluster
centres, changes from Iteration 1 to Iteration 2, the algorithm stops.
Note that the solution is different to that found in Activity 22. In
Activity 22, at the end of the algorithm Billy and Cath were in one cluster,
leaving Adnan, Dan and Elise in the other. Neither is wrong, merely
different.
Solution to Activity 25
(a) In all of the plots, the dividing line between the two clusters goes
diagonally across the plot. This indicates that the main difference
between the clusters is the concentration of both chemicals together.
In one cluster, this total concentration is higher compared with the
other. In other words, assuming that a higher concentration
corresponds to a stronger flavour in the liquor, the two clusters
correspond to a relatively strongly flavoured liquor and a relatively
less flavoured liquor.
The solutions differ in the weight given to ethyl acetate over ethyl
lactate. This could be translated into differences in the flavour.
(b) In Figure 23(a) and Figure 23(b), there does not seem to be a clear
gap between the two clusters, so they do not represent reasonable
divisions of the data into two clusters.
In Figure 23(c), there does seem to be a gap between the two clusters.
So, this clustering solution seems more reasonable.
(c) Based on the mean silhouette statistics, the clustering solution given
in Figure 23(c) has the highest value of the statistic and thus appears
better than the other two solutions.
Solution to Activity 26
(a) The average silhouette statistics is highest for k = 3. So for these data
there appears to be three clusters.
(b) The points in the upper k = 2 cluster solution also form a cluster in
the k = 3 cluster solution, meaning that the k = 3 solution happens to
split the lower cluster given for the k = 2 solution. This split seems to
be where there is a bit of a gap between observations. So it seems
reasonable that this three-cluster solution is better than the
two-cluster solution.
289
Unit B1 Cluster analysis
The four-cluster solution also appears to split the data nicely into
groups. It is worth noting that the mean silhouette statistic for this
solution is not much less than that for the three-cluster solution. In
the five-cluster solution it is not clear that the observations in the
middle cluster are closer to each other than to observations in other
clusters. So it not surprising that the mean silhouette statistic for
the k = 5 solution is not as high as for the k = 3 solution.
Solution to Activity 27
(a) In Example 30, it was noted that Adnan and Billy had three friends
(including themselves) within 10 cm of their own heights.
Furthermore, Cath and Elise both had two friends whose height was
within 10 cm of their own heights. So, when dmax = 10 and gmin = 2,
Adnan, Billy, Cath and Elise would be regarded as being in the
interior of a cluster. Dan still would not deemed to be in the interior
of a cluster.
(b) When dmax = 10, at most there are only three friends within 10 cm of
one of their heights. This means that when dmax = 10 and gmin = 4,
none of the friends would be deemed to be in the interior of a cluster.
(c) When dmax = 6, we have the following.
So when gmin = 2, this means that only Adnan, Billy, Cath and Elise
are in the interior of a cluster. Dan is not.
(d) When dmax = 1, we have the following.
290
Solutions to activities
Solution to Activity 28
Phase 2 of the algorithm is only triggered when one such observation has
been found.
Solution to Activity 29
• Step 1:
This time just Billy is in the cluster set initially. Billy is necessarily an
interior point, and so adds the observations which are sufficiently close
to it to the cluster set. This means that Adnan and Cath get added to
the cluster set, so that it becomes {Billy, Adnan, Cath}.
• Step 2:
Now consider the next observation in the set: Adnan. As in Example 32,
Adnan is also an interior point and potentially adds both Cath and
Elise. However, as Cath is already in the cluster set, this means that the
cluster set becomes {Billy, Adnan, Cath, Elise}.
• Step 3:
Now considering Cath, this is an edge observation, just as in
Example 32. So, no further observations are added to the cluster set.
• Step 4:
Now considering Elise, this is also an edge observation, just as in
Example 32. So, again, no further observations are added to the cluster
set.
At this point, all the observations in the cluster set have been checked
and no further observations will be added. So, the cluster set ends up as
being {Billy, Adnan, Cath, Elise}. This is exactly the same observations
that were in the final cluster set generated in Example 32. Only the
order of the observations in the cluster set has been changed (which does
not matter).
Solution to Activity 30
From the plots it is clear that the two solutions are very similar. In both
cases two clusters are identified and no points are identified as being an
outlier. In fact, the allocation of just two points is different in the two
solutions. These points are part of a small group of observations that
arguably lie between the hearts of the two clusters.
This difference is not likely to be important. The impression about both
clusters is unchanged. However, it does highlight that these points are not
clearly in one cluster or the other cluster.
291
Unit B1 Cluster analysis
Solution to Activity 31
(a) Decreasing gmin (Figures 31(a) and (b)) has led to more clusters being
identified, and fewer observations being classified as outliers.
Increasing dmax (Figures 31(a) and (c)) also has led to fewer clusters
being identified. However, this time it is because observations that
were in separate clusters are now allocated to the same cluster.
Additionally, observations that were labelled as outliers are now
allocated to the cluster too.
(b) The solution given by dmax = 1.2 and gmin = 4 (Figure 31(c)) is not
helpful. It simply puts all the samples into the same cluster.
There is less to choose between the other two solutions. However,
using the principle of parsimony (i.e. the fewer clusters the better),
the solution given by dmax = 0.6 and gmin = 4 should be preferred.
The high number of outliers in both these solutions suggests that
perhaps there are not clusters in the data to find!
Solution to Activity 32
(a) In this experiment, hierarchical clustering took less time
than k-means or DBScan when the dataset size was 50 or less. So
hierarchical clustering appears to be fastest for small datasets.
(b) When the dataset size was 500 or more, k-means clustering took less
time than DBScan or hierarchical clustering. So, this appears to be
the fastest for large datasets.
(c) It is more important for a technique to be fast with large datasets.
When a dataset is small, even a relatively slow technique is still likely
to come up with an answer in an acceptably small amount of time.
Solution to Activity 33
(a) In hierarchical clustering, the main choice that has to be made is the
choice of linkage. For example, single, complete or average. Changing
the linkage means re-running the algorithm. Furthermore, as it is easy
to change the dissimilarity measure in hierarchical clustering, different
choices for this might be tried. This might lead to the algorithm being
re-run a number of times to explore the impact of these choices.
(b) In partitional clustering, there are two main choices to be made: the
initial positions of the cluster centres, and what value of k to use. In
Subsection 4.4, you have already seen that exploring what is a
reasonable value for k can lead to the algorithm being run a few
times. For example, in Example 26 the algorithm was implemented
for nine different values of k. Furthermore, in Subsection 4.4 you saw
that choosing starting configurations at random was one way to try to
ensure that a good, stable solution is found. This could lead to the
algorithm being re-run lots of times to try to ensure there is a
reasonable chance that a good, stable solution is found.
292
Solutions to activities
(c) In DBScan there are two main choices: dmax , the maximum
dissimilarity between two data points for them to be in the same
cluster, and gmin , the minimum number of data points to define a new
cluster. Also, it is possible to change the dissimilarity measure.
Trying just a few different values for both dmax and gmin , and a few
different dissimilarity measures can quickly lead to the algorithm
being run ten or more times.
293
Unit B2
Big data and the application of data
science
Introduction
Introduction
At the time of writing this unit, there is much talk about ‘big data’. This
talk is both in a positive way (what it can make possible) and a negative
way (how it threatens values deemed important by society). This unit, the
second and final unit in the data science strand, considers some of the
issues and applications for ‘big data’.
In Section 1, you will explore when data might be thought of as ‘big data’
and we will describe some of its uses. (For the purposes of this unit, we will
refer to data of the type you have been dealing with so far as ‘small data’.)
An important feature of the analysis of big data is that the computational
aspects cannot be ignored. Data analysis becomes infeasible if it takes the
computer too long to come up with the results. Furthermore, with big
data it cannot be assumed that such computational problems will be
solved simply by switching to a bigger, faster computer. Section 2 will
focus on distributed computing, which harnesses computing power across
multiple processors and makes the analysis of big data feasible. In
Section 3, the focus is on algorithms used by the processors to produce
results for our data analysis – in particular, the extent to which we can
expect them to produce the correct results, or even the extent to what
constitutes the correct results is known.
In the remaining sections of this unit, Sections 4 to 7, we shift away from
considering the practical considerations of computation time and
algorithms to more philosophical considerations.
In Section 4, you will explore how having big data can impact on the
interpretational approach to data analysis. Sections 5 and 6 will deal with
wider ethical issues thrown up by big data (and small data too). Data
scientists and statisticians, and indeed others who collect and use data,
have a responsibility for the use to which the results are put. As you will
see, the nature of big data, and the complexity of models built using it,
means that ethical principles surrounding privacy and fairness can be
unwittingly violated. Finally, Section 7 will introduce guidelines for data
scientists and statisticians for dealing with such data. These guidelines aim
to ensure that the public justifiably regards the work of data scientists and
statisticians as being undertaken with integrity.
The following route map shows how the sections connect to each other.
297
Unit B2 Big data and the application of data science
Section 1
What is so special
about big data?
Section 2 Section 4
Handling Outputs from
big data big data analysis
Section 5
Privacy
Section 3
Section 6
Models and
Fairness
algorithms
Section 7
Guidelines for
good practice
298
1 What is so special about big data?
Take a minute or two to think about what the term ‘big data’ currently
means to you. What do you think makes big data different from just data?
As the Solution to Activity 1 has suggested, one aspect that can make
data big is that there are lots (and lots, and lots!) of individual data A type of big data!
points. But there are other aspects too, notably the type of data, and the
speed at which it accumulates. These two aspects, along with the number
of data points, are indicated by the so-called three V’s of data science:
• volume
• velocity
• variety.
These three V’s were highlighted by Laney (2001, cited in Diebold, 2021)
and will be the subject of Subsection 1.2. (There have been further V’s
proposed since, and other letters too.)
A more general definition for big data is given by Walkowiak:
Big data is any data that cause significant processing, management,
analytical and interpretational problems.
(Walkowiak, 2016, p. 11)
This definition focuses on the impact that trying to handle big data
causes, rather than trying to quantify what a set of data needs to look
like in order to be considered big. We will return to this in Section 2.
However, first, in Subsection 1.1, we will consider some of the areas in
which big data have had an impact. This is to give you an idea of what
has already proved possible using big data.
299
Unit B2 Big data and the application of data science
When you last did some internet shopping, did the website also suggest
other items you might be interested in? If so, did the suggestions seem
reasonable to you? Have you bought items that have been suggested to
you by the website?
Share your responses via the M348 forums.
300
1 What is so special about big data?
301
Unit B2 Big data and the application of data science
So, as you have seen, big data is making its impact in different ways. In
the next subsection, we discuss in more detail the aspects that can make
data big, or what are known as the three V’s.
302
1 What is so special about big data?
1.2.1 Volume
It probably does not surprise you that the size, or volume, of the datasets
being analysed is one aspect of big data. Some of the datasets being
analysed are huge – many thousands of times larger than the data you
have so far analysed in M348. So why is this a challenge?
In many situations, this involves two considerations: computer memory
and computational time.
You will have noticed as you worked through the Jupyter notebooks
produced for this module that data analysis using R requires the dataset
to be loaded into R. In order for that to happen, the data needs to be
stored somewhere that R can access it. Typically, this means being stored
on the same computer where R is running. The memory of the computer
has to be sufficient to be able to store the data. It also has to have
additional memory available to any temporary R objects created during
the analysis. If sufficient memory is not available, R will not able to With big data, huge volume
may sometimes be a challenge!
complete the analysis.
Every time you use R to analyse data, this will involve the software getting
the computer to perform various computations to come up with the
results. Whilst computers can do basic computations very quickly, it is
still possible for the number of computations to build so that they take a
noticeable amount of time. It is also possible for the number of
computations to be so great that the analysis takes so long that it becomes
too long to wait. For example, recommender systems such as those
described in Example 1 have to be able to come up with suggestions very
quickly before the (potential) customer loses interest and moves on to
another website (which could be hosted by a rival retailer).
Both of these issues, about computer memory and computer time, can be
solved by technological improvements in computing up to a point. For
example, in the past, PCs and laptops that came with 1 GB of storage
used to be top end. Nowadays, computers with 1 TB or more of storage are
commonly available. Computers have also become many orders of
magnitude faster. The pace of improvement in computer power – at least
in the last few decades – has been summarised by Moore’s Law, which is
described in Box 1. However, in the case of big data, switching to a bigger
and faster computer is unlikely to be sufficient. Instead, as you will see
in Section 2, the solution lies in harnessing the power of several processors
together.
303
Unit B2 Big data and the application of data science
1010
Number of transistors
108
106
104
1.2.2 Variety
So far in this module, the data you have worked with have been in the
form of a data frame. Recall that, in a data frame, the data are laid out in
a tabular format. There are a number of observations: the rows. For each
observation, a number of variables are recorded – the same variables for
each observation (though it is possible for some of the individual values to
be recorded as missing). Such data are said to be structured.
However, big data also encompasses data that cannot be captured in a
neat tabular format. For example, the information in tweets, Facebook
postings and YouTube videos can be used to form a big data dataset. Such
data are said to be unstructured. What makes sense as variables in one
304
1 What is so special about big data?
1.2.3 Velocity
Velocity refers to the speed at which the data are gathered and need to be
processed. So far in this module, the data have effectively been static –
giving the opportunity to take time over analysing the data. However,
some big datasets will be constantly added to, second by second. For
example, the online retailer Amazon (used by individuals and businesses,
and available in many different countries and languages) will constantly be
gathering information about visitors to its website. Moreover, customer
behaviour is also likely to be constantly evolving. So an appropriate model
at one point is not likely to remain appropriate. This means there is the
challenge of dealing with this constantly changing landscape.
305
Unit B2 Big data and the application of data science
error, it has no impact on some biases. So, whilst the sheer size of big data
datasets can lead to precise estimates, it does not necessarily mean that
the estimates are accurate. A reminder of the difference between precision
and accuracy is given in Box 2.
In the next activity, you will consider a situation where bias might occur.
Comment or ‘have your say’ sections are available on some news media
websites to allow the public to express their thoughts about topical issues.
Suggest reasons why an analysis of such postings on a controversial issue
would not necessarily reflect the opinion of the general public accurately.
Would the biases involved here diminish as the number of postings
increase?
306
2 Handling big data
How are the linear and generalised linear models fitted to data?
Hint: think about how this question can be answered using different levels
of detail.
307
Unit B2 Big data and the application of data science
Box 3 Algorithms
Loosely, an algorithm is often taken to mean a formula or rule.
However, a dictionary definition of algorithm is:
a precisely defined set of mathematical or logical operations
for the performance of a particular task.
(OED Online, 2022)
Note that the operations are not just instructions of the form ‘do
this’. They can also contain loops (e.g. ‘do this 10 times’ or ‘do this
until something happens’) and conditions (e.g. ‘if this, do one thing
otherwise do something else’).
308
2 Handling big data
4. The slope, β,
b is given by
P P
(x − x) (y − y)
βb = .
(x − x)2
P
Now, for the generalised linear models you have been fitting to data,
not much has been said about how the parameter estimates are found,
beyond using the method of maximum likelihood estimation (for example,
in Subsection 3.1 of Unit 7).
So far, you have not had to worry about the details of the algorithm
used to do the fitting. You have been able to rely on the fact that such
algorithms exist and work sufficiently quickly that you are not left too
long for the computer to provide an answer.
However, with big data the speed of algorithms becomes important.
This is what we will consider in Subsection 2.1. Then, in Subsection 2.2,
we will consider distributed computing – an approach to computing that
can handle the demands of scale that big data brings – before briefly
discussing how this is done in practice in Subsection 2.3.
309
Unit B2 Big data and the application of data science
(a) Table 1 lists five trees from the manna ash trees dataset. Although
you could use R to calculate the estimated values for the intercept, α,
and slope, β, for the purposes of this activity you should do this
calculation ‘by hand’ and time how long it takes you. (Note: ‘by
hand’ includes making use of a calculator or calculator app.)
Table 1 Five trees from mannaAsh
310
2 Handling big data
( x)2
X P
x2 − .
n
4. Calculate the variance by dividing (x − x)2 by (n − 1): that is,
P
(x − x)2
P
2
variance (or s ) = .
n−1
√
5. Calculate the standard deviation as s = variance.
311
Unit B2 Big data and the application of data science
Mathematically, the methods in Boxes 5 and 6 result in the same value for
the standard deviation. But what about the computational time?
Consider this now in Activity 6.
(a) Using Method 1, time how long it takes you to calculate by hand
the standard deviation for the following set of data.
14.59 18.97 24.56 54.28 32.15
(b) Using Method 2, time how long it takes you to calculate by hand
the standard deviation for the same set of data.
(c) Which method was faster for this particular dataset? For larger
datasets, which method do you think is going to be faster?
312
2 Handling big data
313
Unit B2 Big data and the application of data science
314
2 Handling big data
315
Unit B2 Big data and the application of data science
316
2 Handling big data
In the Solution to Activity 8, the need to check whether one of your friends
fails to deliver the subsum they promised was mentioned. In distributed
computing, such a circumstance is not confined to the vagaries of human
behaviour: it is also a consideration when using a cluster of processors to
do the computations. There is always the risk that an individual processor
does not complete or return its assigned computation – that it ‘fails’ in
some sense. Now, the probability of an individual failure is usually low.
However, once several processors are combined into a cluster, the
probability that at least one fails can easily become non-negligible, as you
will discover in Activity 9.
317
Unit B2 Big data and the application of data science
So, the risk of a least one processor failing increases with the number of
processors. This means that in setting up distributed computing clusters,
it is something that has to be confronted. Additionally, there might be
problems with the network used to allow computers to communicate with
each other. So, implementations involving distributed computing need to
include strategies for coping when a follower processor, the network or even
the processor acting as the leader fails. Then failure of a processor becomes
a minor inconvenience rather than sabotaging the whole computation.
This is in addition to having strategies for optimising the distribution of
calculations across the processors. This means there is a cost, in terms of
computation time, in using distributed computing. So, whilst distributed
computing can bring great reductions in computation time when big data
is worked on, it is often not worth it for small data.
Map
Reduce
Reduce
Map
318
2 Handling big data
So far in this section, you may have been assuming the multiple processors
for distributed computing have to be in different computers. However,
modern computers generally have more than one processor in them. This
enables your computer to do two or more separate computations at the
same time. We will make use of this in Subsection 2.3.1, where you will
use R for distributed computing. In particular, you will use distributed
computing to fit a simple linear regression model. This is just one of the
many statistical techniques you have used so far in this module. In
Subsection 2.3.2, we will explore whether converting all these techniques to
make the most of a distributed computing environment is likely to be
straightforward.
319
Unit B2 Big data and the application of data science
You have just seen, in Activity 10, that the mean can be calculated using
distributed computing. However, not all statistics are so amenable as the
mean to compute using a distributed computing set-up. One example is
the calculation of the median. To see one reason why, work through
Activity 11.
320
3 Models and algorithms
(e) Using your answers to parts (b) to (d), calculate the median of the
three medians. Does this always give the median for the whole
dataset?
(f) Repeat part (e), but this time calculate the mean of the medians.
Is this any better?
3.1 Convergence
Recall that, in Boxes 5 and 6, two different algorithms for calculating the
standard deviation were given. Both algorithms appear to be
straightforward – they each describe a sequence of five steps that ends up
with the standard deviation being calculated. But does this mean that
using either of these algorithms calculates the standard deviation exactly,
regardless of the data it is applied to? Unfortunately, the answer is no, not
quite, as you will discover in Activity 12.
321
Unit B2 Big data and the application of data science
So, as you have seen in Activity 12, the standard deviation will not usually
be exact because the exact value corresponds to an irrational number.
Furthermore, the variance, a quantity used in an intermediate step of the
calculation, may not be an exact number either. So, all of this limits how
accurately the variance can be calculated.
Not being able to calculate a value exactly does not just apply to
calculating the standard deviation: it is a general feature of algorithms.
The issue is, instead, whether the value is close enough to the exact value.
What counts as ‘close enough’ will depend on the context.
First of all, there is no point trying to go beyond what is known as
machine precision. That is, precision with which the processor (or
processors) is able to perform basic calculations. This is because two
322
3 Models and algorithms
323
Unit B2 Big data and the application of data science
324
3 Models and algorithms
This does not stop backward or forward stepwise regression from being
useful. Both might produce reasonable parsimonious models, just not
necessarily the same parsimonious model.
Even when you are applying the same algorithm to the same data,
you might not end up with the same result. As you will see in Example 12,
you have already met one such algorithm in Unit B1: the k-means
algorithm.
325
Unit B2 Big data and the application of data science
As you have seen in Activity 13, algorithms with the same end goal do not
necessarily end up producing the same answers. Even the same algorithm,
applied to the same data, may not result in the same answer, as you have
seen in Example 12. So, why are such algorithms still regarded as useful?
One reason is that proving that an algorithm will always produce the ‘best’
answer is often a non-trivial task. This is particularly true if we also need
to prove that the algorithm will produce an answer in a reasonable amount
of time. The best that might be achieved is that the algorithm is shown to
produce the best answer for all the examples it has been applied to.
Perhaps more importantly, the problem that needs solving may be
sufficiently hard that an algorithm to produce the ‘best’ answer all the
time does not exist – perhaps cannot exist. For example, it has been
famously proven that an algorithm to always correctly detect whether
other algorithms will always come to an end cannot be constructed. Or it
might be that it is possible to find an algorithm that will produce the
‘best’ answer for some classes of the problem but not others, such as the
problem of finding a maximum described in Example 13.
326
3 Models and algorithms
In such cases, it matters what algorithm is used and what starting point On a local, but not global,
maximum
the algorithm uses. These then become details that should be reported as
part of the data analysis.
For the models discussed in Units 1 to 8, the fitting has been done ‘by
least squares’ or ‘by maximum likelihood’. Both these approaches provide
an unambiguous definition of what the estimates should be. As noted in
Example 13 (Subsection 3.2), the MLEs are the values of the parameters
for which the likelihood is maximised. However, in statistics, not all
algorithms are trying to solve problems where there is an unambiguous
notion of what makes one answer better than another. Some algorithms
are developed, and adopted, on the basis that the approach seems like a
sensible one to take. This is backed with evidence that the algorithm
works at least with some test cases. In the following couple of activities,
you will consider whether some algorithms you have already met can be
thought of in such terms.
327
Unit B2 Big data and the application of data science
Unit B1 focused on cluster analysis. Think about the methods for cluster
analysis introduced in that unit: hierarchical clustering, k-means,
and DBScan. To what extent are the differences between these methods
about fitting different models to the data and to what extent are they
about using different algorithms?
So, as you have seen in Activities 15 and 16, not only might different
algorithms produce different answers to the same problem, it might be
ambiguous as to which solution is better. In these situations, it is
important to keep track of which algorithm has been used. For example,
when reporting a cluster analysis, the report should include whether
clusters have been found using agglomerative hierarchical clustering (along
with which linkage), a partitional clustering method such as k-means or a
density-based clustering method such as DBScan.
328
4 Outputs from big data analysis
Furthermore, you may have come across the term ‘spurious correlation’.
Such correlations are those that are deemed to have happened just by
chance. That is, the sample of values for each of the two variables turns
out to produce a strong correlation even though the two variables are not
linked. The underlying principle of this is that if you look hard enough
(i.e. compare enough variables), some will show a correlation just by
chance. As you will see in Example 15, this limits how we can interpret
results.
329
Unit B2 Big data and the application of data science
Year: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Y1 9.3 9.7 9.7 9.7 9.9 10.2 10.5 11.0 10.6 10.6
Y2 480 501 540 552 547 622 655 701 712 708
It turns out that the correlation between these two variables (cheese
and engineering) is 0.96. A strong positive correlation!
You may be surprised to learn that in some big data analyses the
plausibility of correlations is not considered. If two items are correlated
then this is something that can be exploited. It does not matter how
unreasonable this association seems. Some statisticians are concerned and
unhappy about this.
Such an approach brings advantages. In particular, it allows the
opportunity for new relationships to be found and exploited. For example,
loyalty cards allow supermarket chains to collect a wealth of data about
their customers. From such data, the supermarket can seek out
associations between purchases that the customers makes that are not
obvious. This information can then be used to better target marketing.
330
4 Outputs from big data analysis
However, unless we are sure we have the entire population, having a large
amount of data does not diminish the importance of it being
representative. Increasing a sample size only reduces the sampling error;
it does not address biases due to lack of representativity. So, we will
consider the representativeness of the FIFA 19 database in the next
activity.
331
Unit B2 Big data and the application of data science
In Activity 18, it was stated that data about all the footballers in the
FIFA 19 database are available. Are such data likely to be representative
of all international footballers?
Even when the data are not the entire population, and we are happy that
they are representative of the population we are interested in, statistically
significant results will often be obtained even for effects that are in
practice too small to be important.
(b) A sample of 1000 players is now taken from the FIFA 19 database and
the same model fitted. The output is given in Table 4. What do you
conclude about the variables now?
Table 4 Coefficients when fitting the model to a sample of 1000 footballers
332
4 Outputs from big data analysis
(c) The same model is fitted to all 18 147 footballers in the database. The
output is given in Table 5. What do you conclude about the variables
now?
Table 5 Coefficients when fitting the model to all 18 147 footballers
(d) For each of these three models (or, more accurately, the same model
to different data) calculate the 95% confidence interval for the slope
associated with preferredFoot. (Hint: when doing this, use the fact
that as n gets big the t-distribution with n degrees of freedom gets
increasingly close to the standard normal distribution. So,
tn (0.975) ≃ z(0.975) ≃ 1.96 when n is big.)
Does the width of the 95% confidence interval for the slope associated
with preferredFoot get bigger or smaller as the sample size
increases?
In each case, is it plausible that the effect of preferredFoot is to
increase strength by 1 unit? (Remember that the variable
preferredFoot can only take the values 0 or 1.)
So, as you have seen in Activity 20, when more data are available,
variables with small effects are more likely to feature in parsimonious
models. This means that models fitted to big data can be, and are, more
complicated than models fitted to smaller datasets. This leads to the
change in emphasis that is the subject of Subsection 4.3 – prediction
rather than explanation.
333
Unit B2 Big data and the application of data science
In Unit 8 (Subsection 1.1), you analysed data from the UK’s 2013 Living
Costs and Food Survey, given again here in Table 6.
Table 6 Counts of households from the UK survey dataset classified by
employment, gender and incomeSource
incomeSource
earned other
gender gender
employment female male female male
full-time 626 1688 31 95
part-time 235 112 123 66
unemployed 18 16 72 58
inactive 68 78 815 1043
Total 947 1894 1041 1262
Below are listed four different log-linear models that could be fitted to
these data. Which of these models is the most complex? Which is the
most difficult to interpret?
count ∼ incomeSource + employment + gender,
count ∼ incomeSource + employment + gender + employment:gender,
count ∼ incomeSource + employment + gender + employment:gender
+ employment:incomeSource,
count ∼ incomeSource + employment + gender + employment:gender
+ employment:incomeSource + gender:incomeSource.
As you have seen in Subsection 4.2, using big data means that even small
effects end up as statistically significant. This makes it difficult to simplify
models, which in turn makes it more difficult to interpret them. As you
will see in Example 16, the complexity of the model can go beyond just
including more terms in a linear regression or generalised linear model.
334
4 Outputs from big data analysis
The flexibility of neural networks, coupled with big data, means that they
have been applied in a number of different settings, as you will discover in
Activity 22.
335
Unit B2 Big data and the application of data science
The size, complexity and flexibility of the models that can be fitted using
big data often mean that a succinct summary of how changes in an
explanatory variable impact on the response variable is not possible.
Indeed, any such interpretation may only be possible by looking for
differences (or lack of differences) in predictions when individual values
are changed.
In many situations, it may not matter how any particular prediction is
arrived at, simply whether a good prediction can be arrived at. However,
as you will discover in Subsections 6.1 and 6.2, understanding how a model
behaves with respect to different sets of inputs is important. It is critical
for ensuring that a model made possible by big data does not lead to
real-world impacts that are detrimental to society.
Furthermore, having a prediction or classification based on big data rather
than small data does not mean that it is certain or inevitable. There
remains the need to include some estimate of the uncertainty associated
with the prediction or classification, for example, by giving a confidence
interval. Otherwise, we have no means of telling the difference between a
reliable prediction and a wild guess!
5 Privacy
So far in this unit, you have seen what is meant by big data and how it is
dealt with. (Admittedly this is in a general way.) You have also seen how
the analysis of big data can be different in character to the analysis of
small data. In this section, you will consider some of the ethics
surrounding collection and storage of big data: consent (Subsection 5.1)
and preserving anonymity (Subsection 5.2). At the heart of this is the idea
of allowing individuals to maintain privacy, should they wish to. As such,
these ideas apply to small data as well as big data.
5.1 Consent
For any data analysis, one of the first considerations should be to check
that the data have been collected in an ethical way. The analysis of data,
particularly data relating to people or animals, that has not been obtained
ethically could be seen as condoning the unethical practices. Within the
research community, projects involving people or animals have to gain
ethical clearance before they are given approval to go ahead. So, as you
will see in Example 17, institutions such as The Open University have to
consider how such ethical clearance can be obtained.
336
5 Privacy
337
Unit B2 Big data and the application of data science
Whilst completing Activity 24, you may have come to your own
conclusions about whether the customers or users of the social media
platform or online retailer understand the full range of data they are
consenting to the company collecting about them. More importantly, this
includes whether the users/customers would still be happy with the terms
and conditions if they did understand them.
However, data captured by social media platforms and online retailers can
include data about others. These others may not even know that such data
is being gathered, let alone consent in any way. You will consider the
possibility for such data to be gathered in the next activity.
338
5 Privacy
339
Unit B2 Big data and the application of data science
Concerns about the use of data, particularly exhaust data (which was
introduced in Subsection 1.2.4), have led to the introduction of legislation.
For example, the General Data Protection Regulation (GDPR) in the UK
and in the EU. As Box 7 details, this legislation sets out a number of
principles that should be applied to the processing of personal data.
So, how do the principles set out in Box 7 help? First, let us consider
transparency. In guidance about the legislation (ICO, no date) the ICO
explains this as ‘being clear, open and honest with people from the start
about who you are, and how and why you use their personal data’. This
means that even people whose data are included as part of a secondary
dataset have a chance to assert rights such as the right to object.
Another of the principles, that of ‘purpose limitation’, the ICO explains as
meaning that personal data should be ‘collected for specified, explicit and
legitimate purposes’ and crucially ‘not further processed in a manner
incompatible with these purposes’. So, this regulation prohibits the
analysis of personal data by an organisation for whatever purpose they
choose just because they happen to have the data. Instead, the purpose
has to be in line with why the data were obtained, or fresh consent has to
be obtained – or a clear obligation needs to be demonstrated (which, of
course, includes a legal obligation).
340
5 Privacy
5.2 Anonymisation
In Subsection 5.1, you saw that when data relates to individuals it is
important to obtain informed consent when that is feasible. More
generally, you saw that the data should only be used in a way that
is compatible with that consent.
Often, that consent comes with provisos over who can access data that
allows the individual to be identified. These people with access often
form a very restricted list and are given such access for specific reasons.
For example, for the COVID-19 study you considered in Activity 23, it is
clearly stated in the participant information sheet that only the research
team will have access to all of the data.
However, always keeping data restricted to small groups of people brings
its own disadvantages. For a start, it means that others are denied the
chance to scrutinise any results obtained from it. It also means that
data cannot be put to uses that bring benefits beyond the primary
purpose. For example, in this module you have been making use of data
that has been made publicly available. So, if nothing else, the use of these
data is enhancing the teaching of data science and increasing the number
of people skilled in making sense of data.
The competing imperatives of maintaining confidentiality and making the
best use of data are usually resolved by anonymising the data. That is,
removing sufficient information from the data to make it very unlikely that
individuals could be identified. Note that this requirement about not
being able to identify individuals applies to all the individuals in the
dataset. The anonymisation will have failed if even just one individual
can be too easily identified.
On the face of it, anonymisation may seem easily achievable by just
removing names from a dataset. However, it is not only names that can
lead to individuals being identified from a dataset. Other information that
is commonly known about people, such as their addresses, can also cause
a problem. In Activity 26, you will consider the variables in the OU
students dataset that relate to student location.
Recall that, in Unit 4, you began fitting regression models to data about
some OU students. As stated in the description of the data, care was taken
with these data to ensure that they are anonymised.
Look back at the data description given in Subsection 5.3 of Unit 4. Which
variables include information about where each student was located?
Based on the information given from these variables, explain why the Anonymised OU students
location of a student cannot be precisely deduced.
341
Unit B2 Big data and the application of data science
As you have seen in Activity 26, the data in the OU students dataset only
provides crude information about where a student is based. In the case of
region, all of the categories correspond to geographical areas of large
numbers of people. For imd, the categorisations ‘most’, ‘middle’ and ‘least’
also correspond to large areas of the UK and the category ‘other’ to
everywhere else in the world. So, each combination of region and imd
translates to a large number of people in the population. So, knowing the
value of region and imd for someone in the dataset still means that there
are still very many people this student could be.
One of the other variables given in the dataset is age. In the next activity,
you will consider whether this can be used to identify students instead.
70
60
50
Frequency
40
30
20
10
0
20 40 60 80
Age (years)
Figure 9 Students’ ages in the OU students dataset
In Activity 27, you have seen that it does seem unlikely that students
could be identified on the basis of age. Unfortunately, this by itself does
not mean that we can be satisfied that confidentiality has been maintained
just yet. With the advent of the internet, it is difficult for individuals not
342
5 Privacy
The ability of students in their 70s, 80s and even older to pass OU
modules and earn degrees is something to be celebrated. When Clifford
Dadson graduated, he became the oldest person to do so from the OU. Getting his OU degree,
Spend five minutes looking online for other information about Clifford. Clifford Dadson became the
Could he be the oldest student in the OU students dataset? oldest British graduate
In Activity 28, you have seen that finding out extra information about the
oldest person to graduate from the OU is relatively easy. You may feel
that this is only possible with people achieving something exceptional such
as Clifford Dadson. However, the growth of social media means that trivial
and not-so-trivial information is available about people, often posted by
themselves. This might include information such as name, age and – in the
case of OU students – which modules they are studying and when. To
avoid this information being used to identify someone in the OU students
dataset, recall an extra protection is built in with respect to age – a It seems that his OU degree
random amount between −2 and +2 has been added to each age. Thus, for inspiration opened the sky
wide for him! One year after
any individual represented in the dataset, we only know their age within a
his degree, Dadson was
five-year age band.
skydiving for charity.
Deliberately adding extra variation to data might seem at first glance to
be something that would make the data unusable. The key thing is that it
needs to be done in such a way that it does not introduce any biases. For a
start, this means that changes to individual values are randomly
generated. Then, the only impact of the adjustment is that parameter
estimates will not be as precisely estimated compared with estimation
based on the original dataset; this is a cost of improving the anonymisation
applied to the data.
In the OU students dataset, extra variation was added to the variable age
by the holders of the data (the OU) to improve the anonymisation; the true
age (or at least the age that was given) is known by the holders of the data.
It is also possible to obtain usable anonymised data by asking individuals
to apply some randomisation when they give data. Such an approach is
known as randomised response (Warner, 1965). This then places the
randomisation in the hands of the respondents instead of the researchers.
For example, this could be done by employing a scheme such as the one
given in Box 8.
343
Unit B2 Big data and the application of data science
344
5 Privacy
would then be how to ensure that students who indicate that they
have used them are not placed in a compromised position.
(Note: guidance about what plagiarism is and how to avoid it is
provided on the module website.)
345
Unit B2 Big data and the application of data science
Take a few minutes to think about the type of important questions that
are worth researching but might benefit from a randomised response. Use
the M348 forums on the module website to share your thoughts.
6 Fairness
The previous section, Section 5, dealt with ethical issues surrounding the
collection and storage of data, big and small alike. In this section, we will
discuss ethical issues surrounding the analysis of data. This is important
because an analysis being possible does not mean it should be done.
Before reading and working through this section, please notice that the
material discusses some forms of injustice issues. This content may evoke
powerful emotions in some people. If you think that you may need to
discuss your thoughts and/or emotions on the topics presented in this
section, or elsewhere in this unit, please contact the Open University
Report and Support service or the Student Support Team (SST) – see the
module website for links to these.
In Subsection 6.1, you will see how it is possible for data analysis to
exacerbate inequality despite that not being the intention. You will also
see in Subsection 6.2 that, without care, feedback loops can be introduced
so that inadequacies of a model could get worse over time, not better.
As in Section 5, these ideas apply to the analysis of small data as well as
big data. However, the reach of big data is such that analyses have the
potential to impact on daily lives, including critical areas such as health,
346
6 Fairness
crime, justice and insurance. Unfortunately, as you will see, this has led
to high-profile cases of big data analyses going wrong.
6.1 Inequality
One aim of analysing big data is to build predictive models that can bring
objectivity to decision-making. So, instead of relying on a person, or group
of people, to predict outcomes with all the unconscious bias they can’t
help but have, the predictions are based on actual data. This aspiration is
particularly important with respect to predictions used to make decisions
that have a big impact on people’s lives. Examples include: if someone
should be deemed a good enough risk to grant them a loan for a house,
car, etc.; what the most appropriate medical diagnosis is, given a set of
symptoms; how likely it is that someone has committed a crime.
With big data, it is possible to make finer-grained predictions: for
example, whether somebody (and people like them) will repay a loan on
time, or suffer a heart attack in the next 5 years. This brings benefits, but
also dangers. Should a decision about giving somebody a loan depend on
factors over which they have little or no control? And who should have
access to their health records anyway? Legislation such as the Equality
Act and GDPR (see Box 7 in Subsection 5.1) exists to prevent
discrimination and data misuse. At the heart of such legislation is the
notion of protected characteristics – characteristics about people that
are covered by the legislation. There is no universal list of protected
characteristics, as you can see in Box 9, though there can be much overlap.
But, as we will see in the rest of this subsection, avoiding discrimination is
not always straightforward.
347
Unit B2 Big data and the application of data science
At the heart of this system was data analysis and, presumably, there was
not an intention to discriminate. So, what went wrong?
In this case, the blame has been laid on the selection of the data used to
build the facial recognition system. The data used consisted of
predominantly lighter faces, with few darker faces. You will explore why
this matters in the next couple of activities.
348
6 Fairness
In Activity 32, it was assumed that in the population the only thing that
influences the probability of repaying a loan is a person’s income. Now
suppose that for a subgroup of the population the relationship between the
probability of repaying the loan is different. Furthermore, suppose that in
the data used to build the predictive model only 1% of the data
corresponds to people in this subgroup.
(a) If the predictive model is built ignoring whether someone belongs to
this subgroup, is the model likely to give predictions for people in the
subgroup as good as for everyone else? Why or why not?
(b) Suppose a separate model is fitted for people in this subgroup. Is this
model likely to give predictions for people in the subgroup as good as
the one for everyone else? Why or why not?
349
Unit B2 Big data and the application of data science
350
6 Fairness
So, how could such a model come up with pricing that seems to depend on
race? You will begin to explore this in Activity 34.
Take a look again at Example 24. What information about the applicant
did The Sun change when doing their investigation? Is this a protected
characteristic in the UK?
So, as you have seen in Activity 34, removing a variable such as sex and
race (and gender and ethnicity too) is not always sufficient to remove all
information about protected characteristics in a dataset. Such information
could be inferred from other variables that by themselves are not covered
by equality legalisation. That is, using other variables as proxies for
protected characteristics. This might be someone’s name, but also where
they live or, less obviously, something seemingly innocuous such as the
products that they buy. These variables could include variables that you
might not want to drop from the modelling.
Preventing a model from using other variables as proxies for protected
characteristics and thereby implicitly using them to improve the model is
difficult. If using a protected characteristic would improve the predictive
accuracy of the model, then the search for the best model is likely to end
up trying to include this information via proxies.
In one sense, this issue is not different from modelling discussed in Units 1
to 8. For example, forward and backward stepwise regression only consider
the impact of including a variable on the fit of the model. Value
judgements about what variables mean, either by themselves, or in
combination, do not come into it at this point. However, the difference
comes with the complexity of the models. Interpretation of the model
makes it possible to be able to examine why such a model gives the
predictions that it does. This makes it possible to detect when protected
characteristics are inadvertently being used.
The complexity of models used with big data, such as neural nets, make it
very difficult, if not impossible, to figure out why a model gives the
predictions that it does. This means that situations where protected
characteristics are inadvertently used to improve predictions can remain
hidden. Detection might only arise after comparing predictions for similar
sets of inputs, such as what was done in Example 24.
The final example we give in this subsection relates to a health app.
351
Unit B2 Big data and the application of data science
One issue this example highlights is whether the app should give the same
diagnosis regardless of gender. Arguably, gender equity is better served by
an app that is equally accurate for people of any gender, though going
down this route raises issues about how best to measure accuracy. In
Box 9 of Unit 5 (Subsection 4.3.1), a couple of measures for continuous
data were introduced: MSE and MAPE. For categorical outcomes, such as
‘heart attack’ or ‘panic attack’, the misclassification rate is often used.
That is, the percentage of outcomes for which the model came up with the
wrong diagnosis. However, any calculation also needs to factor in the cost
of making the wrong diagnosis. In the next activity, you will consider what
this cost might be in the case of heart attack versus panic attack.
There is also the question of whether the app is reflecting real differences
in the rates of heart attack versus panic attack in men and women. As the
352
6 Fairness
developers of Babylon point out, the app is built on historical data and
symptoms. Thus, it is reflecting a difference that is there in the data.
However, there is also a danger that the app is reflecting a historical bias
in the diagnosis of heart attacks. That is, whether historically heart
attacks in women have been more likely to be missed by doctors (which is
a real concern – expressed, for instance, by the British Heart Foundation,
2019). This could be because the doctors themselves thought of heart
attacks as something that more affects men and hence were more likely to
associate the same symptoms in women with something else (such as a
panic attack). Or this could be because the ‘typical’ symptoms of heart
attack might occur less in women who are having a heart attack.
This leads to another issue about fairness, which we will consider in
Subsection 6.2: whether historical biases in data will diminish over time as
the app is used and more data are gathered.
353
Unit B2 Big data and the application of data science
In Example 26, you have seen how a feedback loop could be set up. If this
happens, it means that over time an inequality will not diminish but
instead it is likely to just grow and grow.
Example 27 PredPol
For some while now, it has been recognised that the location of many
crimes is not completely at random. Some areas, at some times, are
unfortunately more likely to experience crime than others. However,
this means that by analysing where, when and what crime has already
happened, predictions can be made as to where crime is more likely to
PredPol, the system predicting
possible dangers has its own happen next. This knowledge allows police forces to act in a proactive
possible dangers! way, directing resources to the predicted hotspots. This is the idea
underpinning PredPol, a predictive model based on big data (PredPol,
2020).
The labelling of some areas as higher-crime simply because crime was more
likely to be observed there, as shown can happen in Activity 36, is bad
enough. However, there is also the danger that it also plays into other
prejudices as different districts almost inevitably have different
demographic mixes. For example, if the districts initially targeted happen
to be poorer areas, then the idea that more crime occurs in poor areas
could get reinforced. (Rather than crime being more likely to be reported
354
7 Guidelines for good practice
Have you ever worked on or read about a situation dealing with a dataset
or data analysis where there might be any potential ethical issues? Based
on your study of Sections 5 and 6, clearly determine the potential issues
and give your suggestions on how you could avoid them in practice. Share
your reasoning on the M348 forums.
355
Unit B2 Big data and the application of data science
Open ‘A guide for ethical data science’ provided on the module website,
and read the descriptions of the themes given in Section 3.2 of the
Introduction.
(a) Which groups of people are mentioned in the descriptions?
(b) To what extent are the considerations guided by legal and regulatory
requirements?
(c) How much do the data analysis techniques feature in the descriptions?
Now look at the checklist given at the end of ‘A guide for ethical data
science’ on the module website. Which of the suggested actions address
issues surrounding consent, anonymisation, inequality and feedback loops
that we have considered in Sections 5 and 6?
Notice that only one of the categories in this guide relates directly to the
analysis of data and the building of statistical models. This stresses that,
in practice, there is so much more to data science than fitting statistical
models. The management of the data also plays a vital part. As you saw
in Unit 5, preparing data ready for analysis is a non-trivial task.
356
7 Guidelines for good practice
The ethical checklist highlights the need to be sure that the data have
been ethically sourced, and privacy safeguarded. It also highlights the
importance of communication skills in data scientists, in particular, the
ability to communicate technical issues in a non-technical way.
To round off this section, unit and strand, we end with a reminder of the
benefits that data science can bring to the world. On 11 March 2020 the
World Health Organization (WHO) declared COVID-19 a pandemic
following the spread of this newly identified disease from its first known
source in Wuhan, China. During this pandemic, data scientists have
worked hard to provide an evidence base for strategies to mitigate spread,
to understand more about the disease and how public behaviour changed
as a result.
Whilst data about the pandemic soon started accumulating, making sense
of some of it has been challenging. For example, comparisons of case rates,
either over time or between nations, depended on factors such as on the
availability of testing, the type of testing, and reporting rates of the results
– all of which impacted on the numbers of recorded cases. Nevertheless,
data scientists used their skills to help. For example, The Alan Turing
Institute in the UK lists the following as some of the key projects it was
engaged with in response to the pandemic (The Alan Turing Institute,
2021).
• Project Odysseus. A project where they monitored activity on London’s
streets, so aiding infrastructure to be reconfigured to make it easier for
social distancing to be observed.
• DECOVID. A project using anonymised patient data to help improve
treatment plans for COVID-19 patients.
• Rapid Assistance in Modelling the Pandemic initiative. Creating a
model of individuals’ movements around towns and cities so that the
impact of different lockdown strategies could be tested.
• Modelling of positive COVID-19 test counts to provide up-to-date
numbers despite a slight lag in the processing of tests.
• Improving the NHS COVID-19 app to more accurately predict the risk
that a user has been in contact with a COVID-19-positive person.
Finally, remember that big data is a relatively new area for data scientists
and statisticians. Who knows what exciting developments are just around
the corner – or have already happened in the time between this unit being
written (early 2022) and when you are reading these words!
357
Unit B2 Big data and the application of data science
Summary
In this unit, you have been learning about big data, in particular some of
the challenges it brings.
There is not an agreed definition of when data becomes big data. One
definition is that big data are data that cause significant processing,
management, analytical and interpretational problems. These problems
might be because the data possess one, or more, of the three V’s: volume,
variety and velocity.
With big data processing, problems are generally overcome by using
distributed computing. That is, splitting the data storage and
computation across a number of processors. The use of generic algorithms,
such as MapReduce, help structure computations to take advantage of
distributed computing, though they are easier to apply to some statistical
computations (such as computing a mean) than others (such as computing
a median).
Data analysis via a computer requires algorithms to be implemented.
There might be different algorithms designed to achieve the same task, but
even if there are, how the computation time depends on the number of
data points (for example using big O notation) might vary. Such
differences become important with the size of big data datasets.
The time taken to implement the algorithms is not the only problem. It
may not be possible to obtain exactly the right answer. Even with simple
calculations, some rounding error is often inevitable. Or it may not be
guaranteed that the algorithm will deliver the best answer every time. For
example, that a global maximum will always be obtained, rather than just
a local maximum. In other situations, it may not even be clear which of
two answers is better. In such cases, it is important to document which
algorithm has been used, for example whether clustering was done via
k-means, DBScan or some other algorithm.
The output from big data is sometimes different from that of small data.
Correlations might be exploited whether or not they are spurious.
Furthermore, large amounts of data mean that more complicated models
are fitted. This makes interpretation harder, and hence may not be done.
Also, some of the uses that big data have been put to have focused on
prediction rather than interpretation.
In this unit, you have also been considering ethical issues surrounding the
use of big data, and small data too. Informed consent is an important
principle when it comes to personal data. That is, that people freely agree
having fully understood what they are being asked to do. Furthermore,
legislation such as GDPR covers what personal data can be stored and
what can be done with it. When data are to be shared, it is often only
after it has been anonymised. Care has to be taken with anonymisation to
make sure that individuals cannot be identified again afterwards. By using
randomised response approaches, it is possible to collect usable data in a
358
Summary
Section 1
What is so special
about big data?
Section 2 Section 4
Handling Outputs from
big data big data analysis
Section 5
Privacy
Section 3
Section 6
Models and
Fairness
algorithms
Section 7
Guidelines for
good practice
359
Unit B2 Big data and the application of data science
Learning outcomes
After you have worked through this unit, you should be able to:
• interpret what is meant by big data and appreciate its importance, uses
and challenges
• explain the different aspects of big data, concentrating on the three V’s
aspects: volume, variety and velocity
• understand the notion of distributed computing and appreciate its
computational power obtained by combining multiple individual
processors
• implement distributed computing in R through working on a set of
Jupyter notebook activities
• understand some computational algorithms that can be used to facilitate
the analysis of big data
• appreciate different aspects of these algorithms and investigate their
convergence, accuracy, and the uniqueness and optimality of the
solutions they give
• describe some differences in the interpretation of big data outputs
compared to that of small data – these include the extent to which
correlations may be spurious, whether the analysis is for sample or
population data, and that the interpretation of the big data output
usually depends on prediction, rather than explanation
• appreciate the importance and legal obligation to maintain the privacy
of individuals while handling both small and big data – this includes
obtaining the appropriate consent and maintaining the anonymisation of
data at all stages of data collection and storage
• appreciate the importance and legal obligation to maintain other ethical
standards at all stages of big data analysis – specifically, ensuring equity
of all individuals and avoiding feedback loops
• understand and use guidelines for good practice to ensure that data are
being handled with integrity and that all ethical issues are considered.
360
References
References
Amazon.co.uk (2022) Privacy Notice. Available at:
https://www.amazon.co.uk/gp/help/customer/display.html?nodeId=502584
(Accessed: 29 September 2022).
BBC News (2017) ‘Google DeepMind NHS app test broke UK privacy law’,
3 July. Available at: https://www.bbc.co.uk/news/technology-40483202
(Accessed: 9 February 2021).
British Heart Foundation (2019) Bias and biology: how the gender gap in
heart disease is costing women’s lives. (British Heart Foundation briefing.)
Available at: https://www.bhf.org.uk/informationsupport/heart-matters-
magazine/medical/women-and-heart-disease/download-bias-and-biology-
briefing (Accessed: 20 October 2022).
Buolamwini, J. and Gebru, T. (2018) ‘Gender shades: intersectional
accuracy disparities in commercial gender classification’, Proceedings of
Machine Learning Research, 81, pp. 77–91.
Butler, D. (2013) ‘When Google got flu wrong’, Nature, 494, 14 February,
pp. 155–156. doi:10.1038/494155a.
CERN (2021) Worldwide LHC Computing Grid. Available at:
https://wlcg-public.web.cern.ch (Accessed: 24 March 2021).
Chaves, W.A., Valle, D., Tavares, A.S., von Mühlen, E.M. and Wilcove,
D.S. (2021) ‘Investigating illegal activities that affect biodiversity: the case
of wildlife consumption in the Brazilian Amazon’, Ecological Applications,
31(7), Article e02402. doi:10.1002/eap.2402.
Cook, S., Conrad C., Fowlkes, A.L. and Mohebbi, M.H. (2011) ‘Assessing
Google Flu Trends performance in the United States during the 2009
Influenza Virus A (H1N1) pandemic’, PLoS One, 6(8), Article e23610.
doi:10.1371/journal.pone.0023610.
Das, S. (2019) ‘It’s hysteria, not a heart attack, GP app Babylon tells
women’, 13 October. Available at:
https://www.thetimes.co.uk/edition/news/its-hysteria-not-a-heart-attack-
gp-app-tells-women-gm2vxbrqk (Accessed: 29 September 2022).
Dean, J. and Ghemawat, S. (2004) ‘MapReduce: simplified data processing
on large clusters’, OSDI’04: Proceedings of the 6th Conference on
Symposium on Operating Systems Design & Implementation. San
Francisco, 6–8 December, pp. 137–149. doi:10.5555/1251254.1251264.
Diebold, F.X. (2021) “What’s the big idea? ‘Big Data’ and its origins”,
Significance, 18(1), pp. 36–37. doi:10.1111/1740-9713.01490.
EU-CJ (2011) ‘Taking the gender of the insured individual into account as
a risk factor in insurance contracts constitutes discrimination’, Court of
Justice of the European Union, Press Release No 12/11. Available at:
https://curia.europa.eu/jcms/upload/docs/application/pdf/2011-
03/cp110012en.pdf (Accessed: 23 February 2022).
361
Unit B2 Big data and the application of data science
362
References
Leo, B. (2018) ‘Motorists fork out £1,000 more to insure their cars if their
name is Mohammed’, The Sun, 22 January. Available at:
https://www.thesun.co.uk/motors/5393978/insurance-race-row-john-
mohammed (Accessed: 29 September 2022).
Mach, P. (2021) ‘10 business applications of neural network (with
examples!)’, Ideamotive, 7 January. Available at:
https://www.ideamotive.co/blog/business-applications-of-neural-network
(Accessed: 16 September 2022).
Moore, G.E. (1965) ‘Cramming more components onto integrated circuits’,
Electronics, 38(8), 19 April.
NASA (2016) ‘When the computer wore a skirt: Langley’s computers,
1935–1970’. Available from: https://www.nasa.gov/feature/when-the-
computer-wore-a-skirt-langley-s-computers-1935-1970
(Accessed: 4 February 2022).
OED Online (2022) ‘algorithm, n.’. Available at:
https://www.oed.com/view/Entry/4959 (Accessed: 18 August 2022).
Oridupa, G. (2018) ‘Fundamentals of MapReduce (new to MapReduce?)’,
Coding and analytics, 23 August. Available at:
https://www.codingandanalytics.com/2018/08/fundamentals-of-
mapreduce.html (Accessed: 20 December 2022).
PredPol (2020) ‘PredPol and community policing’. Available at:
https://blog.predpol.com/predpol-and-community-policing (Accessed:
25 February 2022).
Pym, H. (2019) ‘App warns hospital staff of kidney condition in minutes’,
BBC News, 1 August. Available at: https://www.bbc.co.uk/news/
health-49178891 (Accessed: 9 February 2021).
R Development Core Team (2022) ‘The R Reference Index’ (R-release,
version 4.2.1). Available at: https://cran.r-project.org/manuals.html
(Accessed: 21 October 2022).
Royal Free London (no date) ‘Information Commissioner’s Office (ICO)
investigation’. Available at: https://www.royalfree.nhs.uk/patients-
visitors/how-we-use-patient-information/information-commissioners-office-
ico-investigation-into-our-work-with-deepmind
(Accessed: 22 February 2022).
The Alan Turing Institute (2021) ‘Data science and AI in the age of
COVID-19’. Available at: https://www.turing.ac.uk/research/publications
/data-science-and-ai-age-covid-19-report (Accessed: 5 August 2021).
The National Archives (2013) ‘Equality Act 2010, Part 2, Chapter 1:
Protected Characteristics’. Available at:
https://www.legislation.gov.uk/ukpga/2010/15/part/2/chapter/1
(Accessed: 19 December 2022).
363
Unit B2 Big data and the application of data science
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, a recommender system: Photos on Siggy Nowak da
Pixabay
Subsection 1.1, Google: Lets Design Studio / Shutterstock
Figure 1: Ben Hider / Stringer
Subsection 1.2.1, volume: Richard Bailey / Corbis Documentary / Getty
Subsection 2.1, kids measuring tree width: Jupiterimages / Getty
Figure 4: Kathy Hutchins / Shutterstock
Figure 5: Magpi. This file is licenced under Commons Attribution-
NonCommercial-ShareAlike 3.0 Unported licence.
https://creativecommons.org/licenses/by-nc-sa/3.0/
Figure 6: James Brittain / Getty
Subsection 2.2, work in parallel: JGI / Jamie Grill / Getty
Subsection 2.2, multiprocessor: Taken from Appliances Direct website
Figure 7: diagram of the map-reduce algorithm: Taken from
https://www.codingandanalytics.com/2018/08/fundamentals-of-
mapreduce.html
Subsection 3.1, Lotoo: Posted by u/Jredrock on Reddit.com
Subsection 3.1, a square root: chaostrophic.com
364
Acknowledgements
365
Unit B2 Big data and the application of data science
Solutions to activities
Solution to Activity 1
There is not a single generally accepted definition of ‘big data’. So, it is
not surprising if you struggled to put your finger on what makes data ‘big’.
Going by the adjective ‘big’, you may well have thought in terms of
datasets that have lots of observations and/or lots of variables. This is
indeed one aspect that can make data big. But did you also think about
the type of data, and the speed at which it can be gathered? As you will
see next, these can also be factors that turn data into big data.
Solution to Activity 2
Everyone’s responses are likely to be different. If you have bought items
which were suggested to you, or even if the items were of interest to you,
this suggests that a recommender system has worked – at least this one
time.
Solution to Activity 3
Reasons the analysis of such postings might not accurately reflect the
opinion of the general public include the following.
• Access to the internet is required to be able to leave comments on such
websites. Even though access to the internet has dramatically increased
during this century, it is not universal even in developed countries.
The attitudes of those without access to the internet may be very
different to those with access.
• Even if someone can access the internet, they may not have the time to
do so. Or it may be a website that they don’t like, possibly because of
the perceived political leaning of the news media outlet. Thus, in this
respect, the attitudes of those who notice the comments section may be
different to the attitudes of the general public.
• The attitudes of those people prepared to post in a comment may also
be different to those who are aware of it, but do not post. Furthermore,
people might be more reticent to express particular opinions, for fear of
censure from others.
• Some people might post multiple times using different names in an effort
to make it appear that a particular viewpoint is more common than it
really is. In the extreme, a posting may have been artificially generated.
These issues do not diminish as the number of postings increases,
particularly if the increase in postings is simply a result of there being
more posts per individual rather than more individuals posting.
366
Solutions to activities
Solution to Activity 4
The simplest answer you probably thought of is that these models are fitted
using R – or using a different statistics package such as Minitab or SPSS.
But how do these packages do it? Recall that linear models are generally
fitted by least squares. That is, the parameter estimates are chosen so that
the sums of the squared differences between the data points and the model
are minimised.
In contrast, maximum likelihood estimation is used for generalised linear
modelling. That is, the parameter estimates are chosen to be those for
which the likelihood is maximised.
However, saying that the model is fitted using least squares or maximum
likelihood still does not fully answer the question about how the package
comes up with the estimates for model parameters. You need to consider
the computations that are performed. The complexity of these
computations depends on the model that is being fitted.
Solution to Activity 5
(a) When a member of the module team did this, they took 8 minutes
30 seconds to calculate βb and then αb. As the time taken depends on
factors such as dexterity with a calculator, focus on the task and
practice, the time you took is likely to be different.
It turns out that for these data βb = 12.9747 and α b = 3.5854. Did you
get the correct values for βb and α
b?
(b) The calculations in Step 1 are likely to take three times as long as
there are three times as many numbers to add up. The other steps are
likely to take about the same length of time as when there are only
five trees included. This is because the number of numbers involved in
the calculations are the same whatever the number of trees. Overall
this means that calculating α b and βb will take a bit less than three
times as long.
(c) As the number of trees increases, it is the time spent on Step 1 that
will dominate. The other steps will only require a few computations,
no matter how many trees there are. So, generally the computation
time will go up proportionally with the number of trees.
Solution to Activity 6
(a) There is no right answer to this. How long is taken will vary from
person to person. When a member of the module team tried this,
they took 4 minutes 15 seconds. (The value of the standard deviation
you should have obtained is 15.628.)
(b) When the same member of the module team tried the second method,
they took 3 minutes 15 seconds.
(c) So the module team member was able calculate the standard
deviation faster using Method 2. You probably found the same.
367
Unit B2 Big data and the application of data science
Solution to Activity 7
(a) It is the first step that is likely to take the most time. The number of
terms involved with each of the calculations detailed in Step 1 will
increase as the number of trees increases.
(b) All four approaches are valid as they will all result in the same value
of the sum being calculated. This is because it does not matter what
terms are added together.
(c) Strategy (iii) is likely to be fastest. With you and your two friends
each calculating one of s1 , s2 and s3 , these can be calculated
simultaneously. This just leaves the relatively small sum s1 + s2 + s3
to be calculated afterwards. If everyone can add two numbers together
as quickly as everyone else, it does not make sense for one person to
be given more numbers to add together than anyone else because the
final sum can’t be calculated until s1 , s2 and s3 are all known.
(d) Suppose that you give one friend s1 to calculate and the other s2 .
The friend given s2 to calculate only needs to know the values
of x6 , . . . , x10 , just one third of the data. Similarly, the friend given s3
to calculate only needs to know the values x11 , . . . , x15 , again just one
third of the data. Finally, you would only need to know the values
of x1 , . . . , x5 – provided you are sure that your friends have the data
they need.
Solution to Activity 8
(a) There are two key tasks. The first is to decide which subset of the
data each friend is going to work on (and of course, which data points
you are going to leave for yourself). This is important because each
data point must go to one, and only one, person. The other key task
is to keep track of what the subsums s1 , s2 and s3 are once they have
been calculated. Note that keeping track also includes having some
system of noticing if one of your friends fails to get back to you with
their subsum.
368
Solutions to activities
(b) Whilst you might think that dividing up the work is worthwhile when
there are 12 trees in the dataset, it is unlikely to be when there are
just 6 trees. The extra hassle of tasking your two friends and keeping
track of what they report for their subsums might make it not worth
it. Instead you may well have decided that, in this circumstance, it is
quicker to do all the computations yourself.
Solution to Activity 9
(a) P (at least one fails) = 1 − P (neither fails)
= 1 − (P (a processor does not fail))2
= 1 − (0.9)2 = 1 − 0.81 = 0.19.
(b) P (at least one fails) = 1 − P (none fails)
= 1 − (P (a processor does not fail))5
= 1 − (0.9)5 ≃ 1 − 0.5905 = 0.4095.
(c) Let n be the minimum number of processors we need. This means
that we must have that
1 − (0.9)n = 0.5.
That is (0.9)n = 0.5.
In other words, that is n log(0.9) = log(0.5).
Thus,
log(0.5)
n= ≃ 6.58 ≃ 7.
log(0.9)
(d) When the probability of a single processor failing is 0.01, then the
number of processors, n, so that the probability of at least one failure
is 0.5 is given by
log(0.5)
n= ≃ 68.97 ≃ 69.
log(0.99)
Solution to Activity 10
(a) The overall mean, x, can be calculated using
n1 x1 + n2 x2
x= .
n1 + n2
(b) A similar formula works if we have p subsets. One way of seeing this
is by considering that the overall mean can be calculated by
combining the mean based on the first p − 1 subsets and the mean
based on the last subset. More directly, this is same as using
Pp
k=1 nk xk
x= P p .
k=1 nk
369
Unit B2 Big data and the application of data science
(c) As parts (a) and (b) show, it is possible to calculate the mean by first
splitting the data into p subsets and working on each subset
independently. Then, the results from each subset can be combined to
get the overall mean. So, it is a computation that is amenable to
being done using distributed computing.
Solution to Activity 11
(a) As the data are already ordered, the median corresponds to the
middle value, which is 21 in this case.
(b) With this split of the data, the median of each subset is 15, 21 and 37,
respectively.
(c) With this split of the data, the median of each subset is 2, 21 and 117,
respectively.
(d) With this split of the data, the median of each subset is 5, 37 and 103,
respectively.
(e) The median of the medians for each subset is shown in the following
table.
So, using the subsets proposed in parts (b) and (c) leads to the
median of medians that is the same as the median for the whole
dataset. However, not all subsets will lead to the median for the
whole dataset, as the selection in part (d) demonstrates.
(f) The mean of the medians for each subset is shown in the following
table.
So, taking the mean of the medians is even worse. The mean of the
medians is not guaranteed to be even close to the median of the whole
dataset.
370
Solutions to activities
Solution to Activity 12
For these data, (x − x)2 = 14. This value is exact as the
P
(a) (i)
calculation involves the adding and squaring of (small-valued)
integers.
(ii) The variance is 7. This is exact because 14 is a multiple of 2
(= n − 1).
(iii) The standard deviation is 2.646 (to three decimal places). As the
phrase ‘to three decimal places’ indicates, this value is not exact.
At least, it is not exact if we want to write it down in decimal
form. The exact value is an irrational number, so we cannot write
down the value exactly using a finite number of decimal places.
For these data, (x − x)2 = 2. Like in part (a), this value is
P
(b) (i)
exact as it’s just involving the adding and squaring of
(small-valued) integers.
(ii) The variance is 0.667 (to three decimal places). This is not exact.
The value 2 is not a multiple of 3 (= n − 1), so we cannot write
down this value exactly using a finite number of decimal places.
(iii) To three decimal places, the value of the standard deviation you
obtained would have been 0.837, 0.819, 0.817 or 0.816, depending
on how many decimal places you used for the variance when
taking the square root, i.e. these are the square root values
of 0.7, 0.67, 0.667 and 0.66667, respectively. (To three decimal
places, the square root of 0.6667 is the same as 0.667.) None of
the standard deviation values will be the exact value, as now it is
not possible to input the exact value to calculate the square root.
Also, we are not able to write down the result exactly using a
finite number of decimal places.
Solution to Activity 13
No, forward and backward stepwise regression do not always result in the
same parsimonious model. For example, in Unit 2, both forward and
backward stepwise regression were used to find a parsimonious model for
the income a film will generate.
In Example 14 (Subsection 5.3.1 of Unit 2), forward stepwise regression
found the parsimonious model to be
√
income = −3.627 + 0.033 budget + 1.268 screens + 0.923 rating.
In contrast, as Example 15 in Subsection 5.3.2 of Unit 2 showed, backward
stepwise regression working from the same set of explanatory variables,
found a different parsimonious model. This model corresponds to
√
income = − 4.385 + 0.036 budget + 1.058 rating + 1.104 screens
− 0.268 views + 0.099 likes + 1.493 dislikes
− 0.554 comments.
371
Unit B2 Big data and the application of data science
Solution to Activity 14
Yes it is. There is general agreement that the best model corresponds to
the one that minimises the sum of the differences between the predicted
values (heights of trees) and the actual heights of the trees. (As it turns
out for these data, this sum is 1.027 based on the result from the first
algorithm, and 0.136 using the second algorithm. So in this case it
happens to be that the second algorithm produces the better answer.)
Solution to Activity 15
No, it is not. As you learnt in Subsection 5.2 of Unit 2, there are different
ways of measuring how well a model fits the data. These ways take into
account how many variables are included. In particular, there is the
adjusted R2 statistic (Ra2 ) and the Akaike information criterion (AIC). It
might be that Ra2 and AIC both indicate the same selection of variables is
best. However, this is not guaranteed.
Solution to Activity 16
It could be argued that all the methods are about fitting the same type of
model: a model in which the data can be split into different clusters.
However, this leads to a vague model. The shape of the clusters and the
distribution of the numbers of observations in each cluster are not tied
down.
Instead, the methods are about using different algorithms to achieve the
same end (splitting the data into clusters). The different algorithms have
different strengths.
Solution to Activity 17
No, it is not.
In Example 14, it was agreed that the link between strength and weight is
unlikely to be causal. However, an indirect link is plausible. A footballer
might be heavier because they have more muscle mass and hence they
are stronger.
Stronger
Heavier
372
Solutions to activities
Solution to Activity 18
(a) The sample mean is a good estimator of the population mean.
So, an estimate of the population mean strength is 71.07.
(b) It is good practice to give some indication of the sampling error
associated with an estimate. The 95% confidence interval is one
means of doing so.
(c) As we are now dealing with all the data, the mean strength
(calculated to be 65.32) is the population mean strength. Note that
this is the value of the population mean, not an estimate of it. So,
sampling error is no longer an issue and hence indications of the size
of the sampling error are no longer required.
Solution to Activity 19
Your first thoughts may have been that, yes, these data will be
representative of all international footballers. However, it is not clear
whether the database contains information on both male and female
footballers.
Also, the data may only be representative of footballers in 2019, or a few
years either side. The data is almost certainly not representative of all
international footballers in the past and may not be representative of such
footballers in the future due to factors such as changes in training methods.
Solution to Activity 20
(a) No, it does not appear that all the variables are contributing to the
model. The p-values associated with preferredFoot and skillMoves
are both relatively large.
(b) Now, it looks like most of the variables are contributing significantly
to the model. There is only one variable, preferredFoot, that might
not be, and even here the p-value is close to the level that is normally
considered for statistical significance.
(c) Now, there is strong evidence that all the variables are contributing to
the model as all the p-values are very small.
(d) In each case, the 95% confidence interval can be calculated using the
formula
(βb − t(0.975) × s.e.(β),
b βb + t(0.975) × s.e.(β)),
b
where ‘s.e.(β)’
b means ‘the standard error of βb ’. Here, even when the
sample size is n = 100, the degrees of freedom for the relevant
t-distribution is big enough that it is reasonable to replace it
by z(0.975) ≃ 1.96.
So, when the model is based on 100 footballers, the 95% confidence
interval is
(−0.279 − 1.96 × 1.046, −0.279 + 1.96 × 1.046) ≃ (−2.329, 1.771).
373
Unit B2 Big data and the application of data science
Solution to Activity 21
Out of the models listed, the last model is the most complex as it has the
most terms in it. It is also the most complicated one to interpret. The
effect of any one of the variables incomeSource, employment and gender
depends on what value the other variables take.
Solution to Activity 22
The authors give the following examples:
• recommender systems
• providing a way to see how a selected item of clothing would look on a
model
• providing software to help banks decide whether they should extend
credit to someone
• forecasting the value of currencies, cryptocurrencies and stocks
• assisting doctors with diagnoses and with treatment plans
• detecting cyber attacks
• detecting online fraud
• planning and monitoring the routing of deliveries
• predicting the time of deliveries
• developing an autopilot system for a car.
374
Solutions to activities
Solution to Activity 23
(a) The intended reader is someone who is considering taking part in the
study. Notice how the leaflet makes frequent use of the pronoun ‘you’
to mean this person and the pronoun ‘we’ to mean the researchers.
(b) The leaflet makes it very clear the reader does not have to take part
if they don’t want to.
(c) A participant will be asked to give basic details such as their name,
age, gender and ethnicity, details about recent illnesses, and about
any recent COVID-19 test they might have had. They will also be
asked to upload a photograph of their antibody test result.
(d) The research team will have access to all of the data. Only results
of the study, which the researchers stress will not contain personal
identifiable information, will be shared with others. (These others
include the UK’s National Health Service (NHS), Public Health
England and the Department of Health and Social Care.)
Solution to Activity 24
For this, the module team looked at Amazon.co.uk. Their Privacy Notice
(in this case, one dated 22 March 2019) includes a long list of instances
where they gather information, such as:
• communicate with us by phone, email or otherwise
• search for products or services
• upload your contacts
• place an order through Amazon services
• talk to or otherwise interact with our Alexa Voice service.
They also gather other information automatically such as:
• the IP address
• login; email address; password
• the location of your device or computer.
Other services that Amazon.co.uk gather information about their
customers include:
• information about your interactions with products and services offered
by our subsidiaries
• information about internet-connected devices and services with Alexa
• credit history information from credit bureaus.
375
Unit B2 Big data and the application of data science
Solution to Activity 25
Two examples are:
• upload your contacts
• talk or otherwise interact with Alexa Voice service.
By uploading information about your contacts, you are providing Amazon
with information about these other people. Yet unless you have told your
contacts about this, they may not even be aware that you have passed on
information about them to Amazon.
Similarly, the presence of Alexa in a household means that any member of
the household or visitor may inadvertently have what they have said
picked up by Alexa.
Solution to Activity 26
There are two variables that relate to where each student was based:
region and imd. (Remember that imd is a factor representing the index of
multiple deprivation (IMD), which is a measure of the level of deprivation
for the student’s (UK) postcode address.)
However, both variables only give a very rough indication of where each
student was located. The variable region splits the UK into just
13 regions, and one of those regions includes all the students based outside
of the UK. The other variable, imd, combines locations in the UK on the
basis of their index of multiple deprivation. For these data, just four
categorisations are used: ‘most’, ‘middle’, ‘least’ and ‘other’.
Solution to Activity 27
In the dataset, most students are 70 years old or younger. However, there
are a few students over the age of 75, including one over 90 years old. Of
course, there are enough people aged at least 90 years old that this piece of
information does not seem to be useful for identification.
Solution to Activity 28
Other information about Clifford available online includes that he earned
his degree, a BA Open in Arts, in 2013 when he was aged 93 (The Open
University, 2013).
This means it is unlikely he is the oldest student in the OU students
dataset. As stated in the description of the dataset, the dataset only
contains data about students on modules between 2015 and 2020 – after
Clifford graduated. Also, all the students in this dataset were studying
statistics modules, which are modules that Clifford is unlikely to have
studied as part of his degree (though not impossible).
376
Solutions to activities
Solution to Activity 29
A student might be giving the response ‘yes’ simply because their coin toss
was ‘heads’.
Furthermore, only the student will know if the coin toss was in fact ‘tails’,
and they have indeed submitted someone else’s work as their own. So,
someone processing the survey cannot say if an individual respondent has
admitted to submitting someone else’s work as their own because they do
not know.
Solution to Activity 30
(a) If no OU students have ever submitted someone else’s work as their
own, the response ‘yes’ would only be a result of coin tosses coming
up heads. Assuming that all the coins are unbiased, we would expect
this to happen 50% of the time. So, we would expect 50% of the
responses to this question to be ‘yes’.
(b) If 1% of OU students have ever submitted someone else’s work as
their own, we would still expect 50% of respondents to answer ‘yes’
because their coin toss is ‘heads’. However, now 1% of students whose
coin toss was ‘tails’ would also answer ‘yes’.
So, overall we would expect 50% + 0.5 × 1% = 50.5% of the responses
to this question to be ‘yes’.
(c) If p% of OU students have ever submitted someone else’s work as
their own, we would still expect 50% of respondents to answer ‘yes’
because their coin toss is ‘heads’. However, now p% of students whose
coin toss was ‘tails’ would also answer ’yes’.
So, overall we would expect 50% + 0.5 × p% = (100% + p%)/2 of the
responses to this question to be ‘yes’.
(d) The percentage of ‘yes’ replies we expect to have depends on the true
percentage, p, of OU students who have ever submitted someone else’s
work as their own. As p increases, the percentage of ‘yes’ replies we
expect goes up. So, we are able to use the percentage of ‘yes’ replies
to estimate p.
Solution to Activity 31
As has already been mentioned in this subsection, questions that involve
illegal behaviour benefit from using a randomised response approach. So,
for example:
• Why would someone drive even when they know they are not in a fit
state to do so?
• Why would someone knowingly buy counterfeit or stolen goods?
377
Unit B2 Big data and the application of data science
Solution to Activity 32
(a) All other things being equal, a model estimated using data from
100 000 people should produce better predictions than a model
estimated using data from 100 people. This is because the extra data
should mean that there will be far less sampling error associated with
the estimates of model parameters.
(b) In general, more data means that models with more parameters can
reasonably be fitted. These extra parameters are likely to allow the
shape of the model to more closely reflect the shape of the
relationship between income and the probability of repaying the loan.
For example, think of the shapes that can be achieved by functions of
the form a + bx, a + bx + cx2 and a + bx + cx2 + dx3 .
Solution to Activity 33
(a) No, this model is not likely to be as good for people in this subgroup.
The low proportion of people from this subgroup is likely to only have
a small impact on shaping the estimated relationship between income
and the probability of repaying a loan. So, the modelled relationship
between income and the probability of repaying the loan will largely
reflect people not in the subgroup.
(b) A separate model for people in this subgroup will be based on far
fewer data than the model for everyone else. So, as explored in
Activity 32, this separate model is likely to give worse predictions
than the predictions for people not in the subgroup.
Solution to Activity 34
According to Example 24, only the name of the applicant was changed,
nothing else. In the UK, someone’s name is not a protected characteristic.
However, someone’s name can be, and often is, used to infer race (and sex
too). This is why The Sun’s findings about the cost implications of just
changing an applicant’s name from John Smith to Mohammed Ali
mattered.
378
Solutions to activities
Solution to Activity 35
(a) Costs here include the time of the medical staff at A&E, the costs of
any medical tests run to discover it is not a heart attack after all, the
time and expense for the person going to A&E and the stress induced
by receiving such advice for the person concerned (and others who
care about them). You may have thought of other costs.
(b) Costs here include those associated with not getting the required
medical treatment in a timely fashion. This could lead to a much
worse outcome for the person.
(c) The costs associated with misdiagnosing a heart attack as a panic
attack are potentially much greater than those associated with
misdiagnosing a panic attack as a heart attack. So, most people
would agree that the former misdiagnosis is worse.
Solution to Activity 36
(a) It makes sense for the police to focus on where the crime is. So, in
this case, to focus on district A.
(b) If more crime is recorded where the police are looking, then more
crime will be recorded in district A.
(c) If more crime is recorded in district A, then this reinforces the idea
that district A is the higher-crime area.
The answers to parts (a), (b) and (c) set up the undesirable feedback loop
shown in Figure S3.
District labelled
as ‘high crime’
Figure S3 A potential feedback loop when district policing levels are decided
by crime rates
Solution to Activity 37
All kinds of real-life scenarios are welcome. In your discussion of the
potential ethical issues, try to put them in context with the contents of
Sections 5 and 6. Engage with other students on the module and discuss
your thoughts clearly and responsibly on the module forums.
379
Unit B2 Big data and the application of data science
Solution to Activity 38
(a) You should have noticed that a wide range of people are mentioned.
These include people whose information is being used and stakeholders
in the work that is being done. It also includes society as whole.
(b) There is mention of the need to adhere to legal and regulatory
frameworks (including where that encompasses being a whistleblower
if necessary), but the themes go beyond that. For example, to act
with transparency to build public trust in data science.
(c) The themes just mention using ‘robust statistical and algorithmic
methods that are appropriate to the question being asked’. Which
technique should be used in which situation is not mentioned.
Solution to Activity 39
Consideration about consent occurs under project planning (‘Can data be
ethically sourced?’), under data management (‘Fully understand the
consents and legal uses of the data’) and under analysis and development
(‘Apply consents and permitted uses, professional and regulatory
requirements’).
Anonymisation also appears under data management (‘Consider impacts of
data processes to privacy, bias and error’) as well as analysis and
development (‘Monitor risks identified at planning, assess for additional
risks (harm, bias, error, privacy)’) and implementation and delivery
(‘Applying best practice in anonymisation before sharing data or
disseminating outputs’).
Whilst inequality and feedback loops are not explicitly mentioned, bias,
harm and fairness are. For example, under project planning (‘Are there
risks (privacy, harm, fairness) for individuals, groups, businesses,
environment?’), under data management (‘Detecting and mitigating
sources of bias’) and analysis and development (‘Monitor risks identified at
planning, assess for additional risks (harm, bias, error, privacy))’.
380
Review Unit
Data analysis in practice
Introduction
Introduction
This unit is designed to help you consolidate what you have learned in the
core part of M348. It describes some extensions of ideas you have already
met, as well as a small number of new statistical ideas that you may find
useful for data analysis in practice. The main aim, however, is to reinforce
the connections that exist between the various modelling approaches you
have seen in Units 1 to 8 and to give you some practical advice for
choosing a model for your data. Several datasets will be introduced in the
unit, so that you can draw out their features through different modelling
approaches.
To ease you into the spirit of this unit, we’ll start with a repeat of the
quote from Unit 5, attributed to the eminent statistician George Box:
‘All models are wrong, but some are useful.’
Ultimately, Box is saying that there is not one ‘right’ answer when trying
to find a model for your data. Every statistician is likely to choose slightly
differently. This is illustrated in the paper called ‘Many analysts, one data
set: making transparent how variations in analytic choices affect results’
(Silberzahn et al., 2018), where 29 teams of researchers analysed the same
dataset to answer the same research question. Only 20 of the teams found
the effect of interest to be statistically significant, and the estimates for
this effect varied a lot. Moreover, the 29 final models proposed by the
teams used 21 unique combinations of explanatory variables!
Keep this in mind when modelling a dataset. There is a multitude of
techniques you can use to analyse the data, and more than one approach
may lead to an appropriate model. So it’s a case of doing something that
can be justified and defended. This unit will help you along this path.
The unit starts by reviewing linear models, which were the focus of
Units 1 to 5. In Unit 1, we explored the concept of simple linear
regression, where a linear relationship between a continuous response
variable and one covariate is modelled. In Unit 2, this framework was
generalised to multiple linear regression, or more simply, multiple
regression, to incorporate an arbitrary number of covariates that may
affect the response. We will review various features of multiple regression,
including transformations, model selection and model diagnostics, in
Sections 1 and 2: Section 1 will focus on the exploratory analysis stage of
multiple regression, while Section 2 will consider the model fitting stage.
Not all explanatory variables are continuous, and therefore Unit 3 explored
methods to analyse data where the explanatory variable is categorical, that
is, a factor. We saw how a factor can be incorporated into the regression
framework by using indicator variables, and the concept of analysis of
variance, or ANOVA, was introduced. Unit 4 finally brought all these
methods together, introducing multiple regression with an arbitrary
number of covariates and factors, as well as potential interactions between
them. We’ll review multiple regression with factors in Section 3.
383
Review Unit Data analysis in practice
384
Introduction
Section 1
Multiple regression:
understanding the data
Section 2 Section 3
Multiple regression: Multiple regression
model fitting with factors
Section 4
The larger framework
of generalised linear
models
Section 5
Relating the model to
the research question
Section 7
Section 6 Section 8
To transform, or
Another look at the Who’s afraid
not to transform,
model assumptions of outliers?
that is the question
Section 9
The end of your
journey: what next?
385
Review Unit Data analysis in practice
1 Multiple regression:
understanding the data
In this section, we’ll review various aspects of the exploratory analysis
stage of multiple regression. To do this, we’ll use a dataset that is
introduced in Subsection 1.1. We’ll then start an exploratory analysis for
this dataset in Subsection 1.2 and propose a first multiple regression model
for the data. Subsection 1.3 focuses on visualising the data; this raises
several questions about the dataset and how it should be modelled, which
are discussed in Subsection 1.4.
386
1 Multiple regression: understanding the data
The values for the first five observations from the experiment are
given in Table 1.
Table 1 First five observations from the desilylation experiment
387
Review Unit Data analysis in practice
We will consider an initial model for the data from the desilylation
experiment next.
Model (1) from Box 1 provides us with a starting point for modelling the
data from the desilylation experiment from Table 1 in Subsection 1.1. So,
as a first model, let’s focus on a multiple regression model which includes
all of the possible covariates from the experiment as explanatory variables
– that is, the multiple regression model
yield ∼ temp0 + time0 + nmp0 + equiv0. (2)
It is usually a good idea to have a look at the data before starting to fit a
model. This can give an idea of which covariates affect the response, and
how. We will start in Activity 2 by looking at the data for the first five
observations from the desilylation experiment given in Table 1. Of course,
we cannot draw any strong conclusions from just looking at the first five
observations, but these may already give some indication about the model
which can be followed up in the data analysis.
388
1 Multiple regression: understanding the data
The first five observations from the desilylation experiment were given in
Table 1; this is repeated here as Table 3 for convenience.
Table 3 Repeat of Table 1
389
Review Unit Data analysis in practice
So, we’ll continue to focus our attention on a multiple regression model for
yield, which includes all of the explanatory variables, but we’ll now use
the standardised data in the desilylation dataset so that Model (2) becomes
yield ∼ temp + time + nmp + equiv. (3)
390
1 Multiple regression: understanding the data
temp
time
nmp
equiv
yield
So, now let’s look at what the scatterplot matrix in Figure 1 tells us
about these relationships.
• Let’s start with the plot in the bottom-left corner – that is, the
scatterplot of yield and temp. We can see that the yield increases
as the temperature increases. This confirms our conjecture from
Activity 2. However, the relationship does not look like a straight
line, so we may consider a transformation of the explanatory
variable temp for our modelling approach. Also, the response values
do not seem evenly spread around the potential mean curve, which
may indicate a problem with the constant variance assumption.
• From the second scatterplot in the bottom row of Figure 1, the
relationship between yield and time appears to be less strong.
There may be a slight increase in yield over time, but again the
relationship is not linear and a transformation could be considered.
In Activity 2, we were undecided about the effect of time on yield.
Seeing all the data in the scatterplot gives us a slightly stronger
steer towards including this covariate in the model, but not
necessarily as a linear term.
• The third scatterplot in the bottom row of Figure 1, showing the
relationship between yield and nmp, is almost a mirror image of
the previous plot. We see a somewhat decreasing, but not linear,
relationship between these variables. Again, the scatterplot confirms
our conjecture from Activity 2 of a decreasing relationship, while
giving us further information about the possible shape of the model.
• The fourth scatterplot in the bottom row of Figure 1, showing the
relationship between yield and equiv, gives a similar picture to
the second. So, we deduce that there is some evidence of an
391
Review Unit Data analysis in practice
Now let us have a look at the six pairwise scatterplots for the four
explanatory variables in the scatterplot matrix in Figure 1 (in Example 1).
We will explore these in Activity 5.
392
1 Multiple regression: understanding the data
12
10
8
Frequency
0
−2 −1 0 1 2
temp
Figure 2 Bar chart of values for the covariate temp in the desilylation dataset
To answer Question 2, you could think about what it is that makes the
desilylation dataset different from the datasets that you have seen
previously. Consider, for example, the FIFA 19 dataset introduced in
Unit 1. This dataset contains the values of several variables that have been
measured for 100 footballers. In Unit 2, we used the height and weight of a
393
Review Unit Data analysis in practice
394
1 Multiple regression: understanding the data
395
Review Unit Data analysis in practice
The F -statistic for the fitted model summarised in Table 5 is 15.98 with 4
and 25 degrees of freedom, and the associated p-value is p < 0.001. We’ll
explore the output for this fitted model in Activity 8.
396
2 Multiple regression: model fitting
(c) For the regression coefficient of temp, state what the associated
p-value is testing. What is the conclusion of this test?
(d) Interpret the estimated regression coefficient of temp.
(e) From their p-values, what do you conclude about the regression
coefficients of time, nmp and equiv?
397
Review Unit Data analysis in practice
In Activity 9, we’ll check the assumptions for our initial fitted model for
the desilylation data.
The residual plot and the normal probability plot after fitting the multiple
regression model
yield ∼ temp + time + nmp + equiv
to the desilylation data are given in Figure 3(a) and Figure 3(b),
respectively.
6
2
Standardised residuals
4
1
2
Residuals
0 0
−2
−1
−4
−2
−6
85 90 95 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 3 The residual plot (a) and normal probability plot (b) for yield ∼ temp + time + nmp + equiv
(a) Does the residual plot in Figure 3(a) support the assumption that the
Wi ’s have zero mean and constant variance?
398
2 Multiple regression: model fitting
(b) What shape does the pattern of the points in the residual plot in
Figure 3(a) remind you of?
(c) Does the normal probability plot in Figure 3(b) support the
assumption that the Wi ’s are normally distributed?
By inspecting the residual plot and the normal probability plot given in
Figure 3 for the fitted model
yield ∼ temp + time + nmp + equiv,
it is clear that the model assumptions on the Wi ’s are violated. Why is
that, and what could be our next steps?
Well, the scatterplot matrix in Figure 1 (in Example 1, Subsection 1.3)
suggested that, while there was clearly a relationship between the response
and each of the explanatory variables, this relationship was not necessarily
linear. We have, however, just fitted a model with a linear relationship
between the response yield and the explanatory variables. So, the
resulting pattern in the residual plot reflects the non-linear relationships
visible in the scatterplot matrix.
399
Review Unit Data analysis in practice
2
2
0
Mean of residuals
0
Residuals
−2
−2
−4
−4
−6 −6
−2 −1 0 1 2 −2 −1 0 1 2
(a) temp (b) temp
Figure 4 Plot of (a) the residuals and (b) the means of the residuals against the values of temp
We could continue in the same manner to plot the residuals and the values
of each of the other explanatory variables in turn, to see if a similar
pattern emerges. Alternatively, we can just transform all of the
explanatory variables by taking the square of each one, use all of these
transformed variables as covariates in the model, and then use formal tests
to see which of the transformed variables are significant. We will pursue
the latter strategy here. There is, however, more than one way that we
could include the transformed variables as covariates in the model. Should
we simply replace the original covariate with the transformed version, or
should we also keep the original covariate in the model in addition to the
transformed version? For example, should we just include temp2 as a
covariate in the model, or should we include both temp and temp2 as
covariates in the model? When analysing the films dataset in
Subsection 4.2.2 of Unit 2, the variable screens was transformed to
become screens3 to improve the model fit, and we did not then include
the untransformed variable in the model. For the desilylation dataset,
however, there are two good reasons why we may want to include both the
transformed and untransformed variables as covariates in the model.
In order to explain the first reason, let’s first consider how it was decided
to fit screens3 to the films dataset. A scatterplot of the response variable
and screens had revealed a non-linear pattern. Consequently, several
transformations of screens had been tried, and the scatterplot of the
response and screens3 turned out to show a reasonably linear pattern.
Therefore the transformation screens3 was adopted. For the desilylation
dataset, we looked instead at a scatterplot of the residuals and the fitted
values (in Figure 3(a), Subsection 2.2) and a scatterplot of the residuals
and the values of temp (in Figure 4(a)). That means that on the vertical
axis we did not plot the response variable, but instead we plotted the
residuals after already fitting a model including the untransformed
400
2 Multiple regression: model fitting
variables as covariates. This gives an indication that we may need both the
linear terms in the model (that is, the untransformed variables) and the
quadratic terms (that is, the transformed squares). If these terms are not
all needed in the model, we can always remove insignificant terms using a
model selection approach such as the Akaike information criterion (AIC).
(A reminder of the AIC is given in Box 4, Subsection 2.4.)
The second reason why we may want to include both the transformed and
untransformed variables as covariates in the model, is because quadratic
models without linear terms lack flexibility. To illustrate, for simplicity
consider a model with just one variable x. Figure 5 shows typical graphs of
a quadratic function of the form y = a + bx + cx2 , that is, a function with
an intercept a, a linear term bx and a quadratic term cx2 .
4 4
Quadratic function
Quadratic function
3
2 2
1 1
0 0
more positive c increasing a
−1 −1
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
(a) x (b) x
4 more negative c, 4
decreasing a increasing b
Quadratic function
3
Quadratic function
2 2
1 1
0 0
−1 −1
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
(c) x (d) x
Figure 5 Quadratic functions with (a) various values of the quadratic parameter c, (b) various values of the
intercept parameter a, (c) various values of the intercept parameter a and the (negative) quadratic parameter c,
(d) various values of the linear parameter b
401
Review Unit Data analysis in practice
All the graphs in Figure 5 have the typical parabola shape of quadratic
functions. If the quadratic parameter c is positive, the parabola will be
open at the top, and if c is negative, the graph is flipped to be open at the
bottom.
• Figure 5(a) shows that varying the (positive) quadratic parameter c
changes how quickly the graph increases.
• Figure 5(b) shows that changing the intercept parameter a will shift the
graph up and down.
• Figure 5(c) shows the combined effect of varying the intercept
parameter a and the quadratic parameter c, this time with negative
values of c.
In Figures 5(a), (b) and (c), the minimum (when c > 0) or maximum
(when c < 0) value of the graph is always attained when x = 0. This
cannot be changed by changing the values of a or c.
• Figure 5(d) shows quadratic functions when the linear parameter b is
varied. You can see that this will give us the flexibility to have a
minimum or maximum at a different value of x.
In the desilylation experiment, the researchers are interested in finding the
values of the explanatory variables where the maximum response is
obtained. Therefore, we require a model that is flexible enough to have the
maximum at any plausible combination of values of the explanatory
variables, and not just when they are all equal to 0 (as would happen if
there were no linear terms, as in Figures 5(a), (b) and (c)).
When we look at the scatterplots of the response variable versus each
explanatory variable in turn on the bottom row of the scatterplot matrix
given in Figure 1 (Subsection 1.3), we can see that the maximum is not
exactly in the centre (horizontally) of each plot, which corresponds to
when each explanatory variable is equal to 0. In the films example in
Subsection 4.2.2 of Unit 2, on the other hand, it is clear that the minimum
of the response variable income will be attained when there are no
screens. Therefore, in that case, it was sufficient to just use the
transformed variable, screens3 , without keeping the original variable
screens.
So, we’ve decided that we want to investigate adding quadratic terms to
our model for yield, in addition to the linear terms for each explanatory
variable. What other terms could we consider adding?
Well, the Solution to Activity 2 noted that the yield of alcohol seemed to
be affected differently by an increase in time, depending on the
temperature of the experiment. In order to accommodate differences
between the effect of one of the explanatory variables on the response for
different values of another explanatory variable, an interaction between the
two explanatory variables can be added to the model. So, we should also
consider adding the interaction between temp and time – that is, the
interaction term temp:time – to the model.
402
2 Multiple regression: model fitting
403
Review Unit Data analysis in practice
What can we ascertain from this output? We’ll begin by exploring the
results shown in Table 6 in Activity 10, and then we will compare our
results to those for Model (3) (that is, the model with just the individual
covariates as explanatory variables).
(a) The p-value for testing that all regression coefficients are 0 is less
than 0.001. Name the distribution this p-value is obtained from and
give (and explain) its degrees of freedom.
(b) Test whether the regression model contributes information to
interpret changes in the yield of the alcohol.
(c) Which distribution are the p-values for the individual coefficients
calculated from? What’s the value of the degrees of freedom?
(d) Comment on the individual significance of all regression coefficients
by category (linear, quadratic, interaction). Would you consider
removing any terms from the model?
The p-value for the interaction time:nmp indicates that there is only weak
evidence to include this term in the model, if all other terms are in the
model. So, is this interaction needed in the model? To answer this
question, we should compare the models with and without the interaction
time:nmp, to see which model is preferable.
Box 4 provides a reminder of how the model selection criteria from
Subsection 5.2 in Unit 2 can be applied.
404
2 Multiple regression: model fitting
After selecting our model, the next step in the data analysis is now to
check the model assumptions through diagnostic plots. We will do this
soon in Subsection 2.5, but before we do, we’ll take a bit of a detour and in
405
Review Unit Data analysis in practice
Activity 12 we’ll consider the estimates for the coefficients that are
common to both Model (4) (which includes the four covariates, their
squares and the two-way interactions) and Model (3) (which just includes
the four covariates).
Compare the estimates for the linear coefficients for Model (4) in Table 6
with those for Model (3) in Table 5. For convenience, the relevant parts of
Tables 5 and 6 are repeated here as Tables 8 and 9, respectively.
Table 8 Estimates for the linear coefficients after fitting Model (3)
Table 9 Estimates for the linear coefficients after fitting Model (4)
Did you expect this result? Where in M348 have you seen a similar
situation with a different outcome?
406
2 Multiple regression: model fitting
have the same estimates for their common parameters. It turns out that
the desilylation dataset is a special case. So, why is this happening? What
makes the desilylation dataset different to the datasets seen in Unit 2?
Well, the difference lies in the way the desilylation experiment has been
designed. The statistician who helped the chemists design the experiment
deliberately chose to do it this way. This is why the scatterplots of
pairwise explanatory variables in Figure 1 (Subsection 1.3) have this
particularly symmetric and balanced structure. (A detailed discussion of
how exactly to approach the design problem is beyond the scope of M348.)
2
0.5
Standardised residuals
1
Residuals
0.0 0
−1
−0.5
−2
80 85 90 95 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 6 The residual plot (a) and normal probability plot (b) for Model (4)
(a) Does the residual plot in Figure 6(a) support the assumption that the
Wi ’s have zero mean and constant variance?
(b) Does the normal probability plot in Figure 6(b) support the
assumption that the Wi ’s are normally distributed?
407
Review Unit Data analysis in practice
The plots for Model (4) in Figure 6 show a big improvement over their
counterparts from our initial model, Model (3), in Figure 3
(Subsection 2.2). It therefore seems reasonable to conclude from
Activity 13 that the assumptions on the Wi ’s are satisfied for Model (4).
Another step in the assessment of Model (4) is to look for influential
points. Recall from Subsection 3.2 of Unit 2 that a data point is likely to
be influential if it has both high leverage and a large residual. An easy way
to look for potentially influential points is by using the residuals versus
leverage plot; this plot for Model (4) is shown in Figure 7.
13
2
1
Standardised residuals
−1
−2
The residuals versus leverage plot in Figure 7 shows that there are only
two different values of leverage: six points have leverage of
approximately 0.17, while all the other points have leverage of just
under 0.6. (Again, this is a feature of the design of the experiment for the
desilylation dataset. The six points with low leverage are in the centre of
the experimental region, i.e. where all covariates have the (standardised)
value of 0, and the other points all have the same distance from the
centre.) The most influential points are therefore those which are in the
higher-leverage group and have large standardised residuals. Here,
point 13 in the top-right corner of the plot is most influential as its
standardised residual has the highest absolute value.
In order to decide whether point 13 in Figure 7 actually is an influential
point, we can use the Cook’s distance, as described in Subsection 3.3 of
Unit 2. A Cook’s distance plot gives a graphical representation of the
408
2 Multiple regression: model fitting
Cook’s distance values for a dataset; this plot for Model (4) is shown in
Figure 8.
13
0.4
11
Cook’s distance
0.3
24
0.2
0.1
0.0
0 5 10 15 20 25 30
Observation number
Figure 8 The Cook’s distance plot for fitted Model (4)
Explain why the Cook’s distance plot given in Figure 8 suggests that there
aren’t any problems with influential points for these data and Model (4).
409
Review Unit Data analysis in practice
410
3 Multiple regression with factors
411
Review Unit Data analysis in practice
It’s usually a good idea to have a look at the data first. Figure 9 shows the
observed values of the optical density plotted against the day of
observation, with the observations for untreated samples distinguished
from those for treated samples.
1.2
1.0
Optical density
0.8
0.6
0.4
0.2
1 2
Day
Figure 9 Values of the optical density on two different days for treated and
untreated samples
(a) From Figure 9, would you say that the optical density is affected by
the toxin (factor treatment)? Is this what you had expected?
(b) What about the factor day? Is the distribution of responses different
on different days of measuring, and would you have expected this
result?
(c) Would you say the effect of the factor treatment is roughly the same
on both days?
412
3 Multiple regression with factors
One way to select a model more rigorously than just by looking at the data
is to look at the values of the adjusted R2 statistic and the AIC for
competing models. Table 11 provides these values for various linear models
for the cells data.
Table 11 Adjusted R2 and AIC for various linear models for the cells data
We’ll consider which model we’d prefer based on the results given in
Table 11 next, in Activity 17.
413
Review Unit Data analysis in practice
Is this the model that you expected to be the preferred model from
Activity 16?
Following on from Activity 17, based on the adjusted R2 statistic and the
AIC, our preferred model for the cells dataset is
opticalDensity ∼ treatment + day. (5)
The output after fitting this model to the cells data is given in Table 12.
Table 12 Summary of output after fitting Model (5) to the cells data
Recall from Subsection 3.1 of Unit 3 that, when we have a factor, there’s a
parameter estimate for the effect for each level of the factor in comparison
to the effect of the first level of the factor. So, ‘treatment 1’ means that
this row corresponds to the indicator variable for when the sample is
treated with the toxin, that is, when treatment = 1, and ‘day 2’ means
that this row corresponds to the indicator variable for the second day, that
is, when day = 2. The factor levels treatment = 0 and day = 1 have been
used as the baseline levels, and the effects of these are part of the intercept
parameter.
The output given in Table 12 is very similar to what we have seen for
covariates, that is, for continuous explanatory variables: the table gives a
row for each parameter estimate, its standard error and the p-value
associated with it. What are the p-values testing? Let’s have a closer look
in Activity 18.
The row corresponding to the factor treatment, has a p-value less than
0.001.
(a) What is the null hypothesis and what is the alternative hypothesis
corresponding to this p-value?
(b) What is the value of the test statistic for the test in part (a), and
what would be its distribution if the null hypothesis were true?
414
3 Multiple regression with factors
(c) Based on the evidence in Table 12, do you think the toxin affects cell
survival?
Here, the ESS for the factor treatment is the variability in optical density
that is explained by the different treatments (toxin or no toxin), given that
the factor day has also been fitted to the model, and the ESS for the factor
day is the variability in optical density that is explained by the different
days, given that the factor treatment has also been fitted to the model.
We actually get the same p-values for the factors in both Table 12 and
Table 13. This is because the factors here have only two levels.
For factors with more than two levels, k say, we would get k − 1 p-values in
the regression table (for the coefficients comparing each level of the factor
with the baseline level). This can make the table of regression coefficients
look quite messy! In contrast, we only have one p-value associated with
each factor in the ANOVA table.
Also recall that, when there are more than two levels for a factor, the
p-values for the individual levels of the factor aren’t terribly useful. This is
because we either include the factor, in which case we need to include the
corresponding indicator variables for all of the factor levels, or we don’t
include the factor, in which case we don’t need any of the indicator
variables for the factor levels. So, when the factors have more than two
levels, the F -values in the ANOVA table are used for testing whether each
factor is required in the model. In this case, the p-value for a factor in the
415
Review Unit Data analysis in practice
ANOVA table will not be the same as the p-values associated with each
level of the factor.
The ANOVA table is useful for presenting the information required for
testing whether each factor is required in the model in a concise way, but
the regression table is also useful since it might have extra information
that is needed for specific questions of interest. For example, if we wanted
to predict a new response, we would need the coefficients from the
regression table.
0.2
Standardised residuals
1
0.1
Residuals
0.0 0
−0.1
−1
−0.2
0.4 0.6 0.8 1.0 −1 0 1
(a) Fitted values (b) Theoretical quantiles
Figure 10 The residual plot (a) and the normal probability plot (b) for fitted Model (5)
The diagnostic plots in Figure 10 do not flag up any cause for concern.
The residual plot of residuals and fitted values shows reasonably equal
spread about 0, and the normal probability plot shows that the
standardised residuals are close to the straight line. Minor deviations in
these plots are expected as we have a very small dataset. We conclude that
Model (5) provides a good fit for the cells data.
416
4 The larger framework of generalised linear models
Consider the plot of data from the cells dataset given in Figure 9
(Subsection 3.1) and answer the following questions.
(a) Suppose that the biologists had measured all of the treated samples
on Day 1 and all the untreated ones on Day 2. What could be a
problem here?
(b) Now suppose that the biologists had measured all of the untreated
samples on Day 1 and all the treated ones on Day 2. What could be a
problem this time?
(c) Hence, was it a good idea to plan the experiment in the way it was
run?
That concludes our review of linear models. In the next section, we’ll turn
our attention to generalised linear models.
417
Review Unit Data analysis in practice
How do the linear models we have met so far in this unit fit into this
framework? Well, in linear models, the distribution for the response
variable is the normal distribution, and the link function, g, is the ‘identity
link’, so that g(E(Yi )) = E(Yi ). So, when modelling a dataset using a
linear model, we therefore only had to decide which terms to include in the
linear predictor.
The definition of a GLM allows us to be more flexible. We can now fit
responses that are not normally distributed, such as, count data. This
means we have to make three decisions in the modelling process:
• which distribution we want to fit to the response variable
• which link function to use
• which terms to include in the linear predictor.
How do we know which distribution to pick? In many cases, a distribution
will spring to mind quite naturally considering the nature of the response.
Often, this will result in a good fit, but keep in mind that there may be
datasets where a different distribution provides a better fit. In this
module, we considered the following response distributions for GLMs.
• The normal distribution is a natural choice when the data are
continuous and their distribution roughly follows a bell-shaped curve.
For example, a normal distribution may be a natural choice to model the
heights of footballers.
• The Bernoulli distribution is a natural choice when the data have two
possible outcomes, often called ‘success’ and ‘failure’. For example, you
will either pass or fail your driving test, or a footballer taking a penalty
kick may, or may not, score a goal.
418
4 The larger framework of generalised linear models
419
Review Unit Data analysis in practice
The final decision required for a GLM, is which terms to include in the
linear predictor. For this, we can use the same techniques that we’ve used
for selecting which terms to include in a linear model.
In the following subsections, we will have a close look at three different
datasets, which we’ll use to build generalised linear models.
420
4 The larger framework of generalised linear models
421
Review Unit Data analysis in practice
As an initial model for the citations dataset, let’s try fitting a Poisson
GLM (with a log link) of the form
numCitations ∼ yearDiff + journal + yearDiff:journal. (6)
We’ll consider this model in Activity 21.
Explain why Model (6) would be a first good GLM for the citations
dataset.
What happened here? Why are we not getting an estimate for the ‘slope’
of yearDiff when journal takes the value 2?
In order to answer this question, we need to have a look at the data. (In
fact, we should have done this before we even started proposing a model!)
Figure 11 shows a scatterplot of the number of citations (numCitations)
and the time since the article was published (yearDiff), with the different
journal types identified.
We’ll consider Figure 11 next in Activity 22.
422
4 The larger framework of generalised linear models
Journal type:
standard statistics prestigious statistics medical
80
60
Number of citations
40
20
0
5 10 15 20
Years since publication
Figure 11 Scatterplot of numCitations and yearDiff, with the different
values of journal identified
As there is only one observation for which journal takes value 2, could this
be the cause of the problem with the interaction term?
Unfortunately, the answer is ‘Yes’. Having only one observation in the
dataset for which journal takes value 2 means that we cannot estimate a
separate slope for this category. It is clear that in order to estimate a
slope, there must be at least two observations, or, in other words, we need
at least two points to draw a line (of best fit) between them. (Imagine you
tried to fit a line to the single observation for medical journals. You could
fit the intercept as the value of the response for this one data point, but
then there is no way to know what the slope should be.) So, sadly, we have
to give up on Model (6). What can we do?
Well, the best way round this problem is to get more data! (In this case,
that means more articles published in medical journals, and their numbers
of citations.) However, given the context of the data, this is easier said
than done! One possible solution in such a situation (where there are too
few data in some of the categories of a factor) could be to combine some of
the categories. This is sometimes called conflation. You need to make sure
though that this makes sense from the context. For example, here it might
seem tempting to have just two categories: one category for standard
statistics journals and a second category combining prestigious statistics
journals with medical journals. Activity 23 considers this approach.
423
Review Unit Data analysis in practice
When fitting Model (7), the null deviance is 462.801 with 22 degrees of
freedom, and the residual deviance is 80.933 with 19 degrees of freedom.
(a) Is there a significant gain in fit for Model (7) compared with the null
model?
(b) Do you think Model (7) is a good fit to the citations data? What
could be an issue here?
424
4 The larger framework of generalised linear models
shows the standardised deviance residuals against index, taking the data in
order of publication date, plot (c) shows the squared standardised deviance
residuals against index number, again taking the data in order of
publication date (where the red circles denote positive residuals and the
blue triangles denote negative residuals), and plot (d) shows the normal
probability plot. We’ll consider Figure 12 in Activity 25.
4 4
Standardised deviance residuals
0 0
−2 −2
2 3 4 5 6 7 5 10 15 20
(a) µ (b) Index number
Squared standardised deviance residuals
15 4
Standardised deviance residuals
2
10
5
−2
0 −4
5 10 15 20 −2 −1 0 1 2
Index number Theoretical quantiles
(c) (d)
Figure 12 Residual plots for Model (7) for the citations data
425
Review Unit Data analysis in practice
While we can fit all of the terms in Model (7), the residual deviance for the
model considered in Activity 24 suggests that the model may not be a
good fit, and the diagnostic plots considered in Activity 25 suggest that
some of the model assumptions may be questionable. We will revisit this
dataset later, in Subsection 6.2, to explore further modelling options.
426
4 The larger framework of generalised linear models
Dose (mg/m2 )
Biomarker group 100 150 180 215 245 260 Total
Biomarker negative 0/5 0/4 0/4 0/6 2/7 1/1 3/27
Biomarker positive 1/6 0/4 0/8 2/4 – – 3/22
Pooled data 1/11 0/8 0/12 2/10 2/7 1/1 6/49
427
Review Unit Data analysis in practice
428
4 The larger framework of generalised linear models
429
Review Unit Data analysis in practice
Figure 13 shows typical shapes of the logistic regression curves for the
probability of toxicity for the two levels of the factor biomarker across the
values of the covariate logDose. Notice that each of the plots shows two
curves – one for each level of the factor.
• Figure 13(a) shows the situation where there is no interaction. In this
case, the curves for the two levels of biomarker have the same shape
and are simply shifted to the left or right according to the parameter for
biomarker. This is similar to the parallel slopes model that we
introduced in Section 2 in Unit 4.
• Figure 13(b) shows the situation where there is an interaction between
biomarker and logDose. In this case, the curves for the two levels of
biomarker can also differ in shape by being stretched or shrunk. This is
similar to the non-parallel slopes model that we introduced in Section 3
in Unit 4.
1.0 1.0
Probability of adverse reaction
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
(a) logDose (b) logDose
Figure 13 Typical shapes of the logistic regression curves for the probability of toxicity for models (a) with no
interaction and (b) with an interaction
Next, we’ll look at the fit of Model (9) and see whether all terms in the
model are needed, or if some of the parameters may be equal to 0. To do
this, we want to compare Model (9) with the most plausible of its nested
submodels. In logistic regression (and also more generally for generalised
linear models), we can compare nested models through their deviance
differences. To aid this comparison, consider Table 19, which provides the
residual deviances of Model (9) and several of its submodels.
430
4 The larger framework of generalised linear models
Table 19 Residual deviances of various logistic regression models for the dose
escalation data
Let’s establish our preferred model in Activity 27, using what we’ve
learned about model comparisons for nested generalised linear models in
Units 6 and 7.
(a) What are the deviance differences between Model (9) and Models M2 ,
M1 and M0 , respectively?
(b) If all models fit the data, what is the chi-square distribution which
should be used to test whether there is significant gain in choosing
Model (9) over each of the respective Models M2 , M1 and M0 ?
(c) The p-values associated with the test statistics equal to your answer to
part (a) using your chi-squared distributions from part (b) are 0.0165,
0.0283 and 0.0115, respectively, for comparing Model (9) with Models
M2 , M1 and M0 . Interpret these p-values in terms of the models.
We have now established that Model (9) provides a better fit to the dose
escalation data, compared with its submodels containing fewer terms. We
also need to check that our chosen model satisfies the assumptions that are
needed in order to use logistic regression.
Figure 14, given next, shows the diagnostic plots for Model (9).
Plot (a) in Figure 14 shows the standardised deviance residuals against a
transformation of µ b, plot (b) shows the standardised deviance residuals
against index, taking the data in the order they were entered into the data
file, plot (c) shows the squared standardised deviance residuals against
index number, again taking the data in the order they were entered into
the data file (where the red circles denote positive residuals and the blue
triangles denote negative residuals), and plot (d) shows the normal
probability plot. Note that, for these data, the actual order in which the
data were collected is unknown.
431
Review Unit
Standardised deviance residuals Data analysis in practice
1 1
0 0
−1 −1
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 10 20 30 40 50
(a) 2 arcsin µ (b) Index number
Squared standardised deviance residuals
6
Standardised deviance residuals
2
5
1
4
3 0
2 −1
1
−2
0
0 10 20 30 40 50 −2 −1 0 1 2
Index number Theoretical quantiles
(c) (d)
Figure 14 Diagnostic plots for Model (9) for the dose escalation dataset: (a) the standardised deviance
residuals against a transformation of µ
b, (b) the standardised deviance residuals against index (c) the squared
standardised deviance residuals against index (red circles denote positive residuals and blue triangles denote
negative residuals), and (d) the normal probability plot.
432
4 The larger framework of generalised linear models
Overall, there are some concerns about the assumptions for Model (9),
particularly in relation to the large (squared) deviance residuals for five of
the positive residuals where Yi = 1. With this amount of unusual
(according to the model) observations, we could try to accommodate them
by adding further terms to the model. However, the dataset includes only
six points where Yi = 1 in total – that is, only six out of 49 patients
experienced toxicity. (A low number of toxicity events is good for the
participants, but makes the modelling more difficult!) There is some
danger of overfitting if we add extra terms to Model (9): we might just be
matching this particular dataset very closely, but our fit might not be a
good prediction for future toxicity trials. We will revisit this issue in
Subsection 8.2 when we discuss strategies for dealing with outliers.
433
Review Unit Data analysis in practice
obese
no yes
ageGroup ageGroup
ethnicity 4 to 5 10 to 11 4 to 5 10 to 11
Asian 33241 36837 3537 12463
Black 14800 17646 2607 7468
Mixed 19171 17766 2194 5413
White 227637 251319 24240 60565
Chinese/other 9830 9522 1062 3098
Unknown 55387 54686 5764 14355
We will now work through choosing log-linear models for this three-way
contingency table in Activity 29.
(a) How would you go about testing whether the model containing all
two-way interactions fits better than the model omitting the
interaction ethnicity:ageGroup? Find the value of an appropriate
test statistic and the distribution this comes from if both models fit
equally well.
(b) Suppose your test in part (a) gives a p-value close to 0. What is your
conclusion? What does this tell you about the comparison between
the model containing all two-way interactions and the other two
models where one interaction is omitted?
434
5 Relating the model to the research question
(c) Which model, if any, provides an adequate fit for the data? You can
assume that a value of 30.856 would give a p-value of less than 0.001
for a χ2 (5) distribution.
While it is disappointing that we could not find a well-fitting model for the
child measurements data (apart from the saturated model, which is not
very useful for predictions), we should also ask ourselves why we fitted a
log-linear model in the first place. Section 5 will introduce the concept of
the research question behind a study, and how this affects our modelling
approach. In Section 5, we will revisit several datasets, in particular the
cells, the child measurements, the dose escalation and the desilylation
datasets.
In this section, we’ll revisit some of the datasets from earlier in this unit:
• the cells dataset (from Subsection 3.1) in Subsection 5.1
• the child measurements dataset (from Subsection 4.4) in Subsection 5.2
• the dose escalation dataset (from Subsection 4.3) in Subsection 5.3
• the desilylation dataset (from Subsection 1.1) in Subsection 5.4.
For each dataset, we’ll identify the research question of interest and a
statistical model to try to find its answer.
435
Review Unit Data analysis in practice
436
5 Relating the model to the research question
strong evidence that the toxin does indeed decrease the proportion of
surviving cells in a sample.
In Unit 3, and in Section 3 of this unit, we saw that datasets where all
explanatory variables are factors can be analysed by either creating an
ANOVA table or by reporting the output of a linear regression model in a
table of coefficients. Which one would be preferred for the cells dataset,
given the research question? We’ll consider this in Activity 31.
437
Review Unit Data analysis in practice
438
5 Relating the model to the research question
We’ve seen in this subsection that the research question is important for
how we analyse the data. It is therefore essential to find out as much as
possible about the study from the researchers who are conducting it. If in
doubt, ask!
439
Review Unit Data analysis in practice
This model means that the dose affects the potential of a toxicity outcome
differently in the two biomarker groups. In other words, the dose where
toxicity is expected to have probability 0.16 is different in the two
biomarker groups, and we are therefore looking for two different doses. A
quick and easy way of finding these doses is through a graphical approach.
Figure 15 shows the fitted probabilities of toxicity for the dose escalation
data for the two biomarker groups. In each plot, the horizontal arrow is at
the value where the probability of toxicity is 0.16.
1.0 1.0
Probability of toxicity
Probability of toxicity
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
100 150 200 250 100 150 200 250
(a) Dose (mg/m2 ) (b) Dose (mg/m2 )
Figure 15 Plots of the fitted probabilities for (a) biomarker negative patients, and (b) biomarker positive
patients
In Activity 35 you will find the recommended doses for each biomarker
group of patients by looking at the graphs in Figure 15.
The scale in the graphs in Figure 15 is not fine enough to allow reading off
the exact values of the required dose for a value of 0.16 for the probability
of toxicity. In order to get a more accurate result, we need to use our fitted
model equation.
Recall that we can predict the values for p, the probability of toxicity, at
given dose levels for each biomarker group, by using the fitted model
equation
pb
log = ηb, (12)
1 − pb
440
5 Relating the model to the research question
where ηb is the fitted value of the linear predictor. For our particular
logistic regression model
toxicity ∼ logDose + biomarker + logDose:biomarker,
for biomarker negative patients we get the fitted model equation
pb
log = −424 + 529 logDose
1 − pb
dose
= −424 + 529 log +1 , (13)
200
and for biomarker positive patients we get the fitted model equation
pb
log = −4.3 + 4.1 logDose
1 − pb
dose
= −4.3 + 4.1 log +1 . (14)
200
However, we are not interested in predicting values for the probability of
toxicity, p. Our aim, instead, is to find the dose where p equals 0.16. How
can this be done? The answer is that we can substitute pb = 0.16 into
Equations (13) and (14), and then solve for dose. Substituting the value pb
in these equations, the left-hand side of each equation becomes
pb 0.16
log = log
1 − pb 1 − 0.16
≃ log(0.1905) ≃ −1.6582.
So, in order to find the dose for the biomarker negative group when
pb = 0.16, we need to find the value of dose which satisfies
dose
−1.6582 = −424 + 529 log +1 , (15)
200
and to find the dose for the biomarker positive group when pb = 0.16, we
need to find the value of dose which satisfies
dose
−1.6582 = −4.3 + 4.1 log +1 . (16)
200
Use Equations (15) and (16) to calculate the dose required for each
biomarker group when pb = 0.16.
So, we’ve seen how we can use our fitted model to address the research
question of interest. However, before recommending these doses, it’s
always a good idea to check how sensible the results seem to be.
441
Review Unit Data analysis in practice
We’ll start by looking back at the fitted curves given in Figure 15. The
graph in Figure 15(a) for the biomarker negative group shows a very steep
S-shaped curve, whereas the graph in Figure 15(b) for the biomarker
positive group shows a slowly increasing curve. Why do they look so
different?
Well, the first thing to note is that the graph for the biomarker positive
group would be S-shaped if we plotted it over a larger range of the dose.
So, although it’s not clear from the plots, in fact both curves are S-shaped.
However, the graph for the biomarker negative group causes some concern.
While having the desired S-shape of the logistic regression model, it is too
steep for a statistician’s liking. It would seem rather unlikely that the
probability of toxicity in the biomarker negative group is almost 0 up to a
dose of about 240 mg/m2 , and then goes up to one when reaching a dose of
about 250 mg/m2 .
In addition to having some concern about the fitted curve for the
biomarker negative group, we also weren’t entirely happy with the
diagnostic plots for this model given in Figure 14 (Subsection 4.3). So,
how else might we check whether our results seem to be sensible? Well,
there’s another check that we can do for dose escalation trials.
One of the features of a dose escalation trial is that several patients are
given the same dose. For example, from Table 18 (Subsection 4.3), we can
see that five patients in the biomarker negative and six patients in the
biomarker positive group were given dose 100 mg/m2 . For each of the
doses used in the trial, we can therefore find an estimate for the
probability of toxicity by simply dividing the number of patients who
experienced toxicity at this dose by the total number of patients who
received this dose, separately for the two biomarker groups. Table 23
shows these estimates for the dose escalation dataset.
Table 23 Estimated probability of toxicity by dose and biomarker groups
Dose (mg/m2 )
Biomarker 100 150 180 215 245 260
0 0 0 0 2 1
negative =0 =0 =0 =0 ≃ 0.286 =1
5 4 4 6 7 1
1 0 0 2
positive ≃ 0.167 =0 =0 = 0.5 – –
6 4 8 4
442
5 Relating the model to the research question
Comparing these results with the recommended doses from Model (9), we
find that for the biomarker positive group, the values essentially coincide
(181 mg/m2 using the fitted model in Activity 36 and 180 mg/m2 using the
direct estimation in Activity 37), which is strong evidence this dose can be
recommended for this group. In addition, the plot of fitted probabilities in
this group is what we would expect, so the results from Model (9) should
be ‘trustworthy’ in the biomarker positive group.
For the biomarker negative group, however, the values are quite far apart
(244 mg/m2 using the fitted model in Activity 36 and 215 mg/m2 using the
direct estimation in Activity 37). We also had some concern about the
steepness of the S-shaped curve of fitted probabilities in this group. In
hindsight, it would have been good if an intermediate dose, such as
230 mg/m2 , had been investigated in this group. Overall, with the tools
and the data we have, we cannot give a strong recommendation for a
specific dose in the biomarker negative group. In this type of situation in
practice, medical statisticians might use more sophisticated statistical
methods that allow them to incorporate prior knowledge into the model, or
they might take forward more than one dose to the next stage of the trial.
Any of these methods will require close collaboration with the clinicians.
In any case, it is worth noting that the recommended doses will be different
for the two groups. This again highlights the importance of considering all
variables that may affect the response. In this example, we accounted for
potential differences between patients through the biomarker. If this
information had been neglected, we would have recommended the same
dose for all patients. It is likely that this dose would have unacceptable
levels of toxicity for the biomarker positive group, while being too low to
reach the best level of efficacy for the biomarker negative group.
Following on from Activity 38, we would now like to maximise the fitted
model. Luckily, any statistical software will do this for us; maximisation
in R provides the results given in Table 24.
Table 24 Optimal values for the variables in the desilylation dataset
444
5 Relating the model to the research question
We now have a prediction interval for the yield of the alcohol of interest
when the reaction is run at the estimated optimal covariate values in
Table 24. What is the next step for the chemists to optimise the
production process of the alcohol? We’ll explore this question in
Activity 41.
The original ranges for the covariates used in the desilylation experiment
(that is, temp0, time0, nmp0 and equiv0) were provided in Table 4 in
Subsection 1.4. The variables were standardised by setting the lowest value
for each respective variable to −2 and the highest value to 2. The
midpoint of the range was then set to 0, and so on. We can reverse the
standardisation to get the optimal estimated covariate values in their
original scale as given in Table 25.
Table 25 Optimal values for the variables in the desilylation experiment in
original units
Notice that the value for time0 is at the upper limit of the range for this
covariate (31 hours in the original scale or 2 in the standardised scale). In
the optimisation procedure where we found these values, we had implicitly
constrained the values of the variables to the interval [−2, 2] in the
standardised scale, so that we only looked for the maximum in this range
for each of the covariates. Why did we do that, and does this mean a
higher yield is possible outside this range?
We constrained the ranges of the covariates to [−2, 2] in order to avoid
extrapolation. You have seen in Subsection 6.1 of Unit 1 that a fitted
model is only valid for prediction for values of the explanatory variables
within the range of values in the sample of the data. There is no guarantee
that the same relationship between the response variable and the
explanatory variable will hold outside the range of data used to calculate
the fitted model. Going outside this range is extrapolation, which can lead
to incorrect conclusions.
If we had not constrained the optimisation routine to only look for the
maximum yield when time is less than or equal to 2, then R might have
445
Review Unit Data analysis in practice
given us a higher maximum yield than the one we found, with a value of
time greater than 2. However, this would be extrapolation, since we have
no observed responses when time is greater than 2, and therefore we would
not be able to trust these results. The model may not be valid in this area.
In this situation, the chemists have two options to ensure efficient mass
production of the alcohol. The first one is to set their equipment to the
values in Table 25. This way, they have a prediction interval which has
been derived from a reliable model. The second one is to increase the
sample size. In particular, they could run the reaction a few times for
larger values of time0 and then re-estimate the model including these new
observations. This way, they might find an even higher yield for the
alcohol of interest.
In deciding what to do, the chemists may also take into account the costs
of producing the alcohol on a large scale. For example, if they can produce
slightly more alcohol when running the reaction for 35 hours, say, then
they may still produce more alcohol in total when going with 31 hours as
in Table 25 because they can use the equipment to run the reaction more
often! Of course, the costs for heating, for the materials and for equipment
maintenance also need to be taken into account.
446
6 Another look at the model assumptions
447
Review Unit Data analysis in practice
Journal type:
standard statistics prestigious statistics medical
80
60
Number of citations
40
20
0
5 10 15 20
Years since publication
Figure 16 Scatterplot of numCitations and yearDiff, with the different
values of journal identified (repeat of Figure 11)
Recall that we’ve already seen an example of using a linear model to model
a dataset with a count response when we analysed the Olympics dataset in
Unit 5. In that unit, we found that the fitted linear model worked well for
these data. So, let’s try fitting the linear model
numCitations ∼ yearDiff + journal (17)
to data from the citations dataset. A summary of the output after fitting
the model is given in Table 26.
448
6 Another look at the model assumptions
What can we learn from the output in Table 26? We’ll explore the
interpretation of the results in Activity 43.
449
Review Unit Data analysis in practice
10 2
Standardised residuals
5 1
Residuals
0 0
−5 −1
−10 −2
0 20 40 60 80 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 17 The residual plot (a) and normal probability plot (b) for Model (17)
(a) Does the residual plot in Figure 17(a) support the assumption that
the Wi ’s have zero mean and constant variance?
(b) Does the normal probability plot in Figure 17(b) support the
assumption that the Wi ’s are normally distributed?
450
6 Another look at the model assumptions
with two models with similar values of the AIC, but different sets of
explanatory variables in the linear predictor.
First, the AIC is only a relative measure. It can be used to decide between
models, but does not tell us anything about the fit, or if the model
assumptions are satisfied. We need to use model diagnostics on both
models. If none of the two models satisfies the model assumptions, we will
need to look for a better model. If only one of them satisfies the model
assumptions, we can pick this one. But what do we do if both models are
appropriate?
There is nothing wrong with reporting two models. You could go back to
the researchers who collected the data, and discuss your findings with
them. This may help them (and you) to interpret the results.
Alternatively, or in addition, you could split the data into a test and a
training dataset as in Subsection 4.3 of Unit 5 and assess how well the
models found by fitting the training data can predict the test data.
To round off this section, we’ll revisit the Olympics dataset and use R to
compare some different potential models.
451
Review Unit Data analysis in practice
452
7 To transform, or not to transform, that is the question
453
Review Unit Data analysis in practice
454
7 To transform, or not to transform, that is the question
455
Review Unit Data analysis in practice
10
10
5
5
Residuals
Residuals
0
0
−5
−5
−10
−10
5 10 15 20 0 1 2
(a) yearDiff (b) journal
Figure 18 (a) Scatterplot of the residuals against the covariate yearDiff and (b) comparative boxplot of the
factor journal for Model (17)
Do the two plots in Figure 18 support the assumption that the Wi ’s have
zero mean and constant variance?
Following on from Activity 45, from the scatterplot of the residuals against
the values of yearDiff, it looks like the variance increases as the time
since publication increases. This makes intuitive sense, since soon after
publication, when only few researchers have read the article, there will
generally be a small number of citations, so the variability of
numCitations when yearDiff is small will also be small. However, an
article that has been published for a long time may get a large number of
citations or very few, depending on how interesting the research in the
article is to other researchers. This means that the variability of
numCitations when yearDiff is large is likely to be larger than when
yearDiff is small.
We could try a variance stabilising transformation of the response variable
numCitations. Here, we want to decrease the variance as yearDiff
increases, so we could try going down the ladder of powers. Typical
transformations in this case are the square root and the log
transformation. The increasing vertical spread of the residuals with the
values of yearDiff didn’t look too severe, so we should start with the
‘mildest’ transformation, in this case the square root.
456
7 To transform, or not to transform, that is the question
Journal type:
standard statistics prestigious statistics medical
8
Square root of number of citations
0
5 10 15 20
Years since publication
Figure 19 Square root of numCitations versus yearDiff, with the journal
type identified
A parallel slopes model with constant error variance looks promising for
the data in Figure 19.
Figure 20, given next, shows the diagnostic plots for the citations data
after fitting the linear model
√
numCitations ∼ yearDiff + journal. (18)
Figure 20 also shows the scatterplot of the residuals for fitted Model (18)
against the values of yearDiff, and a comparative boxplot of these
residuals for the factor journal.
457
Review Unit Data analysis in practice
2 2
Standardised residuals
1 1
Residuals
0 0
−1 −1
−2 −2
2 4 6 8 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
2 2
1 1
Residuals
Residuals
0 0
−1 −1
−2 −2
5 10 15 20 0 1 2
(c) yearDiff (d) journal
Figure 20 Diagnostic plots after fitting Model (18): (a) residual plot, (b) normal probability plot,
(c) scatterplot of residuals against yearDiff, and (d) comparative boxplots of residuals for the three levels of
journal
458
8 Who’s afraid of outliers?
459
Review Unit Data analysis in practice
Spot the odd one out! Suppose, for example, that in the FIFA 19 dataset you find that a
footballer’s height has been recorded as 7 inches. This is clearly a mistake!
In this case, you have two options. The first one is to delete the value of
height for this footballer and to analyse the data with this particular
value missing. The other option is to try and find the correct value for the
footballer’s height. In this particular example, this may be possible as it is
likely that the heights of these international footballers have been recorded
in more than one database. Make sure you use a reliable source, though!
Similarly, if in the desilylation dataset one of the (standardised) values of
the variable temp, say, had been recorded as 10, this is very likely a
mistake, since the chemists had decided that all (standardised) values of
temp are between −2 and 2. In such a case, it is recommended to go back
to the person who recorded the results, if at all possible, to confirm the
correct value. It may be tempting to use the symmetry of the design you
saw in the scatterplot matrix in Figure 1 (Subsection 1.3) to find out
which value should be the correct one, and you are likely to be right in this
case of a carefully designed experiment. However, there is a remote
possibility that, after all, the wrong temperature has accidentally been
used in this particular run of the experiment!
Another example where outliers may occur by mistake is when a dataset
has been merged from a large database or different sources, such as the
Olympics dataset in Unit 5. There is room for error by omitting values
that should be in the dataset, duplicating rows of data or adding data that
do not belong. For example, we could accidentally add data from a winter
Olympics, and suddenly countries such as Switzerland and Austria have
far more medals than expected. These data are not ‘wrong’, but do not
belong to the ‘population’ you are studying, in this case the medals at
summer Olympics.
Some of the data may also have been influenced by a situation beyond
anyone’s control. Imagine a group of UK city planners in 2022, say, who
study how employees commute to work, in order to plan public transport
provision for their city. An important source of data for their research
would be the the national census, which is run every 10 years by the Office
for National Statistics. (The 2021 census may have been the last, but there
will be equivalent ways of sourcing these data in the future.) If they look,
460
8 Who’s afraid of outliers?
for example, at the three most recent censuses to model commuting habits
in their city, it seems likely that the data from the March 2021 census will
come up as outliers. Far more employees than expected are working from
home. A whole dataset is an outlier here!
While the 2021 data provide a unique snapshot of commuting patterns
during a pandemic, this may not be particularly useful to predict demand
for public transport in the future. However, just using data from
pre-pandemic times may also not give the full picture. Some employees
The COVID-19 lockdowns
may change their commuting habits permanently after the pandemic. In
meant that places which are
this case, it may be best for the city planners to conduct their own survey
usually very busy, like St
to get the most reliable data to predict demand. Mark’s Square in Venice, Italy,
For the rest of this section, we will investigate strategies to deal with were empty
outliers that do not result from mistakes or where there are no obvious
reasons for them being outliers.
461
Review Unit Data analysis in practice
17
0.30
0.25 18
5
Cook’s distance
0.20 12
0.15
0.10
0.05
0.00
0 5 10 15 20 25 30
Observation number
Figure 21 Cook’s distance plot for Model (3) for the desilylation data
462
8 Who’s afraid of outliers?
15
0.30
0.25
Cook’s distance
0.20
20
0.15
0.10 8
0.05
0.00
0 5 10 15 20
Observation number
Figure 22 Cook’s distance plot for Model (19) for the citations data
463
Review Unit Data analysis in practice
A look at the data reveals that the potential outliers all correspond to
very large response values, so that they correspond to articles that
have large numbers of citations. This is visualised in Figure 23 where
a scatterplot of numCitations and yearDiff is shown, together with
the regression line from Model (19).
80
60
Number of citations
40
20
0
5 10 15 20
Years since publication
Figure 23 Scatterplot of numCitations and yearDiff, with the
regression line for Model (19)
It is clear from Figure 23 that the regression line cannot capture the
very high values of the response. Is it just chance that these articles
have so many more citations than the rest? We asked the researcher
who provided the data and an explanation was found.
The number of citations an article will get depends on which type of
journal it has been published in. Articles in prestigious journals are
read by more researchers and are thus more likely to be cited. Also,
articles published in journals that are also read by researchers from
other disciplines may get more citations. For the citations dataset, it
turned out that the article with the largest number of citations
(observation 15) is about the statistical analysis of a clinical trial, and
had therefore been published in a medical journal. The three articles
with the next highest numbers of citations (observations 8, 17 and 20)
had been published in prestigious statistics journals. This was the
motivation behind creating the factor journal reflecting the type of
journal the respective article has been published in.
464
8 Who’s afraid of outliers?
Example 5 proposed adding an extra squared term into the linear predictor
for the logistic regression model for the dose escalation dataset, giving the
proposed model as Model (20).
465
Review Unit Data analysis in practice
The diagnostic plots after fitting Model (20) are shown in Figure 24, and
they look promising!
2 2
Standardised deviance residuals
0 0
−1 −1
−2 −2
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 10 20 30 40 50
(a) 2 arcsin µ (b) Index number
Squared standardised deviance residuals
2
Standardised deviance residuals
1
3
2 0
1 −1
0 −2
0 10 20 30 40 50 −2 −1 0 1 2
Index number Theoretical quantiles
(c) (d)
Figure 24 Diagnostic plots for Model (20) for the dose escalation data
But . . .
Why a ‘but’ ? Aren’t the diagnostic plots looking much better than those
in Figure 14 for Model (9) without the squared term? Surely, only a
gloomy and pessimistic statistician can find fault here!
Or is there anything else we should consider?
We now have a large number of residuals that are essentially 0. Shouldn’t
we have a look to see whether we are overfitting? Are we fitting a model
that fits this particular dataset really well, but would not be useful for
predicting toxicity in future patients?
The revelation comes when we try to find the optimum dose. Let’s have a
466
8 Who’s afraid of outliers?
1.0
0.8
Probability of toxicity
0.6
0.4
0.2
0.0
100 150 200 250
2
Dose (mg/m )
Figure 25 Plot of the fitted probabilities in the biomarker positive group for
Model (20)
(a) Can you spot a problem with the plot in Figure 25?
(b) What could happen if we used this model to find the optimum dose
(the dose where the probability of toxicity is 0.16)?
(c) Can you think of a statistical explanation for this shape of the plot?
467
Review Unit Data analysis in practice
Not all decisions around data will have severe consequences like the ones
described in Example 6, but we should be aware that our decisions on
analysing and interpreting data may have some impact in the real world.
This could be patients’ health if the data come from a clinical trial, or a
468
8 Who’s afraid of outliers?
469
Review Unit Data analysis in practice
470
9 The end of your journey: what next?
‘censored’, and we need to take censoring into account when modelling the
data.
Non-parametric/semi-parametric methods
All generalisations of GLMs so far are still based on parametric models
where we assume a general form, including a distribution, and then we
estimate some unknown model parameters. If we have no idea what model
might fit our data, or if we tried several seemingly plausible models
without success, a non-parametric or semi-parametric method could be an
option. These methods include splines, wavelets, local polynomials,
support vector machines and generalised additive models, to name but a
few. They are more flexible than parametric models, but may be more
difficult to fit and to interpret.
471
Review Unit Data analysis in practice
Summary
In this unit, we have reviewed the main themes of M348: linear models and
generalised linear models, how they are related, and their application in
practice. You may have seen more clearly the similarities and links between
many of the concepts and methods that you have learned in the module.
You will also have gained more practice at using many of these methods.
When thinking about possible models for your data, you have many
options. You select a distribution for the response variable, a link function
that links the distribution mean to the linear predictor, and finally the
form of the linear predictor itself. Which variables should be included?
Are there significant interactions? Do we need transformations? From this
arguably incomplete list, you can see the options seem endless.
Ultimately, we should be pragmatic about model choice. There is usually
not one ‘best’ model, but many models that don’t fit, and some that do.
The aim is not to find an elusive perfect model, but a model that is
justifiable. Questions you could ask yourself are: Are the assumptions on
my model reasonable (or at least not too unreasonable)? Can I justify
what I have done? Does my analysis answer the research question?
If in doubt, and where possible, discuss your findings with the researchers
who collected the data. Often, modelling is just a means to an end to
answer a specific research question in the application area. Do your results
make sense to them? Don’t be afraid of asking questions. Lots of
questions! Communication is key.
The route map for this final M348 unit is repeated here, as a reminder of
what has been studied and how the sections link together.
472
Summary
Section 1
Multiple regression:
understanding the data
Section 2 Section 3
Multiple regression: Multiple regression
model fitting with factors
Section 4
The larger framework
of generalised linear
models
Section 5
Relating the model to
the research question
Section 7
Section 6 Section 8
To transform, or
Another look at the Who’s afraid
not to transform,
model assumptions of outliers?
that is the question
Section 9
The end of your
journey: what next?
473
Review Unit Data analysis in practice
Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate that there is no ‘correct’ model for a dataset and different
statisticians might recommend different models for the same data
• appreciate the importance of communication between the statistician
doing the modelling and the researcher who collected the data
• understand that the model needs to be able to address the research
question of interest
• have an understanding of linear models and generalised linear models,
and the connection between them
• appreciate that fitting a generalised linear model requires three decisions:
◦ which distribution to use for the response variable
◦ which link function to use
◦ which terms to include in the linear predictor
• test whether individual covariates, factors and interactions need to be
included in a model
• compare models using the adjusted R2 statistic and/or the AIC
• check the model assumptions through appropriate diagnostic plots
• appreciate that there can be more than one way to model a dataset, and
the ‘natural’ model for a dataset may not necessarily be the best model
to use
• appreciate that a model can sometimes be improved by transforming one
or more of the variables, or adding another explanatory variable or an
interaction
• appreciate some of the possible reasons for outliers and strategies for
dealing with them
• use R to fit a logistic regression model to contingency table data
• use R to compare models.
474
References
References
BBC News (2021) ‘Covid: Man offered vaccine after error lists him as
6.2 cm tall’, 18 February. Available at: https://www.bbc.co.uk/news/
uk-england-merseyside-56111209 (Accessed: 5 December 2022).
Biedermann, S. (2006) Private communication with one of the researchers
who conducted the experiment.
Cotterill, A. and Jaki, T. (2018) ‘Dose-escalation strategies which use
subgroup information’, Pharmaceutical Statistics, 17, pp. 414–436.
Elsevier (2021) ‘Biedermann, Stefanie’, Scopus (Accessed: 10 February
2021).
NHS Digital (2020) National Child Measurement Programme, England
2019/20 School Year. Available at: https://digital.nhs.uk/data-and-
information/publications/statistical/national-child-measurement-
programme/2019-20-school-year (Accessed: 17 September 2022).
Nicholson, H.S., Krailo, M., Ames, M.M., Seibel, N.L., Reid, J.M.,
Liu-Mares, W, Vezina, L.G., Ettinger, A.G. and Reaman, G.H. (1998)
‘Phase I study of Temozolomide in children and adolescents with recurrent
solid tumors: a report from the children’s cancer group’, Journal of
Clinical Oncology, 16(9), pp. 3037–3043.
Owen, M.R., Luscombe, C., Lai, L., Godbert, S., Crookes, D.L. and
Emiabata-Smith, D. (2001) ‘Efficiency by design: optimisation in process
research’, Organic Process Research and Development, 5, pp. 308–323.
Rodriguez-Martinez, A. et al. (2020) ‘Height and body-mass index
trajectories of school-aged children and adolescents from 1985 to 2019 in
200 countries and territories: a pooled analysis of 2181 population-based
studies with 65 million participants’, The Lancet, 396(10261),
pp. 1511–1524. doi: 10.1016/S0140-6736(20)31859-6.
Silberzahn, R. et al. (2018) ‘Many analysts, one data set: making
transparent how variations in analytic choices affect results’, Advances in
Methods and Practices in Psychological Science, 1(3), pp. 337–356. doi:
10.1177/2515245917747646.
475
Review Unit Data analysis in practice
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, The GlaxoSmithKline Carbon Neutral Laboratory for
Sustainable Chemistry: Michael Thomas / Flickr. This file is licensed
under Creative Commons -by-2.0
https://creativecommons.org/licenses/by/2.0/
Subsection 1.4, ‘Phew!’: Mykola Kravchenko / 123rf
Subsection 3.1, coal miner with a canary: Laister / Stringer / Getty
Subsection 4.1, British bluebells: Ket Sang Tai / 123RF
Subsection 4.2, Albert Einstein: Orren Jack Turner / Wikipedia / Public
Domain
Subsection 4.2, apples and pears: Inna Kyselova / 123RF
Subsection 4.3, medicine dose: rawpixel / 123RF
Subsection 4.4, child’s height being measured: Janie Airey / Getty
Subsection 5.2.1, average child heights: Lingkon Serao / 123RF
Subsection 5.4, working in a pharmaceutical industry: traimak / 123RF
Section 7, transformations: Sutisa Kangvansap / 123RF
Subsection 8.1, St Mark’s square during lockdown: federicofoto / 123RF
Section 9, looking out over the horizon: mihtiander / 123RF
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
476
Solutions to activities
Solutions to activities
Solution to Activity 1
All four explanatory variables are continuous, and are therefore covariates.
To see this, you can think of the units that the variables are measured in.
• The temperature at which the reaction is run, temp0, is measured in ◦ C,
and can be set to any value the experimenters deem sensible. This is
clearly a continuous variable.
• The time for which the reaction is run, time0, is measured in hours, and
can also be set to any value the experimenters deem sensible (not
necessarily just whole hours). Therefore, this is also a continuous
variable.
• The variable nmp0 measures the concentration of the solution in terms of
volumes of the solvent NMP. Again, this can take any value within a
range the experimenters may specify. This is also a continuous variable.
• The variable equiv0 measures the molar equivalents of the reagent. A
molar equivalent is the ratio of the moles of one compound to the moles
of another. Again, this is something the experimenters can vary within a
sensible range, and is thus continuous.
This matters because the type of variable will affect the way the model is
set up: both types of explanatory variables can be handled in the
framework of regression models, but the inclusion of factors requires the
use of indicator variables.
Solution to Activity 2
(a) First, you should reacquaint yourself with the interpretation of the
estimated coefficients, as explained in Subsection 1.2 of Unit 2.
Each βbj represents the expected change in the response when the
corresponding jth covariate increases by one unit, assuming all other
covariates remain fixed.
• For βb1 : We would expect this to be positive. The yield is
considerably higher for the second and fourth observations, where a
higher temperature is used, than for the first, third and fifth
observations. In particular, when comparing the second observation
with the first observation, we can see that the yield is higher and
the temperature was the only covariate that changed its value. The
same argument holds when comparing the fourth observation with
the third observation. This indicates that increasing the
temperature may increase the yield.
• For βb2 : This is less clear-cut. Both the highest and the lowest yield
occur at the shorter reaction time. We would need to see more data
to get a better idea about βb2 .
477
Review Unit Data analysis in practice
Solution to Activity 3
In Subsection 5.1 of Unit 2, the scatterplot matrix was introduced. This is
a graphical tool that shows scatterplots for all pairs of variables in the
dataset.
You can use it to assess how the response variable is related to each
explanatory variable. For example, if the relationship between an
explanatory variable and the response is non-linear, then a scatterplot can
help us to decide whether a transformation of the explanatory variable
may be useful. A scatterplot can also indicate issues with non-constant
variance and can therefore help us to decide whether a transformation of
the response may be useful. Additionally, the scatterplot matrix can be
useful for spotting relationships between the explanatory variables.
478
Solutions to activities
Solution to Activity 4
The regression coefficients in a multiple regression model are partial
regression coefficients, which means they are associated with a variable’s
contribution after allowing for the contributions of the other explanatory
variables. Therefore, if a variable is highly correlated with another
variable, it will have little or no additional contribution over and above
that of the other. This means there is a case for omitting one of the
variables from the model.
Solution to Activity 5
Notice that all six pairwise scatterplots look the same. They consist of
nine points, four of which are placed on the vertices of a square that is
standing on one vertex. Four further points are in the middle of each edge
connecting the vertices. The last point is in the centre of the square.
The striking differences to previous scatterplots are that all of the plots are
the same, and that their patterns appear to be systematic.
Solution to Activity 6
(a) This is an observational study, since we simply observe the values of
the explanatory variables rather than influencing their values. It is
not easily possible to influence the GDP or the number of medals won
by a country at the previous Olympics.
(b) This is a designed experiment, since the researchers selected the
values (‘new’ or ‘usual’) of the explanatory variable describing the
training regime, and they also decided that half of the participants
should receive each training regime. They could then compare the two
training groups and see how the choice of training regime affects the
strength of football players.
Solution to Activity 7
Some considerations the chemists would need to think about include:
• Which variables are likely to affect the yield of the alcohol of interest,
and should therefore be included as potential explanatory variables?
• What are plausible values or ranges for these variables?
• Which combinations of values should be used in the experiment?
• Sample size versus costs: how many runs of the reaction can be afforded,
and is there time to run them?
You may well have thought of other considerations!
479
Review Unit Data analysis in practice
Solution to Activity 8
(a) Let β1 , β2 , β3 and β4 be the regression coefficients of temp, time, nmp
and equiv, respectively.
To test whether the regression model contributes information to
interpret changes in the yield of the alcohol, we need to test the
hypotheses
H0 : β1 = β2 = β3 = β4 = 0,
H1 : at least one of the four coefficients differs from 0.
The test statistic for testing these hypotheses is the F -statistic (which
was reported to be 15.98).
Since the p-value associated with the F -statistic is less than 0.001,
there is strong evidence that at least one of the four regression
coefficients is different from 0. Hence, there is strong evidence that
the regression model contributes information to interpret changes in
the yield of the alcohol.
Denoting the number of regression coefficients that we’re testing by q,
the p-value for this test is calculated from an F (ν1 , ν2 ) distribution
where
ν1 = q = 4,
ν2 = n − (q + 1) = 30 − (4 + 1) = 25.
(c) The p-value associated with the regression coefficient of temp, tests
the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0,
(assuming that β2 = βb2 , β3 = βb3 and β4 = βb4 ).
The p-value for testing these hypotheses is less than 0.001. Therefore,
there is strong evidence to suggest that the regression coefficient of
temp, β1 , is not 0, when time, nmp and equiv are in the model.
(d) The estimated regression coefficient of temp is 4.159. This means that
the yield of the alcohol (yield) is expected to increase by 4.159 if the
value of temp increases by one unit, and the values of time, nmp and
equiv remain fixed.
(e) The p-values associated with the partial regression coefficients of time
and equiv are both small (p = 0.0424 for time and p = 0.0220 for
equiv), indicating that there is evidence that β2 and β4 are different
from 0, if the other variables are in the model.
480
Solutions to activities
Solution to Activity 9
(a) The residual plot in Figure 3(a) suggests that the zero mean
assumption of the Wi ’s does not hold. This is because the points in
this plot are not randomly scattered about zero, but instead show a
systematic pattern; there are large negative values of the residuals for
the smallest and largest fitted values, and exclusively positive
residuals for fitted values in the middle of the range.
Note, however, that the vertical scatter is fairly constant across the
fitted values, indicating that the assumption of constant variance does
seem reasonable.
(b) The systematic pattern in the plot in Figure 3(a) looks like a parabola
– that is, a quadratic function.
(c) The normal probability plot in Figure 3(b) shows that the
standardised residuals clearly do not follow the straight line, but wrap
around it. So, it seems that the normality assumption also does not
hold.
Solution to Activity 10
(a) Let q be the number of regression coefficients we are testing. Then, in
Model (4) we have four coefficients for the linear terms, four for the
quadratic terms and six for the interactions. Therefore, the number of
regression coefficients in the model is q = 4 + 4 + 6 = 14.
The p-value for the test is then calculated from an F (ν1 , ν2 )
distribution where
ν1 = q = 14,
ν2 = n − (q + 1) = 30 − (14 + 1) = 15.
481
Review Unit Data analysis in practice
(d) The p-values associated with all partial regression coefficients of the
linear terms are less than 0.001, indicating strong evidence for being
different from 0, if all terms are in the model.
For the quadratic terms, we find that two of the p-values (for the
coefficients of temp2 and equiv2 ) are also less than 0.001, again
indicating strong evidence that the coefficients for these terms are
not 0, if all terms are in the model. Although the p-values for the
other two quadratic terms (time2 and nmp2 ) are not as small, they are
still small enough to indicate that there is evidence for their
coefficients being different from 0, if all terms are in the model.
We can see that the p-values for the three interactions involving temp
are less than 0.001, indicating strong evidence that their coefficients
are different from 0, if all terms are in the model. The p-values
associated with the coefficients of time:equiv and nmp:equiv are not
so small, but still small enough to suggest that there is evidence for
being different from 0, if all terms are in the model. The evidence that
the coefficient for the interaction time:nmp is 0, if all terms are in the
model, is, however, only weak, since the associated p-value is larger.
The only term that we might consider dropping from the model is the
interaction time:nmp, since there was only weak evidence that this
term needs to be in the model.
Solution to Activity 11
From Box 4, a high value of the adjusted R2 is preferable, whereas a low
value of the AIC is preferable. Model (4) has the higher adjusted R2 and
the lower AIC, and therefore Model (4) is the preferred model.
Solution to Activity 12
The estimates for the linear coefficients are the same for both models.
However, their standard errors are smaller in Model (4).
This is unexpected. Back in Subsection 1.2 of Unit 2, we compared
coefficients of the simple and multiple regression models, and in all
examples we found that the estimated coefficient in the simple linear
regression model was different to the partial coefficient for the
corresponding explanatory variable in the multiple regression model.
Solution to Activity 13
(a) The plot of residuals against fitted values in Figure 6(a) does not give
any evidence to doubt the model assumptions of zero mean and
constant variance for the Wi ’s.
(b) The normal probability plot in Figure 6(b) shows that the
standardised residuals follow the straight line quite well. There is
therefore no cause for concern about the normality assumption.
482
Solutions to activities
Solution to Activity 14
From Subsection 3.3 of Unit 2, there is no standard rule of thumb for
deciding that a point is influential. Sometimes a Cook’s distance greater
than 0.5 is considered to indicate an influential point, but a point could
also be considered as being influential for smaller Cook’s distance values if
its Cook’s distance is large in comparison to the other Cook’s distance
values.
All of the Cook’s distance values in Figure 8 are less than 0.5, and none of
them are particularly large in comparison to the other values, and so this
Cook’s distance plot suggests that there doesn’t seem to be any problems
with influential points for these data and Model (4).
Solution to Activity 15
(a) Yes, the optical density is affected by whether or not the sample was
treated. The distribution of values for untreated samples is centered
at a higher value than the distribution of values for treated samples.
This is expected. Adding a toxin to a sample is likely to decrease the
proportion of surviving cells in a sample, resulting in lower values of
the optical density.
(b) Yes, the distribution of responses is different on different days. It’s
shifted down on the second day compared with the first.
This is not necessarily expected. Why would a measurement be
affected by the day on which it’s been taken?
(c) The data look quite similar on both days, just shifted. In particular,
the mean difference between opticalDensity for treated cells and for
untreated cells (that is, the treatment effect) looks roughly the same
on both days.
Solution to Activity 16
(a) The factor treatment should be included in the model, since we have
seen that there are differences between the responses of treated and
untreated cells.
(b) The factor day should also be included. We have seen that the
responses appear to be ‘shifted’ down on the second day.
It is good that we looked at the responses by day so that we noticed
the difference a day can make, which was confirmed by the biologist.
If we had neither looked at the data visually nor spoken with an
expert, then we might not have bothered fitting the factor day, as it
was not obvious this could be important.
(c) The interaction is probably not needed in the model, since the effect
of the treatment is roughly the same on both days.
483
Review Unit Data analysis in practice
Solution to Activity 17
The model with both factors, treatment and day, but without their
interaction, is the preferred model because it has the largest value of the
adjusted R2 statistic and the smallest value of the AIC.
Yes, this is as expected. In Activity 16, we conjectured that both factors,
but not their interaction, should be in the model.
Solution to Activity 18
(a) Let α be the intercept parameter, β1 be the regression coefficient for
the indicator variable ‘treatment 1’, and β2 be the regression
coefficient for the indicator variable ‘day 2’. Then the p-value on the
row corresponding to the factor treatment is testing the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0,
assuming that β2 = βb2 = −0.4.
(b) The value of the test statistic is −5.408. If the null hypothesis were
true, the distribution of the test statistic would be a t(ν) distribution,
where, for a model with q regression coefficients,
ν = n − (q + 1) = 12 − (2 + 1) = 9.
(c) The p-value is very small, so there is strong evidence against the null
hypothesis that β1 = 0. Therefore, we conclude that there is evidence
that the toxin does affect cell survival.
Solution to Activity 19
(a) There is clearly day-to-day variation in the data. In particular, in this
experiment the responses on Day 2 seem lower than those on Day 1.
If the treated cells had been measured on Day 1, we would expect the
responses to be in a similar region as the three values for treated cells
for Day 1 of Figure 9 (Subsection 3.1). Similarly, if the untreated cells
had been measured on Day 2, we would expect the responses to be in
a similar region as the three values for untreated cells for Day 2 of
Figure 9. If that were the case, the model would not show a
significant effect of the treatment, as the values in the two treatment
groups would be fairly similar. The biologists would thus (wrongly)
conclude that the toxin does not affect cell survival.
(b) In this situation, we would have the opposite problem. The difference
in optical density between treated and untreated cells would appear
to be much larger than it actually is.
(c) Yes, it was a good idea. The way the experiment was planned gave us
the opportunity to separate the treatment effect from the day-to-day
variation, so the treatment effect can be estimated more precisely.
484
Solutions to activities
Solution to Activity 20
(a) Exponential distribution. We can view the distance until the battery
runs out of power as a ‘time to event’ variable.
(b) Bernoulli distribution. The outcome is either ‘made a claim’ or ‘no
claim’.
(c) Poisson distribution. For each policy holder, the number of claims in
the last year is counted.
(d) Binomial distribution. The number of policy holders who made a
claim, out of all policy holders, is recorded.
(e) Normal distribution. Height is a continuous variable, and it seems
plausible that it could follow a bell-shaped curve.
Solution to Activity 21
There are three potential choices when fitting a GLM: the distribution of
the response variable, the link function and the form of the linear
predictor. For the link function, we’ll use Table 14 to guide our choice, and
so we only need to choose the response distribution and the form of the
linear predictor.
So, for the citations data:
• Distribution: The response variable is a count, the number of citations
an article has accrued. The Poisson distribution is a natural choice for
modelling here.
• Link function: From Table 14, we’ll use the log link, the canonical link
function for the Poisson distribution.
• Linear predictor: We have two explanatory variables, the covariate
yearDiff and the factor journal. Both seem relevant here. We would
expect that an article that has been published for longer may have more
citations than a newer article, since more researchers will have had the
chance to read it. It also seems intuitive that the type of journal where
an article has been published could affect the number of citations.
Therefore both variables should be fitted to the model.
What about an interaction between them? This would correspond to
allowing non-parallel slopes in the linear predictor, or, in other words,
the number of citations could increase at different rates over time for
different journal types. This seems plausible.
Putting this together, the proposed model for the citations data is a
Poisson GLM (with a log link) of the form
numCitations ∼ yearDiff + journal + yearDiff:journal.
485
Review Unit Data analysis in practice
Solution to Activity 22
(a) The three categories of the factor journal coincide with different
ranges for numbers of numCitations, with articles in standard
statistics journals (journal = 0) having fewer citations than those in
prestigious statistics journals (journal = 1), which in turn have fewer
citations than those in medical journals (journal = 2). This means
that we can expect journal to be a good variable to have in our
model as it seems to explain some of the differences in the number of
citations. We were also right about assuming that the number of
citations increases with the years an article has been published.
(b) There is only one article in a medical journal, so there is only one
observation for which journal takes value 2.
Solution to Activity 23
It is important to consider this question in the context of the data. There
is no reason to assume the numbers of citations for articles in prestigious
statistics journals and medical journals should be similar. We would not
be comparing like with like. We can also see this from our (albeit limited)
dataset, as the one response we have for a medical journal is quite different
from those for prestigious statistics journals.
Solution to Activity 24
(a) The null model, which does not take any explanatory variables into
account, is nested in Model (7). As explained in Units 6 and 7, we can
compare nested GLMs by calculating the deviance difference of the
two models, which gives
462.801 − 80.933 = 381.868.
If both models fit equally well, this difference has a χ2 (d) distribution,
where
d = difference in the degrees of freedom associated with
the null deviance and the residual deviance
= 22 − 19 = 3.
It is clear that such a large value of the deviance difference (381.868)
will yield a tiny p-value when compared against a χ2 (3) distribution.
Indeed, conducting this comparison in R yields a p-value close to 0.
We conclude that Model (7) provides a highly significant gain in fit
over the null model.
(b) If Model (7) is a good fit, then the residual deviance should come
from a χ2 (r) distribution, where
r = n − number of parameters in the proposed model.
The value of the degrees of freedom for the residual deviance (r) was
given in the question to be 19. This comes from the fact that there
486
Solutions to activities
Solution to Activity 25
• The red line shown in the plot of the standardised deviance residuals
against a transformation of µ
b given in Figure 12(a) shows some possible
curvature suggesting that either the Poisson model or its canonical link
may not be appropriate, or some important term may be missing from
the linear predictor.
• In the plot of the standardised deviance residuals versus index shown in
plot (b), the standardised deviance residuals appear to be fairly
randomly scattered about zero across the index, except for a cluster of
negative residuals associated with the first few index values.
• The plot of the squared standardised deviance residuals against index
shown in plot (c) suggests that the magnitude of the standardised
deviance residuals remains fairly constant across the index.
• The normal probability plot shown in plot (d) is reasonably close to the
straight line, with some small deviation at the lower end.
Because of the hint of curvature in Figure 12(a), it appears that Model (7)
may not be appropriate for the citations data, and that further
investigation or a different model may be needed.
487
Review Unit Data analysis in practice
Solution to Activity 26
In dose escalation trials, the binary variable toxicity is an obvious
response variable since the researchers want to find out how this variable is
related to the other variables in the dataset. A logistic regression model
with the binary response toxicity is therefore a natural choice for these
data. (Section 5 will provide more detail on how to select a model
according to the research question of the study.)
Solution to Activity 27
(a) The deviance difference for Models (9) and M2 is
D(M2 ) − D(9) = 31.149 − 25.398 = 5.751.
The deviance difference for Models (9) and M1 is
D(M1 ) − D(9) = 32.528 − 25.398 = 7.13.
The deviance difference for Models (9) and M0 is
D(M0 ) − D(9) = 36.434 − 25.398 = 11.036.
(b) The appropriate chi-square distribution for the deviance difference has
degrees of freedom equal to the difference in the degrees of freedom
for the models we are comparing, or the number of extra parameters
in the larger model.
For Models (9) and M2 , this difference is 46 − 45 = 1. Therefore the
distribution is χ2 (1).
For Models (9) and M1 , this difference is 47 − 45 = 2. Therefore the
distribution is χ2 (2).
For Models (9) and M0 , this difference is 48 − 45 = 3. Therefore the
distribution is χ2 (3).
(c) In all three comparisons, the p-value is small. The p-values therefore
suggest there is significant gain in model fit when including the extra
parameters from Model (9). Therefore Model (9) should be chosen.
Solution to Activity 28
In the plot of the standardised deviance residuals against fitted values
given in plot (a), the typical (for logistic regression) ‘lines’ of positive and
negative deviance residuals correspond to response values of Yi = 1 and
Yi = 0, respectively. The red line, which in the ideal case should be a
horizontal line at 0, has a slight upward trajectory which is interrupted by
a small U-shaped dip. This dip may indicate that the terms in the linear
predictor cannot completely capture all of the features of the data. The
dip is rather small, but we note some small concern here.
In the plot of the standardised deviance residuals against index shown in
plot (b), if the responses are independent, the standardised deviance
residuals in the plot should fluctuate randomly and there shouldn’t be
488
Solutions to activities
Solution to Activity 29
(a) To compare the fit of these models, we look at the deviance difference.
This gives a test statistic of
976.62 − 302.78 = 673.84.
If both models fit equally well, this value comes from a χ2 (d)
distribution, where d is the difference between the degrees of freedom
of the two models, 10 − 5 = 5, or the number of extra terms fitted to
the larger model.
(b) If the p-value is close to 0, then there is very strong evidence that the
extra terms in the larger model, with all two-way interactions, are
needed.
The model omitting the interaction obese:ethnicity has a larger
residual deviance than the model omitting ethnicity:ageGroup, so
the deviance difference (the test statistic) will be even larger than the
value in part (a), with the same degrees of freedom, so the p-value will
be even smaller.
The model omitting the interaction obese:ageGroup also has a larger
residual deviance than the model omitting ethnicity:ageGroup, so
the deviance difference (the test statistic) will again be larger than
the value in part (a). This time, the degrees of freedom will be
6 − 5 = 1, but again, the p-value will be very small, since the deviance
difference is so much larger than the degrees of freedom.
(c) We have already ruled out (in part (b)) the models that omit one of
the two-way interactions. They are all significantly worse than the
model with all two-way interactions, so we just need to investigate the
latter.
489
Review Unit Data analysis in practice
Solution to Activity 30
(a) The biologists’ research question could be written as: ‘Will the toxin
change the proportion of surviving cells in a sample?’
(b) The proportion of surviving cells in a sample is measured in terms of
the sample’s optical density; these values are recorded in the variable
opticalDensity. The variable treatment indicates whether the
sample has been treated with the toxin. So, statistically we must ask
if the factor treatment has an effect on the response variable
opticalDensity.
So, for Model (11), we’re ultimately interested in testing whether the
factor treatment needs to be in the model in addition to day. In
other words, we’re interested in testing the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming that β2 = βb2 ),
where β1 and β2 denote, respectively, the regression coefficients for the
indicator variables for the second levels of factors treatment and day.
(c) The p-value for the test of interest to the biologists is the p-value
associated with treatment in Table 22, which is less than 0.001.
Therefore, there is strong evidence that the toxin does indeed affect
cell viability. (It’s good that we fitted the effect of day to the model.
Without this, the p-value for this test would be 0.0149, so that there
is still evidence against H0 : β1 = 0, but it’s less strong.)
490
Solutions to activities
Solution to Activity 31
(a) If the biologists are interested in the research question formulated in
Activity 30, so that they want to find out if there is an effect of the
toxin on the proportion of surviving cells in a sample, both methods
can be used. We have already found the p-value for the test in
Table 12 (Subsection 3.2) from the output when fitting a linear
regression model. Because the factor treatment only has two levels,
the ANOVA table for the same model given in Table 13
(Subsection 3.3) produces the same p-value in the row associated with
treatment, so can also be used.
(b) If the biologists are interested in testing the one-sided hypothesis that
the toxin decreases the proportion of surviving cells, then we need the
table of coefficients from the regression model. (Because the
t-distribution is symmetric about 0, the p-value of the one-sided test
is either half the p-value of the two-sided test or one minus half the
p-value of the two-sided test, depending on the sign of the estimated
coefficient.) The ANOVA table can only tell us if a factor affects the
response (and should therefore be in the model), but not in which
direction (increase or decrease).
Solution to Activity 32
The logistic regression model with response variable obese and factors
ethnicity and ageGroup should be fitted. The binary variable obese is
an obvious response for the research question since the researchers want to
find out how this variable is related to the other two variables in the
dataset.
Solution to Activity 33
(a) The proposed logistic regression model corresponds to the log-linear
model where the response variable is the count in Table 20
(Subsection 4.4). Since ethnicity and ageGroup are in the logistic
regression model for obese, we’d expect all three main effects
(ethnicity, ageGroup and obese) to be in the log-linear model for
the same data, and we’d also expect the model to include the two-way
interactions obese:ethnicity and obese:ageGroup.
(b) The residual deviance for the logistic regression model in part (a)
is 302.78, which is much larger than 5, the value of the degrees of
freedom. Therefore, by the usual ‘rule of thumb’ comparing the values
of the residual deviance and its degrees of freedom, the residual
deviance is much larger than expected if the model were a good fit
and so we conclude that the model from part (a) is not a good fit to
the data.
491
Review Unit Data analysis in practice
Solution to Activity 34
The interaction ethnicity:ageGroup models how these two factors are
jointly related to the response, or, in other words, how changing them
jointly will affect the response. For example, the odds of an Asian child
being obese are changing in a specific way when we go from the younger
ageGroup to the older ageGroup. For a child from a different ethnic group,
for example, Black, the odds may change in a different way when we go
from the younger ageGroup to the older ageGroup.
In the model without the interaction, the odds of being obese would be
changing in exactly the same way for children from all ethnicities when we
go from the younger ageGroup to the older ageGroup. This might not be
realistic, which could explain why the fit of the smaller model (considered
in Activity 33) was not adequate.
Solution to Activity 35
In the biomarker negative group, we can see that the value of x where the
vertical line crosses the x-axis is approximately 240 mg/m2 . In the
biomarker positive group, this dose value is approximately 180 mg/m2 .
Solution to Activity 36
Solving Equation (15) for dose, we get
dose
−1.6582 = −424 + 529 log +1
200
−1.6582 + 424
dose
= log +1
529 200
dose
exp(0.7984) ≃ +1
200
(2.2220 − 1) × 200 ≃ dose
244 ≃ dose.
So, for biomarker negative patients, the dose required so that the
probability of toxicity is 0.16 is (approximately) 244 mg/m2 .
Solving Equation (16) for dose, we get
dose
−1.6582 = −4.3 + 4.1 log +1
200
−1.6582 + 4.3
dose
= log +1
4.1 200
dose
exp(0.6443) ≃ +1
200
(1.9047 − 1) × 200 ≃ dose
181 ≃ dose.
Therefore, for biomarker positive patients, the dose required so that the
probability of toxicity is 0.16 is (approximately) 181 mg/m2 .
492
Solutions to activities
Solution to Activity 37
(a) For the biomarker negative group, the estimated probabilities are
equal to 0 up until dose 215 mg/m2 . Dose 245 mg/m2 , however, has
probability of toxicity estimated as 0.286, which is considerably larger
than the acceptable 0.16. We therefore conclude that dose 245 mg/m2
is likely to cause too many toxicity events, and we recommend the
next lower dose, 215 mg/m2 .
(b) For the biomarker positive group, the estimated probability at dose
100 mg/m2 is 16 , which is just about acceptable. Moving up the doses,
we get estimates of 0 for doses 150 mg/m2 and 180 mg/m2 . Then, at
215 mg/m2 , the probability of toxicity is estimated to be 0.5, which is
far higher than acceptable. We therefore recommend the next lower
dose, 180 mg/m2 , for use in this group.
Solution to Activity 38
(a) The answer to the research question could impact the production
process of the alcohol, since GlaxoSmithKline can set the covariates
to their (estimated) optimal values. This is likely to increase the yield
and would thus make production more efficient.
(b) You could find the maximum of your fitted model from Section 2 and
use this to estimate the maximum yield. Similarly, you can use the
values of the covariates where the maximum of the fitted model is
attained as estimates of the optimal covariate values.
Solution to Activity 39
No. The value of 95.3112 for the yield at these values of the covariates is a
prediction from the model. We need to take into account the uncertainty
that we have around this prediction.
Solution to Activity 40
(a) The correct answer is that the interval is the prediction interval for
the yield of alcohol at the values of the covariates given in Table 24.
How is this different from the interpretation in the question? Well, we
need to take into account that the values of the covariates where the
maximum yield is attained are also estimated. This means that there
is an extra level of uncertainty around these values, which has not
been included in the prediction interval. So, a prediction interval for
the maximum yield would be wider than the interval shown here.
(b) When the covariates are fixed at the values provided in Table 24,
then, in the long run, we expect 95% of new responses (yield of
alcohol during a run of the experiment/production process) to fall
into the interval (93.8610, 96.7614).
493
Review Unit Data analysis in practice
Solution to Activity 41
In the desilylation dataset, the values of the covariates had been
standardised before the data were analysed. The optimal estimated values
are therefore on the standardised scale, and the chemists must reverse the
standardisation to obtain the corresponding values on the original scale.
For example, the value of nmp is given as −1.3979 here, and it would be
impossible to prepare a solution with a negative volume of the solvent!
Solution to Activity 42
(a) Yes, it seems reasonable to assume a normal distribution. First, the
unit of measurement for the response variable, percentage yield, is
continuous, in which case a normal distribution is often a sensible
starting point. Second, after fitting the final Model (4), we
investigated the normal probability plot of standardised residuals (in
Activity 13, Subsection 2.5). The points in the plot are close to the
straight line, confirming the validity of this model assumption.
(b) The normal distribution can attain any real value. The response
variable in this example, however, is measured in percentage yield,
which is restricted to the interval from 0 to 100. A potential issue
could be that the normal distribution might provide predictions
outside this interval. Luckily, in the example this did not happen.
From Subsection 5.4, the prediction interval for the percentage yield
of the alcohol of interest when the reaction is run at the estimated
optimal covariate values, is (93.8610, 96.7614), which does not exceed
the limit of 100.
If our predictions had exceeded the upper limit of 100%, we could
have tried a distribution bounded on (0, 100) (or a transformed
response bounded on (0, 1)) instead of the normal distribution.
Solution to Activity 43
(a) The term called ‘Intercept’ is the value of numCitations when the
covariate yearDiff is 0 and the factor journal takes the baseline
level. The baseline level of journal is when journal takes coded
value 0, which represents a ‘standard statistics journal’, and if
yearDiff = 0, then this means that the journal has just been
published. So, since the estimated value of ‘Intercept’ for the fitted
model is 1.174 in Table 26, we’d expect the number of citations for an
article in a standard statistics journal that’s just been published to
be 1.174.
The regression coefficient for yearDiff is estimated to be 0.552. This
means that, after controlling for the type of journal, the expected
number of citations for an article is expected to increase by 0.552
after each year since publication.
The value 1 for journal represents a ‘prestigious statistics journal’.
The estimate for this parameter is 35.610. So, after controlling for the
494
Solutions to activities
(c) Since yearDiff is a covariate, we can use the p-value for the
regression coefficient for yearDiff given in the summary output table
to assess whether or not yearDiff should be kept in the model. From
Table 26, this p-value is 0.0193, which means there is evidence to
suggest that this coefficient is different from 0 (provided the factor
journal is also in the model). Therefore the covariate yearDiff
should be kept in the model.
Since journal is a factor, in order to assess whether journal should
be kept in the model, we can use the ANOVA test comparing the RSS
values for Model (17) and the model without journal included. The
p-value from this ANOVA test is very small, so there is strong
evidence that the factor journal should also be kept in the model in
addition to yearDiff.
Overall, we can say that there is evidence to suggest that each of the
two explanatory variables, journal and yearDiff, influences the
number of an article’s citations, given the existence of the other
explanatory variable in the model.
Solution to Activity 44
(a) The plot of residuals against fitted values in Figure 17(a) does not
give strong evidence to doubt the model assumptions of zero mean
and constant variance for the Wi ’s.
(b) The normal probability plot in Figure 17(b) shows that the
standardised residuals follow the straight line quite well. There is no
reason for concern about the normality assumption.
Solution to Activity 45
In the scatterplot of the residuals against values of yearDiff, it looks like
the variance of the Wi ’s increases as yearDiff increases, since the vertical
spread of points moving from left to right seems to increase. This may
indicate that a transformation of the response variable might be needed to
make the variance constant.
The plot of residuals against the levels of journal does not give strong
evidence to doubt the model assumptions of zero mean and constant
variance for the Wi ’s. The boxplot for values where journal is 0 looks
reasonable, and the other levels simply do not contain enough data (three
495
Review Unit Data analysis in practice
points where journal is 1, and only one point where journal is 2) to draw
any strong conclusions.
Solution to Activity 46
(a) The estimated probabilities of toxicity decrease before they increase!
This makes no sense scientifically.
(b) There are now two fitted values of pb = 0.16, one at a dose of roughly
100 mg/m2 , and the other at a dose of roughly 215 mg/m2 (by looking
at the plot). From the data, neither of these doses seems a good
choice. If we choose 100 mg/m2 as the recommended dose, we may
lose efficacy. The higher dose levels of 150 mg/m2 and 180 mg/m2 did
not cause toxicity in this study, while likely being more effective. But,
if we choose 215 mg/m2 as the recommended dose, we may expose
more patients to toxicity than planned. At this dose level, two out of
four patients experienced toxicity in the study, so this dose may be
too high to be safe.
(c) When we look at the data in Table 18 (Subsection 4.3), there is one
case of toxicity in the biomarker positive group at the lowest dose
level (100 mg/m2 ) and no cases in the two next higher dosage groups.
The model we have fitted follows this pattern very closely.
496
Index
Index
∆ 47, 103 selection 43
big data
absolute income hypothesis 32 definition 299
accuracy 306 three V’s 299
adaTepe 235 variety 304
adjusted R2 statistic 405 velocity 305
agglomerative hierarchical clustering 224 veracity 305
start 225 volume 303
AIC 405 big O notation 310
AIH 32 binary distance 205
Akaike information criterion (AIC) 405 BLUE 29
algorithm Bray–Curtis distance 202, 205
definition 308
for simple linear regression 308 cells 411
for sorting data 312 central limit theorem 46
hill-climbing 327 centroid 241
MapReduce 318 ceteris paribus 15
split-apply-combine 318 change
to calculate a square root 323, 324 absolute 47
to calculate a standard deviation 311 percentage 47
Analysis of variance (ANOVA) 415 relative 47
Anderson–Hsiao estimator 168 Charter of Fundamental Rights 348
ANOVA 415 childMeasurements 433
explained sum of squares (ESS) 415 citations 421
F -value 415 city block distance 205
table 415 clothingFirms 39
APC 13 cluster analysis 188
AR(1) 125 cluster definition 191, 208
arbitrage 150 cluster, allocation to 238
asymptotic property 28 cluster, centre 241
Augmented Dickey–Fuller (ADF) test 140 clustering
autocorrelation coefficient 111 k -means 244
autocorrelation function (ACF) 112 density-based 255
autonomous consumption 13 hierarchical 223, 224
autoregressive process 125 partitional 236
autoregressive scatterplot 108 clusters 188
average linkage 227 coffeePrices 164
average propensity to consume 13 cointegration 150
cointegration test 151
Babylon app 352 complete linkage 226
backshift operator 108 confirmatory analysis 6, 119
beads 197 conflation 423
bias 30 consistency 28
attenuation 45 contingency table 434
attrition 37 Cook’s distance 408
omitted variable 57 plot 408
497
Index
498
Index
499
Index
500
Index
ukGDP 115
unbiasedness 26
unemployment 104
unit root 136
unlabelled observation 257
unstructured data 304
unsupervised learning 334
usConsumption 152
variable
dummy 71
endogenous 10
exogenous 10
instrumental 65
proxy 59
Voronoi diagram 238
501