Research Method Using r
Research Method Using r
Book:
Baker, Daniel Hart orcid.org/0000-0002-0161-443X (2022) Research Methods Using R:
Advanced Data Analysis in the Behavioural and Biological Sciences. Oxford University
Press , (360pp).
Reuse
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless
indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by
national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of
the full text version. This is indicated by the licence information on the White Rose Research Online record
for the item.
Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing [email protected] including the URL of the record and the reason for the withdrawal request.
[email protected]
https://eprints.whiterose.ac.uk/
Research Methods Using R: advanced data
analysis in the behavioural and biological sciences
March 2022
2
Acknowledgements
First and foremost, I would like to thank all of the students who have taken my
Advanced Research Methods module at the University of York over the past
7 years. You have all made an invaluable contribution to this work, by asking
questions, by suggesting new ways to think about methods, and through your
enthusiasm and perseverance on a difficult topic. This book is for you, and for
all future students who want to learn these techniques.
Many of my colleagues have had a huge influence on developing my own under-
standing of the topics covered in the book, especially Tim Meese, who as my PhD
supervisor taught me more than I can remember about Fourier analysis, signal
detection theory, and model fitting! I am further indebted to Tom Hartley, Alex
Wade, Harriet Over and Shirley-Ann Rueschemeyer, who provided invaluable
feedback on the material at various stages.
I am also very grateful to Martha Bailes and the team at Oxford University
Press, who have guided this project from an initial proposal through to a finished
product. Martha coordinated reviews from many anonymous reviewers from
diverse fields, whose comments and suggestions have strengthened the manuscript
immeasurably. I am further indebted to Bronte McKeown for checking the
technical content of the book, including the code and equations.
Finally, a huge thank you to my family, Gemma, Archie and Millie, who have
put up with me being buried in a laptop writing this book and tinkering with
the figures for far too long!
1 Introduction 9
1.1 What is this book about? . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Who will find this book useful? . . . . . . . . . . . . . . . . . . . 10
1.3 Topics covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Some words of caution . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Implementation in R . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3
4 CONTENTS
3.10 Putting it all together - importing and cleaning some real data . 57
3.11 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Power analysis 81
5.1 What is statistical power? . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Effect size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 How can we estimate effect size? . . . . . . . . . . . . . . . . . . 83
5.4 Power curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Problems with low power . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Problems with high power . . . . . . . . . . . . . . . . . . . . . . 88
5.7 Measurement precision impacts power . . . . . . . . . . . . . . . 89
5.8 Reporting the results of a power analysis . . . . . . . . . . . . . . 89
5.9 Post-hoc power analysis . . . . . . . . . . . . . . . . . . . . . . . 91
5.10 Doing power analysis in R . . . . . . . . . . . . . . . . . . . . . . 92
5.11 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6 Meta analysis 97
6.1 Why is meta analysis important? . . . . . . . . . . . . . . . . . . 97
6.2 Designing a meta analysis . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Conducting and summarising a literature search . . . . . . . . . 99
6.4 Different measures of effect size . . . . . . . . . . . . . . . . . . . 101
6.5 Converting between effect sizes . . . . . . . . . . . . . . . . . . . 103
6.6 Fixed and random effects . . . . . . . . . . . . . . . . . . . . . . 104
6.7 Forest plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.8 Weighted averaging . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.9 Publication bias and funnel plots . . . . . . . . . . . . . . . . . . 107
6.10 Some example meta analyses . . . . . . . . . . . . . . . . . . . . 109
6.11 Calculating and converting effect sizes in R . . . . . . . . . . . . 110
6.12 Conducting a meta analysis in R . . . . . . . . . . . . . . . . . . 112
6.13 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . 114
20 Endnotes 413
20.1 A parting note . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
20.2 Answers to practice questions . . . . . . . . . . . . . . . . . . . . 414
20.3 Alphabetical list of key R packages used in this book . . . . . . . 425
21 References 429
Chapter 1
Introduction
9
10 CHAPTER 1. INTRODUCTION
feel worried about the prospect of learning to program: persevere, and believe in
yourself. Anyone can learn to program if they put in the time and effort. It is
not magic, and it is not beyond you, but it is a skill that takes practice to get
the hang of.
we cover some techniques for graph plotting and data visualisation. Unlike many
works on graph plotting in R, we use the ‘base’ plotting functions instead of the
popular ggplot2 package. In Chapter 19, we discuss issues around reproducible
data analysis, including methods of version control, open data formats, and
automatically downloading and uploading data from public repositories.
A further caveat: this book is not meant to be the final word on any method
or technique. It is intended to be an introduction (a primer) to the methods,
and a starting point for your own reading and learning. So there will be lots
of things I don’t mention, and probably many faster or more efficient ways of
programming something. Also, the R community is very fast moving, and new
packages are being created all the time. So it is worthwhile checking online for
package updates with new features, and also for more recent packages for a given
method.
1.5 Implementation in R
Most of the methods covered in this book are not implemented in commercial
statistics packages such as SPSS and SAS. Instead, we use a statistical pro-
gramming language called R for all examples (R Core Team 2013). R has the
advantage that it is an open source language, so anyone can develop their own
packages (collections of code that implement statistical tests) for others to use. It
is now standard practise for papers describing a new statistical technique to have
an associated R package for readers to download. R is free to download, and can
be installed on most operating systems. There are also some online R clients that
can be accessed directly through a web browser without needing any installation.
I mostly used another language, Matlab, throughout my PhD and Postdoctoral
years, and many of the methods we discuss here can also be implemented in
Matlab, or other programming languages such as Python. However, in the
interests of consistency, we use a single language throughout, which is introduced
in Chapter 2. I have also made the R code for all examples and figures available
on a GitHub repository at: https://github.com/bakerdh/ARMbookOUP.
14 CHAPTER 1. INTRODUCTION
1.6 History
This book has grown from a lecture course that I developed in 2014, and
deliver twice each year at the University of York. The course is taught to
third year undergraduates, and is often audited by postgraduate students on
MSc and PhD courses. Many former students have gone on to have successful
careers in numerate areas. I try to make the material as accessible as possible,
and sometimes use props like 3D printed surfaces to demonstrate function
minimisation, and a bingo machine to explain bootstrapping! The biggest
concern students have is about whether they will understand the mathematical
content for advanced statistical methods. For this reason I deliberately avoid
equations where possible, preferring instead to explain things at a conceptual
level, and I have done the same in this book.
The topics taught on the module go beyond the core undergraduate research
methods syllabus, and so are rarely discussed in typical textbooks. Up until now,
I have mostly used tutorial papers and package manuals as the recommended
reading. But this is not ideal, and I always felt that students on the module
would benefit from a single text that presents everything in a common style.
Producing a written explanation of the course content seemed the natural next
step, and my publishers agreed! The original lecture content equates to about
half of the topics included here, with other content added that seemed related
and useful.
Chapter 2
Introduction to the R
environment
2.1 What is R?
R is a statistical programming language. It has been around since the early
1990s, but is heavily based on an earlier language called S. It is primarily an
interpreted language, meaning that we can run programs directly, rather than
having to translate them into machine code first (though a compiler which does
15
16 CHAPTER 2. INTRODUCTION TO THE R ENVIRONMENT
this is also available). Its particular strengths lie in the manipulation, analysis
and visualisation of data of various kinds. Because of this focus, there are other
tasks that it is less well-suited to - you probably wouldn’t use it to write a
computer game, for example. However, it is rapidly increasing in popularity and
is currently the language of choice for statisticians and data analysts. We will
use it exclusively throughout this book for all practical examples.
The core R language is maintained by a group of around 20 developers known
as the R Development Core Team. The language is freely available for all major
operating systems (Windows, Mac and various flavours of Linux), and the R
Foundation is a not-for-profit organisation. This means that R is an inherently
free software project - nobody has to pay to use it, and there is no parent company
making huge profits from it. This is quite different from many other well-known
programming languages and statistical software packages that you may have
come across. It means that R is available to anyone with a computer and
internet connection, anywhere in the world, regardless of institutional affiliation
or financial circumstances. This seems to me an inherently good thing.
You can download R from the R project website. It is hosted by the Compre-
hensive R Archive Network (CRAN), a collection of around 100 servers, mostly
based at Universities across the world. The CRAN servers all mirror the same
content, ensuring that if one goes down the software is still available from the
rest. The CRAN mirrors also contain repositories of R packages, which we will
discuss further in section 2.8. If you do not have R installed on your computer
and plan to start using it, now would be a good time to download it. Exactly
how the installation works depends on your computer and its operating system,
but instructions are available for all systems at the R project website.
2.2 RStudio
In parallel with the development of the core R language, a substantial amount
of work has been done by a company called RStudio. This is a public benefit
corporation that makes some money from selling things like ‘pro’ and enterprise
versions of its software, web hosting and technical support. However its primary
product, the RStudio program, is free and open source. RStudio (the program)
is an integrated development environment (IDE) for R. It has a number of user-
friendly features that are absent in the core R distribution, and is now the most
widely-used R environment (again it is available for all major operating systems).
I strongly recommend downloading and installing it from the RStudio website.
Note that RStudio requires that you already have a working R installation on
your computer, as it is a separate program that sits ‘on top of’ R itself. So
you need to first install R, and then install RStudio in order for it to work. If
you have insurmountable problems installing R on your computer, there are
now web-based versions that run entirely through a browser and do not require
installation, such as RStudio Cloud and rdrr.io, though these services may not
be free.
2.3. FINDING YOUR WAY AROUND RSTUDIO 17
The RStudio company has also produced a number of very well-written and
useful R packages designed to do various things. For example, the Shiny package
can produce dynamic interfaces to R code that can run in web browsers. The
RMarkdown package can produce documents that combine text, computer code
and figures (discussed further in section 19.3). A version of it (bookdown) was used
to create the first draft of this book. Finally, a suite of tools collectively referred
to as the Tidyverse offer a uniform approach to organising and manipulating
data (essentially storing even very complex data structures in a spreadsheet-like
format). Although many introductions to R now focus on these tools, they are
advanced-level features that are not required to implement the examples in this
book, and so we will not discuss them further.
2.5, but for now note that they appear in the upper left corner of the RStudio
window.
In the upper right corner is the Environment tab, which contains a summary of
all of the information currently stored in R’s memory. This will usually be empty
when you first launch R, but will fill up with things like data sets, and the results
of various analyses. There are other tabs in the upper right corner, including the
History tab which keeps track of all commands executed in the current R session.
The Environment and History tabs can be saved to files (using the disk icon),
and also loaded back in from a previous session. The broom icon, which appears
in several panels, empties the contents of the Environment (or other section).
The lower right corner of the RStudio window contains several more tabs. The
first is the Files tab, which contains a file browser, allowing you to open scripts
and data files from within RStudio. The second tab is the Plots tab, which will
display any graphs you create. You can also export graphs from this window,
and navigate backwards and forwards through multiple graphs using the left
and right arrow buttons.
The next tab in the lower right pane is the Packages list, which contains all
of the packages of R code currently installed on your computer. We discuss
packages in more detail in Section 2.8. The Help tab is also in this section of
the window, and is used to display help files for functions and packages when
you request them.
As with most other desktop applications, there are a number of drop down
menus that allow you to do various things, such as load and save files, copy and
paste text, and so on. I use these much less frequently than in other programs,
largely because many of the functions are duplicated in the various toolbars in
the main RStudio window. Overall, the RStudio package provides a user-friendly
environment for developing and running R code.
2.5 Scripts
A script is a text file that contains computer code. Instead of entering commands
one at a time into the R console, we can type them into a script. This allows
us to save our work and return to it later, and also to share scripts with others.
Although this sounds like a simple concept, it has enormous benefits. First of
all, we have an exact record of exactly what we did in our analysis that we can
refer back to later. Second of all, someone else could check the code for errors,
reproduce your analysis themselves and so on. This is very different from the
way most people use statistical packages with graphical interfaces like SPSS
(although technically it is possible to record the analysis steps in these packages,
researchers rarely do it in practise). Sharing scripts online through websites like
the Open Science Framework is becoming commonplace, as part of a drive for
openness and reproducibility in research, as we will discuss in greater detail in
Chapter 19.
In RStudio we can create a new script through the File menu. These traditionally
have the file extension .R - however they are just plain text files, so you can
open and edit them in any normal text editor if you want to. It is a good idea
to create scripts for everything you do in R and save them somewhere for later
reference. R works slightly differently from some other programming languages,
in that you can easily run chunks of code from a script without having to run
the whole thing. You do this by highlighting the section of code you wish to
execute (using the mouse) and then clicking the Run icon (the green arrow at
the top of the script panel). You will see the lines of code reproduced in the
Console, along with any output.
Another way to run a script is to use the source command, and provide it with
the location and file name of the script you wish to run. This will execute the
entire script with a single command, as follows:
source('~/MyScripts/script.R')
You can use the source command from the console, and also include it in a script
so that you execute the code from multiple scripts in sequence.
In the above line of code, we have created a data object called a and stored the
number 10 in it. We read the <- opearator as ‘is given the value’. So the above
example means ‘a is given the value 10’. This data object will then appear in
the Environment pane of the RStudio window. If we want the data object to
contain a string of text, we wrap the text in inverted commas (quotes) so that R
knows it is text data, rather than a reference to another data object:
b <- 'Hello'
Once a data object has been created, we can use it in calculations, in much the
same way that mathematicians use letters to represent numbers in algebraic
expressions. For example, having stored the number 10 in the object a, we can
multiply it by other numbers as follows:
a*2 # multiply the contents of 'a' by 2
## [1] 20
Data objects can contain more than one piece of information. A list of numbers
is called a vector, and can be generated in several ways. A sequence of integers
can be defined using a colon:
numvect <- 11:20
numvect
## [1] 11 12 13 14 15 16 17 18 19 20
Or, we can combine several values using the c (concatenate) operation, which
here has the same result:
numvect <- c(11,12,13,14,15,16,17,18,19,20)
numvect
## [1] 11 12 13 14 15 16 17 18 19 20
Just as for a single value, we can perform operations on the whole vector of
numbers. For example:
numvect^2 # raise the contents of 'numvect' to the power 2
## [1] 121 144 169 196 225 256 289 324 361 400
The above code calculates the square of each value in the data object numvect
2.6. DATA OBJECTS 21
(the ˆ symbol indicates raising to a power). We can also request (or index)
particular values within a vector, using square brackets after the name of the
data object. So, if we want to know just the fourth value in the numvect object,
we can ask for it like this:
numvect[4] # return the 4th value in 'numvect'
## [1] 14
If we want a range of values we can index them using the colon operator:
numvect[3:8] # just entries 3 to 8 of the vector
## [1] 13 14 15 16 17 18
And if we want some specific entries that are not contiguous, we can again use
the c (concatenate) function:
numvect[c(1,5,7,9)] # some specific entries from the vector
## [1] 11 15 17 19
Finally, we can use other data objects as our indices. For example:
n <- 6
numvect[n]
## [1] 16
Data objects can also have more than one dimension. A two-dimensional data
object is like a spreadsheet with rows and columns, and is referred to as a matrix.
We need to tell R how big a matrix is going to be so that it can reserve the right
amount of memory to store it in. The following line of code generates a matrix
with ten rows and ten columns, storing the values from 1 to 100:
d <- matrix(1:100, nrow=10, ncol=10)
d
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 11 21 31 41 51 61 71 81 91
## [2,] 2 12 22 32 42 52 62 72 82 92
## [3,] 3 13 23 33 43 53 63 73 83 93
## [4,] 4 14 24 34 44 54 64 74 84 94
## [5,] 5 15 25 35 45 55 65 75 85 95
## [6,] 6 16 26 36 46 56 66 76 86 96
## [7,] 7 17 27 37 47 57 67 77 87 97
## [8,] 8 18 28 38 48 58 68 78 88 98
## [9,] 9 19 29 39 49 59 69 79 89 99
## [10,] 10 20 30 40 50 60 70 80 90 100
Again, we can index a particular value. For example, if we want the number
from the 8th row and the 4th column, we can ask for it by adding the indices
22 CHAPTER 2. INTRODUCTION TO THE R ENVIRONMENT
## [1] 38
This is very similar to the way you can refer to rows and columns in spreadsheet
software such as Microsoft Excel, except that in R rows and columns are both
indexed using numbers (whereas Excel uses letters for columns). If we want
all of the rows or all of the columns, omitting the number (i.e. leaving a blank
before or after the comma) will request this, for example:
d[8,] # row 8 with all columns
## [1] 8 18 28 38 48 58 68 78 88 98
d[,4] # column 4 with all rows
## [1] 31 32 33 34 35 36 37 38 39 40
We can also request a range of values for rows and/or columns using the colon
operator:
d[1:3,5:7] # rows 1:3 of columns 5:7
## Position Songwriter
## 1 1 Bob Dylan
## 2 2 Paul McCartney
## 3 3 John Lennon
2.7. FUNCTIONS 23
## 4 4 Chuck Berry
## 5 5 Smokey Robinson
In the above code, we first created two vectors called Position and Songwriter.
Then, we used the data.frame function to combine these together into a single
data object, that I’ve called chart. When we look at the chart object (by typing
its name), we can see that the data are organised into two columns and five rows.
The first column contains the chart position, and the second column contains
the name of the songwriter. The Environment panel will also contain the two
vectors and the data frame, as shown in Figure 2.2.
Figure 2.2: Example screenshot of the Environment panel, showing two vectors,
and also a data frame that contains both vectors.
Data frames can have as many rows and columns as you like, so they are a
very general and flexible way to store and manipulate data of different kinds.
There are numerous other classes of data object in R, and new classes can be
defined when required, so we will discuss any other data types as they come up
throughout the book.
It is important to have a good conceptual understanding of data objects, because
this is how R stores information. If you load in a data set from an external file,
this will be stored in a data object. You might, for example, have a spreadsheet
file containing questionnaire responses, which you could load into R and store in
a data frame called questionnaire. Similarly, when you run a statistical test, the
results will typically be stored in a data object as well. If you run a t-test (see
Chapter 4 for details of how to do this), this will produce a data object that you
might call ttestresults, and will contain the t-statistic, the p-value, the degrees of
freedom, and lots of other useful information.
2.7 Functions
The real power of high-level programming languages is the use of functions. A
function is a section of code that is wrapped up neatly. It accepts one or more
inputs, and produces one or more outputs. Just like data objects, every function
also has a name (subject to similar restrictions, in that they cannot start with a
number, or contain any reserved characters or spaces). Indeed, any functions
that you create will also appear in the Environment, and can be considered
a type of object. There are many hundreds of functions built into R, which
24 CHAPTER 2. INTRODUCTION TO THE R ENVIRONMENT
do different useful things. A very simple example is the mean function, which
calculates the mean (average) of its inputs. The following code calculates the
mean of the numbers from 1 to 10:
mean(1:10)
## [1] 5.5
Notice that the inputs to the function are inside of the brackets after the
function name. Functions can also take data objects as their inputs, and store
their outputs in other data objects. Earlier on, we stored the numbers from 11
to 20 in a data object called numvect. So we can pass these values into the mean
function, and store the output in a new object as follows:
averageval <- mean(numvect)
averageval
## [1] 15.5
You can find out more about a function using the help function, and passing it
the name of the function you are interested in:
help(mean)
In RStudio the documentation will appear in the Help panel in the lower right
corner of the main window.
It is possible to define new functions yourself to do things that you find useful.
To show you the syntax, here is a very simple example of a function that takes
three numbers as inputs. It adds the first two together, and divides by the third:
addanddivide <- function(num1,num2,num3){
output <- (num1 + num2)/num3
return(output)
}
addanddivide(7,5,3)
## [1] 4
With these inputs (7, 5 and 3) we get the output four, because (7+5)/3 = 4.
But now that we have created the function, we can call it as many times as we
like with any inputs. You can also call functions from within other functions,
meaning that operations can be nested within each other, producing sequences
of arbitrary complexity.
2.8 Packages
Sets of useful functions with a common theme are collected into packages. There
are many of these built into the basic R distribution, which you can see by
clicking on the Packages tab in the lower right panel of the main RStudio window
(an example is shown in Figure 2.3a). You can also install new packages to
perform specific functions. Those meeting some basic quality standards are
available through CRAN (at time of writing over 10,000 packages), but it is also
possible to download other packages and install them from a package archive
file. In RStudio, the package manager has a graphical interface for installing and
updating packages (click the Install button in the Packages tab). The window
that pops up (see Figure 2.3b) allows you to install any packages by typing the
package name into the dialogue box. However you can also install packages
from the console or within a script using the install.packages command. For
example, the following code will download and install the zip package used for
compressing files:
install.packages('zip')
Once installed, packages need to be activated before they are visible to R and
their functions become available for use. This can be done manually by clicking
the checkbox in the packages list (see examples in Figure 2.3a - the Matrix and
methods packages are active), or using the library function in the console or in a
script. For example, we can activate the zip package that we just downloaded
like this:
library(zip)
## [1] 5.5
26 CHAPTER 2. INTRODUCTION TO THE R ENVIRONMENT
Figure 2.3: Example screenshot of the Packages panel (a) and the Install Packages
dialogue box (b).
We will use several packages throughout the book, and some of the more
important ones are summarised in section 20.3.
if (a==1){
print('a is equal to 1')
}
if (a==1){
print('a is equal to 1')
}
if (a==1){
print('a is equal to 1')
} else {
print('a is not equal to 1')
}
This form is known as an if. . . else statement, and can even be extended with
many other conditional statements as follows:
a <- -1
if (a>0){
print('a is a positive number')
} else if (a<0) {
print('a is a negative number')
} else {
print('a is zero')
}
The logical statements can involve calls to various functions. Two useful ones
are the is.na and is.infinite functions. These check if data are classed as not
a number (for example if values are missing, or are irrational numbers such as
the square root of -1), or infinite values. The functions return TRUE or FALSE
values, which are interpreted appropriately by the if statement. These functions
are useful for preventing operations that will cause a script to crash, for example
if a missing or infinite number is used in a calculation.
28 CHAPTER 2. INTRODUCTION TO THE R ENVIRONMENT
2.10 Loops
Another powerful feature of programming languages is the loop. A loop instructs
the computer to repeat the same section of code many times, which it will
typically do extremely fast. The simplest type of loop is the for loop. This
repeats the operations inside the loop a fixed number of times. For example, the
following code will print out the word ‘Hello’ ten times:
for (n in 1:10){
print('Hello')
}
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"
The terms in the brackets (n in 1:10) define the behaviour of the loop. They
initialise a counter, which can be called anything, but here is called n. The value
of n increases by one each time around the loop, between the values of 1 and
10. We could change these numbers to span any range we like. The instructions
within the curly brackets {print(‘Hello’)} tell R what to do each time around
the loop.
We can also incorporate the value of the counter into our loop instructions,
because it is just a data object containing a single value. In the following
example, we print out the square of the counter:
for (n in 1:10){
print(n^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
2.10. LOOPS 29
A really useful trick is to use the counter to index another data object. In the
following example, we store the result of an equation in the nth position of the
data object called output:
output <- NULL # create an empty data object to store the results
for (n in 1:10){
output[n] <- n*10 + 5
}
output
## [1] 15 25 35 45 55 65 75 85 95 105
Of course we can have as many lines of code inside of a loop as we like, and
these can call functions, load in data files and so on. It is also possible to embed
loops inside of other loops to produce more complex code structures.
There are two other types of loop available in R, called while and repeat loops.
These do not always repeat a fixed number of times. Their termination criteria
are defined so that the loop exits when certain conditions are met. For example,
a while loop will continue repeating as long as a particular conditional statement
is satisfied. Below is a while loop that adds a random number to a counter object
on every iteration. The loop continues while the counter value is less than five,
and exits when the value of the counter is greater than (or equal to) five. If you
run this code several times, the loop will repeat a different number of times on
each execution.
counter <- 0
while (counter<5){
counter <- counter + runif(1)
print(counter)
}
## [1] 0.7136377
## [1] 1.014727
## [1] 1.62777
## [1] 2.279565
## [1] 3.093874
## [1] 3.776449
## [1] 4.106962
## [1] 4.554587
## [1] 5.174488
A repeat until loop is very similar, except that the conditions for terminating the
loop are evaluated (checked) at the end of the loop rather than at the start (as
with a while loop). These are used less frequently but are appropriate in some
circumstances.
30 CHAPTER 2. INTRODUCTION TO THE R ENVIRONMENT
Whereas if you want to treat the first row as data values (and specify your
column names separately), you would enter:
data <- read.csv('filename.csv',header=FALSE)
The spreadsheet contents will be stored as a data frame in the Environment (the
memory) of R. As noted above, there are so many R packages now that you will
be most likely able to find a function to read in virtually any file format you
need, even including specialist data formats like MRI images.
A helpful feature of RStudio is the Import Dataset option from the File menu.
This provides a graphical interface for importing data that is stored in widely-
used formats (including Excel and SPSS). What is especially clever is that the R
code for loading in the data is automatically generated and sent to the console.
This means you can copy the code into a script so that loading data is automated
in the future. In Chapter 3, we will discuss how to inspect and clean up data
that you have imported.
Function Description
mean Calculates the average of a vector of numbers
sd Calculates the standard deviation of a vector of numbers
sqrt Calculates the square root of its inputs
nrow Returns the number of rows in a matrix
ncol Returns the number of columns in a matrix
dim Returns the dimensions of a data object
rowMeans Calculates the mean of each row in a matrix
colMeans Calculates the mean of each column in a matrix
seq Generates a sequence of numbers with specified spacing
rep Generates a repeating sequence of numbers
abs Calculates the absolute value (removes the sign)
sign Returns the sign of the input (-1, 0 or 1)
pmatch Partial matching of strings
which Returns the indices of items satisfying a logical condition
is.nan Returns TRUE for any values that are not a number
is.infinite Returns TRUE for any values that are infinite
apply Applies a function over specified dimensions of a matrix
%% Returns the remainder (modulus) for integer division
%/% Returns the quotient (the bit that’s not the remainder)
unique Removes duplicate values from a vector
paste Combines two or more strings into a single string
and that they should magically know the answer to their question already. This
is not true. Everyone uses search engines to find out how to do something, or to
remind themselves the name of a function, or the specific syntax they need. I
do this all the time - almost everything in this book I have worked out how to
do by reading about it online. Expert professional programmers do it all the
time as well. Often if you ask someone who is a more experienced programmer
for help, they will actually just search for the answer on the internet. This is
such a trueism that there is a whole sub-genre of internet memes about how
programmers all have to Google things all the time. There is no shame in this -
it’s the best way to learn.
a <- a + 4
}
if (a>10){print('Bananas')}
if (a<10){print('Apples')}
if (a==10){print('Pears')}
if (a==12){print('Oranges')}
35
36 CHAPTER 3. CLEANING AND PREPARING DATA FOR ANALYSIS
newer tidyverse framework offers functions called gather and spread which fulfil
analogous roles as part of the tidyr package. Since requirements will differ
depending on idiosyncracies of a data set, the help documentation for these
functions will likely prove useful if your data needs restructuring. A final useful
function for restructuring data is the transpose function, t. This function turns
rows into columns, and columns into rows, much more straightforwardly than
can be done using spreadsheet software, where copying and pasting with special
options is required.
The entry to the breaks argument defines the start and end points of each bin.
The hist function will add up the total number of values in the data set between
the lower and upper boundaries of each bin. To define these boundaries, the
seq function generates a sequence of numbers between two points, here with a
specified length. For example:
seq(-3,3,length=11)
## [1] -3.0 -2.4 -1.8 -1.2 -0.6 0.0 0.6 1.2 1.8 2.4 3.0
So this line of code generates a sequence of numbers between -3 and 3, with 11
evenly spaced values. The hist function will then use this sequence to bin the
data. Note that there will always be one less bin than the length of the sequence,
and that the mid-points of the bins are the means of successive pairs of break
points.
A popular alternative to showing histograms with discrete bins is to plot a
smoothed kernel density function. The smoothing can sometimes obscure discon-
tinuities in the data, but smoothed functions are often more visually appealing.
An example is shown in Figure 3.1c, created using the density function as follows:
a <- density(data)
plot(a$x,a$y,type='l',lwd=2)
The histograms in Figure 3.1a-c show data that are approximately normally
distributed, and have no obvious outliers. But real data are often much less clean.
38 CHAPTER 3. CLEANING AND PREPARING DATA FOR ANALYSIS
70
120 (b)
(a) 60
100
50
80
Frequency
Frequency
40
60
30
40 20
20 10
0 0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
data data
0.4 60
(d)
(c) 50
0.3
40
Frequency
a$y
0.2 30
20
0.1
10
0.0 0
−3 −2 −1 0 1 2 3 −2 0 2 4 6
a$x data
Figure 3.1: Example histograms. Panel (a) has ten bins, and panel (b) has
20 bins. Panel (c) shows a kernel density function, and panel (d) includes two
extreme outliers.
3.3. INSPECTING DATA: HISTOGRAMS AND SCATTERPLOTS 39
For example, in Figure 3.1d there are two outlier points, at x-values of 4.5 and
6.1, as indicated by the arrows. These data points are far outside of the range
of values in the rest of the sample, and are likely to have been caused by some
kind of error. In Section 3.4 we will discuss some methods for identifying such
outliers more objectively, and also consider how to replace them in an unbiased
way.
Histograms are informative regarding the shape of a distribution, and will usually
make any outliers clear. However it is often also helpful to plot each individual
data point using a scatterplot. Usually the term scatterplot makes us think
of bivariate plots such as that shown in Figure 3.2b, which show two variables
plotted against each other. But if we have only a single dependent variable, it is
still important to inspect the individual data points. We can use a univariate
scatterplot to do this, in which the x-position is arbitrary, as shown in Figure
3.2a. This plot shows every individual observation on some measure (shown on
the y-axis), but within a given condition the x-position of a data point is not
informative.
10 3
(a) (b)
9
2
8
7
Measure B
1
Measure
6
●
5 ● 0
●
● ●
4 ●●● ● ●
●
●●
●
●
● ●● −1
3 ●● ●●●●● ●●
● ●●
●
●
●
●● ●●
●
● ●
2 ● ● ●● ●
● ●● −2 ●
●
1
0 −3
A B −3 −2 −1 0 1 2 3
Condition Measure A
Figure 3.2: Examples of univariate (a) and bivariate (b) scatterplots. Kernel
density functions are shown for each measure along the margins of panel (b)
(grey curves).
The scatterplot in Figure 3.2b contains three outliers, which are highlighted in
blue. These values are not particularly remarkable in either their x- or y-values,
so the grey kernel density histograms along the margins do not reveal them.
However they clearly differ from the rest of the population (grey points), and
might therefore be considered to be outliers. Both of the graphs in Figure
3.2 were produced using the generic plot functions (see Chapter 18 for more
detailed discussion of creating plots in R), though alternative methods are also
40 CHAPTER 3. CLEANING AND PREPARING DATA FOR ANALYSIS
available, for example using the geom_point function in the ggplot2 library.
Remember that the code to produce all figures is also available in the book’s
GitHub repository at: https://github.com/bakerdh/ARMbookOUP.
Q1 IQR Q3
● ●
●
●●
● ● ● ●
●
Outer fence
Outer fence
Inner fence
Inner fence
50%
− 5σ − 4σ − 3σ − 2σ − 1σ 0σ 1σ 2σ 3σ 4σ 5σ
Figure 3.3: Illustration of the inner and outer fences, and the interquartile range.
The x-axis is scaled in standard deviation (sigma) units, relative to the mean.
The cloud of grey data points show representative data sampled from the black
population distribution, but also features several outliers (blue points). The
vertical dotted lines, horizontal bar, and blue shaded region of the distribution
illustrate the interquartile range, between which 50% of data points lie. The
dashed lines and error whiskers indicate the inner fence, which extends 1.5 times
the IQR above and below Q1 and Q3 respectively. The outer fence extends 3
times the IQR above and below Q1 and Q3, and is shown by the solid vertical
lines.
42 CHAPTER 3. CLEANING AND PREPARING DATA FOR ANALYSIS
and in some conventions non-outlying points are not plotted at all. Sometimes
extreme outliers (beyond the outer fence) use a different symbol from near
outliers (between the inner and outer fences). Rudimentary boxplots can be
created automatically using the boxplot function from the graphics package in R.
For example, the following line of code produces the boxplot in Figure 3.4:
boxplot(ndata)
4
●
●
2
0
−2
●
●
Figure 3.4: Example rudimentary boxplot, generated using the boxplot function.
threshold criterion as 3 times the standard deviation of the sample. To work out
which values exceed this distance from the mean, we subtract the mean from
each data point, and for convenience take the absolute values (i.e. any negative
numbers become positive). Any values in this normalised data set that exceed
the criterion value will be picked up as outliers. We can find the indices of these
values using the which function.
data <- rnorm(100) # generate some random data
data[57] <- 100 # replace the value in row 57 with an outlier
criterion <- 3*sd(data) # calculate 3 times the standard deviation
normdata <- abs(data-mean(data)) # subtract the mean and take the absolute value
which(normdata>criterion) # find the indices of any outlier values
## [1] 57
Two more formal variants of this approach include Chauvenet’s criterion (Chau-
venet 1863) and Thompson’s Tau (Thompson 1985). These methods use the
total sample size to determine the threshold for deciding that a given data point
is an outlier, based on properties of either the normal (Gaussian) distribution or
the T-distribution.
The procedure for Chauvenet’s criterion is to convert all values to absolute
z-scores (subtracting the mean and scaling by the standard deviation) and
again identifying data points that exceed a criterion. This time the criterion is
calculated by taking the quantile of the normal distribution at 1/(4N), where
N is the sample size. This means that the criterion becomes more stringent
(i.e. larger) as the sample size increases, because with larger samples we expect
a greater number of extreme values. A rudimentary implementation is given by
the following function:
d_chauv <- function(data){
i <- NULL # initialise a data object to store outlier indices
m <- mean(data) # calculate the mean of the data
s <- sd(data) # calculate the standard deviation
Zdata <- abs(data-m)/s # convert data to absolute z-scores
dmax <- abs(qnorm(1/(4*length(data)))) # determine the criterion
i <- which(Zdata>dmax) # find indices of outliers
return(i)}
The above function uses the qnorm function to estimate the appropriate quantile,
and the which function to return the indices of any outlier values. Using this
function with the example data from above (where we added an outlier at entry
57) again correctly identifies the outlier:
d_chauv(data)
## [1] 57
The modified Thompson’s Tau method is conceptually similar, except that the
44 CHAPTER 3. CLEANING AND PREPARING DATA FOR ANALYSIS
critical value is obtained from the T-distribution using a given α value, and then
converted to the tau statistic with the equation:
tα/2 (N − 1)
τ=√ q (3.1)
N (N − 2 + t2α/2 )
where tα/2 is the critical t-value with N -2 degrees of freedom using the significance
criterion α (typically α = 0.05), and N is the sample size. Another difference is
that the tau method is iterative, with only the value that deviates most from
the mean being compared with the critical value on each iteration. The τ value
is recalculated on each iteration, using only the data remaining after removing
outliers on previous iterations.
Figure 3.5 illustrates the results of simulations that show how sensitive each
test is for detecting a single outlier of known value. The Thompson’s Tau test
identifies the known outlier with a similar sensitivity to methods that reject
values exceeding 1.5 times the sample standard deviation. The Chauvenet
criterion is more conservative, and similar to rejecting values exceeding about
2.8 times the standard deviation. Note that these values are dependent on the
sample size, for example the Chauvenet criterion becomes more conservative as
sample size increases.
One distinct danger posed by the availability of several different outlier detection
algorithms is that they provide an experimenter with hidden degrees of freedom
in their analysis. An unscrupulous researcher could easily engineer a desired
result by choosing an outlier rejection algorithm after their data have been
collected. This is highly unethical and strongly discouraged. To avoid such
issues, analysis plans should ideally be preregistered before the data have been
collected, and must detail how outliers will be handled. In general, a criterion of
around 2.5 or 3 standard deviations is usually appropriate for univariate data,
though it is worth checking standard practice in a given research area.
1
Tukey
Proportion detected
Thompson
0.8
Chauvenet
0.6
0.4
1 SD
0.2 2 SD
3 SD
0
0 1 2 3 4
Outlier position (SD from true mean)
distance for a given data point is scaled by the variance of the data set in the
direction of that data point. This allows us to take into account any correlations
between the two (or more) dependent variables.
To understand how this works, consider first the data set shown in Figure 3.6a.
Here, the two variables are uncorrelated, and the variable plotted on the x-axis
has a smaller variance than the one plotted on the y-axis. The two white outlier
points are both the same Euclidean distance from the centroid of the data (black
point), because the blue vector lines are the same length. But if we express their
distance in standard deviation units along the appropriate axis, the square outlier
is clearly a more extreme outlier than the triangle outlier (because the standard
deviation is larger in the y direction). This is what the Mahalanobis distance
calculates, and for this example the square point would have a Mahalanobis
distance of around D = 6, and the triangle point a distance of around D = 3.
So, the square is a more extreme outlier than the triangle when expressed with
this metric, which fits with our intuitions.
8 8
(a) (b)
4 4
● ●
0 0
y
−4 −4
−8 −8
−8 −4 0 4 8 −8 −4 0 4 8
x x
Figure 3.6: Illustration of outliers in two dimensional (bivariate) data sets. The
black point is the centre of mass, and the white triangle and square symbols are
outliers. The blue lines depict the Euclidean distances between the centroid and
each outlier, and are of identical length.
Next let’s consider the data set shown in Figure 3.6b. This time, there is a
clear correlation between the two variables. Again, the two outliers are the same
Euclidean distance from the centroid. As before, the Mahalanobis distance is
greater for the square point, because the variance is estimated in the direction of
the blue vector line. This is somewhat harder computationally than calculating
the standard deviation for one or other dependent measure. Happily, we do not
need to perform these calculations by hand. The mahalanobis function is part of
3.5. IDENTIFYING MISSING AND OUT OF RANGE VALUES 47
the core stats package in R. Imagine we have a data frame containing around
500 rows of bivariate data (from Figure 3.2b) structured as follows:
head(bdata)
## x y
## 1 0.685092067 0.6039170
## 2 -0.005550195 0.2395705
## 3 -0.777641329 -1.0976698
## 4 1.875702830 1.5417293
## 5 -0.377129105 0.3195294
## 6 -0.454686991 0.3052273
We can pass this data frame into the mahalanobis function, along with the
means of each variable (calculated inline using the colmeans function), and
the covariance matrix (calculated inline using the cov function). The function
returns a squared distance estimate for each x-y pair of points:
D2 <- mahalanobis(bdata, colMeans(bdata), cov(bdata))
D2[1:6]
200
150
Frequency
100
Outliers
50
0
0 10 20 30 40
Mahalanobis D2
in a data object that are classed as ‘not a number’. This is especially useful as
missing values loaded in from spreadsheets are assigned NA by default. We can
combine the is.na function with the which function to return the indices of the
missing values.
To give an example, suppose we have a data set comprising 10 values, 3 of which
are missing:
## [1] 4 6 NA 1 7 8 NA 3 NA 5
The is.na function will return TRUE for the missing values, and FALSE for the
others:
is.na(nandata)
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
Wrapping the which function around this will return the indices of the NA values,
at positions 3, 7 and 9:
which(is.na(nandata))
## [1] 3 7 9
One way to remove the NA values is to invert the output of the is.na function
using the ! operator (making TRUE become FALSE, and vice versa), to return
only the rows that contain real numbers:
trimmeddata <- nandata[which(!is.na(nandata))]
trimmeddata
## [1] 4 6 1 7 8 3 5
A related function is the is.infinite function, which returns TRUE for positive
and negative infinity - these often appear if values are inadvertently divided by
zero. However the which function can take any logical argument, so it can also
be used to restrict values to within a certain range, for example:
trimmeddata[which(trimmeddata<5)] # return only values < 5
## [1] 4 1 3
It might be useful to do this if one’s data have a natural range. For example,
many real-world measurements such as height, weight, heart rate and so on, must
take on positive values. If some sort of error has resulted in negative values for
some observations of these types of variables, it would be reasonable to consider
these observations as out of range.
Finally, we can combine these logical statements with other functions, such as
the mean, sum or sd functions. To calculate the mean of the nandata object,
excluding the NA values, we can nest the which statement inside the indexing of
the variable as follows:
50 CHAPTER 3. CLEANING AND PREPARING DATA FOR ANALYSIS
mean(nandata[which(!is.na(nandata))])
## [1] 4.857143
Although this is a good general solution, it can be rather cumbersome. In
some functions, there is an alternative implementation specifically for NA values
known as the na.rm flag. By setting this to TRUE, we instruct the function to
remove the NA values automatically. The mean, sum and sd functions (along
with other core functions) have this facility, for example:
mean(nandata, na.rm=TRUE)
## [1] 4.857143
Notice that this returns the same result as the more involved solution. Not all
functions have an na.rm flag, but you can check the help files to work out which
ones do.
The above tools are extremely useful for cleaning up data and dealing with
missing or problematic values. They can be used in combination with several of
the other outlier detection criteria that were described in Sections 3.4 - 3.4.3.
However this leads us to a more philosophical question of what we should be
doing with outliers and missing values in the first place.
analysis of variance (ANOVA), for which a single missing value usually requires
removal complete removal of a participant.
Alternatively, techniques exist to estimate (impute) plausible values to replace
those that are missing, which can make some statistical models more stable.
These include mean imputation (replacing a missing value with the sample
mean), Windsorization (replacing an outlying value with a neighbouring value),
and clipping (setting extreme values to a prespecified maximum or minimum
threshold). If you have a solid rationale for replacing missing or outlying values,
and have ideally specified this in advance through preregistration, then these
methods avoid many of the shortcomings of outlier deletion.
Finally we can decide to leave outliers in a data set. This might affect our choice
of statistical test, as data sets with substantial outliers are unlikely to meet the
assumptions of parametric statistics. We could instead use methods that make
fewer assumptions, such as nonparametric alternatives, or the ‘bootstrapping’
approach we will introduce in Chapter 8. There is also a class of statistics
called robust statistics that are designed to be used with noisy data. A simple
example of a robust statistic is the trimmed mean, in which some percentage of
extreme values is removed from the data set, and the mean calculated using the
remainder. For example, a 10% trimmed mean would involve rejecting the lowest
and highest 10% of values from a data set, and using the remaining 80% of
values to estimate the mean. Other variants forego the assumption that data are
normally distributed, and instead use other distributions such as t-distributions,
which have longer tails (this approach is also common in Bayesian statistics, see
Kruschke 2014). A full discussion of robust statistics is beyond the scope of the
current text, but the interested reader is referred to the book Robust Statistics
by Peter Huber (2004).
It is important to remember that the units of rescaled data will not be the same
as the units of the original measurements. One way to think of normalized data
is as being akin to z-scores, where each data point is expressed in standard
deviation units. The univariate scatterplots in Figure 3.8 illustrate the effect of
rescaling a data set.
20
●
●
● ● ●● ●
● ●
● ● ● ●●● ●
● ●● ●● ●
10 ●●● ● ●
●
●
●●● ●●●● ●
● ●
●● ●●●
●
● ●
● ●● ●
●
● ● ● ●
● ● ● ●●● ● ●● ●● ●● ● ●
●● ● ●●
●●●●●
● ●
●●
●●
●●
● ●
●
● ●
●
●●●
●● ●●
●●
●●
●●
0 ● ●● ● ● ●●
●
●● ● ●● ●●● ●●●●●
●●●●●
●
●
● ● ●● ●
●●●
● ●● ●●● ● ●● ●●● ●
●
● ● ●
●
● ●●●
●
Figure 3.8: Illustration of rescaling using an example data set (black points).
The dark blue points show centering, with no change in variance. The grey
points illustrate scaling by the standard deviation - this also affects the mean,
but does not centre the mean on 0. Finally, the light blue points show the effects
of centering and then scaling.
with outliers, this is still not the case? If this happens, it is sometimes possible
to transform the data by applying mathematical operations to the full data set.
The most common of these are logarithmic transforms, which pull in the long
tail of a positively skewed distribution, and squaring, which has a similar effect
on negatively skewed distributions (exponential transforms can also be used for
dealing with negative skew). Examples of how each of these transforms can
make skewed data conform more closely to a normal distribution are shown in
Figure 3.9. A deeper discussion on the advantages of data transforms is given
by Bland (2000).
As many introductory statistics texts will explain in detail, there are two main
tests to assess whether data are consistent with a normal distribution. These
are the Kolmogorov-Smirnov test (Smirnov 1948; Kolmogorov 1992) and the
Shapiro-Wilk test (Shapiro and Wilk 1965). The Kolmogorov-Smirnov test
involves comparing the cumulative distribution functions of the data and a
reference distribution (i.e. a normal distribution), and is implemented in R’s
built in stats package by the ks.test function. For example:
ks.test(data1,'pnorm',mean(data1),sd(data1))
##
## One-sample Kolmogorov-Smirnov test
##
## data: data1
## D = 0.17671, p-value = 0.003879
## alternative hypothesis: two-sided
Note that the ks.test function requires as inputs the sample of data and the name
of the cumulative distribution you wish to compare it to (pnorm is a cumulative
normal). By default the data will be compared to a distribution with a mean of
0 and a standard deviation of 1. So it is necessary to either provide the actual
mean and standard deviation of your data as additional arguments (as above),
or to first normalize your data using the scale function (see Section 3.7):
ks.test(scale(data1),'pnorm')
##
## One-sample Kolmogorov-Smirnov test
##
## data: scale(data1)
## D = 0.17671, p-value = 0.003879
## alternative hypothesis: two-sided
The data we have tested here were generated using a lognormal distribution
(the grey distribution in the top panel of Figure 3.9), so the test is signifi-
cant, indicating a deviation from normality. As with most assumption tests, a
non-significant Kolmogorov-Smirnov would mean that there was no significant
difference between the data and reference distribution, so we could assume the
54 CHAPTER 3. CLEANING AND PREPARING DATA FOR ANALYSIS
Figure 3.9: Examples of data transforms. The upper panel shows some positively
skewed data (grey), and a more normal distribution following a log transform
(blue). The lower panel shows some negatively skewed data (grey), and a more
normal distribution following a squaring transform (blue).
3.8. TRANSFORMING DATA AND TESTING ASSUMPTIONS 55
##
## Shapiro-Wilk normality test
##
## data: data1
## W = 0.886, p-value = 3.285e-07
Again, a significant result (p < 0.05) implies a deviation from normality. Notice
that the p-value for the Shapiro-Wilk test is generally smaller than that for the
Kolmogorov-Smirnov test using the same data, illustrating its greater power.
However, most implementations of Shaprio-Wilk are limited to samples of less
than 5000 data points, whereas the Kolmogorov-Smirnov test has no such
restrictions.
Finally, Q-Q plots can be very informative in visually assessing and understanding
deviations from normality. These are constructed by plotting the quantiles of the
data against the quantiles of a reference distribution (e.g. a normal distribution).
The qqnorm function generates a plot, and the qqline function adds a reference
line. Examples are shown in Figure 3.10 for normally distributed data (left) and
positively skewed data (right). Note the substantial deviation from the diagonal
reference line at the upper end of the skewed distribution.
Typically, one would use tests of normality and Q-Q plots on a raw data set
first. If this deviates from normality, it is worth trying an appropriate transform,
and then running the normality test again. Unfortunately, even after applying
a transform, some data sets simply cannot be manipulated sufficiently to meet
the assumptions of parametric statistics. This might rule out the use of some
tests, however there are also many non-parametric alternatives that can be
used instead, just like for data sets with outliers. In general, non-parametric
methods involve rank-ordering a data set, and performing calculations on the
rankings rather than the raw data. This avoids any problems with outliers and
deviations from normality, but at the expense of statistical power (the ability of
a test to detect a true effect). Common examples are the Spearman correlation
coefficient, the Mann-Whitney U test, and Friedman’s ANOVA, all of which are
implemented in R. Non-parametric methods are not the main focus of the current
text, however in Chapter 8 we will discuss bootstrap resampling techniques,
which can be used as a flexible non-parametric approach to statistical testing.
56 CHAPTER 3. CLEANING AND PREPARING DATA FOR ANALYSIS
● ●
4
2 ●
●●
●●●●
● ●
● ●●
●●
●●
● 3
Sample Quantiles
Sample Quantiles
●
●●
1 ●
●●
●●
●
●
● ●●
●● ●
●●
●
●
●●
●
●●
●
●
●
●
●● ●
●
● ●
●●
●
●
●
● 2 ●●
●
●●
●
0 ●
●
●
●
●
●●
●●
●●
● ●
●●
●
●●
● ●
●●
●
●
●●
●● ●
● ●●
●
●
●●
● ●●
●
● ●
●●
●
●● ●
●●
●
●●
●
● ●
●●
●
●●
● ●
● 1 ●
●
●
●
●
●
−1 ●
●●●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●● ●
●●
●
●●
●
●●
●●
●
●
●●● ●●●
●●
●●
●●●
● ●●●●
● ●●
● ● ●●
−2 −1 0 1 2 −2 −1 0 1 2
Figure 3.10: Example Q-Q plots for normally distributed data (left) and skewed
data (right).
To demonstrate how factors work, let’s create a data object containing sex
information for seven rats:
ratsex <- c('M','M','F','M','F','F','F')
ratsex
By default, this data object contains a list of character strings. We can convert
it to a factor using the factor function:
ratfactor <- factor(ratsex)
ratfactor
3.10. PUTTING IT ALL TOGETHER - IMPORTING AND CLEANING SOME REAL DATA57
## [1] M M F M F F F
## Levels: F M
Now we see that R has defined two levels for the factor, F and M. An integer
code is also assigned to each level, by default in alphabetical order. We can see
these numerical values using the as.numeric function:
as.numeric(ratfactor)
## [1] 2 2 1 2 1 1 1
Notice that all values of M are coded as 2, and all values of F are coded as 1. If
we wanted a particular ordering of the numerical values associated with each
level of the factor, we can specify this when we create the data object using the
factor command:
ratfactor <- factor(ratsex, levels = c('M','F'))
as.numeric(ratfactor)
## [1] 1 1 2 1 2 2 2
Now we have coded M as 1, and F as 2. What if we wanted to change the labels
from ‘M’ and ‘F’ to ‘male’ and ‘female’? We can do this using the levels function
as follows:
levels(ratfactor)[levels(ratfactor)=='M'] <- 'male'
levels(ratfactor)[levels(ratfactor)=='F'] <- 'female'
ratfactor
The experiment was quite lighthearted - it was a general knowledge quiz with
ten questions. Before seeing the questions, participants rated their own general
knowledge ability on a scale from 0-100. The idea was to see if these ratings
predicted actual performance in the quiz.
We can load in the data using the read.csv function (the data can be downloaded
from the book’s GitHub repository), and take a look at the first few rows as
follows:
quizdata <- read.csv('data/qualtricsexample.csv')
head(quizdata)
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11
## 1 84 D A C A A A B A C C
## 2 77 C D A A B C A A C
## 3 51 D C D B A D D A A A
## 4 83 B A D A C A C A A A
3.10. PUTTING IT ALL TOGETHER - IMPORTING AND CLEANING SOME REAL DATA59
## 5 88 C A A D A A B C A A
## 6 68 A B A B A C D A C A
After restructuring, the first column contains the ratings from 0 - 100. Let’s take
a closer look at these values using a histogram, shown in Figure 3.11a as follows:
15 10 8 ● ●
6 ● ●● ● ● ● ●
10
6
quizscores
5
Frequency
Frequency
● ● ● ●●●● ●
4 ●● ●● ● ● ● ● ●
4
5
3 ● ● ● ● ●● ● ●
2
2 ● ● ● ● ● ● ●
0 0 1 ● ●
0 20 40 60 80 100 1 2 3 4 5 6 7 8 40 50 60 70 80 90
Figure 3.11: Histograms and scatterplot for the example qualtrics data. Panel
(a) shows the histogram of self-ratings of general knowledge, panel (b) shows
the histogram of actual quiz performance, and panel (c) shows the correlation
between the two measures.
hist(quizdata$Q1)
Two features are clear from the histogram: there are two outlier points with a
value of 0, and overall the distribution looks negatively skewed. Perhaps the
ratings of 0 were genuinely participants who thought they had very poor quiz
ability. But it could also be that 0 was the default rating on the scale used, and
these participants did not change it for whatever reason. We can use the code
from earlier in the chapter to identify data points that are more than 3 standard
deviations from the mean as follows:
criterion <- 3*sd(quizdata$Q1) # calculate 3 times the standard deviation
normdata <- abs(quizdata$Q1-mean(quizdata$Q1)) # subtract the mean and take the absolute value
which(normdata>criterion) # find the indices of any outlier values
## [1] 21 30
This code identifies the participants in rows 21 and 30, and indeed these are the
ones that produced ratings of 0:
quizdata$Q1[c(21,30)]
## [1] 0 0
Given our concerns about the possibility that these participants did not use the
rating scale correctly, and their distance from the rest of the scores, we might
be justified in removing them from the data set. We can do this using another
60 CHAPTER 3. CLEANING AND PREPARING DATA FOR ANALYSIS
which statement, but this time to include only the participants whose ratings
are less than 3 standard deviations from the mean, as follows:
quizdata <- quizdata[which(normdata<criterion),]
To see whether the data are normally distributed, we can run the Shapiro-Wilk
test as follows:
shapiro.test(quizdata$Q1)
##
## Shapiro-Wilk normality test
##
## data: quizdata$Q1
## W = 0.96021, p-value = 0.08514
Although the data show some evidence of negative skew, the test is (just)
non-significant (p = 0.085), so we can proceed assuming a normal distribution.
Next we can look at the answers to the quiz questions themselves. These are
all four-option multiple choice questions, with the answers stored as A,B,C and
D. When we loaded in the data, R converted the responses to factors. However,
some of the questions have not been answered by certain participants, and we
have some missing values. Just as we might in an exam, we will mark the
questions with missing data as incorrect.
To score the quiz, we will use two loops (see section 2.10), one inside the other.
The outer loop will run through each participant in turn, and the inner loop
will run through each question for a given participant. For this example, we will
assume that the correct answer for each question was ‘A’, and we will count up
the number of questions each participant got right and store this in a new data
object called quizscores:
# make a list of zeros to store the scores
quizscores <- rep(0,nrow(quizdata))
We can inspect a histogram of these scores, again using the hist function, as
shown in Figure 3.11b, and also run the Shapiro-Wilk test:
hist(quizscores)
shapiro.test(quizscores)
3.11. PRACTICE QUESTIONS 61
##
## Shapiro-Wilk normality test
##
## data: quizscores
## W = 0.95898, p-value = 0.07544
Despite the somewhat unusual form of the data, with evidence of a floor effect
at the lower end, the Shapiro-Wilk test is again non-significant. Finally, we
can inspect a scatterplot of the two variables plotted against each other, to
see if there is evidence that participants were able to predict their own general
knowledge ability:
plot(quizdata$Q1,quizscores,type='p')
The graph produced by the above code is shown in Figure 3.11c. It does appear
to be the case that individuals who rated their ability more highly also obtained
generally higher test scores, which we might go on to test using correlation, or
other statistical tests described in Chapter 4. We can also check for multivariate
outliers using the Mahalanobis distance as follows:
bothscores <- data.frame(quizdata$Q1,quizscores)
D <- mahalanobis(bothscores,colMeans(bothscores),cov(bothscores))
sort(round(D,digits=2))
## [1] 0.02 0.18 0.18 0.23 0.49 0.51 0.51 0.51 0.61 0.64 0.72 0.74 0.74 0.79 0.90
## [16] 0.99 1.01 1.04 1.07 1.12 1.17 1.19 1.30 1.30 1.32 1.35 1.47 1.63 1.66 1.72
## [31] 1.84 1.86 1.89 2.21 2.21 2.30 2.47 2.52 2.55 3.11 3.15 3.34 3.46 3.91 3.96
## [46] 3.98 4.36 4.57 5.21 5.71 8.31
By sorting the distances using the sort function, we can see that the largest
value is 8.31, which does not exceed our criterion of D2 = 9 (or D = 3).
The Qualtrics quiz data is a deliberately simple example - most data sets
would involve multiple conditions, experimental manipulations, or independent
variables. However this has hopefully given a good indication of how we can
import and clean a data set in preparation for further analysis.
65
66 CHAPTER 4. STATISTICAL TESTS AS LINEAR MODELS
Considered from this perspective, we will see that many statistical tests involve
fitting a model to try and explain our data. Usually the model assumes that
our measurements (known as the dependent variable) can be predicted to some
extent by one or more other factors (known as independent variables). We
can then compare the fit of this statistical model to a ‘null model’, in which
those other factors do not predict our measurements. If our model explains the
data better than the null model (according to some criterion), we consider it to
be statistically significant (we will discuss these assumptions in more detail in
Chapter 17). The clearest way to demonstrate the model comparison approach is
by starting with linear regression (where it is most explicit), and then applying
the same logic to other situations.
## DBH culmheight
## 1 6.39 7.92
## 2 6.19 8.39
## 3 6.88 8.03
## 4 6.69 8.74
## 5 6.48 9.06
## 6 7.18 8.33
It is clear from Figure 4.1 that the data are highly correlated, which we can
confirm by calculating a correlation coefficient by passing the data to the cor
function:
cor(bamboodata)
## DBH culmheight
## DBH 1.0000000 0.9314495
## culmheight 0.9314495 1.0000000
4.2. REGRESSION (AND CORRELATION) 67
10 11 12 13 14
●
● ●
● ●
●
● ● ●
Culm Height (m)
●
● ●
●
●
●
● ●
● ●
● ●
● ●
●
●
9
●
● ●
●
8
●
7
5 6 7 8 9 10 11 12
Figure 4.1: Linear regression between diameter at breast height and culm height
for moso bamboo, based on Figure 1 of Yen (2016).
68 CHAPTER 4. STATISTICAL TESTS AS LINEAR MODELS
##
## Pearson's product-moment correlation
##
## data: bamboodata$DBH and bamboodata$culmheight
## t = 13.545, df = 28, p-value = 8.122e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8596580 0.9671648
## sample estimates:
## cor
## 0.9314495
The output gives us the same correlation coefficient on the final line (along with
95% confidence intervals just above), and also calculates a t-statistic and p-value
to assess statistical significance (on the fifth line). Notice that the p-value is
expressed in scientific notation as 8.122e-14. This is how R represents very
small numbers, in this case it means 8.122 × 10−14 , or 0.00000000000008122 (the
easiest way to think of this is that you shift the main number by the value given
after the e - here 14 places to the right).
In regression, we want to fit a linear model (i.e. a straight line) that allows us
to predict values of the dependent variable (culm height) using the values of
the independent variable (DBH; note that in this example these are both things
that have been measured, and so DBH might not meet a strict criteria for being
an independent variable because it will involve measurement error, but this is
just an example). To do this, we will use the lm (linear model) function in R.
The lm function (as well as other related functions, including those for running
ANOVA) uses a syntax to specify models, that has the general form DV ~ IV.
The tilde symbol (~) means is predicted by. In other words, we’re saying that the
dependent variable is predicted by the independent variable. For our example,
we want to run the model culmheight ~ DBH, and we will also tell the lm function
the name of the data frame containing our data (bamboodata). Finally, we will
store the output of the model in a new data object called bamboolm. We do this
with a single line of code, and then have a look at the output using the generic
summary function:
bamboolm <- lm(culmheight ~ DBH, data=bamboodata)
summary(bamboolm)
##
## Call:
## lm(formula = culmheight ~ DBH, data = bamboodata)
4.2. REGRESSION (AND CORRELATION) 69
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26249 -0.50913 -0.02087 0.45275 1.43376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.27726 0.71223 1.793 0.0837 .
## DBH 1.12081 0.08274 13.545 8.12e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6488 on 28 degrees of freedom
## Multiple R-squared: 0.8676, Adjusted R-squared: 0.8629
## F-statistic: 183.5 on 1 and 28 DF, p-value: 8.122e-14
The output tells us what we have done (repeating the function call), and then
gives us a table of (unstandardised) coefficients. These are the intercept (1.28,
given in the Estimate column for the (Intercept) row) and the slope of the fitted
line (1.12, given in the Estimate column for the DBH row). We can use these to
plot a straight line with the equation y = β0 + β1 x (where β0 and β1 are the
model coefficients that correspond to the y-intercept and gradient), which in
this case is culmheight = 1.28 + 1.12*DBH. That is the line that is shown in
Figure 4.1, and which gives an excellent fit to the data. If we need to obtain
standardised regression coefficients we can either standardise the data first (see
section 3.7), or run the lm.beta function from the QuantPsyc package on the
model output object (e.g. lm.beta(bamboolm) for the above example).
The summary table also provides some other useful statistics. There is an
R-squared value (R2 = 0.87) telling us the proportion of the variance explained,
and an overall F-statistic (F = 183.5) and p-value for the regression model. The
p-value is the same as the one for the correlation, and for the DBH coefficient
(in the Coefficients table), because we only have one predictor. For multiple
regression models the table will indicate the significance of each predictor, and the
F-statistic will tell you about the full model. We are also given some information
about the residuals (the left over error that the model can’t explain) which can
be used to check the assumptions of the test (more on this in section 4.5).
So that’s how to run a straightforward linear regression in R. But to set things
up for the rest of this chapter, we should dig a little bit deeper into what linear
regression is actually doing. We have fitted a line to describe our data, but
what does the p-value indicate? In regression, we are effectively comparing two
different models. In the null model, the line has a slope of 0, which means there
is no effect of the value of DBH on culmheight. The best we can do to predict
culmheight is to use the overall mean (sometimes known as an intercept-only
model). In the alternative model, the slope of the line is allowed to vary to try
and fit the data better. Usually in regression we don’t bother to show both lines,
70 CHAPTER 4. STATISTICAL TESTS AS LINEAR MODELS
but it’s worth making them explicit so that you can see the difference - they are
plotted in Figure 4.2.
14 14
● ●
● ●
● ● ● ● ● ●
13 ● 13 ●
● ● ● ● ● ●
12 ● 12 ●
Culm Height (m)
● ● ● ●
● ● ● ●
10 10
● ●
● ●● ● ●●
● ●
9 ● 9 ●
● ●
● ● ● ●
8 ● ● 8 ● ●
7 7
5 6 7 8 9 10 11 12 5 6 7 8 9 10 11 12
Figure 4.2: Null and alternative model fits for the bamboo data. The null model
has a slope of 0, the alternative model can have any slope value. Thin vertical
lines show the residuals between model and data.
In both panels of Figure 4.2 I have also added some thin vertical lines, that join
the (thick black) fitted line to each individual data point. These are called the
residuals - they are the error between the data and the model prediction. One
way of thinking about residuals is that they represent how well the model (thick
line) is able to describe the data (points). If the fit is poor, the residual lines will
be long (as in the null model). If the fit is good, the residual lines will be short
(as in the regression model). The proportion of the variance explained (R2 ) and
the statistical comparison between the null and alternative models are based on
the lengths of these lines (though we will not go into further details here about
precisely how this works). I think of it as the left over variance (i.e. differences
between points) that the models cannot explain. The p-value in regression is
really telling us whether the alternative model can explain significantly more of
the total variance than the null model.
4.3 T-tests
It is rarely made explicit in introductory texts that this basic idea, of assessing the
fits of two models by comparing the left over variances, is also what is happening
in other tests such as t-tests. The t-test is used to compare the means of two
groups to see if they differ from each other. Usually, this is explained as taking
the mean difference and dividing by the pooled standard error (which is the
4.3. T-TESTS 71
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
The which function in the above code returns the indices of the DBH vector
that satisfy the conditional statement (e.g. it tells us which entries in the DBH
vector are less than 8.5, or greater than 8.5). The as.factor command tells R
that the data should be treated as categorical (factor) labels for the purposes of
conducting statistical tests (see section 3.9). The data with the group split are
plotted in Figure 4.3.
In R, we can then run a t-test using the t.test function. The standard way to do
this is to split the culmheight data into two separate data objects for the narrow
and wide groups, and then plug them into the t-test function:
group1 <- bamboodata$culmheight[bamboodata$sizegroup==1]
group2 <- bamboodata$culmheight[bamboodata$sizegroup==2]
t.test(group1, group2, var.equal=TRUE)
##
## Two Sample t-test
##
## data: group1 and group2
## t = -10.016, df = 28, p-value = 9.298e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.66897 -2.42303
## sample estimates:
## mean of x mean of y
## 9.267333 12.313333
72 CHAPTER 4. STATISTICAL TESTS AS LINEAR MODELS
14
● ●●
●● ●
12 ●
●
● ● ●
● ●●
●●
Culm Height (m)
●● ●
10
●
●
●● ●●
●
● ●
● ●
● ●
8
6
4
2
0
Group
Figure 4.3: Culm height data split into narrow and wide groups by DBH value.
The 5th line of the output gives us a large t-value (-10), and a very small p-value,
indicating a highly significant group difference. An alternative syntax would be
to use the same formula structure that we used for regression, which the t.test
function also accepts. This time we are predicting the height values using group
membership, so the appropriate formula is culmheight ~ sizegroup:
t.test(culmheight ~ sizegroup, data=bamboodata, var.equal=TRUE)
##
## Two Sample t-test
##
## data: culmheight by sizegroup
## t = -10.016, df = 28, p-value = 9.298e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.66897 -2.42303
## sample estimates:
## mean in group 1 mean in group 2
## 9.267333 12.313333
This is a different way of achieving exactly the same result, and you can see that
the outcomes are identical. But, we could also run the test explicitly as a linear
model (using the lm function), again with the sizegroup variable as the predictor
4.3. T-TESTS 73
##
## Call:
## lm(formula = culmheight ~ sizegroup, data = bamboodata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35333 -0.74333 0.06967 0.79667 1.18667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2673 0.2150 43.09 < 2e-16 ***
## sizegroup2 3.0460 0.3041 10.02 9.3e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8329 on 28 degrees of freedom
## Multiple R-squared: 0.7818, Adjusted R-squared: 0.774
## F-statistic: 100.3 on 1 and 28 DF, p-value: 9.298e-11
The output looks different from the output of the t-test function, as it has the
same layout as the regression output we saw in the previous section. But you
can see that the values of the t-statistic and p-value in the table of coefficients
are exactly the same as the ones we got from the t-test function (the minus sign
is missing from the t-statistic, but this is arbitrary anyway because it depends
on the order in which the groups are entered).
Now, this consistency across methods prompts us to think about the t-test in
the context of regression. Just like with regression, we can conceptualise the
t-test as comparing two models. The null model is one in which the means do
not vary with group, given by the horizontal black line in the left panel of Figure
4.4. The alternative model is one where the means can vary with group, given
by the diagonal black line in the right panel of Figure 4.4.
Just as with regression, we can calculate the residual error between each data
point and the accompanying model prediction for its group. The model prediction
for the null model is the grand mean (horizontal black line). The model prediction
for the alternative model is the group mean for each condition. Then we
compare the two model fits statistically to see if the alternative model describes
significantly more of the variance than the null model. This is another way of
thinking about what a significant t-test means: conceptually it is exactly the
same as linear regression.
74 CHAPTER 4. STATISTICAL TESTS AS LINEAR MODELS
14 ●
14 ●
●● ● ●● ●
● ● ● ●
12 ●●
● ● ●
● 12 ●●
● ● ●
●
●● ●●
●● ●●
● ●●● ● ●●●
Culm Height (m)
6 6
4 4
2 2
0 0
Figure 4.4: T-tests conceptualised a comparison between a null model where the
group means do not differ (left) and an alternative model where they do (right).
4.4 ANOVA
Finally, we can extend the same regression logic to analysis of variance (ANOVA),
where the independent variable has more than two levels. In the bamboo paper by
Yen (2016), the data set is split into 5 groups by DBH value, in 1cm increments
as shown in the two graphs in Figure 4.5.
Again, we can think of ANOVA as comparing a null model where the predicted
values are not affected by group (left panel), with an alternative model where the
predicted values change across group (right panel). Notice that the alternative
model involves specifying four separate lines (thick lines joining the means),
which can have different slopes. This is why the number of degrees of freedom
for the independent variable in a one-way ANOVA is always one less than the
number of levels. To conduct the ANOVA in R, we can use the aov function
(the DBHgroup column contains the groupings):
anovamodel <- aov(culmheight ~ DBHgroup, data=bamboodata)
summary(anovamodel)
Or we can achieve the same result using the linear model (lm) function:
4.4. ANOVA 75
14 14
● ●
● ● ● ● ● ●
● ●
● ●
● ●
● ●● ● ● ●
● ●
12 ● 12 ●
● ● ● ●
Culm Height (m)
● ● ● ●
● ● ● ●
10 ●
10 ●
● ●●●
● ●●●
● ●
● ●
●
●
●
●
● ● ● ●
8 ●● 8 ●●
6 6
1 2 3 4 5 1 2 3 4 5
##
## Call:
## lm(formula = culmheight ~ DBHgroup, data = bamboodata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2283 -0.5837 0.1033 0.4988 1.5800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5850 0.3010 28.520 < 2e-16 ***
## DBHgroup2 0.9733 0.4257 2.286 0.031 *
## DBHgroup3 2.1850 0.4257 5.133 2.64e-05 ***
## DBHgroup4 3.6100 0.4257 8.480 7.99e-09 ***
## DBHgroup5 4.2583 0.4257 10.003 3.19e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7373 on 25 degrees of freedom
## Multiple R-squared: 0.8473, Adjusted R-squared: 0.8229
## F-statistic: 34.68 on 4 and 25 DF, p-value: 7.277e-10
76 CHAPTER 4. STATISTICAL TESTS AS LINEAR MODELS
Table 4.1: Table summarising how common statistical tests can be implemented
in R, using both a dedicated function and a linear model. The terms DV and
IV (IV1, IV2) are assumed to be column names of the dependent (DV) and
independent (IV) variables in a data object called ‘dataset’, and also to exist
as independent data objects. The term ID indicates a participant identification
variable for repeated measures tests. The lme function is part of the nlme package,
and the ezANOVA function is part of the ez package. Note that alternative
implementations for more complex designs can produce different results, and do
not necessarily test appropriate assumptions, or make the same corrections.
Test Generic function call
One-sample t-test t.test(DV)
Independent t-test t.test(DV[which(IV==1)], DV[which(IV==2)], var.equal=TRUE)
Dependent (paired) t-test t.test(DV[which(IV==1)], DV[which(IV==2)], paired=TRUE)
Linear regression summary(lm(DV ~ IV, data=dataset))
Multiple regression summary(lm(DV ~ IV1 + IV2, data=dataset))
One-way independent ANOVA summary(aov(DV ~ IV, data=dataset))
One-way repeated measures ANOVA ezANOVA(dataset, dv=DV, wid=ID, within=IV)
Factorial independent ANOVA summary(aov(DV ~ IV1 * IV2, data=dataset))
Factorial repeated measures ANOVA ezANOVA(dataset, dv=DV, wid=ID, within=c(IV1,IV2))
Mixed design ANOVA ezANOVA(dataset, dv=DV, wid=ID, within=IV1, between=IV2)
Notice that the F-statistic in the ANOVA summary table, and the final line
of the regression output, are identical (F = 34.68). This is because the un-
derlying calculations for the tests we call ANOVAs, t-tests and regressions are
fundamentally the same thing (a linear model).
Table 4.1 provides example R functions for popular tests, using both the generic
function and the linear model form. Traditionally, we would use regression when
our independent variable is continuous, a t-test when it is discrete with two
levels, and ANOVA when it has more levels. But as Table 4.1 illustrates, these
4.5. ASSUMPTIONS OF LINEAR MODELS 77
separate tests are really all part of the wider family of the general linear model,
and can all be implemented within the same framework.
B) Bamboo growth
C) Crop yields
D) Counting students
2. To test the effect of age on brain volume, the appropriate linear model
formula would be:
A) age ~ brainvolume
B) brainvolume ~ age
C) age - brainvolume
D) brainvolume - age
3. In R, the function to run a t-test is called:
A) ttest
B) t-test
C) t.test
D) Ttest
4. In regression, the residuals indicate:
A) The total variance in the data set
B) The differences between each pair of data points
C) The amount of the variance explained by a model
D) The error between the data and the model fit
5. The null hypothesis produces a model line with a slope of:
A) 1
B) -1
C) 0
D) It depends on the data
6. The alternative hypothesis produces a model line with a slope of:
A) 1
B) -1
C) 0
D) It depends on the data
7. For a one-way ANOVA with three levels, how many regression coefficients
would we expect (not including the intercept)?
A) 1
B) 2
C) 3
D) 4
8. The as.factor function is used to:
A) Define a dependent variable
B) Turn numeric data into text
C) Tell R that a data object is categorical
D) Round number so that it is an integer
9. For a categorical independent variable with four levels, which R functions
could you use to analyse the data?
A) lm or t.test
B) lm or aov
C) aov or t.test
D) lm only
4.7. PRACTICE QUESTIONS 79
Power analysis
81
82 CHAPTER 5. POWER ANALYSIS
x̄1 − x̄2
d= , (5.1)
σ
where x̄1 and x̄2 are the group means, and σ is the pooled standard deviation.
This is a standardized score, conceptually similar to the z-score, but for means
rather than individual observations. Because the denominator is the standard
deviation (and not the standard error), the effect size is independent of sample
size, although effect size estimates do become more accurate as sample size
increases. Other related effect sizes include Hedge’s g and Glass’s δ, which
slightly vary the denominator term. For multivariate statistics, the Mahalanobis
distance (Mahalanobis 1936) extends Cohen’s d to the multivariate case (see
section 3.4.3).
As a heuristic, Cohen (1988) suggested that effect sizes (d) of 0.2, 0.5 and 0.8
correspond to small, medium and large effects respectively. Let’s think about how
5.3. HOW CAN WE ESTIMATE EFFECT SIZE? 83
this applies to some hypothetical data sets. Imagine we want to know if tortoises
can run faster than hares. We time a group of hares and a group of tortoises
running along a racetrack. In the first race, the mean times are 57 seconds for
the hares, and 108 seconds for the tortoises. The difference (51 seconds) seems
large. But when we look at the raw data, we see that the individual animals are
quite variable in how long they take - maybe some of them get distracted eating
grass, whereas others are more on task. We calculate the standard deviation
as being 200 seconds, meaning that our effect size is d = (108-51)/200 = 0.26 -
quite a small effect according to Cohen’s heuristics.
Next suppose we re-run the race but we remove all of the distractions so that
the animals stay focussed. The mean times are rather shorter overall, 32 seconds
for the hares, and 80 seconds for the tortoises. Notice that the mean difference
is about the same as it was before - 48 seconds this time. But this time the
standard deviation is much smaller, at just 54 seconds. The smaller standard
deviation means that our effect size ends up being much larger: d = (80-32)/54
= 0.89. So even though the raw difference in means has stayed the same, the
precision of the measurement has improved.
One way to think about Cohen’s d is to consider the underlying population
distributions implied by different values of the statistic. Figure 5.1 shows four
pairs of distributions, with various mean differences and standard deviations.
The figure illustrates that one can increase d either by increasing the difference
between the group means, or by reducing the variance (spread) of the distri-
butions. Although in many situations these properties are fixed, and we must
simply measure them as best we can, in some experimental contexts it may be
possible to influence them to increase an effect size and boost statistical power.
For example, to increase the effect of a drug or intervention, one could apply a
higher drug dose, or a longer intervention duration. In section 5.7 we will discuss
how collecting more data for each individual in a study can often reduce the
overall standard deviation.
d=1 d=2
−4 −2 0 2 4 −4 −2 0 2 4
d=2 d=4
−4 −2 0 2 4 −4 −2 0 2 4
run such simulations) in which t-tests were run on 1000 synthetic data sets, each
comprising N=10 participants. The true (generative) effect size is given by the
white diamond, and the individual points are the estimated effect sizes for each
data set. Grey points below the horizontal line are non-significant, and blue
points above it are significant. If we imagine that only the significant ‘studies’
were published (an effect known as publication bias), we might estimate a mean
effect size around d = 1 (blue diamond), much higher than the true effect size of
d = 0.5 (white diamond).
If the study design has a larger sample size (N = 50), the estimates of effect
size become more accurate and regress to the mean. The spread of effect sizes
becomes tighter about the true value, and most repetitions return an effect size
close to the actual effect size (see right panel of Figure 5.2). Note that because
of the larger sample size, the power is higher, and a smaller observed effect
size is required for statistical significance (i.e. the horizontal line moves down).
Effect size estimates from underpowered studies should therefore be treated with
caution because they are more likely to overestimate the true effect size.
3 3
True effect size
Estimated effect size
Effect size (d)
2 2
1 1
0 0
N=10 N=50
−1 −1
Figure 5.2: Simulations to demonstrate effect size inflation resulting from un-
derpowered studies. 1000 data sets were generated using a mean of 1 and a
standard deviation of 2, for 10 participants per data set (left) or 50 participants
per data set (right). Points above the horizontal lines indicate significant effects,
and points below the lines indicate non-significant effects. White diamonds are
the true effect sizes, and blue diamonds are effect size estimates calculated only
from significant studies. The position of each point along the x-axis is arbitrary.
will cover meta analysis in more detail in Chapter 6, but in brief it is a technique
for calculating the average effect size across a number of studies. Because this
increases the overall power, the effect size estimate is likely to be more accurate.
Note that the effects of publication bias can still influence meta analyses, as
non-significant results are less likely to be available for inclusion in the analysis.
When conducting novel or exploratory research, there may not be any suitable
studies on which to base our effect size estimates. One common solution is to
run a pilot study with a smaller sample. This is commonplace in clinical trials,
where the eventual sample size is very large indeed (hundreds or thousands
of participants), and large cost savings can potentially be made by running a
smaller scale pilot study first, perhaps on a few dozen participants. Although
for many lab-based studies this might not be practical, piloting a new experi-
mental paradigm is always worthwhile if possible, as it provides much additional
information besides a possible effect size.
If there really is no existing data to estimate the likely effect size, one can
use Cohen’s heuristics for small, medium or large effect sizes to perform power
calculations. An important concept is the smallest effect size of interest. The
idea here is that effects smaller than some value would be of no theoretical or
practial importance. For example, if a drug treatment had an effect size of d
= 0.01, it would provide no meaningful benefit to patients and would not be
worth the expense of developing. So it might be practical to power a study
to detect a larger effect size that we think would be clinically meaningful. Of
course, as expected effect sizes get smaller, the sample size required to achieve
adequate power will increase, and there is a balance to be struck with practical
considerations around resource allocation for a given study. A useful way to
think about these issues is to plot power curves, as we describe in the next
section.
1 1
0.8 0.8
0.6 0.6
Power
Power
0.4 0.4
N = 20
d = 0.2
N = 40
0.2 d = 0.5 0.2
N = 80
d = 0.8
0 0
0 50 100 150 200 0 0.2 0.4 0.6 0.8 1
Figure 5.3: Power curves for different combinations of sample size and effect size.
The criterion for statistical significance was fixed at 0.05 in all cases.
The right panel of Figure 5.3 shows analogous functions for fixed sample sizes
as a function of effect size. These echo the results of the left panel, showing for
example that a sample of 20 individuals (per group) can only detect effect sizes
of d > 0.9 with satisfactory power. Such large effect sizes are unusual in most
areas of life science research (indeed, many consider large effects to be trivial
and not worth investigating at all), yet many published studies across diverse
disciplines tend to have samples around this size. Similarly, effect sizes in the
small-to-medium range will always have low power with double-digit sample
sizes. A consequence of all of this is that many studies are underpowered.
where σw and σb are the within and between participant standard deviations,
and k is the number of trials (D. Baker et al. 2021). Because the sample standard
deviation appears on the denominator of the effect size equation for Cohen’s d
(equation (5.1)), running more trials on each individual increases effect size, and
therefore drives up statistical power. Of course this is only meaningful when
the within-participant variance is high compared to the between-participants
variance, but this appears to be the case for many experimental paradigms in
psychology and neuroscience, and by extension other areas of the life sciences.
A useful way to assess the joint impact of trial number (k) and sample size (N)
on statistical power is to produce a two-dimensional contour plot, as shown in
Figure 5.4. Each curve illustrates the combinations of N and k that lead to a
given level of statistical power. Researchers can therefore optimise their study
design by trading off these factors - if individual participants are hard to recruit,
each one could be tested for longer. If individual participants are plentiful, each
one could be tested for less time. A Shiny application to generate power contours
is available at: https://shiny.york.ac.uk/powercontours
70 70%
80%
90%
60 100%
50
40
30
20
10
0
0 20 40 60 80 100 120 140 160 180 200
Sample size (N)
Figure 5.4: Power contour plot. Curves show combinations of N and k that
give a constant level of statistical power. This example assumes a true group
mean of 1, within-participant standard deviation of 10, and between participants
standard deviation of 2.
5.9. POST-HOC POWER ANALYSIS 91
to calculate the power of a study that has already been conducted. This is known
as post-hoc power analysis, or observed power, and is available as an option in
commercial statistics packages such as SPSS. Indeed, this is the method that
has been used to estimate the level of power in particular areas of the literature.
It is also sometimes used to interpret null results - determining whether an effect
was likely to be non-significant because a study was underpowered. However,
there are some theoretical concerns with interpreting observed power (Hoenig
and Heisey 2001). Most of these centre around the fact that observed power is
inversely proportional to the p-value (with low p-values equating to high power).
In other words, a non-significant result is likely to have low power because it
is non-significant. This means that calculating observed power provides no
additional information beyond a properly reported p-value and is therefore
misleading. A more fruitful approach to interpreting null results is offered by
Bayesian statistics, as discussed in Chapter 17.
##
## Two-sample t test power calculation
##
## n = 63.76561
## d = 0.5
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
5.10. DOING POWER ANALYSIS IN R 93
The output returns all of the values we have just entered, and also tells us that
we will require a sample size of N = 63.77. Of course it is not practical to
test 0.77 of a participant, so we always round up to the nearest whole number.
Therefore a sample size of N = 64 per group is required (so N = 128 in total).
Variants for one-sample and paired t-tests are also available, as detailed in the
help.
A similar function can conduct power analysis for correlations. Let’s say we
want to know the smallest correlation coefficient that can be detected with a
power of 0.8 and a sample size of 30 participants.
pwr.r.test(n=30, power=0.8, sig.level=0.05)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 30
## r = 0.4866474
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
This output tells us that a correlation coefficient of r = 0.49 or larger can be
detected with 80% power.
Next let’s look at power calculations for a one-way ANOVA. Here we use a
different measure of effect size, called f (also known as Cohen’s f ). Note that
importantly, this is very different from the F-ratio usually reported in an ANOVA
summary table. The effect size f is closely related to d, such that f = d2 . It is
calculated by taking the standard deviation across the population (i.e. group)
means and dividing it by their pooled standard deviation (i.e. across participants).
We also need to know how many groups there are in the study design (the input
k). Let’s calculate the power of a study design with an effect size of f = 0.1, and
N = 30 participants in each of five groups:
pwr.anova.test(k=5, n=30, f=0.1, sig.level=0.05)
##
## Balanced one-way analysis of variance power calculation
##
## k = 5
## n = 30
## f = 0.1
## sig.level = 0.05
## power = 0.1342476
##
## NOTE: n is number in each group
This design has a very low power indeed, only 0.13.
94 CHAPTER 5. POWER ANALYSIS
We can use these functions to produce power curves, such as those shown in
Figure 5.3 by entering a range of effect sizes or sample sizes in a loop (see
section 2.10). For example, we can produce a very instructive power curve for
correlations as follows:
N <- 4:100
r <- NULL
for (n in 1:length(N)){
output <- pwr.r.test(n=N[n], power=0.8, sig.level=0.05)
r[n] <- output$r
}
plot(N,r,type='l',lwd=3,xlim=c(0,100),ylim=c(0,1))
1.0
0.8
0.6
R
0.4
0.2
0.0
0 20 40 60 80 100
Figure 5.5: Curve showing the minimum correlation coefficient that can be
detected at 80% power, as a function of sample size.
This curve (plotted in Figure 5.5) shows the minimum r value that can be
detected with 80% power at a range of sample sizes. Even studies with N = 100
cannot reliably detect small correlations where r < 0.25.
Power calculations for more complex and sophisticated designs, or those using
statistical techniques not covered by the pwr package, are best done by simulation.
An excellent introduction to power analysis by simulation is given by Colegrave
and Ruxton (2020), and we will discuss some of the stochastic methods required
for this in Chapter 8.
5.11. PRACTICE QUESTIONS 95
C) 0.98
D) 1.00
9. What is the smallest effect size (w) that can be detected using a Chi-
squared test with 12 participants, 10 degrees of freedom and a power of
0.5 (assume an alpha level of 0.05)?
A) 1.16
B) 0.91
C) 0.88
D) 0.99
10. The function pwr.f2.test calculates power for factorial ANOVAs using the
general linear model. Assuming numerator and denominator degrees of
freedom of 2 and 12, what is the smallest effect size that can be detected
with 80% power and alpha of 0.05?
A) 0.83
B) 24.4
C) 0.69
D) 0.60
Answers to all questions are provided in section 20.2.
Chapter 6
Meta analysis
Meta analysis is a method for combining the results of several studies computa-
tionally. Usually, it is some measure of effect size (see section 5.2) that we choose
to combine, such as Cohen’s d, or the r value from a correlation. Of course,
the simplest way to do this is just to average the effect sizes from a bunch of
studies that all measure the same thing. But often there are differences in study
design, sample size, and other features, that make a straightforward average
inappropriate. Imagine combining three studies. Two of them are high quality,
testing hundreds of participants using state-of-the-art methods. The other is a
rather shoddy affair that should probably never have been published in the first
place. It would hardly seem fair to give them all equal weight in our calculations.
The tools of meta analysis allow us to take factors such as this into account.
The main outcome of a meta analysis is an aggregate effect size estimate, which
is used to determine whether, on the balance of evidence, a real effect exists.
97
98 CHAPTER 6. META ANALYSIS
topic, but the evidence from reading them individually appeared mixed. For this
reason, corticosteroids were not routinely prescribed in cases of premature birth.
In 1990 a meta analysis was published (Crowley, Chalmers, and Keirse 1990)
that showed a clear benefit of the drugs (a reduction in mortality of 30-50%),
and their use became mainstream clinical practice.
There are two ways of interpreting this story. On the one hand, many thousands
of babies suffered and died unnecessarily during the years when the evidence
supporting corticosteroid use was available but had not been synthesised together.
On the other hand, over the past three decades, many thousands of babies have
been treated using this method, and many lives have been saved. Either way, the
importance of meta analysis is clear - unambiguous answers to medical questions
can save lives.
The corticosteroid example led to the creation in 1993 of the Cochrane Collab-
oration, an international charity organisation dedicated to coordinating meta
analyses on a range of topics. These are freely available in the Cochrane Library
(https://www.cochranelibrary.com/). Most of the Cochrane meta analyses are on
medical topics, including a substantial number on mental health and psychiatric
conditions. They do not focus only on medications - many analyses are concerned
with dietary and lifestyle factors, and other therapeutic techniques. The logo of
the Cochrane Collaboration is a stylised version of the corticosteroid data.
In addition to medical reviews, the tools of meta analysis can be applied to
other topics, including basic experimental laboratory science. These might be
less obviously life saving, but they have become increasingly important in recent
years for establishing whether reported effects are robust. This has led to some
interesting conclusions about entire subfields of research, and is an important
aspect of the replication crisis being widely discussed in many fields. Overall, a
meta analysis should represent the strongest form of evidence on a particular
topic, as it synthesises all of the available data in a systematic and quantitative
way.
The literature search (stages 1-4) can be succinctly summarised using a PRISMA
diagram, which we will introduce in the next section. The remainder of the
chapter will mostly focus on the final two stages, as these comprise the numerical
and computational parts of the process.
Studies included in
quantitative synthesis
(meta-analysis)
(n = 65)
Figure 6.1: Example PRISMA diagram, reporting the number of studies included
and excluded at each stage of a literature review.
6.4. DIFFERENT MEASURES OF EFFECT SIZE 101
of data are well described by effect size measures based on differences in means
(such as Cohen’s d), or those indicating the proportion of the overall variance
explained by some predictor (e.g. the correlation coefficient r, and the ANOVA
effect size measure η 2 ).
In much of the clinical literature, other types of effect size are common, which
you will come across if you read materials on meta analysis. These are based on
the concepts of risk and odds, which are important ideas to know about. They
are used most often for dichotomous (binary) data, which have obvious relevance
in medicine - is the patient dead or alive; are they infected or cured? In fact any
type of data can be arbitrarily made dichotomous, for example by deciding on a
criterion or cut off. For example, continuous measurements of blood pressure
can be categorised into high and low blood pressure groups by choosing some
threshold (currently 120/80 for stage 1 hypertension). So the risk and the odds
can in principle be calculated for any type of data, though this should only be
done when it is a theoretically meaningful thing to do.
The risk is defined as the number of events divided by the sample size. This is a
familiar concept - if one out of every thousand people get a particular disease,
the risk is 1/1000 or 0.001. The odds is very closely related, but subtly different.
It is the number of events divided by the number of non-events. So, in the
example of one in a thousand people, the odds would be 1/999, which will be
very similar indeed to the risk. However the numbers start to diverge as events
become more common. Consider a condition that affects half of a sample of 100
people: the risk will be 50/100 = 0.5, but the odds will be 50/50 = 1. A risk
score can never exceed 1, but an odds score can take on any positive number.
Figure 6.2 shows how the risk and odds diverge as events become more common.
Although raw risk and odds scores are sometimes clinically meaningful, in clinical
trials it is more common to report the risk ratio or the odds ratio. These are the
ratios of risk or odds values comparing a treatment group and a control group
(e.g. the risk for the treatment group, divided by the risk for the control group).
They will tell you, for example, how much a treatment or drug changes your risk
of some outcome, such as recovering from a disease. Because these are ratios of
event counts, they will always be positive numbers, and a value of 1 will always
mean there is no difference between the treatment and control groups. However,
whether values above or below 1 indicate a positive outcome will depend on
exactly what is being measured. An odds ratio of 3 might be good news if it
means a drug makes you more likely to recover from an illness, but very bad
news if it makes you more likely to have a heart attack!
Lastly, you will also see that some studies report the log odds ratio. This is just
a log transform of the odds ratio, which is a sensible thing to do given the range
of possible odds ratios. After the log transform, odds ratios >1 will have positive
values, and odds ratios <1 will have negative values. Ratios in general are often
more appropriately represented in log units, and thinking ‘logarithmically’ is
something that gets easier with practice.
6.5. CONVERTING BETWEEN EFFECT SIZES 103
10 2
Risk
Odds
Log10(Risk or odds)
8
1
Risk or odds
6
0
4
−1
2
0 −2
Figure 6.2: Comparison of odds and risk scores for events of different probabilities,
assuming a total sample of 100. The right hand panel shows the log transform
of the same values.
x̄1 − x̄2
d= (6.1)
σ
where x̄1 and x̄2 are the group means, and σ is the pooled standard deviation.
If these values are reported in a source paper, we can use them to calculate d. If
the means and standard deviation are not available, we can also convert from a
correlation coefficient (r):
2r
d= √ , (6.2)
1 − r2
2t
d= √ (6.3)
df
104 CHAPTER 6. META ANALYSIS
In section 6.11 we will discuss how to convert between effect sizes and com-
mon statistics using R. However tools also exist online to perform these cal-
culations, for example a number of tools are available at the website: http:
//www.psychometrica.de/effect_size.html (Lenhard and Lenhard 2016).
Auckland
Block
Study Reference
Doran
Gamsu
Morrison
Papageorgiou
Tauesch
Summary
Odds Ratio
Figure 6.3: Example forest plot of corticosteroid data from Crowley, Chalmers,
and Keirse (1990), available as part of the rmeta package.
106 CHAPTER 6. META ANALYSIS
from one study do not overlap the mean effect size for another study, the
two studies can be considered to have different effect sizes. If our data meet
parametric assumptions, the confidence intervals are usually calculated using the
approximation 1.96*SE, though they can also be derived by bootstrap resampling
(see Chapter 8) if the original data are available.
At the foot of the plot is the summary effect - this is the grand average effect
size across all studies. It is traditionally represented by a diamond. The middle
of the diamond corresponds to the mean, and the left and right corners are the
95% confidence intervals. The effect is deemed to be significant if the error bars
do not overlap the line of no effect, as is clearly the case for the example in
Figure 6.3. As mentioned before, the grand average effect size is not simply the
arithmetic mean of the individual studies. To understand how it is calculated,
we need to discuss the concept of weighted averaging.
x1 + x2 + x3 2+7+3
mean = = =4
3 3
Another way of thinking about this is to assume that each number has a weight of
1, which it is multiplied by, with the denominator being the sum of the weights:
ω1 × x1 + ω2 × x2 + ω3 × x3 1×2+1×7+1×3
wmean = = =4
ω1 + ω2 + ω3 1+1+1
In this example, because the weights are all set to 1, the end result is the same.
But if we weight some values differently from others it will change the outcome.
For example, if we assign the second number a higher weight it will bring the
average up (because it is a big number than the others):
1×2+3×7+1×3
wmean = = 5.2
1+3+1
Note that the weight appears twice: to multiply the value on the numerator,
and also as part of the sum of the weights on the denominator. In general terms
a weighted average is defined as:
Σ(ω1:i × x1:i )
wmean =
Σω1:i
6.9. PUBLICATION BIAS AND FUNNEL PLOTS 107
where ωi is the ith weight, and xi is the ith value in the list of numbers we wish
to average. This is the procedure typically used to calculate the grand mean
effect size. But what are the weights?
In meta analysis, we want to use weights that give an indication of the quality
of each study. One very simple way to do this is to use the sample size as the
weights - a study testing ten participants would have a weight of 10, and a
study testing 100 participants would have a weight of 100. This treats each
study as though it were part of a single monolithic study (i.e. it is a fixed effects
approach). Other alternatives are to use uniform weights (e.g. to give each
study a weight of 1), or to choose some predetermined criteria based on the
methodology used. For example, one might decide to weight studies using a
state-of-the-art recording device more highly than those using older technology
(for example in neuroscience, MEG has lower recording noise than EEG; in
genetics, PCR is better than older methods like RFLP).
These options are reasonable and defensible in some situations, but they are not
what is typically done in meta analysis. Instead, the weights are derived from
the variance for each study. Specifically, we use the inverse variance, 1/σ 2 . This
will be a large value for studies with small variance (i.e. very reliable studies),
and small for studies with large variances (i.e. unreliable studies). Note that the
σ term represents the standard deviation of the sampling distribution, which is
the sample standard error. As such, the sample size contributes to the inverse
variance weights (because the standard error calculation includes the sample
size).
As well as calculating the weighted average of the effect sizes, we also need to
derive confidence intervals (to tell us where the corners of the diamond should
go). One way to do this is to calculate the variance of the weighted average using
the squares of the weights to combine the variances. Another option is to use
stochastic simulations (see Chapter 8) to estimate the variance (Sánchez-Meca
and Marin-Martinez 2008). Fortunately these rather complex calculations are
done automatically in meta analysis software, so we will not consider them
further here.
effect size on the x-axis. But this time, the studies are ordered along the y-axis,
usually according to either sample size or inverse variance. Studies with large
samples appear towards the top, and studies with small samples appear near the
bottom. An ideal funnel plot looks like the example in the left panel of Figure
6.4.
100 ●
●
●
●
●
●
100 ●
●
●
●
●
● ●
●
● ●
● ●
●
● ●
●
●
●
● ●
80 ●
●
80 ●
Sample size (N)
● ●
● ●●
● ●●
● ●
● ● ●●
● ● ● ●
● ● ● ● ●
● ● ● ● ●●
●
●● ●
● ●●
●
● ●● ●
● ● ● ●
● ●
● ●● ●●
40 ●
●
●● ●
●
●
●
● ●
●
●
40 ●
● ●●
●●● ● ●●
●●● ●●
●
● ●●
●●
●
● ● ● ●
● ● ●
● ●
● ● ● ●● ●
●
●● ● ● ● ●
● ●● ●● ●
●● ● ●● ●● ●
● ● ● ●●
● ● ●●
●● ● ●● ●●
● ● ● ● ● ● ● ● ●● ●
● ●● ● ●
● ● ● ● ●●
20 ●
● ●●
● ●
●
●
●
●
●
●
●
●
●
● 20 ●
●●●
● ●●
●
●
● ●
● ●
●
●
● ●
●
●
●● ● ● ● ● ●● ●
●● ● ● ● ● ● ●
● ● ● ●
● ● ●● ● ● ● ●● ● ●
●
● ● ●
● ● ● ● ●●●● ●●
● ● ● ● ●
● ● ● ●
● ● ● ● ●● ● ●● ● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●
●
0 0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Effect size (d) Effect size (d)
Figure 6.4: Example funnel plots, in which each point represents a simulated
study. In the left panel, a symmetrical funnel (triangle) shape is apparent, with
large-sample studies producing estimates close to the true effect size (top) and
small-sample studies producing more variable estimates (bottom). In the right
plot, studies with estimates below the true mean are suppressed (i.e. remain
unpublished). The plot becomes asymmetric, and the mean effect size (dashed
line) is overestimated.
The symmetrical funnel plot gets its triangular shape because studies (points)
with large sample sizes (at the top) are more likely to produce estimates of effect
size close to the true mean (solid line), whereas studies with small sample sizes
(at the bottom) will produce more variable effect sizes. If a funnel plot looks
like this, it is unlikely that publication bias is a big problem for the area under
study.
Now let’s think about what would happen if studies that were non-significant
did not get published. This might happen for nefarious reasons (such as a
pharmaceutical company deliberately suppressing a study that shows their drug
is ineffective), but it is also likely just as a consequence of human nature, and
the current incentive system in scientific publication. Non-significant results are
much harder to get published, as many journals will simply reject them out of
hand as being ‘uninteresting’. Furthermore, most researchers have limited time,
and will often prioritise publishing studies with significant results, which might
6.10. SOME EXAMPLE META ANALYSES 109
be more likely to get published in prestigious journals, and so be better for their
career.
The right hand panel of Figure 6.4 shows an asymmetrical funnel plot, in which
all studies with effects below the true mean of d=1 are omitted. The effects of
this are clearest in the small sample studies, which now skew out to the right.
Studies with a larger effect size were close to the true mean anyway, so these
look much the same as before. One consequence is that the mean effect size
across all the studies (shown by the dashed line) now overestimates the true
mean (solid line).
Funnel plots can be used to test for publication bias, and this is routinely done
as part of a meta analysis. In situations where publication bias is detected,
techniques exist to estimate what the true underlying effect size is likely to be.
One striking example is a meta analysis by Shanks et al. (2015) that looked at
priming studies of consumer choice. The funnel plot they produced (see their
Figure 2) was highly asymmetrical, whereas a funnel plot of replication studies
was symmetrical about an effect of d = 0. This is strong evidence for publication
bias, or other types of questionable research practise (such as p-hacking) in this
particular paradigm.
tation on the likelihood of dying from infectious diseases. They included only
controlled trials in their meta analysis, involving measles, respiratory diseases or
diarrhoea, in children in developing countries. They pooled odds ratios across
studies, with aggregate effects being calculated separately for different diseases,
and in community studies. The largest result was for measles: across three
studies, the average odds ratio was 0.34 (i.e. a reduced risk of death of 66%
following supplementation). A subset of five community studies also found a
mortality reduction of 30%. To assess the potential impact of publication bias,
the authors calculated a statistic called the ‘failsafe N’ (Rosenthal 1979) for the
community study result (with 5 studies). This statistic tells us the number of
non-significant studies that would need to exist (yet remain unpublished) for
there to be no effect overall. For this example it was 53, which is an implausibly
large number of studies to remain unpublished. Overall, these results suggest a
strong benefit of either vitamin A supplementation, or a well-balanced diet, in
reducing the risk of death from infectious disease.
For a single line of code, the output is very extensive. The idea is to provide all
the various measures of effect size you might need to use. These include Cohen’s
d in the first section (which has a value of 0.36), Hedge’s g in the second section,
the equivalent correlation coefficient (r) and z-score, odds ratios and number
needed to treat (NNT). Similar functions exist if you have a p-value, t-value, r
value and so on, for example:
112 CHAPTER 6. META ANALYSIS
I have hidden the output from the above functions, but it is in exactly the same
format as for the previous example. What we usually want though, is to just
extract the single effect size we are interested in. We can do this by assigning
the output of the function call to a data object as follows:
output <- mes(mean1, mean2, sd1, sd2, n1, n2)
The object called output then contains all of the numbers you might need in
fields with sensible names. For example, you can request Cohen’s d and its
variance as follows:
output$d
## [1] 0.36
output$var.d
## [1] 0.02
And we can convert the variance to a standard deviation by taking the square
root:
sqrt(output$var.d)
## [1] 0.1414214
The effect size, its standard deviation and the sample size are the values you will
need to enter into a meta analysis. It will often help to store them in another
data object so they can be easily accessed and entered into the meta analysis
functions.
The effect sizes might be values of Cohen’s d, and the standard errors will be
the square root of the variance estimates that are returned when the effect size
6.12. CONDUCTING A META ANALYSIS IN R 113
is calculated (see previous section). The rmeta package contains functions that
will use these values to conduct a meta analysis. There are several varieties of
meta analysis available, but we will use the meta.summaries function to conduct
a random effects meta analysis using the effect size measures.
meta.summaries(effectsizes,standarderrors,method='random')
## Random-effects meta-analysis
## Call: meta.summaries(d = effectsizes, se = standarderrors, method = "random")
## Summary effect=0.852 95% CI (0.456, 1.25)
## Estimated heterogeneity variance: 0.078 p= 0.149
The output from this function tells us the summary effect size (0.852) and its 95%
confidence intervals. This is useful, but it’s more helpful if we save the output of
the function into a data object, which we can then pass into the metaplot and
funnelplot functions to produce graphical summaries of the results as follows:
metaoutput <- meta.summaries(effectsizes,standarderrors,method='random')
# this line of code tells R to put the next two plots side by side
par(mfrow=c(1,2), las=1)
metaplot(effectsizes,standarderrors,summn=metaoutput$summary,
sumse=metaoutput$se.summary,sumnn= metaoutput$se.summary^-2,
xlab='Effect size (d)',ylab="Study",summlabel='')
funnelplot(metaoutput, plot.conf=TRUE)
● ●
Study
Size
Figure 6.5: Auto-generated forest and funnel plots, using the metaplot and
funnelplot functions.
114 CHAPTER 6. META ANALYSIS
These plots (shown in Figure 6.5) are quite rudimentary, but can be improved
by specifying additional input arguments. For example, the author names
can be specified using the labels argument, and different colours chosen with
the colors argument (see further details in the help files). The funnel plot
can be automatically mirrored about its mid-point by adding the argument
mirror=TRUE to the function call.
It is also helpful to use the generic summary function to get a more detailed
summary of the meta analysis, which includes everything you would need to
generate your own forest plot manually (e.g. in another plotting package):
summary(metaoutput)
## Random-effects meta-analysis
## Call: meta.summaries(d = effectsizes, se = standarderrors, method = "random")
## ----------------------------------------------------
## Effect (lower 95% upper) weights
## 1 0.7 0.31 1.09 1.7
## 2 0.4 -0.19 0.99 1.2
## 3 2.1 0.34 3.86 0.2
## 4 0.9 0.31 1.49 1.2
## 5 1.6 0.62 2.58 0.6
## ----------------------------------------------------
## Summary effect: 0.85 95% CI ( 0.46,1.25 )
## Estimated heterogeneity variance: 0.078 p= 0.149
Those are the basics of doing a meta analysis in R. There is much more function-
ality in the rmeta package, and other packages are available for specific types of
meta analysis and other variations on the analysis.
B) 1.25
C) 0.69
D) 4.19
4. What is the value of Hedge’s g for an ANOVA with an F-ratio of 13.6 and
17 participants in each group?
A) 1.26
B) 0.53
C) 0.60
D) 1.24
5. What is the log odds ratio for comparing proportions of 0.7 and 0.6 with 3
participants per group?
A) 0.12
B) 1.56
C) 0.44
D) 0.24
6. Conduct a random effects meta analysis using effect sizes of d = 0.1, 0.6,
-0.2, 0.9 and 1.1, with standard deviations of 0.2, 0.3, 0.1, 0.4 and 0.5.
What is the aggregate effect size?
A) 0.80
B) -0.01
C) 0.37
D) 0.20
7. What is the aggregate effect size using the values from question 7, but
conducting a fixed effects analysis instead?
A) 0.80
B) -0.01
C) 0.37
D) 0.20
8. Produce a forest plot using the data from question 6 (assuming random
effects). Is there a significant effect overall?
A) No, because the diamond overlaps the line of no effect
B) Yes, because the diamond overlaps the line of no effect
C) Yes, because most individual studies do not overlap the line of no
effect
D) No, because one of the individual studies has a negative effect
9. Which of the following is not a plausible explanation for an asymmetrical
funnel plot?
A) Small sample studies being more likely to produce significant effects
B) Random sampling
C) P-hacking
D) Publication bias
10. If 10 members of a treatment group of 500 recover from an illness, whereas
only 5 members of a control group of 400 recover, what is the odds ratio?
A) 0.020
B) 0.013
C) 1.60
116 CHAPTER 6. META ANALYSIS
D) 1.61
Answers to all questions are provided in section 20.2.
Chapter 7
Mixed-effects models
117
118 CHAPTER 7. MIXED-EFFECTS MODELS
In common with other versions of the general linear model (see Chapter 4), the
general idea of mixed effects models is to try to account for as much of the overall
variance in the data set as possible, using our various predictors. We usually
want to know if our fixed effects are able to account for a significant proportion
of the variance, but we also want to account for the variance due to our random
effects. Sometimes this is because random effects are ‘nuisance’ variables that
we need to control for, but are not really interested in. Including the random
effect in our model means that we can remove this variance, reducing the noise
in our estimate of the fixed effects. Sometimes accounting for a random effect
can reveal structure in a data set that is otherwise masked by group differences,
as we will demonstrate with our first example.
Linear regression
100
●
● ●
●
●●
●
80 ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ●
●● ● ● ●● ●
●
●
●● ●● ● ● ● ●● ●
● ●
● ●
● ● ●●●● ● ●
●
● ● ● ●●●
● ●
● ● ● ●● ● ●● ● ●● ● ●
60 ● ●
● ● ●● ● ● ●
● ●●●● ● ● ● ●
●
●● ● ● ●
●
● ● ●
● ● ● ● ●
●●
●
●
● ●●● ● ●
●
● ● ●● ● ● ● ● ● ●
● ● ● ●
DV
● ● ● ● ● ● ●
●
● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ●● ●
● ● ● ● ● ●● ●
● ●● ● ●
● ● ● ●● ●● ●● ● ●● ●● ● ● ● ●
●
● ●● ● ● ● ● ●●● ● ● ● ●●
● ● ● ● ●● ●
● ●● ● ●
● ● ● ● ● ● ● ●● ● ● ● ●● ● ●
●
●
● ● ● ● ●● ● ●
● ●● ● ●
40
● ●
●● ● ● ● ● ● ●● ● ● ●
● ● ●● ● ● ●
● ●
● ● ● ● ● ●
●
●● ● ●● ●
● ● ●● ●
● ●● ● ●● ● ● ● ● ● ●
●● ●● ●●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●● ● ●●
● ●● ●● ●●
● ● ● ● ● ● ● ●
●●●
● ●
●
● ● ● ● ●● ●● ● ●●
● ● ● ● ●
● ● ●
● ● ● ● ●● ●●
●
● ● ●
20 ●
● ●
●
●● ●
●
●
● ●
●●
●
●● ●
●
● ● ●●●
●
●●
● ● ● ●●
●
● ●
● ● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●
●
● ●● ● ●
●
● ●
●● ●
● ● ●
● ●●
0 ●
●
●
●●
●
●
0 20 40 60 80 100
IV
Figure 7.1: Simulated data showing the relationship between one independent
variable (IV) and one dependent variable (DV). The solid grey line is an intercept-
only regression, with slope constrained to be 0. The dashed black line is the best
fitting regression line, with slope and intercept free to vary.
120 CHAPTER 7. MIXED-EFFECTS MODELS
100 100
(a) ● (b) ●
● ●
● ● ● ●
●● ●●
● ●
80 ● ●
●
●
●
●
●
●
●
●
●
●
●
80 ● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ●
● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ●● ●
●● ● ● ● ● ●● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
●● ● ● ●● ● ●● ● ● ●● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●● ●● ● ● ●● ●●
● ● ● ●
60
●
60
● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ● ●●
● ● ● ● ● ●● ● ● ●●
● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ●
●
● ●● ● ●● ● ●● ●
●
● ●● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
DV
DV
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●●● ● ● ● ● ● ●●● ●
● ● ●● ● ● ● ●
● ●● ● ● ● ●● ● ● ● ●
● ●● ●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ●
●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●
●
● ● ● ●●● ● ● ● ●● ●
● ● ● ●●● ● ● ● ●●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ●
● ●
40
●
40
● ● ● ● ●
● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●
●● ●●
● ● ● ● ● ● ● ●
● ● ●● ● ●
● ● ●● ● ● ● ●
● ● ●● ● ●
● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●
●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●
● ● ●● ● ● ●●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ●● ● ● ● ● ●● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
20 20
● ● ● ● ● ● ● ●
● ● ● ●●● ● ● ● ● ● ●●● ● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
●● ● ● ● ● ● ●
●● ● ●
● ● ● ● ● ●
● ●● ● ●●
● ● ● ● ●● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ●● ● ●●
0 ●
●
●
●●
● 0 ●
●
●
●●
●
0 20 40 60 80 100 0 20 40 60 80 100
IV IV
100 100
(c) ● (d) ●
● ●
● ● ● ●
●● ●●
● ●
80 ● ●
●
●
●
●
●
●
●
●
●
●
●
80 ● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ●
● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ●● ●
●● ● ● ● ● ●● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
●● ● ● ●● ● ●● ● ● ●● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●● ●● ● ● ●● ●●
● ● ● ●
60
●
60
● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ●● ● ● ● ● ●●
●●
● ● ●
●
● ● ● ● ● ●● ● ● ● ● ●●
●●
● ● ●
●
● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●
●● ● ● ●● ● ●● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ● ●● ●
DV
DV
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●●●
● ● ● ● ● ●●●
● ●
● ●● ●● ● ● ● ●● ●● ● ●
● ● ● ●●
●●
●
● ● ●● ●
●
● ● ●
● ● ● ● ●●
●●
●
● ● ●● ●
●
● ● ●
●
● ● ● ●● ● ● ● ● ●● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ●●
● ●● ● ● ●● ●
●
● ● ●●● ● ● ● ●● ●
● ● ●●● ● ● ● ●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ●
●● ●●
40
●
40
● ● ● ● ●
● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●
●● ● ● ● ●● ● ● ●
● ●
● ● ●
● ● ●
● ● ●● ● ● ● ●
● ● ●
● ● ●
● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●
●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●
● ● ●● ● ● ●●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
20 20
● ● ● ● ● ● ● ●
● ● ● ●●● ● ● ● ● ● ●●● ● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
●● ● ● ● ● ● ●
●● ● ●
● ● ● ● ● ●
● ●● ● ●●
● ● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ●● ● ●●
0 ●
●
●
●●
● 0 ●
●
●
●●
●
0 20 40 60 80 100 0 20 40 60 80 100
IV IV
Figure 7.2: The same data as shown previously, but with groups tagged in
different colours (a). Mixed effects models with random intercepts (b), random
slopes (c), and random intercepts and slopes (d), are shown by the regression
lines.
7.1. DIFFERENT TYPES OF MIXED-EFFECTS MODEL 121
The black dashed line is the grand average regression line. It is quite different
from our traditional regression fit (in Figure 7.1), which was essentially flat. If
we now perform a statistical comparison, we see that there is a highly significant
effect of the IV:
Finally, we can allow both the intercepts and slopes to vary between groups,
as shown in Figure 7.2d. This captures the shallower slope of the bottom
group (in purple), as well as the vertical offsets between groups. Overall this
type of model has more degrees of freedom than the other two. We could
alternatively have run five completely independent linear regressions (or used
multiple regression) instead of our single mixed effects model, but the mixed
effects approach additionally gives us the grand average regression line, that
takes account of the sample size of each group and has greater overall power
than for any individual group. In other words it tells us about the overall effect
of the independent variable, rather than its effect only within each group. For
models where either the slope or intercept are fixed, the mixed effects framework
lets us jointly estimate the value of the fixed parameter across all of our groups.
Mixed-effects regression models are enormously flexible, and we will learn about
the syntax to implement them later in the chapter. The decision of whether to
include random intercepts, random slopes, or both will depend heavily on the
hypothesis you are trying to test. However it is quite rare to find a situation where
only random slopes are required - most models either involve random intercepts,
or allow both parameters to vary. Sometimes it might also be advisable to test
more than one model, and use goodness of fit indicators such as R2 to decide
which model describes the data best (see section 7.6 for more detail). Note
that because a random intercepts model requires fewer degrees of freedom than
a model in which both parameters vary, it can sometimes be fit to data sets
with fewer observations. In the next section, we will run through an example of
mixed-effects regression using data from the literature.
122 CHAPTER 7. MIXED-EFFECTS MODELS
12
●
● Expiration ● ●
10 ● Inspiration
Tidal volume (l)
● ● ●
● ●
● ● ●
8 ●● ●
●● ● ●● ●
●
●
● ●● ● ●● ●
●
● ● ●
● ●
● ●● ● ●
6 ● ●
● ●
● ● ●● ● ●●
● ●
●
● ● ●● ●
● ● ● ● ● ● ● ● ●● ● ●
● ● ●● ●● ● ● ●● ●
● ●
● ● ● ●
4 ●● ●
●
●
●● ● ● ●
● ● ●
● ●
●
● ●
2
0
50 100 150 200 250 300
Body mass (kg)
Figure 7.3: Tidal volume for bottlenose dolphins as a function of body mass,
modified and replotted from Fahlman et al. (2018).
principle the relationship with body mass might differ between them.
The mixed-effects approach allows us to deal with both of these issues in a coher-
ent way. We treat body mass and breath direction (expiration and inspiration)
as fixed effects, and animal as a random effect. This allows us to have separate
regression lines for the two directions of breath (expiration and inspiration), and
test if there is evidence overall for an effect of body mass. The two lines in
Figure 7.3 show these fits, and the regression output tells us there is a significant
effect of body mass, and also of breath direction:
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## bodymass 30.719 30.719 1 26.720 26.521 2.096e-05 ***
## direction 34.799 34.799 1 77.414 30.044 5.095e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The significant effect of body mass indicates that the overall regression slope
is significantly steeper than 0. The significant effect of direction tells us that
the two breath directions involve different regression lines (though note this is
largely because I modified the data for this example).
To account for the multiple observations from some animals, we treat animal as
a random effect. In the current context, this is similar to a repeated measures
design for a t-test or ANOVA. But crucially, those tests expect a balanced design,
where each individual contributes the same number of observations. The mixed-
effects approach relaxes this assumption, so it is more flexible when dealing with
real data sets.
As you might recall from Chapter 4, the general linear model underlying regression
can also be used in situations where the independent variable is categorical rather
than continuous. The most familiar instances of this are t-tests and ANOVAs.
We can bring the benefits of mixed-effects models to factorial experimental
designs, and also include multiple random effects, as we will see in our next
example.
Table 7.1: Example stimuli for a lexical decision task. I ask the forgiveness
of any real psycholinguists reading this, who would no doubt have numerous
objections to using these stimuli in an actual experiment.
For our next example, we consider a lexical decision task in which participants
are presented with a string of characters, and must decide if they are a word
or a non-word. Examples of non-words are often based on real words, but with
some errors introduced, for example “bekause”, but they can also be nonsense
strings of letters, for example “okjsdfj”. In our experiment, 20 participants each
respond to 20 nouns and 20 verbs, and also 20 non-words based on the original
nouns, and 20 non-words based on the original verbs. The dependent variable is
the reaction time, measured in milliseconds. Our example stimuli are shown in
Table 7.1.
For our first participant, we will show these stimuli in a random order, and
measure reaction times for each decision (word vs non-word). Their data might
look something like the values shown in Table 7.2.
Table 7.2: Example reaction times for a lexical decision task, for one participant
(times in ms).
Table 7.2). These would then be entered, along with the means of the other
19 participants, into a 2x2 repeated measures ANOVA. The two factors for the
ANOVA are word type (noun or verb) and word validity (word or non-word),
and each participant would contribute a single mean reaction time for each of
those four conditions (meaning 80 data points in total - 20 participants × 4
conditions). The ANOVA results might look something like this:
##
## Error: subject
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 1 5470 5470
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## wordtype 1 20654 20654 18.010 6.21e-05 ***
## validity 1 103705 103705 90.431 1.61e-14 ***
## wordtype:validity 1 10878 10878 9.485 0.00289 **
## Residuals 75 86009 1147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see significant effects for word type, validity, and their interaction. We can
also plot the means for each condition, along with individual data points, as
shown in Figure 7.4.
Figure 7.4 is a conventional plot, in which the error bars show the standard
deviation across participants, and each point corresponds to a different individual
person (N=20). Note that some individual participants are generally fast or
generally slow, regardless of the condition. For example, you can see that the
highest points stay near the top in each condition, as linked by the faint grey lines.
It is this between-participant variance (the tendency for individuals to differ
systematically across conditions) that the repeated measures design can discard,
and which gives it a greater statistical power relative to a between-participants
design (where we test different individuals in each condition).
However, there is another way to think about the results of this experiment.
Notice that some of the words in Table 7.1 are likely to be easier than others to
identify. For example, in the non-nouns set, poncake might be quite a challenging
word (because the o looks like and a, and pancake is a noun). On the other
hand, lestuce might be an easier example to identify correctly as a non-word.
We can produce an alternative plot to Figure 7.4, by averaging reaction times
across participants for each item (instead of averaging across items for each
participant). Again, this will involve there being 80 observations: 20 items × 4
conditions. For the current example, we can see from Figure 7.5 that the ‘By
items’ plot has a key similarity with the ‘By participants’ plot (Figure 7.4): the
group means (horizontal black lines) are the same in both graphs. This has to be
the case because these are the grand averages across both items and participants.
7.3. FACTORIAL MIXED-EFFECTS EXAMPLE: LEXICAL DECISION TASK FOR NOUNS AND VERBS127
800
By participants
Reaction time (ms)
● ●
700 ●
●● ● ●
●●
●● ●●●
●●● ● ● ●
●
●● ● ●● ●● ●● ● ●
●● ● ● ● ●●
●●
● ● ●
● ● ●
● ● ●
600 ●● ●● ●●
●
●●
●
●●
●
●
● ●
●● ●
●
●●
●● ●
●● ●
500 ●
●
400
Nouns
Non−nouns
Verbs
Non−verbs
Figure 7.4: Graph showing condition means (black bars) and individual data
points (symbols) for the four conditions. Error bars indicate the standard
deviation across participants, and thin grey lines join points for an individual
participant.
128 CHAPTER 7. MIXED-EFFECTS MODELS
However the group variances differ between the plots, as do the individual points.
800
By items
Reaction time (ms)
700
600
500
400
Nouns
Non−nouns
Verbs
Non−verbs
Figure 7.5: Graph showing condition means (black bars) and data points for
each item (squares) for the four conditions. Error bars indicate the standard
deviation across items, and thin grey lines join points for an individual stimulus
item.
Something to notice about the ‘By items’ plot (Figure 7.5 is that the standard
deviations are much smaller than in the ‘By participants’ plot (Figure 7.4). This
suggests that the responses to the different items are more similar to each other
than are the responses of different individuals. Now, we could in principle do
another ANOVA, but this time treating item as the unit we average within,
instead of participant. Note that item is repeated within the word class of noun
or verb, but not across classes (see Table 7.1), which is why the faint grey lines
only link within a word type category in Figure 7.5. The (mixed) ANOVA output
looks like this:
##
## Error: item
## Df Sum Sq Mean Sq
## wordtype 1 77105 77105
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## wordtype 1 37483 37483 67.41 4.75e-12 ***
## validity 1 12804 12804 23.02 7.95e-06 ***
7.4. HOW CAN I DECIDE IF AN EFFECT IS FIXED OR RANDOM? 129
breath direction as a fixed effect with two levels - why wasn’t this a random effect
instead? This turns out to be quite a complicated question, to which there is not
always a definitive answer, and these decisions are often left to the person doing
the analysis. One good heuristic is to think about whether you are interested in
the effect in question: if you are, it should probably be a fixed effect. In addition,
there are two important factors that prevent a variable from being treated as
a random effect. First, if the variable is continuous, it cannot be used as a
random effect; only categorical variables can be treated in this way. This rules
out variables like age and weight from being treated as random factors, unless
they are discretised into categories first. Second, random effects should have at
least five levels, as with fewer levels the estimate of the standard deviation for
that variable will be inaccurate. This is the reason why breath direction needed
to be a fixed effect in the dolphins example, and it also means that variables
such as sex/gender, handedness, blood group, and (in genetics) single-nucleotide
polymorphism must either be treated as fixed factors or ignored.
For traditional ANOVAs, we usually deal with missing data by either excluding
participants (known as listwise deletion, this can substantially reduce power),
or by averaging across only the data points we have (which means that some
participants contribute more observations than others). In mixed-effects models,
we are estimating the properties of an underlying regression line as best we can
for each comparison. Critically, if some estimates are missing, we can still come
up with a sensible parameter estimate. We can see how this might work by
randomly removing some of the observations from our psycholinguistics data set.
For example, if we remove 5% of the data points, the analysis runs fine, and the
summary table changes only very slightly:
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Another situation that causes problems for ANOVA is when different groups
have very different sample sizes, also resulting in an unbalanced design. The
main issues are that it is difficult to accurately test whether groups of very
different sizes meet the homogeneity of variances assumption (i.e. that they
have equal variances), and also that if sample size covaries with an independent
variable, it can confound the main effect (in factorial designs). These issues
can cause particular problems when conducting research on rare conditions and
diseases, or in groups that comprise a minority of a population. Designs that
aim to sample the population at random (such as polling research) will tend
to select relatively few people in such categories, leading to highly unbalanced
designs. Again, mixed effects models are able to deal with this situation more
appropriately than ANOVA, because they correctly account for the variance
structure of the underlying data. This makes them a good choice for analysis of
data that relates to equality and diversity of underrepresented groups.
The above code would run a linear model using the data stored in a data frame
called dataset, trying to predict values of the height column using the values in
the age column. If we have additional independent variables, such as sex, we
can include them either as single predictors (additive, as in regression formulae),
or as factors that interact with the other independent variables (multiplicative,
as for ANOVA formulae):
# regression notation (no interaction)
output <- lm(height ~ age + sex, data=dataset)
Formulae for the lmer function follow similar rules, but there is an additional
piece of syntax to consider. A random effect is always entered after the fixed
effects (i.e. independent variables), it is entered in brackets, and it is entered
after a vertical slash symbol. For example, if we wanted to include nationality
as a grouping variable to predict height, we might do so as follows:
# mixed-effects model call with random intercepts
output <- lmer(height ~ age + (1|nationality), data=dataset)
The above line of code will run a mixed-effects model with random intercepts
(see Figure 7.2b), using age as a predictor and nationality as a grouping variable
(random effect). Alternatively, we can specify a random slopes model (see
Figure 7.2c) by incorporating the independent variable into the random effects
specification, and specifying (using a 0) that the intercept is not included as a
random effect:
# mixed-effects model call with random slopes
output <- lmer(height ~ age + (0 + age|nationality), data=dataset)
Finally, we can specify a model with random slopes and intercepts (see Figure
7.2d) by allowing the intercepts to vary again:
# mixed-effects model call with random slopes and intercepts
output <- lmer(height ~ age + (1 + age|nationality), data=dataset)
Just as we can have more than one independent variable, random effects can be
defined for multiple grouping variables, depending on the structure of our data
set. For example, the call for our factorial mixed-effects model for the lexical
decision task was:
model <- lmer(RT ~ wordtype * validity + (1|subject) + (1|item), data=RTlmm)
This line of code specifies that reaction time (RT) is predicted by two factorially
combined independent variables (word type and validity), with random intercepts
on subject and item.
As with many R functions, the output of the model fit is stored in another data
object (the one to the left of the <- assignment). This data structure contains
a lot of information, and we can extract it in several ways. Simply inspecting
the object (by typing its name) will give us some helpful numbers, such as the
number of observations, but it is not generally very informative:
model
Similarly, we can request a summary table for our random effects terms with
the ranova function:
ranova(model)
136 CHAPTER 7. MIXED-EFFECTS MODELS
## R2m R2c
## [1,] 0.3454568 0.6798848
2
Two values are calculated and reported by this function. The first (Rm ) is the
2
marginal R value, which represents the proportion of the variance explained by
our fixed effects (i.e. traditional independent variables), excluding any random
effects. The second (Rc2 ) is the conditional R2 value, which is the proportion
of the variance explained by the full model, including both fixed and random
effects. It is helpful to report both of these statistics for each model that you
run.
If we want to compare two (or more) models, we can again use the anova
function to produce a table of useful statistics, including AIC, BIC and log
likelihood scores. The model with the smallest AIC and BIC scores, and the
largest log-likelihood score gives the best account of the data. Here is an example
comparing the models from Figure 7.2b-d:
anova(simmodel3,simmodel4,simmodel5)
## $group
## (Intercept) IV
## 1 32.911341 0.9268825
## 2 -1.536704 1.0272352
## 3 4.649459 0.9156914
## 4 -11.237636 0.4863508
## 5 -25.522081 1.0810784
##
## attr(,"class")
## [1] "coef.mer"
The first column contains the intercept values, and the second column the slope
values.
Finally, we can inspect the residuals using a Q-Q plot, which we first encountered
in section 3.8. These graphs show the expected quantiles (based on a normal
distribution) along the x-axis, and the actual residuals (from the data) along
the y-axis. Substantial deviations from the major diagonal line (usually at the
extremes) indicate that the normality of residuals assumption has been violated,
and the model results should be treated with some caution. We can generate a
Q-Q plot (see Figure 7.6) using the qqnorm function, after first extracting the
residuals from the model object using the resid function:
modelresiduals <- resid(model) # extract the residuals
qqnorm(modelresiduals) # create the plot
qqline(modelresiduals) # add the diagonal line
100
●
●●●
●
●●
●●●●●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
Sample Quantiles
●
●
●●
●
●●
●
50
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
0
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
−100 −50
●●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●
●●
●●
●
●●
●
●●
●
●●●
●
●●●●
● ●●
−3 −2 −1 0 1 2 3
Theoretical Quantiles
source, Westfall, Kenny, and Judd (2014) discuss issues relating to statistical
power, and Meteyard and Davies (2020) provide recommendations on reporting.
Readers convinced by the arguments for Bayesian statistics (Chapter 17) are
advised to read about Bayesian hierarchical models, which have similar properties
(see e.g. Kruschke 2014). It is also worth identifying papers in your own area
of research that use the mixed-effects approach, to find examples of common
practice. Finally, there are many helpful blog posts and discussion board threads
online that are well worth reading when troubleshooting specific issues.
Stochastic methods
141
142 CHAPTER 8. STOCHASTIC METHODS
allow us to work out useful stuff without requiring formal equations. Note that
the methods involving random numbers we discuss in this chapter are distinct
from the concept of a random effect, that we introduced in Chapters 6 and 7.
Random effects are where individuals, groups or studies differ in their means,
whereas here we use random numbers for several other purposes.
these are not necessary for anything we will discuss here, and are only worth the
bother and expense if you really need them.
The remainder of this chapter is divided into two parts. In part 1, we will
describe how stochastic methods can be used to model different situations, in
order to gain insights into how a system, model, or experiment might behave. In
part 2, we will introduce the concept of resampling. This is a way of analysing
data that can be used to estimate confidence intervals, and also to conduct
statistical hypothesis testing.
a <- runif(100000)
b <- runif(100000)
hist(a, breaks = 100, col = 'white')
hist(b, breaks = 100, col = '#8783CF')
Histogram of a Histogram of b
1000 1000
800 800
Frequency
Frequency
600 600
400 400
200 200
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
a b
What would we expect the distribution to look like if we added these two samples
together? Intuitively, we might guess that the distribution of the summed values
should also be uniform. However this intuition would be incorrect. In fact, the
summed distribution has a clear peak in the centre, as shown in Figure 8.2,
generated by the following code:
hist(a+b, breaks = 100, col = '#CFCDEC')
Histogram of a + b
2000
1500
Frequency
1000
500
0
a+b
Figure 8.2: The sum of two populations of uniformly distributed random numbers.
much more likely that the summed numbers will be of middling value. If we
kept on summing lots and lots of uniform distributions, we would eventually end
up with a beautiful normal distribution, as shown in Figure 8.3, and generated
using the following code:
bigsum <- runif(100000)
for (n in 1:99){bigsum <- bigsum + runif(100000)}
hist(bigsum, breaks = 100, col = 'grey')
Histogram of bigsum
1000 1500 2000 2500
Frequency
500
0
40 45 50 55 60
bigsum
Figure 8.3: The sum of 100 populations of uniformly distributed random numbers.
With this simple example, using random numbers, we have demonstrated Central
Limit Theorem in action. Of course, there are a whole load of complex mathe-
matical equations that explain how it works in detail (see e.g. the Wikipedia
entry on Central Limit Theorem). But, in keeping with the spirit of the quote
at the start of the chapter, we have been able to show that the theorem works
146 CHAPTER 8. STOCHASTIC METHODS
the results of an experiment might turn out. Power calculations (see Chapter 5)
can also be done by simulation, affording greater flexibility (e.g. for complex or
unbalanced deisgns) than analytic approximations (Colegrave and Ruxton 2020).
An added advantage of simulating data is that one can construct an analysis
pipeline in advance of running the experiment. This saves time later, is useful
for clarifying and making explicit one’s assumptions and expectations, and can
also be included in preregistration materials.
plot(seq(-4,4,0.001),dnorm(seq(-4,4,0.001),mean=0,sd=1),
type='l',lwd=3, main='dnorm',xlab='x',ylab='Density')
plot(seq(-4,4,0.001),pnorm(seq(-4,4,0.001),mean=0,sd=1),
type='l',lwd=3, main='pnorm',xlab='x',ylab='Cumulative probability')
plot(seq(0,1,0.001),qnorm(seq(0,1,0.001),mean=0,sd=1),
type='l',lwd=3, main='qnorm',xlab='Quantile',ylab='x')
The top left panel of Figure 8.4 shows the output of the rnorm function, which
generates a sequence of n random numbers drawn from a normal distribution,
with mean and standard deviation defined by the function call (defaults are
mean = 0 and sd = 1). The rnorm function is the most useful function for our
current purposes, but for reference we will also describe the outputs of the other
three related functions.
In the top right panel of Figure 8.4, the dnorm function produces a probability
density plot for the same normal distribution. This gives the probability of
drawing a number with value x from a normal distribution with the mean and
standard deviation specified by the function call. Note that this function does
not produce random numbers directly, but it has many uses, such as plotting
smooth curves to summarise distributions.
In the lower left panel of Figure 8.4, the pnorm function provides the cumulative
distribution function. A good way to understand this is to imagine that at each
148 CHAPTER 8. STOCHASTIC METHODS
rnorm dnorm
0.4
40
0.3
30
Frequency
Density
0.2
20
10 0.1
0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
x x
pnorm qnorm
1.0 3
0.8 2
Cumulative probability
1
0.6
0
x
0.4
−1
0.2 −2
0.0 −3
x Quantile
Figure 8.4: Example distribution function outputs for random numbers, density
function, cumulative density and quantiles, for a normal distribution with mean
= 0 and sd = 1.
8.3. PART 1: USING RANDOM NUMBERS TO FIND STUFF OUT 149
value of x on the curve, you are adding up the probabilities for every number
between −∞ and x. For any input value of x, this function will tell you the
probability that a number drawn from a normal distribution will have a value
smaller than x. Equivalently, subtracting the probability from 1 will tell you
the probability of a number having a value larger than x. This is particularly
important for calculating p-values in statistical testing. For example, if we run
an ANOVA and calculate an F-ratio, we compare this to the (inverse) cumulative
F distribution (from the pf function) with an appropriate number of degrees of
freedom. This provides the ubiquitous p-value that is used to determine if a test
is statistically significant.
Finally, in the lower right panel of Figure 8.4, the qnorm function provides
quantiles from the normal distribution. This is the reverse of the cumulative
distribution - notice that the x and y axes are switched between the lower two
panels. So it can be used to reverse engineer a test statistic if we know the
p-value. This is used in some of the effect size conversion tools we discussed in
section 6.5.
Equivalent functions are available for other distributions with a consistent naming
pattern. For example the rgamma, dgamma, pgamma and qgamma functions
generate a gamma distribution. This has a positive skew, and is sometimes
used for modelling prior distributions in Bayesian statistics. Other particularly
useful distributions include the uniform distribution (runif, dunif, punif and
qunif ) we encountered in the central limit theorem example above, the log-normal
distribution (rlognorm, dlognorm, plognorm and qlognorm), the F distribution (rf,
df, pf and qf ) used in ANOVA and related statistics, and the Poisson distribution
(rpois, dpois, ppois, qpois) that is used to model event probabilities such as the
spiking of neurons.
All of the above functions use the same underlying random number generator.
We can set the seed to a specific (integer) value, and be confident that the
sequence of pseudo-random numbers we generate will always be the same. For
example, setting the seed to 100 and asking for 5 random numbers from a normal
distribution produces the following output:
set.seed(100)
rnorm(5)
If we change the seed to a different value (99), we will get a different sequence:
set.seed(99)
rnorm(5)
Crucially, if we set the seed back to 100, we should get our first sequence of
numbers out again:
150 CHAPTER 8. STOCHASTIC METHODS
set.seed(100)
rnorm(5)
This needs to be done before any random numbers are actually generated, as
each time we sample from the random number generator we change its state.
The seed can then be restored by setting .Random.seed <- seed, which should
permit full reproducibility of the original random sequence.
The set.seed function can also be used to specify the random number generator
algorithm to be used (with the kind argument). The default is the exciting-
sounding Mersenne-Twister (Matsumoto and Nishimura 1998), which is a widely-
used algorithm implemented in a number of programming languages and software
packages (including SPSS). There are half a dozen alternatives with similarly
exotic names, and also the option for users to specify their own algorithms if
required.
4
(a)
3
●
2
●
1
−10 −5 0 5 10
8
(b)
6
●
4
●
2
0 10 20 30 40
same size as the original data set, this also means that some values from the
original set might not be included at all in the resampled data. I always think of
bootstrapping using the analogy of a bag of ping pong balls, each of which has a
number from the original data set written on it. Resampling involves drawing a
ball from the bag, and noting down the number. In resampling with replacement,
the ball then goes back in the bag, meaning there is the possibility it will be
pulled out again. In resampling without replacement, the ball stays out of the
bag until the end of this bootstrap iteration.
Once the null distribution has been generated, we can calculate a p-value by
determining the proportion of resampled test statistics that are more extreme
than the original test statistic. For a one-sided test this is the proportion of
resampled statistics that are either smaller than or larger than the original
statistic. For a two-sided test, the absolute values are used instead. Resampling
approaches are inherently non-parametric, so can be used in situations where
the assumptions of more traditional parametric statistics are not met. Crucially,
this method works for any test statistic one might come up with, even if the
expected distribution is unknown. Some variants include the bootstrap test, in
which data are resampled with replacement, and the permutation test, in which
all possible permutations (i.e. combinations of group ordering) of the data are
included in the resampled distribution. A similar approach can also be taken
with correlations by randomly reshuffling the pairings of the two dependent
variables on each iteration (see below). For a more elaborate use of resampling
methods, see section 15.8, which describes a related method for controlling for
multiple comparisons.
154 CHAPTER 8. STOCHASTIC METHODS
A B A' B'
60 94 58 50
Resampled data
73 60 62 58
Original data
87 86 97 60
t = −0.11
62 18 25 86
t = 1.97
99 50 62 87
97 44 18 60
58 83 94 83
62 29 73 99
93 25 93 29
58 62 62 44
−6 −4 −2 0 2 4 6
t−statistic
Figure 8.6: Illustration of the bootstrap test. The original data (left) consists
of two groups, A and B (columns), which produce a t-statistic of 1.97. These
data are resampled by randomly reshuffling the group allocations, and a new
t-statistic is calculated (here -0.11) using the resampled groups, A’ and B’ (blue
squares indicate values that originated in group A). The distribution of resampled
t-statistics from 10,000 such resampling iterations is shown in the right hand
plot, along with the original t-statistic (blue line). For this example, 3.4% of
the population lies to the right of the blue line, implying a one-sided p-value
of 0.034, or a two-sided p-value of 0.069. This is close to the value from the
original t-test of p = 0.067.
8.4. PART 2: RESAMPLING METHODS 155
## [1] 7 10 5 3 5 8 9 6 5 9
If we resample without replacement, we get a random permutation of the numbers,
which is useful in some situations (e.g. for randomising the order of conditions
in an experiment).
sample(1:10,10,replace=FALSE)
## [1] 10 3 9 4 6 5 2 7 1 8
Finally, we can also resample either with or without replacement but produce
a smaller data set by specifying how many values we need with the second
argument to the sample function. This is known as subsampling:
sample(1:10,5,replace=FALSE)
## [1] 8 6 7 1 3
The sample function becomes particularly useful when it is embedded in a loop
(see section 2.10) that repeats an operation many times on the resampled data.
The following code resamples the mean of some data, and plots the distribution
of resampled means in Figure 8.7:
# generate some random synthetic data
data <- rnorm(100, mean=1, sd=3)
b <- hist(allmeans,breaks=20)
# add a vertical line showing the true mean
lines(c(mean(data),mean(data)),c(0,max(b$counts)),col='black',lwd=8,lty=2)
Histogram of allmeans
1000
Frequency
600
200
0
allmeans
Figure 8.7: Histogram of bootstrapped means. The black dashed line is the true
mean, and the dotted lines are the 95% confidence intervals.
The data object allmeans now contains 10,000 bootstrapped means. We can
estimate the confidence intervals from this population using the quantile function.
This function returns values at a specific proportion of a distribution. To get the
95% confidence intervals, we request proportions of 0.025 for the lower bound,
and 0.975 for the upper bound, because 95% of the values will lie between these
points.
# use the quantile function to get the confidence intervals
# from the population of bootstrapped means
CIs <- quantile(allmeans, c(0.025,0.975))
CIs
## 2.5% 97.5%
## 0.2493662 1.4879034
We can add the limits to our histogram (vertical dotted lines) to visualise them
as follows:
8.4. PART 2: RESAMPLING METHODS 157
These upper and lower confidence intervals can then be used to plot error bars
for the mean in other figures. Of course, we are not limited to bootstrapping the
mean. We can bootstrap any test we are interested in, and obtain confidence
intervals on the test statistic. For example, we could bootstrap confidence
intervals on the t-statistic of a one-sample t-test using the same data (see Figure
8.8).
maint <- t.test(data,mu=0) # calculate a t-statistic instead of a mean
b <- hist(allT,breaks=20)
lines(c(maint$statistic,maint$statistic),c(0,max(b$counts)),lty=2,lwd=8)
lines(CIs[c(1,1)],c(0,max(b$counts)/2),lty=3,lwd=4)
lines(CIs[c(2,2)],c(0,max(b$counts)/2),lty=3,lwd=4)
Histogram of allT
1500
Frequency
1000
500
0
−2 0 2 4 6
allT
Figure 8.8: Distribution of resampled t-statistics, showing the true mean (dashed
line) and 95% confidence intervals (dotted lines).
158 CHAPTER 8. STOCHASTIC METHODS
Finally, let’s conduct a bootstrap test on some weakly correlated data. We’ll
generate these ourselves so that we have control over the extent of the correlation:
# generate a vector of 50 random values
var1 <- rnorm(50)
# generate a vector of 50 values that includes a fraction of var1
var2 <- rnorm(50) + 0.25*var1
## [1] 0.2186587
The data object nullR now contains the null distribution of 10,000 resampled
correlation coefficients. We can calculate a one-sided p-value by working out
the proportion of this distribution that is larger than our original correlation
coefficient (stored in the data object truecor):
length(which(nullR>truecor))/10000
## [1] 0.062
The which function here returns the indices of any entries in nullR that are larger
than the value of truecor. Then the length function counts how many indices
have been returned by the which function; this is converted to a proportion by
dividing by the number of resampling iterations (10,000). Finally, it is worth
visualising both the null distribution and the original correlation coefficient, as
shown in Figure 8.9.
8.5. FURTHER READING 159
hist(nullR,breaks=20)
lines(c(truecor,truecor),c(0,1200),col=pal2tone[1],lwd=6)
Histogram of nullR
1000
Frequency
600
200
0
nullR
Figure 8.9: Null distribution from a bootstrap test of a correlation. Each corre-
lation coefficient in the distribution is calculated using independently resampled
data from each variable. The vertical blue line shows the correlation coefficient
from the original data.
Notice that the number of resampling iterations determines the precision of the
resulting p-value. Running 10,000 iterations gives us a precision of 1/10,000 =
0.0001. If we only ran 100 iterations, we would only have a precision of 1/100 =
0.01. Values smaller than this will default to 0.
need to simulate a particular type of data, there may well be specific R packages
and online tutorials designed with this in mind.
The general problem here is called parameter optimization, and there is a whole
class of computational techniques designed to solve it. This chapter will discuss
some of the issues involved in fitting models to data, and introduce a well-
established optimization algorithm called the Downhill Simplex Algorithm. We
will first outline the idea of linear and nonlinear models, and how to calculate the
error of a model fit. Next we will introduce the possible parameter space for a
model, and discuss how this depends on the number of parameters in the model.
Then we will introduce the simplex algorithm, and describe some problems it
can encounter during optimization. Finally we will go through an illustrative
example of function fitting in R.
163
164 CHAPTER 9. NONLINEAR CURVE FITTING
●
85
80
●
Height (cm)
●
75
●
70
●
65
●
60
10 15 20
Age (months)
To the extent that height will continue to increase approximately linearly with
age, we could use the fitted line to predict how tall this particular child might
be in another 3 months. In regression notation, the equation of a straight line is:
y = β0 + β1 x, (9.1)
where the β1 parameter determines the slope (gradient) of the line, and the β0
parameter is a vertical offset that determines the value of y when x = 0 (often
called the y-intercept). Performing regression involves finding the values of the
two parameters (β0 and β1 ) that give the best description of the data. For the
above example this turns out to be β0 = 55.4 and β1 = 1.2. The slope value is
telling us that every month the child grows another 1.2 cm on average.
For real data, there will always be some amount of error between the straight
line fit and the data points (shown by the thin vertical lines in Figure 9.1). One
9.3. NONLINEAR MODELS 165
way of thinking about fitting is that we are trying to make this error as small as
we possibly can. If the parameter estimates were completely wrong this would
give a very poor fit. For example if the slope parameter were negative, the model
(thick blue line) would predict that children should shrink as they age! It follows
that the best fitting parameter values (of β0 and β1 ) are the ones that produce
the smallest error between model and data.
y = β0 + β1 x2 , (9.2)
This transforms our straight line into a curve, as shown in Figure 9.2.
90
●
85
80
●
Height (cm)
●
75
●
70
●
65
●
60
10 15 20
Age (months)
Figure 9.2: Quadratic model fit to age vs height data (a quadratic function is
one that involves squaring).
The quadratic curve gives a slightly better fit to the data (the vertical lines are
shorter than before). More generally, the exponent might not be exactly 2, and
so its value could become another free parameter in the equation, much like β0
and β1 :
166 CHAPTER 9. NONLINEAR CURVE FITTING
y = β0 + β1 xγ , (9.3)
This means that when we fit the model to our data, we would have three different
parameters to adjust (β0 , β1 and γ) instead of just two. In fact, we could in
principle fit any equation to any set of data if we had reason to do so, and these
equations would have as many parameters as they might need. As we will see
later in the chapter, models with a large number of parameters quickly become
very difficult to fit.
This example illustrates two important points. First, that fitting curves to data
is an important research skill that can factor into critical life or death decision
making at the highest levels. Models of infection were used to guide government
policy about how to control the virus. Second, that extrapolating from curves
fitted to limited data can be extremely misleading - the dashed curve in Figure
9.3 does not give accurate future predictions, and basing important decisions
on it at the end of March would have been a bad idea. Of course most of
the coronavirus modelling was rather more sophisticated than a 3-parameter
exponential function, but the same caveats apply no matter how elaborate the
model.
9.4. A PRACTICAL EXAMPLE: EXPONENTIAL MODELLING OF DISEASE CONTAGION167
●
●
80
●
●
●
60
●
●
●
●
40
●
●
●
●
●
●
20
●●
●
●●
●●●
●●●●
●●●●●●●●●●●●●●●●●
0
0 10 20 30 40 50
Figure 9.3: UK coronavirus cases for a 6 week period in early 2020. The dashed
black curve was fit to the first 30 days of data, and predicts the following two
weeks. The blue curve was fit to the full data set.
168 CHAPTER 9. NONLINEAR CURVE FITTING
100 20 30 40 50
10
80
Intercept (β0)
60
40
20
20
30
40
100 90 80 70 60 50
0
Slope (β1)
−2 −1 0 1 2
Figure 9.4: Parameter space for a linear model fit. The star indicates the best
fitting parameters, which give the smallest error. Blue shading indicates the
depth of the surface.
This visualisation tells us that the region of the parameter space that gives the
best fit is somewhere around β1 = 1 and β0 = 60, which corresponds well to our
original parameter estimates from the regression fit (shown by the black star).
The parameters that give the best fit are those that produce the smallest error
between the line and the data points, and so this is the lowest point in a virtual
three dimensional ‘space’ consisting of one dimension for each parameter (β0
and β1 ) and a third dimension for the value of the error (the height, indicated
by the contours and shading). This error surface will be different for every data
9.6. OPTIMIZATION ALGORITHMS 169
set, and for each model equation we might attempt to fit. Note that although in
this chapter we plot several error surfaces, we would not typically visualise them,
as the computational cost of evaluating the model for all possible parameter
combinations is usually too great.
If we only have two free parameters, testing all plausible combinations of pa-
rameter values is possible on a modern computer (assuming some sensible level
of sampling resolution). But as we add more free parameters, the amount
of time required to do this will increase exponentially. This is known as the
combinatorial explosion - the rapid growth in the complexity of a problem as
more dimensions are added. Also, the error surface will have more and more
dimensions (always n+1, where n is the number of free parameters, and the extra
dimension represents the error between the model and the data). Spaces with
more than three dimensions are pretty much impossible to represent graphically
or to imagine in our dimensionally-limited brains. Clearly, we need an algorithm
to do this for us.
100 20 30 40 50
10
80
Intercept (β0)
60
40
20
20
30
40
100 90 80 70 60 50
0
Slope (β1)
−2 −1 0 1 2
Figure 9.5: Path taken by the simplex, proceeding left to right from the starting
position (white triangle), to the final solution (black star). The blue triangles
are intermediate instances, sampled every 5 iterations.
172 CHAPTER 9. NONLINEAR CURVE FITTING
the model fit worse. So it stays put, and cannot find the global minimum. A
good example of a surface with multiple minima is the Himmelblau function,
shown by the contour plot in Figure 9.6.
0
300
100
−2
200
250
−4 50 350
400
150
300
200 550 500 450 500
−4 −2 0 2 4
The black star around (x = 3, y = 2) indicates the true global minimum, but
often an optimization algorithm will get stuck in one of the two local minima
on the left hand side of the plot, and return parameter values from one of these
two locations instead. These are also good solutions, and sometimes they will
be sufficient for whatever purpose we have in fitting our model (the model curve
will likely follow the data quite closely). However they are not quite as good as
the global minimum, which we would ideally like to find.
There are two main ways to fix the local minima problem. The first is to restart
the simplex algorithm from many random starting points, in the hope that
one version finds the global minimum. The other is to alter one of the fitted
parameter values by a large amount, and then restart the algorithm from this
new location. This method, referred to as casting the stone, assumes that if the
original solution is the global minimum, the algorithm will not find a better
solution in the new region of the search space to which it has been ‘cast’, and
will reconverge to the original solution.
9.9. SOME PRACTICAL CONSIDERATIONS 173
Complex models can often take a long time to fit. One way to speed things
up is to optimize the code that calculates the model predictions as much as
possible. This can involve removing extraneous commands, replacing loops with
matrix operations, pre-allocating memory, compiling code, and making use of
parallel processing capabilities. Because the model code will be called hundreds
or thousands of times during optimization, even small increases in efficiency can
often translate to time savings of several hours.
Another way to speed up fitting is to constrain the range of values that one or
more parameters can take. Often this can be determined on practical grounds.
In the baby height example from Figure 9.1, we could constrain the slope value
to always be positive (because babies shrinking as they get older doesn’t make
sense). When model parameters represent real-world properties of a physical
system, it is sometimes reasonable to constrain them to lie within a sensible
range. For example, a model parameter representing body temperature in live
humans could be constrained to lie between 10 and 50◦ C because temperatures
outside of this range would be fatal.
Models that contain a stochastic (random) element (see Chapter 8) are not
typically suitable for use with optimization algorithms. This is because the
model itself returns a different solution each time it is run, even with a fixed
set of parameters. This means that the error surface changes on every iteration,
causing obvious problems for fitting. One solution to this is to use the same
seed value for the random number generator on each iteration. This freezes the
surface and allows the minimum to be found. However, it will be important to
rerun the fitting with different values of the random seed to check that similar
parameter values are found each time.
Real data sets often contain some data points that are more reliable than others.
This might be because more observations were made in some conditions than in
others. In such situations, it can be useful to weight the data points by some
measure of their reliability when calculating the error of the fit. Doing this
might prevent very noisy data points from having an undue influence on the
fit. The precise values used for the weights will depend on the type of data
you are fitting, and the general idea of weighting was introduced in section 6.8.
As with many aspects of computational modelling, the precise details of the
implementation are left to the modeller, and as you gain more experience you
will usually develop heuristics that work for the type of data you are interested
in.
174 CHAPTER 9. NONLINEAR CURVE FITTING
data you are trying to model (and see Chapters 3 and 18 for some guidance on
plotting). If the data have a clear form, for example a Gaussian-like distribution,
this might suggest, or rule out, particular mathematical functions. Second, the
simpler a model is, the easier its behaviour will be to understand. One way to
simplify a model is to reduce the number of free parameters as far as possible.
Statistics have been proposed to mathematically compare the performance of
models with different numbers of parameters. The Akaike Information Criteria
(AIC; Akaike 1974) is a widely used example, and contains a penalty term that
increases with the number of free parameters. This can help to avoid ‘overfitting’,
by excluding parts of a model that may not be necessary to provide an acceptable
fit. In general though, reading existing studies on a similar topic is the best
way to get a feel for the types of models that might be suitable for a particular
data set. Some more detailed practical suggestions for model development are
proposed by Blohm, Kording, and Schrater (2020) - although the authors focus
on neuroscience, the points they make are generally applicable to modelling in
other domains.
−(x−a)2
f (x) = e 2σ 2 , (9.4)
where a and σ are free parameters, and x is the value along the x-axis. The
parameter a controls the horizontal offset of the Gaussian, and σ controls the
spread (width). Gaussian functions can be used to characterise many biological
processes, such the tuning functions of neurons. We can implement the equation
as the first line of an R function (see section 2.7 for a refresher on how function
definitions work in R). I have named the function errorfit, because it calculates
the error between the model and some data (in other words, the error of the fit):
errorfit <- function(p){ # define a new function called 'errorfit'
# equation of a gaussian, with parameters from the input, p
gaus <- exp(-((x-p[1])^2)/(2*p[2]^2))
176 CHAPTER 9. NONLINEAR CURVE FITTING
The second line of the function calculates the root-mean-squared (RMS) error
by taking the differences between the model (stored in the gaus data object)
and the data (stored in the ydata data object), squaring the differences, and
calculating the square root of the mean. We will generate some synthetic data
for the model to fit, where we know the true values of the free parameters, and
see how well the simplex algorithm can recover them. Let’s set the values to be
a = 2 and σ = 3, and generate data using the Gaussian function for a range of
x-values, adding a bit of noise (to simulate measurement error):
x <<- seq(-10,15,1) # sequence of x-values from -10 to 15
p <- c(2,3) # true parameter values used to simulate some data
ydata <- exp(-((x-p[1])^2)/(2*p[2]^2)) # simulated data from gaussian
ydata <<- ydata + 0.05*rnorm(length(x)) # add noise to simulated data
plot(x,ydata,type='p') # plot simulated data
1.0
●
●
● ●
0.8
●
0.6
●
ydata
●
●
0.4
● ●
0.2
● ● ● ●
● ● ● ●
● ● ●
0.0
●
●
● ●
−10 −5 0 5 10 15
The above code generates the graph shown in Figure 9.7. One small point to
notice. When we define the data objects x and ydata, we use the double arrow
assignment (<<-) to specify that they are global variables. This means they are
available from within the errorfit function, so we do not need to explicitly pass
9.12. HOW TO FIT CURVES TO DATA IN R 177
them to it as inputs.
If we tried to guess the model parameters without knowing them in advance, we
might estimate that the middle of the function was around 5, and the spread
was around 1. This would produce the (very poor) fit shown in Figure 9.8, and
given by the following code:
p <- c(5,1) # a guess at some possible parameter values
pred <- exp(-((x-p[1])^2)/(2*p[2]^2)) # model prediction using these parameters
plot(x,ydata,type='p') # plot the data again
lines(x,pred,lwd=2,col='#8783CF') # add the model prediction
1.0
●
●
● ●
0.8
●
0.6
●
ydata
●
●
0.4
● ●
0.2
● ● ● ●
● ● ● ●
● ● ●
0.0
●
●
● ●
−10 −5 0 5 10 15
Notice that we store the parameters in a data object called p, which we could pass
to the errorfit function to get a numerical estimate of how good (or otherwise)
the fit is:
# calculate the error of the fit with our first guess parameter values
errorfit(p)
## [1] 0.3870659
The fit is obviously poor, and hopefully by optimizing our parameters we will
be able to improve on the RMS error of 0.4. We will do this using the downhill
simplex algorithm, that is called using the nelder_mead function. We provide
the function with the name of the errorfit function and a starting ‘guess’ for
178 CHAPTER 9. NONLINEAR CURVE FITTING
what the parameter values might be. It returns a data object, which contains
the estimated parameters. We can then plug those parameters into our equation,
and generate a curve that fits the data well (see Figure 9.9).
library(pracma) # load the pracma package
plot(x,ydata,type='p')
lines(x,pred,lwd=2,col='#8783CF') # plot the model fit as a line
1.0
●
●
● ●
0.8
●
0.6
●
ydata
●
●
0.4
● ●
0.2
● ● ● ●
● ● ● ●
● ● ●
0.0
●
●
● ●
−10 −5 0 5 10 15
Figure 9.9: Simulated data with the best-fitting Gaussian function (curve).
The data object produced by the nelder_mead function (the sout object) contains
several pieces of information besides the final parameter estimates. For example
it includes the value of the function, and the number of iterations that were run.
We don’t need to look at this information now, but it is there if you ever need it.
The estimated parameter values should be close to the original values used to
generate the data, and as you can see (from Figure 9.9) the curve provides a
good fit to the data points. We have extracted the estimated parameters from
sout$xmin and stored them in the data object p:
9.12. HOW TO FIT CURVES TO DATA IN R 179
## [1] 0.04361436
The RMS error is much smaller than for our non-optimized best guess parameters
that we started with. This means that the simplex algorithm has done a good
job of fitting the model and finding some good parameter values. Something we
could potentially do next is to bootstrap this whole process (see Chapter 8) to
obtain confidence intervals on our parameter values.
This example of function fitting contains all of the same steps as we would go
through for a more sophisticated model fit. We need to create an R function that
calculates the model predictions for a given set of parameters, and calculates the
error between the model predictions and the data. We then pass this function
to the simplex algorithm (nelder_mead function), along with an initial guess
about the parameters. This initial guess will often influence the end result, so
it is sensible to repeat the fitting process many times using random starting
parameters, and choose the model parameters with the best overall fit. For the
example above, such a procedure might look something like this:
# first initialise data objects to store the best error value and parameters
bestrms <- 10000
bestp <- c(0,0)
# if this is the best fit we've found so far, store the parameters
if (thiserror<bestrms){
bestrms <- thiserror
bestp <- p}
}
This general approach can be used to fit models of arbitrary complexity to any
type of data, and is an enormously flexible and useful scientific tool. Of course,
180 CHAPTER 9. NONLINEAR CURVE FITTING
more complex models will take longer to fit, and might require additional lines
of code (or extra functions) to specify. You could use different measures of error,
and specify some additional options in the simplex fit, such as the maximum
number of iterations or evaluations, as described in the help files for nelder_mead.
Alternatively, it is also possible to use a Bayesian approach (see Chapter 17)
to fitting models. This involves sampling a version of the error surface using a
stochastic process (see Kruschke 2014). Happy fitting!
Fourier Analysis
Fourier Analysis takes its name from a 19th Century French mathematician
called Joseph Fourier (see Figure 10.1). Fourier was a polymath: an expert on
183
184 CHAPTER 10. FOURIER ANALYSIS
many topics. Of particular note, he is generally credited as being the first person
to desribe the greenhouse effect - the process by which carbon dioxide traps heat
near the surface of a planet and causes global temperatures to rise.
The basic idea behind the technique that bears his name is that any waveform
can be decomposed into a bunch of sine waves of different frequencies (you may
have encountered the sine and cosine functions when learning trigonometry). Of
course, in Fourier’s era there were no computers, meaning that this procedure
had to be carried out by hand. This was a prohibitively slow process, and so
Fourier Analysis was not widely used until long after his death. However modern
computers make the calculations straightforward and very fast, and Fourier
Analysis has been used in a wide variety of signal processing applications, as we
will describe below.
1Hz
10Hz
Figure 10.2: Example sine waves of different frequencies. The 1Hz wave at the
top goes through a single cycle (it increases, then decreases, then returns to
baseline) during the one second of time depicted. The 10Hz wave below it goes
through ten cycles in the same period of time.
186 CHAPTER 10. FOURIER ANALYSIS
10.3 Terminology
A key first step in understanding Fourier Analysis is to get your head around some
important terms. If we take a waveform and calculate the Fourier transform,
this will break the waveform down into its component frequencies. The result
is referred to as the Fourier spectrum, and has two parts as we will describe
in a moment. If we want to convert from the spectrum back to the waveform,
we perform the inverse Fourier transform (see Figure 10.3 for an example).
These operations are also referred to as Fourier analysis and Fourier synthesis
respectively, and the underlying mathematics are known as the Fourier theorem.
Understanding the maths is not required to use these methods, which are
implemented as core functions in most computer programming languages.
Fourier analysis
(Fourier transform)
Fourier synthesis
(Inverse transform)
Figure 10.3: Illustration of the Fourier transform between a waveform (left) and
its Fourier spectrum (right), for one second of brain activity measured using
electroencephalography (EEG).
mentioned above (see Figure 10.2), is the number of cycles per unit of time. The
amplitude is the vertical difference between the peaks and the troughs of the
sine wave (see Figure 10.4). A low amplitude means a very small change, and a
high amplitude means a large change.
Low amplitude
High amplitude
Figure 10.4: Example sine waves of different amplitudes. The sine wave at the
top has a low amplitude, the one below it has a high amplitude.
Figure 10.5 shows an example amplitude spectrum. This is based on the same
data as in Figure 10.3, but here we have zoomed in on the portion of the
spectrum from 0 - 30 Hz. This is where most of the action is in human brain
activity (because of the intrinsic timescales at which neurons operate), and so is a
worthwhile frequency range to focus on. The highest amplitude is at 5Hz, which
is consistent with the clearly periodic nature of the waveform in Figure 10.3,
showing 5 peaks and 5 troughs in one second. The amplitude spectrum therefore
has a direct mapping to features of the waveform it represents. The units of
amplitude are the same as the units used to measure the original waveform.
For the EEG data used in the examples here, these are microvolts (µV ), but
the units will correspond to whatever dependent variable you have chosen to
measure.
188 CHAPTER 10. FOURIER ANALYSIS
0 5 10 15 20 25 30
Frequency (Hz)
Sine phase
Cosine phase
Figure 10.6: Example sine waves of different phases, relative to the vertical
dashed line. The waveform at the top is in sine phase with the line (i.e. the line
is mid-way through a cycle). The waveform below is in cosine phase with the
line (i.e. the line is at a peak)
190 CHAPTER 10. FOURIER ANALYSIS
look broadly similar - the offset along the x-axis is arbitrary, and determined only
by when in the recording the call began. However the amplitude spectra in the
lower plot have peaks at very different frequencies. The Common Pipistrelle’s
call (black) peaks at around 5000 Hz, whereas the Noctule’s call (blue) peaks at
around 2000 Hz.
There have been many different classification systems proposed that use Fourier
transformed echolocation signals to identify bat species. For example, Walters
et al. (2012) trained an artificial neural network to discriminate between 34
different bat species. The calls were first assigned to one of five different groups;
this classification had an extremely high accuracy of around 98%. Calls were
subsequently assigned to individual species, which had a slightly lower accuracy
of around 84% (but still far above chance performance of 100/34 = 2.9%).
Online tools are available to classify bat calls, and mobile phone applications
and dedicated handheld devices are now available that can perform classification
in real time out in the field. These tools are all based on Fourier analysis.
(a)
Pipistrelle
Amplitude
Noctule
(b)
Noctule
Amplitude
Pipistrelle
Figure 10.7: Waveforms (a) and Fourier spectra (b) for example calls from two
bat species: Pipistrellus pipistrellus (black), and Nyctalus noctula (blue).
10.8. FOURIER ANALYSIS IN TWO DIMENSIONS 193
Figure 10.8: Example sine wave gratings of different spatial frequencies and
orientations. The left grating has a low spatial frequency (3 cycles per image),
the middle grating has a higher spatial frequency (10 cycles per image). The
right grating has an oblique orientation.
that vary in intensity across space, and that if we plot those values for one row
of the image (superimposed in blue), they look very much like a waveform. In
the right panel is the Fourier spectrum of the image. This has been zoomed into
the central low spatial frequency portion, where most of the energy resides.
Of course in a real Fourier spectrum, the small grating icons are not shown.
Instead the value (brightness) at each point (i.e. each x,y coordinate) represents
the amplitude at that particular combination of orientation and spatial frequency.
The right hand panel of Figure 10.9 shows an example in which most of the
energy is concentrated at low spatial frequencies (as is typical for natural images),
with dominant vertical energy (along the horizontal axis), caused by the vertical
contours (wooden poles) in the original image.
194 CHAPTER 10. FOURIER ANALYSIS
Figure 10.9: Greyscale image of a bug hotel (left), and its Fourier spectrum
(right).
10
10
10 8 6 4 2 0 2 4 6 8 10
Frequency (cycles per image)
(c)
Ratio of spectra
film. Amazingly, the researchers also found an individual who has direct control
over their own piloerection response, and could give themselves goosebumps on
demand!
Figure 10.12: Illustration of low-pass filtering. The left plot shows the Fourier
spectrum, with superimposed low-pass filter (blue), which excludes the high
frequency components outside of the shaded region. The right panel shows the
original waveform (blue) and the filtered waveform (black) which lacks the high
frequency noise and is therefore visibly smoother.
We can also apply filters in two dimensions. Figure 10.13 shows low pass and high
pass filters, and their effect on the bug hotel image. The low pass filtered image
198 CHAPTER 10. FOURIER ANALYSIS
looks blurry, as the fine detail is stored at the higher spatial frequencies which
have been removed by the filter. The high pass filtered image lacks extended
light and dark regions (represented by the lower spatial frequencies), and retains
only edges at higher frequencies. Note also how the overlaid pixel intensities for
the central row (shown in dark blue) are smooth in the low pass filtered version,
and jagged in the high pass filtered version.
Finally, we can filter in the orientation domain. Figure 10.14 shows filters and the
resulting images in which either horizontal (left) or vertical (right) information
is removed, leaving information at the orthogonal orientation. Notice how in
the image where horizontal information is removed (left) we can still clearly see
the vertical poles at either side of the image, and the vertical white parts of the
little drawer unit in the centre at the bottom. On the other hand, in the image
where vertical information is removed (right) these features are missing, but a
central horizontal bar, and the horizontal parts of the drawer unit are visible.
The same approach can be taken to generate images. For example, we can create
sinusoidal stimuli with very tightly defined properties (specified bandwidths) in
the Fourier domain. A popular stimulus in computer vision research is the Gabor
pattern, which is a spatially localised sine wave grating. We can generate these
in the Fourier domain by shifting a two-dimensional Gaussian blob (like the
low pass filter in Figure 10.13) away from the origin of Fourier space. This will
produce a Gabor pattern in the spatial domain (see Figure 10.16 for examples).
An interesting observation is that patterns with a small footprint in the Fourier
domain have a large spatial extent in the spatial domain, and vice versa. This
means that small patches of grating have a broader frequency bandwidth than
large ones, and so their orientation and spatial frequency are less clearly defined.
10.11. STIMULUS CONSTRUCTION IN THE FOURIER DOMAIN 199
Figure 10.13: Example of low and high pass filtering on the bughouse image.
The top row shows low pass and high pass filters, in which frequencies in the
lighter regions pass the filter, but frequencies in the darker regions are attenuated.
The lower row shows the resulting filtered images: low pass filtering produces a
blurred image, high pass filtering produces a sharp looking image but without
coarse changes in light and dark.
200 CHAPTER 10. FOURIER ANALYSIS
Figure 10.14: Example of filtering in the orientation domain. The left column
shows a filter that blocks horizontal information, but retains vertical information.
The right column shows the opposite filter.
10.11. STIMULUS CONSTRUCTION IN THE FOURIER DOMAIN 201
1F
1F+3F
1F+3F+5F
1F+3F+5F+7F
1F+3F+5F+7F+9F
1F to 49F
1 3 5 7 9 0 1 2 3 4 5
Frequency (Hz) Time (s)
Figure 10.16: Gabor stimuli synthesised in Fourier space. The upper row shows
the Fourier spectra, and the lower row the spatial transforms.
10.12. DOING FOURIER ANALYSIS IN R 203
The key function is the fft (Fast Fourier Transform) function. This takes a
vector or matrix as its input, and returns a complex-valued Fourier spectrum of
the same dimensions. Complex numbers are a mathematical convenience, and
contain ‘real’ and ‘imaginary’ components. It is not necessary to fully grasp
the mathematics of complex numbers, but in contemporary implementations
of Fourier analysis, the amplitude and phase information are represented in
Cartesian coordinates by the real and imaginary components of the number. An
optional argument to the fft function, inverse = TRUE, will request the inverse
transform. By convention, we also scale the output of the function by the length
of its input. The following lines of code perform the Fourier transform on the
waveform and confirm (using the is.complex function) that we have a complex
valued output:
output <- fft(thiswave)/length(thiswave)
is.complex(output)
## [1] TRUE
We can determine the frequencies for plotting the amplitude spectrum if we
know the duration of the signal (here it was 1 second) and the sample rate (here
1000Hz). The amplitudes can then be plotted as a function of frequency by
taking the absolute values of the Fourier spectrum (e.g. forcing any negative
values to be positive using the abs function) as follows:
samplerate <- 1000
duration <- 1
frequencies <- ((1:(samplerate*duration))-1)/duration
plot(frequencies[2:500],abs(output[2:500]),type='l',lwd=2)
Note that in the above code (see Figure 10.17 for the output), we plot values
only up to the Nyquist limit of 1000/2 = 500Hz. The spectrum is mirrored
about its midpoint, so the values from an index of 501 onwards are a reflection of
the spectrum plotted in Figure 10.17. Notice also that we begin plotting at the
second index of the vectors containing the frequency and Fourier spectrum data.
204 CHAPTER 10. FOURIER ANALYSIS
1.2
abs(output[2:500])
0.8
0.4
0.0
frequencies[2:500]
This is because the first entry in the spectrum, known as the DC component (by
analogy to direct current), often has a much larger amplitude than the other
frequencies. The DC component has a frequency of 0 Hz, that corresponds to
the vertical offset of the waveform (a bit like the intercept term in regression
and ANOVA). Since it is often uninteresting, we have omitted it from the plot
above, but it can be included if required.
0.05
0.03
filter1
0.01
−0.01
Index
transform of the signal. This produces the same result, because convolution
in the temporal domain is the same as multiplication in the Fourier
domain. So, we can apply the filter in the Fourier domain, and then take the
inverse transform to view the filtered signal as follows:
# multiply the fourier spectra of the waveform and filter
filteredspectrum <- output*abs(fft(filter1))
# inverse transform and take the Real values
filteredwave <- Re(fft(filteredspectrum,inverse=TRUE))
plot(1:1000,filteredwave,type='l',lwd=2)
The filtered waveform in Figure 10.19 is much smoother than the original, shown
in Figure 10.3.
4
2
filteredwave
0
−2
−4
1:1000
represented in the corners of the spectrum are now represented in the centre.
In many programming languages there is a built in function to implement the
quadrant shift, but in R we need to define the following single line function:
fftshift <- function(im) {im * (-1)^(row(im) + col(im))}
We first load the image in from a file using the readJPEG function from the jpeg
package. The image is stored as a 512x512x3 matrix. The 512x512 is the size of
the image in pixels (in the x and y directions), and the third dimension contains
three colour channels: red, green and blue. We will just use the information in
the red colour channel and discard the others, so that our image is black and
white.
library(jpeg)
bughouse <- readJPEG('images/bughouse.jpg')
bughouse <- bughouse[,,1]
The image is now stored as a 512x512 matrix of pixel intensities. We can take
the Fourier transform, applying the quadrant shift, as follows:
10.13. PRACTICE QUESTIONS 207
Now that we have Fourier transformed the image, we can do some more agressive
filtering. Perhaps we could include only oblique orientations within a narrow
range of spatial frequencies, using an oriented bandpass filter like those in Figure
10.16. These are created in the Fourier domain using a short function called
offsetgaus as follows:
offsetgaus <- function(n,std,x,y){
i <- matrix(data = (1-(n/2)):(n/2), nrow=n, ncol=n)
j <- t(apply(i,2,rev))
h <- exp(-(((i+x)^2) / (2 * std^2)) - (((j+y)^2) / (2 * std^2)))
return(h)}
# create a Gabor filter using two Gaussian functions, offset from the origin
g <- offsetgaus(512,8,20,20) + offsetgaus(512,8,-20,-20)
The filter and its spatial transform will look very similar to those shown in Figure
10.16. We then multiply the filter by the Fourier spectrum of the image, and take
the inverse transform, with a bit of quadrant shifting sleight of hand. Finally,
we rescale the luminances to between 0 and 1, and then plot the resulting image
(see Figure 10.20).
# apply the filter and inverse transform
filteredimage <- Re(fftshift(fft((bugspectrum*g), inverse=TRUE)))
filteredimage <- filteredimage - min(filteredimage) # scale the luminances
filteredimage <- filteredimage/max(filteredimage) # to between 0 and 1
The clearest feature in the filtered image is a diagonal plank of wood, which has
the most left-oblique energy. This is circled in blue in the original image (right
panel of Figure 10.20).
This section has provided example code for performing Fourier analysis and
filtering in both one and two dimensions. The practice questions below test your
understanding with further examples.
Figure 10.20: Oblique filtered bug hotel image (left). The strongest feature
corresponds to a diagonal plank of wood, circled in blue in the original image
(right).
A) angle(fft(waveform))
B) abs(fft(waveform,inverse=TRUE))
C) abs(fft(waveform))
D) angle(fftshift(waveform))
7. Which pair of operations are equivalent?
A) Convolution in the temporal domain and division in the Fourier
domain
B) Squaring in the temporal domain and subtraction in the Fourier
domain
C) Addition in the temporal domain and convolution in the Fourier
domain
D) Convolution in the temporal domain and multiplication in the Fourier
domain
8. In Fourier space, the highest spatial frequencies are traditionally repre-
sented:
A) In the corners
B) In the upper half
C) In the centre
D) In the lower half
9. What will the following line of code do? angle(fft(waveform))
A) Return the amplitude spectrum
B) Return the phase spectrum
C) Return the full Fourier spectrum
D) Return a smoothed waveform
10. Which line of code will return a filtered version of the data object signal?
A) abs(fft(fft(signal)*filter,inverse=TRUE))
B) Re(fft(fft(signal)*filter,inverse=TRUE))
C) Re(fft(fft(signal,inverse=TRUE)*filter))
D) abs(fft(signal*filter,inverse=TRUE))
Answers to all questions are provided in section 20.2.
210 CHAPTER 10. FOURIER ANALYSIS
Chapter 11
Multivariate t-tests
Many widely-used statistics are univariate in nature, in that they involve a single
dependent variable (outcome measure). If you have more than one dependent
variable, a number of alternative statistical tests are available that can deal with
all of the dependent variables at once, rather than running a series of univariate
tests. The next four chapters will introduce a selection of these methods, which
are referred to as multivariate techniques.
211
212 CHAPTER 11. MULTIVARIATE T-TESTS
(a) y (b) y
● ●
●
● ● ●●
● ●
● ●● ● ●
●● ● ● ● ● ● ●●● ● ●●
●● ●
● ● ● ●● ● ● ● ●
●●● ● ● ● ●● ● ●●●● ● ●●● ● ●●●●
●● ●● ●
●●
● ● ● ● ● ● ●●●
● ● ●● ●● ● ● ●●●
● ● ●● ● ●
●●●
● ●●●● ● ●●
●
● ● ●●● ●●●●● ● ●●●● ● ●● ● ●
● ● ● ● ● ● ●
●●●
●
●
● ●● ● ● ●
● ●
● ● ● ● ●●● ● ●● ● ●
●●
● ●● ● ●●
●●● ●●● ●●● ● ●● ● ● ●●● ●●● ●●●● ● ●●
●● ●● ●●● ●●● ●●
● ●●● ● ● ● ●● ●
●●● ●
●●
●●● ●●● ●
● ●●●●
● ● ●
●●● ●● ● ●●● ●
●● ● ● ● ●● ●●● ●●●
●●● ● ●●●
●
●●●
● ● ●● ●● ●● ●●
●
●● ●
●
● ●●●●
● ● ●● ● ●
●●●●● ● ●
●● ●●●
● ●●
● ●● ●●● ●
● ●● ● ●● ● ●●●●
●●
●● ●●●
● ●
●●● ●●●●●●
●●●●●
● ●●● ● ●
● ● ● ● ● ●● ● ●● ● ●● ● ●● ●●●● ●● ●●●● ● ●
x
●● ●●
x ●●
●●●●● ●
● ● ●● ●●●
● ●● ● ●●
●
● ●● ●●
●●●
●●
● ●●
●●●●
●
●
●
●●
●
●●
● ●●●
●
●
●● ●
●●●●
●●●
●●
●●
● ●●●●●●●●
● ●●●
●
●
●
● ●● ●●
●● ●●●●●
●●●
● ●● ●●
●
●●●●●
● ● ●● ● ●
●
●●●●● ●●●●
●●
●
●●
●●● ●● ●
● ●
●
●●●●●●
●
●
●●
●●
● ● ● ●●●
● ● ● ● ● ●● ●
● ●● ●
●
●●●
●●●
●
● ●
●●
●● ●
●●
●
●● ●
● ●●
●●
●●●●●
● ●
● ●● ●
●● ●●●
●● ●● ●● ●●●●
●●
●●● ●●● ● ● ●●●●● ●● ●● ● ●● ● ●●●● ●●●●
● ●●●●●● ●●● ●
●●
● ●● ● ●
●
●●
● ●●●
● ●●
● ●●
● ●● ● ●
●●
●
●●
●
●
●●
●●
●
●●
●●●
●
●
●●
●●●●●●
●●●●●●
● ● ● ●
●
● ● ●● ●
● ●●●
● ●●●
●● ●●●●
●● ●●●
●
●
●●
●●
● ●
●●●
●●
●
●
●
● ●
● ●
●●
●
●●
●●●
●
●●
●
●●●
●
●●
●
●● ● ● ●● ●
●
●● ●●●●●● ● ● ●
●● ●
●●●●●
●●● ● ● ● ● ● ●● ● ● ●● ●
● ●●● ●● ● ●●●●●● ●●●
●● ●● ● ●●●●●● ● ●● ●●●● ● ●
●●●
● ●● ●●●●● ● ● ●●●●●●●●●● ●●●●●● ●
● ●● ●● ●
● ●
●● ●●●●
●●●●● ● ●●● ● ●● ●●
● ●●●
●●● ● ●●
● ● ●
● ●● ● ●●● ●● ●● ●●●
●●
●●●
●● ● ●
●●
●●●
●●● ●
●●
●● ● ● ●
●●●●
●●
●● ●●●●●● ● ●
●● ● ● ●●●
●
●●●●
●
●●●
●
●
●
●
●
●● ●
●●
●
●
●
●●
●●
●
●
●
● ●
●●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●●●
●
●●●
● ●
●●●
●●
●
●
●
● ●
● ●
●
●● ●●●
●●
● ●
● ● ●● ● ● ●● ● ●
●●
● ●
● ●●●
● ●
●
●
● ●●
●●●
●● ●●●
●
●
●
●
●●●●●●●●
●●
●●
●
●● ●● ● ●● ●
● ● ● ●●● ● ●●●●●● ● ●
●●●●●●●●
● ●
●
●●
● ●
●● ●
●●●●
●
●●
●●
●
● ●●●
●
●
●
●
●●●
● ●
●
●●
●
●
●
●
●
●
●
●
●
● ●●●
●●● ●●
●
●
●
●●●
● ●
●● ●●●● ●
●●●●●●
● ●●
●●
●
●●●●●
●● ●●
●
●
●
●●●● ●●
●●
●
●
●
●
●
●●
●●
●●●●●
●●
●
●
●●
●●●
●
●
●●
●●●●●●● ●
●● ●●● ●
●●● ● ●●
● ●●● ●●●●
●● ●
●●● ● ●● ●● ●● ● ●●●●
●
●●● ●● ● ●● ● ●●
●
● ● ● ● ●● ●●●●
● ●● ●● ●● ●●●
●● ●● ● ● ●● ● ●
●●● ●● ●● ●● ●
●
●●
●●●● ●●
● ●●
●●● ●
●●● ●● ●●
● ● ●
●●
●
● ● ●● ●
●
●●●●● ● ●●
● ● ●●●● ●●●●● ●●●
●
●● ●●● ●● ● ●● ● ● ●● ● ●
● ●●●● ●● ● ●●
● ● ●●
●
● ●●●
●
● ●
●
●●●●●
●
●●
●
● ●● ●
●●●● ●●
● ●●
●●●●
●● ● ● ● ●●●● ●●
●●● ●●
●●●●
●
●● ● ●●●
●●● ● ●●●●● ●●● ● ●● ●● ●● ●● ● ●●
● ●● ● ●●
● ●●●● ●●●● ●
●●
● ●●●
● ●●● ● ● ●●● ●● ● ●
●● ●● ●●●
●●
●●●● ●●●●● ●●
●
● ●●● ●
●
● ●● ●●● ●● ●● ●
●● ● ●
● ● ●● ● ●
● ●● ●● ● ● ●
● ● ● ● ● ● ● ●● ● ● ●●●● ● ●●
● ● ●● ● ● ●●● ●●● ● ●●● ● ●● ●● ●● ●●● ●●●● ●
●
● ● ● ●●
● ●● ● ●● ● ● ● ●
● ● ● ●
● ●● ● ● ● ●
● ●
●
● ●
● ● ●
● ●● ● ● ● ●●
● ● ● ●● ● ●
● ● ● ●● ●
● ●●
●
● ●
●
●
●
●● ● ●
(c) ● ●
●
●
●
●● ●
●●●
●
● ●●
●●●
●
●●
●
●
●●
y ●
●
● ●
● ●
●●
●
●
(d) ● y
● ●
● ●
●
●●
●● ●
●
●
●●
● ●●
● ●●
●
●
●
●●
●●
●
● ●
●
●
●
● ●●●●
●
●●
●● ●●●●● ●
●●●
● ●●●
● ●● ● ● ● ●● ● ●●●
●●
●●
●●●
● ●● ●●● ● ● ●● ●●
● ●●● ● ● ●● ●
● ● ● ●
● ● ●● ● ●● ● ●● ●
●● ● ●
● ● ●●
●● ● ● ● ●●●● ●● ●
● ●●
●
●
●● ●●●●●● ●●●● ● ● ● ● ● ●● ● ●●● ●●● ● ●●
● ●●●●●●● ● ●● ● ● ●● ● ●
● ●● ●
● ● ● ● ● ●● ●
●●● ● ●● ●● ● ● ● ●● ●●●●● ● ● ● ●
● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ●● ● ●
● ● ● ● ●● ● ● ● ●●● ● ●
● ● ●●● ●● ● ●● ● ●
● ●● ● ● ●● ●
●● ●●●●●●●●
●●●●●●●●
●●●●
●●
●●● ●●● ● ●● ● ●● ●
● ●●●
● ●●
●●●●●●
●
●
●
●●●●●
● ● ● ●
● ● ●● ●
●● ●●● ●●
●
●●
●●
●●
● ● ● ●●●● ● ●● ●●●● ●● ●● ●●●●● ●
●●●●●
●●● ●
●●
●●● ●●
●
●●●●●● ● ● ● ●●
●
● ● ●● ●●●●● ●
● ●●● ●●●● ●
●● ●● ●●●●● ● ● ● ● ●●● ●●●●
● ●●● ●●● ● ●
●●
●●●●
●● ● ●● ● ●
● ●
●● ● ●●●●●●
● ●●
●
● ●●●●●
●●●●
●● ●● ●● ●●●
● ●●● ● ●● ● ● ● ●● ● ●●
●● ● ●
● ● ●●● ●●●
●
●
●●●
●
●
●●●●●
●●●●
● ●●● ●●●● ●
●
●
● ●● ●●
● ● ●
● ●
●●●
● ●●
● ●
●●●●● ●●●●●● ●
●●● ●●
●●
● ●●● ● ● ●
●●● ●● ●●●
●●●● ● ● ●
●●●
● ●
●●
●●●●
●
● ●●●●●●
● ●● ●●
●●●●
●● ●
● ● ●●●●
●●
●●
●● ● ●●●●● ●● ●●●●●
●
●●
●● ● ●● ●● ●● ● ●● ● ●
●●● ●
●●● ●●
● ●●● ●●
●● ●●
● ● ●●
● ●●● ● ●●● ● ●●● ●● ●●
●● ●●
●●
●
●●
●●●●
●●●●
●●
●● ●●● ●
●●●● ● ●
●
● ●● ●●
● ●●●●●
● ●
●●●
● ●
●●
●
● ●● ● ●●● ●
●●●●●● ●
●●●●● ●● ● ●●
● ●●●●●●● ●●●
●
●
●●● ●●
● ●
●
● ●●●
●●
●●
●
●
●
●
●
●●
●●●●●●●●●
● ●
●●
●●●
● ●●● ● ●
● ●● ● ● ●●
● ●●
●●● ● ●
●
●
●●●
●●●
●
●●
●
●
●
●
●
●
●● ● ●
●
●
●
●
● ●● ●●
● ●●
●
●
●●● ●
● ● ●●●●● ●● ●● ● ●●●●●●
● ●●● ●● ● ● ● ●● ● ● ●● ● ● ●● ●●●● ● ●● ● ●●●● ● ●
● ●● ●● ●●●●●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●● ●
●●●●●
●●
●
● ●●●●●●●●
● ●● ●
●
●
● ● ●● ● ●
●
● ● ●●
●
●●
●●
● ●
●●
●●●
●●●●●
●●● ●
●●●
●●●
●●●● ●
●
●● ●
●●● ●
●●●● ● ●
● ●● ●●●
●
● ●● ●
●●
●
●●
●●●●●
●
●●●
●
● ●●●
● ●● ●●
●●●
●
●●
●●●●●
● ●●●
●
● ● ● ●●●● ● ●●●●●●●
● ●●● ●●
●
● ●
●
● ●● ●●
●
●●● ●●●
●● ●●● ● ●
● ● ●
●
● ● ●
● ● ● ● ●●●●●
●●●● ●●
●
● ●
●
●●●●●
● ●
●●
● ● ●●●● ● ●●●● ●●
●● ● ● ●●●● ● ●●●●●
●●●●
● ●●●
●
●●
● ●●●● ● ●●
● ●● ● ●
●●
●●● ●
●● ● ● ●● ●● ●●● ● ●● ●● ● ● ●●● ●●
●●● ● ● ● ●●●
● ●● ●
●●●●
●●●●●●●●●●
●●
●●●●
●
●●●●
●● ● ●
● ● ● ●● ●●●
● ●● ●
●● ●●
●
●● ●
● ●●● ●●● ●●●●●● ● ● ●
●● ● ● ● ●●●● ●●●●●● ●●
●● ●●●● ●● ● ●● ●
● ● ●● ●● ● ●● ●
●● ●
●● ●● ● ●●●
●● ● ●●●●●●●●●●● ●
●● ●●
● ●●● ●● ● ●●
●
● ●●●●●●
● ●●● ●
● ●
●● ● ●● ●
● ●● ●● ●● ●● ● ●●● ● ●● ● ●●
●●●●● ●●● ● ● ● ●● ● ●● ●
●
●●● ●● ●●
● ● ● ● ●●
● ● ●● ●● ●● ●● ●● ● ●●
●●● ●● ● ● ●●● ●
●
● ● ●● ● ●●●● ● ●● ●● ● ●●● ●● ●● ●● ● ● ● ●
● ●●●
●●●● ●
● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●●●● ● ●
x x
● ●● ● ●
● ●
● ● ●●●●● ● ● ●●● ● ●●●
●● ● ●● ●● ●● ● ● ● ●
●● ● ● ●● ● ●●
● ●
● ● ● ● ●● ● ● ●● ●●
●
● ● ●● ● ● ● ●
● ● ●● ● ● ● ●● ● ●
● ●● ●
● ● ● ● ● ●● ●●
● ● ●
● ●●
● ● ● ●
● ●
●
●
(a) y (b) y
● ●
●
● ●
● ● ● ●●
●● ● ● ●
● ●
● ● ●● ●
●● ●
● ●
● ●
●
● ● ●●
●●●● ● ● ●● ●●
●●
●●● ● ●●●
● ●
●
●●
●●
●
●●●●●● ●
●●●● ●●●
●●
●
●● ●●●●
●
●●
●● ●
● ●
●
● ● ●● ●●● ●● ●●
● ●● ●●● ●
●● ● ●●
●
●
●● ●●●● ●
● ● ● ●●
● ●● ● ●●●
● ●●● ●●
●●●● ●
●●
● ●
●●●
●●●
● ●●●●
●●●● ● ● ● ● ●● ●●
●
●
●
●
●●
●
●
●●
●●
●●
●
●
●
● ●● ●●
● ●● ●● ●●● ●● ●● ● ● ● ●
● ●●
●
● ● ●●●●●●
●● ●
●●
●●●●●●●●
● ●●●●
● ●●● ● ● ● ●●
●●●
●●
● ●●
●●
●
●●●
●●
● ●
●● ● ●●
● ●●● ● ●● ●
●
●
● ●●●●●●
● ●●●●
● ●●● ●
●●●● ● ●●●
● ●●●
●●●● ●●
●
●
●●
●●●●
●●
●
●●●
● ●● ●●●
●● ●
●●
● ●●
●●●● ●●● ●●●●●●
●●● ●●
●
●●●
● ●
●●●● ● ●● ●●● ●● ● ● ●●
x x
● ● ●●●
●
●● ●●●●
● ● ●
●
●●
●
●●● ● ●
●●● ●●●●
●
● ●
● ●
●●●
●●●●●●● ●
●● ●
● ●
●●
● ● ●●●
● ● ● ● ●
●
●●● ●● ●●
●
●● ●●● ●
●●●
●●●
●●
●
●●●
●
●●
●●
●
●
●● ●●●●●
● ●● ● ●● ●●● ●●
●
●●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●●●
●● ●●
● ●
●●●●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●●
●
● ●●
● ●●
●●●
●●● ● ●● ●●●
●●
●● ●
●●●
●
●●●
●●●
●
●●●●
●
●●●
●
● ●●●●●●
●●●● ●
●●●
●
● ●●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
● ●●●●●● ●● ●●●● ●●
●●●
● ●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●●
● ●●
●●● ●
● ●
●●
●●
●
● ●●
●●●
●●●●
● ● ● ● ●●
●●
●
●●
●●●
●●
●●●
●●● ● ●●
● ●
● ●● ●
●
●
●●●
●
●
●
●●
●
●
●●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●●●
●
●
●●●
●●● ●● ●
●
●
●
●●
●●●
●
●
●
●●
●
●
●●
●
●
●●●
●●
●
●
● ●
●
●●●
●
●
●
●●
●
●
●
●●
●●●●
●
●
●●
●
●
●●
●●●
●
●●●
●●
●
●●
●●
●● ●
●●
●●●
●
● ●
●●●
●
●
●● ●●● ●● ● ●● ●●
●
●●
●●
●●
●
●
●
●
●
●●●
●
●
●●
●●
●●
●
●
●
●●
●
●●
●●●●●
●
●
●●
●
●●
●
●● ●
●
●●● ● ●●
● ●●●●●
●
●●●●●● ● ● ● ●●● ●●●
●●
●●
● ●
●●●
● ● ●● ● ●
●
●●●●●
●●
●●● ●●
●●●
●
●
●●●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
● ●●●●
●
●● ● ●●● ● ●●
●● ●●
●
●
●
●
● ●
●●●
● ●
●●
●●●
●
●●
●●
●●
●●
●
●
●●
●●●
●
●●●●●●●●
● ●●●●
●● ●
●● ●●● ●● ●
●● ● ●
●● ●
●●
●●●
●●
●●
●
●●
●
●●
●●●
●
● ●
●
●●
●●
●●●● ●
● ●● ● ●
●● ●●
●
●●●
●
●●
●
●
●
● ●● ● ● ●●● ●●●●
●●●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
●
●●●●●
● ●●●
● ● ●
●●
●
●● ●
●●
●
●●●
●
●●●●●
●●●●●
●●
●●● ● ● ●● ● ●●
● ●●●
● ●●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●●
●●●● ●●
●● ●●
●●● ●●
●●●
● ●●
● ● ●●● ● ●●●●
●
●
●●●●
●
● ●●●●
●● ●●●● ● ●●● ● ●●●●
● ●● ●
●●●●●● ● ●
● ● ●● ● ●●●
●●●●
● ●●●●●
●● ●
●
●●●
●●●●●●
●
●● ● ●●
●
● ●●● ●
●●
●●●●● ●●● ●● ●● ●●●●● ●●●● ● ●
● ●●●●
● ●● ●●●● ●● ● ● ●●●● ●●● ●●
● ●●
●●●●
●
●
●
●●● ●
● ●●●
● ● ●●●●
●●●● ●●
●
●
●
●
●●
●●●● ●● ●● ●● ● ●
●
●
● ● ●● ● ●
● ●● ● ●●●
● ●●●●●
● ●● ● ●● ● ●●●●● ●
●
● ● ● ●●●● ●●●● ●●
● ● ● ●
●●
● ●● ●
●●
● ●●
● ●
● ● ●● ● ● ●●●●
● ● ●
● ● ●●
● ●●
● ●
● ●
●
● ● ●●
● ● ● ●●●
y ●● ● ●
y
● ●
(c) (d)
●● ●● ●●
●●●●●● ●● ● ● ● ●
● ●●
●● ●● ●● ● ●●
● ● ● ●
●● ●●●●●●●● ●
● ● ● ● ● ●● ●● ● ●
●●● ●●●
● ●●●● ●
●●●
●●● ● ● ● ●●●●
● ●● ●●
●
●●
●
●● ●
●●● ●● ● ●●●● ●●●●●●●●●
●●●
●●● ●●●● ●●
●●●●
●●●●●●
●●
●●
● ● ●● ●
●
● ●
●●●
●
●●● ●● ●
●●●● ●
● ●●
●● ●●
●●
●●●● ●● ●●● ●●● ●●●
● ●
●●
●●
● ●●●●●
●
●●
●●●
●● ●●●●●●
●● ●
●●
●●
●●
●
●●
●
●●●
● ●●
●
●● ●●● ●●
●●
●
●●
●●
●
●
●●
●●
●
●
●
● ●●
●●●●●●●●
● ●● ●● ●●●●● ● ● ● ●● ●● ●
● ●●● ● ●
●
● ● ●
● ●●●●
●●●● ●
●●● ●● ● ● ● ● ●●
●●
●●● ●
●● ●●●●
● ●● ●●●●
●
●● ●●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●●●
●●
●●
●
●
●
●
●
●●
●
●
●●
●
●
● ● ●●●
●●●● ●
● ●●●●●●
● ●
●
●
●●
●●
●
●●
●
●
●
●
●
●●
● ●
●●
●●●●
●●
●●●●
●●●● ● ●●
● ●●●●
● ● ●
●● ● ●●
●●●●
●●
●
●●●●
● ●●●
●●●●●●●
● ●●●
● ● ●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
●
● ●
●
●●
●
●●
●●● ●
●
●● ●●●
●●●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●●●●
●●
●●
●●●●
●●● ●
●●●
● ● ●
●
● ●●
●
●●
●
●●●●●
●● ●
●●●
●●● ●●
● ●
●●●●
●
● ●
●
●●●●
●
●
●●
●
●●
●
●●
● ● ●●●
● ●●
● ●●●●●●
●
●
●●
● ●● ● ●● ●●●
●
● ● ●●●●● ●●
●
●●● ●●●●● ●●
● ●
●
●
●●●
● ●●
●●
● ●● ●
●
●●●● ● ●● ●●●
● ●
●●
●
●●●●
●
●●
●
●●
●●
●●
●
●●
●
● ●● ●
● ●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●●●
● ●●●●● ● ● ●●●●
●●
●●
●
●
●●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●●
●●● ●
● ●● ● ●●●●●
●● ●
●
●●
● ● ●
●
● ●
●●●
● ●● ●● ● ●●
●● ●●●
●●
●
● ●
●
●●●●●
●● ●
●
●● ●●
● ●●●●
●●● ●
●
●●●
●
●
●
●
●●
●●● ●
●●●
●
●●
●●●
●
●
●●
●●
●
●
●●● ● ●● ● ● ● ●
●●●●●●
●
● ●
●●
●●
●●
●
●
●
●● ●
●
●●
●●
●●● ●●●
●
●
●●●
●
●●●
●●● ●
●●
●
●
●● ●
●
●
●●
● ●●● ● ●● ●
●
●
●●●
●●
●
●
●●
●●●
● ●
●●
●●
●●
●●●
●
● ● ●
●●●
●●● ●
●
●
●●●
●●
● ●
● ● ● ● ●●●● ●●
●
●●●●●
●●
● ●●●
● ●●●
●
●● ●●
●●●
●●● ●
●●●
● ●
●●●
●
●
●●
●
●
● ●
●
●●●● ●● ●● ●●●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●
●
●●●● ●
●● ●
● ●●●
●●
● ●
●●●
●●
● ●
●●
● ●
●● ●
●●
●
●●● ●
●● ● ●
● ●●●●
●●●
●●
●●
●
●●●●
●●●●● ● ●
●●●●●
●● ●●
●●●●
● ●●● ●
●
●●
● ●●●●●
●●● ● ● ●●
● ●●● ●
●
●●●
●●
● ●
●
●●●●
●
● ●●●●●
●●●●
●
●●
●●
● ● ●●
●●●
●●●●●●
●
● ●● ●● ●●●
●
●● ●●
●●●●●●
●●●
●●●●● ●
●●●●
● ●● ●
●● ●
●●● ●●●●● ●
●●
●● ● ●●●●●
● ● ●
● ●● ●●●●●
●
●●●● ●
●●●
● ●●
● ● ● ●●●●●●●
● ●●
●
●●●● ●
● ●● ●●●●
●● ● ● ●●● ● ● ●
●● ●
●● ● ●●●●●●●
●●● ●●
● ●
● ●
● ● ●● ●● ● ●
● ●
●● ● ●
x ●●
x
● ● ● ●
●●●●●
● ●● ● ●●●●●●● ●
●
●●● ●
● ●●
● ● ● ●● ● ●
● ● ●●●● ●●●
● ●● ● ● ●● ● ● ●● ●●● ●● ● ●
●●● ●●●●● ●
● ● ● ●● ●●● ● ● ●
●● ● ● ●●● ●● ● ●
● ●●●
●●
● ● ● ● ●● ●●●
● ●● ●
● ● ●
● ●
●
●
● ●
some other point in the space, for example the origin (x = 0, y = 0) in this
example. The distance between the two points is the length of the vector that
joins them, shown by the black line. The variance term is calculated from the
lengths of the residuals. These are the thin grey lines that join the mean to each
data point. Also included in the variance term is the covariance between the two
variables, which is best thought of conceptually as the correlation between them.
y ●
●
●
● ●
● ●
●● ●● ●● ●
● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ●●
●● ● ●●
● ●
● ●
●
● ●
●
●
● ●
●
●
● ●● ●
● ● ● ● ● ● ● ●
●
●● ●
● ●● ●
● ●
● ●● ●
● ●
● ● ● ● ●
●
●● ● ● ●
●
● ●
● ●
●
x ●
Figure 11.3: Example scatterplot showing the sample mean (black point), vector
line between the sample mean and the origin (black line), and residual lines
joining each data point to the sample mean (grey lines).
where N is the sample size, (x̄ − µ) is a vector of differences between the sample
mean (x̄) and the point we are comparing it to (µ; i.e. the black point and the
origin in Figure 11.3), C is the covariance matrix (and C −1 its inverse). The
tick symbol (′ ) indicates transposition of the vector. Calculating the inverse
covariance matrix is impractical by hand, so it is always done by computer.
216 CHAPTER 11. MULTIVARIATE T-TESTS
However I have included the equation here so that you can see the role the
covariance matrix plays in calculating the test statistic.
## x y
## x 0.904 -0.924
## y -0.924 1.221
The values on the diagonal of the matrix (x,x and y,y) give the variance for each
of the two variables (which must always be positive). The off-diagonal values
(x,y and y,x) give the covariance between the two variables (note that both these
values are identical, and may be negative as in the above example). All of the
values are in the original units of measurement - if the matrix is standardised, it
becomes a correlation matrix. The covariance matrix fully describes the variance
and covariance of a multivariate data set.
N −m 2
F = T (11.2)
m(N − 1)
For repeated measures designs (where the same participants complete two
different conditions), a paired-samples version of T 2 is achieved by subtracting
each participant’s scores across the two conditions, and performing the one
sample test comparing to zero. (It is not always appreciated that for univariate
t-tests, a paired samples test is identical to a one-sample test conducted on the
differences between the conditions). Furthermore, the same approach works with
an arbitrary number of dependent variables (m > 2), making T 2 a multivariate
(rather than a bivariate) statistic.
11.3. EXAMPLE: MULTIVARIATE ANALYSIS OF PERIODIC EEG DATA217
N1 ∗ N2
T2 = (x¯1 − x¯2 )′ C −1 (x¯1 − x¯2 ), (11.3)
N1 + N2
where x¯1 and x¯2 are the vectors of sample means for the two groups, and N1
and N2 are the sample sizes. The covariance matrix (C ) is the pooled covariance
218 CHAPTER 11. MULTIVARIATE T-TESTS
●
●
Im Im
Figure 11.4: Example SSVEP data. Each blue point is an individual participant
(N=100), the black points are the group means, and the orthogonal lines show
the eigenvectors of the bounding ellipse. Panel (a) shows data from the baseline
condition where no stimulus was shown, panel (b) shows data from a condition
where 32% contrast sine wave grating patches flickered at 7Hz. Both data sets
are from the 7Hz frequency bin of the Fourier spectrum of the EEG data recorded
at the occipital pole. The x-axis represents the real component, and the y-axis
the imaginary component of the complex number.
11.5. EXAMPLE: VISUAL MOTOR RESPONSES IN ZEBRAFISH LARVAE219
matrix across the two samples, taking sample size into account:
where m is the number of dependent variables, and C1 and C2 are the covariance
matrices of the two groups. For the two-sample version, the F-ratio is calculated
as:
N1 + N2 − m − 1 2
F = T , (11.5)
m(N1 + N2 − 2)
(a) (b)
6 DPF
0.06
9 DPF 1
BDI
0.04
Activity count
●
0.02
●
0
0.6
Count
0.4
Dark
0.2 0 ● Light
0
−30 −20 −10 0 10 20 30 0 0.1
Time (s) BDI
Figure 11.5: Zebrafish larvae visual motor reflex data from Liu et al. (2015).
Panel (a) shows the timecourse for the burst duration index (BDI; upper) and
the average activity count (lower), for larvae 6 (black) and 9 (blue) days post
fertilization (DPF). Shaded regions indicate 95% confidence intervals across 192
individuals, and the vertical dashed line indicates light onset. Panel (b) shows
the bivariate means across both variables (BDI and activity count) for the 1
second period before (dark, squares) and after (light, circles) the light stimulus
onset, again for 6 (black) and 9 (blue) days post fertilization.
2
11.6. THE TCIRC STATISTIC 221
two measurement indices - the burst duration index (top) and the burst count
(bottom). The right panel illustrates the two measures plotted against each
other at two time points (one second before or after light onset). Two-sample
T 2 tests indicate no difference at the time point immediately before stimulus
onset (squares; T 2 = 3.45, F(2,381) = 1.72, p = 0.18), but a significant effect
one second after the light was presented (circles; T 2 = 8.85, F(2,381) = 4.41, p
= 0.01). This suggests that older larvae have a slightly weaker initial response
to light, though it is clear from Figure 11.5a that movement persists for longer
in the 9 day old larvae. Overall, the Liu et al. (2015) study is a good example
of how multivariate statistics can be used to analyse complex data sets.
2
11.6 The Tcirc statistic
Victor and Mast (1991) proposed a variant of the T 2 statistic called Tcirc
2
(the
circ is short for circular). This was intended specifically for analysing complex
Fourier components like those we encountered in Figure 11.4. The test has some
additional assumptions - specifically that the units of the dependent variables
have equal variance, and that there is no correlation between them. In other
words, the data should conform to a circular cloud of points (as in Figure 11.1)
and not an ellipsoidal one (as in Figure 11.2). If these conditions are met, the
one-sample verison of the statistic is calculated as:
2 |x̄ − µ|2
Tcirc = (N − 1) (11.6)
Σ|xj − x̄|2
where N is the sample size, x̄ is the sample mean, µ is the point of comparison,
and xj represents individual observations. The vertical slash symbols ( | | )
denote the absolute value of the numbers inside (i.e. the vector lengths). In
words, this equation takes the squared length of the line joining the sample mean
to the comparison point (i.e. the black line in Figure 11.3), and divides by the
sum of the squared residuals (i.e. the grey lines in Figure 11.3).
Note that crucially there is no covariance term in this equation, which makes it
substantially simpler to calculate. As with the original T 2 statistic, statistical
significance is estimated by comparison with an F-distribution, which for two
2
dependent variables has 2 and 2N-2 degrees of freedom for F = N Tcirc . Repeated
measures and two-sample versions are also possible.
2
Victor and Mast (1991) demonstrate that the Tcirc statistic can be more sensitive
2
(i.e. have greater power) than Hotelling’s T when its assumptions are met.
However there is an issue with the false positive rate when the assumptions
are violated (i.e. when the variables are correlated or have different variances).
I recently (D. Baker 2021) proposed a method for testing the assumptions,
that involves comparing the condition index of a data set to that expected
by chance. The condition index is the square root of the ratio of eigenvector
lengths (eigenvectors are the axes of a bounding ellipse, see examples given by
222 CHAPTER 11. MULTIVARIATE T-TESTS
the grey lines in Figure 11.4). This functions like other assumption tests, in that
2
a significant result means that the Tcirc should not be used, and Hotelling’s T 2
is a safer alternative.
p
D= (x¯1 − x¯2 )′ C −1 (x¯1 − x¯2 ), (11.7)
where all terms are as defined previously, and C is the pooled covariance matrix
calculated using eqn. (11.4). Note that some implementations of the Mahalanobis
distance actually return D2 , which can be converted back to D by taking the
square root (as in eqn. (11.7)). As with Cohen’s d, the D statistic is standardised
so it can be compared across different data sets, studies, and dependent variables,
and could in principle be used as an effect size for meta analysis (see Chapter 6).
I strongly recommend reporting it alongside the results of any T 2 or Tcirc2
test.
The package contains a function called tsqh.test that can calculate one-sample,
two-sample and repeated measures versions of Hotelling’s T 2 test. Let’s assume
that our first data set is stored in an N × 2 array called data:
head(data)
## [,1] [,2]
## [1,] 0.9796330 1.83289050
## [2,] 0.5984391 -0.65855164
## [3,] 2.3306366 1.44780388
## [4,] -0.2141754 -0.06080895
## [5,] -0.1746285 -0.28229951
## [6,] -0.7105413 -0.68882363
We can conduct a one-sample T 2 test using the tsqh.test function as follows:
tsqh.test(data)
## [1] 0.6339422
We are passing to the function the two points we wish to compare - the data
centroid (calculated using the colMeans function), and the comparison point
(0,0). We also provide the covariance matrix from the data (cov(data)). Note
224 CHAPTER 11. MULTIVARIATE T-TESTS
that the function returns the squared distance, so we must take the square root
to find D. If we want to compare to a different point, we can change the input
to the first argument, for example:
D2 <- mahalanobis(c(0.25,0.25),center=colMeans(data),cov=cov(data))
sqrt(D2)
## [1] 0.2816563
To compare two groups, we can again use the tsqh.test function, providing it
with both data sets, and specifying either a paired or unpaired test:
tsqh.test(data,y=baseline,paired=TRUE)
## [1] 0.6503023
For the independent samples (unpaired) case, we instead use the pairwisemahal
function from the FourierStats package. The function expects the data to be
stored in a single matrix, with an additional grouping variable to identify which
group each observation belongs to. We can combine our two data objects using
the rbind function, and generate the group indices with the rep function:
# combine both data sets into a single 200x2 matrix
alldata <- rbind(data,baseline)
# create group labels of 100 1s and 100 2s
grouplabels <- rep(1:2,each=nrow(data))
Then both of these new data objects are passed to the pairwisemahal function:
pairwisemahal(alldata,grouplabels)
## 1 2
## 1 0.0000000 0.9078837
## 2 0.9078837 0.0000000
11.9. PRACTICE QUESTIONS 225
Note that this function returns D (like it should!) and not D2 , so there is no
need to take the square root. It returns a data object that is structured like a
correlation matrix, showing the pairwise distance between each pair of groups.
This allows you to pass in any number of groups, and obtain a full matrix of
distances.
The FourierStats package also contains a function called tsqc.test, that imple-
2
ments the Tcirc test. The syntax is identical to that for tsqh.test, so these
functions can be used interchangeably (though note that tsqc.test only works
for bivariate data, whereas tsqh.test can cope with any number of dependent
2
variables). However, in order to justify running a Tcirc test, we should first test
the condition index of each data set. The function CI.test runs the condition
index test as follows:
CI.test(data)
## CI N criticalCI pval
## 1 1.484294 100 1.282 0.0005631189
A full explanation of how this test works is given by D. Baker (2021). However
you can think of it as being similar to other assumption tests you might be
familiar with (see section 3.8), such as Mauchly’s test of sphericity that is
used to test the assumptions of repeated measures ANOVA, or Levene’s test of
homogeneity of variances. Just like these other assumption tests, if the condition
index test is significant at p < 0.05 (as it is above), then the assumptions of
2
Tcirc are violated, and we should instead run the T 2 test.
These are the basics of how to calculate the T 2 and Tcirc
2
statistics, and the
Mahalanobis distance in R. They are quite rarely used tests, and my hope is
that by including them here more people will know about and use them in the
future. Readers interested in the implementation of the tests are welcome to
inspect the code underlying the FourierStats package for further insights.
Structural equation
modelling
227
228 CHAPTER 12. STRUCTURAL EQUATION MODELLING
Table 12.1: Summary of variables in the Holzinger and Swineford data set.
of 9 of the tests from the full study. Here is a snippet of the data set:
## id sex ageyr agemo school grade x1 x2 x3 x4 x5 x6
## 1 1 1 13 1 Pasteur 7 3.333333 7.75 0.375 2.333333 5.75 1.2857143
## 2 2 2 13 7 Pasteur 7 5.333333 5.25 2.125 1.666667 3.00 1.2857143
## 3 3 2 13 1 Pasteur 7 4.500000 5.25 1.875 1.000000 1.75 0.4285714
## 4 4 1 13 2 Pasteur 7 5.333333 7.75 3.000 2.666667 4.50 2.4285714
## 5 5 2 12 2 Pasteur 7 4.833333 4.75 0.875 2.666667 4.00 2.5714286
## 6 6 2 14 1 Pasteur 7 5.333333 5.00 2.250 1.000000 3.00 0.8571429
## x7 x8 x9
## 1 3.391304 5.75 6.361111
## 2 3.782609 6.25 7.916667
## 3 3.260870 3.90 4.416667
## 4 3.000000 5.30 4.861111
## 5 3.695652 6.30 5.916667
## 6 4.347826 6.65 7.500000
In the above output, the first six columns give demographic data about the
participants, including age, sex, school year, and school attended. These are not
of particular interest for the analysis we have in mind. The remaining columns
contain the nine dependent measures, which correspond to the tests described in
Table 12.1.
The nine tests probe different aspects of mental ability, from basic perception
through to numerical and linguistic functions. We can summarise the relation-
ships between the variables by generating a covariance matrix:
round(cov(HolzingerSwineford1939[,7:15]),digits=2)
## x1 x2 x3 x4 x5 x6 x7 x8 x9
## x1 1.36 0.41 0.58 0.51 0.44 0.46 0.09 0.26 0.46
## x2 0.41 1.39 0.45 0.21 0.21 0.25 -0.10 0.11 0.24
## x3 0.58 0.45 1.28 0.21 0.11 0.24 0.09 0.21 0.38
12.1. HOW ARE DIFFERENT MENTAL ABILITIES RELATED? 229
## x1 x2 x3 x4 x5 x6 x7 x8 x9
## x1 1.00 0.30 0.44 0.37 0.29 0.36 0.07 0.22 0.39
## x2 0.30 1.00 0.34 0.15 0.14 0.19 -0.08 0.09 0.21
## x3 0.44 0.34 1.00 0.16 0.08 0.20 0.07 0.19 0.33
## x4 0.37 0.15 0.16 1.00 0.73 0.70 0.17 0.11 0.21
## x5 0.29 0.14 0.08 0.73 1.00 0.72 0.10 0.14 0.23
## x6 0.36 0.19 0.20 0.70 0.72 1.00 0.12 0.15 0.21
## x7 0.07 -0.08 0.07 0.17 0.10 0.12 1.00 0.49 0.34
## x8 0.22 0.09 0.19 0.11 0.14 0.15 0.49 1.00 0.45
## x9 0.39 0.21 0.33 0.21 0.23 0.21 0.34 0.45 1.00
Figure 12.1 shows the same correlation matrix in a graphical format. The
matrix shows generally positive correlations between different combinations of
variables. The strongest of these (r = 0.73) is between x4 and x5 - the paragraph
comprehension and sentence completion tasks - and there appears to be a cluster
of high correlations involving x4, x5 and x6 in the centre of the matrix. But even
so, just from inspecting the correlation matrix it is rather hard to understand
the structure of the data set.
An alternative approach is to construct a hypothetical model of the potential
relationships. One very simple model is that a single underlying factor determines
performance on all tasks. This general intelligence, or g, factor is widely discussed
in the literature on human cognitive ability (Spearman 1904). It is the classic
example of a latent variable - a construct that we hypothesise might exist, but
we cannot measure directly. This model can be expressed diagrammatically, as
shown in Figure 12.2.
The path diagram shown in Figure 12.2 has several key features. The nine
dependent variables from the Holzinger-Swineford dataset are shown in square
boxes. In the centre is the latent variable g, shown in a circle. These shapes
are the accepted conventions in SEM - squares or rectangles contain measured
variables, and circles or ovals contain latent variables. The arrows joining the
230 CHAPTER 12. STRUCTURAL EQUATION MODELLING
x1
x2
x3
x4
x5
x6
x7
x8
x9
1
x1 1 0.3 0.44 0.37 0.29 0.36 0.07 0.22 0.39
0.8
x2 0.3 1 0.34 0.15 0.14 0.19 −0.08 0.09 0.21
0.6
x3 0.44 0.34 1 0.16 0.08 0.2 0.07 0.19 0.33 0.4
−0.4
x7 0.07 −0.08 0.07 0.17 0.1 0.12 1 0.49 0.34
−0.6
x8 0.22 0.09 0.19 0.11 0.14 0.15 0.49 1 0.45
−0.8
x9 0.39 0.21 0.33 0.21 0.23 0.21 0.34 0.45 1
−1
x9 x1
x8 x2
x7 x3
x6 x4
x5
Figure 12.2: Example structural equation model with a single latent variable.
An alternative model might be to propose that there are several latent variables,
which map on to specific abilities that are probed by more than one test. For
example, we might propose a latent variable for the visual tasks (x1-x3), another
for the literacy tasks (x4-x6) and a final one for the timed tasks (x7-x9). We could
allow interdependencies (i.e. correlations) between these three latent variables,
and represent the model with the diagram in Figure 12.3.
x9 x1
x8 x2
spd vsl
x7 x3
txt
x6 x4
x5
Figure 12.3: Example structural equation model with three latent variables.
measures. We could, for example, see which of the above models gives the best
quantitative description of the data set. There might also be a case for altering
the connections between different nodes in a model to obtain a better fit; that
could change our views on how different variables are related. The following
sections will go through four stages involved in SEM, before discussing some
general issues worth being aware of when conducting this type of analysis.
later). It might seem that there are many possible degrees of freedom when
designing a model like this. However, usually we will be guided by previous
studies, and our intuitions about how different variables might be related. If we
have designed the study that generated the data set being analysed, it is likely
that we included measures because we had some sort of expectation about how
they would be related. If we really have no idea about how to design a model,
there is a technique called Exploratory Factor Analysis that can try to derive
the relationships for us. However this is beyond the scope of this chapter, and is
perhaps less well-suited to hypothesis-driven research.
far more data points than free parameters, so both are safely over identified (as
will typically be the case for data sets with many measures). Notice that model
identification does not depend on the number of cases (i.e. participants) included
in the data set, only on the structure of the data set and the model.
##
## Number of observations 301
##
## Model Test User Model:
##
## Test statistic 85.306
## Degrees of freedom 24
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 918.852
## Degrees of freedom 36
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.931
## Tucker-Lewis Index (TLI) 0.896
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -3737.745
## Loglikelihood unrestricted model (H1) -3695.092
##
## Akaike (AIC) 7517.490
## Bayesian (BIC) 7595.339
## Sample-size adjusted Bayesian (BIC) 7528.739
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.092
## 90 Percent confidence interval - lower 0.071
## 90 Percent confidence interval - upper 0.114
## P-value RMSEA <= 0.05 0.001
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.065
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard errors Standard
##
236 CHAPTER 12. STRUCTURAL EQUATION MODELLING
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## visual =~
## x1 1.000
## x2 0.554 0.100 5.554 0.000
## x3 0.729 0.109 6.685 0.000
## textual =~
## x4 1.000
## x5 1.113 0.065 17.014 0.000
## x6 0.926 0.055 16.703 0.000
## speed =~
## x7 1.000
## x8 1.180 0.165 7.152 0.000
## x9 1.082 0.151 7.155 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## visual ~~
## textual 0.408 0.074 5.552 0.000
## speed 0.262 0.056 4.660 0.000
## textual ~~
## speed 0.173 0.049 3.518 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 0.549 0.114 4.833 0.000
## .x2 1.134 0.102 11.146 0.000
## .x3 0.844 0.091 9.317 0.000
## .x4 0.371 0.048 7.779 0.000
## .x5 0.446 0.058 7.642 0.000
## .x6 0.356 0.043 8.277 0.000
## .x7 0.799 0.081 9.823 0.000
## .x8 0.488 0.074 6.573 0.000
## .x9 0.566 0.071 8.003 0.000
## visual 0.809 0.145 5.564 0.000
## textual 0.979 0.112 8.737 0.000
## speed 0.384 0.086 4.451 0.000
Next, the sections headed Model Test User Model and Model Test Baseline Model
give us the results of chi-square tests for the model fit and for a baseline (null)
model in which covariances are all fixed at 0. We should expect the model
we designed to do better than the baseline model, and indeed we see that it
has a smaller chi-square test statistic, indicating a closer fit to the data. For
this example both tests are significant; recall that a significant chi-square test
can indicate a poor fit to the data, but that as discussed above this is hard to
evaluate because of the confounding effect of sample size on significance. The
following section of the output compares the model to the baseline using the
Comparative Fit Index and the Tucker-Lewis Index. Both of these values are
quite high, around 0.9, indicating that the model we designed gives a better fit
than the baseline model.
The three subsequent sections of the output report additional measures of
goodness of fit, including the log likelihood, the Akaike Information Criterion,
the Bayesian Information criterion, and the root mean square (RMS) error.
These values are particularly useful for comparing between different possible
models, as we will describe in more detail later in this chapter.
The final sections of the output show parameter estimates for the latent variables,
covariances and variances. These are somewhat difficult to interpret in table
format, so we can add the parameter estimates to the path diagram to give a
numerical indication of the strength of the links between variables (see Figure
12.4). This can be done using standardised or unstandardised values. In general,
standardised values are more useful, as the values are then similar to correlation
coefficients. The fitted parameters show high loading of individual measures onto
the three latent variables (coefficients between 0.42 and 0.86), and somewhat
smaller correlations between the latent variables (0.28 to 0.47).
0.56 0.40
x9 x1
0.72 0.42
spd 0.47 vsl
0.57 1.00 1.00
1.00 0.58
0.28 0.46
x7 x3
0.68 0.66
txt
0.84 0.85
0.86
x6 x4
0.30 x5
0.27
0.27
Figure 12.4: Example structural equation model with three latent variables,
showing standardised parameter estimates.
many new parameters at once is not advisable, as the parameters may be highly
correlated (and therefore not very informative). The order in which parameters
are added and removed can also affect the outcome, so care is advised when
attempting changes to the model.
HS.model2
The fit object stores the model definition, a summary of the fitting process, and
all of the various indices and test statistics. We can request a summary like
the example earlier in the chapter using the generic summary function, and
specifying that we want to see the fit indices as follows (I have suppressed the
output of this command in order to save space, but it is identical to that shown
previously):
summary(fit, fit.measures = TRUE)
From the output, we can extract the various statistics we might want to report.
If we want to compare the fits of two models statistically, we can use the anova
function as follows:
anova(fit,fitG)
There are numerous plotting options, explained in the help file for the semPaths
function. These can be used to change the layout and style of the plot. In these
examples I have used the circle layout, as this shows the latent variables in
the middle of the diagram. Other options include tree and spring - it is worth
checking several of these alternatives to find the most natural and appropriate
way to present a given model. For more general discussion of producing attractive
and informative figures, see Chapter 18.
Model modification can then be conducted. We first calculate modification
indices for the factor loadings, which will tell us the effect of removing one
parameter on the other parameters in the model. The modindices function
calculates this information for all possible operators. Since our model does not
242 CHAPTER 12. STRUCTURAL EQUATION MODELLING
have any covariances between dependent variables, we will only inspect the
links to latent variables (though this does not mean that covariances between
dependent variables do not exist, we are just not considering them here).
mi <- modindices(fit)
mi[mi$op == "=~",1:4] # display only the indices involving latent variables
## lhs op rhs mi
## 25 visual =~ x4 1.211
## 26 visual =~ x5 7.441
## 27 visual =~ x6 2.843
## 28 visual =~ x7 18.631
## 29 visual =~ x8 4.295
## 30 visual =~ x9 36.411
## 31 textual =~ x1 8.903
## 32 textual =~ x2 0.017
## 33 textual =~ x3 9.151
## 34 textual =~ x7 0.098
## 35 textual =~ x8 3.359
## 36 textual =~ x9 4.796
## 37 speed =~ x1 0.014
## 38 speed =~ x2 1.580
## 39 speed =~ x3 0.716
## 40 speed =~ x4 0.003
## 41 speed =~ x5 0.201
## 42 speed =~ x6 0.273
The largest modification index (in the mi column) is 36.4, and corresponds to
the link between the visual latent variable and the speeded discrimination task.
This isn’t part of our original model, but we could consider an updated model
that includes such a link (see Figure 12.5:
HS.model3 <- ' visual =~ x1 + x2 + x3 + x9
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
semPaths(fit3,layout="circle",whatLabels="stand",edge.label.cex=1)
Note that the new link between visual (vsl) and x9 is now included, and has a
substantial coefficient (0.38). We can assess the improvement in fit statistically
using the Lagrange Multiplier test in the lavTestScore function as follows:
a <- lavTestScore(fit, add = 'visual =~ x9')
a$uni
##
12.11. DOING SEM IN R USING THE LAVAAN PACKAGE 243
0.37 0.42
x8 x1
0.61 0.44
spd 0.30 vsl
1.00 1.00
1.00 0.59
0.21 0.45
x6 x3
0.30 0.65
0.84
txt 0.45 0.38
0.86
0.85
x5 x9
0.27 x4
0.55
0.28
Figure 12.5: Updated structural equation model with an additional link between
variable x9 and the visual latent variable.
## rmsea
## 0.092
fitmeasures(fit3,'rmsea')
## rmsea
## 0.065
These statistics show us that the root mean square error value is smallest for
the updated model (fit3 ), indicating a better fit to the data.
A similar approach can be taken for removing parameters using the Wald
244 CHAPTER 12. STRUCTURAL EQUATION MODELLING
test (lavTestWald function). This time, let’s remove the link with the lowest
standardised coefficient - the one between the visual latent variable and x2. We
achieve this by introducing a weight term onto this parameter in the model
definition, and then checking what happens when the weight is set to zero:
HS.model4 <- ' visual =~ x1 + b1*x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
## $stat
## [1] 30.84248
##
## $df
## [1] 1
##
## $p.value
## [1] 2.79844e-08
##
## $se
## [1] "standard"
The Wald test also produces a significant p-value, suggesting this change to the
model should be investigated more thoroughly. However, on further inspection,
it actually produces a larger RMS error (and therefore a worse fit) than our
original model:
fitmeasures(fit,'rmsea')
## rmsea
## 0.092
HS.model5 <- ' visual =~ x1 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
## rmsea
## 0.099
The above examples provide a basic introduction to the capabilities of structural
equation modelling. Of course, as with most of the techniques in this book,
there is much more to learn, and many excellent resources are available to help.
The book Principles and Practice of Structural Equation Modelling by Kline
(2015) is an authoritative but readable text that goes into much more detail than
12.12. PRACTICE QUESTIONS 245
we have had space for in this chapter. Another useful resource is the journal
Structural Equation Modeling, which publishes technical papers on this topic.
It is also worthwhile reading some empirical papers that use the methods, to
see how they are implemented and reported in your area of interest. Outside of
the R ecosystem, there are several commercial software packages designed for
structural equation modelling, including LISREL, Stata, Mplus, and the Amos
extension to IBM’s SPSS.
A) Over-identified
B) Under-identified
C) Just identified
D) It is impossible to say without seeing the data
8. Which of the following fit indices indicates a good fit when it has a value
near zero?
A) Bentler-Bonnett
B) Chi-square
C) RMSEA
D) McDonald
9. To assess whether a parameter can be removed from a model, we should
use the:
A) Chi-square test
B) Lagrange Multiplier test
C) Comparative fit index
D) Wald test
10. Structural equation modelling is typically unstable with sample sizes less
than:
A) N=200
B) N=300
C) N=400
D) N=1000
Answers to all questions are provided in section 20.2.
Chapter 13
Multidimensional scaling
and k-means clustering
To give an example of how these methods might be used together, let’s imagine
that we discover some new varieties of insect in an underground cave. The
insects are all about 10 mm long, but vary in their colouring from grey to blue,
and in the thickness and angle of the characteristic stripes that cover their backs
(see Figure 13.1a). You suspect that there might be three distinct species of
insect, but how might we test this hypothesis? One option might be to measure
all of the key variables from the insects (stripe thickness and angle, colour) and
use k-means clustering to try to determine the underlying structure of the data
set. Because it is challenging to visualise multivariate data with more than
two dimensions (i.e. variables), we could then use multidimensional scaling to
collapse the data into a two dimensional space for plotting. The end result might
look something like the graph shown in Figure 13.1b. There is evidence of three
primary clusters, for which the example insects in Figure 13.1a are prototypical
examples.
247
248CHAPTER 13. MULTIDIMENSIONAL SCALING AND K-MEANS CLUSTERING
(a)
●●
●
(b) ● ● ●
●
● ●
●
●
●
● ●
● ●
●
● ●● ●
● ●
● ●
● ●● ● ●● ●
● ●● ● ●
● ● ● ● ●
●
● ● ●● ● ●
●● ● ●●
● ● ● ●● ●
● ●●
● ●● ●● ●●● ● ● ● ● ●●
● ●● ●● ● ●● ● ●
● ● ●● ● ● ● ● ●
● ●● ●● ●
● ● ●
● ● ● ●●
●●● ●
● ●
●●
●● ●●● ●●
●
●●●● ● ●●
●
● ●●
● ● ●●●● ●● ● ●
●
● ●●● ● ●
●
●
● ● ● ● ●● ●●
●●●● ●●● ●●
●● ●
●● ● ●
● ● ●
●
●●●
● ● ●●●
● ● ●● ●
●
Dimension 2
● ●● ●
●
●● ● ● ● ●● ● ●
● ●● ●●● ●●
● ● ● ● ● ●
● ● ●●
●●●
● ● ●
● ●
●● ● ●
● ● ●
●
●
● ● ●
● ●
● ●
●
● ● ●
● ●
● ● ● ●●● ●
● ●
●
● ● ●
●● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ●● ● ● ● ●
● ● ●
● ●● ● ●●
● ● ●
● ● ●● ●●●● ● ●
●
● ● ● ●●● ●
● ●
●
● ●● ● ●
● ●● ●
● ●
●● ● ●
● ●●●●
●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ●● ●
● ●
● ● ●● ●
● ● ●●
● ●
● ●● ● ●
● ●
●● ●
●
●
Dimension 1
Figure 13.3a shows some more complex simulated data (see Chapter 8) generated
from five two-dimensional Gaussian distributions. The colours of the points
indicate the true groupings, and you can see that there is some overlap between
the groups in either the x or y directions. Figure 13.3b shows the k-means
solution with k = 5, where each black point indicates a cluster centroid. The
algorithm has identified sensible clusters, though you can see that some data
points have been grouped with other points that come from a different generating
distribution (i.e. true group). The lines are the residuals that are used to calculate
the distance between each data point and its centroid. We can also see what
happens if we choose different values of k. Figure 13.3c shows clustering with k
= 2, and Figure 13.3d shows clustering with k = 10. These do produce plausible
clusterings, though the original (generating) groupings are not preserved.
(a) (b)
● ●
● ●●
●
●
● ● ●
●
● ●
●
● ●
● ● ●
●
● ●
● ● ● ● ● ●
●
●
●
●
● ●
● ●● ●
●
●
●
● ●
● ●●
● ● ● ● ● ●
● ●● ● ●●
● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
●● ●●
● ● ● ●
● ● ● ●
(c) (d)
● ●
●
●
●
●
●
● ●
● ● ● ●
● ●
● ● ●
●
● ●
●
●
●
●
● ●
●
● ●
●
● ● ●
● ●
●●
●
●
● ● ●● ● ●
● ●
●● ●
●
● ●
● ●
● ●● ● ● ●
●●
●
● ● ●
●
● ●● ● ● ●
●●
● ● ● ● ● ● ● ●
●
●
●
● ●
●
●●
●
●● ● ●● ●
● ● ● ●
● ● ● ●
Figure 13.2: Illustration of the k-means clustering algorithm. In panel (a), the
data points are shown in grey, with the initial centroid estimates in black and
white. Panel (b) shows the initial cluster assignments, and residual vectors
(lines). Panel (c) shows the revised centroid locations and cluster assignments on
the second iteration of the algorithm. Panel (d) shows the path of each centroid
across four iterations of the algorithm, with data points assigned to their final
clusters.
13.2. COMPARING DIFFERENT NUMBERS OF CLUSTERS 251
(a) ●
●
● (b) ●
●
●
● ●
● ● ● ●
● ●
● ● ●●
●
●
●
● ●
●
●
●
●● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●● ● ● ●
● ●● ● ● ● ●● ● ●
●
●●
● ●●
●
●●● ● ● ●
● ● ● ● ●●● ● ● ●
● ● ● ●
● ● ●● ● ●● ●●● ● ● ●● ● ●● ●●●
●
● ●●
● ● ●●
● ● ●
● ●
● ●● ●
● ● ●●
● ● ●
●
●
●
●
●
●
● ●● ● ●
●● ●
●●
● ● ●
●
●
● ●● ● ●
●● ●
●●
●● ● ●● ● ●● ● ●● ●
● ● ●● ● ● ● ●● ●
● ●●● ● ● ●● ● ●●● ● ● ●●
● ●
● ●● ●● ● ● ● ●● ●● ● ●
● ●
●● ● ● ● ●● ● ● ●
● ●
●
●●●●
● ●
●●●●●
●
●● ●●
● ●
● ●
● ●
● ●
(c) ●
●
● (d) ●
●
●
● ●
● ● ● ●
● ●● ● ●●
●
●
● ● ● ●● ●
● ●
●
●
● ● ●
●
●
●
●
● ●
●
● ●
● ●
● ● ●● ● ● ●●
● ●● ● ● ●● ●
● ●
●
● ●
●
●
●
●
● ● ● ● ● ●
●
● ●
●
●
●
● ● ● ●
●
●●
● ● ●● ● ●● ●●● ● ●
●●
● ● ●● ●
● ●● ●●● ● ●
● ●● ●
● ● ●●
● ● ●
●● ● ●
● ●● ●
● ● ●●
● ● ●
●● ●
● ●
● ●● ● ● ● ●
● ●● ● ●
● ●● ● ●●
● ●● ● ● ●
●●
● ● ●● ● ●● ●
● ●● ● ● ●
●●
●
● ●
●
●● ● ●●
●
●
● ●●● ● ●●
● ● ●
●
● ●● ●● ● ● ● ●● ●● ● ●
●
● ● ● ● ●
● ● ● ●
● ●
●
●●●●
● ●
●●●●
●●
●● ●●
● ●
●
●
●
●
●● ●
Figure 13.3: Example k-means clustering on simulated data. Panel (a) shows
data generated from five two-dimensional normal distributions with different
means. Panel (b) shows a k-means solution with k = 5, where black points
indicate the centroids, and lines show the residuals for each point. Panels (c)
and (d) are for k = 2 and k = 10 respectively.
252CHAPTER 13. MULTIDIMENSIONAL SCALING AND K-MEANS CLUSTERING
(i.e. the sum of the squared lengths of the residual lines in Figure 13.3b-d)1 , and
add a penalty term. For the AIC, the penalty is 2mk, where m is the number
of dimensions (i.e. dependent variables) and k is the number of clusters. For
the BIC, the penalty is 0.5log(N )mk, where N is the number of data points
(observations).
For both the AIC and BIC statistics, the best model is the one that produces the
lowest score. Generally both statistics behave similarly, meaning that whichever
one you use is likely to produce the same outcome, so the choice will not matter
for most applications. For the simulated example here, both statistics actually tell
us (see Figure 13.4) that k = 4 clusters gives the most parsimonious description
of the data (despite us actually using 5 generating distributions).
50
BIC
AIC
AIC/BIC score
40
30
20
1 2 3 4 5 6 7 8 9 10
k
Figure 13.4: Figures of merit as a function of the number of clusters (k). Shaded
regions indicate 95% confidence intervals for 1000 independent data sets generated
from the same underlying distributions.
1 There are variants of AIC and BIC for several different error terms, including the residual
sums of squares and the log-likelihood. The key point though is that a penalty is added that is
dependent on the number of free parameters in the model, which here are the data dimensions
and the number of clusters.
13.3. EXAMPLE: K-MEANS CLUSTERING OF DINOSAUR SPECIES 253
3 3 3
● Herbivore
(a) Carnivore (b) (c)
2 ● 2 ● 2 ●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
1 ● 1 ● 1 ●
Weight (log)
Weight (log)
● ● ●
● ● ●
Weight (log) ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
●
●
● ●
●
● ●
●
● ● ● ● ● ●
0 0 0
● ●
●
●
−1 ●
● −1 ●
● −1 ●
●
● ● ●
● ● ●
−2 ●
−2 ●
●
−2 ●
−3 −3 −3
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2
Height/Length (log) Height/Length (log) Height/Length (log)
Since we have two types of dinosaur, the first thing we can try is setting k =
2 (see Figure 13.5b). This doesn’t do an amazing job, as there are quite a lot
of mis-classifications. In particular, there are lots of carnivores included in the
upper cluster, which should be mostly herbivores. An alternative might be k=3
(see Figure 13.5c), where we could define an intermediate cluster. Given the way
the data appear, this looks like we now have a ‘heavier carnivore’ and a ‘lighter
carnivore’ category, as well as a ‘herbivore’ category. Of course, there are still
some errors, but real data are unlikely to cluster perfectly.
If we had the length, height and weight of a newly discovered species, or
one that doesn’t appear in our original data set, we might use the cluster
arrangement to hazard a decent guess about whether they were a carnivore or
a herbivore. For example, my four year old daughter (who knows much more
about dinosaurs than I do) really likes the protoceratops, which doesn’t feature
in the data set. Apparently these weighed 85 kg and were about 0.6 m tall
and 1.8 m long. That places them firmly in the top left corner of the plot (at
254CHAPTER 13. MULTIDIMENSIONAL SCALING AND K-MEANS CLUSTERING
Another variant, called k-medioids clustering, has the constraint that the centre
of each cluster must be one of the data points, whereas in k-means clustering this
is only the case for the initial guess. This is also more robust to outliers than
the k-means algorithm because the medioid is a plausible (i.e. already observed)
data point. Finally, the spherical k-means clustering method tries to constrain
both the distance and the angle of each point relative to the cluster centroid, so
that points are evenly spaced radially. All of these variants work in a broadly
similar way, and may be more or less well-suited to a particular situation or data
type.
There are also several different algorithms for estimating the clusters. In the
standard method described at the start of the chapter, the centres of the clusters
begin as random samples from the data set, and are iteratively recalculated using
the mean of the points allocated to each cluster. This is sometimes referred to
as Lloyd’s algorithm or the Forgy method (after Lloyd (1982) and Forgy (1965)).
One modification to this algorithm, called the random partition method, is to
assign each data point to a random cluster at the start, instead of choosing k data
points to form the initial cluster centres. An alternative algorithm proposed by
Hartigan and Wong (1979) uses a function minimisation approach (see Chapter
9) to determine cluster membership.
(a) (b)
● 5 1 1
Distance
●
4 2
● 3
3 0
● 1 4
● 2
5
1 2 3 4 5
original data set into the new space. This is achieved by minimising a statistic
called the strain (or in some variants the stress). The strain is a loss function
based on the Euclidean distances between points.
A good way to illustrate the results of multidimensional scaling is to use random
colour vectors. Colour is defined using mixtures of the red, green and blue pixels
on a display. We can therefore create random RGB vectors, and use MDS to
reduce from 3 to 2 dimensions for plotting. This is shown in Figure 13.7a, where
colours of a similar hue end up being grouped together. A variant in Figure
13.7b includes a fourth dimension, the alpha (transparency) setting. In this
plot the different hues still group together, but the transparency information is
clearly being factored in too, for example by placing more transparent points
nearer the lower right edge of the cloud.
RGB RGBα
(a) (b) ●
● ●●
● ●
● ● ● ●
●● ●● ●
● ●
● ● ●● ●
● ● ● ●● ● ● ● ●● ●
●● ● ● ●●● ● ● ●● ● ●
●● ● ●● ● ●● ●
●● ●● ● ●● ● ●
● ● ●● ●● ● ● ●● ● ●
●
● ● ●
● ●● ● ● ●●● ● ●● ● ● ● ● ●●● ●● ● ●
●
● ●●●●
Dimension 2
● ● ● ●●
● ●
● ●
●●● ●
● ●
●●
●●
●●●
●●●●●
●
●●
●●● ● ●●● ●
Dimension 2 ●
●
●
●
●
● ●
● ● ● ●●
● ●●●● ● ● ●● ● ● ●
● ●
● ● ● ●● ● ● ● ●● ● ●●● ●●
●
●● ● ● ●●● ● ● ● ● ●● ●● ●
● ● ● ● ● ●● ●
● ● ● ●●● ● ● ●● ● ●
● ●● ● ● ●●
●
● ●●●● ● ●
● ● ●● ●● ●●●
●●● ● ●
● ●● ●● ●●
● ●● ●● ● ●● ● ●●● ●●●
● ●● ● ●●● ●● ● ●● ●
●● ●● ●●● ● ● ●
● ●● ● ● ●●● ● ● ●●● ● ● ● ● ●● ● ●
● ● ● ●● ●● ● ●●● ●● ●●
● ●
● ● ●
●●● ● ● ● ●● ● ● ● ●
●●
● ●● ● ● ● ●●
●●●● ●●●● ● ● ●●● ● ● ●● ●●
● ●● ● ● ● ●● ● ●●●● ●
● ● ● ● ● ● ●●● ●●●● ●● ● ● ●●●●●● ●● ● ● ●●
● ●
● ● ● ● ●
● ●●●●● ● ●●● ●● ●●● ●● ●
●● ●●●● ● ● ● ●●●● ●●● ●●
● ●●
●● ●● ● ● ●●● ●● ● ●● ●●●● ● ●●●● ●●
●
●●● ● ●● ●
● ●● ●
● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ●
●● ●● ● ● ● ● ● ●● ●
●● ● ●●
● ●●●● ● ● ● ● ● ●●
●
●●● ● ●●
● ●● ● ●● ● ●● ● ●●● ● ●●● ● ●● ●
●
● ●● ● ● ●
●● ●●
● ●
●●●● ● ●
● ●
● ●●●●●● ● ● ● ● ●● ●●● ●● ●
● ● ●● ● ●●
●● ●●●● ● ●
● ●●● ●●●● ●● ●
●●●●
● ●● ●
●●
● ●● ●● ● ● ●
● ●
● ● ●●
● ● ●● ● ●● ●●●● ●● ●●● ●●● ●● ● ●
● ● ●
● ● ●
●● ●
●● ● ●●● ●●●● ●●●● ●●● ● ● ●● ● ● ● ● ● ●●●●●● ●●●●● ● ● ●● ●● ● ● ● ●● ●
●● ● ●●● ●●● ● ●●●● ● ● ●
●●
● ● ● ● ●
● ●● ● ●● ● ● ● ●● ●
● ●●●●● ● ● ●
● ● ● ● ● ●● ● ●● ● ●●●
● ●● ● ●● ●● ●●● ● ● ● ●●●
● ●● ●● ● ● ●
● ●
● ●● ● ●● ● ●
●●
● ●●●● ● ● ●
●
● ● ● ●
●● ● ●
● ●● ● ●● ●● ●●● ● ● ●
● ●
● ●●● ●● ● ● ● ● ●●●●●
● ●●● ● ●●
●●●●● ● ●●● ●●
●● ● ●● ● ● ●
● ●● ●●●● ● ●●●●● ● ● ● ● ●● ●● ●● ●● ● ● ● ●
● ●● ● ●●●● ● ●● ●● ● ● ●● ●● ● ● ●● ●
● ● ●
●●
● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ●●
● ●
●● ●● ● ● ● ●●● ●
●● ● ●●
● ●● ● ●● ●
● ● ●●
● ●●●
●●
●●
●
●
●●
● ● ● ● ●● ● ●
● ● ● ● ●● ● ●
●● ●● ● ●●● ● ● ●● ● ● ● ● ●●● ● ●
● ●
●●●● ●● ● ● ●● ● ● ●●●● ● ●●● ● ● ●●●
●●● ● ●● ● ● ● ●● ● ●● ●
●
● ● ● ●●●
●● ●●●●●● ●●● ● ● ●
●● ●
● ●●● ●
●●●● ●● ● ●●●● ●● ● ●●● ●● ● ●● ●● ● ●● ●●● ● ● ● ●●●● ● ● ●●
●
●● ● ● ●
● ● ● ●● ● ●● ● ●●● ● ●●● ● ● ● ●
● ● ●● ● ● ●
● ● ●
●● ● ● ● ● ●● ● ● ● ●● ●●
●● ●● ● ●●● ● ● ●●●● ● ●●● ●
●● ●●● ●●
●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●
●●●●●● ●●●●● ● ●●● ●●● ●● ●●●
● ● ●● ●● ● ● ●●● ●● ● ●● ● ●● ●●● ● ● ●● ● ●●●● ●● ●●●
● ● ● ● ● ● ● ● ● ●● ● ●
●● ● ● ● ● ● ●●● ●● ● ●●
● ●●● ●●● ●
● ● ●
●●●●
● ●● ●● ● ● ●● ●
●● ● ●
● ● ●● ● ● ●● ●●● ● ●● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ●
●● ● ●●● ●●●● ●
● ●● ● ● ●● ●
● ● ●● ● ● ● ● ●● ● ● ●●●● ●
● ● ● ● ● ●
●
● ●● ● ● ●● ●● ●●● ● ● ● ●● ● ●
● ● ● ●● ● ● ● ● ●●● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ●●●● ●
● ●●●●● ● ●● ●● ●● ● ●● ● ● ●
●●● ● ●● ● ● ●● ●
● ●
● ●
● ● ● ●●
● ●
● ●●● ●● ●
● ● ●●● ●● ● ● ● ● ● ●
●● ●●●
● ● ●●
● ●●●● ● ●●●
●
●
●
● ● ●● ● ●● ●● ● ●● ● ●
● ● ●● ●● ●●
●●
● ● ●● ●
●
●● ●●
● ● ●
●●● ●●● ●●● ●
● ●●●● ● ● ● ●●●
● ● ●●● ●● ●●● ● ● ● ● ●● ●● ●● ●
●
● ● ●● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●●● ● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ●
●
● ● ● ●● ●● ●● ● ● ● ●
●● ● ● ●
● ● ●
● ●● ●
● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●
●
● ●
● ● ●●
● ●●
●● ●
●● ●●● ●●
● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ●● ●
● ● ● ●● ●● ●● ● ●● ● ● ●● ● ● ● ●● ●
● ● ● ● ●●
● ● ● ●●● ●●● ● ●
● ●● ● ● ● ● ● ●● ●
● ● ● ●● ●●●
●
●
●
●
●
● ● ● ●
●
●● ●
●
●●
●
●
●
● ●
Dimension 1 Dimension 1
The starting data for MDS will be an N × m matrix, where m is the number of
dependent variables (dimensions). For example:
## [,1] [,2] [,3] [,4]
## [1,] 0.692666520 0.98132464 0.08029358 0.6964390816
## [2,] 0.802897572 0.13823851 0.93906565 0.6016717958
## [3,] 0.797127023 0.88163599 0.66954995 0.6361913709
## [4,] 0.007445487 0.06651651 0.37043385 0.1724689843
## [5,] 0.621347463 0.68464959 0.11980545 0.0002071382
13.5. THE MULTIDIMENSIONAL SCALING ALGORITHM 257
The output will be an N × 2 matrix, where the two dimensions are x and y
coordinates:
## [,1] [,2]
## [1,] -0.15160650 0.6517739
## [2,] 0.41646691 -0.2108153
## [3,] -0.03000302 0.3792663
## [4,] -0.05064061 -0.6095281
## [5,] 0.09436012 0.1585279
## [6,] -0.22697391 0.1237203
We can check the mapping between the original dissimilarities and the dissimi-
larities between positions in the lower dimensional space created by the MDS
algorithm (the rescaled data) using a Shepard diagram. This plots the pairwise
distances between points from the original data along the x-axis, and the pairwise
distances for the rescaled data along the y-axis. If there is no loss of information
due to the rescaling, these values should be perfectly correlated. The amount
of scatter around the diagonal is therefore an indication of how faithfully the
data have been mapped by the MDS algorithm. One can also calculate statistics,
such as Spearman’s rank correlation, between the distances in the two spaces.
Examples for the colour data are shown in Figure 13.8.
RGB RGBα
(a) (b)
Rescaled distance
Rescaled distance
Figure 13.8: Shepard plots for rescaling the three- and four-dimensional colour
vectors. Each point represents a pairwise distance between two points. Both
panels show strong ordinal relationships between the original and rescaled values,
with the largest discrepancies in the upper right corner of each plot, representing
pairs of points that were very far apart in both spaces
258CHAPTER 13. MULTIDIMENSIONAL SCALING AND K-MEANS CLUSTERING
Dimension 2
● SFlu ● BFlu
● Nor ● MERS
● Rhi
● ●Rot
● Mar
SARS
● Ebo ● Rab Un
● Den
● CPox ● Rab
Mea ● ●
RubSma
Mum
● ●●Chi
● Pol ● HIV Un
● HIV
● HepB
Dimension 1
The k-means clustering algorithm, with k = 10 was then applied to the matrix
of 2761 × 20 numbers. The choice of k = 10 clusters was intended to produce
a suitable number of stimulus categories for use in a neuroimaging experiment.
The 24 image examples closest to each of the cluster centroids were chosen for
use in the experiment. The distinctness of each cluster was confirmed using
multidimensional scaling to reduce the dimensionality of the image dataset from
20 dimensions to 2. It was also clear that images from individual clusters had
various properties in common - for example all being roughly circular, or oriented
in a particular direction. A summary of the image selection process is provided
in Figure 13.11 (based on Figure 2 of Coggan et al. (2019)).
260CHAPTER 13. MULTIDIMENSIONAL SCALING AND K-MEANS CLUSTERING
●
Rescaled distance
●
●
●
●
●
●●
●●●●●
●● ●
●●
●●
●●
●●
●●●
●
●●●
●
●●
●●●●
●
●
●●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●●
●
●
●●
●●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●●●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●●
●●
●
●●●
●
●●
●●
● ●
Original distance
●● ●●●
●● ●
● ● ●● ●●
● ●
●●
● ● ●●● ● ● ●
●●
●●
● ●●●
●●
●●●
●●
●●●
●●●
● ●
●●
● ● ●●
●●●
●
●●
●● ●
●●
●
●● ● ●
●●●
●
●●●●
● ●●
2761
2761
●
●● ●●●
●●
● ●● ● ●●
● ● ● ●●●
24
●
●
●●
●● ●●
●
● ● ●
● ● ●
●● ●● ●●
● ● ●●
● ● ● ● ● ●
● ●
● ● ●●●
●●● ●●● ●● ●
●
●●
● ● ●● ●● ●●
●
● ● ●
●●
●● ● ●
●● ● ● ● ● ● ●●
●●
● ●●
● ●
●●
●●●●● ● ●
● ●●
●●●
●
●● ●●● ●
●
●
●●●
● ● ●●●
●●
●●
4096 20 20
The final set of 240 images (10 categories × 24 examples) were then presented
to participants in a block design fMRI experiment. The study found that neural
responses in the ventral visual cortex (a region of the brain believed to be
specialised for detecting objects) produced distinct patterns of activity for each
cluster. This is important, because distinct patterns are usually associated with
specific categories of real-world objects (such as faces, buildings etc.), and this
in turn is interpreted as evidence that there are areas of the brain specialised
for different semantic object categories, such as faces, bodies or buildings. By
using object clusters defined entirely by their image properties (and not their
semantic properties), this study demonstrates that low level image features (such
as orientation, curvature and so on) are also important in understanding stimulus
representations in this part of the brain.
## [112] 5 3 5 3 5 3 3 3 5 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 2.0466979 2.2336343 0.3633141 2.1496942 1.4766339
## (between_SS / total_SS = 75.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The first line of the output tells us how many clusters we have generated, and
their sizes (i.e. how many observations are assigned to each cluster). Then,
it gives the cluster means as the x and y coordinates of the cluster centres.
Note that for data sets with more than two dependent variables, the means
will contain a value for each dependent variable. The clustering vector gives
cluster assignments to each of the individual observations from the data set. The
summed squared error for each cluster is also provided, and gives an estimate
of the residual variance within each cluster. Finally, the ratio of between and
total sums of squares is given - this is the same as the R2 value from ANOVA or
regression, and tells us the proportion of the total variance that is explained by
cluster assignment.
The output data object allows us to access all of these values, as well as incidental
information about things like the number of iterations required for the clustering
algorithm to converge. We can use this information to plot the lines between
each data point and its assigned cluster centroid as follows (see Figure 13.12 for
the output):
# set up an empty plot axis
plot(x=NULL,y=NULL,axes=FALSE, ann=FALSE, xlim=c(-1,1), ylim=c(-1,1))
axis(1, at=c(-1,1), tck=0.01, lab=F, lwd=2)
axis(2, at=c(-1,1), tck=0.01, lab=F, lwd=2)
# draw lines between each cluster centre and the assigned data point
for (n in 1:(nrow(dataset))){
lines(c(clusters$centers[clusters$cluster[n],1],dataset[n,1]),
c(clusters$centers[clusters$cluster[n],2],dataset[n,2]),
col='grey')}
Additional colours for each cluster, or for true group membership if this is known,
13.11. DOING K-MEANS CLUSTERING IN R 265
●
●
● ●
● ●
● ●
● ●
● ●
● ● ● ●
●
● ● ●
● ●
● ● ●●
● ●● ● ●
● ●●●
●
●● ● ● ●● ● ●
● ● ●● ● ●●
●
● ●●
● ● ●●
●● ●● ●
● ● ●
● ● ●
●
●
● ● ● ● ● ● ●
● ● ●
● ●● ● ●● ●● ● ●
●
●
● ●● ● ● ●
● ●● ●● ●● ● ●
●
●
●
●
● ● ●
●
● ●
●●●●
● ● ●
● ●
● ●
lines(c(clusters$centers[clusters$cluster[n],1],dataset[n,1]),
c(clusters$centers[clusters$cluster[n],2],dataset[n,2]),
col='grey')}
points(dataset[,1],dataset[,2],pch=16,col=pal2tone[1])
points(clusters$centers[,1],clusters$centers[,2],pch=16,cex=2)
}
● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ● ●
●
●● ● ● ●● ● ● ●● ● ●
● ● ● ● ● ●
● ●●● ●● ● ●●● ●● ● ●●● ●●
●
●● ● ● ● ● ●
●● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●● ● ●
● ●●
●
●●● ●
●
●
● ●● ● ●
● ●●
●
●●● ●
●
●
● ●● ●
● ●●
● ● ●● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
●
●
● ● ●●
● ●
●
●
● ● ●●
● ●
●
●
● ● ● ●●
● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●
●
● ●● ● ● ● ●
● ●● ● ● ● ●
● ●● ● ● ●
●● ● ● ●● ● ● ●● ● ●
● ● ● ● ● ●
● ●● ● ● ●● ● ● ●● ●
●● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ●●
● ● ● ●
● ●●
● ● ● ●
● ●●
●
●● ●● ●●
● ● ●
● ● ●
● ● ●
● ● ●
The coldist data object is a special class of matrix that contains the pairwise
distances in the correct format to use for multidimensional scaling. We can pass
this matrix (but not the raw data) to the cmdscale function as follows:
scaledxy <- cmdscale(coldist,2)
head(scaledxy)
## [,1] [,2]
## [1,] -0.15160650 0.6517739
## [2,] 0.41646691 -0.2108153
## [3,] -0.03000302 0.3792663
## [4,] -0.05064061 -0.6095281
## [5,] 0.09436012 0.1585279
## [6,] -0.22697391 0.1237203
The second argument for the cmdscale function (the number 2) defines the
dimensionality of the output. So, if we wanted to produce a 3D plot of the
data points, we could change this to 3. It is not clear how one might represent
dimensions higher than 3 graphically, but in principle any number of output
dimensions is possible. Each row of the output (scaledxy) shows the x,y position
of a data point (with row number consistent with the original data matrix).
We can then plot the rescaled data. Often it is helpful to colour-code the
individual points by some meaningful category. For this example, each data
point already has an RGBα colour vector associated with it, which we can use
to produce an attractive diagram (see 13.7b) as follows:
plot(x=NULL,y=NULL,axes=FALSE, ann=FALSE, xlim=c(-1,1), ylim=c(-1,1))
axis(1, at=c(-1,1), tck=0.01, lab=F, lwd=2)
axis(2, at=c(-1,1), tck=0.01, lab=F, lwd=2)
title(xlab="Dimension 1", col.lab=rgb(0,0,0), line=1.2, cex.lab=1.5)
title(ylab="Dimension 2", col.lab=rgb(0,0,0), line=1.5, cex.lab=1.5)
title(expression(paste('RGB',alpha,sep='')))
points(scaledxy[,1],scaledxy[,2],pch=16,cex=0.5,col=
rgb(colourdata[,1],colourdata[,2],colourdata[,3],alpha=colourdata[,4]))
library(MASS)
mdsout <- isoMDS(coldist,k=2)
scaledxy <- mdsout$points
points(scaledxy[,1],scaledxy[,2],pch=16,cex=0.5,col=
rgb(colourdata[,1],colourdata[,2],colourdata[,3],alpha=colourdata[,4]))
If you run this code, you will notice that the solution is similar to the metric
version, but the non-metric diagram has more outliers at the extremes. In both
cases, the units of the scaled solution are arbitrary for both dimensions.
Multivariate pattern
analysis
The final multivariate technique we will discuss in this book is called Multivariate
Pattern Analysis (MVPA). In recent years it has been widely used to analyse
MRI recordings of brain activity, where it is sometimes referred to as MultiVoxel
Pattern Analysis, though the techniques (and acronyms) are much the same.
These methods are a subset of machine learning - a family of artificial intelligence
(AI) methods that aim to train computer algorithms to perform classification tasks
on some sort of complex data. Prominent examples of machine learning include
object identification algorithms (i.e. for labelling the contents of photographs)
and dictation software that converts speech to text (and in some cases can
act on verbal instructions). Machine learning methods also have substantial
promise in the area of personalised medicine and automated diagnosis, with one
prominent example being the diagnosis of eye disease (De Fauw et al. 2018).
These techniques will become more widespread and accurate in the future, and
at the time of writing (2021) are attracting substantial media attention and
commercial investment. In such a fast-moving field, it is always worth keeping
up with new developments. However, for a more detailed discussion of the core
aspects of pattern analysis and other machine learning methods, the classic
text, Pattern recognition and machine learning by Bishop (2006), is an excellent
resource.
271
272 CHAPTER 14. MULTIVARIATE PATTERN ANALYSIS
human operators manuallxy label each one would be prohibitively expensive (not
to mention tedious). An algorithm that can automatically identify their contents
makes the images searchable using text keywords, without requiring extensive
human labour. In other situations, algorithms can be used to identify patterns
in data that would be hard for humans to spot, perhaps owing to the complexity
of the data. The great promise of this aspect of AI is that it could help improve
critical real-world problems such as disease diagnosis and risk prediction in the
insurance industry.
30 30
(a) (b)
20 20
●●● ● ● ●●● ● ●
●● ●●
● ●● ● ●●● ●
● ●●
●●●
● ●● ●● ● ●●● ● ●●
●●●
●
●● ●●● ● ●● ●●● ●
● ●●●
●● ● ●●● ●● ●● ● ●●●
●● ● ●●● ● ● ●●
●●● ●●● ● ●●● ●●● ●
10 ● 10 ●
0 0
Group A Group B
To think about classification, we could replot the data as a single cloud, as shown
in Figure 14.1b (the x-position of each point is arbitrary here). A good way
to try to classify group membership is to plot a category boundary that best
separates the two groups. This is shown by the dashed line in Figure 14.1b, and
14.3. DIFFERENT TYPES OF CLASSIFIER ALGORITHM AND PATTERN ANALYSIS273
a sensible decision rule would be to say that data points below the line (most
of the grey circles) are more likely to be in group A, and those above the line
(most of the blue squares) will be in group B. Of course, this classification is not
totally accurate for the current example - there are several blue squares below
the line and grey circles above it, and these will be misclassified.
The category boundary is a basic classification algorithm, and it prompts some
observations that will generalise to more complex cases. First, we can work out
the accuracy of the classification by calculating the percentage of data points that
are correctly identified. In the example in Figure 14.1b, this is something like
90%. It is also clear that if the group means were more similar, accuracy would
reduce, and if they were more different, accuracy would increase. Additionally,
the variance (spread) of the data points is important. If the variance were
greater, classification accuracy would decrease, and if the variance were smaller,
classification accuracy would increase.
A convenient way of summarising the mean difference and variance is to use the
Cohen’s d metric introduced in Chapter 5. The d statistic is the difference in
means, divided by the standard deviation. We can calculate how classification
accuracy (for a two-category data set) changes as a function of d, as shown by
the black curve in Figure 14.2. As we might expect, increasing the separation
between the group means increases the accuracy of our classifications. So far so
good, but up until this point our examples have had only a single dependent
variable - isn’t MVPA supposed to be a multivariate technique?
The same logic as described for a single variable can easily be extended to the case
of multiple variables. If we have two variables, we can try to classify data points
using both pieces of information by placing a category boundary to separate the
two-dimensional space created by plotting the variables against each other. This
is shown in Figure 14.3a for a linear classifier, where the line separating the white
and grey areas indicates the category boundary. Adding extra informative (and
uncorrelated) variables increases accuracy (see dashed blue curve in Figure 14.2).
In principle this same trick can be applied for any number of variables. In a real
data set, some variables will be informative whereas others will not, and there
will usually be some level of covariance between different measures. However this
is not generally a problem - classifier algorithms will tend to ignore uninformative
variables, and assign more weight to measures that improve accuracy.
100
Percentage correct
90
80
70
60 One DV
Two DVs
50
0 1 2 3 4
Difference in means/SD
Figure 14.2: Increase in accuracy with Cohen’s d for one (black) and two (blue)
dependent variables (DVs).
14.3. DIFFERENT TYPES OF CLASSIFIER ALGORITHM AND PATTERN ANALYSIS275
30 30
(a) (b)
20 20
● ● ● ●
●● ● ●● ●
● ●
● ●● ●●●● ● ●● ●●●●
●
● ● ●● ● ●
● ● ●● ●
10 ● ● ●●●● ● ● 10 ● ● ●●●●● ●
● ●● ●●● ●●
● ● ● ●● ●●● ●●
● ●
● ● ● ●
●● ●●
●● ● ● ●● ● ●
● ● ● ●
0 0
0 10 20 30 0 10 20 30
and works in a similar way to the nonlinear curve fitting methods described in
Chapter 9. Support vector machines can also be created with nonlinear (radial)
basis functions (the basis function is the mathematical equation that is used to
construct the boundary line). These work by enclosing an ‘island’ of values from
one category (see Figure 14.3b). These can be more efficient for some types of
data, but also sometimes suffer from problems with generalisation to new data
sets (Schwarzkopf and Rees 2011).
Other types of algorithm can involve neural networks, which are based on
interconnected multi-layer processing of the type that happens in biological
neural systems. The input layer consists of a set of detectors that respond to
particular inputs or combinations of input. Each successive layer of the network
then applies a mathematical operation to the output of the previous layer. These
operations are usually fairly basic ones, such as weighted averaging (described
in section 6.8 in a different context), choosing the largest input, or a nonlinear
transform such as squaring. The end result of the network is an output layer,
from which classification decisions can be read. Neural networks can produce
very sophisticated operations (much like the brain), though it can sometimes be
difficult to understand fully what a trained neural network is actually doing.
A particularly useful variety is the deep convolutional neural network. These are
based on the early stages of sensory processing in the brain, and involve taking
images (or other natural inputs) and passing them through a bank of filters
that pick out specific low-level features from the input. An example filter bank
is shown in Figure 14.4, involving filters of different orientations and spatial
frequencies (see also Chapter 10 for details on how filters are applied to images).
276 CHAPTER 14. MULTIVARIATE PATTERN ANALYSIS
Figure 14.4: Filter bank showing filters of different orientations and spatial
frequencies.
Deep neural networks are now advanced enough to classify images into different
categories, though it can sometimes be unclear precisely which features of an
image set are being used to do this. It is also important to avoid any confounds
in the input images that might produce false levels of precision, such as the
background of an image. For example, if you wanted to train a network to classify
criminals vs non-criminals from their photographs, it would be important to
make sure that all photographs were taken under the same conditions. Otherwise
it could be the case that all criminal photographs were taken against the same
background (e.g. the height gauge traditionally shown in police mugshots) and
these extraneous features would provide the network with a spurious cue.
A rather different approach to MVPA, that has been very influential in the fMRI
literature, is to forgo classification algorithms and instead use a correlation-based
approach (Haxby et al. 2001). In correlational MVPA, the pattern of brain
activity across multiple voxels (a voxel is the volumetric version of a pixel) is
correlated between two data sets derived from the same condition, or data sets
derived from two separate conditions. The logic is that if there is a distinct and
robust pattern of activity in a region of the brain, the correlation scores will
be higher when the data comes from the same condition than when it comes
from separate conditions. There are several variants of this method depending
on whether the data sets are derived within a single individual, or averaged
across multiple participants. Although this approach sounds very different from
the classification-based methods discussed above, in direct comparisons (e.g.
14.4. SITUATIONS WITH MORE THAN TWO CATEGORIES 277
Coggan, Baker, and Andrews 2016; Isik et al. 2014; Grootswagers, Wardle, and
Carlson 2017) they behave quite similarly. A related approach is to directly
calculate the multivariate effect size (the Mahalanobis distance, see section 3.4.3)
between conditions, and use this as a measure of pattern distinctness (Allefeld
and Haynes 2014).
robust than for a single partitioning. Some MVPA software will implement this
automatically using a technique called k-fold cross validation. The value of the k
parameter determines the number of subsets the data are split into. The model
is trained on k-1 subsets, and tested on the remaining subset. This is repeated
for all permuatations (i.e. each subset is the test set once). We can replace the k
with its value when referring to this type of analysis, e.g. 5-fold cross validation,
or 10-fold cross validation.
In the first example, in around 2018 a large technology company scrapped its
automated recruitment system because it was shown to rate women’s CVs less
highly than men’s for software and technology-related jobs. It did this in some
surprisingly blatant ways, such as penalizing graduates of female-only universities,
and down-weighting CVs that included the word “women’s” - male candidates
would be unlikely to mention being captain of the women’s basketball team, for
example. Why did this happen? It turns out that the algorithm was trained on
historical data from two groups of candidates - those who had been hired, and
those who had not. All of those hiring decisions were made in the traditional
way, by humans with their own prejudices about what makes a good software
engineer. Far from being unbiased, the algorithm inherited and perpetuated the
prejudices of the industry it was created to serve.
The second controversial example was a study claiming that deep neural networks
could classify sexual orientation from photographs more accurately than humans
(Wang and Kosinski 2018). The authors proposed that subtle differences in
facial morphology might reveal exposure to various sex hormones in the womb,
which also influence sexual orientation in adulthood. This work was criticised
on several grounds, including that most of the photographs used to train the
algorithm were of caucasian models, and that the algorithm appeared to be
classifying photographs based on cues that were unrelated to what the authors
claimed, including makeup and the presence or absence of glasses. But the key
point is that the use of machine learning algorithms in this way is extremely
unethical. There are many societies where homosexuality is illegal, and tools
that can be used to classify sexual orientation (no matter how accurate, or using
what cues) could be used to oppress innocent people. Machine learning is a
powerful tool, but it is crucial that it is used responsibly and ethically, and
that the apparent objectvity of computer algorithms is not used to mask human
prejudice and bias.
We will use two functions from the caret package: train and predict. These do
much as you would expect from the names. The train function is used to train
a pattern classifier algorithm, and outputs a data object containing the model
specification. We can then pass this model specification into the predict function,
along with some unseen data, to get predictions out. If we know the ground
truth categories (i.e. the actual categories) for the unseen data, we can also
calculate the classifier’s accuracy. Of course caret is capable of far more than
what we are doing here, but this is a good starting point to demonstrate the
basics. There is extensive documentation available on the package’s web pages
at http://caret.r-forge.r-project.org.
## [1] 2019 61
head(segmentationData[,1:6])
Each data object is now an N × 58 matrix. In the training set, N = 1009, and
in the test set N = 1010, where N corresponds to the number of cells.
Next, we can store the true categories (well segmented, WS, or poorly segmented,
PS) in separate data objects for the training and test sets. These are found in
the third column of the segmentationData data frame, that is headed Class:
traininglabels <- segmentationData[which(segmentationData$Case=='Train'),3]
testlabels <- segmentationData[which(segmentationData$Case=='Test'),3]
## [1] PS WS PS WS PS PS PS WS WS WS WS PS WS PS PS PS PS PS PS PS
## Levels: PS WS
The labels are a factor variable (as described in section 3.9), just as one would
use to specify group membership in an ANOVA design. Now that we have our
data prepared for classification, we can train a classifier on the training data set
using the train function as follows:
svmFit <- caret::train(trainingdata, traininglabels, method = "svmLinear")
svmFit
## [1] PS PS WS WS PS PS PS WS PS PS WS PS PS WS PS WS PS PS WS PS
## Levels: PS WS
We can compare this to the true categories by counting up how many match the
true categories (stored in testlabels) and converting to a percentage:
numbercorrect <- sum(testlabels==p)
totalexamples <- length(testlabels)
100*(numbercorrect/totalexamples)
## [1] 79.80198
The classifier has done very well, getting about 80% of the cells in the test set
correct. We can see if this is statistically significant using a binomial test to
compare this to chance performance (0.5, or 50% correct):
binom.test(numbercorrect,totalexamples,0.5)
##
## Exact binomial test
##
## data: numbercorrect and totalexamples
## number of successes = 806, number of trials = 1010, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.7719120 0.8223772
## sample estimates:
## probability of success
## 0.7980198
The test is highly significant, which is telling us that a linear support vector
machine with the full training set does pretty well. It classifies about 80% of
the test set correctly, which is significantly above chance performance of 50%
correct. What if we used a different kernel? Switching to a radial basis function
improves things by around 1%:
svmFit <- caret::train(trainingdata, traininglabels, method = "svmRadial")
p <- predict(svmFit,newdata = testdata)
100*(sum(testlabels==p)/length(testlabels))
## [1] 80.49505
## [1] 66.53465
Presumably there will be other types of data where the perceptron would be
a better choice. Finally, we can explore how accuracy increases as we include
more of the dependent variables in the classification (see Figure 14.5).
perccor <- NULL
for (n in 1:19){
trainingdata <- as.matrix(segmentationData[which(segmentationData$Case=='Train'),4:(n+4)])
testdata <- as.matrix(segmentationData[which(segmentationData$Case=='Test'),4:(n+4)])
svmFit <- caret::train(trainingdata, traininglabels, method = "svmLinear")
p <- predict(svmFit,newdata = testdata)
perccor[n] <- 100*(sum(testlabels==p)/length(testlabels))
}
plot(2:20,perccor,type='l',ylim=c(50,100),lwd=3)
points(2:20,perccor,pch=16)
100
90
80
perccor
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
70
●
60
50
5 10 15 20
2:20
First of all, let’s see if the FFA can tell the difference between face images and
scrambled images. This is a standard test for ‘selectivity’ of a particular category.
We can set up the trainingdata matrix to contain half of the data from each
condition to train the classifier on as follows:
# create an empty matrix
trainingdata <- matrix(0,nrow=12,ncol=30)
# copy half of the face data into the matrix
trainingdata[1:6,] <- facedata[1:6,]
# copy half of the scrambled data into the matrix
trainingdata[7:12,] <- scrambdata[1:6,]
# convert to a data frame
trainingdata <- data.frame(trainingdata)
We will also need to create numerical labels to tell the classifier which condition
each observation corresponds to. We can just use the numbers 1 and 2 for this
as we have two conditions (face and scrambled):
# create a factor with two levels and six repetitions
traininglabels <- gl(2,6)
traininglabels
## [1] 1 1 1 1 1 1 2 2 2 2 2 2
## Levels: 1 2
Then we can do the same thing with the other half of the data to create a testing
set (notice here we choose trials 7 to 12 instead of 1 to 6), for assessing the
accuracy of the trained model:
14.10. DECODING FMRI DATA 285
As for the previous example, we next train the model using the train function
and test it on an unseen data set with the predict function. We pass in the
training data, the labels identifying the conditions, and specify the algorithm we
want to use (in this case a linear support vector machine). Then we plug the
trained model into the predict function along with the test data:
svmFit <- caret::train(trainingdata, traininglabels, method = "svmLinear")
p <- predict(svmFit,newdata = testdata)
p
## [1] 2 1 1 1 1 1 2 2 2 2 2 2
## Levels: 1 2
The predictions are stored in the data object p, and are condition labels with
values of 1 or 2. You can see that five of the 12 examples have been classified as
condition 1, and 7 have been classified as condition 2. How did this correspond
to the true values? As before, we can work out the accuracy by adding up the
number of correctly classified examples, and then converting to a percentage:
numbercorrect <- sum(testlabels==p)
totalexamples <- length(testlabels)
100*(numbercorrect/totalexamples)
## [1] 91.66667
So for this example, we can see that the algorithm has over 90% accuracy. This
strongly suggests that the brain region we are looking at responds differently to
faces than it does to scrambled images. Next, let’s see if it produces a distinct
response to pictures of houses. The following code duplicates the example above,
except that I have replaced instances of facedata with housedata.
trainingdata <- matrix(0,nrow=12,ncol=30)
trainingdata[1:6,] <- housedata[1:6,]
trainingdata[7:12,] <- scrambdata[1:6,]
trainingdata <- data.frame(trainingdata)
traininglabels <- gl(2,6)
## [1] 50
For houses, accuracy is at 50% correct, so the classifier has not exceeded chance
levels. This indicates that the FFA is not selective for images of houses. Finally,
we could do a 3-way classification between all stimulus types:
trainingdata <- matrix(0,nrow=18,ncol=30)
trainingdata[1:6,] <- facedata[1:6,]
trainingdata[7:12,] <- housedata[1:6,]
trainingdata[13:18,] <- scrambdata[1:6,]
trainingdata <- data.frame(trainingdata)
traininglabels <- gl(3,6)
## [1] 44.44444
This time around, we have above chance decoding at 44% correct (remember
that because there are three categories, the guess rate is 1/3, or 33% correct).
This basic MVPA analysis of MRI data has gone pretty well as it has given us
quite a clear answer. As mentioned above, in a real MRI study, we would repeat
the classifications many times in a loop, randomly reshuffling the examples we
use to train and test the model, and then averaging the accuracies that are
produced (see Chapter 8 for details of resampling methods). Accuracy scores
across multiple participants can then be compared using traditional statistics
such as t-tests. For some data sets, perhaps using EEG or MEG methods, we
can repeat classification at different moments in time to see how brain signals
evolve (see Chapter 15 for further discussion of this).
B) 25% correct
C) 50% correct
D) 70% correct
2. As the Cohen’s d effect size between two conditions increases, classifier
accuracy should:
A) Increase
B) Decrease
C) Stay the same
D) It will depend on how many dependent variables there are
3. What will a linear classifier use to partition data into categories?
A) Any arbitrary curve
B) A straight line, plane or hyperplane
C) A radial curve
D) A sine wave
4. Instead of using a classifier algorithm, MVPA can also be conducted based
on:
A) An extremely fast supercomputer
B) Scores rounded to the nearest integer
C) Reduced data from a factor analysis
D) Correlation
5. An important step in data pre-processing before running MVPA is:
A) Subtracting the mean differences between conditions
B) Conducting univariate analyses
C) Normalization
D) Squaring all measurements
6. Neural network classifier algorithms involve at least one:
A) Spatial scale of filter
B) Simulated calcium channel
C) Hidden network layer
D) Real human neuron
7. If a classifier is trained and tested on the same data, what is the most
likely outcome?
A) Accuracy will be perfect
B) Accuracy will be inflated because of overfitting
C) Accuracy will be reduced because of overfitting
D) Accuracy will be at chance
8. A significant 3-category classification can be interpreted by:
A) Running post-hoc pairwise classifications
B) Running an Analysis of Variance (ANOVA)
C) Removing the least informative category
D) Adjusting the guess rate for a two-category classification
9. In a support vector machine, the support vectors refer to:
A) The dependent variables
B) The distance from the category boundary to the nearest points
C) The distance from the category boundary to each data point
D) The weights that each dependent variable is multiplied by
288 CHAPTER 14. MULTIVARIATE PATTERN ANALYSIS
10. Deep convolutional neural networks are inspired by the structure of:
A) The convoluted (folded) structure of the human brain
B) Complex databases of natural images
C) A bank of filters with different orientations and spatial frequencies
D) Biological sensory systems
Answers to all questions are provided in section 20.2.
Chapter 15
Most introductory statistics courses introduce the concept of the familywise error
rate, and correction for multiple comparisons. The idea is that the more statistical
tests you run to investigate a given hypothesis, the higher the probability that
one of them will be significant, even if there is no true effect. Traditionally,
the solution to this problem has been to correct the criterion for significance
(i.e. the α-level) to account for the number of comparisons. We will first discuss
several such methods (and their shortcomings), before introducing two newer
ideas: the false discovery rate, and cluster correction. Both approaches deal with
multiple comparisons in a principled way, whilst maintaining statistical power at
higher levels than older methods. Controlling the false discovery rate is generally
appropriate when the tests are independent, whereas cluster correction should
be used in situations where correlations are expected between adjacent levels of
an independent variable (e.g. across space or time).
F W ER = 1 − (1 − α)m (15.1)
289
290 CHAPTER 15. CORRECTING FOR MULTIPLE COMPARISONS
where α is the criterion for significance, and m is the number of tests. This
function is plotted in Figure 15.1 for α = 0.05, and shows that the false positive
rate rises rapidly, such that with 14 tests there is a 50% chance of at least one
test being significant. With >60 tests, a false positive is virtually guaranteed.
1
Familywise error rate
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70 80 90 100
Number of tests (m)
Figure 15.1: Familywise error rate as a function of the number of tests (m),
assuming no true effect. The horizontal dashed line indicates the alpha level of
0.05.
α
ᾱ = (15.2)
m
α m
F W ER = 1 − (1 − ) (15.3)
m
For large values of m, the error rate will always be slightly below α with
this formula. For example, with α = 0.05 and m = 100, the error rate is
1 − (1 − 0.05
100 )
100
= 0.0488. The corrected α-level (ᾱ) is then used to threshold
the p-values of each statistical test to determine significance.
An alternative implementation of Bonferroni correction, which is the default in
some statistical software packages such as SPSS, is to adjust the p-values instead
292 CHAPTER 15. CORRECTING FOR MULTIPLE COMPARISONS
ᾱ = 1 − (1 − α)1/m (15.4)
Notice that this is very similar to equation (15.1), except that the exponent is
1/m instead of m. It has the effect of reversing the familywise error rate, so that
when it is plugged into the familywise error calculation, the false positive rate
remains fixed at α (i.e. 0.05), regardless of the number of tests. For our example
scenario of 8 tests, the Sidak-corrected criterion will be:
α
m−(m−1) = α1 . This has the effect of keeping the familywise error rate at or
below α, but reducing the Type II error rate (i.e. being less likely to miss true
effects). There is an equivalent method known as the Holm-Sidak method, where
the Sidak correction is progressively applied instead of the Bonferroni correction.
1 1
(a) d = 0.5 (b) N = 50
0.8 0.8
0.6 0.6
Power
Power
0.4 0.4
m=1
0.2 m=3 0.2
m = 10
m = 100
0 0
0 50 100 150 200 0 0.2 0.4 0.6 0.8 1
Sample size (N) Effect size (d)
Figure 15.2: Power curves for a single corrected test, for different sizes of test
family (m), following Bonferroni correction. Panel (a) shows how power increases
as a function of sample size, for an effect size of d = 0.5. Panel (b) shows how
power increases as a function of effect size, for a sample size of N = 50. The
dashed horizontal line in both panels shows the target of 80% power.
are specified in advance has been suggested as one way to reduce researcher
degrees of freedom. But in many situations, such as novel and exploratory work,
it is not possible to specify which conditions are expected to produce significant
results until after the data have been collected. In such cases, there are two
relatively recent developments designed to control the rate of Type I errors
but still maintain statistical power as far as possible. These are false-discovery
rate correction, and cluster correction. The former method generally makes
the assumption that the various tests are independent. The latter method is
specifically used when there are correlations between successive tests.
Benjamini−Hochberg
0.005
Holm−Bonferroni
Bonferroni
0.0005
0 10 20 30 40 50 60 70 80 90 100
Ranked position of test (j)
Figure 15.3: Effect of three types of multiple comparison correction across 100
tests on the threshold for significance (y-axis) at each position in a rank-ordered
list of p-values (x-axis). Note the logarithmic scaling of the y-axis.
## [1] 0.006 0.754 0.012 0.003 0.005 0.049 0.197 0.022 0.088 0.002
For Bonferroni correction, the first two p-values in the list are below the threshold,
and so only these two tests are significant - the other 8 are not. Holm-Bonferroni
correction is a little more forgiving, and finds the first four tests significant. False-
discovery rate correction is the most liberal, with tests 1-6 reaching significance.
This direct comparison shows the potential for these three methods to lead to
different answers for a given family of tests.
Table 15.2: As for the previous table, but the columns headed Bonf, Holm and
FDR (corr) give the corrected p-values for Bonferroni correction, Holm-Bonferroni
correction, and false-discovery rate correction.
reach significance for very many image locations because statistical power will
be massively reduced. But there is clearly something wrong here, as a better
camera should give us better results! Cluster statistics are a solution to this
issue because they group significant observations together into clusters, which
become the meaningful unit for significance testing regardless of the sampling
resolution.
It is reasonable to assume in many situations that adjacent sample points will
be correlated to some extent. For example, two adjacent pixels in an image are
likely to be more similar in luminance than two randomly selected pixels (Field
1987), and samples from successive moments in time are also likely to be highly
correlated (the extent of which can be assessed by calculating the autocorrelation
function). Sometimes, data preprocessing methods such as low-pass filtering (see
Chapter 10) or smoothing will deliberately blur adjacent samples together to
reduce noise. This is a problem for traditional multiple comparison corrections,
which generally assume that tests are independent. Again, cluster correction
avoids these issues because it takes into account correlation between adjacent
observations.
The basic idea of cluster correction is that we identify clusters - contiguous regions
of space and/or time where a test statistic is significant at some threshold level.
We then aggregate (add up) the test statistic within each cluster, and compare
these values to an empirically derived null distribution. Those clusters that fall
outside some quantile (i.e. the 95% region) of the null distribution are retained
and considered significant. Various algorithms have been developed along these
same lines, but some of them have recently been shown to suffer inflated Type
I error rates (see Eklund, Nichols, and Knutsson 2016). The method we will
discuss in detail here is a nonparametric cluster correction technique described
by Maris and Oostenveld (2007), that does not suffer these issues. We will
demonstrate its use on an example data set.
The black trace is the averaged response to stimuli shown on the right side of
the screen, and the blue trace is the response to stimuli shown on the left side of
the screen.
4 Contralateral
Ipsilateral
Voltage (µV)
●
2
−2
−4
Figure 15.4: Example EEG data from a cueing experiment. Data are averaged
across 400 trials per condition for each participant, and 38 participants. Data
were bandpass filtered from 0.01 to 30Hz, and taken from electrode P8 (grey
point in upper left insert). Shaded regions show ±1SE across participants.
There are some substantial differences between the two waveforms, which we can
summarise by subtracting the two conditions to calculate a difference waveform.
This is shown by the trace in Figure 15.5. There are differences from shortly
after stimulus onset, that persist throughout the epoch. But are these differences
statistically significant? If we conduct a series of paired t-tests to compare
the conditions, these reveal many time points where the waveforms differ, as
summarised by the dark blue lines at y = -2 in Figure 15.5. However, this
involves running 1200 t-tests (because the data are sampled at 1000 Hz), and
with α = 0.05 we would expect at least 60 of these to be significant even if there
were no true effects. A good example of a definite false positive is at the very
start of the time window (-200 ms). This is before the stimulus was presented,
so it must necessarily be a statistical artefact.
4 Difference
p < 0.05
Difference (µV)
Bonferroni
2 FDR
−2
−4
remain significant with such severe correction (see Figure 15.6). The blue lines at
y = -3 in Figure 15.5 show very few significant time points. Even with the more
liberal false discovery rate correction (see light blue lines at y = -4 in Figure
15.5), we still lose quite a lot of our significant time points. Indeed, our largest
cluster splits into two, and our other clusters are reduced in size.
Bonferroni corrected
Uncorrected
To perform cluster correction, we need more than just our p-values. We also
need to think about the test statistic, which in this case is a t-value, though it
could equally well be an F-ratio, correlation coefficient, or any other test statistic.
Figure 15.7 shows the trace of t-statistics as a function of time. This looks
broadly similar to the difference waveform, because the t-statistic is the mean
scaled by the standard error at each time point. The thin blue horizontal lines
indicate the critical t-value for a test with 37 degrees of freedom at α = 0.05
(which is around t = ±2). The significant clusters correspond to the time periods
when the t-statistics are outside of these bounds, as shown by the blue lines at y
= -4, and the grey shaded regions.
For the current example, we have five clusters, which have been numbered
consecutively in Figure 15.7. These range from very brief durations (i.e cluster 1)
to a very long one (cluster 4). We next calculate the summed t-value across all
time points within each cluster. This just involves adding up all of the t-values
302 CHAPTER 15. CORRECTING FOR MULTIPLE COMPARISONS
2
t−statistic
−2
1 2 3 4 5
−4
Figure 15.7: Trace of t-statistics as a function of time. The thin blue lines bound
the critical t-values, and the blue dashed line shows the critical Bonferroni-
corrected t-value. Grey shaded regions indicate periods of significant t-values,
which also correspond to the clusters at y = -4.
15.9. EXAMPLE OF CLUSTER CORRECTION FOR EEG DATA 303
in a single cluster. The summed t-statistics for our five clusters are as follows:
## [1] -7.22676 -73.03767 182.08464 2176.61865 254.14413
We select the largest of our summed t-values (ignoring the sign), which in this
case is cluster 4, with a summed t-statistic of 2177. The raw data from this
cluster will be used to generate a null distribution using a resampling technique
(see also section 8.4.2). The idea here is that we randomly reassign the condition
labels for each participant, and then at each time point within the cluster, we
repeat the t-test and recalculate the summed t-statistic. This is done at least
a thousand times (with different reshufflings of condition labels), to build a
distribution of resampled summed t-statistics. Because of the random condition
assignments, the distribution should have a mean around 0, and some spread
determined by the variability of the data. What we are doing here by randomising
the condition assignments is building up a picture of how we should expect the
data to look if there were no true difference between the groups. We will then
compare this to our observed data to see if there is evidence for real differences.
21 35 4
Figure 15.8: Null distribution of resampled summed t-values. The dashed black
lines give the upper and lower 95% confidence intervals. The numbered vertical
lines correspond to the clusters. Only cluster 4 exceeds the 95% confidence
interval, so only this cluster is considered significant.
The null distribution is shown by the blue shaded curve in Figure 15.8. The
vertical dashed lines indicate the 95% confidence intervals of the null distribution.
These are used as thresholds to compare with each individual cluster’s summed
304 CHAPTER 15. CORRECTING FOR MULTIPLE COMPARISONS
t-value. The clusters are indicated by the shorter numbered lines. Clusters 1, 2,
3 and 5 are inside the 95% confidence intervals of the null distribution, so these
are rejected as not reaching statistical significance. Cluster 4 falls outside of the
confidence intervals, and is retained as being significant.
4 Difference
Clusters
Difference (µV)
−2
−4
The final difference waveform with the surviving cluster is shown in Figure
15.9, indicated by the horizontal blue line. Note that the smaller clusters have
been removed, including the spurious one that occurred before the stimulus
was presented. Also, the remaining cluster has not shrunk - its start and end
points are preserved. So, we would conclude that the significant difference in
activity between conditions begins around 200 ms, and extends to around 850
ms. Cluster correction takes a little getting used to, as it involves very different
processes to other multiple comparison corrections. Nevertheless, it has been
robustly validated, and achieves a good balance between maintaining the desired
familywise error rate, and avoiding Type II errors and cluster shrinkage.
describe the use of a built-in R function called p.adjust, that can be used to
adjust p-values using various methods. First, let’s look at manual Bonferroni
correction. For a vector of p-values (from Tables 15.1 and 15.2, but these could
be collated across all the tests you have run for a data set), we can identify the
significant ones at α = 0.05 as follows:
pvals
## [1] 0.006 0.754 0.012 0.003 0.005 0.049 0.197 0.022 0.088 0.002
alphalevel <- 0.05
pvals < alphalevel
## [1] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
With no correction, 7 values reach significance. We can Bonferroni correct our α
level by dividing it by the number of tests (which we store in a new data object
called m, defined as the length of the pvals vector):
m <- length(pvals)
alphahat <- alphalevel/m
pvals < alphahat
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
Only two of our p-values survive this correction. Alternatively, we can apply the
Sidak correction, which finds one additional test to be positive:
alphahat <- 1 - (1 - alphalevel)^(1/m)
pvals < alphahat
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
The Holm-Bonferroni method is a sequential technique, so we need to rank order
our list of p-values first using the sort function:
pvals <- sort(pvals)
pvals
## [1] 0.002 0.003 0.005 0.006 0.012 0.022 0.049 0.088 0.197 0.754
The thresholds for significance depend on the position in the list, and can be
calculated as follows:
holm <- alphalevel/(m - ((1:m)-1))
print(holm, digits=1)
## [1] 0.005 0.006 0.006 0.007 0.008 0.010 0.013 0.017 0.025 0.050
Finally, we can compare our p-values to the thresholds as follows:
pvals < holm
306 CHAPTER 15. CORRECTING FOR MULTIPLE COMPARISONS
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [1] 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050
## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
Next, let’s look at how to correct the p-values themselves, instead of adjusting
the α level (recall that these methods produce the same outcome). If we want
to apply the false discovery rate algorithm manually, we can do so using a loop
(see section 2.10) that starts with the second to largest p-value, and works its
way down the list:
fdr <- pvals
for (i in ((m-1):1)){
fdr[i] <- min(pvals[i]*m/i, fdr[i+1])}
print(fdr, digits=2)
## [1] 0.015 0.015 0.015 0.015 0.024 0.037 0.070 0.110 0.219 0.754
## [1] 0.02 0.03 0.05 0.06 0.12 0.22 0.49 0.88 1.00 1.00
Or Holm-Bonferroni correction:
p.adjust(pvals,method='holm')
## [1] 0.020 0.027 0.040 0.042 0.072 0.110 0.196 0.264 0.394 0.754
There are several additional correction methods available that you can read
about in the help file for the p.adjust function.
15.11. IMPLEMENTING CLUSTER CORRECTION IN R 307
## [1] 38
m
## [1] 1200
Each data point is the voltage difference for a particular participant and time
point. We can therefore perform one-sample t-tests (comparing to 0) at each
time point, and store the results in vectors of p-values and t-statistics as follows:
allp <- NULL
allt <- NULL
for (n in 1:m){
output <- t.test(data[,n]) # run a t-test
allp[n] <- output$p.value # store the p-value
allt[n] <- output$statistic # store the t-value
}
allp[1:5]
nclusters
## [1] 5
clusterstarts
Now that we have identified our clusters, we can add up the t-values for each
cluster. This uses the vector of t-values that we created earlier, and the start
and end indices of the clusters:
summedt <- NULL
for (n in 1:nclusters){
summedt[n] <- sum(allt[clusterstarts[n]:clusterends[n]])}
summedt
To generate the null distribution, we need to choose the largest cluster and note
its start and endpoints:
15.11. IMPLEMENTING CLUSTER CORRECTION IN R 309
The null distribution is now stored in the data object nullT, and contains 1000
summed t-statistics. We need to compare the summed t-value for each cluster to
this distribution (see Figure 15.8). We can do this using the quantile function to
identify the lower and upper 95% quantiles (i.e. the points between which 95%
of the values in the distribution lie). We can then compare each of our summed
t-values to see if they fall outside of these limits, and retain any that meet this
criteria:
distlims <- quantile(nullT,c(0.05/2,1-(0.05/2)))
distlims
## 2.5% 97.5%
## -988.5799 952.6620
sigclusts <- c(which(summedt<distlims[1]),which(summedt>distlims[2]))
sigclusts
## [1] 4
Finally, we can retain the start and end indices of any significant clusters (in
this case, just cluster 4):
if (length(sigclusts)>0){
clustout <- matrix(0,nrow=length(sigclusts),ncol=2)
310 CHAPTER 15. CORRECTING FOR MULTIPLE COMPARISONS
clustout
## [,1] [,2]
## [1,] 416 1058
So, our significant cluster starts at index 416, and ends at 1058. We can use
these to index the vector of sample times (from -199 to 1000 ms), to find the
true start and end times for the cluster (in ms):
timevals <- -199:1000
timevals[clustout]
We can package up all of the stages for cluster correction into a single function
(a version of this function is included in the FourierStats package that we
encountered in Chapter 11). The function expects four inputs to define the data,
the number of resamples for building the null distribution, and the α levels for
forming clusters and determining significance:
doclustcorr <- function(data,nresamples,clustformthresh,clustthresh){
if (incluster==0){
nclusters <- nclusters + 1
clusterstarts[nclusters] <- n
incluster <- 1
}
}
if (allp[n]>=clustformthresh){
if (incluster==1){
clusterends[nclusters] <- n-1
incluster <- 0
}
}
}
if (incluster>0 & nclusters>0){clusterends[nclusters] <- m}
if (nclusters>0){
summedt <- NULL
for (n in 1:nclusters){
summedt[n] <- sum(allt[clusterstarts[n]:clusterends[n]])}
if (length(sigclusts)>0){
clustout <- matrix(0,nrow=length(sigclusts),ncol=2)
clustout[,1] <- clusterstarts[sigclusts]
clustout[,2] <- clusterends[sigclusts]}
}
return(clustout)}
312 CHAPTER 15. CORRECTING FOR MULTIPLE COMPARISONS
## [,1] [,2]
## [1,] 416 1058
The first column contains the cluster start indices, and the second contains the
cluster end indices. If there were multiple significant clusters, there would be
additional rows in the output matrix.
Cluster correction is a very flexible technique, and I hope that by including a
basic implementation, readers will see how to apply the method to their own
data analysis. Although the example here involves a signal that varies in time,
the same principle applies to other dimensions, such as spatial position. For a
two-dimensional image (or three-dimensional volume), determining the adjacency
of points in a cluster is somewhat more challenging.
Signal detection theory (Green and Swets 1966; Macmillan and Creelman 1991)
is a method for formalising how humans (and other systems) make decisions
under uncertainty. By uncertainty I mean that either the information being
used to make the decision, or the decision process itself, is noisy - that is it
fluctuates over time. Everyday examples might include trying to understand
what someone is saying on a loud train, spot a faint star in the sky at night, or
work out whether you have added salt to your dinner. The theory was developed
in the 1950s, primarily to characterise the performance of radar operators, but
it has much wider applicability. It is the foundation of modern psychophysical
studies of perception and memory, and in recent years the same concepts have
been applied in artificial intelligence (machine learning) research. This chapter
will cover the basic concepts of signal detection theory, and also discuss common
experimental designs. However, we will begin with an example where we consider
how best to determine the sex of a chicken.
315
316 CHAPTER 16. SIGNAL DETECTION THEORY
Figure 16.1: Recently hatched chicks. Most of them turned out to be male.
and sort them by sex. They do this by inspecting the nether regions (the vent) of
the chick under a bright light for specific features that indicate sex. The precise
features are to do with the shape of a small protruberance called the bead, which
is pointier in female chicks, and rounder in male chicks (see Biederman and
Shiffrar 1987 for diagrams). This is an expert task, and it typically requires many
years of training to reach the levels of performance (>99% accuracy) expected
by the industry. Furthermore, each chick is inspected for only a few seconds, so
professional sexers will process around 1000 chicks per hour (and over a million
each year).
True sex
True sex
F 40 10 50 F 49 1 50 F 35 15 50
Hits Misses Hits Misses Hits Misses
M 10 40 50 M 19 31 50 M 5 45 50
FAs CRs FAs CRs FAs CRs
Figure 16.2: Grids showing the accuracy of three student chicken sexers. See
text for details. FAs: false alarms, CRs: correct rejections.
Bia has a somewhat different strategy. She is overall more likely to assign a
chick as being female. Out of all her chicks, she assigns 68 as being female, and
the remaining 32 as being male. In other words she has a bias towards assigning
chicks as being female. This strategy works well for spotting the female chicks,
as it means means she correctly identifies nearly all of them (49 hits, 1 miss).
However, because of her bias she incorrectly assigns 19 of the male chicks to
the female category (19 false alarms) and correctly identifies the rest as male
(31 correct rejections). Bia’s performance is summarised in the middle grid of
Figure 16.2.
Che has a bias in the opposite direction to Bia. He is more likely to assign
a chick as being male (perhaps to avoid getting fined for letting male chicks
through as female). This strategy means that he misses 15 of the female chicks,
318 CHAPTER 16. SIGNAL DETECTION THEORY
● ●
●● ● ● ● ●
●
● ●● ● ● ● ●
● ●● ● ● ● ● ● ●
● ● ●● ● ●
● ●● ● ●
● ● ●
● ● ●● ● ● ●
● ● ● ● ● ●
● ● ●
● ●● ●● ●● ● ●● ●
●
●● ●● ● ●
●● ● ● ●
● ●
●● ● ● ●● ●● ● ●● ●
● ●● ●
Next let’s think about the opposite extreme. Imagine we have to categorise
eggs as male and female before they have hatched, just by looking at them.
Obviously this is impossible, and so the two distributions will completely overlap
(see Figure 16.3b), with any given egg being equally likely to be male or female.
In this situation, hits, misses, false alarms and correct rejections are all equally
likely (each consisting of 25% of trials), and accuracy will be at chance (50%
correct).
A more interesting situation occurs for day-old chicks, where sexing is possible
but it is challenging. The probability distributions in Figure 16.3c overlap
somewhat. This means that some individual chicks will be difficult to categorise,
and may produce misses or false alarms, but accuracy should be somewhere
above chance.
So what is the best strategy for categorising male and female chicks? The
example student chicken sexers described above illustrate that there are different
criteria we can adopt when doing this task. Recall that Bia is more likely to
assign chicks as being female, Che more likely to assign them as male, and Aya
somewhere in between. Figure 16.4 shows us the overlapping distributions in
a bit more detail with an extra feature - the vertical dotted line indicates the
criterion. Any individual observations that fall to the left of this line will be
classified as male, and any falling to the right as female. Of the chicks classified
as female, some will be hits (chicks that really are female) and some will be false
alarms (chicks that are actually male). Of the chicks classified as male, some
will be correct rejections (chicks that really are male) and some will be misses
(chicks that are actually female).
320 CHAPTER 16. SIGNAL DETECTION THEORY
Sensitivity (d')
Criterion
Correct Hits
Rejections
Misses False
Alarms
Male Female
Internal response
Figure 16.4: Internal response distributions with additional features. The vertical
dotted line indicates the criterion - samples to the left of this line will be classified
as male, and those to the right as female.
16.4. CALCULATING SENSITIVITY AND BIAS 321
A key idea in signal detection theory is that we can move our criterion around
if we want to. So, we could instruct our students that misses are a very bad
thing (because potentially productive female chicks would be missed). This
would make them shift their criterion to the left to avoid misses (and increase
hits). However the number of false alarms would also increase (and correct
rejections would decrease) - in other words, they would be like Bia, who has
a bias towards saying chicks are female. Alternatively we could instruct the
students that false alarms are a bigger problem (because it wastes money raising
male chicks), so that they shift their criterion to the right. Then false alarms
(and hits) would decrease, but misses (and correct rejections) would increase. On
the other hand, the sensitivity at the task is determined by the overlap of the two
internal response distributions, and is not affected by changing the criterion. So
this is generally assumed to be fixed for a given observer and situation (though
learning and training are possible).
Assigned sex
Aya F M Total
True sex
F 0.8 0.2 1
P(Hit) P(Miss)
M 0.2 0.8 1
P(FA) P(CR)
chicks that are truly female, each of them must either be a hit or a miss. Similarly,
for a fixed number of truly male chicks, each of them must be classified as either
a correct rejection or a false alarm. This means that P (M iss) = 1 − P (Hit) and
also P (CR) = 1 − P (F A). Because of these interdependencies, once the data
are converted to proportions, we actually only need to know the hit rate and
the false alarm rate to fully characterise performance.
1
Cumulative probability
0.8 Hits
0.6
0.4
0.2
FA ●
0
−5 −4 −3 −2 −1 0 1 2 3 4 5
z (sd units)
We can also convert the performance of Bia and Che (our biased chicken sexers)
324 CHAPTER 16. SIGNAL DETECTION THEORY
to z-scores and d′ values in the same way. For Bia, her hit rate is 49/50 = 0.98,
and her false alarm rate is 19/50 = 0.38. This works out as:
This means that Che is slightly more sensitive than Aya, but less sensitive than
Bia.
We can also quantify the bias of each rater by calculating the criterion. The
equation for the criterion is:
For Bia, who has a substantial bias, this works out as:
In many contexts, signal detection theory is used to estimate sensitivity (d′ ) for
a task, by removing the confounding effect of bias (C ). For our chicken sexing
example, we might want to decide which student to offer a job to. Although
they all have the same accuracy, Bia was the most sensitive, and so she might
be our best choice - after all, we could ask her to adjust her bias if necessary,
but her underlying ability was the strongest. Without dividing accuracy data
into sensitivity and bias, we would have wrongly judged all three students to
be equally good at the task. Having a good understanding of signal detection
theory might help to avoid analogous errors in your own research.
Estimating sensitivity and bias is important in many lab-based studies of human
perceptual and memory abilities, and more recently has been applied to assess the
ability of artificial intelligence systems to judge the content of images. Sometimes
16.5. RADIOLOGY EXAMPLE: RATING SCALES AND ROC CURVES325
very few false alarms. On the other hand, if we place a criterion at a rating of
90, this will produce many hits, but also many false alarms. A useful way of
presenting such data is to plot the hit rate against the false alarm rate for a
range of criterion levels, as shown in Figure 16.7a.
1 1
(a) ● ● (b)
●
0.8 ● 0.8
●
Hit rate
Hit rate
0.6 0.6
●
0.4 0.4
● Criterion d' = 0.5
● ● Liberal d' = 1
0.2 ● Conservative 0.2 d' = 2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False alarm rate False alarm rate
Figure 16.7: Example ROC curves. Panel (a) shows hit rates against false alarm
rates for a range of criterion levels, with sensitivity fixed at d’ = 1. Panel (b)
shows ROC curves for three different levels of d’.
ROC curves can be constructed in other ways besides using a rating scale (and
also by using rating scales with fewer levels), and applied in many contexts besides
cancer diagnosis. In some lab experiments, performance can be aggregated across
many participants who all naturally have different levels of bias, mapping out an
ROC curve for the population. In other studies, participants can be instructed
to change their criterion in different blocks of the experiment (Morgan et al.
2012). This can be done explicitly, by asking participants to be more liberal or
more conservative (e.g. “Only respond ‘yes’ if you are really sure”), or implicitly
by implementing a reward and penalty structure (this is popular in behavioural
economics studies). For example, rewarding hits will encourage a more liberal
criterion, whereas penalising false alarms will encourage a more conservative
criterion. The data from blocks with different instructions should map out a
single ROC curve.
In a 2AFC experiment, two stimuli will be presented (either one after the other,
or simultaneously but in different locations). One will always come from the
null distribution (i.e. a male chick, or a healthy breast image), the other will
always come from the target distribution (i.e. a female chick, or a cancerous
breast image). The participant’s task is to indicate which of the two stimuli
contains the target. This seems like a very minor difference to yes/no designs,
but it has an important consequence. In terms of the internal response, we now
have two discrete values to compare (see Figure 16.8). The optimal strategy is
always to pick the stimulus that produces the largest internal response, as that
is more likely to be the target. There is no need for us to have a criterion, and
no opportunity for bias. Forced choice paradigms are therefore often referred to
as bias free.
Forced choice methods are not limited to having two alternatives, but the number
of alternatives will determine the baseline (chance) performance level. For 2AFC,
the guess rate is 0.5 (50% correct). For 3AFC, the guess rate is 0.33 (33%
correct). In general, the guess rate is 1/m, where m is the number of alternatives
(see also Section 14.4). Depending on the other constraints of an experiment,
there may be good reasons for choosing designs where m > 2. For example,
3AFC can often be explained to participants as an odd-one out paradigm, where
they choose the stimulus that appears different from the other two.
328 CHAPTER 16. SIGNAL DETECTION THEORY
Sensitivity (d')
Null Target
●
Internal response
√
d′ = φ−1 (Pc ) 2
where Pc is the proportion of correct responses, and φ−1 indicates the inverse of
the cumulative Gaussian function shown in Figure 16.6.
Note that it is very common to see single interval experiments described as being
2AFC when they are actually yes/no. The confusion stems from the two possible
response options that a participant is ‘forced’ to choose between. For example,
being shown a line stimulus and having to indicate if it is tilted leftwards or
rightwards. However, this is really a yes/no paradigm in disguise, because there
is only one stimulus being presented. For true 2AFC there must always be two
distinct stimuli, and the participant selects one of them. If bias is negligibly
small then accuracy and d′ are closely related even for yes/no paradigms, so this
distinction becomes largely semantic, but without estimating bias for a yes/no
task this might be difficult to argue convincingly.
Probability (a)
0
Internal response
1
(b) ●
●
●
Proportion correct
0.8
●
0.6
●
●
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Signal strength
sneezes. Lapses can skew psychometric function fits by making them much
shallower than they would usually be (to capture the reduced performance at
high stimulus intensities). This can make the threshold estimate incorrect. Many
software tools for fitting psychometric functions now give the option of setting
an upper asymptote slightly below 1 to counteract this problem (see Section 4.2
of Kingdom and Prins 2010).
16.10 Calculating d′ in R
For yes/no tasks, we can calculate z-scores from proportions directly using the
qnorm function to generate quantiles from the normal distribution. This is the
curve shown in Figure 16.6. Here are the z-scores for a range of proportion
values:
proportions <- seq(0.1,0.9,0.1)
proportions
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
zscores <- qnorm(proportions)
round(zscores,digits=2)
## [1] -1.28 -0.84 -0.52 -0.25 0.00 0.25 0.52 0.84 1.28
We can then calculate d′ manually by subtracting z-scores for hits and false
alarms. For a hit rate of 0.95 and a false alarm rate of 0.1, d′ is calculated as
follows:
hits <- 0.95
FA <- 0.1
dprime <- qnorm(hits) - qnorm(FA)
dprime
## [1] 2.926405
Similarly, bias (C ) is calculated using the same values:
C <- -(qnorm(hits) + qnorm(FA))/2
C
## [1] -0.181651
For 2AFC, the appropriate conversion is:
dprime2afc <- qnorm(proportions)*sqrt(2)
round(dprime2afc,digits=2) # round to 2 decimal places
## [1] -1.81 -1.19 -0.74 -0.36 0.00 0.36 0.74 1.19 1.81
Notice that the proportions below chance performance (0.5) produce negative
d-prime scores. Most participants will tend to perform above chance in 2AFC
16.11. PRACTICE QUESTIONS 333
experiments, so this should occur only rarely. Because there is only one proportion
involved in this calculation,
√ we can convert in the opposite direction as well, by
scaling the d′ scores by 2 and using the pnorm function:
proportionsconverted <- pnorm(dprime2afc/sqrt(2))
proportionsconverted
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
For mAFC tasks (where m > 2), the calculations become more complex (see
section 6.3.2 of Kingdom and Prins (2010) for a detailed explanation). However,
there is a function called dprime.mAFC available in the psyphy package. It will
convert proportion correct scores to d′ , provided m is also specified. However
it can only convert one score at a time (it does not work with vectors), so to
convert multiple values we can either use a loop, or use the sapply function (a
built in function that applies a function to all values in a vector) as follows:
library(psyphy)
dprime3afc <- sapply(proportions,dprime.mAFC,3)
dprime4afc <- sapply(proportions,dprime.mAFC,4)
dprime5afc <- sapply(proportions,dprime.mAFC,5)
If we plot the d′ scores (in Figure 16.10), you can see (from the vertical dotted
lines) that chance performance at the baseline (of 1/m) always corresponds to
d′ = 0.
These tools will allow a signal detection analysis to be applied to data from
yes/no and forced choice experiments. Similar tools exist in other programming
languages and toolboxes, for example the Palamedes toolbox provides signal
detection functions, along with tools for fitting psychometric functions, in the
Matlab environment. In R, the quickpsy package (Linares and López-Moliner
2016) can be used to fit psychometric functions. For further reading on signal
detection theory, I recommend Kingdom and Prins (2010), Macmillan and
Creelman (1991), and (for an excellent historical account) Wixted (2020).
1
d'
m=2
m=3
−1
m=4
m=5
Proportion correct
Figure 16.10: Mapping from proportion correct to d’ for different forced choice
designs.
3. In a yes/no task with 28 hits and 15 misses, the hit rate (as a proportion)
is:
A) 0.65
B) 0.35
C) 0.54
D) 0.45
4. The sensitivity index in signal detection theory is called:
A) Cohen’s d
B) The criterion
C) d′
D) z
5. For a yes/no task, d’ is calculated as:
A) z(Hit) - z(Miss)
B) z(Hit) - z(FA)
C) z(Hit) - z(CR)
D) z(FA) - z(Hit)
6. Using the graph in Figure 16.6, what is the approximate z-score for a
probability of 0.9?
A) 0.5
B) 1.3
C) 3.1
16.11. PRACTICE QUESTIONS 335
D) 4.5
7. By collecting data on a task at a range of criterion levels, we can plot:
A) A psychometric function
B) The change in d′ with bias
C) A normal distribution
D) An ROC curve
8. An experiment is described as being 2AFC if there are:
A) Two possible responses
B) Two different stimuli to choose between
C) Two trials, and the participant responds to both of them
D) Additional responses to indicate confidence
9. Use the qnorm function in R to calculate d’ for a yes/no experiment with
a hit rate of 0.95 and a correct rejection rate of 0.8.
A) d′ = 0.80
B) d′ = 2.32
C) d′ = 2.49
D) d′ = 3.28
10. Use the qnorm function in R to calculate d’ for a 2AFC experiment where
the participant got 76% of trials correct.
A) d′ = 1.00
B) d′ = 0.95
C) d′ = 0.77
D) d′ = 0.71
Answers to all questions are provided in section 20.2.
336 CHAPTER 16. SIGNAL DETECTION THEORY
Chapter 17
Bayesian statistics
This chapter will introduce an approach to statistics that is quite different from
that traditionally taught. Statistical tests that are evaluated for significance
using a p-value are known as Frequentist statistics. This category probably
includes most of the tests you are familiar with - things like t-tests, ANOVA,
correlation and regression. Bayesian methods involve quite different assumptions
and different procedures, and there are situations where they could be a sensible
choice in data analysis. But before we go into detail about how they work, it is
worth discussing some of the basic assumptions of, and problems with, frequentist
methods. It might seem counterintuitive to start a chapter on Bayesian statistics
by talking about frequentist methods, but these are issues that people who use
statistics routinely have often not thought about in depth.
337
338 CHAPTER 17. BAYESIAN STATISTICS
1
Proportion significant
0.5
0.2 d=0.5
d=0.25
0.1 ● d=0
0.05 ● ● ● ● ● ● ● ●
0.02
0.01
4 8 16 32 64 128 256 512
Sample size (N)
Figure 17.1: Outcome of 100000 simulated experiments for different sample sizes
and effect sizes (Cohen’s d).
340 CHAPTER 17. BAYESIAN STATISTICS
P (B | A)P (A)
P (A | B) = . (17.1)
P (B)
In plain language, this equation tells us how to work out the probability of A
given B, if we already know the general probabilities of both A and B, and also
the probability of B given A. That probably doesn’t sound very plain! A more
concrete example might be more useful to introduce how this works.
Imagine that you are a general practitioner, and you want to know how likely
it is that a patient has a specific disease. If you know nothing else about that
patient, your estimate will be based on the prevalence of that disease in the
342 CHAPTER 17. BAYESIAN STATISTICS
population. Of course, in real life doctors know lots of other things about a
patient, not least whether they have any symptoms of the disease, and this
information would also factor into their diagnosis. But for the purposes of this
example, let’s imagine that the best estimate from the medical literature of the
baseline probability is P(A) = 0.01 (e.g. that one in a hundred individuals have
this disease). So your initial expectation, in the absence of any symptoms, is a
1% chance your patient has the disease, and a 99% chance they do not.
To improve your estimate, you take a blood sample from the patient and send it
off to be tested. You know that the test will correctly identify 90% of people
who have the disease; this is referred to as the sensitivity of the test, or the
true positive rate, and is conceptually similar to statistical power (see Chapter
5). This also means that the test will miss 10% of people who have the disease
- they will still be infected, but the test result will be negative, meaning they
appear to be healthy. When real clinical tests are developed, the aim is to get the
sensitivity as high as possible, though there will usually be technical limitations
that prevent perfect sensitivity.
The other aspect of test performance is what happens with uninfected patients,
who do not actually have the disease. This is summarised by the specificity, or
true negative rate of the test. This is how often the test will correctly identify
healthy patients as not having the disease - let’s say it’s 95% of the time for our
example. The remaining 5% of healthy patients will produce a false positive, and
might be incorrectly diagnosed as having the disease. Balancing the trade-off
between these two values (sensitivity and specificity) is a major challenge in test
development, as increasing sensitivity can often decrease specificity. A test that
is very sensitive but has low specificity might catch most people with the disease,
but also give false positives to lots of people who do not, perhaps resulting in
unnecessary treatment. An insensitive test that has high specificity would avoid
false positives, but might also miss many people who are truly ill.
A good way to think about the probabilities involved is to construct a table (see
Figure 17.3) showing the four possible outcomes for a hypothetical group of 1000
people tested. This table will look very familiar to those who have just read
Chapter 16 on signal detection theory, and indeed this can be thought of as a
signal detection problem (where the ‘signal’ is the presence of the disease). Out
of our 1000 people, we expect that 10 of them (1%) actually have the disease.
So the top row of the table (showing people with a true positive disease status)
adds up to 10 individuals, and the lower row (those with a true negative disease
status) is the remaining 990 individuals. The columns indicate the test results.
In the left column are the positive test results. This is 90% of the 10 people who
actually have the disease (so 9 of them), and 5% (the false positive rate) of the
people who don’t have it (~50 people). In the right column are the negative test
results. These comprise the one missed person who actually has the disease, and
940 correct rejections - people who don’t have the disease and tested negative.
Now let’s imagine we get back a positive test result for our patient. Can we
update our estimate from the baseline of 0.01 to incorporate the information
17.5. BAYES’ THEOREM 343
Test result
Positive 9 1 10
Hits Misses
Total 59 941
Figure 17.3: Grid showing test results and true disease status for a group of
1000 individuals. FAs: false alarms; CRs: correct rejections.
344 CHAPTER 17. BAYESIAN STATISTICS
from the test, as well as the information we have about how accurate the test is?
We already know the baseline probability P(A) is 0.01. The other part of the
numerator of equation (17.1) is the probability of getting a positive test result if
we actually have the disease. This is the sensitivity value we identified above,
and so P (B | A) = 0.9.
The denominator of the equation is P(B). This is the probability overall of
getting a positive test result. We don’t know this yet, but we can work it out,
because we have enough information about the sensitivity and specificity, as
well as the baseline probability. The overall probability of getting a positive test
result will be a combination of the probability of getting a positive result if you
have the disease, and the probability of a positive result if you don’t have the
disease. Mathematically:
The first part of this equation is just the numerator term again (P (B | A)P (A)).
The second part is the probability of getting a positive test result if you don’t
have the disease. The bar over the A symbols are indicating the probability of
not having the disease. So the second section of the equation is the probability
of getting a positive result if you do not have the disease (the false positive rate),
multiplied by the probability of not having the disease in the first place. Overall,
the equation we need to calculate is:
P (B | A)P (A)
P (A | B) = , (17.3)
P (B | A)P (A) + P (B | Ā)P (Ā)
0.9 × 0.01
P (A | B) = = 0.15 (17.5)
0.9 × 0.01 + (1 − 0.95) × 0.99
17.5. BAYES’ THEOREM 345
0.8
0.6
0.4
Figure 17.4: Change in posterior probability for different values of test sensitivity
and baseline prevalence. The specificity was fixed at 0.9.
As a topical aside - as previously mentioned, this book was mostly written during
2020 when a novel coronavirus caused a deadly global pandemic. Crucial to
attempts to control the spread of the virus was the development of rapid and
accurate testing. Of course, in the early stages of the pandemic, the baseline
prevalence rate was changing by the day and in fact was acknowledged to be
inaccurate in most countries because only patients with severe symptoms were
being tested. For individual tests, sensitivity and specificity were often unclear,
and were certainly not communicated to the general public. In such situations,
assessing the likelihood of a test result being accurate is very challenging. It
346 CHAPTER 17. BAYESIAN STATISTICS
was also the case that many scientists and politicians had prior expectations
about the virus that were ultimately incorrect. It was assumed to be like a
severe flu with a relatively low fatality rate, which meant that implementing
countermeasures such as lockdowns and mass testing were delayed in many
countries. In the next section we will introduce probability distributions, which
allow us to quantify our uncertainty about a given situation and consider a range
of possibilities.
Three critical concepts in Bayesian statistics are the prior, evidence and posterior
distributions. The prior distribution quantifies our expectations before we have
made any new observations. This is equivalent to the baseline probability of
having a disease in the above example (P (A)). In an experimental setting, it
might be generated using data (such as effect sizes) from previous studies, or
we might use a default (neutral) prior if other information is not available. The
evidence distribution (also called the likelihood function) is derived from our
new observations, which in an experimental or observational study would be the
data we collect. The posterior distribution is the outcome, and is calculated by
multiplying the prior and evidence distributions together according to Bayes’rule.
Example distributions are shown in Figure 17.5.
Prior
Evidence
Posterior
−1 −0.5 0 0.5 1
Effect size (d)
Figure 17.5: Hypothetical probability distributions for the prior (dotted curve),
empirical evidence (dashed curve) and posterior (solid curve). Curves have been
vertically scaled for visualisation.
348 CHAPTER 17. BAYESIAN STATISTICS
comparison with frequentist methods. This value is technically called the posterior odds, and
is the product of the prior odds and the ratio of marginal likelihoods. In most implementations,
we set the prior odds to equal 1, because we assume that the null and alternative hypotheses
are equally likely. The posterior odds therefore depends entirely on the ratio of marginal
likelihoods, which is also called the Bayes Factor. So, in situations where we consider the
17.8. HEURISTICS FOR BAYES FACTOR SCORES 349
If the experimental hypothesis is more likely, the Bayes factor score will be
>1. If the null hypothesis is more likely, the Bayes factor score will be <1.
Confusingly, some software calculates the ratio the other way up, which inverts
the interpretation of the scores. It is possible to indicate which way round the
probabilities were considered by using the subscripts BF10 and BF01 (where 1
is the experimental and 0 is the null hypothesis, and the first value corresponds
to the numerator). The Bayes factor tells us which model is best supported by
the data.
null and alternative hypotheses equally likely, the posterior odds and the Bayes factor are
equivalent.
350 CHAPTER 17. BAYESIAN STATISTICS
just as with their frequentist counterparts (Figure 17.1). But in the case where
the effect size is d = 0, the Bayes factors decrease as a function of sample size
(black circles). This means that some data sets will provide evidence for the null
hypothesis (Bayes factors <1/3), and this will not be conflated with inconclusive
data sets that may have insufficient sample size (Bayes factors around 1).
1000
100
Bayes Factor
10
d=0.5
1
●
●
d=0.25
● ● d=0
●
●
●
0.1 ●
●
4 8 16 32 64 128 256 512
Sample size (N)
Figure 17.6: Bayes factors for 100000 simulated experiments with different
sample sizes and effect sizes.
Much has been written about the relative strengths of each approach, and we
will not attempt to resolve this question here. However, a practical suggestion
that may be of value is to start to report Bayes factors alongside frequentist
tests, in much the same way that effect sizes are now routinely reported in many
papers. In most cases, these will simply reinforce the frequentist outcome. But in
situations where a non-significant result is found, reporting the Bayes factor can
help to distinguish between an inconclusive result, and one in which the evidence
favours the null hypothesis. This adds substantial value to the interpretation of
null results, at very little cost.
##
## One Sample t-test
##
## data: data
## t = 4.644, df = 19, p-value = 0.000177
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1.800734 4.755699
## sample estimates:
## mean of x
## 3.278217
The effect size for the synthetic data is large, so this test gives a highly significant
result (p < 0.05). We can run a Bayesian t-test on the same data by calling the
ttestBF function from the BayesFactor package:
library(BayesFactor)
ttestBF(data)
is the comparison value for the test - in other words it is the prediction of the
null hypothesis, which is why it is described as being ‘against denominator’ - the
null appears on the denominator of the Bayes factor equation. The final line of
the output tells us the type of test (a one sample Bayes factor) and the type of
prior (JZS). The JZS prior (standing for Jeffreys-Zellner-Siow) is a particular
type of default prior constructed using a Cauchy distribution for the effect size
and a Jeffreys prior on the variance, as described by Rouder et al. (2009).
What about if we have a null result? To explore this, we can generate a new
data set with a true mean of zero, and run both tests again:
data <- rnorm(20, mean=0, sd=3)
t.test(data)
##
## One Sample t-test
##
## data: data
## t = -0.18955, df = 19, p-value = 0.8517
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -1.442209 1.202684
## sample estimates:
## mean of x
## -0.1197628
ttestBF(data)
##
## One Sample t-test
354 CHAPTER 17. BAYESIAN STATISTICS
##
## data: data
## t = 1.3714, df = 4, p-value = 0.2422
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.6562693 1.9373199
## sample estimates:
## mean of x
## 0.6405253
ttestBF(data)
As you can see from the output, this produces a non-significant p-value, and
a Bayes factor score around 1 (0.75). Because the sample size is so small, we
simply don’t have enough data from our observations to conclude that either
the null or alternative hypothesis is more convincing.
If we have more than two groups to compare, we can conduct a one-way ANOVA
using the frequentist aov function, and the Bayesian anovaBF function. First,
we simulate and plot some data (see Figure 17.7:
# generate three levels of a dependent variable
dv1 <- rnorm(60, mean = 3, sd = 3)
dv2 <- rnorm(60, mean = 4, sd = 3)
dv3 <- rnorm(60, mean = 5, sd = 3)
alldata <- c(dv1,dv2,dv3) # combine together
group <- gl(3,60,labels = c("DV1", "DV2", "DV3")) # make condition labels
dataset <- data.frame(group,alldata) # combine into a data frame
plot(alldata ~ group) # plot as box plots
10
alldata
5
0
−5
group
(For those unfamiliar with R’s linear model syntax, the term alldata ~ group
is interpreted as ‘alldata is predicted by group’. In other words, alldata is the
dependent variable, and group is the independent variable - see Chapter 4). The
frequentist effect is complemented by a large Bayes factor score of 26.7:
bf <- anovaBF(alldata ~ group, data = dataset)
summary(bf)
##
## Pearson's product-moment correlation
##
## data: IV and DV
## t = 3.6749, df = 58, p-value = 0.0005212
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2030679 0.6200815
## sample estimates:
## cor
## 0.4345837
The data show a significant positive correlation (not plotted here). There is also
a Bayesian correlation test (the correlationBF function) that can be called as
follows:
correlationBF(IV,DV)
##
## Call:
## lm(formula = DV ~ IV, data = regressiondata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57064 -0.83313 0.06715 0.65323 2.38935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2038 0.4216 9.971 3.46e-14 ***
17.14. FURTHER RESOURCES ON BAYESIAN METHODS 357
B) 0.01
C) 33.5
D) 0.65
10. Suppose that the probability of a man sporting a moustache is 5%. However,
during November lots more people grow moustaches for charity, and the
probability of any individual man having a moustache increases to 30%.
Use Bayes’rule to calculate the conditional probability that it is November
given that you see a man with a moustache. The probability is:
A) 0.08
B) 0.25
C) 0.35
D) 0.49
Answers to all questions are provided in section 20.2.
360 CHAPTER 17. BAYESIAN STATISTICS
Chapter 18
There are several plotting packages in R that one might choose, the most popular
being the ggplot2 package (Wickham 2016). This is considered part of the
tidyverse suite of packages created by the makers of RStudio. The ggplot2
functions use a plotting grammar, whereby a plot is constructed by adding layers
of components. This can produce some excellent results, however in my personal
experience I find that there is usually something I need to change from the
default way that ggplot2 behaves. This often proves to be very difficult indeed,
and so I have found that the built-in plotting functions from the base R graphics
package afford more flexibility. We will use these functions throughout the
chapter. Remember that the code to produce all figures in the book, including
this chapter, is available at: https://github.com/bakerdh/ARMbookOUP.
We will start by discussing some key principles of data visualisation, and con-
siderations regarding the choice of colour palettes. We will then build up a
plot with multiple components using the basic plot commmand, demonstrating
various options. Finally, we will discuss how to export graphs, and also how
to combine several graphs into a multi-part figure. These tools are sufficient
for assembling publication-quality plots entirely within R, without the need for
additional software (such as Adobe Illustrator).
361
362 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
100 ●
●
●
●
90 ●
PROPORTION CORRECT
● ●
80
70
60
50
40
30 ● 0 Degrees
20 20 Degrees
10 40 Degrees
0
0 15 30 45 60 75 90
ELEMENT ORIENTATION (deg)
Figure 18.1: Reproduction of a rubbish graph from my own final year undergrad-
uate project. The experiment involved finding contours in a field of randomly
oriented elements. Performance is best when all elements in the contour had a
similar orientation (circles), but rapidly degrades as each element’s orientation
is jittered. Error bars give 95% confidence intervals.
364 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
100 ● ●
● ●
90 ●
● ●
Percent correct
80
70
60
50
40
30 ● 0 Degrees
20 20 Degrees
10 40 Degrees
0
0 15 30 45 60 75 90
Orientation (deg)
Figure 18.2: Replotted data from my undergraduate project, using colour, line
style and symbol size to maximally distinguish conditions.
there are some truly awful examples in the media that violate this basic rule.
For interval and ratio data, the units of measurement are meaningful, and so
these need to be represented appropriately in a plot. Some spreadsheet software
that is widely used to produce graphs has a default setting where the x-axis is
categorical even for continuous data, which can result in an unequal spacing of
values that is very misleading.
For some types of data, a logarithmic scale may be more appropriate than a
linear one. Log-scalings can be used when the data have a positive skew in linear
units, as is often the case with values involving time (such as reaction times),
and also when the data are ratios. There are several ways to plot a log axis.
Figure 18.3b shows the tick marks in linear increments, but on a log axis, so
that successive spaces become smaller (as used on old logarithmic graph paper).
An alternative is to log-transform the data, and plot this on a linear axis, where
the labels are in log units (Figure 18.3c). A disadvantage of this method is that
the log units themselves are arbitrary, and hard to relate to the original values.
The method I find most transparent is to give tick labels in the original linear
units (rather than in log units), but spaced in logarithmic steps, such as factors
of 2 or 10 (see Figure 18.3d). Conventions differ across disciplines about which
approach to use, and so it is always best to check some related papers in your
field.
18.1. FOUR PRINCIPLES OF CLEAR DATA VISUALISATION 365
(a) ●
900 (b) ●
900 ● 800 ●
700 ●
●
●
● ●
●
● ●
●
●●
800
● ● ● ● ●
● ● ●
● ●●
● ● ●● ●
600
● ● ● ●
● ●● ● ● ●
Reaction time (log(ms))
●●
●● ●●●●● ●●
●
● ●● ●
●● ●
● ●●●
●
● ●●
● ● ●● ●● ● ●●
●
●●● ●
●
●
● ●
● ●●●
● ●
●
●●●
●●●●●
●●
●
●
800 ●
Reaction time (ms)
700
●●●
500 ●● ●● ● ●●
● ●● ●●
● ● ● ● ●● ● ●● ● ●●
●●
● ● ● ● ●● ● ● ● ●●
●●
● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●
● ● ●●●●●
● ●● ● ●● ●
● ● ●●●●
●●●●● ●●
●●● ● ● ●● ● ● ●●
2.8
●●● ●● ●
●● ● ● ●● ● ● ● ●
● ●●●● ●
●●● ●
●●● ●●●●● ●●●●● ●●●● ● ●
● ● ● ●
● ●
●●● ●
●●● ● ●●●●● ●● ●●
● ● ●● ● ●●●● ●● ●
●●● ●● ●●●● ●●
● ● ●● ●● ● ● ● ● ●● ● ● ● ●●
● ●
● ●●●● ● ●
●
●● ●●●●● ● ● ●● ●● ● ● ●
● ●● ●● ● ● ● ● ●● ●● ● ● ● ●
● ● ● ●● ● ●
●●●● ● ●●● ● ●● ●● ●
● ● ● ●●●● ●● ●●●● ●● ●● ●●●● ●●
● ● ●●
●●●● ● ●
●● ● ● ●●
●●●● ●●●●●● ●● ●● ●● ● ● ●●● ● ● ● ● ●●● ● ●
● ● ●● ●● ● ●● ●● ● ●●
●●
●●● ● ● ●●● ●●●● ●
● ● ● ● ● ● ●
●●● ●●●
● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●●
400
● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●
600
●●●●●
● ●● ●● ●
●●● ●● ●● ●● ●●
●● ● ● ● ● ●● ●●●●●●●● ●●● ●●●● ● ●
●●●● ●● ●● ●●● ●
● ●●● ●● ● ● ●● ●● ●●● ●
● ●●● ●● ● ●
● ●● ● ●● ● ●●● ●● ● ● ●●● ● ●● ●● ● ● ●●● ● ●●
● ● ● ● ●●●● ●
●●
●
●●●●
●● ●●●
●
● ●●● ●
● ●● ●●●●
●
●●
●●●●●●
●● ●
●● ●● ●●●●●
●●● ●●● ●●
●●
●
●● ●● ●●●●●
●●● ●●● ●●
●●
● ● ●●●●● ●●●●● ●●● ● ●●●●●● ●
●●●● ● ●● ● ● ●
●● ● ●●●
● ● ●● ● ● ●
●● ● ●●●
●
● ●● ●●● ●●●●● ● ●
●●●● ● ●●●
● ● ●
● ● ●●● ● ●● ●●
● ●● ● ●● ●● ● ● ● ● ● ●● ●●● ●●
● ●● ● ●● ●● ● ● ● ● ● ●● ●●●
● ● ●●
● ● ●
●
● ●
●●●
●
● ● ●● ●● ●
●● ●●● ●● ●●
●●
●
●
●● ● ● ● ●● ●● ●●●● ●● ●●
● ● ● ● ● ●● ●● ●●●● ●● ●●
● ●
● ●● ●●
● ●●●
●● ●
●●● ●●
● ●● ●●
●●●● ●●●
●●
●● ●●●●●● ●● ●●
● ●
●●
● ●
● ●●● ●●●
●● ●●
●
●●● ● ● ● ●
●●● ●● ●●
● ●
●●
● ●
● ●●● ●●●
●● ●●
●
●●● ● ● ● ●
●●● ●●
● ● ●● ●● ● ● ●●●
●●●●● ● ● ●● ●● ●
●● ●
●● ●●●●●●
● ●●●●●● ●
● ● ●●●● ●●●● ●●●●●●●●●
● ●●●●● ●
● ● ●●●● ●●●●
● ●● ● ●
● ● ●● ● ●
● ●●●●●● ●●● ● ●
●
●●
● ●● ● ●●●● ●● ●
● ●●●●● ●●●●● ●● ●● ● ●● ● ●●●●
●
●● ●
● ●●●●● ●●●●● ●● ●● ●
●● ● ● ●● ●
●● ●● ● ● ●●● ●● ● ●
●
●●● ●● ●● ●● ●●● ●●●●● ●●
●● ●
●● ●● ●● ●●●● ●
●●
●● ● ● ●●●●● ●●
●● ●
●● ●● ●● ●
●●
●● ● ●
●
●●● ● ● ●●●
● ●
● ● ●
●●● ●● ●●●
● ●● ● ●●●●●● ●●●● ● ●● ●●● ●●●●● ● ●●●
● ●●●●● ●● ●●● ●●●●●
●●●●
● ●●●
● ●●●●●
●● ● ●
●
● ●● ● ●●●●●
● ● ●●● ●●● ● ● ● ●●●● ●
●
●●●● ● ● ●●●
●●●
● ●● ●●● ● ● ●●
●●●● ●●●●●● ● ●● ●●● ●
●●● ●●●
● ●● ●●● ● ● ●●
●●●● ●●●●●● ● ●● ●●● ●
●●●
● ● ●● ● ● ● ● ●●● ● ●●● ●●● ● ● ● ● ● ●● ●
●● ●
●●● ●
● ●● ● ● ● ● ● ●● ●
●● ●
●●● ●
● ●●
400
● ●●● ●●● ●●● ●●
●● ● ● ●● ● ●●● ●●
●● ● ● ●● ●
500
● ●● ● ● ● ● ●●● ● ●
●
●●● ●●
● ● ●● ●●●●●
●●● ● ● ● ●
●●● ●● ●●●●●●●●● ●● ●
●●● ●
●● ● ●
●●● ●● ●●●●●●●●● ●● ●
●●● ●
●●
● ● ● ●●●● ●● ● ●● ●● ●● ●●
●●●● ●● ●●● ● ●● ● ●●
●
● ●● ●● ●●
●●●● ●● ●●● ● ●● ● ●●
●
●
●● ●● ● ●● ● ● ● ● ●●
● ●● ●● ●● ●●● ●●●●●
● ●●● ● ● ●● ●●●● ●
● ●●●● ●●●●
●●●
●●●
●● ●● ●●● ● ●●● ●●●●● ●
● ●●●● ●●●●
●●●●● ●● ●●● ● ●●● ●●●●●
● ●
●●
●
● ● ●● ● ●● ●●● ●● ● ● ●● ●●● ●● ●
●●●●● ●● ●●● ●●●
●● ● ● ●● ●●● ●● ●
●●●●●
300
● ●●● ●●●
●●●●
●● ● ●● ● ● ● ●●● ● ●●● ●
●●
●
●●●●●●●● ● ●● ●● ●● ●●● ●
●●
●
●●●●●●●● ● ●● ●● ●●
●●● ●●● ●●●● ●● ●●●●●● ● ● ●● ●●● ●●●●● ●●●● ●●●●●● ●●●●●● ● ●●●
● ●● ●●● ●●●●● ●●●● ●●●●●● ●●●●●● ● ●●●
●
●●●● ● ● ●● ●●●● ●
●● ● ●● ●●
●
● ● ● ●●● ● ●●● ● ● ● ● ● ●
●●● ● ●●● ●●●● ●●
● ● ●● ●● ●
●● ●●● ●● ●●
●
● ●
●●● ● ●●● ●●●● ●●
● ● ●● ●● ●
●● ●●● ●● ●●
●
●
●●● ●
● ●● ● ●● ●●● ● ●● ● ● ●●● ● ● ●
● ●●
● ●●● ●●
●● ●
●●● ●● ●●●●
● ● ●●● ● ●●● ● ● ●●
● ●●● ●●
●● ●
●●● ●● ●●●●
● ● ●●● ● ●●● ● ●
●● ●●●●
●●
●●●
● ● ●●● ● ● ●●●● ● ●●● ●● ● ●● ● ● ●● ● ● ●● ●●●●● ●● ●
●●● ●●●● ●●● ● ●●●●● ●● ●
●●● ●●●● ●●● ●
●●
●● ● ● ●●●● ●
● ●
● ●●
● ●● ● ●
●
●
● ●
● ●● ●● ● ●
● ● ● ● ●● ●● ● ● ●●
●
●●● ●●● ●●●
●●
●● ●●●● ●●● ● ●● ●● ● ● ●●
●
●●● ●●● ●●●
●●
●● ●●●● ●●●
● ● ● ●● ●●●●●●
●● ●●●● ● ● ●● ● ●● ●● ● ●●●● ●● ● ●●●●●●● ● ●●●● ●● ●●
●● ● ●● ● ●●●● ●● ● ●●●●●●● ● ●●●● ●● ●●
●● ●
●●● ●●● ●● ● ●● ● ●●● ●●● ●●● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ●●●
●
●● ● ● ● ●
●
●●●● ● ● ●● ●● ●●●
●
●● ● ● ● ●
●
●●●● ● ● ●●
●●
●●●● ● ●●● ●
●●●● ●●●●●● ● ●● ● ● ● ● ●
● ● ●● ●● ●
●●● ●● ●● ●● ● ●●● ●
● ● ●● ●● ●
●●● ●● ●● ●● ● ●●●
●●● ●● ●●●
●● ●
● ● ● ● ● ●● ●
●●● ●● ● ● ●● ●●●
● ●
● ● ●
●●● ●● ● ● ●● ●●●
● ●
● ●
● ●●
●●●
●
●
●●●
● ● ●● ●●●●●● ●
●● ●●● ● ● ● ●●●●● ●
●●
●
● ●● ● ● ●●● ● ●●
● ● ● ● ●●●●● ●
●●
●
● ●● ● ● ●●● ● ●●
● ● ● ●
400
● ● ●●●●● ●● ●● ●●
●●● ●● ● ● ● ● ● ●●● ● ●● ● ●●● ● ● ● ● ●● ●● ●●● ● ● ● ● ●● ●●
●●●●●
● ●●
●●●●●● ●●● ●●●● ● ●
●●
●
●●
● ● ● ●● ●● ●●●●● ●● ● ● ● ● ● ● ●●
●●●●● ●● ● ● ● ● ● ● ●●
●
● ●●●● ●● ●●
● ●●
● ● ●● ●●● ●● ● ●●●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●
●● ●●● ●●●
●●
● ● ● ●● ●●●●
●●●
●●●●●
●●
●●●
● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●●● ● ● ● ● ● ●● ● ● ● ●● ●●●● ●
●●● ●
●●●
●
●●●
● ●●●●
●●● ● ● ● ● ● ● ● ●● ●●
● ● ● ● ● ● ●● ●●
● ●
●● ●●● ●●
● ●● ●●●●● ●●●●●● ●● ●
●●●● ● ●●● ● ● ●
●
●● ●● ●
●
● ●●
● ● ● ●● ●● ●
●
● ●●
● ● ●
●
● ●●
● ●
● ●● ●● ●
●
●● ●●● ●●● ●
●● ●●● ● ●
●
● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ●●● ● ●● ●● ●
●●
● ●●● ●
●●● ●●●
● ● ●● ●●●● ●●● ● ●●● ● ● ●● ● ● ● ● ● ●● ●● ●●● ● ● ● ● ●● ● ●● ●● ●●● ● ● ● ● ●●
●●●●● ●● ●●
●●● ●●●● ●●
●● ● ●
●● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●
●●●●●●
2.4
●● ●
●● ● ● ● ● ●● ●
●● ●
● ● ●●●
●
● ● ●
●● ●
● ●●●● ●
●
●●●● ●● ●●
●● ● ● ● ● ● ● ● ● ●● ●● ●●
● ● ● ● ●● ● ● ● ● ● ●● ●● ●●
● ● ● ● ●●
●● ●●● ●●●● ● ●
●● ●●
●
●●●● ● ● ●● ●● ● ●
●
●
●
●● ● ●● ●●
●●●
●●
●●
● ●● ●●
●●●●●●
● ●
●●●
● ● ● ●● ● ●● ● ● ●● ● ●● ●
●● ●●●● ●● ●●● ●● ● ●●●●●●● ● ● ●● ● ● ● ● ●● ● ●
200
●
300
● ● ●● ● ●●
●● ● ● ● ●●●●
● ● ● ● ●
●●● ● ●●● ● ● ●● ●● ●●● ●
● ●● ●
●
● ●● ●
● ●● ● ●
● ● ● ●● ●● ● ● ● ● ● ●●
● ●●
●
●● ●● ● ● ●●● ● ● ●● ●●
● ● ● ● ●
● ●● ●● ● ●● ● ● ● ●
● ●● ●● ● ●●● ●●● ● ●● ●● ● ●● ● ●
● ● ● ● ●●
200
● ● ● ● ● ●
● ● ● ● ● ●● ● ●
● ●●
● ●● ●● ● ● ●●●● ● ● ●
● ●● ●
● ● ●
●●●● ● ● ● ●
●● ●
200
●
●
●
100
0 100 2 100
Figure 18.3: Examples of linear and logarithmic axes. Panel (a) shows a linear
axis for reaction time data in milliseconds (ms). Panel (b) shows a logarithmic
axis for the same data, with tick marks spaced every 100 ms, resulting in uneven
tick placement. Panel (c) shows the logarithmic transform of the data - the
units are now much harder to interpret. Finally, panel (d) shows logarithmic
scaling with tick marks spaced at equal factors, but values given in linear units.
Position on the x-axis is arbitrary.
366 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
There is also some disagreement about whether it is necessary for a linear axis
to extend to zero (especially for bar charts). Sometimes starting the axis at
an arbitrary point closer to the means can exaggerate an apparent difference
between conditions (see Figure 18.4). This is bad practice if the intention is to
mislead, and exaggerate the apparent size of an effect. Yet it is not necessarily
the case that zero should always be included. For many variables, zero is not a
plausible value for any of our observations - reaction times being a good example
again. Stretching an axis to zero arbitrarily can obscure real and meaningful
differences in the data. Instead, I favour using axes that span an informative
range, and combining this with error bars to give a clear visual indication of
which differences are meaningful, as we will discuss in the following section.
A good rule of thumb might be to have the axes extend around 2 standard
deviations above the highest point, and around 2 standard deviations below the
lowest point you are plotting. However, this heuristic also needs a good dose
of common sense - round the upper and lower values to a sensible value, such
as an integer, or a factor of 10 or 100, depending on the natural scale of your
data. You should ideally use the same range for multiple plots of the same type
of data.
10 14
(a) (b)
9 12
10
8
Height (m)
Height (m)
8
7
6
6
4
5 2
4 0
visual indication of how precise our estimates of means or medians might be. For
example, the data shown in Figure 18.5a might suggest a large difference between
the two conditions, yet without an indication of how variable each estimate is,
we cannot tell whether the apparent differences are real, or just due to sampling
error.
The most widely used measures of variability are the standard deviation, standard
error, and confidence intervals. The standard deviation (SD, or σ) is the square
root of the variance, so it is the most direct indication of the underlying variability
in the data (see Figure 18.5b). It does not depend on the number of estimates,
meaning it should remain approximately constant as the sample size changes.
On the other hand, the standard error (SE) scales the standard deviation by the
square root of the sample size (see Figure 18.5c). This means that as we collect
more data the standard error becomes smaller, reflecting the greater precision of
our estimate of the mean (or other measure of central tendency).
Something worth observing in Figure 18.5 is that the different types of error
bars can appear to indicate different amounts of variance, if we don’t take into
account what they are actually showing (remember it is the same data set in
each case). For example, standard deviations will always produce larger error
bars than standard errors, and without clarity about what is being shown a
reader might develop a spurious understanding of the variability of the data. It
turns out that even trained researchers often misinterpret what error bars are
showing, and treat standard errors and 95% confidence intervals very similarly
(Belia et al. 2005). Many people appear to use heuristics, such as whether the
error bars ‘just touch’, which are not a good indicator of statistical significance.
All of this confirms that we always need to explicitly state in the figure caption
what our error bars represent.
14 14
(a) (b)
12 12
10 10
Height (m)
Height (m)
8 8
6 6
4 4
2 2
0 0
14 14
(c) (d)
12 12
10 10
Height (m)
Height (m)
8 8
6 6
4 4
2 2
0 0
Figure 18.5: Illustration of different types of error bar. Panel (a) shows two
means with no indication of the variance. Panel (b) shows the same data with
standard deviations plotted. The lower row shows the same data with standard
errors (c) and 95% confidence intervals (d) plotted. In this example, the two
groups are significantly different (t=2.5, df=18, p=0.02, d=1.1).
18.1. FOUR PRINCIPLES OF CLEAR DATA VISUALISATION 369
100 ● ●
● ●
90 ●
● ●
Percent correct
80
70
60
50
40
30
20 ● 0 Degrees
20 Degrees
10
40 Degrees
0
0 15 30 45 60 75 90
Orientation (deg)
The most straightforward solution is to plot the individual data points that went
into calculating the averages being displayed (see Figure 18.7b). Interestingly, it
has been shown that the presence of a bar can distort the reader’s interpretation of
individual data points. Newman and Scholl (2012) found that people judge points
falling within the bar to be more likely to be part of the underlying distribution
than those falling outside of it. Since the bar itself is not informative, it can
be safely removed, and replaced by a line or point to indicate the average.
In practical terms, individual data points should not get in the way of the
representation of the mean, and so alpha transparency can be useful. Plotting
data points semi-transparently also allows us to see when multiple points with
similar values are overlapping with each other.
For some data sets, it can also be helpful to plot distributions, or kernel density
functions (smoothed histograms). There are several conventions for plotting
these. For example, with bivariate scatterplots showing correlations, it is common
to plot histograms along the margins of the plot (we will see an example of this
later in the chapter). For univariate data, the violin plot replaces bars with
mirrored distributions (Figure 18.7c). A more recent suggestion is the raincloud
plot (Allen et al. 2019), which shows a distribution with individual data points
either underneath or to one side, and a mean and error bar in between (Figure
18.7d).
For data types that involve continuous functions rather than individual points
(e.g. where a timecourse has been measured), individual functions can still be
plotted with a thin line-width, and the grand mean overlaid in a thicker line and
more salient colour (Rousselet, Foxe, and Bolam 2016). An example of this is
shown in Figure 18.8, replotting the ERP difference data described in Section
15.9. For this data set there is one clear outlier (highlighted in blue) that is
contributing only noise to the data set, and might reasonably be excluded on
various criteria. In the absence of outliers, this method of presentation allows
the reader to confirm that nothing has been hidden by the averaging process,
giving them more confidence in the results of any inferential statistics, and the
conclusions drawn from them.
18.1. FOUR PRINCIPLES OF CLEAR DATA VISUALISATION 371
Figure 18.7: Example visualisations that provide more information about the
distribution of data. Panel (a) shows a generic bar plot (error bars indicate
95% confidence intervals). Panel (b) shows the raw data points plotted semi-
transparently. Panel (c) shows a violin plot, illustrating the distribution of points
as a symmetrical kernel density function. Panel (d) shows a raincloud plot,
which features the distribution and the raw data.
372 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
20
15
Difference (µV)
10
−5
−10
−200 0 200 400 600 800 1000
Time (ms)
Figure 18.8: Replotted ERP difference data, showing results for individual partic-
ipants (thin traces) and the group mean (thick line). The shaded region indicates
±1SE across participants. The blue curve highlights an outlier participant that
might sensibly be excluded.
18.2. CHOOSING COLOUR PALETTES 373
Rainbow palette
Original
Black/White
Deutan
Protan
Tritan
ing to produce palettes that are appropriate for everyone. Palettes such as the
rainbow palette shown in Figure 18.9 are problematic partly because they involve
many different hues. A better alternative are colour maps that pass smoothly
from one hue to another, and simultaneously change in brightness. One example,
the parula palette, is shown in Figure 18.10. Both ends of the palette are clearly
distinguishable for all four simulated varieties of colourblindness. When used to
plot real data (e.g. contour maps), the parula palette is still attractive and clear,
even when the colour information is removed (see Figure 18.11).
Parula palette
Original
Black/White
Deutan
Protan
Tritan
Figure 18.11: The Himmelblau function, plotted with two colour palettes and
their greyscale equivalents - the rainbow palette (left) and the parula palette
(right). The upper row shows the original palettes, and the lower row shows the
same values but with colour information removed (simulating achromatopsia).
Notice how the rainbow palette introduces sharp boundaries of brightness between
regions, whereas the parula palette has a smoother transition.
18.2. CHOOSING COLOUR PALETTES 377
18.12. Perceptually uniform palettes reverse these priorities, and aim to create
consistent changes in luminance (and perceived hue) by manipulating the red,
green and blue levels nonlinearly (see right panel of Figure 18.12). Notice how
the standard ramp on the left has bands of higher brightness (especially in the
reddest part of the palette), but the perceptually uniform palette has a much
smoother transition across the whole range. This is important when representing
data using a colourmap, because a nonuniform palette will introduce spurious
features that are not present in the data.
1 1
0.8 R 0.8 R
Luminance
Luminance
0.6 L 0.6 L
B B
0.4 0.4
0.2 G 0.2 G
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Palette level Palette level
Figure 18.12: Profile of red (R), green (G) and blue (B) values, and overall
luminance (L) for a standard linear ramp from black through red, orange, yellow
and white (left panel), and a perceptually uniform equivalent (right panel).
the data, making the peaks and troughs appear more salient and over-saturated.
The perceptually uniform palette produces smoother transitions between features,
and the true location of the peak in the lower left corner is clearer. For most
types of scientific image, these characteristics are to be preferred for the accurate
communication of results.
As well as being used for displaying scientific image data and surfaces, perceptu-
ally uniform colour palettes can also be used for selecting colours to plot data
points and curves. We will demonstrate this in the practical sections of the
chapter.
1 1
(a) ● Condition 1 (b)
Condition 2
0.8 0.8
●
● ●
Y axis title
Y axis title
0.6 0.6
●
●
0.4 0.4
●
●
●
●
0.2 0.2 ●
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
X axis title X axis title
1 1
(c) (d)
0.8 0.8
Y axis title
Y axis title
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
X axis title X axis title
Figure 18.14: Example outputs from the plotting function. Panel (a) shows an
empty axis, ready for data to be plotted. Panel (b) shows an example of two
conditions plotted with lines, points, error bars and a figure legend. Panel (c)
shows an example of the polygon function with alpha transparency. Panel (d)
shows an example of kernel density functions for an x-y scatterplot.
380 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
The above code will appear to do nothing at all, but it will define an empty
axis in the plot window of RStudio. We can then add axes with tick marks at
sensible intervals using the axis command, tick labels with the mtext command,
and axis labels with the title command.
ticklocs <- seq(0,1,0.2) # locations of tick marks
axis(1, at=ticklocs, tck=0.01, lab=F, lwd=2)
axis(2, at=ticklocs, tck=0.01, lab=F, lwd=2)
mtext(text = ticklocs, side = 1, at=ticklocs) # add the tick labels
# the 'line' command moves away from the axis, the 'las' command rotates to vertical
mtext(text = ticklocs, side = 2, at=ticklocs, line=0.2, las=1)
title(xlab="X axis title", col.lab=rgb(0,0,0), line=1.2, cex.lab=1.5)
title(ylab="Y axis title", col.lab=rgb(0,0,0), line=1.5, cex.lab=1.5)
The above code will produce the axes shown in Figure 18.14a. These axes are
quite minimal - it is also possible to add the off-axes (again using the axis
command, but with the side option of 3 or 4). If required, we can also force the
axes to be square by adding par(pty=“s”) before the call to the plot command.
Additional text can be added anywhere in the plot using the text function. This
is often useful for adding panel labels (a, b, etc.) for example:
text(-0.05,0.95,'(a)',cex=1.8, pos=4)
The first two inputs are the x- and y-coordinates, and the third input is the text
string we want to plot. The cex option specifies the font size, and the pos option
has several possible values that determine the position of the text relative to the
coordinates: 4 means that the string is plotted to the right of the coordinates.
A brief note about font size and style - many readers have visual impairments
that make small text particularly difficult to read. Therefore it is a good idea to
make text within your figure as large as is practical. Some people also find that
fonts without serifs (the little decorative lines added to some letters) are easier
to read. Popular sans serif fonts include Arial, Calibri and Helvetica.
The generic plotting functions do not require the data to be stored in a specific
format such as a data frame. They work fine with vectors of numbers. Lines and
error bars should ideally be plotted first so that they appear behind the data
points. A line connecting the data points can be defined using the lines function:
# draw a line connecting the points
lines(datax, datay, col='cornflowerblue', lwd=3, lty=1)
We provide the x- and y-values as vectors, define the colour using the col option,
and the width using the lwd option. There are also several line styles that can
be selected with the lty option, though here we have chosen the default option
(1), which plots a continuous straight line. Other options include dashed (option
2) and dotted (option 3) lines.
Error bars are plotted using the arrows function. We need to manually add and
subtract the standard error to define the upper and lower extents of the bars as
follows:
# add lower error bars
arrows(datax, datay, x1=datax, y1=datay-SEdata, length=0.015, angle=90, lwd=2)
# add upper error bars
arrows(datax, datay, x1=datax, y1=datay+SEdata, length=0.015, angle=90, lwd=2)
Notice that we plot all of the lower error bars at once, followed by all of the
upper error bars. The arrows function allows us to enter a vector of values for
each input, just like the lines function. As the name suggests, the arrows the
function is designed to plot actual arrows, but setting the angle of the arrow to
90 degrees (as in the above code) produces error bars that are flat at the ends.
Next, we can add our data points with the points function. There are many
different point styles available that can be selected through the pch option. I
tend to favour styles 15-20 (solid shapes of a single colour) and styles 21-25
(filled shapes with an outline). Details on the available shapes are given in the
help file for the points function. For solid shapes of a single colour, the colour is
specified by the col option. For filled shapes, we specify the outline colour with
the col option, and the fill colour with the bg (background) option, allowing the
two parts of the shape to have different colours.
# draw the data points themselves
points(datax, datay, pch=21, col='black', bg='cornflowerblue', cex=1.6, lwd=3)
We can also include a legend to indicate what different symbols refer to. The
legend function requires x and y coordinates corresponding to its upper left
corner. We also provide a list of text labels for the conditions, and repeat some of
the options from the points and lines functions to specify the symbol properties.
382 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
Putting all of the above lines of code together (with some additional data for a
second condition) gives us the plot in Figure 18.14b.
colour1
## [1] "#0000FF80"
colour2
## [1] "#00000080"
Notice that the colours are converted to a hexadecimal representation. If we
don’t know the RGB values of the colour we want to use, but instead have the
colour name, the following helper function (addalpha) will convert to the correct
format and add the alpha level:
addalpha <- function(col, alpha=1){
apply(sapply(col, col2rgb)/255, 2, function(x) rgb(x[1], x[2], x[3],alpha=alpha))}
We can then use these colour values with lines, points and polygons. Polygons
are two-dimensional shapes of arbitrary specification. For example, we could
draw two rectangles using the polygon function as follows:
polygon(c(0.2,0.8,0.8,0.2),c(0.2,0.2,0.6,0.6),col=colour1)
polygon(c(0.4,0.6,0.6,0.4),c(0.4,0.4,0.9,0.9),col=colour2)
In these two lines of code, the first two vectors of numbers are x- and y- values
that specify the corners (vertices) of the polygons. Adding the code to an empty
plot produces the image in Figure 18.14c. Of course we are not limited to only
four vertices, and a common use of the polygon function is to plot histograms
as kernel density functions. With a larger set of correlated data, we can add
histograms to the margins of a scatterplot as follows:
datax <- rnorm(1000,mean=0.5,sd=0.1)
datay <- rnorm(1000,mean=0.5,sd=0.1) + (datax-0.5)
a <- density(datax)
a$y <- 0.2*(a$y/max(a$y))
polygon(a$x, 1-a$y, col=colour1,border=NA)
a <- density(datay)
a$y <- 0.2*(a$y/max(a$y))
polygon(1-a$y, a$x, col=colour2,border=NA)
The density function generates the kernel density distribution, and the subsequent
line of code rescales the height to be between 0 and 0.2 so that it fits neatly
onto our plot. Notice that the x- and y- variables are reversed for the second
polygon so that it appears on the right side of the plot. Figure 18.14d plots
these polygons along with the data points, which are also semi-transparent.
## [1] "#FF8000CC"
The col2rgb converts a colour name or hexadecimal code back to red, green and
blue values (scaled 0 - 255):
384 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
col2rgb('orange')
## [,1]
## red 255
## green 165
## blue 0
col2rgb("#FF8000CC")
## [,1]
## red 255
## green 128
## blue 0
Note that this function removes the transparency information.
To obtain a custom colour palette, we can use the colorRamp function to
transition smoothly between two or more colours, and sampling the ramp at
an arbitrary number of intermediate points. For example, the following code
produces the palette shown on the left side of Figure 18.12:
cr <- colorRamp(c("black","red","orange","yellow","white"))
thispal <- rgb(cr(seq(0, 1, length = 256)), max = 255)
The pal.test function produces a series of test images for a given palette, as
shown in Figure 18.15, which shows the output of the following expression:
pal.test(kovesi.linear_bgyw_15_100_c67)
kovesi.linear_bgyw_15_100_c67
250
200
150
100
50
0
0 50 100 150
Figure 18.15: Test image for colour palettes, generated by the pal.test function
for the Kovesi linear blue-green-yellow-white palette.
The graph in the lower right shows the red, green, blue and luminance ramps in
the same format as Figure 18.12. The upper right image shows the full palette
with a sine wave modulation along the upper edge. If the oscillations are harder
to see in some parts of the palette, this indicates perceptual non-uniformity.
The volcano images in the lower row are the same image with the colour palette
ranging from 0 - 1 and from 1 - 0. If some features are visible in only one of
these images, it again suggests problems with the palette. Since the Kovesi
palettes are all perceptually uniform, they pass most of these tests. It is worth
also viewing the output of the test function for a non-uniform palette (see Figure
18.16). Notice that the two volcanoes appear to differ in size, and that parts of
the sine-wave oscillation are not visible (particularly in the green band).
To check how a palette will appear to individuals with several types of colour-
blindness, the pal.safe function produces images like those shown in Figures 18.9
and 18.10.
Once we have selected a suitable colour palette, we can apply it to an image using
the image function. A suitable test image is the interesting two-dimensional
386 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
rainbow
250
200
150
100
50
0
0 50 100 150
Figure 18.16: Test image for colour palettes, generated by the pal.test function
for the rainbow palette. Observe in particular that the sinusoidal oscillation in
the upper right panel is very hard to see in some regions of the palette (e.g. the
green band).
18.7. SAVING GRAPHS AUTOMATICALLY 387
2 2
f (x, y) = sin(y)e(1−cox(x)) + cos(x)e(1−sin(y)) + (x − y)2 (18.1)
The following code plots this function using the palette stored in the rpal object,
with the output shown in Figure 18.17:
x <- matrix(rep(seq(-10,0,length.out=200),200),nrow=200,ncol=200)
y <- t(matrix(rep(seq(-6.5,0,length.out=200),200),nrow=200,ncol=200))
z <- sin(y)*exp((1 - cos(x))^2) + cos(x)*exp((1 - sin(y))^2) + (x - y)^2
The image function requires vectors of x and y values, and a matrix of z (intensity)
values, as well as a palette specification for the col option. When it is called,
the function generates a new plot, for which axes and labels can be plotted if
required. The useRaster option tells the function to plot either as a raster image
(if TRUE) or a vector image (if FALSE). Raster images are like photographs,
where the RGB value of each pixel is stored at a particular resolution. Vector
graphics use graphical primitives such as lines, points and polygons to represent
images. For most plots (i.e. lines, points, polygons), vector graphics are preferred
because they can easily be enlarged with no loss of quality. However the vector
rendering of images such as the one shown in Figure 18.17 often look worse
than a raster version, because the individual elements of the texture surface get
separated by little white lines. It is worth trying both options to see what looks
best.
Figure 18.17: The Mishra Bird function, plotted using the image function and
the Kovesi diverging bky palette.
18.7. SAVING GRAPHS AUTOMATICALLY 389
sizes can become very large. Second, for a fixed resolution, if we zoom into a
raster image too far it will start to look blocky and pixelated, because we can
see each individual pixel (and sometimes compression artifacts). This can look
particularly bad for text, as shown in the example in Figure 18.18a.
Figure 18.18: Comparison of raster and vector art. Panel (a) shows a figure
legend from a graph that was saved in a raster format (jpeg) and then zoomed
in. Panel (b) shows the same thing for a vector format (pdf) - the text and
objects are sharper, with no visible pixels.
An alternative are vector file formats (including .pdf, .svg, .ps and .eps formats).
Vector images are lists of instructions about how to draw an image using a
collection of lines, shapes and text - much like the way we actually construct
graphs in R. Vector files do not have a set resolution, and if we zoom in on a
particular part the computer can redraw it with a high level of detail (see Figure
18.18b). This is the natural format for scientific figures, and most journals prefer
you to submit artwork in a vector file format. An added advantage is that vector
images typically have much smaller file sizes than their raster equivalents. It is
possible to embed raster images as part of a vector file, though these will still
have all of the limitations of the original image (e.g. fixed resolution, and large
file size).
We can export a graph as a vector pdf file (pdf stands for portable document
format) using the pdf function:
pdf('filename.pdf', bg='transparent', height = 5.5, width = 5.5)
The height and width parameters are in inches, and other background colours
can be set using the bg option if required. All plots you create will then be
exported to this file up until the following command is called:
390 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
dev.off()
The svg function exports in scalable vector graphics format in a similar way.
Other functions can export as raster graphics formats (jpeg, tiff, png and bmp
functions). For all of the raster functions the file size is set in pixels by default.
Finally, we can export in ps and eps format using the postscript function:
postscript('filename.ps', horizontal = FALSE, onefile = FALSE,
paper = 'special', height = 5.5, width = 5.5)
For this function the paper option needs to be set to ‘special’ for the height
and width options to be used. Postscript files are vector images (though they
can have raster images embedded inside of them). In the next section we will
discuss how to re-import individual graphs that have been saved in ps format,
to combine into a single figure without losing the vector information. There is
one downside to this format - postscript files cannot store alpha transparency
information. However we will suggest a workaround for this issue.
The above code will generate a 2 x 2 plot like Figure 18.14, with each subsequent
panel being added in sequence. The layout function is slightly more sophisticated,
and can allow for plots of different sizes. However, sometimes publication-quality
figures need more flexibility than is permitted by regular layouts. For this reason,
this section will demonstrate an alternative approach to combining plots that is
built on functions from the grImport and grid packages.
The grImport package allows us to import a postscript (.ps) file by first converting
it to a custom xml format as follows:
library(grImport)
PostScriptTrace('filename.ps')
e1 <- readPicture('filename.ps.xml')
Note that for this to work, you will need the free Ghostscript tools installed
on your system (see https://www.ghostscript.com/). The imported figure is
described as a set of lines, text and filled surfaces stored in a systematic way - in
other words it is a vector image. We can draw this figure in its entirety onto a
new plot, either in the Plots window or (more typically) a pdf file that we wish
18.8. COMBINING MULTIPLE GRAPHS 391
to export to. To do this, we use the grid.picture function from the grid package
as follows:
grid.picture(e1,x=0.5,y=0.5,width=0.5,height=1)
The advantage of this approach is that we have total control over the size and
placement of the figure panel in the new plot. By default, the image will be
centred on the x and y coordinates provided, with the width value specifying a
proportion of the width of the plot window. Somewhat unintuitively, the height
option specifies a proportion of the width, and so should generally be set to 1 to
avoid distorting the aspect ratio of the figure.
a <- density(datax)
a$y <- 0.2*(a$y/max(a$y))
polygon(a$x, 1-a$y, col=pal2tone[2],border=NA)
a <- density(datay)
a$y <- 0.2*(a$y/max(a$y))
polygon(1-a$y, a$x, col='black',border=NA)
points(datax,datay,col=rgb(0,0,0),pch=16,cex=0.6)
dev.off()
Figure 18.19: Example of multi-panel plot with panels of different sizes and
positions.
This approach to collating plots works very well, but there is one problem - the
postscript format does not support alpha transparency. This means that when
18.9. ADDING RASTER IMAGES TO A PLOT 393
This will plot the image at the x- and y-coordinates given by the first two
numbers for the bottom left corner, to the x- and y-coordinates given by the
394 CHAPTER 18. PLOTTING GRAPHS AND DATA VISUALISATION
last two numbers for the top right corner. Note that graphs in which the axes
have different limits or units will require careful thought about how to set these
coordinates to preserve the correct aspect ratio of the image. For example, if
the x-axis runs from 0 to 1, but the y-axis spans from 0 to 2, the extent of a
square image in the y-direction must be twice that in the x-direction.
It is also possible to rotate the image if required using the additional angle
option, specified in degrees. Rotation is about the lower left corner of the image
(not the centre), so this will also shift the centre of the image, meaning that the
coordinates often need to be adjusted for appropriate placement. A good way to
think about this is that the second pair of coordinates are really specifying the
size of the image (rather than its top right corner). For example, consider the
code:
rasterImage(violin,0.8,0.8,1.1,1.1,angle=180)
This actually plots the image between x=0.5 and x=0.8, because the rotation
through 180 degrees about the lower left hand corner means that the top right
corner has moved: Figure 18.21 shows the violin image plotted at both rotations.
One practical consideration when plotting raster images is that they will cover
up anything plotted underneath them. It is often pragmatic to plot the images
first, so that any points, lines and other features of the figure appear on top of
the image.
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
399
400 CHAPTER 19. REPRODUCIBLE DATA ANALYSIS
6), and to inform power analyses in planning new studies (see Chapter 5). Open
practices should generally act to speed up scientific progress, and even suggest
new projects that were not previously viable. Over the past few years I have
used open data in my own research, and also had others use data that I have
made available. Both experiences have been very positive, and I have no regrets
about sharing data publicly. In particular I have been contacted by researchers
who lack laboratory facilities to collect their own data, but are able to make
progress and test hypotheses by re-using existing data sets. Such secondary data
analysis is also likely to increase citations to the original work.
The aim of this chapter is to discuss some details associated with making data
and scripts open and accessible. Some of these points will transcend specific
implementation issues, but we will also discuss how to interface with platforms
that are widely used today, in particular GitHub and the Open Science Framework
(OSF). In general, making code and data available will improve the integrity
and progress of science, and is worthwhile for researchers in all disciplines and
at all career stages.
set up a public dummy example to demonstrate how this works. Within RStudio,
choose New Project. . . from the File menu. Select Version Control, followed by
Git. You will then see a dialog box asking for details of the repository. The one
I’ve created is at:
https://github.com/bakerdh/gitdemo
You should enter this URL into the dialog box. You will also need to choose a
name for the project, and a directory to save it to on your computer. Then, when
you click Create Project, all of the files from the repository will automatically be
downloaded and stored as a local copy on your computer.
At the moment the project really just has a single file called gitdemo.R, which
doesn’t do very much (it prints out a message to the console). You can make
changes to this file on your computer, and then upload the changes to the
repository. There are three stages to this process, which are all accessed through
the Git tab in the top right corner of the main RStudio window (where the
Environment usually appears).
First you can click the Diff button to stage the changes - this shows you a
colour-coded visual representation of everything that you have changed from the
latest version you downloaded, in a separate window headed Review Changes.
If you are happy with the changes you can Commit them. This means you are
saying that you are ready to upload the changes to the online repository so that
others can see them. You should also add a commit message, where you can
briefly detail the reasons for the changes. Finally, you can push the changes to
the repository using the upward pointing green arrow - this will send everything
you have altered to the GitHub server and incorporate it into the main project.
If you are working on a group project, you can also periodically pull changes
made by your collaborators using the downward pointing blue arrow. This will
update your local copy of the project. But this is where things can get tricky -
if multiple people are working on the same files, there might be conflicts, where
two people have made incompatible changes. There are tools within git to cope
with this, including making separate branches of the code that you can work on
individually, before merging them back with the master branch. Dedicated git
references go into more detail about how this works (Chacon and Straub 2014).
GitHub repositories are designed to contain computer code and other small
files, so it is not generally a good idea to store large data files in them (the
amount of storage space is usually limited for this reason). Happily, as we will
outline in Section 19.6, other repositories exist that can store your data, and
you can include code to automatically download the data to your computer.
Furthermore, the OSF website integrates well with GitHub, in that you can
associate a GitHub repository with an OSF project. The files contained in the
repository will appear as if they are files within the OSF project, meaning that
data and code are accessible from a single location.
There are two additional advantages to maintaining a GitHub repository for your
402 CHAPTER 19. REPRODUCIBLE DATA ANALYSIS
your code is doing. In R, the hash symbol (#) and anything following it are
ignored by R, allowing you to include comments as follows:
randomSum <- 0 # initialise a data object to store a number in
# add a random number to our data object each time around the loop
randomSum <- randomSum + rnorm(1)
You can use multiple hash symbols to introduce hierarchical structure into your
code, for example by using three hash symbols to indicate a new section of code.
I am pretty bad at commenting my own code, so this is something I need to work
on myself! But it is definitely helpful for anyone trying to understand what you
have done, or attempting to modify your code for their own needs. Beginning
your script with some comment lines giving an overview of what it aims to do is
good practice, especially in the common situation where you might have many
versions of the same piece of code which are all slightly different.
Finally, there is a balance to be struck between writing dense code with multiple
nested functions on a single line, and code that is more spread out and easier
to read. For example, the following section of code contains multiple nested
commands and a logic statement, and is therefore rather difficult to parse:
if (t.test(rnorm(100,mean=2,sd=2))$p.value<0.05){print('Test is significant')}
On the other hand, we can rewrite (or re-factor) the code by splitting things up
and de-nesting the function calls as follows:
a <- rnorm(100, mean=2, sd=2)
tout <- t.test(a)
if (tout$p.value<0.05){
print('Test is significant')}
(In both cases, the code generates some random numbers, runs a t-test on them,
and spits out some text if it is significant). Which of these approaches you
take will depend on what sort of code you are trying to write. If your aim is
to use as few lines of code as possible, and perhaps save on memory allocation
for additional data objects, the first approach might occasionally have some
advantages. If the aim is to be understood by others, the second approach is
much clearer.
If you write dense code without comment lines and meaningful object names, it
will seem like a chore to go back and add them later. Indeed, many programmers
will be reticent to do this, in case it stops the code from working! However, if
you bear these suggestions in mind when you are developing the code, then you
404 CHAPTER 19. REPRODUCIBLE DATA ANALYSIS
can embed readability into your code as you create it. This often encourages
you to think more deeply about the programming decisions you are making,
and about the structure of your code, which means you are more likely to write
better, more legible code.
But once the package has been installed, repeating this line of code is unnecessary.
In fact it will actually re-install the same package(s), which can substantially
slow down your script, especially if there are many packages to install. As an
alternative, the following code will check the currently installed packages, and
only install those that are missing:
19.5. IDENTIFYING A ROBUST DATA REPOSITORY 405
We can do a similar trick for activating the packages, so this is only done for
packages that are not currently active:
# work out which packages are not currently activated
toactivate <- packagelist[which(!packagelist %in% (.packages()))]
# silently activate the inactive packages using the library function
invisible(lapply(toactivate,library,character.only=TRUE))
These five lines of code will avoid repeatedly installing and activating the same
packages, and is a reasonably compact code snippet to include at the start of a
script that will make your code more effortlessly accessible to others.
with other systems (such as GitHub, which we discussed in Section 19.1), and
has many additional features such as hosting of preregistration documents and
preprints. This means it can be used as a repository for an entire project, hosting
the data files, analysis scripts, and other information. An ideal to strive for is
that another researcher could download all of your materials, and fully reproduce
your entire analysis pipeline, through from processing the raw data to creating
the figures and statistics reported in a paper. The structure of the OSF website
makes this goal achievable. An example repository from a recent paper of mine
(D. Baker et al. 2021) is available at: https://osf.io/ebhnk/.
The osfr package is hosted on GitHub, rather than the CRAN repository. We
need to use the remotes package to download and install it as follows:
install.packages('remotes')
library(remotes)
remotes::install_github("centerforopenscience/osfr")
library(osfr)
I have created a dummy OSF project to demonstrate how uploading and down-
loading works. However, in order to upload data (to a repository you own) you
would first need to authorise R to interact with your OSF account using an
access token. The instructions for how to create a token are in the settings
section of the OSF website. Once you have a token, this is entered as follows:
osf_auth(token = 'MY_TOKEN_HERE')
Of course your token will only allow you to upload to your own projects, and
importantly it should not be shared with anyone else. This means you need to
remember to remove the token if you are posting your analysis code somewhere
publicly (just as I have done above).
## # A tibble: 1 x 3
## name id meta
## <chr> <chr> <list>
## 1 Upload & Download examples thm3j <named list [3]>
The osfproject data object then contains metadata about the project, and we
can use it to list the files contained in the project with the osf_ls_files function:
filelist <- osf_ls_files(osfproject)
filelist
## # A tibble: 5 x 3
## name id meta
## <chr> <chr> <list>
## 1 File1.csv 5ef0d99265982801b4cf0c9f <named list [3]>
## 2 File2.csv 5ef0d99665982801abcf2ad5 <named list [3]>
## 3 File3.csv 5ef0d99a145b1a01cc52cc8f <named list [3]>
## 4 File4.csv 5ef0d99e65982801a8cf2b7b <named list [3]>
## 5 File5.csv 5ef0d9a2145b1a01cb52c874 <named list [3]>
From this table, we see that the project contains five csv files. I originally
uploaded these files using the following code (which of course could also be done
in a loop):
osf_upload(osfproject,'data/File1.csv')
osf_upload(osfproject,'data/File2.csv')
osf_upload(osfproject,'data/File3.csv')
osf_upload(osfproject,'data/File4.csv')
osf_upload(osfproject,'data/File5.csv')
The OSF website has automatically generated ID numbers for each file. We
can then download the files to the R working directory using the osf_download
function:
osf_download(filelist) # download all files in the list
Note that we are downloading all the files at once here, which might be a
dangerous strategy with large data sets! If we want to download only a single
file, we can index different rows of the filelist data object as follows:
osf_download(filelist[3,]) # download the third file in the list only
a <- which(filelist[,1]=='File3.csv')
osf_download(filelist[a,])
not specific to a particular scientific field or data type. The most basic example is
to use a text file format, such as the csv (comma-separated values) format. Each
individal value in a csv file is separated by a comma, with new lines indicated by
a carriage return. This format can store both numbers and text, and the data
can be easily read by any modern programming language, as well as spreadsheet
packages such as Microsoft Excel, Google Sheets, or OpenOffice Calc. Finally,
column headings can be included as the first row of data.
A shortcoming of text files is that they are not very efficient for storing large
amounts of data. If storage space is an issue, lossless compression algorithms
(such as zip and gzip) can be used to reduce file size. Alternatively, generic
formats for structuring and storing large amounts of data also exist. One
example is the HDF5 (Hierarchical Data Format 5) specification, which can store
arbitrarily large data structures. This is an open source format, developed by
a not-for-profit group dedicated to ensuring accessibility of data stored in the
format. The hdf5r package contains tools for reading and writing in this format
within R, and there are similar libraries for other programming languages.
For several reasons, data formats associated with a specific programming language
or software package are not typically very open. For example, the widely-used
Matlab language is owned by a company (The MathWorks Inc.), who also own
the specification for the .mat file format. Although toolboxes currently exist
to import such data files into other programming environments (such as the
R.matlab toolbox in R), there is no guarantee that this will always be the case.
Similar arguments apply to Microsoft’s xls and xlsx spreadsheet formats. Overall,
making your data available in an open format is a better choice for ensuring it
is accessible to others both now and in the future.
head(hdata[,1:4],n=12)
410 CHAPTER 19. REPRODUCIBLE DATA ANALYSIS
Endnotes
413
414 CHAPTER 20. ENDNOTES
Ultimately the scientific enterprise is about finding out how the world works.
But real data are always noisy, and rarely lead to a clear interpretation in their
raw state. To this end, the analysis methods described in this book are tools
that help us to test hypotheses and better understand empirical data. Yet the
choice of statistical tool, and the way we interpret its outcome, will often be
somewhat subjective. My aim is that by understanding the fundamentals of a
range of techniques, researchers can make more informed decisions about how to
conduct their analyses. Mastery of advanced methods allows for creative flair in
the presentation of results, and in some situations may lead to new knowledge
that would be missed by more basic analyses. Although it may not feel like
it to a beginner, it can be very satisfying to immerse oneself in data analysis
(“wallowing in your data” as a former colleague of mine used to say). I hope that
this book helps others on their way to experiencing data analysis as a pleasure
rather than a chore, and leads to many new and exciting discoveries.
Chapter 3:
1. B - The data layouts are called wide and long - see the chapter for examples.
2. C - The error bars on a boxplot usually show the inner fence. Data
points falling outside of the whiskers are classified as outliers using Tukey’s
method.
3. D - the interquartile range gives the points between which 50% of the data
points lie. This is important for some methods of identifying outliers, but
it is not used to replace outlying values.
4. D - the Mahalanobis distance is suitable for identifying outliers with
multivariate data. The other three methods are used for univariate data.
5. C - the inner fence is 1.5 times the interquartile range beyond Q1 and Q3,
which works out as 2.698 standard deviations from the mean.
6. A - a 20% trimmed mean would exclude the highest 20% and lowest 20%
of values, leaving only the central 60%.
7. B - the Kolmogorov-Smirnov test is a test of the normality assumption,
rather than an alternative to parametric methods. The other responses
are all plausible alternatives to using parametric tests.
8. C - the scale function performs normalization of the mean and standard
deviation. If we want to avoid subtracting the mean, we set the ‘centre’
argument to FALSE, as in the third answer.
9. D - all of the previous options are used to assess deviations from normality.
The q-q plot shows the quantiles of the data and reference distributions,
and the two tests give a quantitive comparison.
10. A - logarithmic transforms will squish large values closer together, so the
skewed tail of the distribution will shrink.
Chapter 4:
1. A - Gossett was employed by the Guinness corporation in Dublin as their
head brewer, to analyse data on batches of beer
2. B - the model formula uses the tilde (~) to indicate relationships, and is
always of the form DV ~ IV. Since we want to know how age predicts brain
volume, we need brainvolume to come first in the formula.
3. C - you can confirm that the other three options are not R functions by
trying to call the help function for them - they do not exist.
4. D - the residuals are the left over variance that cannot be explained by the
model. A good way to think about this is the error between the model’s
predictions and the data points.
5. C - a null regression model is flat, so it has a slope of zero. See Figure 4.2
for an example.
416 CHAPTER 20. ENDNOTES
6. D - the slope of a fitted regression line will depend on the data we are
fitting, and the strength of any relationship between the two variables.
7. B - the degrees of freedom are always one less than the number of groups.
A good way to think about this is the number of straight lines required to
join successive pairs of points (see e.g. Figure 4.5).
8. C - factors are categorical variables, used to define groups e.g. in ANOVA.
9. B - we cannot use the t.test function because we have more than two
levels, but we could use either the aov or lm functions, as illustrated in
the chapter.
10. A - the asterisk symbol is used to indicate factorial combination, and
to generate interaction terms. A plus symbol would request additive
combination, as in multiple regression.
Chapter 5:
1. B - the power is the proportion of significant tests, so 2000/10000 = 0.2.
2. C - Cohen’s d is the difference in means divided by the standard deviation,
so (15-12)/20 = 0.15.
3. C - as we collect more data, our precision in estimating the true effect size
is less subject to noise from random sampling error.
4. D - power is the probability of producing a significant effect, so a low
powered study is unlikely to do this
5. A - when power is low, only studies with large effect sizes will be significant
(see Figure 5.2), which overestimates the true effect
6. D - enter pwr.t.test(d = 0.8, sig.level = 0.05, power = 0.8,
type=‘one.sample’). The answer is 14.3, which rounds up to 15.
7. A - enter pwr.r.test(r=0.3, n=24, sig.level=0.05). The answer is 0.30.
8. B - enter pwr.anova.test(f=0.33,k=8,n=30,sig.level=0.01). The answer is
0.91.
9. C - enter pwr.chisq.test(N = 12, df = 10, power = 0.5, sig.level=0.05).
The answer is 0.88.
10. A - enter pwr.f2.test(u=2, v=12, sig.level=0.05, power=0.8). The answer
is 0.83.
Chapter 6:
1. D - enter mes(0.34,0.3,0.1,0.1,24,24), the value of d is 0.4.
2. B - enter res(0.7,n=10), the value of d is 1.96.
3. A - enter pes(0.01,30,30), the value of the odds ratio is 3.48.
4. D - enter fes(13.6,17,17), the value of g is 1.24.
20.2. ANSWERS TO PRACTICE QUESTIONS 417
that might be treated as a random effect, it has only two levels and is
likely to be the independent variable of interest.
Chapter 8:
1. D - pseudo-random numbers are generated by an algorithm. The computer’s
clock is entirely predictable, but can be used to seed a random number
generator. The other two options are inherently random.
2. A - regardless of the shape of the individual distributions, their sum should
always be normal (see the example in Figure 8.3).
3. B - the seed is the name for the value used by the algorithm to generate a
sequence of random numbers.
4. C - bootstrapping is primarily used to estimate confidence intervals, though
this can be applied to any statistic including the mean, median or t-statistic.
5. C - it is very difficult to distinguish between truly random and pseudo-
random numbers, and it is unlikely that this could be done with stochastic
simulations.
6. A - enter median(rgamma(100000,shape=2,scale=2)). The output will be
approximately 3.35.
7. D - we use the quantile function to request points from an empirical
distribution that correspond to particular probabilities.
8. B - the 95% confidence intervals are taken at 2.5% and 97.5% on a distri-
bution, which means that 95% of the values lie between those points.
9. A - this is a permutation of the original set, because each of the five
numbers appears once. The other three examples include either duplicate
numbers (implying resampling with replacement), or numbers that are not
in the original set.
10. C - enter hist(rpois(10000,lambda=2)). Poisson distributions have positive
skew as they are bounded at 0.
Chapter 9:
1. C - there is always one more dimension than the number of free parameters
(which represents the error between model and data).
2. B - for a three dimensional space, the x- and y-coordinates represent the
parameter values, and the height represents the error.
3. A - circulation is not something associated with simplex algorithms, but
the other three options are geometric operations that change the shape
and location of the simplex.
4. D - local minima are low regions of the error space that are not as low as
the global minimum.
20.2. ANSWERS TO PRACTICE QUESTIONS 419
10. C - violin plots show a kernel density function mirrored about its midpoint.
They are often said to look rather suggestive (see https://xkcd.com/1967/).
Chapter 19:
1. B - if code is shared publicly, it can be accessed by anyone, so there is
no need to have shared your own code. Note that answer C might seem
like a disadvantage in some ways, but really it is a good thing to correct
mistakes, even if it requires corrections to a publication.
2. A - with some complex analyses, graphical interfaces often have so many
options that it can be difficult to precisely reproduce the original choices
(even if you are the one who made them!).
3. D - digital online identifiers are important to ensure that it will always be
possible to find the data even if website URLs change in the future.
4. A - the PAT gives R the permission to access your own OSF projects
so that you can upload and download data. Public projects are already
accessible without requiring the PAT, and there is no permission setting
that will allow you to access other private projects belonging to other users.
5. C - HDF5 and csv files are both open data formats, whereas .mat and xlsx
are proprietary.
6. B - A pull request downloads the current version, including any changes
made by other programmers working on the project.
7. D - we first stage the changes to check differences from the main branch,
then we commit those changes, and finally push them to the repository.
8. B - in lower camel case the first letter is always lower case, but subsequent
words start with a captial letter.
9. A - the hash symbol (#) is used to indicate comments in R (other pro-
gramming languages use different characters, for example Matlab uses a %
symbol).
10. C - reproducibility is about repeating an analysis on a given data set,
whereas replication is about repeating an experiment.
compute.es - tools for calculating and converting effect sizes, used in Chapter 6.
Key functions include des, mes, pes and tes.
grImport - tools to import and manipulate vector graphics files, used in Chapter
18. The key functions are PostScriptTrace, readPicture and grid.picture.
FourierStats - functions to conduct Hotelling’s T 2 and Tcirc
2
tests, used in Chapter
11. Key functions include tsqh.test, tsqc.test, CI.test and pairwisemahal.
jpeg - package for loading in JPEG images, used in Chapters 10 and 18. The
package contains two functions, readJPEG and writeJPEG.
lavaan - Latent Variable Analysis package for running structural equation
models, used in Chapter 12. The key functions are cfa, lavTestScore and
lavTestWald.
MAd - meta-analysis with mean differences, including tools for calculating effect
sizes, used in Chapter 6. Key functions include r_to_d, t_to_d and or_to_d.
MASS - helper package from the textbook Modern Applied Statistics with S
(Venables, Ripley, and Venables 2002). Contains the isoMDS function for non-
metric multidimensional scaling used in Chapter 13, as well as the Shepard
function to produce Shepard plots.
pals - tools for creating and evaluating colourmaps and palettes for plotting,
used in Chapter 18. The kovesi palettes are very useful, the pal.safe function
simulates colourblindness, and the pal.test function can be used to evaluate a
palette.
pracma - package containing many practical mathematical functions. This
includes the nelder_mead function to implement the Nelder and Mead (1965)
downhill simplex algorithm, described in Chapter 9.
PRISMAstatement - functions for creating PRISMA diagrams for use in meta-
analysis, used in Chapter 6. The prisma_graph function generates a diagram.
psyphy - functions for analysing psychophysical data, including d′ calculation
with the dprime.mAFC function, as used in Chapter 16.
pwr - package of power analysis functions used in Chapter 5. Key functions
include pwr.t.test, pwr.r.test and pwr.anova.test.
quickpsy - package for fitting psychometric functions, used in Chapter 16. The
main function is also called quickpsy.
remotes - tools for installing packages from repositories such as GitHub, as
demonstrated in Chapter 19. The key function is install_github.
rmeta - meta anlaysis tools used in Chapter 6. Key functions include
meta.summaries, metaplot and funnelplot.
semPlot - package used to produce graphical representations of structural equa-
tion models, used in Chapter 12. The key function is semPaths.
20.3. ALPHABETICAL LIST OF KEY R PACKAGES USED IN THIS BOOK427
signal - package for signal processing, the fir1 function was used to construct
filters in Chapter 10.
428 CHAPTER 20. ENDNOTES
Chapter 21
References
429
430 CHAPTER 21. REFERENCES
https://doi.org/10.1364/josaa.4.002379.
Fisher, Ronald A. 1926. “The Arrangement of Field Experiments.” Journal
of the Ministry of Agriculture 33. Ministry of Agriculture; Fisheries: 503–15.
https://doi.org/10.23637/ROTHAMSTED.8V61Q.
Forgy, E.W. 1965. “Cluster Analysis of Multivariate Data: Efficiency Vs Inter-
pretability of Classifications.” Biometrics 21: 768–69.
Fox, J., and S. Weisberg. 2018. An R Companion to Applied Regression. 3rd ed.
SAGE Publications Inc.
Friston, Karl. 2012. “Ten Ironic Rules for Non-Statistical Reviewers.” Neuroim-
age 61 (4): 1300–1310. https://doi.org/10.1016/j.neuroimage.2012.04.018.
Friston, Karl J, Thomas Parr, Peter Zeidman, Adeel Razi, Guillaume Flandin,
Jean Daunizeau, Ollie J Hulme, et al. 2020. “Dynamic Causal Modelling of
Covid-19.” Wellcome Open Res 5: 89. https://doi.org/10.12688/wellcomeopenres.
15881.2.
Galecki, A., and T. Burzykowski. 2013. Linear Mixed-Effects Models Using R.
Springer-Verlag New York.
Glasziou, P P, and D E Mackerras. 1993. “Vitamin a Supplementation in
Infectious Diseases: A Meta-Analysis.” BMJ 306 (6874): 366–70. https://doi.
org/10.1136/bmj.306.6874.366.
Gorgolewski, Krzysztof J, Tibor Auer, Vince D Calhoun, R Cameron Craddock,
Samir Das, Eugene P Duff, Guillaume Flandin, et al. 2016. “The Brain
Imaging Data Structure, a Format for Organizing and Describing Outputs of
Neuroimaging Experiments.” Sci Data 3: 160044. https://doi.org/10.1038/sdata.
2016.44.
Green, D M, and John A Swets. 1966. Signal Detection Theory and Psy-
chophysics. Wiley.
Gronau, Quentin F., Alexander Ly, and Eric-Jan Wagenmakers. 2019. “Informed
Bayesian T-Tests.” The American Statistician 0 (0). Taylor & Francis: 1–14.
https://doi.org/10.1080/00031305.2018.1562983.
Grootswagers, Tijl, Susan G Wardle, and Thomas A Carlson. 2017. “Decoding
Dynamic Brain Patterns from Evoked Responses: A Tutorial on Multivariate
Pattern Analysis Applied to Time Series Neuroimaging Data.” J Cogn Neurosci
29 (4): 677–97. https://doi.org/10.1162/jocn_a_01068.
Harbord, Roger M, Matthias Egger, and Jonathan A C Sterne. 2006. “A Modified
Test for Small-Study Effects in Meta-Analyses of Controlled Trials with Binary
Endpoints.” Stat Med 25 (20): 3443–57. https://doi.org/10.1002/sim.2380.
Hartigan, J. A., and M. A. Wong. 1979. “Algorithm as 136: A K-Means
Clustering Algorithm.” Journal of the Royal Statistical Society. Series C
435
Karaboga, D. 2005. “An Idea Based on Honey Bee Swarm for Numberical
Optimization.” Technical Report TR06. Ericyes University.
Kass, R.E., and A.E. Raftery. 1995. “Bayes Factors.” Journal of the American
Statistical Association 90 (430): 773–95.
Kennedy, J., and R. Eberhart. 1995. “Particle Swarm Optimization.” Proceedings
of ICNN’95 4: 1942–8. https://doi.org/10.1109/ICNN.1995.488968.
Kingdom, F.A.A., and N. Prins. 2010. Psychophysics: A Practical Introduction.
Elsevier.
Kline, Rex B. 2015. Principles and Practice of Structural Equation Modelling.
4th ed. Guilford, New York.
Kolmogorov, A.N. 1992. Selected Works Ii: Probability Theory and Mathematical
Statistics. Edited by A.N. Shiryaev. Vol. 26. Springer Netherlands.
Kovesi, P. 2015. “Good Colour Maps: How to Design Them.” arXiv, no.
1509:03700. https://arxiv.org/abs/1509.03700.
Kruschke, John K. 2014. Doing Bayesian data analysis: a tutorial with R, JAGS,
and Stan. 2nd ed. Elsevier, Academic Press.
Kuhn, Max. 2008. “Building Predictive Models inRUsing thecaretPackage.”
Journal of Statistical Software 28 (5). Foundation for Open Access Statistic.
https://doi.org/10.18637/jss.v028.i05.
Kuznetsova, Alexandra, Per B. Brockhoff, and Rune H. B. Christensen. 2017.
“LmerTest Package: Tests in Linear Mixed Effects Models.” Journal of Statistical
Software 82 (13). Foundation for Open Access Statistic. https://doi.org/10.
18637/jss.v082.i13.
Lakens, Daniel, Federico G. Adolfi, Casper J. Albers, Farid Anvari, Matthew A.
J. Apps, Shlomo E. Argamon, Thom Baguley, et al. 2018. “Justify Your Alpha.”
Nature Human Behaviour 2 (3). Springer Science; Business Media LLC: 168–71.
https://doi.org/10.1038/s41562-018-0311-x.
Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. SAGE Publica-
tions.
Lenhard, W., and A. Lenhard. 2016. Calculation of Effect Sizes. Dettelbach
(Germany): Psychometrica. https://doi.org/10.13140/RG.2.2.17823.92329.
Lilja, David. 2016. Linear Regression Using R: An Introduction to Data Modeling.
University of Minnesota Libraries Publishing. https://doi.org/10.24926/8668/
1301.
Linares, Daniel, and Joan López-Moliner. 2016. “quickpsy: An R Package to
Fit Psychometric Functions for Multiple Groups.” The R Journal 8 (1): 122–31.
https://doi.org/10.32614/RJ-2016-008.
437