Analysing Data Using Linear Models 5th Ed January 2021
Analysing Data Using Linear Models 5th Ed January 2021
This book is for bachelor students in social, behavioural and management sciences
that want to learn how to analyse their data, with the specific aim to answer
research questions. The book has a practical take on data analysis: how to do
it, how to interpret the results, and how to report the results. All techniques
are presented within the framework of linear models: this includes simple and
multiple regression models, linear mixed models and generalised linear models.
This approach is illustrated using R.
iii
Contents
v
1.22 Box plots in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.23 Visualising categorical variables . . . . . . . . . . . . . . . . . . . 41
1.24 Visualising categorical and ordinal variables in R . . . . . . . . . 45
1.25 Visualising co-varying variables . . . . . . . . . . . . . . . . . . . 46
1.25.1 Categorical by categorical: cross-table . . . . . . . . . . . 46
1.25.2 Categorical by numerical: box plot . . . . . . . . . . . . . 47
1.25.3 Numeric by numeric: scatter plot . . . . . . . . . . . . . . 48
1.26 Visualising two variables using R . . . . . . . . . . . . . . . . . . 50
1.27 Overview of the book . . . . . . . . . . . . . . . . . . . . . . . . . 51
Appendices 371
Let’s look at the first column in Table 1.1. We see that it regards the variable
name. We call the property name a variable, because it varies across our units
(the students): in this case, every unit has a different value for the variable
1
name. In sum, a variable is a property of units that shows different values for
different units.
The second column represents the variable grade. Grade is here a variable,
because it takes different values for different students. Note that both Mark
Zimmerman and Mohammed Solmaz have the same value for this variable.
What we see in Table 1.1 is called a data matrix : it is a matrix (a collection
of rows and columns) that contains information on units (in the rows) in the
form of variables (in the columns).
A unit is something we’d like to say something about. For example, I might
want to say something about students and how they score on a course. In that
case, students are my units of analysis.
If my interest is in schools, the data matrix in Table 1.2 might be useful,
which shows a different row for each school with a couple of variables. Here
again, we see a variable for grade on a course, but now averaged per school. In
this case, school is my unit of analysis.
library(tidyverse)
studentID <- seq(4132211, 4132215)
course <- c("Chemistry", "Physics", "Math", "Math", "Chemistry")
grade <- c(4, 6, 3, 6, 8)
shirtsize <- c("medium", "small", "large", "medium", "small")
tibble(studentID, course, shirtsize, grade)
2
## # A tibble: 5 x 4
## studentID course shirtsize grade
## <int> <chr> <chr> <dbl>
## 1 4132211 Chemistry medium 4
## 2 4132212 Physics small 6
## 3 4132213 Math large 3
## 4 4132214 Math medium 6
## 5 4132215 Chemistry small 8
From the output, you see that the tibble has dimensions 5 × 4: that means it
has 5 rows (units) and 4 columns (variables). Under the variable names, it can
be seen how the data are stored. The variable studentID is stored as a numeric
variable, more specifically as an integer (<int>). The course variable is stored
as a character variable (<chr>), because the values consist of text. The same is
true for shirtsize. The last variable, grade, is stored as <dbl> which stands
for ’double’. Whether a numeric variable is stored as integer or double depends
on the amount of computer memory that is allocated to a variable. Double
variables have a decimal part (e.g., 2.0), integers don’t (e.g., 2).
This way of representing data on a variable that was measured more than
once is called wide format. We call it wide because we simply add columns when
we have more measurements, which increases the width of the data matrix. Each
new observation of the same variable on the same unit of analysis leads to a
new column in the data matrix.
3
Table 1.4: Data matrix with depression levels in long format.
client time depression
1 1 5
1 2 6
1 3 9
1 4 3
2 1 9
2 2 5
2 3 8
2 4 7
3 1 9
3 2 0
3 3 9
3 4 3
4 1 9
4 2 2
4 3 8
4 4 6
Note that this is only one way of looking at this problem of measuring
depression four times. Here, you can say that there are really four depression
variables: there is depression measured at time point 1, there is depression
measured at time point 2, and so on, and these four variables vary only across
units of analysis. This way of thinking leads to a wide format representation.
An alternative way of looking at this problem of measuring depression four
times, is that depression is really only one variable and that it varies across
units of analysis (some people are more depressed than others) and that it also
varies across time (at times you feel more depressed than at other times).
Therefore, instead of adding columns, we could simply stick to one variable
and only add rows. That way, the data matrix becomes longer, which is
the reason that we call that format long format. Table 1.4 shows the same
information from Table 1.3, but now in long format. Instead of four different
variables, we have only one variable for depression level, and one extra variable
time that indicates to which time point a particular depression measure refers
to. Thus, both Tables 1.3 and 1.4 tell us that the second depression measure
for client number 3 was 0.
Now let’s look at a slightly more complex example, where the advantage
of long format becomes clear. Suppose the depression measures were taken on
different days for different clients. Client 1 was measured on Monday, Tuesday,
Wednesday and Thursday, while client 2 was measured on Thursday, Friday,
Saturday and Sunday. If we would put that information into a wide format
table, it would look like Figure 1.5, with missing values for measures on Monday
thru Wednesday for client 2, and missing values for measures on Friday thru
Sunday for patient 1.
4
Table 1.5: Data matrix with depression levels in wide format.
client Monday Tuesday Wednesday Thursday Friday Saturday Sunday
1 5 6 9 3
2 9 5 8 7
Table 1.6 shows the same data in long format. The data frame is considerably
smaller. Imagine that we would also have weather data for the days these
patients were measured: whether it was cloudy or sunny, whether it rained or
not, and what the maximum temperature was. In long format, storing that
information is easy, see Table 1.7. Try and see if you can think of a way to store
that information in a wide table!
Table 1.7: Data matrix with depression levels in wide format, including data on
the time of measurement.
client depression day maxtemp rain
1 5 Monday 23 rain
1 6 Tuesday 24 no rain
1 9 Wednesday 23 rain
1 3 Thursday 25 no rain
2 9 Thursday 25 no rain
2 5 Friday 22 no rain
2 8 Saturday 21 rain
2 7 Sunday 22 no rain
Thus, storing data in long format is often more efficient in terms of storage of
information. Another reason for preferring long format over wide format is the
most practical one for data analysis: when analysing data using linear models,
software packages require your data to be in long format. In this book, all the
analyses with linear models require your data to be in long format. However,
we will also come across some analyses apart from linear models that require
your data to be in wide format. If your data happen to be in the wrong format,
5
rearrange your data first. Of course you should never do this by hand as this
will lead to typing errors and would take too much time. Statistical software
packages have helpful tools for rearranging your data from wide format to long
format, and vice versa.
library(tidyverse)
relig_income
## # A tibble: 18 x 11
## religion ‘<$10k‘ ‘$10-20k‘ ‘$20-30k‘ ‘$30-40k‘ ‘$40-50k‘ ‘$50-75k‘ ‘$75-100k‘
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k~ 15 14 15 11 10 35 21
## 6 Evangel~ 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori~ 228 244 236 238 197 223 131
## 9 Jehovah~ 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin~ 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C~ 9 7 11 13 13 14 18
## 16 Other F~ 20 33 40 46 49 63 46
## 17 Other W~ 5 2 3 4 2 7 3
## 18 Unaffil~ 217 299 374 365 341 528 407
## # ... with 3 more variables: ‘$100-150k‘ <dbl>, ‘>150k‘ <dbl>, ‘Don’t
## # know/refused‘ <dbl>
6
1. religion, stored in the rows,
To put the values that we see in the columns into one single column, we use
pivot longer():
relig_income %>%
pivot_longer(cols = -religion, # columns that need to be restructured
names_to = "income", # name of new variable with old column names
values_to = "count") # name of new variable with values
## # A tibble: 180 x 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don’t know/refused 96
## # ... with 170 more rows
The names to argument gives the name of the variable that will be created
using the column names, i.e. income.
The values to argument gives the name of the variable that will be
created from the data stored in the cells, i.e. count.
us_rent_income
7
## # A tibble: 104 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01 Alabama income 24476 136
## 2 01 Alabama rent 747 3
## 3 02 Alaska income 32940 508
## 4 02 Alaska rent 1200 13
## 5 04 Arizona income 27517 148
## 6 04 Arizona rent 972 4
## 7 05 Arkansas income 23789 165
## 8 05 Arkansas rent 709 5
## 9 06 California income 29454 109
## 10 06 California rent 1358 3
## # ... with 94 more rows
Here both estimate and moe are variables (column names), so we can supply
them to the function argument values from to make new variables:
us_rent_income %>%
pivot_wider(names_from = variable,
values_from = c(estimate, moe))
## # A tibble: 52 x 6
## GEOID NAME estimate_income estimate_rent moe_income moe_rent
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01 Alabama 24476 747 136 3
## 2 02 Alaska 32940 1200 508 13
## 3 04 Arizona 27517 972 148 4
## 4 05 Arkansas 23789 709 165 5
## 5 06 California 29454 1358 109 3
## 6 08 Colorado 32401 1125 109 5
## 7 09 Connecticut 35326 1123 195 5
## 8 10 Delaware 31560 1076 247 10
## 9 11 District of Columbia 43198 1424 681 17
## 10 12 Florida 25952 1077 70 3
## # ... with 42 more rows
The names from argument gives the name of the variable that will be used
for the new column names, i.e. variable
The values from argument gives the name(s) of the variable(s) that store
the value that you wish to see spread out across several columns. Here we
have two such variables, i.e. moe and estimate
8
vignette("pivot")
9
measurement you use, the ratio of height for these individuals would always
be 2. Therefore, if we have a variable that measures height in meters, we are
dealing with a ratio variable.
Now let’s look at an example of an interval variable. Suppose we measure
the temperature in two classrooms: one is 10 degrees Celsius and the other is
20 degrees Celsius. The ratio of these two temperatures is 20/10 = 2, but does
that ratio convey meaningful information? Could we state for example that the
second classroom is twice as warm as the first classroom? The answer is no,
and the reason is simple: had we expressed temperature in Fahrenheit, we would
have gotten a very different ratio. Temperatures of 10 and 20 degrees Celsius
correspond to 50 and 68 degrees Fahrenheit, respectively. These Fahrenheit
temperatures have a ratio of 68/50=1.36. Based on the Fahrenheit metric, the
second classroom would now be 1.36 times warmer than the first classroom. We
therefore say that the ratio does not have a meaningful interpretation, since the
ratio depends on the metric system that you use (Fahrenheit or Celsius). It
would be strange to say that there is twice more warmth in classroom B than
in classroom A, but only if you measure temperature in Celsius, not when you
measure it in Fahrenheit!
The reason why the ratios depend on the metric system, is because both
the Celsius and Fahrenheit metrics have arbitrary zero-points. In the Celsius
metric, 0 degrees does not mean that there is no warmth, nor is that implied in
the Fahrenheit metric. In both metrics, a value of 0 is still warmer than a value
of -1.
Contrasting this to the example of height: a height of 0 is indeed the absence
of height, as you would not even be able to see a person with a height of 0,
whatever metric you would use. Thus, the difference between ratio and interval
variables is that ratio variables have a meaningful zero point where zero indicates
the absence of the quantity that is being measured. This meaningful zero-point
makes it possible to make meaningful statements about ratios (e.g., 4 is twice
as much as 2) which gives ratio variables their name.
What ratio and interval variables have in common is that they are both
numeric variables, expressing quantities in terms of units of measurements. This
implies that the distance between 1 and 2 is the same as the distances between
3 and 4, 4 and 5, etcetera. This distinguishes them from ordinal variables.
10
Similar for age, we could code a number of people as young, middle-aged
or old, but on the basis of such a variable we could not state by how much
two individuals differ in age. As opposed to numeric variables that are often
continuous, ordinal variables are usually discrete: there isn’t an infinite number
of levels of the variable. If we have sizes small, medium and large, there are no
meaningful other values in between these values.
Ordinal variables often involve subjective measurements. One example would
be having people rank five films by preference from one to five. A different
example would be having people assess pain: ”On a scale from 1 to 10, how bad
is the pain?”
Ordinal variables often look numeric. For example, you may have large,
medium and small T-shirts, but these values may end up in your data matrix as
’3’, ’2’ and ’1’, respectively. However, note that with a truly numeric variable
there should be a unit of measurement involved (3 of what? 2 of what?), and
that numeric implies that the distance between 3 and 2 is equal to the distance
between 2 and 1. Here you would not have that information: you only know
that a large T-shirt (coded as ’3’) is larger than a medium T-shirt (coded as
’2’), but how large that difference is, and whether that difference is that same
as the difference between a medium T-shirt (’2’) is larger than a small T-shirt
(’1’), you do not know. Therefore, even though we see numbers in our data
matrix, the variable is called an ordinal variable.
Categorical variables are not about quantity at all. Categorical variables are
about quality. They have values that describe ’what type’ or ’which category’ a
unit of belongs to. For example, a school could either be publicly funded or not,
or a person could either have the Swedish nationality or not. A variable that
indicates such a dichotomy between publicly funded ’yes’ or ’no’, or Swedish
nationality ’yes’ or ’no’, is called a dichotomous variable, and is a subtype of a
categorical variable. The other subtype of a categorical variable is a nominal
variable. Nominal comes from the Latin nomen, which means name. When you
name the nationality of a person, you have a nominal variable. Table 1.8 shows
an example of both a dichotomous variable (Swedish) that always has only two
different values, and a nominal variable (Nationality), that can have as many
different values as you want (usually more than two).
Another example of a nominal variable could be the answer to the question:
”name the colours of a number of pencils”. Nothing quantitative could be
stated about a bunch of pencils that are only assessed regarding their colour.
In addition, there is usually no logical order in the values of such variables,
something that we do see with ordinal variables.
11
Table 1.8: Nationalities.
ID Swedish Nationality
1 Yes Swedish
2 Yes Swedish
3 No Angolan
4 No Norwegian
5 Yes Swedish
6 Yes Swedish
7 No Danish
8 No Unknown
12
meaning of your variable and the objective of your data analysis project, and
only then take the most reasonable choice. Often, you can start with numerical
treatment, and if the analysis shows peculiar results2 , you can choose categorical
treatment in secondary analyses.
In the coming chapters, we will come back to the important distinction
between categorical and numerical treatment (mostly in Chapter 6). For now,
remember that numeric variables are always treated as numeric variables, categorical
variables are always treated as categorical variables (even when they appear
numeric), and that for ordinal variables you have to think before you act.
## # A tibble: 5 x 4
## studentID course shirtsize grade
## <int> <chr> <chr> <dbl>
## 1 4132211 Chemistry medium 4
## 2 4132212 Physics small 6
## 3 4132213 Math large 3
## 4 4132214 Math medium 6
## 5 4132215 Chemistry small 8
We see that the variable studentID is stored as integer. That means that the
values are stored as numeric values. However, the values are quite meaningless,
they are only used to identify persons. If we want to treat this variable as a
categorical variable in data analysis, it is necessary to change this variable into
a factor variable. We can do this by typing:
course_results$studentID <-
course_results$studentID %>%
factor()
When we look at this variable after the transformation, we see that this new
categorical variable has 5 different categories (levels).
2 For instance, you may find that the assumptions of your linear model are not met, see
Chapter 7.
13
course_results$studentID
The last variable grade is stored as double. Variables of this type will be
treated as numeric in data analyses. If we’re fine with that for this variable, we
leave it as it is. If we want the variable to be treated as ordinal, then we need
the same type of factor transformation as for shirtsize. For now, we leave it as
it is. The resulting data frame then looks like this:
course_results
## # A tibble: 5 x 4
## studentID course shirtsize grade
## <fct> <fct> <ord> <dbl>
## 1 4132211 Chemistry medium 4
## 2 4132212 Physics small 6
## 3 4132213 Math large 3
## 4 4132214 Math medium 6
## 5 4132215 Chemistry small 8
Now both studentID and course are stored as factors and will be treated
as categorical. Variable shirtsize is stored as an ordinal factor and will be
14
treated accordingly. Variable grade is still stored as double and will therefore
be treated as numeric.
Table 1.9: Frequency table for age, with proportions and cumulative
proportions.
age frequency proportion cum frequency cum proportion
0 2 0.002 2 0.002
1 7 0.007 9 0.009
2 20 0.020 29 0.029
3 50 0.050 79 0.079
4 105 0.105 184 0.184
5 113 0.113 297 0.297
6 159 0.159 456 0.456
7 150 0.150 606 0.606
8 124 0.124 730 0.730
9 108 0.108 838 0.838
10 70 0.070 908 0.908
11 34 0.034 942 0.942
12 32 0.032 974 0.974
13 14 0.014 988 0.988
14 9 0.009 997 0.997
15 2 0.002 999 0.999
17 1 0.001 1000 1.000
The data in the frequency table can also be represented using a frequency
plot. Figure 1.1 gives the same information, not in a table but in a graphical
way. On the horizontal axis we see several possible values for age in years,
and on the vertical axis we see the number of children (the count) that were
observed for each particular age. Both the frequency table and the frequency
plot tell us something about the distribution of age in this imaginary town with
1000 children. For example, both tell us that the oldest child is 17 years old.
Furthermore, we see that there are quite a lot of children with ages between 5
and 8, but not so many children with ages below 3 or above 14. The advantage
15
160
150
140
130
120
110
100
90
count
80
70
60
50
40
30
20
10
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
age in years
of the table over the graph is that we can get the exact number of children of a
particular age very easily. But on the other hand, the graph makes it easier to
get a quick idea about the shape of the distribution, which is hard to make out
from the table.
Histograms are very convenient for continuous data, for instance if we have
values like 3.473, 2.154, etcetera. Or, more generally, for variables with values
that have very low frequencies. Suppose that we had measured age not in years
but in days. Then we could have had a data set of 1000 children where each and
every child had a unique value for age. In that case, the length of the frequency
table would be 1000 rows (each value observed only once) and the frequency
plot would be very flat. By using age measured in years, what we have actually
done is putting all children with an age less than 365 days into the first bin (age
0 years) and the children with an age of at least 365 but less than 730 days into
the second bin (age 1 year). And so on. Thus, if you happen to have data with
many many values with very low frequencies, consider binning the data, and
using a histogram to visualise the distribution of your numeric variable.
16
320
300
280
260
240
220
200
180
count
160
140
120
100
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
age in years
When we have the frequency for each observed age, we can calculate the relative
frequency or proportion of children that have that particular age. For example,
when we look again at the frequencies in Table 1.9 we see that there are two
children who have age 0. Given that there are in total 1000 children, we know
2
that the proportion of people with age 0 equals 1000 = 0.002. Thus, the
proportion is calculated by taking the frequency and dividing it by the total
number.
We can also compute cumulative frequencies. You get cumulative frequencies
by accumulating (summing) frequencies. For instance, the cumulative frequency
for the age of 3, is the frequency for age 3 plus all frequencies for younger ages.
Thus, the cumulative frequency of age 3 equals 50 + 20 (for age 2) + 7 (for age
1) + 2 (for age 0) = 79. The cumulative frequencies for all ages are presented
in Table 1.9.
We can also compute cumulative proportions: if we take for each age the
proportion of people who have that age or less, we get the fifth column in Table
1.9. For example, for age 2, we see that there are 20 children with an age of 2.
This corresponds to a proportion of 0.020 of all children. Furthermore, there
are 9 children who have an even younger age. The proportion of children with
an age of 1 equals 0.007, and the proportion of children with an age of 0 equals
0.002. Therefore, the proportion of all children with an age of 2 or less equals
0.020 + 0.007 + 0.002 = 0.029, which is called the cumulative proportion for the
age of 2.
17
1.9 Frequencies and proportions in R
The mtcars data set contains information about a number of cars: miles per
gallon (mpg), number of cylinders (cyl), etcetera.
mtcars
The function as tibble() is available when you load the tidyverse package.
From now on, we assume that you load the tidyverse package at the start of
18
every R session.
If we want to know how many cars belong to which category of number of
cylinders, we can use the function count():
mtcars %>%
count(cyl)
## # A tibble: 3 x 2
## cyl n
## <dbl> <int>
## 1 4 11
## 2 6 7
## 3 8 14
The new variable n is the frequency. We see that the value 4 occurs 11 times,
the value 6 occurs 7 times and the value 8 occurs 14 times. Thus, in this data
set there are 11 cars with 4 cylinders, 7 cars with 6 cylinders, and 14 cars with
8 cylinders.
We obtain proportions when we divide the frequencies by the total number
of cars (the sum of all the values in the n variable):
mtcars %>%
count(cyl) %>%
mutate(proportion = n/sum(n))
## # A tibble: 3 x 3
## cyl n proportion
## <dbl> <int> <dbl>
## 1 4 11 0.344
## 2 6 7 0.219
## 3 8 14 0.438
mtcars %>%
count(cyl) %>%
mutate(proportion = n/sum(n)) %>%
mutate(cumfreq = cumsum(n),
cumprop = cumsum(proportion))
## # A tibble: 3 x 5
## cyl n proportion cumfreq cumprop
## <dbl> <int> <dbl> <int> <dbl>
## 1 4 11 0.344 11 0.344
## 2 6 7 0.219 18 0.562
## 3 8 14 0.438 32 1
19
A frequency plot can be made using ggplot combined with geom line():
mtcars %>%
count(cyl) %>%
mutate(proportion = n/sum(n)) %>%
ggplot(aes(x = cyl, y = n)) +
geom_line()
mtcars %>%
ggplot(aes(x = mpg)) +
geom_histogram(breaks = seq(5, 40, 5))
It is wise to play around with the number of bins that you’d like to make,
or with the boundaries of the bins. Here we choose boundaries 5, 10, 15, . . . , 40.
largest value in the row (the age of the last child in the row).
4 Note that we could also choose to use 6, because 1 and 5 are lower than 6. Don’t worry,
the method that we show here to compute quartiles is only one way of doing it. In your life,
you might stumble upon alternative ways to determine quartiles. These are just arbitrary
agreements made by human beings. They can result in different outcomes when you have
small data sets, but usually not when you have large data sets.
20
1.00
0.75
cum.proportion
0.50
0.25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
value
The quartiles as defined here can also be found graphically, using cumulative
proportions. Figure 1.3 shows for each observed value the cumulative proportion.
It also shows where the cumulative proportions are equal to 0.25, 0.50 and 0.75.
We see that the 0.25 line intersects the other line at the value of 5. This is the
first quartile. The 0.50 line intersects the other line at a value of 7, and the 0.75
line intersects at a value of 10. The three percentiles are therefore 5, 7 and 10.
If you have a large data set, the graphical way is far easier than doing it by
hand. If we plot the cumulative proportions for the ages of the 1000 children,
we obtain Figure 1.4. We see a nice S-shaped curve. We also see that the
three horizontal quartile lines no longer intersect the curve at specific values, so
what do we do? By eye-balling we can find that the first quartile is somewhere
between 4 and 5. But which value should we give to the quartile? If we look at
the cumulative proportion for an age of 4, we see that its value is slightly below
the 0.25 point. Thus, the proportion of children with age 4 or younger is lower
than 0.25. This means that the child that happens to be the 250th cannot be 4
years old. If we look at the cumulative proportion of age 5, we see that its value
is slightly above 0.25. This means that the proportion of children that is 5 years
old or younger is slightly more than 0.25. Therefore, of the the total of 1000
children, the 250th child must have age 5. Thus, by definition, the first quantile
is 5. The second quartile is somewhere between 6 an 7, so by using the same
reasoning as for the first quartile we know that 50% of the youngest children is
7 years old or younger. The third quartile is somewhere between 8 and 9 and
this tells us that the youngest 75% of the children is age 9 or younger. Thus,
we can call 5, 7 and 9 our three quartiles.
Alternatively, we could also use the frequency table (Table 1.9). First, if we
want to have 25% of the children that are the youngest, and we know that we
have 1000 children in total, we should have 0.25 × 1000 = 250 children in the
21
1.00
cum.proportion 0.75
0.50
0.25
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
age
first group. So if were to put all the children in a row, ordered from youngest
to oldest, we want to know the age of the 250th child.
In order to find the age of this 250th child, and we look at Table 1.9, we see
that 29.7% of the children have an age of 5 or less (297 children), and 18.4% of
the children have an age of 4 or less (184 children). This tells us that, since 250
comes after 184, the 250th child must be older than 4, and because 250 comes
before 297, it must be younger than or equal to 5, hence the child is 5 years old.
Furthermore, if we want to find a cut-off age for the oldest 25%, we see from
the table, that 83.8% of the children (838 children) have an age of 9 or less, and
73.0% of the children (730) have an age of 8 or less. Therefore, the age of the
750th child (when ordered from youngest to oldest) must be 9.
What we just did for quartiles, (i.e. 0.25, 0.50, 0.75) we can do for any
proportion between 0 and 1. We then no longer call them quartiles, but quantiles.
A quantile is the value below which a given proportion of observations in a group
of observations fall. From this table it is easy to see that a proportion of 0.606
of the children have an age of 7 or less. Thus, the 0.606 quantile is 7. One
often also sees percentiles. Percentiles are very much like quantiles, except that
they refer to percentages rather than proportions. Thus, the 20th percentile is
the same as the 0.20 quantile. And the 0.81 quantile is the same as the 81st
percentile.
The reason that quartiles, quantiles and percentiles are important is that
they are very short ways of saying something about a distribution. Remember
that the best way to represent a distribution is either a frequency table or a
frequency plot. However, since they can take up quite a lot of space sometimes,
one needs other ways to briefly summarise a distribution. Saying that ”the
third quartile is 454” is a condensed way of saying that ”75% of the values is
either 454 or lower”. In the next sections, we look at other ways of summarising
22
information about distributions.
Another way in which quantiles and percentiles are used is to say something
about individuals, relative to a group. Suppose a student has done a test and
she comes home saying she scored in the 76th percentile of her class. What does
that mean? Well, you don’t know her score exactly, but you do know that of
her classmates, 76 percent had the same score or lower. That means she did
pretty well, compared to the others, since only 24 percent had a higher score.
1.11 Quantiles in R
Obtaining quartiles, quantiles and percentiles can be done with the quantile()
function:
quantile(mtcars$mpg,
probs = c(0.25, 0.50, 0.75, 0.90))
Σni=1 Yi
Y = (1.1)
n
In words, in order to compute Y , we take every value for variable Y from
i = 1 to i = n and sum them, and the result is divided by n. Suppose we have
5 Variables are symbolised by capitals, e.g., Y . Specific values of a variable are indicated
in lowercase, e.g., y.
23
variable Y with the values 6, -3, and 21, then the mean of Y equals:
Σi Yi Y1 + Y2 + Y3 6 + (−3) + 21 24
Y = = = = =8 (1.2)
n n 3 3
24
150
count
100
50
0
0 20000 40000 60000
income
Table 1.10: Four series of values and their respective medians and means.
X1 X2 X3 median mean
4 5 8 5 5.7
4 5 80 5 29.7
4 5 800 5 269.7
4 5 8000 5 2669.7
25
The mode can also be determined for categorical variables. If we have the
observed values ’Dutch’, ’Danish’, ’Dutch’, and ’Chinese’, the mode is ’Dutch’
because that is the value that is observed most often.
If we look back at the distribution in Figure 1.5, we see that the peak of
the distribution is around the value of 19,000. However, whether this is the
mode, we cannot say. Because income is a more or less continuous variable,
every value observed in the Figure occurs only once: there is no value of income
with a frequency more than 1. So technically, there is no mode. However, if
we split the values into 20 bins, like we did for the histogram in Figure 1.5, we
see that the fifth bin has the highest frequency. In this bin there are values
between 17000 and 21000, so our mode could be around there. If we really want
a specific value, we could decide to take the average value in the fifth bin. There
are many other statistical tricks to find a value for the mode, where technically
there is none. The point is that for the mode, we’re looking for the value or the
range of values that are most frequent. Graphically, it is the value under the
peak of the distribution. Similar to the median, the mode is also quite stable: it
is not affected by extreme values and is therefore to be preferred over the mean
in the case of asymmetric distributions.
there would be no single middle value in we would have to average the M and L values, which
would be impossible!
7 Unless you see one? But then it would not be a categorical value but an ordinal variable.
26
”Engineering”. Thus, for categorical variables, both dichotomous and nominal
variables, only the mode is a meaningful measure of central tendency.
As stated earlier, the appearance of a variable in a data matrix can be
quite misleading. Categorical variables and ordinal variables can often look like
numeric variables, which makes it very tempting to compute means and medians
where they are completely meaningless. Take a look at Table 1.11. It is entirely
possible to compute the average University, Size, or Programme, but it would
be utterly senseless to report these values.
It is entirely possible to compute the median University, Size, or Programme,
but it is only meaningful to report the median for the variable Size, as Size is
an ordinal variable. Reporting that the median size is equal to 2 is saying that
about half of the study programmes is of medium size or small, and about half
of the study programmes is of medium size or large.
It is entirely possible to compute the mode for the variables University, Size,
or Programme, and it is always meaningful to report them. It is meaningful to
say that in your data there is no University that is observed more than others.
It is meaningful to report that most study programmes are of medium size, and
that most study programmes are study programme number 2 (don’t forget to
look up and write down which study programme that actually is!).
Table 1.11: Study programmes and their relative sizes (1=small, 2=medium,
3=large) for six different universities.
University Size Programme
1 1 2
2 3 2
3 2 3
4 2 3
5 3 4
6 2 1
mtcars %>%
summarise(mean_cyl = mean(cyl),
median_cyl = median(cyl))
## # A tibble: 1 x 2
## mean_cyl median_cyl
## <dbl> <dbl>
## 1 6.19 6
27
R does not have an in-built function to calculate modes. So we create our
own function getmode(). This function takes a vector as input and gives the
mode value as output.
mtcars %>%
summarise(mode_cyl = getmode(cyl))
## # A tibble: 1 x 1
## mode_cyl
## <dbl>
## 1 8
28
20, 19, 20 and 454. Then the range is equal to 454 − 19 = 435. That’s a large
range, for a series of values that for the most part hardly differ from each other.
Instead of measuring the distance from the lowest to the highest value, we
could also measure the distance between the first and the third quartile: how
much does the third quartile deviate from the first quartile? This distance or
deviation is called the interquartile range (IQR) or the interquartile distance.
Suppose that we have a large number of systolic blood pressure measurements,
where 25% are 120 or lower, and 75% are 147 or lower, then the interquartile
range is equal to 147 − 120 = 27.
Thus, we can measure variation using the range or the interquartile range.
A third measure for variation is variance, and variance is based on the sum of
squares.
29
SS = (Y1 − Y )2 + (Y2 − Y )2 + (Y3 − Y )2 (1.4)
= (10 − 11)2 + (11 − 11)2 + (12 − 11)2 = (−1)2 + 02 + 12 = 2
Now let’s use some values that are more different from each other, but with
the same mean. Suppose you have the values 9, 11 and 13. The average value
is still 11, but the deviations from the mean are larger. The deviations from 11
are -2, 0 and +2. Taking the squares, you get (−2)2 = 4, 02 = 0 and (+2)2 = 4
and if you add them you get SS = 4 + 0 + 4 = 8.
Thus, the more the values differ from each other, the larger the deviations
from the mean. And the larger the deviations from the mean, the larger the
sum of squares. The sum of squares is therefore a nice measure of how much
values differ from each other.
SS Σi (Yi − Y )2
Var(Y ) = = (1.6)
n n
8 Online Σ (Y −Y )2
you will often find the formula i n−1 i
. The difference is that here we are
talking about the definition of the variance of an observed variable Y , and that elsewhere one
talks about trying to figure out what the variance might be of all values of Y when we only see
a small portion of the values of Y . When we use all values of Y , we talk about the population
variance, denoted by σ 2 . When we only see a small part of the values of Y , we talk about a
sample of Y -values. We will come back to the distinction between population variance and
sample variance and why they differ in Chapter 2.
30
As an example, suppose you have the values 10, 11 and 12, then the average
value is 11. Then the deviations are -1, 0 and 1. If you square them you get
(−1)2 = 1, 02 = 0 and 12 = 1, and if you add these three values, you get
SS = 1 + 0 + 1 = 2. If you divide this by 3, you get the variance: 32 . Put
differently, if the squared deviations are 1, 0 and 1, then the average squared
deviation (i.e., the variance) is 1+0+1
3 = 23 .
As another example, suppose you have the values 8, 10, 10 and 12, then
the average value is 10. Then the deviations from 10 are -2, 0, 0 and +2.
Taking the squares, you get 4, 0, 0 and 4 and if you add them you get SS = 8.
To get the variance, you divide this by 4: 8/4 = 2. Put differently, if the
squared deviations are 4, 0, 0 and 4, then the average squared deviation (i.e.,
the variance) is 4+0+0+4
4 = 2.
Often we also see another measure of variation: the standard deviation. The
standard deviation is the square root of the variance and is therefore denoted
as σ:
s
√ p Σi (Yi − Y )2
σ= σ2 = Var(Y ) = (1.7)
n
y−Y
z= (1.8)
σ
31
1.16 Variance, standard deviation, and standardisation
in R
The functions var() and sd() calculate the variance and standard deviation
for a variable, respectively.
mtcars %>%
summarise(var_mpg = var(mpg),
std_mpg = sd(mpg))
## # A tibble: 1 x 2
## var_mpg std_mpg
## <dbl> <dbl>
## 1 36.3 6.03
2
q
2
i −Y ) i −Y )
However, these functions use the formulas Σi (Yn−1 and Σi (Yn−1 , respectively.
We will discuss this further in Chapter 2. If you want to use the formula
Σi (Yi −Y )2
n , you need to write your own function that computes the sum of squares
(SS) and divides by n:
mtcars %>%
summarise(var_mpg = var_n(mpg),
std_mpg = sqrt(var_n(mpg))) # taking the square root
## # A tibble: 1 x 2
## var_mpg std_mpg
## <dbl> <dbl>
## 1 35.2 5.93
Note that you get different results. For large data sets (large n), the differences
will be negligible.
Standardised measures can be obtained using the scale() function:
mtcars %>%
mutate(z_mpg = scale(mpg)) %>%
select(mpg, z_mpg)
## # A tibble: 32 x 2
## mpg z_mpg[,1]
## <dbl> <dbl>
32
## 1 21 0.151
## 2 21 0.151
## 3 22.8 0.450
## 4 21.4 0.217
## 5 18.7 -0.231
## 6 18.1 -0.330
## 7 14.3 -0.961
## 8 24.4 0.715
## 9 22.8 0.450
## 10 19.2 -0.148
## # ... with 22 more rows
33
400000
300000
count
200000
100000
0
24000 27000 30000 33000 36000
wage
4000
3000
count
2000
1000
0
25000 27500 30000 32500 35000
wage
34
0.0004
0.0003
density
0.0002
0.0001
0.0000
25000 27500 30000 32500 35000
wage
mtcars %>%
ggplot(aes(x = mpg)) +
geom_density()
1 (x−30000)2
−
f (x) = √ e 2×10002 (1.9)
2π10002
which you are allowed to forget immediately. It is only to illustrate that
distributions observed in the wild (empirical distributions) sometimes resemble
mathematical functions (theoretical distributions).
The density function of that distribution is plotted in Figure 1.9. Because
of its bell-shaped form, the normal distribution is sometimes informally called
’the bell curve’. The histogram in Figure 1.8 and the normal density function
in Figure 1.9 look so similar, they are practically indistinguishable.
Mathematicians have discovered many interesting things about the normal
distribution. If the distribution of a variable closely resembles the normal
35
0.0004
0.0003
0.0002 68 percent
0.0001
Figure 1.9: The theoretical normal distribution with mean 30,000 and standard
deviation 1000.
distribution, you can infer many things. One thing we know about the normal
distribution is that the mean, mode and median are always the same. Another
thing we know from theory is that the inflexion points9 are one standard deviation
away from the mean. Figure 1.9 shows the two inflexion points. From theory we
also know that if a variable has a normal distribution, 68% of the observed values
lies between these two inflexion points. We also know that 5% of the observed
values lie more than 1.96 standard deviations away from the mean (2.5% on both
sides, see Figure 1.9). Theorists have constructed tables that make it easy to
see what proportion of values lies more than 1, 1.1, 1.2 . . . , 3.8, 3.9, . . . standard
deviations away from the mean. These tables are easy to find online or in books,
and these are fully integrated into statistical software like SPSS and R. Because
all these percentages are known for the number of standard deviations, it is
easier to talk about the standard normal distribution.
In such tables online or in books, you find information only about this
standard normal distribution. The standard normal distribution is a normal
distribution where all values have been standardised (see Section 1.15.3). When
values have been standardised, they automatically have a mean of 0 and a
standard deviation of 1. As we saw in Section 1.15.3, such standardised values
are obtained if you subtract the mean score from each value, and divide the
result by the standard deviation. A standardised value is usually denoted as
a z-score. Thus in formula form, a value Y = y is standardised by using the
following equation:
y−Y
z= (1.10)
σ
9 The inflexion point is where concave turns into convex, and vice versa. Mathematically,
the inflexion point can be found by equating the second derivative of a function to 0.
36
Table 1.12: Standardising scores.
Y mean Y minus mean Z
7.2 10.4 -3.2 -0.7
8.8 10.4 -1.5 -0.3
17.8 10.4 7.4 1.6
10.4 10.4 -0.0 -0.0
10.6 10.4 0.3 0.1
18.6 10.4 8.2 1.7
12.3 10.4 1.9 0.4
3.7 10.4 -6.7 -1.4
6.6 10.4 -3.8 -0.8
7.8 10.4 -2.6 -0.5
Table 1.12 shows an example set of values for Y that are standardised. The
mean of the Y -values turns out to be 10.38, and the standard deviation 4.77.
By subtracting the mean, we ensure that the average z-score becomes 0, and
by subsequently dividing by the standard deviation, we make sure that the
standard deviation of the z-scores becomes 1.
This standardisation makes it much easier to look up certain facts about
the normal distribution. For instance, if we go back to the normally distributed
wage values, we see that the average is 30,000, and the standard deviation is
1,000. Thus, if we take all wages, subtract 30,000 and divide by 1,000, we
get standardised wages with mean 0 and standard deviation 1. The result is
shown in Figure 1.10. We know that the inflexion points lie at one standard
deviation below and above the mean. The mean is 30,000, and the standard
deviation equals 1,000, so the inflexion points are at 30000 − 1000 = 29000 and
30000 + 1000 = 31000. Thus we know that 68% of the wages are between 29,000
and 31,000.
How do we know that 68% of the observations lie between the two inflexion
points? Similar to proportions and cumulative proportions, we can plot the
cumulative normal distribution. Figure 1.11 shows the cumulative proportions
curve for the normal distribution. Note that we no longer see dots because the
variable Z is continuous.
We know that the two inflexion points lie one standard deviation below and
above the mean. Thus, if we look at a z-value of 1, we see that the cumulative
probability equals about 0.84. This means that 84% of the z-values are lower
than 1. If we look at a z-value of -1, we see that the cumulative probability
equals about 0.16. This means that 16% of the z-values are lower than -1.
Therefore, if we want to know what percentage of the z-values lie between -1
and 1, we can calculate this by subtracting 0.16 from 0.84, which equals 0.68,
which corresponds to 68%.
All quantiles for the standard normal distribution can be looked up online10
or in Appendix A, but also using R. Table 1.13 gives a short list of quantiles.
10 See for example www.normaltable.com or www.mathsisfun.com/data/standard-normal-
distribution-table.html
37
0.4
0.3
density
0.2 68 percent
0.1
0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
Z
1.00
0.84
Cumulative proportion
0.75
0.50
0.25
0.16
0.00
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5
Z
38
From this table, you see that 1% of the z-values is lower than -2.33, and that
25% of the z-values is lower than -0.67. We also see that half of all the z-values
is lower than 0.00 and that 10% of the z-values is larger than 1.28, and that the
1% largest values are higher than 2.33.
Although tables are readily found online, it’s helpful to memorise the so-
called 68 – 95 – 99.7 rule, also called the empirical rule. It says that 68%
of normally distributed values are at most 1 standard deviation away from the
mean, 95% of the values are at most 2 standard deviations away (more precisely,
1.96), and 99.7% of the values are at most 3 standard deviations away. In other
words, 68% of standardised values are between -1 and +1, 95% of standardised
values are between -2 and +2 (-1.96 and +1.96), and 99.7% of standardised
values are between -3 and +3.
Thus, if we return to our wages with mean 30,000 and standard deviation
1,000, we know from Table 1.13 that 99% of the wages are below 30000 + 2.33
times the standard deviation = 30000 + 2.33 × 1000=32330.
Returning back to the IQ example of Section 1.15.3. Suppose we have IQ
scores that are normally distributed with a mean of 100 and a standard deviation
of 15. What IQ score would be the 90th percentile? From Table 1.13 we see
that the 90th percentile is a z-value of 1.28. Thus, the 90th percentile for our IQ
scores lies 1.28 standard deviations above the mean (above because the z-value
is positive). The mean is 100 so we have to look at 1.28 standard deviations
above that. The standard deviation equals 15, so we have to look at an IQ score
of 100 + 1.28 × 15, which equals 119.2. This tells us that 90% of the IQ scores
are equal to or lower than 119.2.
As a last example, suppose we have a personality test that measures extraversion.
If we know that test scores are normally distributed with a mean of 18 and a
standard deviation of 2, what would be the 0.10 quantile? From Table 1.13
we see that the 0.10 quantile is a z-value of -1.28. This tells us that the 0.10
quantile for the personality scores lies at 1.28 standard deviations below the
mean. The mean is 18, so the 0.10 quantile for the personality scores lies at
1.28 standard deviations below 18. The standard deviation is 2, so this amounts
to 18 − 1.28 × 2 = 15.44. This tells us that 10% of the scores on this test are
15.44 or lower.
Such handy tables are also available for other theoretical distributions. Theoretical
39
distributions are at the core of many data analysis techniques, including linear
models. In this book, apart from the normal distribution, we will also encounter
other theoretical distributions: the t-distribution (Chapter 2), the F -distribution
(Chapter 6), the chi-square distribution (Chapters 2, 8, 12, 13 and 14) and the
Poisson distribution (14).
This means that if you have a normal distribution with mean 100 and
standard deviation 15, 5% of the values are 75.3272 or less, 50% of the values
are 100 or less, and 95% of the values are 124.6728 or less.
If you want to know the cumulative proportion for a certain value of a
variable that is normally distributed, you can use pnorm():
pnorm(-1, mean = 0, sd = 1)
## [1] 0.1586553
40
29,400 and 30,800 The horizontal black line within the white box represents the
second quartile (the median), so half of the workers earn less than 30,100.
A box plot also shows whiskers: two vertical lines sprouting from the white
box. There are several ways to draw these two whiskers. One way is to draw
the top whisker to the largest value (the maximum) and the bottom whisker to
the smallest value (the minimum). Another way, used in Figure 1.12, is to have
the upper whisker extend from the third quartile to the observed value equal
to at most 1.5 times the interquartile range away from the median, and the
lower whisker extend from the first quartile to the value at most 1.5 times the
interquartile range below the median (the interquartile range is of course the
height of the white box). The dots are outlying values, or simply called outliers:
values that are even further away from the median. This is displayed in Figure
1.12. There you see first and third quartiles of 29,400 and 30,800, respectively, so
an interquartile range (IQR) of 30800 − 29400 = 1400. Multiplying this IQR by
1.5 we get 1.5 × 1400 = 2100. The whiskers therefore extend to 29400 − 2100 =
27300 and 30800 + 2100 = 32900.
Thus, the box plot is a quick way of visualising in what range the middle
half of the values are (the range in the white box), where most of the values are
(the range of the white box plus the whiskers), and where the extreme values
are (the outliers, individually plotted as dots). Note that the white box always
contains 50% of the values. The whiskers are only extensions of the box by a
factor of 1.5. In many cases you see that they contain most of the values, but
sometimes they miss a lot of values. You will see that when you notice a lot of
outliers.
mtcars %>%
ggplot(aes(x = "", y = mpg)) +
geom_boxplot() +
xlab("")
41
33000
32800
32600
32400
32200
32000
31800
31600
31400
31200
31000
30800
30600
30400
30200
wage
30000
29800
29600
29400
29200
29000
28800
28600
28400
28200
28000
27800
27600
27400
27200
27000
42
Table 1.14: A frequency table of nationalities.
nationality n
Chinese 10
Dutch 145
German 284
Indian 7
Indonesian 10
200
count
100
0
Chinese Dutch German Indian Indonesian
nationality
Ordinal variables are often visualised using bar charts. Figure 1.15 shows
the variation of the answers to a Likert questionnaire item, where Nairobi
inhabitants are asked ”To what degree do you agree with the statement that
the climate in Iceland is agreeable?”. With ordinal variables, make sure that
the labels are in the natural order.
43
Nationality
Chinese
Dutch
German
Indian
Indonesian
75
count
50
25
0
Co
So
Co
So
mp
me
mp
me
Dis
let
wh
Ag
let
wh
ely
ag
at
ely
ree
at
re
d
dis
isa
ag
ag
e
ag
ree
gre
ree
ree
44
1.24 Visualising categorical and ordinal variables
in R
mtcars %>%
mutate(cyl = factor(cyl, ordered = TRUE)) %>%
ggplot(aes(x = cyl)) +
geom_bar()
10
count
0
4 6 8
cyl
mtcars %>%
count(cyl) %>%
mutate(proportion = n/sum(n)) %>%
ggplot(aes(x = "",
y = proportion,
fill = factor(cyl))) +
geom_col(width = 1) +
coord_polar(theta = "y") +
xlab("") +
ylab("") +
theme_void() +
scale_fill_brewer(palette = "Blues") +
labs(fill = "Cylinders")
45
Cylinders
4
6
8
46
Cross-tables are a nice visualisation of how two categorical variables co-vary.
But what if one of the two variables is not a categorical variable?
The alternative for two variables where one is categorical and the other one
is numeric, is to create a box plot. Figure 1.16 shows a box plot of the pencil
data. A box plot gives a quick overview of the distribution of the pencils: one
distribution of the blue pencils, and one distribution of the red pencils. Let’s
have a look at the distribution of the blue pencils on the left side of the plot.
The white box represents the interquartile range (IQR), so that we know that
half of the blue pencils have a length between 4 and 9. The horizontal black
line within the white box represents the median (the middle value), so half of
the blue pencils are smaller than 4.85. The vertical lines are called whiskers.
These typically indicate where the data points are that lie at most 1.5 times
the IQR away from the median. For the blue pencils, we see no whisker on top
of the white box. That means that there are no data points that lie more than
1.5 times the IQR above the median of 4.85 (here the IQR equals 5.03). We see
a whisker on the bottom of the white box, to the lowest observed value of 2.7.
47
8
6
length
blue red
colour
This value is less than 1.5 times 5.03 = 7.545 away from the median of 4.85 so
it is included in the whisker. It is the lowest observed value for the blue pencils
so the whisker ends there.
From a box plot like this it is easy to spot differences in the distribution of a
quantitative measure for different levels of a qualitative measure. From Figure
1.16 we easily spot that the red pencils (varying between 2 and 6 cm) tend to
be shorter than the blue pencils (varying between 3 and 9 cm). Thus, in these
pencils, length and colour tend to co-vary: red pencils are often short and
blue pencils are often long.
48
8
6
length
2
3.3 3.4 3.5 3.6 3.7 4
weight
6
length
2
3.4 3.6 3.8 4.0
weight
49
Table 1.17: Cross-tabulation of length (rows) and weight (columns) for twenty
pencils.
3.3 3.4 3.5 3.6 3.7 4
2 1 0 0 0 0 0
2.7 0 1 0 0 0 0
3.3 0 1 0 0 0 0
3.4 0 1 0 0 0 0
3.5 0 0 1 0 0 0
3.6 0 0 1 0 0 0
4.1 0 0 2 0 0 0
4.4 0 0 2 0 0 0
4.5 0 0 2 0 0 0
4.7 0 0 0 1 0 0
5.2 0 0 0 1 0 0
5.7 0 0 0 0 1 0
5.8 0 0 0 0 1 0
9 0 0 0 0 0 4
that you can easily overlook when only looking at the values. Cross-tables, box
plots and scatter plots are powerful tools to find regularities but also oddities
in your data that you’d otherwise miss. Some such patterns can be summarised
by straight lines, as we see in Figure 1.19. The remainder of this book focuses
on how we can use straight lines to summarise data, but also how to make
predictions for data that we have not seen yet.
mtcars %>%
ggplot(aes(x = wt, y = mpg)) +
geom_point()
A box plot for one categorical and one numeric variable can be made using
geom boxplot():
mtcars %>%
mutate(cyl = factor(cyl)) %>%
ggplot(aes(x = cyl, y = mpg)) +
geom_boxplot()
A cross table for two categorical variables can be made using table():
50
8
6
length
2
3.4 3.6 3.8 4.0
weight
Figure 1.19: A scatterplot of length and weight, with a straight line that
summarises the relationship.
table(mtcars$cyl, mtcars$gear)
##
## 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
Note that the number of cylinders (first-named variable) is in the rows (here
4, 6 and 8 cylinders), and the number of gears (second-named variable) is in the
columns (3, 4, and 5 gears).
51
numeric predictor variables (multiple regression). In Chapter 5 we will discuss
how you can draw conclusions about linear models for data that you have not
seen. For example, in the previous section we described the relationship between
weight and length of twenty pencils. The question that you may have is whether
this linear relationship also holds for all pencils of the same make, that is,
whether the same linear model holds for both the observed twenty pencils and
the total collection of pencils.
In Chapter 6 we will show how we can use straight lines to summarise
relationships with predictor variables that we want to treat as categorical.
Chapter 7 discusses when it is appropriate to use linear models to summarise
your data, and when it is not. It introduces methods that enable you to decide
whether to trust a linear model or not. Chapter 8 then discusses alternative
methods that you can use when linear models are not appropriate.
Chapter 9 focuses on moderation: how one predictor variable can affect the
effect that a second predictor variable has on the outcome variable.
Chapter ?? shows how you can make elaborate statements about differences
between groups of observations, in case one of the predictor variables is a
categorical variable.
Chapters 10 and 11 show how to deal with variables that are measured more
than once in the same unit of analysis (the same participant, the same pencil,
the same school, etc.). For example, you may measure the weight of a pencil
before and after you have made a drawing with it. Models that we use for
such data are called linear mixed models. Similar to linear models, linear mixed
models are not always appropriate for some data sets. Therefore, Chapter 12
discusses alternative methods to study variables that are repeatedly measured
in the same research unit.
Chapters 13 and 14 discuss generalised linear models. These are models
where the outcome variable is not numeric and continuous. Chapter 13 discusses
a method that is appropriate when the outcome variable has only two values, say
”yes” and ”no”, or ”pass” and ”fail”. Chapter 14 discusses a method that can
be used when the outcome variable is a count variable and therefore discrete,
for example the number of children in a classroom, or the number of harvested
zucchini from one plant.
Chapter 15 discusses relatively new statistical methodology that is needed
when you have a lot of variables. In such cases, traditional inferential data
analysis as discussed in the previous chapters often fails.
52
Chapter 2
53
Luteinising hormone (LH) level IU/L
3.5
3.3
3.1
2.9
2.7
2.5
2.3
2.1
1.9
1.7
1.5
1.3
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536373839404142434445464748
measure
54
variance of 0.14. We see that therefore the mean based on only 10 elephants
gives a rough approximation of the mean of all elephants: the sample mean
gives a rough approximation of the population mean. Sometimes it is too low,
sometimes it is too high. The same is true for the variance: the variance based
on only 10 elephants is a rough approximation, or estimate, of the variance of
all elephants: sometimes it is too low, sometimes it is too high.
Table 2.1: Imaginary data on elephant height when 5 random samples (columns)
of 10 elephants (rows) are drawn from the population data.
1 2 3 4 5
1 3.77 2.52 3.26 3.61 3.16
2 3.61 3.41 3.09 3.33 2.74
3 3.12 2.91 3.14 3.22 3.91
4 2.95 3.20 2.85 3.40 3.60
5 2.53 3.45 2.69 3.20 3.19
6 3.12 3.11 3.45 2.31 2.94
7 3.31 3.22 2.98 3.65 4.39
8 2.59 3.76 2.81 2.20 3.24
9 2.91 3.44 3.63 3.12 3.21
10 3.36 2.84 4.15 2.73 2.75
mean 3.13 3.19 3.20 3.08 3.31
variance 0.14 0.12 0.18 0.23 0.24
55
1000
count
500
0
2.75 3.00 3.25 3.50 3.75
sample mean
Figure 2.2: A histogram of 10,000 sample means when the sample size equals
10.
750
count
500
250
0
0.0 0.1 0.2 0.3
sample variance
Figure 2.3: A histogram of 10,000 sample variances when the sample size equals
10. The red line indicates the population variance. The blue line indicates the
mean of all variances observed in the 10,000 samples.
56
also the variance. Figure 2.3 shows the sampling distribution. The red line
shows the variance of the height in the population, and the blue line shows the
mean variance observed in the 10,000 samples. Clearly, the red and blue line do
not overlap: the mean variance in the samples is slightly lower than the actual
variance in the population. We say that the sample variance underestimates the
population variance a bit. Sometimes we get a sample variance that is lower
than the population value, sometimes we get a value that is higher than the
population value, but on average we are on the low side.
Overview
random sample: values that you observe when you randomly pick a
subset of the population
57
that on average, the values for the variances are too low.
Another thing we saw was that the distribution of the sample means looked
symmetrical and close to normal. If we look at the sampling distribution of
the sample variance, this was less symmetrical, see Figure 2.3. It actually has
the shape of a so-called χ2 -(pronounced ’chi-square’) distribution, which will be
discussed in Chapters 8, 12, 13 and 14. Let’s see what happens when we do not
take samples with 10 elephants each time, but 100 elephants.
Stop and think: What will happen to the sampling distributions of the mean
and the variance? For instance, in what way will Figure 2.2 change when we
use 100 elephants instead of 10?
Figure 2.4 shows the sampling distribution of the sample mean. Again the
distribution looks normal, again the blue and red lines overlap. The only
difference with Figure 2.2 is the spread of the distribution: the values of the
sample means are now much closer to the population value of 3.25 than with
a sample size of 10. That means that if you use 100 elephants instead of 10
elephants to estimate the population mean, on average you get much closer to
the true value!
Now stop for a moment and think: is it logical that the sample means are
much closer to the population mean when you have 100 instead of 10 elephants?
Yes, of course it is, with 100 elephants you have much more information about
elephant heights than with 10 elephants. And if you have more information,
you can make a better approximation (estimation) of the population mean.
Figure 2.5 shows the sampling distribution of the sample variance. Compared
to a sample size of 10, the shape of the distribution now looks more symmetrical
and closer to normal. Second, similar to the distribution of the means, there is
much less variation in values: all values are now closer to the true value of 0.14.
And not only that: it also seems that the bias is less, in that the blue and the
red lines are closer to each other.
Here we see three phenomena. The first is that if you have a statistic like
a mean or a variance and you compute that statistic on the basis of randomly
picked sample data, the distribution of that statistic (i.e., the sampling distribution)
will generally look like a normal distribution if sample size is large enough.
It can actually be proven that the distribution of the mean will become a
normal distribution if sample size becomes large enough. This phenomenon is
known as the Central Limit Theorem. It is true for any population, no matter
what distribution it has.1 Thus, this means that height in elephants itself does
not have to be normally distributed, but the sampling distribution of the sample
mean will be normal for large sample sizes (e.g., 100 elephants).
The second phenomenon is that the sample mean is an unbiased estimator
1 This is true except for the case that you have fewer than 3 data points and for a few
special cases, that you don’t need to know about in this book.
58
of the population mean, but that the variance of the sample data is not an
unbiased estimator of the population variance. Let’s denote the variance of the
sample data as S 2 . Remember from Chapter 1 that the formula for the variance
is
Σ(yi − ȳ)2
S 2 = Var(Y ) = (2.1)
n
We saw that the bias was large for small sample size and small for larger
sample size. So somehow we need to correct for sample size. It turns out that
n
the correction is a multiplication with n−1 :
n
s2 = S2 (2.2)
n−1
Σ(yi − ȳ)2
s2 = (2.3)
n−1
2
c2 = s2 = Σ(yi − ȳ)
σ (2.4)
n−1
59
4000
3000
count
2000
1000
0
2.75 3.00 3.25 3.50 3.75
means
Figure 2.4: A histogram of 10,000 sample means when the sample size equals
100.
2500
2000
1500
count
1000
500
0
0.0 0.1 0.2 0.3
var
Figure 2.5: A histogram of 10,000 sample variances when the sample size equals
100.
60
Overview
Central Limit Theorem: says that the sampling distribution of the
sample mean will be normally distributed for infinitely large sample sizes.
estimator: a quantity that you compute based on sample data, that
you hope says something about a quantity in the population data. For
instance, you can use the sample mean and hope that it is close to the
population mean. You use the sample mean as an approximation of the
population mean.
estimate: the actual value that you get when computing an estimator.
For instance, we can use the sample mean as the estimator of the
population mean. The formula for the sample mean is Σy n so this formula
i
61
case the standard error of the mean. It is a measure of how uncertain we are
about a population mean when we only have sample data to go on. Think about
this: why would we associate a large standard error with very little certainty?
In this case we have only 10 data points for each sample, and it turns out that
the standard error of the mean is a function of both the sample size n and the
population variance σ 2 .
r
σ2
σȳ = (2.5)
n
Here, the
q population variance equals 0.14 and sample size equals 10, so the
σȳ equals 0.14
10 = 0.118, close to our observed value. If we fill in the formula
for a sample size of 100, we obtain a value of 0.037. This is a much smaller
value for the spread and this is indeed observed in Figure 2.4. Figure 2.6 shows
the standard error of the mean for all sample sizes between 1 and 200.
In sum, the standard error of the mean is the standard deviation of the
sample means, and serves as a measure of the uncertainty about the population
mean. The larger the sample size, the smaller the standard error, the closer a
sample mean is expected to be around the population mean, the more certain
we can be about the population mean.
Similar to the standard error of the mean, we can compute the standard
error of the variance. This is more complicated – especially if the population
distribution is not normal – and we do not treat it here. Software can do
the computations for you, and later in this book you will see examples of the
standard error of the variance.
Summarising the above: when we have a population mean, we usually see
that the sample mean is close to it, especially for large sample sizes. If you do
not understand this yet, go back before you continue reading.
The larger the sample size, the closer the sample means are to the population
means. If you turn this around, if you don’t know the population mean, you
can use a large sample size, calculate the sample mean, and then you have a
fairly good estimate for the population. This is useful for our problem of the
LH levels, where we have 48 measures. The mean of the 48 measurements could
be a good approximation of the mean LH level in general.
As an indication of how close you are to the population mean, the standard
error can be used. The standard error of the mean is the standard deviation of
the sampling distribution of the sample mean. The smaller the standard error,
the more confident you can be that your sample mean is close to the population
mean. In the next section, we look at this more closely. If we use our sample
mean as our best guess for the population mean, what would be a sensible range
of other possible values for the population mean, given the standard error?
62
0.20
standard error of the mean
0.15
0.10
0.05
Figure 2.6: Relationship between sample size and the standard error of the
mean, when the population variance equals 0.14.
Overview
standard error of the mean: the standard deviation of the distribution
of sample means (the sampling distribution of the sample mean). Says
something about how spread out the values of the sample means are. It
can be used to quantify the uncertainty about the population mean when
we only have the sample mean to go on.
standard error of the variance: the standard deviation of the sampling
distribution of the sample variance. Says something about how spread
out the values of the sample variances are. It can be used to quantify the
uncertainty about the population variance when we only have the variance
of the sample values to go on.
63
error,qcomputed as a function of the population variance and sample size, in our
case 0.14 4 = 0.19. Now imagine that for a bunch of samples we compute the
sample means. We know that the means for large sample sizes will look more
or less like a normal distribution, but how about for a small sample size like
n = 4? If it would look like a normal distribution too, then we could use the
knowledge about the standard normal distribution to say something about the
distribution of the sample means.
For the moment, let’s assume the sample size is not 4, but 4000. From
the Central Limit Theorem we know that the distribution of sample means is
almost identical to a normal distribution, so let’s assume it is normal. From
the normal distribution, we know that 68% of the observations lies between 1
standard deviation below and 1 standard deviation above the mean (see Section
1.19 and Figure 1.9). If we would therefore standardise our sample means, we
could say something about their distribution given the standard error, since the
standard error is the standard deviation of the sampling distribution. Thus, if
the sampling distribution looks normal, then we know that 68% of the sample
means lies between one standard error below the population mean and one
standard error above the population mean.
So suppose we take a large number of samples from the population, compute
means and variances for each sample, so that we can compute standardised
scores. Remember from Chapter 1 that a standardised score is obtained by
subtracting an observed score from the mean and divide by the standard deviation:
y − ȳ
zy = (2.6)
sdy
If we apply standardisation of the sample means, we get the following: for
a given sample mean ȳ we subtract the population mean µ and divide by the
standard deviation of the sample means (the standard error):
ȳ − µ
zȳ = (2.7)
σȳ
If we then have a bunch of standardised sample means, their distribution
should have a standard normal distribution with mean 0 and variance 1. We
know that for this standard normal distribution, 68% of the values lie between
-1 and +1, meaning that 68% of the values in a non-standardised situation lie
between -1 and +1 standard deviations from the mean (see Section 1.19). That
implies that 68% of the sample means lie between -1 and +1 standard deviations
(standard errors!) from the population mean. Thus, 68% of the sample means
lie between −1×σȳ and +1×σ
q ȳ from the population mean µ. If we have sample
0.14
size 4000, σȳ is equal to 4000 = 0.0059161 and µ = 3.25, so that 68% of the
sample means lie between 3.2440839 and 3.2559161.
This means that we also know that 100 − 68 = 32% of the sample means
lie farther away from the mean: that it occurs in only 32% of the samples that
a sample mean is smaller than 3.2440839 and larger than 3.2559161. Taking
this a bit further, since we know that 95% of the values in a standard normal
64
distribution lie between -1.96 and +1.96 (see Section 1.19), we know that it
happensqin only 5% of the samples that the sample mean isqsmaller than 3.25 −
0.14 0.14
1.96 × 4000 = 3.2384045 or larger than 3.25 + 1.96 × 4000 = 3.2615955.
Another way of putting this is that
q it happens in only 95% of the samples that
0.14
a sample mean is at most 1.96 × 4000 away from the population mean 3.25.
This distance of 1.96 times the standard error is called the margin of error
(MoE). Here we focus on the margin of error that is based on 95% observations
of the observations seen in the normal distribution:
A 95% confidence interval contains 95% of the sample means had the population
mean been equal to the sample mean. Its construction is based on the estimated
sampling distribution of the sample mean.
65
60
40
density
20
MoE MoE
0
3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29
Height in m
The idea is illustrated in Figure 2.7. There you see two sampling distributions:
one for if the population mean is 3.25 (blue) and one for if the population mean
is 3.26 (black). Both are normal distributions because sample size is large, and
both have the same standard error that can be estimated using the sample
variance. Whatever the true population mean, we can estimate the margin of
error that goes with 95% of the sampling distribution. We can then construct
an interval that stretches the length of about twice (i.e., 1.96) the margin of
error around any value. We can do that for the real population mean (in blue),
but the problem that we face in practice is that we don’t know the population
mean. We do know the sample mean, and if we centre the interval around that
value, we get what is called the 95% confidence interval. We see that it ranges
from 3.248 to 3.272. This we can use as a range of plausible values for the
unknown population mean. With some level of ’confidence’ we can say that the
population mean is somewhere in this interval.
Note that when we say: the 95% confidence interval runs from 3.248 to
3.272, we cannot say, we are 95% sure that the population mean is in there.
’Confidence’ is not the same as probability. We’ll talk about this in a later
section. First, we look at the situation where sample size is small so that we
cannot use the Central Limit Theorem.
66
2.6 The t-statistic
In the previous section, we constructed a 95% confidence interval based on the
standard normal distribution. We know from the standard normal distribution
that 95% of the values are between -1.96 and +1.96. We used the standard
normal distribution because the sampling distribution will look normal if sample
size is large. We took the example of a sample size of 4000, and then this
approach works fine, but remember that the actual sample size was 4. What if
sample size is not large? Let’s see what the sampling distribution looks like in
that case.
Remember from the previous section that we standardised the sample means.
ȳ − µ
zȳ =
σȳ
and that zȳ has a standard normal distribution. But, this only works if we
have a good estimate of σȳ , the standard error. If sample size is limited, our
estimate is not perfect. You can probably imagine that if you take one sample
of 4 randomly
q selected elephants, you get one value for the estimated standard
2
error ( sn ), and if you take another sample of 4 elephants, you get a slightly
different value for the estimated standard error. Because we do not always have
a good estimate for σȳ , the standardisation becomes a bit more tricky. Let’s
call the standardised sample mean t instead of z:
ȳi − µ
tȳi = q 2
si
n
67
0.4
0.3
Distribution
density
0.2 normal
t
0.1
0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
Figure 2.8: Distribution of t with sample size 4, compared with the standard
normal distribution.
ȳ i −µ
the distribution we get if we have sample size 4 and we compute tȳi = r for
s2
i
n
many different samples.
When you compare the two distributions, you see that compared to the
normal curve, there are fewer observations around 0 for the t-distribution: the
density around 0 is lower for the red curve than for the blue curve. That’s
because there are more observations far away from 0: in the tails of the distributions,
you see a higher density for the red curve (t) than for the blue curve (normal).
They call this phenomenon ’heavy-tailed’: relatively more observations in the
tails than around the mean.
That the t-distribution is heavy-tailed has important implications. From
the standard normal distribution, we know that 5% of the observations lie more
than 1.96 away from the mean. But since there are relatively more observations
in the tails of the t-distribution, 5% of the values lie farther away from the
mean than 1.96. This is illustrated in Figure 2.9. If we want to construct a 95%
confidence interval, we can therefore no longer use the 1.96 value.
With this t-distribution, 95% of the observations lie between -3.18 and +3.18.
Of course, that is in the standardised situation. If we move back to our scale
of elephant heights with a sample mean of 3.26, we have to transform this back
to elephant heights. So -3.18q times the standard error away from the mean of
0.15
3.26, is equal to 3.26 − 3.18 × = 2.6441956, and +3.18 times the standard
4
q
error away from the mean of 3.26, is equal to 3.26 + 3.18 × 0.15 4 = 3.8758044.
So the 95% interval runs from 2.64 to 3.88. This interval is called the 95%
confidence interval, because 95% of the sample means will lie in this interval, if
the population mean would be 3.26.
Notice that the interval includes the population mean of 3.25. If we would
interpret this interval around 3.26 as containing plausible values for the population
mean, we see that in this case, this is a fair conclusion, because the true value
68
0.4
0.3
Distribution
density
0.2 normal
t
0.1
0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
Figure 2.9: Distribution of t with sample size 4, compared with the standard
normal distribution. Shaded areas represent 2.5% of the respective distribution.
69
Then if we imagine that we take 100 random samples from this population
distribution, we can calculate 100 sample means and 100 sample variances. If
we then construct 100 confidence intervals around these 100 sample means, we
obtain the confidence intervals displayed in Figure 2.10. We see that 95 of these
intervals contain the value 3.25, and 5 of them don’t: only in samples 1, 15, 20,
28 and 36, the interval does not contain 3.25.
It can be mathematically shown that given a certain population mean, when
taking many, many samples and constructing 95% confidence intervals, you can
expect 95% of them will contain that population mean. That does not mean
however that given a sample mean with a certain 95% interval, that interval
contains the population mean with a probability of 95%. It only means that were
this procedure of constructing confidence intervals to be repeated on numerous
samples, the fraction of calculated confidence intervals that contain the true
population mean would tend toward 95%. If you only do it once (you obtain a
sample mean and you calculate the 95% confidence interval) it either contains
the population mean or it doesn’t: you cannot calculate a probability for this.
In the statistical framework that we use in this book, one can only say something
about the probability of data occurring given some population values:
Given that the population value is 3.25, and if you take many, many independent
samples from the population, you can expect that 95% of the confidence intervals
constructed based on resulting sample means will contain that population value
of 3.25.
Using this insight, we therefore conclude that the fact we see the value of
3.25 in our 95% confidence interval around 2.9, gives us some reason to believe
(’confidence’) that 3.25 could also be a plausible candidate for the population
mean.
Summarising, if we find a sample mean of say 2.9, we know that 2.9 is a
reasonable guess for the population mean (it’s an unbiased estimator). Moreover,
if we construct a 95% confidence interval around this sample mean, this interval
contains other plausable candidates for the population mean. However, it might
be possible that the true population mean is not included.
70
100
99
98
97
96
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
sample
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
x
72
0.4
0.3 Distribution
normal
density
0.2 t, df = 1
t, df = 3
t, df = 9
0.1
0.0
−5.0 −2.5 0.0 2.5 5.0
Figure 2.11: Difference in the shapes of the standard normal distribution and
t-distributions with 1, 3 and 9 degrees of freedom.
73
values -1.96 and 1.96 for the normal distribution, but for the t-distribution we
have to use other values, depending on the degrees of freedom. We see that for 3
degrees of freedom, we have to use the values -3.18 and 3.18, and for 199 degrees
of freedom the values -1.97 and +1.97. This means that for a t-distribution with
3 degrees of freedom, 95% of the observations lie in the interval from -3.18 to
3.18. Similarly, for a t-distribution with 199 degrees of freedom, the values for
cumulative probabilities 0.025 and 0.975 are -1.97 and 1.97 respectively, so we
can conclude that 95% of the observations lie in the interval from -1.97 to 1.97.
Now instead of looking at 95% intervals for the t-distribution, let’s try to
construct a 90% confidence interval around an observed sample mean. With
a 90% confidence interval, 10% lies outside the interval. We can divide that
equally to 5% on the low side and 5% on the high side. We therefore have to
look at cumulative probabilities 0.05 and 0.95 in Table 2.2. The corresponding
quantiles for the normal distribution are -1.64 and 1.64, so we can say that for
the normal distribution, 90% of the values lie in the interval (-1.64, 1.64). For
a t-distribution with 9 degrees of freedom, we see that the corresponding values
are -1.83 and 1.83. Thus we conclude that with a t-distribution with 9 degrees
of freedom, 90% of the observed values lie in the interval (-1.83, 1.83).
However, now note that we are not interested in the values of the t-distribution,
but in likely values for the population mean. The standard normal and the
t-distribution are standardised distributions. In order to get values for the
confidence interval around the sample mean, we have to unstandardise the
values. The value of 1.83 above means ”1.83 standard errors away from the mean
(the sample mean)”. So suppose we find a sample mean of 3, with a standard
error of 0.5, then we say that a 90% confidence interval for the population mean
runs from 3 − 1.83 × 0.5 to 3 + 1.83 × 0.5, so from 2.085 to 3.915.
74
Constructing confidence intervals
1. Compute the sample mean ȳ.
(yi −ȳ)
2. Estimate the population variance s2 = Σi n−1 .
q
2
3. Estimate the standard error σ̂ȳ = sn .
5. Look up t 1−x . Take the t-distribution with the right number of degrees
2
of freedom and look for the critical t-value for the confidence interval: if
x is the confidence level you want, then look for quantile 1−x
2 . Then take
its absolute value. That’s your t 1−x .
2
7. Subtract and sum the sample mean with the margin of error: (ȳ−MoE, ȳ+
MoE).
Note that for a large number of degrees of freedom, the values are very close
to those of the standard normal.
Table 2.2: Quantiles for the standard normal and several t-distributions.
probs norm t199 t99 t9 t5 t3
0.0005 -3.29 -3.34 -3.39 -4.78 -6.87 -12.92
0.0010 -3.09 -3.13 -3.17 -4.30 -5.89 -10.21
0.0050 -2.58 -2.60 -2.63 -3.25 -4.03 -5.84
0.0100 -2.33 -2.35 -2.36 -2.82 -3.36 -4.54
0.0250 -1.96 -1.97 -1.98 -2.26 -2.57 -3.18
0.0500 -1.64 -1.65 -1.66 -1.83 -2.02 -2.35
0.1000 -1.28 -1.29 -1.29 -1.38 -1.48 -1.64
0.9000 1.28 1.29 1.29 1.38 1.48 1.64
0.9500 1.64 1.65 1.66 1.83 2.02 2.35
0.9750 1.96 1.97 1.98 2.26 2.57 3.18
0.9900 2.33 2.35 2.36 2.82 3.36 4.54
0.9950 2.58 2.60 2.63 3.25 4.03 5.84
0.9990 3.09 3.13 3.17 4.30 5.89 10.21
0.9995 3.29 3.34 3.39 4.78 6.87 12.92
75
2.10 Obtaining a confidence interval for a population
mean in R
Suppose we have values on miles per gallon (mpg) in a sample of cars, and we
wish to construct a 99% confidence interval for the population mean. We can
do that in the following manner. We take all the mpg values from the mtcars
data set, and set our confidence level to 0.99 in the following manner:
It shows that the 99% confidence interval runs from 17.2 to 23.0. The
t.test() function does more than simply constructing confidence intervals.
That is the topic of the next section.
76
answer, then null-hypothesis testing might be a solution. With null-hypothesis
testing, a null-hypothesis is stated, after which you decide based on sample data
whether or not the evidence is strong enough to reject that null-hypothesis. In
our example, the null-hypothesis is that the South-African mean has the value
3.38 (the Tanzanian mean). We write that as follows:
you want to know something about the population mean, the only information you need to
get from the sample data is the mean of the sample values. Knowing the exact values does
not give you extra information: the sample mean suffices. The proof for this is beyond this
book.
77
0.059. And the last bit is easy: the degrees of freedom is simply sample size
minus 1: 40 − 1 = 39.
We plot this sampling distribution of the sample mean in Figure 2.12. This
figure tells us that if the null-hypothesis is really true and that the South-African
mean height is 3.38, and we would take many different random samples of 40
elephants, we would see only sample means between 3.20 and 3.35. Other values
are in fact possible, but very unlikely. But how likely is our observed sample
mean of 3.27: do we feel that it is a likely value to find if the population mean
is 3.38, or is it rather unlikely?
What do you think? Think this over for a bit before you continue to read.
In fact, every unique value for a sample mean is rather unlikely. If the
population mean is 3.38, it will be very improbable that you will find a sample
mean of exactly 3.38, because by sheer chance it could also be 3.39, or 3.40 or
3.37. But relatively speaking, those values are all more likely to find than more
deviant values. The density curve tells you that values around 3.38 are more
likely than values around 3.27 or 3.50, because the density is higher around the
value of 3.38 than around those other values.
What to do?
The solution is to define regions for sample means where we think the sample
mean is no longer probable under the null-hypothesis, and a region where it is
probable enough to believe that the null-hypothesis could be true.
For example, we could define an acceptance region where 95% of the sample
means would fall if the null-hypothesis is true, and a rejection region where only
5% of the sample means would fall if the null-hypothesis is true. Let’s put the
rejection region in the tails of the distribution, where the most extreme values
can be found (farthest away from the mean). We put half of the rejection region
in the left tail and half of it in the right tail of the distribution, so that we have
two regions that each covers 2.5% of the sampling distribution. These regions
are displayed in Figure 2.13. The red ones are the rejection regions, and the
green one is the acceptance region (covering 95% of the area).
Why 5%, why not 10% or 1%? Good question. It is just something that is
accepted in a certain group of scientists. In the social and behavioural sciences,
researchers feel that 5% is a small enough chance. In contrast, in quantum
mechanics, researchers feel that 0.000057% is a small enough chance. Both
values are completely arbitrary. We’ll dive deeper into this arbitrary chance
level in a later section. For now, we continue to use 5%.
From Figure 2.13 we see that the sample mean that we found for your 40
South-African elephants (3.27) does not lie in the red rejection region. We
see that 3.27 lies well within the green section where we decide that sample
means are likely to occur when the population is 3.38. Because this is likely, we
think that the null-hypothesis is plausible: if the population mean is 3.28, it is
plausible to expect a sample mean of 3.27, because in 95% of random samples
we would see a sample mean between 3.255 and 3.500. The value 3.27 is a
very reasonable value and we therefore do not reject the null-hypothesis. We
conclude therefore that it could well be that both Tanzanian and South-African
elephants have the same average height of 3.38, that is, we do not have any
78
0.4
0.3
density
0.2
0.1
Figure 2.12: The sampling distribution under the null-hypothesis that the
South-African population mean is 3.38. The blue line represents the sample
mean for our observed sample mean of 3.27.
79
0.4
0.3
density
0.2
0.1
2.5 % 2.5 %
0.0
3.27 3.38
3.175 3.200 3.225 3.250 3.275 3.300 3.325 3.350 3.375 3.400 3.425 3.450 3.475 3.500 3.525 3.550 3.575
sample mean
Figure 2.13: The sampling distribution under the null-hypothesis that the
South-African population mean is 3.38. The red area represents the range of
values for which the null-hypothesis is rejected (rejection region), the green
area represents the range of values for which the null-hypothesis is not rejected
(acceptance region).
q q
s2 0.14
population mean of 3.38 and a standard error of σ̂ȳ = n = 40 = 0.059,
we obtain:
3.27 − 3.38
t= = −1.864 (2.10)
0.059
We can then look at a t-distribution of 40 − 1 = 39 degrees of freedom to see
how likely it is that we find such a t-score if the null-hypothesis is true. The t-
distribution with 39 degrees of freedom is depicted in Figure 2.14. Again we see
the population mean represented, now standardised to a t-score of 0 (why?), and
the observed sample mean, now standardised to a t-score of -1.864. As you can
see, this graph gives you the same information as the sampling distribution in
Figure 2.13. The advantage of using standardisation and using the t-distribution
is that we can now easily determine whether or not an observed sample mean
is somewhere in the red zone or in the green zone, without making a picture.
We have to find the point in the t-distribution where the red and green zones
meet. These points in the graph are called critical values. From Figure 2.14 we
can see that these critical values are around -2 and 2. But where exactly? This
information can be looked up in the t-tables that were discussed earlier in this
chapter. We plot such a table again in Table 2.3. A larger version is given in
Appendix B.
In such a table, you can look up the 2.5th percentile. That is, the value for
80
which 2.5% of the t-distribution is equal or smaller. Because we are dealing with
a t-distribution with 39 degrees of freedom, we look in the column t39, and then
in the row with cumulative probability 0.025 (equal to 2.5%), we see a value of
-2.02. This is the critical value for the lower tail of the t-distribution. To find
the critical value for the upper tail of the distribution, we have to know how
much of the distribution is lower than the critical value. We know that 2.5% is
higher, so it must be the case that the rest of the distribution, 100−2.5 = 97.5%
is lower than that value. This is the same as a probability of 0.975. If we look
for the critical value in the table, we see that it is 2.02. Of course this is the
opposite of the other critical value, because the t-distribution is symmetrical.
Now that we know that the critical values are -2.02 and +2.02, we know
that for our standardised t-score of -1.864 we are still in the green area, so we
do not reject the null-hypothesis. We don’t need to draw the distribution any
more. For any value, we can directly compare it to the critical values. And not
only for this example of 40 elephants and a sample mean of 3.27, but for any
combination.
Suppose for example that we would have had a sample size of 10 elephants,
and we would have found a sample mean of 3.28 with a slightly different sample
variance, s2 = 0.15. If we want to test the null-hypothesis again that the
population mean is 3.38 based on these results, we would have to do the following
steps:
Null-hypothesis testing
q
s2
1. Estimate the standard error σ̂ȳ = n.
ȳ−µ
2. Calculate the t-statistic t = σ̂ȳ , µ is the population mean under the
null-hypothesis.
3. Determine the degrees of freedom, n − 1.
4. Determine the critical values for lower and upper tail of the appropriate
t-distribution, using Appendix B.
5. If the t-statistic is between the two critical values, then we’re in the green,
we still believe the null-hypothesis is plausible.
6. If the t-statistic is not between the two critical values, we are in the red
zone and we reject the null-hypothesis.
3.28−3.38
2. Calculate the t-statistic: t = 0.1224745 = −0.8164966
3. Determine the degrees of freedom: sample size minus 1 equals 9
81
0.4
0.3
density
0.2
0.1
4. In Table 2.3 we look for the row with probability 0.025 and the column
for t9. We see a value of -2.26. The other critical value then must be 2.26.
5. The t-statistic of -0.8164966 lies between these two critical values, so these
sample data do not lead to a rejection of the null-hypothesis that the
population mean is 3.38. In other words, these data from 10 elephants do
not give us reason to doubt that the population mean is 3.38.
Table 2.3: Quantiles for the standard normal and several t-distributions.
probs norm t199 t99 t47 t39 t9 t5 t3
0.0005 -3.29 -3.34 -3.39 -3.51 -3.56 -4.78 -6.87 -12.92
0.0010 -3.09 -3.13 -3.17 -3.27 -3.31 -4.30 -5.89 -10.21
0.0050 -2.58 -2.60 -2.63 -2.68 -2.71 -3.25 -4.03 -5.84
0.0100 -2.33 -2.35 -2.36 -2.41 -2.43 -2.82 -3.36 -4.54
0.0250 -1.96 -1.97 -1.98 -2.01 -2.02 -2.26 -2.57 -3.18
0.0500 -1.64 -1.65 -1.66 -1.68 -1.68 -1.83 -2.02 -2.35
0.1000 -1.28 -1.29 -1.29 -1.30 -1.30 -1.38 -1.48 -1.64
0.9000 1.28 1.29 1.29 1.30 1.30 1.38 1.48 1.64
0.9500 1.64 1.65 1.66 1.68 1.68 1.83 2.02 2.35
0.9750 1.96 1.97 1.98 2.01 2.02 2.26 2.57 3.18
0.9900 2.33 2.35 2.36 2.41 2.43 2.82 3.36 4.54
0.9950 2.58 2.60 2.63 2.68 2.71 3.25 4.03 5.84
0.9990 3.09 3.13 3.17 3.27 3.31 4.30 5.89 10.21
0.9995 3.29 3.34 3.39 3.51 3.56 4.78 6.87 12.92
82
0.4
0.3
density
0.2
0.1
0.0
0.217 0.566 0.217
t = −0.82
−2 0 2
t
Figure 2.15: Illustration of what a p-value is. The total blue area represents the
probability that under the null-hypothesis, you find a more extreme value than
the t-score or its opposite. The blue area covers a proportion of .217 + .217 =
0.434 of the t-distribution. This amounts to a p-value of .434.
83
probability that the t-score is more than 0.82 is also 0.217. The blue regions
together therefore represent the probability that you find a t-score of less than
-0.82 or more than 0.82, and that probability equals 0.217 + 0.217 = 0.434.
Therefore, the probability that you find a t-value of ±0.82 or more extreme
equals 0.434. This probability is called the p-value.
Why is this value useful?
Let’s imagine that we find a t-score of exactly equal to one of the critical
values. The critical value for a sample size of 10 animals related to a cumulative
proportion of 0.025 equals -2.26 (see Table 2.3). Based on this table, we know
that the probability of a t-value of -2.26 or lower equals 0.025. Because of
symmetry, we also know that the probability of a t-value of -2.26 or higher also
equals 0.025. This brings us to the conclusion that the probability of a t-score
of ±2.26 or more extreme, is equal to 0.025 + 0.025 = 0.05 = 5%. Thus, when
the t-score is equal to the critical value, then the p-value is equal to 5%. You
can imagine that if the t-score becomes more extreme than the critical value
the p-value will become less than 5%, and if the t-score becomes less extreme
(closer to 0), the p-value becomes larger.
In the previous section, we said that if a t-score is more extreme than one of
the critical values (when it doesn’t have a value between them) then we reject
the null-hypothesis. Thus, a p-value of 5% or less means that we have a t-score
more extreme than the critical values, which in turn means we have to reject the
null-hypothesis. Thus, based on the computer output, we see that the p-value
is larger than 0.05, so we do not reject the null-hypothesis.
Overview
critical value: the minimum (or maximum) value that a t-score should
have to be in the red zone (the rejection region). If a t-value is more
extreme than a critical value, then the null-hypothesis is rejected. The
red zone is often chosen such that a t-score will be in that zone 5% of the
time, assuming that the null-hypothesis is true.
p-value: indicates the probability of finding a t-value equal or more
extreme than the one found, assuming that the null-hypothesis is true.
Often a p-value of 5% or smaller is used to support the conclusion that
the null-hypothesis is not tenable. This is equivalent to a rejection region
of 5% when using critical values.
Let’s apply this null-hypothesis testing to our luteinising hormone (LH) data.
Based on the medical literature, we know that LH levels for women in their
child-bearing years vary between 0.61 and 56.6 IU/L. Values vary during the
menstrual period. If values are lower than normal, this can be an indication that
the woman suffers from malnutrition, anorexia, stress or a pituitary disorder.
If the values are higher, this is an indication that the woman has gone through
menopause.
84
We’re going to use the LH data presented earlier in this chapter to make a
decision whether the woman has a healthy range of values for a woman in her
child-bearing years by testing the null-hypothesis that the mean LH level in this
woman is the same as the mean of LH levels in healthy non-menopausal women.
First we specify the null-hypothesis. Suppose we know that the mean LH
level in this woman should be equal to 2.54, given her age and given the timing
of her menstrual cycle. Thus our null-hypothesis is that the mean LH in our
particular woman is equal to 2.54:
H0 : µ = 2.54 (2.11)
Next, we look at our sample mean and see whether this is a likely or unlikely
value to find under this null-hypothesis. The sample mean is 2.40. To know
whether this is a likely value to find, we have to know the standard error
of the sampling distribution, and we can estimate this by using the sample
variance. The sample variance s2 = 0.3042553
q q and we had 48 measures, so we
s2 0.3042553
estimate the standard error as σˆȳ = n = 48 = 0.08. We then apply
standardisation to get a t-value:
2.40 − 2.54
t= = −1.75 (2.12)
0.08
Next, we look up in a table whether this t-value is extreme enough to be
considered unlikely under the null-hypothesis. In Table 2.3, we see that for 47
degrees of freedom, the critical value for the 0.025 quantile equals -2.01. For the
0.975 quantile it is 2.01. Our observed t-value of -1.75 lies within this range.
This means that a sample mean of 2.40 is likely to be found when the population
mean is 2.54, so we do not reject the null-hypothesis. We conclude that the LH
levels are healthy for a woman her age.
We can do the null-hypothesis testing also with a computer. Let’s analyse
the data in R and do the computations with the following code. First we load
the LH data:
data(lh)
t.test(lh, mu = 2.54)
##
## One Sample t-test
##
## data: lh
## t = -1.7584, df = 47, p-value = 0.08518
## alternative hypothesis: true mean is not equal to 2.54
## 95 percent confidence interval:
## 2.239834 2.560166
85
## sample estimates:
## mean of x
## 2.4
In the output we see that the t-value is equal to -1.7584, similar to our -1.75.
We see that the number of degrees of freedom is 47 (n − 1) and that the p-value
equals 0.08518. This p-value is larger than 0.05, so we do not reject the null-
hypothesis that the mean LH level in this woman equals 2.54. Her LH level is
healthy.
H0 : µ = 2.54 (2.13)
HA : µ 6= 2.54 (2.14)
This kind of null-hypothesis testing is called two-sided or two-tailed testing:
we look at two critical values, and if the computed t-score is outside this
range (i.e., somewhere in the two tails of the distribution), we reject the null-
hypothesis.
The alternative to two-sided testing is one-sided or one-tailed testing. Sometimes
before an analysis you already have an idea of what direction the data will
go. For instance, imagine a zoo where they have held elephants for years.
These elephants always were of Tanzanian origin, with a mean height of 3.38.
Lately however, the manager observes that the opening that connects the indoor
housing with the outdoor housing gets increasingly damaged. Since the zoo
recently acquired 4 new elephants of South-African origin, the manager wonders
whether South-African elephants are on average taller than the Tanzanian elephants.
To figure out whether South-African elephants are on average taller than the
Tanzanian average of 3.38 or not, the manager decides to apply null-hypothesis
testing. She has two hypotheses: null-hypothesis H0 and alternative hypothesis
HA :
86
This set of hypotheses leaves out one option: the South-African mean might
be lower than the Tanzanian one. Therefore, one often writes the set of hypotheses
like this:
She next tests the null-hypothesis, more specifically the one where µSA =
3.38. From the damaged doorway she expects the sample mean to be higher
than 3.38, but is it high enough to serve as evidence that the population mean
is also higher than 3.38? She decides that when the sample mean is in the
rejection zone in the right tail of the sampling distribution, then she will decide
that the null-hypothesis is not true, but that the alternative hypothesis must
be true. This is illustrated in Figure 2.16.
It shows the sampling distribution if we happen to have 4 new South-African
elephants, with a sample mean of 3.45 and a standard error of 0.059. In red,
we see the rejection region: if the sample mean happens to be in that zone we
decide to reject the null-hypothesis. Similar to two-tailed testing, we decide
that an area of 5% is small enough to suggest that the null-hypothesis is not
true. Note that in two-tailed testing, this area of 5% was divided equally into
the upper tail and the lower tail of the distribution, but with one-tailed testing
we put it all in the tail where we expect to find the sample mean based on a
theory or a hunch.
In this sampling distribution, based on 3 degrees of freedom, we see that the
sample mean is not in the red zone – the rejection region – therefore we do not
reject the null-hypothesis. We conclude that based on this random sample of 4
elephants, there is no evidence to suggest that South-African elephants are on
average taller than Tanzanian elephants.
The same procedure can be done with standardisation. We compute the
t-statistic as
3.45 − 3.38
t= = 1.19 (2.19)
0.059
In Table 2.3 we have to look up where the red zone starts: that is for the 0.95
quantile, because below that value lies 95% (green zone) and above it 5% (the
red zone). We see that the 95th percentile for a t-distribution with 3 degrees
of freedom is equal to 2. Our t-value 1.19 is less than that, so that we do not
reject the null-hypothesis.
A third way is to compute a one-tailed p-value. This is illustrated in Figure
2.17. The one-tailed p-value for a t-statistic of 1.19 and 3 degrees of freedom
turns out to be 0.16. That is the proportion of the t-distribution that is blue.
That means that if the null-hypothesis is true, you will find a t-value of 1.19 or
larger in 16% of the cases. Because this proportion is more than 5%, we do not
reject the null-hypothesis.
87
0.3
density
0.2
0.1
0.0 5%
3.38 3.45
3.175 3.200 3.225 3.250 3.275 3.300 3.325 3.350 3.375 3.400 3.425 3.450 3.475 3.500 3.525 3.550 3.575
sample mean
Figure 2.16: The sampling distribution under the null-hypothesis that the
South-African population mean is 3.38. In one-tailed testing, the rejection
area is located in only one of the tails. The red area represents the range
of values for which the null-hypothesis is rejected (rejection region), the green
area represents the range of values for which the null-hypothesis is not rejected
(acceptance region).
0.3
density
0.2
0.1
0.84 0.16
0.0
t = 1.19
−2.5 0.0 2.5
t
Figure 2.17: The sampling distribution under the null-hypothesis that the
South-African population mean is 3.38. For one-tailed testing, the rejection
area is located in only one of the tails. The green area represents the probability
of seeing a t-value smaller than 1.19, the blue are represents the probability of
seeing a t-value larger than 1.19. The latter probability is the p-value.
88
0.3
density
0.2
0.1
0.05 0.95
0.0
Critical t = −1.68
−2.5 0.0 2.5
t
Figure 2.18: One-tailed decision process for deciding whether the average LH
level in a woman is too low.
We decide beforehand that if a t-value is too far out in the left tail of the
distribution, the LH levels are too low. We again use 5% of the area of the
t-distribution. This decision process is illustrated in Figure 2.18 where we see a
critical t-value of -1.68 when we we have 47 degrees of freedom (see Table 2.3).
We calculate our t-value and find -1.75, see section 2.13. We see that this
t-value is smaller than the critical value -1.68, so it is in the red rejection area.
This is the area that we use for the rejection of the null-hypothesis, so based on
these data we decide that the mean LH level in this woman is abnormally low.
Importantly, note that when we applied two-tailed hypothesis testing, we
decided to not reject the null-hypothesis, whereas here with one-tailed testing,
we decide to reject the null-hypothesis. All based on the same data, and the same
null-hypothesis. The difference lies in the choice of the alternative hypothesis.
89
When doing one-tailed testing, we put all of the critical region in only one tail
of the t-distribution. This way, it becomes easier to reject a null-hypothesis, if
the mean LH level is indeed lower than normal. However, it could also be easier
to make a mistake: if the mean LH level is in fact normal, we could make a
mistake in thinking that the sample mean is deviant, where it is actually not.
Making mistakes in inference is the topic of the next section.
It is generally advised to use two-tailed testing rather than one-tailed testing.
The reason is that in hypothesis testing, it is always the null-hypothesis that
is being used as the starting point: what would the sample means (or their
standardised versions: t-scores) look like if the null-hypothesis is true? Based
on a certain null-hypothesis, say population mean µ equals 2.54, sample means
could be as likely higher or lower than the population mean (since the sampling
distribution is symmetrical). Even if you suspect that µ is actually lower, based
on a very good theory, you would help yourself too much to falsify the null-
hypothesis by putting the rejection area only in the left tail of the distribution.
And what do you actually do if you find a sample mean that is in the far end
of the right tail? Do you still accept the null-hypothesis? That would not make
much sense. It is therefore better to just stick to the null-hypothesis, and see
whether the sample mean is far enough removed to reject the null-hypothesis.
If the sample mean is in the anticipated tail of the distribution, that supports
the theory you had, and if the sample mean is in the opposite tail, it does not
support the theory you had.
Compare one-tailed and two-tailed testing in R using the LH data. By
default, R applies two-tailed testing. R gives the following output:
t.test(lh, mu = 2.54)
##
## One Sample t-test
##
## data: lh
## t = -1.7584, df = 47, p-value = 0.08518
## alternative hypothesis: true mean is not equal to 2.54
## 95 percent confidence interval:
## 2.239834 2.560166
## sample estimates:
## mean of x
## 2.4
If you want one-tailed testing, where you expect that the mean LH level is
lower than 2.54, you do that in the following manner3 :
##
3 If you expect that the LH level will higher than 2.54, you use ”greater” instead of ”less”.
90
## One Sample t-test
##
## data: lh
## t = -1.7584, df = 47, p-value = 0.04259
## alternative hypothesis: true mean is less than 2.54
## 95 percent confidence interval:
## -Inf 2.533589
## sample estimates:
## mean of x
## 2.4
When you compare the p-values, you see that the p-value using one-tailed
testing is half the size of the p-value using two-tailed testing (0.04 vs 0.08).
Based on the previous sections, you should know why the p-value is halved!
In the second output, using a critical p-value of 5% you would reject the null-
hypothesis, whereas in the first output, you would not reject the null-hypothesis.
Using one-tailed testing could lead to a big mistake: thinking that the sample
mean is deviant enough to reject the null-hypothesis, while the null-hypothesis
is actually true. We delve deeper into such mistakes in the next section.
91
error rate is 5%. It is a conditional probability. Conditional probabilities are
probabilities that start from some given information. In this case, the given
information is that the null-hypothesis is true: given that the null-hypothesis is
true, it is the probability that we reject the null-hypothesis. Because we do not
like to make mistakes, we want to have the probability of a mistake as low as
possible.
In the social and behavioural sciences, one thinks that a probability of 5%
is low enough to take the risk of making the wrong decision. As stated earlier,
in quantum mechanics one is even more careful, using a probability of 0.000057
%. So why don’t we also use a much lower probability of making a type I error?
The answer is that we do not want to make another type of mistake: a type II
error. A type II error is the mistake that we make when we do not reject the
null-hypothesis, while it is not true. Taking the example of the elephants again,
suppose that the population mean is not equal to 3.38, but the t-score is not in
the rejection area, so we believe that the population mean is 3.38. This is then
the wrong decision. The type of mistake we then make is a type II error.
Let’s take this example further. Suppose we have a two-tailed decision
process, where we compare two hypotheses about South-African elephants:
either their mean height is equal to 3.38 (H0 ), or it is not (HA ). We compute
the t-statistic and determine the critical values based on 5% area in the tails of
the t-distribution. This means that we allow ourselves to make a mistake in 5%
of the cases: the probability that we find a t-score in one of the 2.5% tails equals
2.5% + 2.5% = 5%. This is the probability of a type I error. Note that we chose
this value deliberately. This 5% we call α (’alpha’): it is the relative frequency
we allow ourselves to make a type I error. We say then that our α is fixed to
0.05, or 5%. This means that if the null-hypothesis is true, the probability that
the t-statistic will be in in the tails will be 5%.
Then what is the probability of a type II error? A type II error is based on
the premise that the alternative hypothesis is true. That alternative hypothesis
states that the population mean is not equal to 3.38. Given that, what is the
probability that we do not reject the null-hypothesis?
This is impossible to compute, because the alternative hypothesis is very
vaguely stated: it could be anything, as long as it is not 3.38. Let’s make it a
bit easier and state that the alternative hypothesis states that the population
mean equals 3.42.
If the population mean height is equal to 3.42, what would sample means
look like? That’s easy, that is the sampling distribution of the sample mean.
The mean of that sampling distribution would be 3.42. This is illustrated in
Figure 2.19. The left curve is the sampling distribution for a population mean
of 3.38. The red area represents the probability of a type I error. The right
curve is the sampling distribution for a population mean of 3.42. The blue area
92
0.4
0.3
density
0.2
0.1
2.5 % 2.5 %
0.0
Figure 2.19: Two sampling distributions, one for a population mean of 3.38
(null-hypothesis) and one for a population mean of 3.42 (alternative hypothesis).
The red areas represent the probability of a type I error, the dark green area
the probability of a type II error. The blue areas represent the probability of
making the (correct) decision that the null-hypothesis is not true when it is
indeed not true.
93
0.4
0.3
density
0.2
0.1
0.5 % 0.5 %
0.0
Figure 2.20: Two sampling distributions, one for a population mean of 3.38
(null-hypothesis) and one for a population mean of 3.42 (alternative hypothesis).
The red areas represent the probability of a type I error, the dark green area
the probability of a type II error. The blue area represents the probability of
making the (correct) decision that the null-hypothesis is not true when it is
indeed not true.
94
both tails (a two-tailed null-hypothesis test). You immediately see that the blue
areas have also become smaller, and that by consequence the dark green area
becomes larger: the probability of a type II error.
Thus, the α should be chosen wisely: if it is too large, you run a high risk of
a type I error. But if it is too low, you run a high risk of a type II error. Let’s
think about this in the context of our luteinising hormone problem.
We saw that if the LH level is not normal, this is an indication of malnutrition
or a disease and the patient should have further checks to see what the problem
is. But if the LH level is normal or above, there is no disease and no further
checks are required. Again we take the null-hypothesis that the mean LH level
in this woman equals 2.54. What would be a type I error this case, and what
would be type II error?
The type I error is the mistake of rejecting the null-hypothesis while it is in
fact true. Thus, the woman’s mean LH level is 2.54, but by coincidence, the
mean of the 48 measurements that we have turns out to be in the rejection area
of the sampling distribution. If this happens we make the mistake that we do
a lot of tests with this woman to find out what’s wrong with her, while in fact
she is perfectly healthy! How bad would such a mistake be? It would certainly
lead to extra costs, but also a lot of the woman’s time. She would also probably
start worrying that something is wrong with her. So we definitely don’t want
this to happen. We can minimize the risk of a type I error by choosing a low α.
The type II error is the mistake of not rejecting the null-hypothesis while it
is in fact not true. Thus, the woman’s mean LH level is lower than 2.54, but by
coincidence, the sample mean of the 48 measurements that we have turns out
to be in the acceptance area of the sampling distribution. This means that the
woman’s LH level does not seem to be abnormal, and the woman is sent home.
How bad would such a mistake be? Well, pretty bad because the woman’s
hormone level is not normal, but everybody thinks that she is OK. She could
be very ill but nothing is found in further tests, because there are no further
tests. So we definitely don’t want this to happen. We can minimize the risk of
a type II error by choosing a higher α.
So here we have a conflict, and we have to make a balanced choice for α:
too low we run the risk of type II errors, too high we run the risk of a type I
error. Then you have to decide what is worse: a type I mistake or a type II
mistake. In this case, you could say that sending the woman home while she is
ill, is worse than spending money on tests that are actually not needed. Then
you would choose a rather high α, say 10%. That means that if you have several
women who are in fact healthy, 10% of them would receive extra testing. This
is a fairly high percentage, but you are more sure that women with an illness
will be detected and receive proper care.
But if you think it is most important that you don’t spend too much money
and that you don’t want women to start worrying when it is not needed, you
can pick a low α like 1%: then when you have a lot of healthy women, only 1%
of them will receive unnecessary testing.
95
Overview
Type I error: the mistake of rejecting the null-hypothesis, while it is
true
Type II error: the mistake of not rejecting the null-hypothesis, while it
is not true
α: the relative frequency we allow ourselves to make a type I error
96
Chapter 3
Inference about a
proportion
97
0.3
probability
0.2
0.1
0.0
0 0.25 0.5 0.75 1
sample proportion
Figure 3.1: Sampling distribution of the sample proportion, when the population
proportion is 0.60
98
elephants when we randomly and sequentially pick 4 elephants. There are in
fact 6 different ways of randomly selecting 4 elephants where only 2 are tall.
When we use A to denote a tall elephant and B to denote a short elephant,
the 6 possible combinations of having two As and two Bs are in fact: AABB,
BBAA, ABAB, BABA, ABBA, and BAAB.
This number of combinations is calculated using the binomial coefficient:
4 4!
= =6 (3.1)
2 2!2!
This number 42 (’four choose two’) is called the binomial coefficient. It can
be calculated using factorials: the exclamation mark ! stands for factorial. For
instance, 5! (’five factorial’) means 5 × 4 × 3 × 2 × 1.
In its general form, the binomial coefficient looks like:
n n!
= (3.2)
r r!(n − r)!
So suppose sample size n is equal to 4 and r equal to 2 (the number of tall
elephants in the sample), we get:
4 4! 4! 4×3×2×1
= = = =6 (3.3)
2 2!(n − r)! 2!2! 2×1×2×1
4
p(#A = 2|n = 4, p = 0.6) = ×0.62 ×(1−0.6)2 = 6×0.0576 = 0.3456 (3.4)
2
99
0.3
probability
0.2
0.1
0.0
0 1 2 3 4
r
Table 3.1: Four possible ways of selecting 2 tall elephants (A) and 2 short
elephants (B), together with the probability for each selection.
ordering computation of probability probability
AABB 0.6 x 0.6 x 0.4 x 0.4 0.0576
ABAB 0.6 x 0.4 x 0.6 x 0.4 0.0576
ABBA 0.6 x 0.4 x 0.4 x 0.6 0.0576
BAAB 0.4 x 0.6 x 0.6 x 0.4 0.0576
BABA 0.4 x 0.6 x 0.4 x 0.6 0.0576
BBAA 0.4 x 0.4 x 0.6 x 0.6 0.0576
100
Overview
sampling distribution of the sample proportion: the distribution
of proportions that you get when you randomly pick new samples from a
population and for each sample compute the proportion.
101
0.12
0.09
probability
0.06
0.03
0.00
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
proportion
H0 : p = 0.60 (3.7)
HA : p 6= 0.60 (3.8)
Is the proportion of 0.84 that we observe in the sample (the zoo) a probable
value to find if the proportion of all Tanzanian is equal to 0.60? If this is the
case, we do not reject the null-hypothesis, and believe that the zoo data could
have been randomly selected from the Tanzanian population and are therefore
representative. However, if the proportion of 0.84 is very unprobable given that
102
the population proportion is 0.60, we reject the null-hypothesis and believe that
the data are not representative.
With null-hypothesis testing we always have to fix our α first: the probability
with which we are willing to accept a type I error. We feel it is really important
that the sample is representative of the population, so we definitely do not want
to make the mistake that we think the sample is representative (not rejecting
the null-hypothesis) while it isn’t (HA is true). This would be a type II error
(check this for yourself!). If we want to minimise the probability of a type II
error (β), we have to pick a relatively high α (see Chapter 2), so let’s choose
our α = .10.
Next, we have to choose a test statistic and determine critical values for it
that go with an α of .10. Because we have a relatively large sample size of 50,
we assume that the sampling distribution for a proportion of 0.60 is normal.
From the standard normal distribution, we know that 90% (1 − α!) of the
values lie between −1.6448536 and 1.6448536 (see Table 2.3). If we therefore
standardise our proportion, we have a measure that should show a standard
normal distribution:
ps − p0
zp = (3.9)
sd
where zp is the z-score for a proportion, ps is the sample proportion, p0 is
the population proportion assuming H0 , and sd is the standard deviation of the
sampling distribution, which is the standard error. Note that we should take
the standard error that we get when the null-hypothesis is true. We then get
90% of the values in any normal distribution lie between ±1.64 standard
deviations away from the mean (see Table 2.3). Here we see a z-score that
exceeds these critical values, and we therefore reject the null-hypothesis. We
conclude that the proportion of tall elephants observed in the sample is larger
than to be expected under the assumption that the population proportion is
0.6. We decide that the zoo data are not representative of the population data.
The decision process is illustrated in Figure 3.4.
103
0.4
0.3
density
0.2
0.1
0.0 5% 5%
3.464
−2 0 2
Z
Figure 3.4: A normal distribution to test the null-hypothesis that the population
proportion is 0.6. The blue line represents the z-score for our observed sample
proportion of 0.84.
##
## Exact binomial test
##
## data: 42 and 50
## number of successes = 42, number of trials = 50, p-value = 0.0004116
## alternative hypothesis: true probability of success is not equal to 0.6
## 95 percent confidence interval:
## 0.7088737 0.9282992
## sample estimates:
## probability of success
## 0.84
The output shows the sample proportion: the probability of success is 0.84.
42
This is of course 50 . If we want to know what the population proportion is, we
look at the 95% confidence interval that runs from 0.7088737 to 0.9282992. If
you want to test the null-hypothesis that the population proportion is equal to
104
0.60, then we see that the p-value for that test is 0.0004116.
As said, the binomial test also works fine for small sample sizes. Let’s go
back to the very first example of this chapter: the zoo manager sees that of the
4 elephants they have, 3 bump their head and are therefore taller than 3.40 m.
What does that tell us about the proportion of elephants worldwide that are
taller than 3.40 m? If we assume that the 4 zoo elephants were randomly selected
from the entire population of elephants, we can use the binomial distribution.
In this case we type in R:
binom.test(x = 3, n = 4)
##
## Exact binomial test
##
## data: 3 and 4
## number of successes = 3, number of trials = 4, p-value = 0.625
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.1941204 0.9936905
## sample estimates:
## probability of success
## 0.75
1 Note in the output that by default, binom.test() chooses the null-hypothesis that the
105
106
Chapter 4
Linear modelling:
introduction
107
variable. We can also predict the adult height of a child from the height of the
mother.
The dependent variable is usually the most central variable. It is the variable
that we’d like to understand better, or perhaps predict. The independent
variable is usually an explanatory variable: it explains why some people have
high values for the dependent variable and other people have low values. For
instance, we’d like to know why some people are healthier than others. Health
may then be our dependent variable. An explanatory variable might be age
(older people tend to be less healthy), or perhaps occupation (being a dive
instructor induces more health problems than being a university professor).
Sometimes we’re interested to see whether we can predict a variable. For
example, we might want to predict longevity. Age at death would then be our
dependent variable and our independent (predictor) variables might concern
lifestyle and genetic make-up.
Thus, we often see four types of relations:
In all these four cases, variable A is the independent variable and variable
B is the dependent variable.
Note that in general, dependent variables can be either numeric, ordinal, or
categorical. Also independent variables can be numeric, ordinal, or categorical.
108
6
3
Y
−1
−2
−1 0 1 2 3 4 5
X
For the linear relationship between X and Y in Figure 4.1 the linear equation
is therefore
Y = 0 + 2X (4.2)
Y = 2X (4.3)
With this equation, we can find the Y -value for all values of X. For instance,
if we want to know the Y -value for X = 3.14, then using the linear equation we
know that Y = 2×3.14 = 6.28. If we want to know the Y -value for X = 49876.6,
we use the equation to obtain Y = 2 × 49876.6 = 99753.2. In short, the linear
equation is very helpful to quickly say what the Y -value is on the basis of the
X-value, even if we don’t have a graph of the relationship or if the graph does
not extent to certain X-values.
109
1
0
Y
−1
−2
−1 0 1 2 3 4 5 6
X
In the linear equation, we call Y the dependent variable, and X the independent
variable. This is because the equation helps us determine or predict our value
of Y on the basis of what we know about the value of X. When we graph the
line that the equation represents, such as in Figure 4.1, the common way is to
put the dependent variable on the vertical axis, and the independent variable
on the horizontal axis.
Figure 4.2 shows a different linear relationship between X and Y . First we
look at the slope: we see that for every unit increase in X (from 1 to 2, or
from 4 to 5) we see an increase of 0.5 in Y . Therefore the slope is equal to 0.5.
Second, we look at the intercept: we see that when X = 0, Y has the value -2.
So the intercept is -2. Again, we can describe the linear relationship by a linear
equation, which is now:
Y = −2 + 0.5X (4.4)
Linear relationships can also be negative, see Figure 4.3. There, we see that
if we move from 0 to 1, we see a decrease of 2 in Y (we move from Y = −2 to
Y = −4), so −2 is our slope value. Because the slope is negative, we call the
relationship between the two variables negative. Further, when X = 0, we see
a Y -value of -2, and that is our intercept. The linear equation is therefore:
Y = −2 − 2X (4.5)
110
0
−1
−2
−3
−4
−5
Y
−6
−7
−8
−9
−10
−11
−12
−1 0 1 2 3 4 5
X
Overview
dependent variable: the variable that we want to describe, understand,
predict or explain. Usually denoted as Y .
111
1100
1000
900
Euros yearly spent on holidays
800
700
600
500
400
300
200
100
0 10 20 30 40 50 60 70 80 90 100
Yearly income in k Euros
income and the amount of Euros spent on holidays. Yearly income is measured in
thousands of Euros (k Euros), and money yearly spent on holidays is measured
in Euros. Let us regard money spent on holidays as our dependent variable
and yearly income as our independent variable (we assume money needs to be
saved before it can be spent). We therefore plot yearly income on the X-axis
(horizontal axis) and holiday spendings on the Y -axis (vertical axis). Let’s
imagine we find the data from 100 women between 30 and 40 years of age that
are plotted in Figure 4.4.
In the scatter plot, we see that one woman has a yearly income of 100,000
Euros, and that she spends almost 1100 Euros per year on holidays. We also
see a couple of women who earn less, between 10,000 and 20,000 Euros a year,
and they spend between 200 and 300 Euros per year on holiday.
The data obviously do not form a straight line. However, we tend to think
that the relationship between yearly income and holiday spending is more or
less linear: there is a general linear trend such that for every increase of 10,000
Euros in yearly income, there is an increase of about 100 Euros.
Let’s plot such a straight line that represents that general trend, with a slope
of 100 straight through the data points. The result is seen in Figure 4.5. We
see that the line with a slope of 100 is a nice approximation of the relationship
between yearly income and holiday spendings. We also see that the intercept of
the line is 100.
Given the intercept and slope, the linear equation for the straight line
approximating the relationship is
112
1100
1000
800
700
600
500
400
300
200
100
0 10 20 30 40 50 60 70 80 90 100
Yearly income in k Euros
In summary, data on two variables may not show a perfect linear relationship,
but in many cases, a perfect straight line can be a very reasonable approximation
of the data. Another word for a reasonable approximation of the data is a
prediction model. Finding such a straight line to approximate the data points
is called linear regression. In this chapter we will see what method we can
use to find a straight line. In linear regression we describe the behaviour of
the dependent variable (the Y -variable on the vertical axis) on the basis of
the independent variable (the X-value on the horizontal axis) using a linear
equation. We say that we regress variable Y on variable X.
4.4 Residuals
Even though a straight line can be a good approximation of a data set consisting
of two variables, it is hardly ever perfect: there are always discrepancies between
what the straight line describes and what the data actually tell us.
For instance, in Figure 4.5, we see a woman, Sandra Schmidt, who makes 69
k Euros a year and who spends 809 Euros on holidays. According to the linear
equation that describes the straight line, a woman that earns 69 k Euros a year
would spend 100 + 100 × 69 = 786 Euros on holidays. The discrepancy between
the actual amount spent and the amount prescribed by the linear equation
equals 809 − 786 = 23 Euros. This difference is rather small and the same holds
for all the other women in this data set. Such discrepancies between the actual
amount spent and the amount as prescribed or predicted by the straight line
are called residuals or errors. The residual (or error) is the difference between
a certain data point (the actual value) and what the linear equation predicts.
113
2500
2000
1500
Y
1000
500
0 10 20 30 40 50 60 70 80 90 100
X
Let us look at another fictitious data set where the residuals (errors) are a
bit larger. Figure 4.6 shows the relationship between variables X and Y . The
dots are the actual data points and the blue straight line is an approximation
of the actual relationship. The residuals are also visualised: sometimes the
observed Y -value is greater than the predicted Y -value (dots above the line)
and sometimes the observed Y -value is smaller than the predicted Y -value (dots
below the line). If we denote the ith predicted Y -value (predicted by the blue
line) as Ybi (pronounced as ’y-hat-i’), then we can define the residual or error as
the discrepancy between the observed Yi and the predicted Ybi :
ei = Yi − Ybi (4.7)
where ei stands for the error (residual) for the ith data point .
If we compute residual ei for all Y -values in the data set, we can plot them
using a histogram, as displayed in Figure 4.7. We see that the residuals are on
average 0, and that the histogram resembles the shape of a normal distribution.
We see that most of the residuals are around 0, and that means that most of
the values Y -values are close to the line (where the predicted values are). We
also see some large residuals but that there are not so many of these. Observing
a more or less normal distribution of residuals happens often in research. Here,
the residuals show a normal distribution with mean 0 and variance of 13336
(i.e., a standard deviation of 115).
114
40
30
count
20
10
−400 −350 −300 −250 −200 −150 −100 −50 0 50 100 150 200 250 300 350 400
residuals
115
2500
2000
1500
Y
1000
500
0 10 20 30 40 50 60 70 80 90 100
X
Figure 4.8: Data on variables X and Y with an added straight line. The sum
of the residuals equals 0.
116
1 2 3
2500
2000
1500
Y
1000
500
Figure 4.9: Three times the same data set, but with different regression lines.
1 2 3
60
40
count
20
Figure 4.10: Histogram of the residuals (errors) for three different regression
lines, and the respective sums of squared residuals (SSR).
117
In summary, when we want to have a straight line that describes our data
best (i.e., the regression line), we’d like a line such that the residuals are on
average 0 (i.e, sum to 0), and where we see the smallest residuals possible. We
reach these criteria when we use the line in such a way that we have the lowest
value for the sum of the squared residuals possible. This line is therefore called
the least squares or OLS regression line.
There are generally two ways of finding the intercept and the slope values
that satisfy the Least Squares principle.
2. Analytical approach For problems that are not too complex, like this
linear regression problem, there are simple mathematical equations to find
the combination of intercept and slope that gives the lowest sum of squared
residuals.
Using the analytical approach, it can be shown that the Least Squares slope
can be found by solving:
P
(Xi − X)(Yi − Y )
slope = P (4.8)
(Xi − X)2
118
Overview
residual: the difference between a certain data point (the actual value)
and what the linear equation predicts.
linear regression: When we want to describe the behaviour of the
dependent variable (the Y -variable on the vertical axis) on the basis of the
independent variable (the X-value on the horizontal axis) by a straight
line, linear regression is the process of finding such a straight line.
Least Squares principle: In order to find the best regression line, you
need a criterion. The Least Squares principle is such a criterion and
specifies that the sum of the squares of the residuals should be as small
as possible.
Ybi = b0 + b1 Xi (4.10)
where we use b0 to denote the intercept, b1 to denote the slope and Xi as
the ith value of X.
In reality, the predicted values for Y always deviate from the observed values
of Y : there is practically always an error e that is the difference between Ybi and
Yi . Thus we have for the observed values of Y
Yi = Ybi + ei = b0 + b1 Xi + ei (4.11)
Typically, we assume that the residuals e have a normal distribution with a
mean of 0 and a variance that is often unknown but that we denote by σe2 . Such
a normal distribution is denoted by N (0, σe2 ). Taking the linear equation and
the normally distributed residuals together, we have a model for the variables
X and Y .
Yi = b0 + b1 Xi + ei (4.12)
ei ∼ N (0, σe2 ) (4.13)
119
A model is a specification of how a set of variables relate to each other. Note
that the model for the residuals, the normal distribution, is an essential part
of the model. The linear equation only gives you predictions of the dependent
variable, not the variable itself. Together, the linear equation and the distribution
of the residuals give a full description of how the dependent variable depends on
the independent variable.
A model may be an adequate description of how variables relate to each
other or it may not, that is for the data analyst to decide. If it is an adequate
description, it may be used to predict yet unseen data on variable Y (because
we can’t see into the future), or it may be used to draw some inferences on data
that can’t be seen, perhaps because of limitations in data collection. Remember
Chapter 2 where we made a distinction between sample data and population
data. We could use the linear equation that we obtain using a sample of data
to make predictions for data in the population. We delve deeper into that issue
in Chapter 5.
The model that we see in Equations 4.12 and 4.13 is a very simple form
of the linear model. The linear model that we see here is generally known
as the simple regression model : the simple regression model is a linear model
for one numeric dependent variable, an intercept, a slope for only one (hence
’simple’) numeric independent variable, and normally distributed residuals. In
the remainder of this book, we will see a great variety of linear models: with one
or more independent variables, with numeric or with categorical independent
variables, and with numeric or categorical dependent variables. All these models
can be seen as extensions of this simple regression model. What they all have
in common is that they aim to predict one dependent variable from one or more
independent variables using a linear equation.
In the syntax we first indicate that we start from the mtcars data set. Next,
we use the lm() function to indicate that we want to apply the linear model
to these data. Next, we say that we want to model the variable mpg. The ∼
(’tilde’) sign means ”is modelled by” or ”is predicted by”, and next we plug in
the independent variable cyl. Thus, this code says we want to model the mpg
variable by the cyl variable, or predict mpg scores by cyl. Next, because we
120
35
30
25
mpg
20
15
10
4 5 6 7 8
cyl
Figure 4.11: Data set on number of cylinders (cyl) and miles per gallon (mpg)
in 32 cars.
already indicated we use the mtcars data set, the data argument for the lm()
function should be left empty. Finally, we store the results in the object model.
In the last line of code we indicate that we want to see the results, that we
stored in model.
model
##
## Call:
## lm(formula = mpg ~ cyl, data = .)
##
## Coefficients:
## (Intercept) cyl
## 37.885 -2.876
The output above shows us a repetition of the lm() analysis, and then two
coefficients. These are the regression coefficients that we wanted: the first is
the intercept, and the second is the slope. These coefficients are the parameters
of the regression model. Parameters are parts of a model that can vary from
data set to data set, but that are not variables (variables vary within a data set,
parameters do not). Here we use the linear model from Equations 4.12 and 4.13
where b0 , b1 and σe2 are parameters since they are different for different data
sets.
The output does not look very pretty. Using the broom package, we can get
the same information about the analysis, and more:
121
library(broom)
model <- mtcars %>%
lm(mpg ~ cyl, data = .)
model %>%
tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 37.9 2.07 18.3 8.37e-18
## 2 cyl -2.88 0.322 -8.92 6.11e-10
R then shows two rows of values, one for the intercept and one for the slope
parameter for cyl. For now, we only look at the first two columns. In these
columns we find the least squares values for these parameters for this data set
on 32 cars that we are analysing here.
In the second column, called estimate, we see that the intercept parameter
has the value 37.9 (when rounded to 1 decimal) and the slope has the value
-2.88. Thus, with this output, the linear equation for the regression equation
can be filled in:
122
A B
2000
1000
Y
−1000
0 25 50 75 100 0 25 50 75 100
X
Figure 4.12: Two data sets with the same regression line.
to predict Y -values in data set A very well, almost without error, whereas the
regression line cannot be used to predict Y -values in data set B very precisely.
The regression line is also the least squares regression line for data set B, so any
improvement by choosing another slope or intercept is not possible.
Francis Galton was the first to think about how to quantify this difference in
the ability of a regression line to predict the dependent variable. Karl Pearson
later worked on this measure and therefore it came to be called Pearson’s
correlation coefficient. It is a standardised measure, so that it can be used
to compare different data sets.
In order to get to Pearson’s correlation coefficient, you first need to standardise
both the independent variable, X, and the dependent variable, Y . You standardise
scores by taking their values, subtract the mean from them, and divide by the
standard deviation (see Chapter 1). So, in order to obtain a standardised value
for X = x we compute zX ,
x−X
zX = (4.15)
σX
and in order to obtain a standardised value for Y = y we compute zY ,
y−Y
zY = . (4.16)
σY
Let’s do this both for data set A and data set B, and plot the standardised
scores, see Figure 4.13. If we then plot the least squares regression lines for
the standardised values, we obtain different equations. For both data sets, the
intercept is 0 because by standardising the scores, the means become 0. But
123
A B
0
Z_Y
−2
−1 0 1 −1 0 1
Z_X
Figure 4.13: Two data sets, with different regression lines after standardisation.
the slopes are different: in data set A, the slope is 0.997 and in data set B, the
slope is 0.376.
These two slopes, the slope for the regression of standardized Y -values on
standardized X-values, are the correlation coefficients for data sets A and B,
respectively. For obvious reasons, the correlation is sometimes also referred to
as the standardised slope coefficient or standardised regression coefficient.
Correlation stands for the co-relation between two variables. It tells you
how well one variable can be predicted from the other. The correlation is bi-
directional: the correlation between Y and X is the same as the correlation
between X and Y . For instance in Figure 4.13, if we would have put the ZX -
variable on the ZY -axis, and the ZY -variable on the ZX -axis, the slopes would be
exactly the same. This is true because the variances of the Y - and X-variables
are equal after standardisation (both variances equal to 1).
Since a slope can be negative, a correlation can be negative too. Furthermore,
a correlation is always between -1 and 1. Look at Figure 4.13: the correlation
between X and Y is 0.997. The dots are almost on a straight line. If the dots
would all be exactly on the straight line, the correlation would be 1.
Figure 4.14 shows a number of scatter plots of X and Y with different
correlations. Note that if dots are very close to the regression line, the correlation
can still be close to 0: if the slope is 0 (bottom-left panel), then one variable
cannot be predicted from the other variable, hence the correlation is 0, too.
124
1 1
1 2
500
−500
2 2
Y
1 2
500
−500
125
In summary, the correlation coefficient indicates how well one variable can
be predicted from the other variable. It is the slope of the regression line if both
variables are standardised. If prediction is not possible (when the regression
slope is 0), the correlation is 0, too. If the prediction is perfect, without errors
(no residuals) and with a slope unequal to 0, then the correlation is either -1
or +1, depending on the sign of the slope. The correlation coefficient between
variables X and Y is usually denoted by rXY for the sample correlation and
ρXY (pronounced ’rho’) for the population correlation.
4.9 Covariance
The correlation ρXY as defined above is a standardised measure for how much
two variables co-relate. It is standardised in such a way that it can never be
outside the (-1, 1) interval. This standardisation happened through the division
of X and Y -values by their respective standard deviation. There exists also an
unstandardised measure for how much two variables co-relate: the covariance.
The correlation ρXY is the slope when X and Y each have variance 1. When
you multiply correlation ρXY by a quantity indicating the variation of the two
variables, you get the covariance. This quantity is the product of the two
respective standard deviations.
The covariance between variables X and Y , denoted by σXY , can be computed
as:
P
(Xi − X)(Yi − Y )
σXY = (4.21)
n
126
so it is the mean of the squared cross-products of two variables.1 Note that
the numerator bears close resemblance to the numerator of the equation that
we use to find the least squares slope, see Equation 4.8. This is not strange
since both the slope and the covariance say something about the relationship
between two variables. Also note that in the equation that we use to find the
least squares slope the denominator bears close relationship to the formula for
P 2
2
the variance, since σX = (Xin−X) (see Chapter 1). We could therefore rewrite
Equation 4.8 that finds the least squares or OLS slope as:
P
(Xi − X)(Yi − Y )
slopeOLS = P (4.22)
(Xi − X)2
σXY × n
= 2 ×n
σX
σXY
= 2
σX
This shows how all three quantities slope, correlation and covariance say
something about the linear relationship between two variables. The slope says
how much the dependent variable increases if the independent variable increases
by 1, the correlation says how much of a standard deviation the dependent
variable increases if the independent variable increases by one standard deviation
(alternatively: the slope after standardisation), and the covariance is the mean
cross-product of two variables (alternatively: the unstandardised correlation).
Table 4.1 shows a small data set on two variables X and Y with 5 observations.
The mean value of X is -0.4 and the mean value of Y is -0.2. If we subtract
1 Again, similar to what was said about the formula for the variance of a variable, on-line
P
(X −X)(Y −Y )
i i
you will often find the formula n−1
. The difference is that here we are talking
about the definition of the covariance of two observed variables, and that elsewhere one talks
about trying to estimate the covariance between two variables in the population. Similar
to the variance, the covariance in a sample is a biased estimator of the covariance in the
population. To remedy this bias, we divide the cross-products not by n but by n − 1
127
2
Y 1
−1
−1.23
−2
−2 −1 0 1 2
X
the respective mean from each observed value and multiply, we get a column of
cross-products. For example, take the first row: X − X = −1 − (−0.4) = −0.6
and Y − Y = 2 − (−0.2) = 2.20. If we multiply these numbers we get the
cross-product −0.6 × 2.20 = −1.32. If we compute all cross-products and sum
them, we get -6.40. Dividing this by the number of observations (5), yields the
covariance: -1.28.
If we compute the variances of X and Y (see Chapter 1), we obtain 1.04 and
2.16, respectively. Taking the square roots we obtain the standard deviations:
1.0198039 and 1.4696938. Now we can calculate the correlation on the basis of
−1.28
the covariance as ρXY = σσXXYσY = 1.0198039×1.4696938 = −0.85.
We can also calculate the least squares slope as σσXY
2 = −1.28
1.04 = −1.23.
X
The original data are plotted in Figure 4.15 together with the regression
line. The standardised data and the corresponding regression line are plotted
in Figure 4.16. Note that the slopes are different, and that the slope of the
regression line for the standardised data is equal to the correlation.
mtcars %>%
select(cyl, mpg) %>%
cor()
## cyl mpg
128
2
1
Z_Y
−0.85
−1
−2
−2 −1 0 1 2
Z_X
Figure 4.16: Data example (standardised values) and the regression line.
In the output we see a correlation matrix. On the diagonal are the correlations
of cyl and mpg with themselves, which are perfect (a correlation of 1). On the
off-diagonal, we see that the correlation between cyl and mpg equals -0.852162.
This is a strong negative correlation, which means that generally, the more
cylinders a car has, the lower the mileage. We can also compute the covariance,
with the function cov:
mtcars %>%
select(cyl, mpg) %>%
cov()
## cyl mpg
## cyl 3.189516 -9.172379
## mpg -9.172379 36.324103
On the off-diagonal we see that the covariance between cyl and mpg equals
-9.172379. On the diagonal we see the variances of cyl and mpg. Note that
R uses the formula with n − 1 in the denominator. If we want R to compute
the (co-)variance using n in the denominator, we have to write an alternative
function ourselves:
129
XY <- X %*% Y # multiply each X with each Y and sum them
return(XY / length(x)) # divide by n
}
cov_alt(mtcars$cyl, mtcars$mpg)
## [,1]
## [1,] -8.885742
To determine the least squares slope for the regression line of mpg on cyl,
we divide the covariance by the variance of cyl (Equation 5.1):
## [1] -2.87579
Note that both cov() and var() use n − 1. Since this cancels out if we do
the division, it doesn’t matter whether we use n or n − 1.
If we first standardise the data with the function scale() and then compute
the least squares slope, we get
## [,1]
## [1,] -0.852162
cor(z_mpg, z_cyl)
## [,1]
## [1,] -0.852162
cov(z_mpg, z_cyl)
## [,1]
## [1,] -0.852162
We see from the output that the slope coefficient for the standardised situation
is equal to both the correlation and the covariance of the standardised values.
The data and the least squares regression line can be plotted using geom smooth():
mtcars %>%
ggplot(aes(x = cyl, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = F)
130
35
30
25
mpg
20
15
10
4 5 6 7 8
cyl
Y = b0 + b1 X (4.23)
Further, we argued that in most cases, the relationship between X and Y
cannot be completely described by a straight line. Not all of the variation in Y
can be explained by the variation in X. Therefore, we have residuals e, defined
as the difference between the observed Y -value and the Y -value that is predicted
by the straight line, (denoted by Yb ):
e = Y − Yb (4.24)
Therefore, the relationship between X and Y is denoted by a regression
equation, where the relationship is approached by a linear equation, plus a
residual part e:
Y = b0 + b1 X + e (4.25)
The linear equation gives us only the predicted Y -value, Yb :
Yb = b0 + b1 X (4.26)
We’ve also seen that the residual e is assumed to have a normal distribution,
with mean 0 and variance σe2 :
131
Remember that linear models are used to explain (or predict) the variation
in Y : why are there both high values and low values for Y ? Where does the
variance in Y come from? Well, the linear model tells us that the variation is
in part explained by the variation in X. If b1 is positive, we predict a relatively
high value for Y for a high value of X, and we predict a relatively low value for
Y if we have a low value for X. If b1 is negative, it is of course in the opposite
direction. Thus, the variance in Y is in part explained by the variance in X,
and the rest of the variance can only be explained by the residuals e.
Because the residuals do not explain anything (we don’t know where these
residuals come from), we say that the explained variance of Y is only that part
of the variance that is explained by independent variable X: Var(b0 +b1 X). The
unexplained variance of Y is the variance of the residuals, σe2 . The explained
variance is often denoted by a ratio: the explained variance divided by the total
variance of Y :
Var(b0 + b1 X) Var(b0 + b1 X)
Varexplained = = (4.29)
Var(Y ) Var(b0 + b1 X) + σe2
From this equation we see that if the variance of the residuals is large, then
the explained variance is small. If the variance of the residuals is small, the
variance explained is large.
In the data set, we see that the variance of the weight, Var(weight) is equal
to 72274. Since we also know the variance of the residuals, we can solve for the
variance explained by volume:
132
Var(weight) = 72274 = Var(107.7 + 0.7 × volume) + 15362
Var(107.7 + 0.7 × volume) = 72274 − 15362 = 56912
So the proportion of explained variance is equal to 56912
72274 = 0.7874478. This
is quite a high proportion: nearly all of the variation in the weight of books is
explained by the variation in volume.
But let’s see if we can explain even more variance if we add an extra independent
variable. Suppose we know the area of each book. We expect that books with
a large surface area weigh more. Our linear equation then looks like this:
4.14 R-squared
With regression analysis, we try to explain the variance of the dependent variable.
With multiple regression, we use more than one independent variable to try to
explain this variance. In regression analysis, we use the term R-squared to
refer to the proportion of explained variance, usually denoted with the symbol
R2 . The unexplained variance is of course the variance of the residuals, Var(e),
usually denoted as σe2 . So suppose the variance of dependent variable Y equals
200, and the residual variance in a regression equation equals say 80, then R2
or the proportion of explained variance is (200 − 80)/200 = 0.60.
R2 = σexplained
2
/σY2 = (σY2 − σunexplained
2
)/σY2 = (σY2 − σe2 )/σY2 (4.34)
This is the definition of R-squared at the population level, where we know
the exact values of the variances. However, we do not know these variances,
since we only have a sample of all values.
133
We know from Chapter 2 that we can take estimators of the variances σY2
and σe2 . We should not use the variance of Y observed in the sample, but the
unbiased estimator of the variance of Y in the population
2 Σi (Yi − Y )2
σc
Y = (4.35)
n−1
where n is sample size (see Section 2.3).
For σe2 we take the unbiased estimator of the variance of the residuals e in
the population
2 2
c2 = Σi (ei − ē) = Σi ei
σ (4.36)
e
n−1 n−1
Here we do not have to subtract the mean from the residuals, because the
mean is 0 by definition.
If we plug these estimators into Equation 4.34, we get
Σ(Yi −Y )2 Σe2
2 c2
Y − σe
σc n−1 − n−1i
c2
R = =
2 Σ(Yi −Y )2
σc
Y n−1
Σ(Yi − Y )2 − Σe2i Σ(Yi − Y )2 Σe2i
= 2
= −
Σ(Yi − Y ) Σ(Yi − Y )2 Σ(Yi − Y )2
SSR
= 1− (4.37)
SST
where SSR refers to the sum of the squared residuals (errors)2 , and SST
refers to the total sum of squares (the sum of the squared deviations from the
mean for variable Y ).
As we saw in Section 4.5, in a regression analysis, the intercept and slope
parameters are found by minimising the sum of squares of the residuals, SSR.
Since the variance of the residuals is based on this sum of squares, in any
regression analysis, the variance of the residuals is always as small as possible.
The values of the parameters for which the SSR (and by consequence the
variance) is smallest, are the least squares regression parameters. And if the
variance of the residuals is always minimised in a regression analysis, the explained
variance is always maximised!
Because in any least squares regression analysis based on a sample of data,
the explained variance is always maximised, we may overestimate the variance
explained in the population data. In regression analysis, we therefore very often
use an adjusted R-squared that takes this possible overestimation (inflation)
into account. The adjustment is based on the number of independent variables
and sample size.
2 In the literature and online, sometimes you see SSR and sometimes you see SSE, both
134
The formula is
2 n−1
Radj = 1 − (1 − R2 )
n−p−1
where n is sample size and p is the number of independent variables. For
example, if R2 equals 0.10 and we have a sample size of 100, and 2 independent
100−1
variables, the adjusted R2 is equal to 1 − (1 − 0.10) 100−2−1 = 1 − (0.90) 99
97 =
0.08. Thus, the estimated proportion of variance explained at population level,
corrected for inflation, equals 0.08. Because R2 is inflated, the adjusted R2 is
never larger than the unadjusted R-squared.
2
Radj ≤ R2
library(DAAG)
library(broom)
model <- allbacks %>%
lm(weight ~ volume + area, data = .)
model %>%
tidy()
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 22.4 58.4 0.384 0.708
## 2 volume 0.708 0.0611 11.6 0.0000000707
## 3 area 0.468 0.102 4.59 0.000616
There we see an intercept, a slope parameter for volume and a slope parameter
for area. Remember from Section 4.2 that the intercept is the predicted value
135
when the independent variable has value 0. This extends to multiple regression:
the intercept is the predicted value when the independent variables all have
value 0. Thus, the output tells us that the predicted weight of a book that has
a volume of 0 and an area of 0, is 22.4. The slopes tell us that for every unit
increase in volume, the predicted weight increases by 0.708, and for every unit
increase in area, the predicted weight increases by 0.468.
So the linear model looks like:
Thus, the predicted weight of a book that has a volume of 10 and an area
of 5, the expected weight is equal to 22.4 + 0.708 × 10 + 0.468 × 5 = 31.82.
In R, the R-squared and the adjusted R-squared can be obtained by first
making a summary of the results, and then accessing these statistics directly.
sum < - model % >% summary ()
sum $ r . squared
sum $ adj . r . squared
sum$r.squared
## [1] 0.9284738
sum$adj.r.squared
## [1] 0.9165527
The output tells you that the R-squared equals 0.93 and the adjusted R-
squared 0.92. The variance of the residuals can also be found in the summary
object:
sum$sigma^2
## [1] 6031.052
4.16 Multicollinearity
In general, if you add independent variables to a regression equation, the proportion
explained variance, R2 , increases. Suppose you have the following three regression
equations:
136
If we carry out these three analyses, we obtain an R2 of 0.8026346 if we only
use volume as predictor, and an R2 of 0.1268163 if we only use area as predictor.
So perhaps you’d think that if we take both volume and area as predictors in
the model, we would get an R2 of 0.8026346 + 0.1268163 = 0.9294509. However,
if we carry out the multiple regression with volume and area, we obtain an R2
of 0.9284738, which is slightly less! This is not a rounding error, but results
from the fact that there is a correlation between the volume of a book and the
area of a book. Here it is a tiny correlation of 0.002, but nevertheless it affects
the proportion of variance explained when you use both these variables.
Let’s look at what happens when independent variables are strongly correlated.
Table 4.2 shows measurements on a breed of seals (only measurements on the
first 6 seals are shown). These data are in the dataframe cfseals in the package
DAAG. Often, the age of an animal is gauged from its weight: we assume that
heavier seals are older than lighter seals. If we carry out a simple regression of
age on weight, we get the output
library(DAAG)
data(cfseal) # available in package DAAG
out1 <- cfseal %>%
lm(age ~ weight , data = .)
out1 %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 11.4 4.70 2.44 2.15e- 2
## 2 weight 0.817 0.0716 11.4 4.88e-12
## [1] 1090.855
## [1] 200.0776
From the data we calculate the variance of age, and we find that it is
1090.8551724. The variance of the residuals is 200, so that the proportion of
explained variance is (1090.8551724 − 200)/1090.8551724 = 0.8166576.
Since we also have data on the weight of the heart alone, we could try to
predict the age from the weight of the heart. Then we get output
137
Table 4.2: Part of Cape Fur Seal Data.
age weight heart
33.00 27.50 127.70
10.00 24.30 93.20
10.00 22.00 84.50
10.00 18.50 85.40
12.00 28.00 182.00
18.00 23.80 130.00
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 20.6 5.21 3.95 0.000481
## 2 heart 0.113 0.0130 8.66 0.00000000209
## [1] 307.1985
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.3 4.99 2.06 0.0487
138
## 2 heart -0.0269 0.0373 -0.723 0.476
## 3 weight 0.993 0.254 3.91 0.000567
## [1] 203.55
Here we see that the regression parameter for weight has increased from 0.82
to 0.99. At the same time, the regression parameter for heart has decreased,
has even become negative, from 0.11 to -0.03. From this equation we see that
there is a strong relationship between the total weight and the age of a seal, but
on top of that, for every unit increase in the weight of the heart, there is a very
small decrease in the expected age. The slope for heart has become practically
negligible, so we could say that on top of the effect of total weight, there is no
remaining relationship between the weight of the heart and age. In other words,
once we can use the total weight of a seal, there is no more information coming
from the weight of the heart.
This is because the total weight of a seal and the weight of its heart are
strongly correlated: heavy seals generally have heavy hearts. Here the correlation
turns out to be 0.96, almost perfect! This means that if you know the total
weight of a seal, you practically know the weight of its heart. This is logical of
course, since the total weight is a composite of all the weights of all the parts
of the animal: the total weight variable includes the weight of the heart.
Here we have seen, that if we use multiple regression, we should be aware
of how strongly the independent variables are correlated. Highly correlated
predictor variables do not add extra predictive power. Worse: they can cause
problems in obtaining regression parameters because it becomes hard to tell
which variable is more important: if they are strongly correlated (positive or
negative), then they measure almost the same thing!
When two predictor variables are perfectly correlated, either 1 or -1, regression
is no longer possible, the software stops and you get a warning. We call such
a situation multicollinearity. But also if the correlation is close to 1 or -1, you
should be very careful interpreting the regression parameters. If this happens,
try to find out what variables are highly correlated, and select the variable that
makes most sense.
In our seal data, there is a very high correlation between the variables heart
and weight that can cause computational and interpretation problems. It makes
more sense to use only the total weight variable, since when seals get older, all
their organs and limbs grow larger, not just their heart.
139
Salary 90000
60000
30000
0 2 4 6
Neuroticism
Figure 4.17 shows the data and the regression line. From this visualisation
it looks like Neuroticism relates positively to their yearly salary: more neurotic
people earn more salary than less neurotic people. More precisely, we see in the
equation that for every unit increase on the Neuroticism scale, the predicted
salary increases with 4912 Euros a year.
3 https://paulvanderlaken.com/2017/09/27/simpsons-paradox-two-hr-examples-with-r-
code/
140
Next we run a multiple regression analysis. We suspect that one other very
important predictor for how much people earn is their educational background.
The Education variable has three levels: 0, 1 and 2. If we include both
Education and Neuroticism as independent variables and run the analysis,
we obtain the following regression equation:
Note that we now find a negative slope parameter for the effect of Neuroticism!
This implies there is a relationship in the data where neurotic employees earn
less than their less neurotic colleagues! How can we reconcile this seeming
paradox? Which result should we trust: the one from the simple regression, or
the one from the multiple regression?
The answer is: neither. Or better: both! Both analyses give us different
information.
Let’s look at the last equation more closely. Suppose we make a prediction
for a person with a low educational background (Education = 0). Then the
equation tells us that the expected salary of a person with a neuroticism score
of 0 is around 50935, and of a person with a neuroticism score of 1 is around
47759. That’s an increase of -3176, which is the slope for Neuroticism in the
multiple regression. So for employees with low education, the more neurotic
employees earn less! If we do the same exercise for average education and high
education employees, we find exactly the same pattern: for each unit increase
in neuroticism, the predicted yearly salary drops by 3176 Euros.
It is true that in this company, the more neurotic persons generally earn
a higher salary. But if we take into account educational background, the
relationship flips around. This can be seen from Figure 4.18: looking only
at the people with a low educational background (Education = 0, the red
data points), then the more neurotic people earn less than their less neurotic
colleagues with a similar educational background. And the same is true for
people with an average education (Education = 1, the green data points) and
a high education (Education = 2, the blue data points). Only when you put
all employees together in one group, you see a positive relationship between
Neuroticism and salary.
Simpson’s paradox tells us that we should always be careful when interpreting
positive and negative correlations between two variables: what might be true
at the total group level, might not be true at the level of smaller subgroups.
Multiple linear regression helps us investigate correlations more deeply and
uncover exciting relationships between multiple variables.
Simpson’s paradox helps us in interpreting the slope coefficients in multiple
regression. In simple regression, when we only have one independent variable,
we saw that the slope for an independent variable A is the increase in the
dependent variable if we increase variable A by one unit. In multiple regression,
we have multiple independent variables, say A, B and C. The interpretation for
the slope coefficient for variable A is then the increase in the dependent variable
141
Education
0
1
90000
2
Salary
60000
30000
0 2 4 6
Neuroticism
Figure 4.18: Same HR data, now with markers for different education levels.
142
Number of ice creams sold per day
1000
temperature
30
20
500
10
10 20 30 40 50
Number of shark attacks per day
Figure 4.19: A spurious correlation between the number of shark attacks and
ice cream sales.
143
144
Chapter 5
145
Volume in centilitres at 20 degrees Celsius
32
30
28
17 18 19 20 21 22
Temperature in degrees Celsius during production
146
34
Volume in centilitres at 20 degrees Celsius
32
30
28
26
17 18 19 20 21 22
Temperature in degrees Celsius during production
Figure 5.2: The relationship between temperature and volume in all 80,000
bottles.
The discrepancy between the two equations is simply the result of chance:
had we selected another sample of 200 bottles, we probably would have found
a different sample equation with a different slope and a different intercept.
The intercept and slope based on sample data are the result of chance and
therefore different from sample to sample. The population intercept and slope
(the true ones) are fixed, but unknown. If we want to know something about
the population intercept and slope, we only have the sample equation to go on.
Our best guess for the population equation is the sample equation; the unbiased
estimator for a regression coefficient in the population is the sample coefficient.
But how certain can we be about how close the sample intercept and slope are
to the population intercept and slope?
147
determine the intercept and the slope. Next, we put these bottles back into
the population, draw a second random sample of 200 bottles and calculate the
intercept and slope again.
You can probably imagine that if we repeat this procedure of randomly
picking 200 bottles from a large population of 80,000, each time we find a
different intercept and a different slope. Let’s carry out this procedure 100
times by a computer. Table 5.1 shows the first 10 regression equations, each
based on a random sample of 200 bottles. If we then plot the histograms of
all 100 sample intercepts and sample slopes we get Figure 5.3. Remember from
Chapters 2 and 3 that these are called sampling distributions. Here we look at
the sampling distributions of the intercept and the slope.
The sampling distributions in Figure 5.3 show a large variation in the intercepts,
and a smaller variation in the slopes (i.e., all values very close to another).
Table 5.1: Ten different sample equations based on ten different random samples
from the population of bottles.
sample equation
1 volume = 28.87 + 0.06 x temperature + e
2 volume = 30.84 – 0.05 x temperature + e
3 volume = 31.05 – 0.06 x temperature + e
4 volume = 31.67 – 0.09 x temperature + e
5 volume = 30.59 – 0.03 x temperature + e
6 volume = 29.53 + 0.02 x temperature + e
7 volume = 28.36 + 0.08 x temperature + e
8 volume = 27.78 + 0.11 x temperature + e
9 volume = 28.29 + 0.09 x temperature + e
10 volume = 30.75 – 0.03 x temperature + e
For now, let’s focus on the slope. We do that because we are mostly
interested in the linear relationship between volume and temperature. However,
everything that follows also applies to the intercept. In Figure 5.4 we see the
histogram of the slopes if we carry out the random sampling 1000 times. We
see that on average, the sample slope is around 0.001, which is the population
slope (the slope if we analyse all bottles). But there is variation around that
mean of 0.001: the standard deviation of all 1000 sample slopes turns out to be
0.08.
Remember from Chapter 2 that the standard deviation of the sampling
distribution is called the standard error. The standard error for the sampling
distribution of the sample slope represents the uncertainty about the population
slope. If the standard error is large, it means that if we would draw many
different random samples from the same population data, we would get very
different sample slopes. If the standard error is small, it means that if we would
draw many different random samples from the same population data, we would
get sample slopes that are very close to one another, and very close to the
148
intercept slope
40
30
count
20
10
0
0 10 20 30 0 10 20 30
Figure 5.3: Distribution of the 100 sample intercepts and 100 sample slope.
60
40
count
20
0
−0.2 0.0 0.2
sample.slope
149
population slope.2
150
32
31
volume
30
29
28
18 19 20 21
temperature
Figure 5.5: The averaging effect of increasing sample size. The scatter plot
shows the relationship between temperature and volume for a random sample
of 20 bottles (the dots); the first two bottles in the sample are marked in red.
The red line would be the sample slope based on these first two bottles, the blue
line is the sample slope based on all 20 bottles, and the black line represents the
population slope, based on all 80,000 bottles. This illustrates that the larger
the sample size, the closer the sample regression line is expected to be to the
population regression line.
600
400
count
200
0
−20 −10 0 10 20 −20 −10 0 10 20
sample slope
Figure 5.6: Distribution of the sample slope when sample size is 2 (left panel)
and when sample size is 20 (right panel).
151
As we have seen, the standard error depends very much on sample size.
Apart from sample size, the standard error for a slope also depends on the
variance of the independent variable, the variance of the dependent variable,
and the correlations between the independent variable and other independent
variables in the equation. We will not bore you with the complicated formula for
the standard error for regression coefficients in the case of multiple regression
3
. But here is the formula for the standard error for the slope coefficient if you
have only one predictor variable X:
s
s2R
σbb1 =
s2X × (n − 1)
v
u s
Σi (Yi −Ybi )2
Σi (Yi − Ybi )2
u
u n−2
= t = (5.1)
Σi (Xi −X)2 (n − 2)Σi (Xi − X)2
n−1 × (n − 1)
where b1 is the slope coefficient in the sample, n is sample size, s2R is the
sample variance of the residuals, and s2X the sample variance of independent
variable X. From the formula, you can see that the standard error σbb1 becomes
smaller when sample size n becomes larger.
It’s not very useful to memorise this formula; you’d better let R do the
calculations for you. But an interesting part of the formula is the nominator:
SSR
n−2 . This is the sum of the squared residuals, divided by n − 2. Remember
from Chapter 1 that the definition of the variance is the sum of squares divided
by the number of values. Thus it looks like we are looking at the variance of
the residuals. Remember from Chapter 2 that when we want to estimate a
population variance, a biased estimator is the variance in the sample. In order
to get an unbiased estimate of the variance, we have to divide by n−1 instead of
n. This was because when computing the sum of squares, we assume we know
the mean. Here we are computing the variance of the residuals, but it’s actually
an unbiased estimator of the variance in the population, because we divide by
n − 2: when we compute the residuals, we assume we know the intercept and
the slope. We assume two parameters, so we divide by n − 2. Thus, when we
have a linear model with 2 parameters (intercept and slope), we have to divide
the sum of squared residuals by n − 2 in order to obtain an unbiased estimator
of the variance of the residuals in the population.
From the equation, we see that the standard error becomes larger when there
is a large variation in the residuals, it becomes smaller when there is a large
variation in predictor variable X, and it becomes smaller with large sample size
n.
152
5.3 t-distribution for the model coefficients
When we look at the sample distribution of the sample slope, for instance
in Figure 5.4, we notice that the distribution looks very much like a normal
distribution. From the Central Limit Theorem, we know that the sampling
distribution will become very close to normal for large sample sizes. Using this
sampling distribution for the slope we could compute confidence intervals and
do null-hypothesis testing, similar to what we did in Chapters 2 and 3.
For large sample sizes, we could assume the normal distribution, and when
we standardise the slope coefficient, we can look up in tables such as in Appendix
A the critical value for a particular confidence interval. For instance, 200 bottles
is a large sample size. When we standardise the sample slope – let’s assume
we find a slope of 0.05 –, we need to use the values -1.96 and +1.96 to obtain
a 95% confidence interval around 0.05. The margin of error (MoE) is then
1.96 times the standard error. Suppose that the standard error is 0.10. The
MoE is then equal to 1.96 × 0.10 = 0.196. The 95% interval then runs from
0.05 − 0.196 = −0.146 to 0.05 + 0.196 = 0.246.
However, this approach does not work for small sample sizes. Again this can
be seen when we standardise the sampling distribution. When we standardise
the slope for each sample, we subtract the sample slope from the population
slope β1 , and have to divide each time by the standard error (the standard
deviation). But when we do that
b1 − β 1 b1 − β1
t= =r (5.2)
σc
bb1 s2R
s2X ×(n−1)
we immediately see the problem that when we only have sample data, we
have to estimate the standard error. In each sample, we get a slightly different
estimated standard error, because each time, the variation in the residuals (s2R )
is a little bit different, and also the variation in the predictor variable (s2X ). If
sample size is large, this is not so bad: we then can get very good estimates of the
standard error so there is little variation across samples. But when sample size
is small, both s2R and s2X are different from sample to sample (due to chance),
and the estimate of the standard error will therefore also vary a lot. The result
is that the distribution of the standardised t-value from Equation 5.2 will only
be close to normal for large sample size, but will have a t-distribution in general.
Because the standard error is based on the variance of the residuals, and
because the variance of the residuals can only be computed if you assume a
certain intercept and a certain slope, the degrees of freedom will be n − 2.
Let’s go back to the example of the beer bottles. In our first random sample
of 200 bottles, we found a sample slope of -0.121. We also happened to know
the population slope, which was 0.001. From our computer experiment, we saw
that the standard deviation of the sample slopes with sample size 200 was equal
to 0.08. Thus, if we fill in the formula for the standardised slope t, we get for
153
this particular sample
−0.1207 − 0.001
t= = −1.52 (5.3)
0.08
In this section, when discussing t-statistics, we assumed we knew the population
slope β, that is, the slope of the linear equation based on all 80,000 bottles. In
reality, we never know the population slope: the whole reason to look at the
sample slope is to have an idea about the population slope. Let’s look at the
confidence interval for slopes.
−0.121 − 0.1
t= = −2.7 (5.4)
0.08
Thus, we compute how many standard errors the sample value is away from
the hypothesised population value 0.1. If the population value is indeed 0.1,
how likely is it that we find a sample slope of -0.121?
From the t-distribution, we know that such a t-value is very unlikely: the
probability of finding a sample slope -2.7 standard deviations or more away
from a population slope of 0.1 is less than 0.0075341. How do we know that?
Well, the t-statistic is -2.7 and the degrees of freedom is 200 − 2 = 198. The
cumulative proportion of a t-value can be looked up in R:
pt(-2.7, df = 198)
## [1] 0.003767051
154
Now let’s ask Martha. She thinks a reasonable value for the population slope
is 0, as she doesn’t believe there is a linear relationship between temperature
and volume. She suspects that the fact that we found a sample slope that was
not 0 was a pure coincidence. Based on that hypothesis, we compute t again
and find:
−0.121 − 0
t= = −1.5 (5.5)
0.08
In other words, if we believe Martha, our sample slope is only about 1
standard deviation away from her hypothesised value. That’s not a very bad
idea, since from the t-distribution we know that the probability of finding a
value more than 1.5 standard deviations away from the mean (above or below)
is 13.35%. You can see that by asking R:
pt(-1.5, df = 198) * 2
## [1] 0.1352072
qt(0.005, df = 198)
## [1] -2.600887
This is shown in Figure 5.7. So if our sample slope is more than 2.6 standard
errors away from the hypothesised population slope, then that population slope
is not a reasonable guess. In other words, if the distance between the sample
slope and the hypothesised population slope is more than 2.6 standard errors,
then the hypothesised population slope is no longer reasonable.
This implies that any value closer than 2.6 standard errors from the sample
slope is a collection of reasonable values for the population slope.
Thus, in our example of the 200 bottles with a sample slope of −0.121 and a
standard error of 0.08, the interval from −0.121−2.6×0.08 to −0.121+2.6×0.08
contains reasonable values for the population slope. If we do the calculations,
155
0.4
density 0.3
0.2
0.1
we get the interval from −0.33 to 0.09. If we would have to guess the value for
the population slope, our guess would be that it would lie somewhere between
between -0.33 and 0.09, if we feel that 1% is a small enough probability.
In data analysis, such an interval that contains reasonable values for the
population value, if we only know the sample value, is called a confidence
interval, as we know from Chapter 2. Here we’ve chosen to use 2.6 standard
errors as our cut-off point, because we felt that 1% would be a small enough
probability to dismiss the real population value as a reasonable candidate (type
I error rate). Such a confidence interval based on this 1% cut-off point is called
a 99% confidence interval.
Particularly in social and behavioural sciences, one also sees 95% confidence
intervals. The critical t-value for a type I error rate of 0.05 and 198 degrees of
freedom is 1.97.
qt(0.975, df = 198)
## [1] 1.972017
Thus, 5% of the observations lie more than 1.97 standard deviations away
from the mean, so that the 95% confidence interval is constructed by subtracting/adding
1.97 standard errors from/to the sample slope. Thus, in the case of our bottle
sample, the 95% confidence interval for the population slope is from −0.121 −
1.97 × 0.08 to −0.121 + 1.97 × 0.08, so reasonable values for the population slope
are those values between −0.28 and 0.04. Luckily, this corresponds to the truth,
because we happen to know that the population slope is equal to 0.001. In real
life, we don’t know the population slope and of course it might happen that the
true population value is not in the 95% confidence interval. If you want to make
the likelihood of this being the case smaller, then you can use a 99%, a 99.9%
156
or an even larger confidence interval.
ΣYi 2 + 6 + Y3 + 2 10 + Y3
Y = = = = 3.75
n 4 4
10 + Y3 = 4 × 3.75 = 15
Y3 = 15 − 10 = 5
157
6
4 Y = 3.75 + e
Y
0
1 2 3 4
X
Once I tell you that the sample mean is 3.75, I am effectively introducing a
constraint. The value of the unknown sample value is implicitly being determined
from the other three values plus the constraint. That is, once the constraint is
introduced, there are only three logically independent pieces of information in
the sample. That is to say, there are only three ”degrees of freedom”, once the
sample mean is revealed.
Let’s carry this example to regression analysis. Suppose I have four observations
of variables X and Y , where the values for X are 1, 2, 3 and 4. Each value of
Y = y is one piece of information. These Y -values could be anything, so we
say that we have 4 degrees of freedom. Now suppose I use a linear model for
these data points, and suppose I only use an intercept. Let the intercept be 3.75
so that we have Y = 3.75 + e. Now the first bit of information for X = 1, Y
could be anything, say 2. The second and third bits of information for X = 2
and X = 4 could also be anything, say 6 and 2. Figure 5.8 shows these bits of
information as dots in a scatter plot. Since we know that the intercept is equal
to 3.75, with no slope (slope=0), we can also draw the regression line.
Before we continue, you must know that if we talk about degrees of freedom
in regression analysis, we generally talk about residual degrees of freedom. We
therefore look at residuals. If we compute the residuals, we have residuals -1.75,
2.25 and -1.75 for these data points. When we sum them, we get -1.25. Since
we know that all residuals should sum to 0 in a regression analysis (see Chapter
4), we can derive the fourth residual to be +1.25, since only then the residuals
sum to 0. Therefore, the Y -value for the fourth data point (for X = 3) has to
be 5, since then the residual is equal to 5 − 3.75 = 1.25.
In short, when we use a linear model with only an intercept, the degrees of
freedom is equal to the number of data points (combinations of X and Y ) minus
158
12
8
Y
4 Y = 3 + 1X + e
0
1 2 3 4
X
159
15
10 line
1
2
Y
3
5 4
0
1 2 3 4
X
prove this requires matrix algebra, but you can see it when you try it yourself.
The gist of it is that if you have a regression equation with both an intercept
and a slope, the degrees of freedom is equal to the number of data points (sample
size) minus 2: n − 2. Generalising this to linear models with K predictors:
n − K − 1.
Generally, these degrees of freedom based on the number of residuals that
could be freely chosen, given the constraints of the model, are termed residual
degrees of freedom. When using regression models, one usually only reports
these residual degrees of freedom. Later on in this book, we will see instances
where one also should use model degrees of freedom. For now, it suffices to know
what is meant by residual degrees of freedom.
160
300
200
count
100
0
−1.0 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
sample.slope
Figure 5.11: Distribution of the sample slope when the population slope is 0
and sample size equals 40.
population slope is equal to 0 and the alternative hypothesis states that there
is a slope that is different from 0. Remember that if the population slope is
equal to 0, that is saying that there is no linear relationship between X and
Y (that is, you cannot predict one variable on the basis of the other variable).
Therefore, the null-hypothesis states there is no linear relationship between X
and Y in the population. If there is a slope, whether positive or negative, is
the same as saying there is a linear relationship, so the alternative hypothesis
states that there is a linear relationship between X and Y in the population.
In formula form, we have
H0 : βslope = 0 (5.6)
HA : βslope 6= 0 (5.7)
161
to find a sample slope of say 1 or -1. Thus, with our sample slope of 1, we know
that this finding is very unlikely if we hold the null-hypothesis to be true. In
other words, if the population slope is equal to 0, it would be quite improbable
to find a sample slope of 1 or larger. Therefore, we regard the null-hypothesis to
be false, since it does not provide a good explanation of why we found a sample
slope of 1. In that case, we say that we reject the null-hypothesis.
5.7 p-values
A p-value is a probability. It represents the probability of observing certain
events, given that the null-hypothesis is true.
In the previous section we saw that if the population slope is 0, and we drew
1000 samples of size 40, we did not observe a sample slope of 1 or larger. In
other words, the frequency of observing a slope of 1 or larger was 0. If we would
draw more samples, we theoretically could observe a sample slope of 1, but the
probability that that happens for any new sample we can estimate at less than
1 in a 1000, so less than 0.001: p < 0.001.
This estimate of the p-value was based on 1000 randomly drawn samples
of size 40 and then looking at the frequency of certain values in that data set.
But there is a short-cut, for we know that the distribution of sample slopes
has a t-distribution if we standardise the sample slopes. Therefore we do not
have to take 1000 samples and estimate probabilities, but we can look at the
t-distribution directly, using tables online or in statistical packages.
Figure 5.12 shows the t-distribution that is the theoretical distribution corresponding
to the histogram in Figure 5.11. If the standard error is equal to 0.19, and the
hypothetical population slope is 0, then the t-statistic associated with a slope of
1−0
1 is equal to t = 0.19 = 5.26. With this value, we can look up in the tables, how
often such a value of 5.26 or larger occurs in a t-distribution with 38 degrees of
freedom. In the tables or using R, we find that the probability that this occurs
is 0.00000294.
1 - pt(5.26, df = 38)
## [1] 0.000002939069
So, the fact that the t-statistic has a t-distribution gives us the opportunity
to exactly determine certain probabilities, including the p-value.
Now let’s suppose we have only one sample of 40 bottles, and we find a slope
of 0.1 with a standard error of 0.19. Then this value of 0.1 is (0.1−0)/0.19 = 0.53
standard errors away from 0. Thus, the t-statistic is 0.53. We then look at the
t-distribution with 38 degrees of freedom, and see that such a t-value of 0.53 is
not very strange: it lies well within the middle 95% of the t-distribution (see
Figure 5.12).
Let’s determine the p-value again for this slope of 0.1: we determine the
probability that we obtain such a t-value of 0.53 or larger. Figure 5.13 shows
162
0.4
0.3
density
0.2
0.1
0.0
−8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
t
Figure 5.12: The histogram of 1000 sample slopes and its corresponding
theoretical t-distribution with 38 degrees of freedom. The vertical line represents
the t-value of 5.26.
the area under the curve for values of t that are larger than 0.53. This area
under the curve can be seen as a probability. The total area under the curve of
the t-distribution amounts to 1. If we know the area of the shaded part of the
total area, we can compute the probability of finding t-values larger than 0.53.
In tables online, in Appendix B, or available in statistical packages, we can
look up how large this area is. It turns out to be 0.3.
1 - pt(0.53, df = 38)
## [1] 0.2995977
pt(-0.53, df = 38)
## [1] 0.2995977
163
0.4
0.3
0.2
y
0.1
0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
t
0.4
0.3
0.2
y
0.1
0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
t
164
0.4
0.3
0.2
y
0.1
0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
t
Figure 5.15: The blue vertical line represents a t-value of 0.53. The shaded area
represents the two-sided p-value: the probability of obtaining a t-value smaller
than -0.53 or larger than 0.53.
Remember that the null-hypothesis is that the population slope is 0, and the
alternative hypothesis is that the population slope is not 0. We should therefore
conclude that if we find a very large positive or negative slope, large in the sense
of the number of standard errors away from 0, that the null-hypothesis is unlikely
to be true. Therefore, if we find a slope of 0.1 or -0.1, then we should determine
the probability of finding a t-value that is larger than 0.53 or smaller than -0.53.
This probability is depicted in Figure 5.15 and is equal to twice the one-side
p-value, 2 × 0.2995977 = 0.5991953.
This probability is called the two-sided p-value. This is the one that should
be used, since the alternative hypothesis is also two-sided: the population slope
can be positive or negative. The question now is: is a sample slope of 0.1 enough
evidence to reject the null-hypothesis? To determine that, we determine how
many standard errors away from 0 the sample slope is and we look up in tables
how often that happens. Thus in our case, we found a slope that is 0.53 standard
errors away from 0 and the tables told us that the probability of finding a slope
that is at least 0.53 standard errors away from 0 (positive or negative) is equal
to 0.5991953. We find this probability rather large, so we decide that we do not
reject the null-hypothesis.
165
found a p-value of 0.60. This probability was rather large, and we decided to
not reject the null-hypothesis. In other words, the probability was so large that
we thought that the hypothesis that the population slope is 0 should not be
rejected based on our findings.
When should we think the p-value is small enough to conclude that the
null-hypothesis can be rejected? When can we conclude that the hypothesis
that the population slope is 0 is not supported by our sample data? This was
a question posed to the founding father of statistical hypothesis testing, Sir
Ronald Fischer. In his book Statistical Methods for Research Workers (1925),
Fisher proposed a probability of 5%. He advocated 5% as a standard level for
concluding that there is evidence against the null-hypothesis. However, he did
not see it as an absolute rule: ”If P is between .1 and .9 there is certainly no
reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated
that the hypothesis fails to account for the whole of the facts. We shall not often
be astray if we draw a conventional line at .05...”. So Fisher saw the p-value
as an informal index to be used as a measure of discrepancy between the data
and the null-hypothesis: The null-hypothesis is never proved, but is possibly
disproved.
Later, Jerzy Neyman and Egon Pearson saw the p-value as an instrument
in decision making: is the null-hypothesis true, or is the alternative hypothesis
true? You either reject the null-hypothesis or you don’t, there is nothing in
between. A slightly milder view is that you either decide that there is enough
empirical evidence to reject the null-hypothesis, or there is not enough empirical
evidence to reject the null-hypothesis (not necessarily accepting H0 as true!).
This view to data-analysis is rather popular in the social and behavioural
sciences, but also in particle physics. In order to make such black-and-white
decisions, you decide before-hand, that is, before collecting data, what level of
significance you choose for your p-value to decide whether to reject the null-
hypothesis. For example, as your significance level, you might want to choose
1%. Let’s call this chosen significance level α. Then you collect your data, you
apply your linear model to the data, and find that the p-value associated with
the slope equals p. If this p is smaller than or equal to α, you reject the null-
hypothesis, and if p is larger than α then you do not reject the null-hypothesis.
A slope with a p ≤ α is said to be significant, and a slope with a p > α is
said to be non-significant. If the sample slope is significant, then one should
reject the null-hypothesis and say there is a slope in the population different
from zero. If the sample slope is not significant, then one should not reject the
null-hypothesis and say there is no slope in the population (i.e., the slope is 0).
Alternatively, one could say there is no empirical evidence for the existence of a
slope (this leaves the possibility that there is a slope in the population but that
our method of research failed to find evidence for it).
166
5.9 Inference for linear models in R
So far, we have focused on standard errors and confidence intervals for the slope
parameter in simple regression, that is, a linear model where there is only one
independent variable. However, the same logic can be applied to the intercept
parameter, and to other slope variables in case you have multiple independent
variables in your model (multiple regression).
For instance, suppose we are interested in the knowledge university students
have of mathematics. We start measuring their knowledge at time 0, when the
students start doing a bachelor programme in mathematics. At time 1 (after 1
year) and at time 2 (after two years), we also perform measures. Our dependent
variable is mathematical knowledge, a measure with possible values between 200
and 700. The independent variables are time (the time of measurement) and
distance: the distance in kilometers between university and their home. There
are two research questions. First question is about the level of knowledge when
students enter the bachelor programme, and the second question is how much
knowledge is acquired in one year of study. The linear model is as follows:
Let’s look at the other columns in the regression table. In the second column
we see the standard errors for each parameter. The third column gives statistics;
167
these are the t-statistics for the null-hypotheses that the respective parameters
in the population are 0. For instance, the first statistic has the value 39.40. It
belongs to the intercept. If the null-hypothesis is that the population intercept
is 0 (β0 = 0), then the t-statistic is computed as
b0 − β 0 299.35 − 0 299.35
t= = = = 39.40 (5.10)
σβ̂ 7.60 7.60
You see that the t-statistic in the regression table is simply the regression
parameter divided by its standard error. This is also true for the slope parameters.
For instance, the t-statistic of 6.96 for time is simply the regression coefficient
18.13 divided by the standard error 2.60:
b1 − β 0 18.13 − 0 18.13
t= = = = 6.96 (5.11)
σβ̂ 7.60 2.60
The last column gives the two-sided p-values for the respective null-hypotheses.
For instance, the p-value of 0.00 for the intercept says that the probability of
finding an intercept of 299.35 or larger (plus or minus), under the assumption
that the population intercept is 0, is very small (less than 0.01).
If you want to have confidence intervals for the intercept and the slope for
time, you can use the information in the table to construct them yourself. For
instance, according to the table, the standard error for the intercept equals
7.60. Suppose the sample size equals 90 students, then you know that you have
n − K − 1 = 90 − 2 − 1 = 87 degrees of freedom. The critical value for a t-
statistic with 84 degrees of freedom for a 95% confidence interval can be looked
up in Appendix B. It must be somewhere between 1.98 and 2.00, so let’s use
1.99. The 95% interval for the intercept then runs between 299.35 − 1.99 × 7.60
and 299.35 − 1.99 × 7.60, so the expected level of knowledge at the start of
the bachelor programme for students living close to or on campus is somewhere
between from 284.23 to 314.47.
To show you how this can all be done using R, we have a look at the R dataset
called ”freeny” on quarterly revenues. We would like to predict the variable
market.potential by the predictors price.index and income.level. Apart
from the tidyverse package, we also need the broom package for the tidy()
function. When we run the following code, we obtain a regression table.
library(broom)
data("freeny")
out <- freeny %>%
lm(market.potential ~ price.index + income.level, data = .)
out %>%
tidy()
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
168
## 1 (Intercept) 13.3 0.291 45.6 1.86e-33
## 2 price.index -0.309 0.0263 -11.8 6.92e-14
## 3 income.level 0.196 0.0291 6.74 7.20e- 8
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13.3 0.291 45.6 1.86e-33 12.5 14.1
## 2 price.index -0.309 0.0263 -11.8 6.92e-14 -0.381 -0.238
## 3 income.level 0.196 0.0291 6.74 7.20e- 8 0.117 0.276
In the last two columns we see for example that the 99% confidence interval
for the price.index slope runs from -0.363 to -0.256.
To illustrate the difference between type I and type II errors, let’s recall
the famous fable by Aesop about the boy who cried wolf. The tale concerns a
shepherd boy who repeatedly tricks other people into thinking a wolf is attacking
his flock of sheep. The first time he cries ”There is a wolf!”, the men working in
an adjoining field come to help him. But when they repeatedly find there is no
169
wolf to be seen, they realise they are being fooled by the boy. One day, when
a wolf does appear and the boy again calls for help, the men believe that it is
another false alarm and the sheep are eaten by the wolf.
In this fable, we can think of the null-hypothesis as the hypothesis that there
is no wolf. The alternative hypothesis is that there is a wolf. Now, when the boy
cries wolf the first time, there is in fact no wolf. The men from the adjoining
field make a type I error: they think there is a wolf while there isn’t. Later,
when they are fed up with the annoying shepherd boy, they don’t react when
the boy cries ”There is a wolf!”. Now they make a type II error: they think
there is no wolf, while there actually is a wolf. See Table 5.4 for the overview.
Table 5.4: Four different scenarios for wolves and men working in the field.
Let’s now discuss these errors in the context of linear models. Suppose you
want to determine the slope for the effect of age on height in children. Let the
slope now stand for the wolf: either there is no slope (no wolf, H0 ) or there is a
slope (wolf, HA ). The null-hypothesis is that the slope is 0 in the population of
all children (a slope of 0 means there is no slope) and the alternative hypothesis
that the slope is not 0, so there is a slope. You might study a sample of children
and you might find a certain slope. You might decide that if the p-value is below
a critical value you conclude that the null-hypothesis is not true. Suppose you
think a probability of 10% is small enough to reject the null-hypothesis as true.
In other words, if p ≤ 0.10 then we no longer think 0 is a reasonable value for
the population slope. In this case, we have fixed our α or type I error rate to be
α = 0.10. This means that if we study a random sample of children, we look at
the slope and find a p-value of 0.11, then we do not reject the null-hypothesis.
If we find a p-value of 0.10 or less, then we reject the null-hypothesis.
Note that the probability of a type I error is the same as our α for the
significance level. Suppose we set our α = 0.05. Then for any p-value equal or
smaller than 0.05, we reject the null-hypothesis. Suppose the null-hypothesis is
true, how often do we then find a p-value smaller than 0.05? We find a p-value
smaller than 0.05 if we find a t-value that is above a certain threshold. For
instance, for the t-distribution with 198 degrees of freedom, the critical value is
±1.97, because only in 5% of the cases we find a t-value of ±1.97 or more if the
null-hypothesis is true! Thus, if the null-hypothesis is true, we see a t-value of at
least ±1.97 in 5% of the cases. Therefore, we see a significant p-value in 5% of
the cases if the null-hypothesis is true. This is exactly the definition of a Type
I error: the probability that we reject the null-hypothesis (finding a significant
p-value), given that the null-hypothesis is true. So we call our α-value the type
I error rate.
Suppose 100 researchers are studying a particular slope. Unbeknownst to
170
them, the population slope is exactly 0. They each draw a random sample from
the population and test whether their sample slope is significantly different from
0. Suppose they all use different sample sizes, but they all use the same α of
0.05. Then we can expect that about 5 researchers will reject the null-hypothesis
(finding a p-value less than or smaller than 0.05) and about 95 will not reject
the null-hypothesis (finding a p-value of more than 0.05).
Fixing the type I error rate should always be done before data collection.
How willing are you to take a risk of a type I error? You are free to make a
choice about α, as long as you do it before looking at the data, and report what
value you used.
If α represents the probability of making a type I error, then we can use β to
represent the opposite: the probability of not rejecting the null-hypothesis while
it is not true (type II error, thinking there is no wolf while there is). However,
setting the β-value prior to data collection is a bit trickier than choosing your α.
It is not possible to compute the probability that we find a non-significant effect
(p > α), given that the alternative hypothesis is true, because the alternative
hypothesis is only saying that the slope is not equal to 0. In order to compute
β, we need to think first of a reasonable size of the slope that we expect. For
example, suppose we believe that a slope of 1 is quite reasonable, given what
we know about growth in children. Let that be our alternative hypothesis:
H0 : β 1 = 0
HA : β1 = 1
171
0.4
0.2
0.1
0.0
−1.66 1.66
t
Figure 5.16: Different t-distributions of the sample slope if the population slope
equals 0 (left curve in blue), and if the population slope equals 1 (right curve
in red). Blue area depicts the probability that we find a p-value value smaller
than 0.10 if the population slope is 0 (α).
0.4
0.2
0.1
0.0
−1.66 1.66
t
Figure 5.17: Different t-distributions of the sample slope if the population slope
equals 0 (left curve in blue), and if the population slope equals 1 (right curve in
red). Shaded area depicts the probability that we find a p-value value smaller
than 0.10 if the population slope is 1 (1 − β).
172
In sum, in this example with an α of 0.10 and assuming a population slope
of 1, we find that the probability of a type II error is 0.86: if there is a slope of
1, then we have an 86% chance of wrongly concluding that the slope is 0.
Type I and II error rates α and β are closely related. If we feel that a
significance level of α = 0.10 is too high, we could choose a level of 0.01. This
ensures that we are less likely to reject the null-hypothesis when it is true. The
critical value for our t-statistic is then equal to ±2.6258905, see Figure 5.18. In
Figure 5.19 we see that if we change α, we also get a different value for 1 − β,
in this case 0.0196567.
Table 5.5 gives an overview of how α and β are related to type I and type
II error rates. If a p-value for a statistical test is equal to or smaller than a
pre-chosen significance level α, the probability of a type I error equals α. The
probability of a type II error rate is equal to β.
Statistical outcome
p>α p≤α
H0 1−α α
Truth
HA β 1−β
Thus, if we use smaller values for α, we get smaller values for 1−β, so we get
larger values for β. This means that if we lower the probability of rejecting the
null-hypothesis given that it is true (type I error) by choosing a lower value for α,
we inadvertently increase the probability of failing to reject the null-hypothesis
given that it is not true (type II error).
Think again about the problem of the sheep and the wolf. Instead of the
boy, the men could choose to put a very nervous person on watch, someone
very scared of wolves. With the faintest hint of a wolf’s presence, the man will
call out ”Wolf!”. However, this will lead to many false alarms (type I errors),
but the men will be very sure that when there actually is a wolf, they will be
warned. Alternatively, they could choose to put a man on watch that is very
laid back, very relaxed, but perhaps prone to nod off. This will lower the risk of
false alarms immensely (no more type I errors) but it will dramatically increase
the risk of a type II error!
One should therefore always strike a balance between the two types of errors.
One should consider how bad it is to think that the slope is not 0 while it is,
and how bad it is to think that the slope is 0, while it is not. If you feel that
the first mistake is worse than the second one, then make sure α is really small,
and if you feel that the second mistake is worse, then make α not too small.
Another option, and a better one, to avoid type II errors, is to increase sample
size, as we will see in the next section.
173
0.4
0.2
0.1
0.0
−2.625891 2.625891
t
Figure 5.18: Different t-distributions of the sample slope if the population slope
equals 0 (left curve), and if the population slope equals 1 (right curve). Blue
area depicts the probability that we find a p-value value smaller than 0.01 if the
population slope is 0.
0.4
0.2
0.1
0.0
−2.625891 2.625891
t
Figure 5.19: Different t-distributions of the sample slope if the population slope
equals 0 (left curve in blue), and if the population slope equals 1 (right curve in
red). Red area depicts the probability that we find a p-value value smaller than
0.01 if the population slope is 1: 1 − β.
174
5.11 Statistical power
Null-hypothesis testing only involves the null-hypothesis: we look at the sample
slope, compute the t-statistic and then see how often such a t-value and larger
values occur given that the population slope is 0. Then we look at the p-
value and if that p-value is smaller than or equal to α, we reject the null-
hypothesis. Therefore, null-hypothesis testing does not involve testing the
alternative hypothesis. We can decide what value we choose for our α, but
not our β. The β is dependent on what the actual population slope is, and we
simply don’t know that.
As stated in the previous section, we can compute β only if we have a more
specific idea of an alternative value for the population slope. We saw that we
needed to think of a reasonable value for the population slope that we might
be interested in. Suppose we have the intuition that a slope of 1 could well
be the case. Then, we would like to find a p-value of less than α if indeed the
slope were 1. We hope that the probability that this happens is very high: the
conditional probability that we find a t-value large enough to reject the null-
hypothesis, given that the population slope is 1. This probability is actually
the complement of β, 1 − β: the probability that we reject the null-hypothesis,
given that the alternative hypothesis is true. This 1 − β is often called the
statistical power of a null-hypothesis test. When we think again about the boy
who cried wolf: the power is the probability that the men think there is a wolf
if there is indeed a wolf. The power of a test should always be high: if there is
a population slope that is not 0, then of course you would like to detect it by
finding a significant t-value!
In order to get a large value for 1 − β, we should have large t-values in our
data-analysis. There are two ways in which we can increase the value of the
t-statistic. Since with null-hypothesis testing t = b−0 b
σb̂ = σb̂ , we can get large
values for t if 1) we have a small standard error, σb̂ , or 2) if we have a large
value for b.
Let’s first look at the first option: a small standard error. We get a small
standard error if we have a large sample size, see Section 5.2.1. If we go back to
the example of the previous section where we had a sample size of 102 children
and our alternative hypothesis was that the population slope was 1, we found
that the t-distribution for the alternative hypothesis was centred around 0.5,
because the standard error was 2. Suppose that we would increase sample size
to 1200 children, then our standard error might be 0.2. Then our t-distribution
for the alternative hypothesis is centred at 5. This is shown in Figure 5.20.
We see from the shaded area that if the population slope is really 1, there is
a very high chance that the t-value for the sample slope will be larger than 2.58,
the cut-off point for an α of 0.01 and 1198 degrees of freedom. The probability
of rejecting the null-hypothesis while it is not true, is therefore very large. This
is our 1 − β and we call this the power of the null-hypothesis test. We see that
with increasing sample size, the power to find a significant t-value increases too.
Now let us look at the second option, a large value of b. Sample slope b1
175
0.4
0.2
0.1
0.0
−2.579939 2.579939
t
Figure 5.20: Different t-distributions of the sample slope if the population slope
equals 0 (left curve in blue), and if the population slope equals 1 (right curve
in red). Now for a larger sample size. Shaded area depicts the probability that
we find a p-value value smaller than 0.01 if the population slope is 1.
depends of course on the population slope β1 . The power becomes larger when
the population slope is further away from zero. If the population slope were
10, and we only had a sample of 102 children (resulting in a standard error of
2), the t-distribution for the alternative hypothesis that the population slope
is centred around σb = 10/2 = 5, resulting in the same plot as in Figure 5.20,
b̂
with a large value for 1 − β. Unfortunately, the population slope is beyond our
control: the population slope is a given fact that we cannot change. The only
thing we can change most of the times is sample size.
In sum: the statistical power of a test is the probability that the null-
hypothesis is rejected, given that it is not true. This probability is equal to
1 − β. The statistical power of a test increases with sample size, and depends
on the actual population slope. The further away the population slope is from 0
(positive or negative), the larger the statistical power. Earlier we also saw that
1 − β increases with increasing α: the larger α, the higher the power.
176
Suppose you want to minimise the probability of a type I error, so you choose
an α = 0.01. Next, you think of what kind of population slope you would like
to find, if it indeed has that value. You could perhaps base this expectation
on earlier research. Suppose that you feel that if the population slope is 0.15,
you would really like to find a significant t-value so that you can reject the
null-hypothesis. Next, you have to specify how badly you want to reject the
null-hypothesis if indeed the population slope is 0.15. If the population slope
is really 0.15, then you would like to have a high probability to find a t-value
large enough to reject the null-hypothesis. This is of course the power of the
test, 1 − β. Let’s say you want to have a power of 0.90. Now you have enough
information to calculate how large your sample size should be.
Let’s look at G*power4 , an application that can be downloaded from the
web. If we start the app, we can ask for the sample size required for a slope of
0.15, an α of 0.01, and a power (1 − β) of 0.90. Let the standard deviation of
our dependent variable (Y ) be 3 and the standard deviation of our independent
variable (X) be 2. These numbers you can guess, preferably based on some other
data that were collected earlier. Then we get the input as displayed in Figure
5.21. Note that you should use two-sided p-values, so tails = two. From the
output we see that the required sample size is 1477 children.
177
Figure 5.21: G*power output for a simple regression analysis.
with a non-significant effect. For that reason, in scientific journals you will find
mostly studies reported with a significant effect. This has led to the file-drawer
problem: the literature reports significant effects for a particular phenomenon,
but there can be many unpublished studies with non-significant effects for the
same phenomenon. These unpublished studies remain unseen in file-drawers
(or these days on hard-drives). So based on the literature there might seem to
exist a particular phenomenon, but if you would put all the results together,
including the unpublished studies, the effect might disappear completely.
Remember that if the null-hypothesis is true and everyone uses an α of 0.05,
then out of 100 studies of the same phenomenon, only 5 studies will be significant
and are likely to be published. The remaining 95 studies with insignificant effects
are more likely to remain invisible.
As a result of this bias in publication, scientists who want to publish their
results are tempted to fiddle around a bit more with their data in order to get
a significant result. Or, if they obtain a p-value of 0.07, they decide to increase
178
their sample size, and perhaps stop as soon as the p-value is 0.05 or less. This
horrible malpractice is called p-hacking and is extremely harmful to science. As
we saw earlier, if you want to find an effect and not miss it, you should carry
out a power analysis before you collect the data and make sure that your sample
size is large enough to obtain the power you want to have. Increasing sample
size after you have found a non-significant effect increases your type I error rate
dramatically: if you stop collecting data until you find a significant p-value, the
type I error rate is equal to 1!
There have been wide discussions the last few years about the use and
interpretation of p-values. In a formal statement, the American Statistical
Association published six principles that should be well understood by anyone,
including you, who uses them.
The six principles are:
1. p-values can indicate how incompatible the data are with a specified
statistical model (usually the null-hypothesis).
These six principles are further explained in the statement online5 . The
bottom line is, p-values have worth but only when used and interpreted in a
proper way. Some disagree. The philosopher of science William Rozeboom
once called NHST ”surely the most bone-headedly misguided procedure ever
institutionalized in the rote training of science students.” The scientific journal
Basic and Applied Social Psychology even banned NHST altogether: t-values
and p-values are not allowed if you want to publish your research in that journal.
5 https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108
179
Most researchers now realise that reporting confidence intervals is often a
lot more meaningful than reporting whether a p-value is significant or not. A p-
value only says something about evidence against the hypothesis that the slope
is 0. In contrast, a confidence interval gives a whole range of reasonable values
for the population slope. If 0 lies within the confidence interval, then 0 is a
reasonable value; if it is not, then 0 is not a reasonable value so that we can
reject the null-hypothesis.
Using confidence intervals also counters one fundamental problem of null-
hypotheses: nobody believes in them! Remember that the null-hypothesis states
that a particular effect (a slope) is exactly 0: not 0.0000001, not -0.000201, but
exactly 0.000000000000000000000.
Sometimes a null-hypothesis doesn’t make sense at all. Suppose we are
interested to know what the relationship is between age and height in children.
Nobody believes that the population slope coefficient for the regression of height
on age is 0. Why then test this hypothesis? More interesting would be to know
how large the population slope is. A confidence interval would then be much
more informative than a simple rejection of the null-hypothesis.
In some cases, a null-hypothesis can be slightly more meaningful: suppose
you are interested in the effect of cognitive behavioural therapy on depression.
You hope that the number of therapy sessions has a negative effect on the
severity of the depression, but it is entirely possible that the effect is very close to
non-existing. Of course you can only look at a sample of patients and determine
the sample slope. But think now about the population slope: think about all
patients in the world with depression that theoretically could partake in the
research. Some of them have 0 sessions, some have 1 session, and so on. Now
imagine that there are 1 million of such people. How likely is it that in the
population, the slope for the regression is exactly 0? Not 0.00000001, not -
0.0000000002, but exactly 0.0000000000. Of course, this is extremely unlikely.
The really interesting question in such research is whether there is a meaningful
effect of therapy. For instance, an effect of at least half a point decrease on the
Hamilton depression scale for 5 sessions. That would translate to a slope of
−0.5
5 = −0.1. Also in this case, a confidence interval for the effect of therapy on
depression would be more helpful than a simple p-value. A confidence interval
of -2.30 to -0.01 says that a small population effect of -0.01 might be there,
but that an effect of -0.0001 or 0.0000 is rather unlikely. It also states that
a meaningful effect of at least -0.1 is likely. You can then conclude that the
therapy is helpful. The p-value less than α only tells you that a value of exactly
0.0000 is not realistic, but who cares.
So, instead of asking research questions like ”Is there a linear relationship
between x and y?” you might ask: ”How large is the linear effect of x on y?”
Instead of a question like ”Is there an effect of the intervention?” it might be
more interesting to ask: ”How large is the effect of the intervention?”
Summarising, remember the following principles when doing your own research
or evaluating the research done by others:
180
of sample data, but only in probabilistic terms. This means that a simple
statement like ”the value of the population slope is definitely not zero”
cannot be made. Only statements like ”A population slope of 0 is not very
likely given the sample data” can be made.
Always report your regression slope or intercept, with the standard error
and the sample size. Based on these, the t-statistics can be computed with
the degrees of freedom. Then if several other researchers have done the
same type of research, the results can be combined in a so-called meta-
analysis, so that a stronger statement about the population can be made,
based on a larger total sample size. The standard error and sample size
moreover allow for the construction of confidence intervals. But better is
to report confidence intervals yourself.
181
Using the same reasoning as above, we also know that if 0 is not within the
99% confidence interval, we know that the p-value is smaller than 0.01, and if 0
is not within the 99.9% confidence interval, we know that the p-value is smaller
than 0.001, etcetera.
A 95% confidence interval can therefore also be seen as the range of possible
values for the null-hypothesis that cannot be rejected with an α of 5%. By the
same token, a 99% confidence interval can be seen as the range of possible values
for the null-hypothesis that cannot be rejected with an α of 1%, etcetera.
Y = b0 + b1 X1 + b2 X2 + · · · + e (5.12)
Remember that in Chapter 2 we discussed inference regarding only a mean.
Here we show that inference regarding the mean can also be done within the
linear model framework. In Chapter 2 we wanted to get a confidence interval
for the mean luteinising hormone (LH) for a woman. We had 48 measures
(n q= 48) and the sample mean was 2.4. We computed the standard error
2
s
as n = 0.0796, so that we could construct a confidence interval using a t-
distribution of 48 − 1 = 47 degrees of freedom. In Chapter 2 we saw that we
can compute a 95% confidence interval for a population mean as
Here we show that the same inference can be done with a very simple version
of the linear model: an intercept-only model. An intercept-only model has only
an intercept and no slopes.
Y = b0 + e (5.13)
e ∼ N (0, σ 2 )
182
take the deviations between this mean of 5 and the Y -values, we get -1, 0 and
1. And these sum to 0. This is true for any set of Y -values. Thus, we could use
the mean of Y as our estimate for b0 , since then the deviations with the mean
(i.e., the residuals) sum to 0.
Earlier we said that the unbiased estimator of the population mean is the
sample mean. Therefore, our b0 parameter represents the unbiased estimator of
the population mean of Y . Let’s see if this works by fitting this model in R. In
R, an intercept is indicated by a 1:
library(broom)
data(lh)
out <- lh %>%
lm(lh ~ 1, data = .)
out %>%
tidy(conf.int = TRUE)
## # A tibble: 1 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.4 0.0796 30.1 2.14e-32 2.24 2.56
In the output we only see an intercept. It is equal to 2.4, which is also the
mean of LH as we saw earlier. The standard error is also exactly the same as
we computed by hand (0.0796), as is the 95% confidence interval. We get the
same results, because in both cases, we use exactly the same standard error and
the same t-distribution with n − K − 1 = 48 − 0 − 1 = 47 degrees of freedom
(K equals 0, the number of independent variables).
In summary, inference about the mean of a set of values can be done using
an intercept-only linear model.
183
184
Chapter 6
Categorical predictor
variables
185
Table 6.1 shows the anonymised data. There we see the dichotomous variable
seat with values ’aisle’ and ’window’.
With dummy coding, we make a new variable that only has values 0 and 1,
that conveys the same information as the seat variable. The resulting variable
is called a dummy variable. Let’s call this dummy variable window and give it
the value 1 for all persons that travelled in a window seat. We give the value 0
for all persons that travelled in an aisle seat. We can also call the new variable
window a boolean variable with TRUE and FALSE, since in computer science,
TRUE is coded by a 1 and FALSE by a 0. Another name that is sometimes
used is an indicator variable. Whatever you want to call it, the data matrix
including the new variable is displayed in Table 6.2.
What we have done now is coding the old categorical variable seat into a
variable window with values 0 and 1 that looks numeric. Let’s see what happens
if we use a linear model for the variables price (dependent variable) and window
(independent variable). The linear model is:
Let’s use the bus trip data and determine the least squares regression line.
We find the following linear equation:
d = 59 + 5 × window
price (6.3)
If the variable window has the value 1, then the expected or predicted price
of the bus ticket is, according to this equation, 59 + 5 × 1 = 64. What does
186
68
67
66
65
64
63
price
62
61
60
59
58
57
0.00 0.25 0.50 0.75 1.00
window
this mean? Well, all persons who had a window seat also had a value of 1
for the window variable. Therefore the expected price of a window seat equals
64. By the same token, the expected price of an aisle seat (window = 0) is
59 + 5 × 0 = 59, since all those with an aisle seat scored 0 on the window
variable.
You see that by coding a categorical variable into a numeric dummy variable,
we can describe the ’linear’ relationship between the type of seat and the price
of the ticket. Figure 6.1 shows the relationship between the numeric variable
window and the numeric variable price.
Note that the blue regression line goes straight through the mean of the prices
for window seats (window = 1) and the mean of the prices for aisle seats (window
= 0). In other words, the linear model with the dummy variable actually models
the group means of people with window seats and people with aisle seats.
Figure 6.2 shows the same regression line but now for the original variable
seat. Although the analysis was based on the dummy variable window, it is
more readable for others to show the original categorical variable seat.
187
68
67
66
65
64
63
price
62
61
60
59
58
57
aisle window
seat
residuals and the squared residuals. These are displayed in Table 6.3.
Table 6.3: Bus trip to Paris data, together with residuals and squared residuals
from the least squares regression line.
person seat window price e e squared
001 aisle 0.00 57.00 -2.00 4.00
002 aisle 0.00 59.00 0.00 0.00
003 window 1.00 68.00 4.00 16.00
004 window 1.00 60.00 -4.00 16.00
005 aisle 0.00 61.00 2.00 4.00
If we take the sum of the squared residuals we obtain 40. Now if we use
a slightly different slope, so that we no longer go straight through the average
prices for aisle and window seats (see Figure 6.3) and we compute the predicted
values, the residuals and the squared residuals (see Table 6.4), we obtain a
higher sum: 40.05.
Only the least squares regression line goes through the observed average
prices of aisle seats and window seats. Thus, we can use the least squares
regression equation to describe observed group means for categorical variables.
Conversely, when you know the group means, it is very easy to draw the
regression line: the intercept is then the mean for the category coded as 0, and
the slope is equal to the mean of the category coded as 1 minus the mean of the
category coded as 0 (i.e., the intercept). Check Figure 6.1 to verify this yourself.
But we can also show this for a new data set.
We look at results from an experiment to compare yields (as measured by
dried weight of plants) obtained under a control and two different treatment
conditions. Let’s plot the data first, where we only compare the two experimental
188
68
67
66
65
64
63
price
62
61
60
59
58
57
aisle window
seat
Figure 6.3: Relation between type of seat and price, with the regression line
being not quite the least squares line.
Table 6.4: Bus trips to Paris, together with residuals and squared residuals from
a suboptimal regression line.
person seat window price wrongpredict e e squared
001 aisle 0.00 57.00 59.10 -2.10 4.41
002 aisle 0.00 59.00 59.10 -0.10 0.01
003 window 1.00 68.00 63.90 4.10 16.81
004 window 1.00 60.00 63.90 -3.90 15.21
005 aisle 0.00 61.00 59.10 1.90 3.61
d = b0 + b1 × treatment2
weight (6.4)
If we fill in the dummy variable and the expected weights (the means!), then
we have the linear equations:
4.661 = b0 + b1 × 0 = b0 (6.5)
5.526 = b0 + b1 × 1 = b0 + b1 (6.6)
So from this, we know that intercept b0 = 4.661, and if we fill that in for the
189
6.0
5.5
weight
5.0
4.5
4.0
3.5
trt1 trt2
group
Figure 6.4: Data on yield under two experimental conditions: treatment 1 and
treatment 2.
Since this regression line goes straight through the average yield for each
treatment, we know that this is the least squares regression equation. We could
have obtained the exact same result with a regression analysis using statistical
software. But this was not necessary: because we knew the group means, we
could find the intercept and the slope ourselves by doing the math.
The interesting thing about a dummy variable is that the slope of the
regression line is exactly equal to the differences between the two averages.
If we look at Equation 6.8, we see that the slope coefficient is 0.865 and this is
exactly equal to the difference in mean weight for treatment 1 and treatment
2. Thus, the slope coefficient for a dummy variable indicates how much the
average of the treatment that is coded as 1 differs from the treatment that is
coded as 0. Here the slope is positive so that we know that the treatment coded
as 1 (trt2), leads to a higher average yield than the treatment coded as 0 (trt1).
This makes it possible to draw inferences about differences in group means.
190
6.3 Making inferences about differences in group
means
In the previous section we saw that the slope in a dummy regression is equal to
the difference in group means. Suppose researchers are interested in the effects
of different treatments on yield. They’d like to know what the difference is in
yield between treatments 1 and 2, using a limited sample of 20 data points.
Based on this sample, they’d like to generalise to the population of all yields
based on treatments 1 and 2. They adopt a type I error rate of α = 0.05.
The researchers analyse the data and they find the regression table as displayed
in Table 6.5. The 95% confidence interval for the slope is from 0.26 to 1.47.
This means that reasonable values for the population difference between the
two treatments on yield lie within this interval. All these values are positive, so
we reasonably believe that treatment 2 leads to a higher yield than treatment 1.
We know that it is treatment 2 that leads to a higher yield, because the slope
in the regression equation refers to a variable grouptrt2 (see Table 6.5). Thus,
a dummy variable has been created, grouptrt2, where trt2 has been coded as
1 (and trt1 consequently coded as 0). In the next section, we will see how to do
this yourself.
If the researchers had been interested in testing a null-hypothesis about the
differences in mean yield between treatment 1 and 2, they could also use the
95% confidence interval for the slope. As it does not contain 0, we can reject
the null-hypothesis that there is no difference in group means at an α of 5%.
The exact p-value can be read from Table 6.5 and is equal to 0.01.
Thus, based on this regression analysis the researchers can write in a report
that there is a significant difference between the yield after treatment 1 and the
yield after treatment 2, t(18) = 3.01, p = 0.01. Treatment 2 leads to a yield of
about 0.87 (SE = 0.29) more than treatment 1 (95% CI: 0.26 – 1.47).
191
PlantGrowth % >%
filter ( group != " ctrl ") % >%
lm ( weight ∼ group , .) % >%
tidy ()
In this code, we take the PlantGrowth data frame that is available in R, we
omit the data points from the control group (because we are only interested in
the two treatment groups), and we model weight as a function of group. What
then happens depends on the data type of group. Let’s take a quick look at
the variables:
PlantGrowth %>%
select(weight, group) %>%
str()
We see that the dependent variable weight is of type numeric (num), and that
the independent variable group is of type factor. If the independent variable
is of type factor, R will automatically make a dummy variable for the factor
variable. This will not happen if the independent variable is of type numeric.
So here group is a factor variable. Below we see the regression table that
results from the linear model analysis.
data("PlantGrowth")
out <- PlantGrowth %>%
filter(group != "ctrl") %>%
lm(weight ~ group, .)
out %>%
tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.66 0.203 22.9 8.93e-15
## 2 grouptrt2 0.865 0.287 3.01 7.52e- 3
We no longer see the group variable, but we see a new variable called
grouptrt2. Apparently, this new variable was created by R to deal with the
group variable being a factor variable. The slope value of 0.865 now refers to
the effect of treatment 2, that is, treatment 1 is the reference category and the
value 0.865 is the added effect of treatment 2 on the yield. We should therefore
interpret these results as that in the sample data, the mean of the treatment 2
data points was 0.865 higher than the mean of the treatment 1 data points.
192
Here, R automatically picked the treatment 1 group as the reference group.
In case you want to have treatment 2 as the reference group, you could make your
own dummy variable. For instance, make your own dummy variable grouptrt1
in the following way and check whether it is indeed stored as numeric in R:
PlantGrowth %>%
select(weight, group, grouptrt1) %>%
str()
Next, you can run a linear model with the grouptrt1 dummy variable that
you created yourself:
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.53 0.203 27.2 4.52e-16
## 2 grouptrt1 -0.865 0.287 -3.01 7.52e- 3
The results now show the effect of treatment 1, with treatment 2 being the
reference category. Of course the effect of -0.865 is now the opposite of the effect
that we saw earlier (+0.865), when the reference category was treatment 2. The
intercept has also changed, as the intercept is now the expected weight for the
other treatment group. In other words, the reference group has now changed:
the intercept is equal to the expected weight of the treatment 1 group.
In general, store variables that are essentially categorical as factor variables
in R. For instance, you could have a variable group that has two values 1 and
2 and that is stored as numeric. It would make more sense to first turn this
variable into a factor variable, before using this variable as a predictor in a
linear model. You could turn the variable into a factor only for the analysis and
leaving the data frame unchanged, like this:
model < - dataset % >%
lm ( y ∼ factor ( group ) , data = .)
193
or change the data frame before the analysis
dataset < - dataset % >%
mutate ( group = factor ( group ))
That is, your independent variables are the dummy variable window (1
coding for window seat, 0 coding for aisle seat) and the numeric variable legroom.
194
Table 6.6: Regression table for the regression of price on the dummy variable
window and the numeric variable legroom.
term estimate std.error statistic p.value
(Intercept) 41.50 9.71 4.27 0.05
window 5.00 2.50 2.00 0.18
leg room 0.25 0.14 1.83 0.21
When we look at the output, we see the regression table in Table 6.6. When
we fill in the coefficients, we obtain the following linear equation:
195
68
64
window
price
0
1
60
56
60 65 70 75 80
Leg room in centimetres
Figure 6.5: The bus trip to Paris data, with the predictions from a linear model
with legroom and window as independent variables.
That is, the regression line for window seats has an intercept that is different:
it is equal to the original intercept plus the slope of the window variable, 41.5 +
5 = 46.5. On the other hand, the slope for legroom is unchanged. With the
same slope for legroom, the two regression lines are therefore parallel.
The second you should notice from Figure 6.5 is that these two regression
lines are not the least squares regression lines for window and aisle seats respectively.
For instance, the regression line for window seats (the top one) should be more
positive in order to minimise the difference between the data points and the
regression line (the residuals). On the other hand, the regression line for aisle
seats (the bottom one) should be less steep in order to have smaller residuals.
Why is this so? Shouldn’t the regression lines minimise the residuals?
Yes they should! But there is a problem, because the model also implies, as
we saw above, that the lines are parallel. Whatever we choose for values for the
multiple regression equation, the regresion lines for aisle and window seats will
always be parallel. And under that constraint, the current parameter values
give the lowest possible value for the sum of the squared residuals, that is, the
the sum of the squared residuals for both regression lines taken together. The
aisle seat regression line should be less steep, and the window seat regression
line should be steeper to have a better fit with the data, but taken together, the
estimated slope of 0.25 gives the lowest overall sum of squared residuals.
It is possible though to have linear models where the lines are not parallel.
This will be discussed in Chapter 9.
196
Table 6.7: Height across three different countries.
ID Country height
001 A 120
002 A 160
003 B 121
004 B 125
005 C 140
... ... ...
Table 6.8: Height across three different countries with dummy variables.
ID Country height countryA countryB
001 A 120 1 0
002 A 160 1 0
003 B 121 0 1
004 B 125 0 1
005 C 140 0 0
... ... ... ... ...
197
countryA and a 0 for countryB. Therefore a third dummy variable countryC
is not necessary (i.e., is redundant): the two dummy variables give us all the
country information we need.
Remember that with two categories, you only need one dummy variable,
where one category gets 1s and another category gets 0s. In this way both
categories are uniquely identified. Here with three categories we also have unique
codes for every category. Similarly, if you have 4 categories, you can code this
with 3 dummy variables. In general, when you have a variable with K categories,
you can code them with K − 1 dummy variables.
PlantGrowth %>%
lm(weight ~ treatment_1 + treatment_2, data = .) %>%
tidy()
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
198
## 1 (Intercept) 5.03 0.197 25.5 1.94e-20
## 2 treatment_1 -0.371 0.279 -1.33 1.94e- 1
## 3 treatment_2 0.494 0.279 1.77 8.77e- 2
You now have two numeric variables that you use in an ordinary multiple
regression analysis. We see the effects (the ’slopes’) of the two dummy variables.
Based on these slopes and the intercept, we can construct the linear equation
for the relationship between treatment and weight (yield):
Based on this we can make predictions for the mean height in the control
group, the treatment 1 group and the treatment 2 group.
Control group specimens score 0 on variable treatment 1 and 0 on variable
treatment 2. Therefore, their predicted weight equals:
199
PlantGrowth %>%
str()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.53 0.203 27.2 4.52e-16
## 2 grouptrt1 -0.865 0.287 -3.01 7.52e- 3
The regression table now looks slightly different: all values are the same as
in the analysis where we created our own dummies, except that R has used
different names for the dummy variables it created. The parameter values are
exactly the same as in the previous analysis, because in both analyses the control
condition is the reference category. This control condition is chosen by R as the
reference category, because alphabetically, ctrl comes before trt1 and trt2.
Thus, the new variable grouptrt1 codes 1s for all observations where group
= trt1 and 0s otherwise, and the new variable grouptrt2 codes 1s for all
observations where group = trt2 and 0s otherwise.
200
reference group. The t-values and p-values are related to null-hypothesis tests
regarding these differences to be 0 in the population.
As an example, suppose that we want to estimate the difference in mean
weight between plants from the treatment 1 group relative to the control group.
From the output, we see that our best guess for this difference (the least square
estimate) equals -0.37, where the yield is less with treatment 1 than under
control conditions. The standard error for this difference equals 0.28. So a
rough indication for the 95% confidence interval would be from −0.37 − 2 × 0.28
to −0.37 + 2 × 0.28, that is, from −0.93 to 0.19. Therefore, we infer that in the
population, our best guess for the difference is somewhere between −0.93 and
0.19.
If we would want to, we could perform three null-hypothesis tests based on
this output: 1) whether the population intercept equals 0, that is, whether the
population mean of the control group equals 0; 2) whether the slope of the
treatment 1 dummy variable equals 0, that is, whether the difference between
the population means of treatment 1 group and the control group is 0; and 3)
whether the slope of the treatment 2 group dummy variable equals 0, that is,
whether the difference between the population means of the treatment 2 group
and the control group is 0.
Obviously, the first hypothesis is not very interesting: we’re not interested to
know whether the average weight in the control group equals 0. But the other
two null-hypotheses could be interesting in some scenarios. What is missing from
the table is a test for the null-hypothesis that the means of the two treatments
conditions are equal. This could be solved by manually creating two other
dummy variables, where either treatment 1 or 2 is the reference group, or by
looking at tricks in Chapter ??. But what is also missing is a test for the null-
hypothesis that all three population means are equal. In order to do that, we
first need to explain analysis of variance.
If all group means are equal in the population, then all population slopes
would be 0. We want to test this null-hypothesis with a linear model in R. We
then have only one independent variable, group, and if we let R do the dummy
coding for us, R can give us an Analysis of Variance. We do that in the following
way:
201
out <- PlantGrowth %>%
lm(weight ~ group, data = .)
out %>%
anova() %>%
tidy()
## # A tibble: 2 x 6
## term df sumsq meansq statistic p.value
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 group 2 3.77 1.88 4.85 0.0159
## 2 Residuals 27 10.5 0.389 NA NA
202
look at the F -value for the group variable. The F -value equals 4.85. This is
the ratio of the mean square of the group effect, which is 1.88, and the mean
square of the residuals (error), which is 0.389. Thus, the F -value for country is
1.88
computed as 0.389 = 4.85. Under the null-hypothesis that all three population
means are equal, this ratio is around 1. Why this is so, we will explain later.
Here we see that the F -value based on these sample data is larger than 1. But
is it large enough to reject the null-hypothesis? That depends on the degrees of
freedom. The F -value is based on two mean squares, and these in turn are based
on two separate numbers of degrees of freedom. The one for the effect of country
was 2 (3 countries so 2 degrees of freedom), and the one for the residual mean
square was 27 (27 residual degrees of freedom). We therefore have to look up in
a table whether an F -value of 4.85 is significant at 2 and 27 degrees of freedom
for a specific α. Such a table is displayed in Table 6.9. It shows critical values if
your α is 0.05. In the columns we look up our model degrees of freedom. Model
degrees of freedom is computed based on the number of independent variables.
Here we have a categorical variable group. But because this categorical variable
is represented in the analysis as two dummy variables, the number of variables
is actually 2. The model degrees of freedom is therefore 2.
In the rows of Table 6.9 we look up our residual degrees of freedom: 27.
For 2 and 27 degrees of freedom we find a critical F -value of 3.35. It means
that if we have an α of 0.05, an F -value of 3.35 or larger is large enough to
reject the null-hypothesis. Here we found an F -value of 4.85, so we reject the
null-hypothesis that the three population means are equal. Therefore, the mean
weight is not the same in the three experimental conditions.
Table 6.9: Critical values for the F -value if α = 0.05, for different model degrees
of freedom (columns) and error degrees of freedom (rows).
1 2 3 4 5 10 25 50
5 6.61 5.79 5.41 5.19 5.05 4.74 4.52 4.44
6 5.99 5.14 4.76 4.53 4.39 4.06 3.83 3.75
10 4.96 4.10 3.71 3.48 3.33 2.98 2.73 2.64
27 4.21 3.35 2.96 2.73 2.57 2.20 1.92 1.81
50 4.03 3.18 2.79 2.56 2.40 2.03 1.73 1.60
100 3.94 3.09 2.70 2.46 2.31 1.93 1.62 1.48
Note that our null-hypothesis that all group means are equal in the population
cannot be answered based on a regression table. If the population means are all
equal, then the slope parameters should consequently be 0 in the population.
Let’s have a look again at the regression table, plotting also the 95% confidence
intervals:
203
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.03 0.197 25.5 1.94e-20 4.63 5.44
## 2 grouptrt1 -0.371 0.279 -1.33 1.94e- 1 -0.943 0.201
## 3 grouptrt2 0.494 0.279 1.77 8.77e- 2 -0.0780 1.07
This hypothesis test is very different from the t-tests in the regression table.
The t-test for the grouptrt1 effect specifically tests whether the average weight
in treatment 1 group is different from the average weight in the control group
(the reference country). The t-test for the grouptrt2 effect specifically tests
whether the average weight in the treatment 2 group is different from the average
weight in the control group (the reference country). Since these two hypotheses
do not refer to our original research question regarding overall differences across
all three groups, we do not report these t-tests, but we report the overall F -test
from the ANOVA table.
In general, the rule is that if you have a specific research question that
addresses a particular null-hypothesis, you only report the statistical results
regarding that null-hypothesis. All other p-values that your software happens
to show in its output should be ignored. We will come back to this issue in
Chapter ??.
204
6.9 The logic of the F -statistic
As stated earlier, the ANOVA is an alternative way of representing the linear
model. Suppose we have a dependent variable Y , and three groups: A, B and
C. In the usual linear model, we have an intercept b0 , and we use two dummy
variables. Suppose we use C as our reference group, then we need two dummy
variables for groups A and B. We could model the data then using the following
equation, with normally distributed errors:
This is the linear model as we know it. The linear equation has three
unknown parameters that need to be estimated: one intercept and two dummy
effects. The dummy effects are the differences between the means of groups A
and B relative to reference group C.
Alternatively, we could represent the same data as follows:
That is, instead of estimating one intercept and two dummy effects, we
simply estimate the three population means directly! We leave out the intercept,
and we estimate three population means.
Next, we focus on the variance of the dependent variable, Y in this case,
that is split up into two parts: one part that is explained by the independent
variable (groups in this case) and one part that is not explained (cf. Chapters
4). The unexplained part is easiest of course: that is simply the part shown by
the residuals, hence σ 2 .
The logic of the F -statistic is entirely based on this σ 2 . As stated earlier,
under the null-hypothesis the F -statistic should have a value of around 1. This
is because F is a ratio and under the null-hypothesis, the numerator and the
denominator of this ratio should be more or less equal. This is so because
both the numerator and the denominator are estimators of σ 2 . Under the null-
hypothesis, these estimators should result in more or less the same numbers,
and then the ratio is more or less 1. If the null-hypothesis is not true, then the
numerator becomes larger than the denominator and hence the F -value becomes
larger than 1.
In the previous section we saw that the numerator of the F -statistic was
computed by taking the sum of squares of the group variable and dividing it
by the degrees of freedom. What is actually being done is the following: If the
null-hypothesis is really true, then the three population means are equal, and
you simply have three independent samples from the same population. Each
sample mean shows simply a slight deviation from the population mean.
205
This variance of sample means should remind us of something. If we go
back to Chapter 2, we saw there that if we have a population with mean µ and
variance σ 2 , and if we draw many many random samples of size n and compute
sample means for each sample, their distribution is a sampling distribution (Fig.
2.2). We also saw in Chapter 2 that on average the sampling distribution will
show a mean that is the same as the population mean: the sample mean is an
unbiased estimator of the population mean. And, important for ANOVA, the
standard deviation of q the sampling distribution, known as the standard error,
2
will be equal to σY = sn (Chapter 2). If we take the square, we see that the
variance of the sample means is equal to
s2
σY2 = (6.20)
n
where s2 is the unbiased estimator of the variance of Y .
In the ANOVA model above, we have three group means. Now, suppose we
have an alternative model, under the null-hypothesis, that there is really only
one population mean µ, and that the observed different group means in groups
A, B and C are only the result of chance (random sampling). Then the variance
of the group means is nothing but the square of the standard error, and the
number of observations per group is the sample size n. If that is the case, then
we can flip the equation of the standard error around and say:
c2 = σ 2 × n = SS × n
σ (6.21)
Y 2
or in words: our estimate of the total variance of Y in the population is the
estimated variance of the group means in the population times the number of
observations per group.
So the numerator is one estimator of the variance of the residuals. For
that estimator we only used information about the group means: we looked
at variation between groups. Now let’s look at the denominator. For that
estimator we use information from the raw data and how they deviate from
the sample group means, that is we look at within-group variation. Similar to
regression analysis, for each observed value, we compute the difference between
the observed value and the group mean. We then compute the sum of squared
residuals, SSR. If we want the variance, we need to divide this by sample size,
n. However, if we want to estimate the variance in the population, we need to
divide by a corrected n. In Chapter 2 we saw that if we wanted to estimate a
variance in the population on the basis of one sample with one sample mean,
SS
we used s2 = n−1 . The n − 1 was in fact due to the loss of 1 degree of freedom
because by computing the sample variance, we used the sample mean, which was
only an estimate of the population mean. Here, because we have three groups,
we need to estimate three population means, and the degrees of freedom is
therefore n − 3. The estimated variance in the population that is not explained
by the independent variable is therefore SSR/(n − 3).
206
4
^
Y
Y
^
Y
Y Y Y
2
^
Y
1
A B C
group
Figure 6.6: Illustration of ANOVA using a very small data set. In grey the raw
data, in black the overall sample mean, and in red the sample group means.
207
2, and in group C, we see the values 2, 1 and 1. When we sum all these values
and divide by 9, we get the overall mean (the grand mean), which is equal to
Y = 2.1111111, denoted in black in Figure 6.6. In red, we see the sample group
means. For group A, that is equal to (4+2+1)/3 = 2.3333333, for group B this is
(4+2+2)/3 = 2.6666667, and for group C this equals (2+1+1)/3 = 1.3333333.
Thus, our ANOVA model for these data is the following:
Our OLS estimates for the parameters are the sample means, so that we
have the linear equation
Based on this linear equation we can determine the predicted values for each
data point. Table 6.10 shows the Y -values, the group variable, the dummy
variables from the ANOVA model equation (Equation 6.22) and the predicted
values. We see that the predicted value for each observed value is equal to the
sample group mean.
Table 6.10: Small data example for illustrating ANOVA and the F -statistic.
Y group dummy A dummy B dummy C predicted residual
1 A 1 0 0 2.33 -1.33
2 A 1 0 0 2.33 -0.33
4 A 1 0 0 2.33 1.67
2 B 0 1 0 2.67 -0.67
2 B 0 1 0 2.67 -0.67
4 B 0 1 0 2.67 1.33
2 C 0 0 1 1.33 0.67
1 C 0 0 1 1.33 -0.33
1 C 0 0 1 1.33 -0.33
Using these predicted values, we can compute the residuals, also displayed
in Table 6.10, and these help us to compute the first estimate of σ 2 , the one
based on residuals, namely the SSR divided by the degrees of freedom. If we
square the residuals in Table 6.10 and sum them, we obtain SSR = 8. To obtain
the Mean Squared Error (MSE or meansq for Residuals), we divide the SSR by
the degrees of freedom. Because the linear model with 2 dummy variables has
n − K − 1 = 9 − 2 − 1 = 6 residuals degrees of freedom (see Chapter 5), we also
have only 6 residual degrees of freedom. Thus we get M SE = 8/6 = 1.3333333.
We can see these numbers in the bottom row in the ANOVA table, displayed in
Table 6.11.
208
For our second estimate of σ 2 , the one based on the group means, we look
at the squared deviations of the group means from the overall mean (the grand
mean). We saw that the grand mean equals 2.11. The sample mean for group
A was 2.3333333, so the squared deviation equals 0.0493828. The sample mean
for group B was 2.6666667, so the squared deviation equals 0.3086421. Lastly,
the sample mean for group C was 1.3333333, so the squared deviation equals
0.3086421. Adding these squared deviations gives a sum of squares of 0.962963.
To obtain an unbiased estimate for the population variance of these means,
we have to divide this sum of squares by the number of groups minus 1 (model
degrees of freedom), thus we get 0.962963/2 = 0.6604939. This we must multiply
by the sample size per group to obtain an estimate of σ 2 (see Equation 6.21),
thus we obtain 1.4444444.
Table 6.12: Small data example for illustrating ANOVA and the F -statistic.
Y group predicted grand mean deviation sq deviation
1 A 2.33 2.11 0.22 0.05
2 A 2.33 2.11 0.22 0.05
4 A 2.33 2.11 0.22 0.05
2 B 2.67 2.11 0.56 0.31
2 B 2.67 2.11 0.56 0.31
4 B 2.67 2.11 0.56 0.31
2 C 1.33 2.11 -0.78 0.60
1 C 1.33 2.11 -0.78 0.60
1 C 1.33 2.11 -0.78 0.60
Obtaining the estimate of σ 2 based on the group means can also be illustrated
using Table 6.12. There again we see the raw data values for variable Y , the
predicted values (the group means), but now also the grand mean, the deviations
of the sample means from the grand mean, and their squared values. If we simply
add the squared deviations, we no longer have to multiply by sample size. Thus
we have as the sum of squares 2.8888889. Then we only have to divide by the
number of groups minus 1, so we have 2.8888889/2 = 1.4444444. This sum of
squares, the degrees of freedom of 2, and the resulting MS can also be seen in
the ANOVA table in Table 6.11.
Hence we have two estimates of σ 2 , the one called the Mean Squared Error
(MSE) that is based on the residuals (sometimes also called the MS within or
MSW), and the other one called the Mean Squared Between groups (MSB), that
is based on the sum of squares of group mean differences. For the F -statistic, we
209
use the MS Between (MSB) as the numerator and the MSE as the denominator,
M SBgroup 1.4444444
F = = = 1.0833333 (6.24)
M SE 1.3333333
We see that the F -statistic is larger than 1. That means that the estimate
for σ 2 , M SBgroup , based on the sample means is larger than the estimate based
on the residuals, M SE. This could indicate that the null-hypothesis, that the
three population means are equal, is not true. However, is the F -value really
large enough to justify such a conclusion?
To answer that question, we need to know what values the F -statistic would
take for various data sets if the null-hypothesis were true (the sampling distribution
of F ). If for each data set we have three groups, each consisting of three observed
values, then we have 2 degrees of freedom for the group effect, and 6 residual
degrees of freedom. Table 6.9 shows critical values if we want to use an α of
0.05. If we look up the column with a 2 (for the number of model degrees of
freedom) and the row with a 6 (for the residual degrees of freedom), we find
a critical F -value of 5.14. This means that if the null-hypothesis is true and
we repeatedly take random samples, we find an F -value equal to or larger than
5.14 only 5% of the time. If we want to reject the null-hypothesis, therefore, at
an alpha of 5%, the F -value has to be equal or larger than 5.14. Here we found
an F -value of only 1.0833333, which is much smaller, so we cannot reject the
null-hypothesis that the means are equal.
For illustration, Figure 6.7 shows the distribution of the F -statistic with 2
and 6 degrees of freedom under the null-hypothesis. The figure shows it happens
quite a lot under the null-hypothesis that the F -statistic is equal to 1.0833333
or larger.
Always check the degrees of freedom for your F -statistic carefully. The first
number refers to the degrees of freedom for the Mean Square Between: this is
the number of groups minus 1 (K − 1). This is equal to the number of dummy
210
1.00
0.75
density
0.50
0.25
5%
0.00
0 1 2 3 4 5 6
F
Figure 6.7: Density plot of the F -distribution with 2 and 6 degrees of freedom.
In blue the observed F -statistic in the small data example, in red the critical
value for an α of 0.05. The blackened area under the curve is 5%.
variables are used in the linear model. This is also called the model degrees of
freedom. The second number refers to the residual degrees of freedom: this is
n − K − 1 as we saw Chapter 5, where K is the number of dummy variables.
In this ANOVA model you have 9 data points and you have 2 dummy variables
for the three groups. So your residual degrees of freedom is 9 − 2 − 1 = 6.
This residual degrees of freedom is equal to that of the t-statistic for multiple
regression.
211
0.4
density 0.3
0.2
0.1
0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
t
Figure 6.8: The vertical line represents a t-value of -2.40. The shaded area
represents the extreme 5% of the possible t-values
square of each value (thus, suppose as the first 3 randomly drawn t-values you
get -3.12, 0.14, and -1.6, you then square these numbers to get the numbers 9.73,
0.02, and 2.79). If you then make a density plot of these one million squared
numbers, you get the density plot in Figure 6.9. It turns out that this density
is an F -distribution with 1 model degrees of freedom and 40 residual degrees of
freedom.
If we also square the observed test statistic t-value of -2.40, we obtain an
F -value of 5.76. From online tables, we know that, with 1 model degrees of
freedom and 40 residual degrees of freedom, the proportion of F -values larger
than 5.76 equals 0.02. The proportion of t-values, with 40 (residual) degrees of
freedom, larger than 2.40 or smaller than -2.40 is also 0.02. Thus, the two-sided
p-value associated with a certain t-value, is equal to the p-value associated with
an F -value that is the square of the t-value.
212
0.75
density
0.50
0.25
0.00
0 1 2 3 4 5 6 7 8
F
Figure 6.9: The F -distribution with 1 model degrees of freedom and 40 error
degrees of freedom. The shaded area is the upper 5% of the distribution. The
vertical line represents the square of -2.40: 5.76
213
214
Chapter 7
Assumptions of linear
models
7.1 Introduction
Linear models are models. A model describes the relationship between two or
more variables. A good model gives a valid summary of what the relationship
between the variables looks like. Let’s look at a very simple example of two
variables: height and weight. In a sample of 100 children from a distant country,
we find 100 combinations of height in centimetres and weight in kilograms that
are depicted in the scatter plot in Figure 7.1.
We’d like to find a linear model for these data, so we determine the least
squares regression line. We also determine the standard deviation of the residuals
so that we have the following statistical model:
This model, defined above, is depicted in Figure 7.2. The blue line is the
regression line, and the dots are the result of simulating (inventing) independent
normal residuals with standard deviation 4.04. The figure shows how the data
would like according to the model.
The actual data, displayed in Figure 7.1 might have arisen from this model
in Figure 7.2. The data is only different from the simulated data because of the
randomness of the residuals.
A model should be a good model for two reasons. First, a good model is a
summary of the data. Instead of describing all 100 data points on the children,
we could summarise these data with the linear equation of the regression line and
the standard deviation (or variance) of the residuals. The second reason is that
you would like to infer something about the relationship between height and
215
60
40
weight
20
0
110 120 130 140 150
height
60
40
weight
20
0
110 120 130 140 150
height
Figure 7.2: Data set on height and weight in 100 children and the least squares
regression line.
216
weight in all children from that distant country. It turns out that the standard
error, and hence the confidence intervals and hypothesis testing, are only valid
if the model describes the data well. This means that if the model is not a good
description of your sample data, then you draw the wrong conclusions about
the population.
For a linear model to be a good model, there are four conditions that need
to be fulfilled.
3. equal variance The residuals have equal variance (also called homoskedasticity)
If these conditions (often called assumptions) are not met, the inference
with the computed standard error is invalid. That is, if the assumptions are
not met, the standard error should not be trusted, or should be computed using
alternative methods.
Below we will discuss these four assumptions briefly. For each assumption,
we will show that the assumption can be checked by looking at the residuals. We
will see that if the residuals do not look right, one or more of the assumptions
are violated. But what does it mean that the residuals ’look right’ ?
Well, the linear model says that the residuals have a normal distribution. So
for the height and weight data, let’s apply regression, compute the residuals for
all 100 children, and plot their distribution with a histogram, see Figure 7.3.
The histogram shows a bell-shaped distribution with one peak that is more or
less symmetric. The symmetry is not perfect, but you can well imagine that if
we had measured more children, the distribution could more and more resemble
a normal distribution.
Another thing the model implies is that the residuals are random: they are
random draws from a normal distribution. This means, if we would plot the
residuals, we should see no systematic pattern in the residuals. The scatter
plot in Figure 7.4 plots the residuals in the order in which they appear in the
data set. The figure seems to suggest a random scatter of dots, without any
kind of system or logic. We could also plot the residuals as a function of the
predicted height (the dependent variable). This is the most usual way to check
for any systemetic pattern. Figure 7.5 shows there is no systematic relationship
between the predicted height of a child and the residual.
When it looks like this, it shows that the residuals are randomly scattered
around the regression line (the predicted heights). Taken together, Figures 7.3,
7.4 and 7.5 suggest that the assumptions of the linear model are met.
Let’s have a look at the same kinds of residual plots when each of the
assumptions of the linear model are violated.
217
20
15
count
10
0
−10 −5 0 5 10
residuals
10
5
residuals
−5
0 25 50 75 100
observation
218
5
residual
−5
7.2 Independence
The assumption of independence is about the way in which observations are
similar and dissimilar from each other. Take for instance the following regression
equation for children’s height predicted by their age:
This regression equation predicts that a child of age 5 has a height of 125 and
a child of age 10 has a height of 150. In fact, all children of age 5 have the same
predicted height of 125 and all children of age 10 have the same predicted height
of 150. Of course, in reality, children of the same age will have very different
heights: they differ. According to the above regression equation, children are
similar in height because they have the same age, but they differ because of
the random term e that has a normal distribution: predictor age makes them
similar, residual e makes them dissimilar. Now, if this is all there is, then this is
a good model. But let’s suppose that we’re studying height in an international
group of 50 Ethiopian children and 50 Vietnamese children. Their heights are
plotted in Figure 7.6.
From this graph, we see that heights are similar because of age: older children
are taller than younger children. But we see that children are also similar
because of their national background: Ethiopian children are systematically
taller than Vietnamese children, irrespective of age. So here we see that a
simple regression of height on age is not a good model. We see that, when we
estimate the simple regression on age and look at the residuals in Figure 7.7.
As our model predicts random residuals, we expect a random scatter of
residuals. However, what we see here is a systematic order in the residuals: they
219
160
150
country
height
Ethiopian
140
Vietnamese
130
120
4 6 8 10 12
age
Figure 7.6: Data on age and height in children from two countries.
2
residual
−2
0 25 50 75 100
child
220
2
residual
−2
Ethiopian Vietnamese
country
tend to be positive for the first 50 children and negative for the last 50 children.
These turn out to be the Ethiopian and the Vietnamese children, respectively.
This systematic order in the residuals is a violation of independence: the residuals
should be random, and they are not. The residuals are dependent on country:
positive for Ethiopians, negative for Vietnamese children. We see that clearly
when we plot the residuals as a function of country, in Figure 7.8.
Thus, there is more than just age that makes children similar. That means
that the model is not a good model: if there is more than just age that makes
children more alike, then that should be incorporated into our model. If we use
multiple regression, including both age and country, and we do the analysis,
then we get the following regression equation:
When we now plot the residuals we see that there is no longer a clear country
difference, see Figure 7.9.
Another typical example of non-random scatter of residuals is shown in
Figure 7.10. They come from an analysis of reaction times, done on 10 students
where we also measured their IQ. Each student was measured on 10 trials. We
predicted reaction time on the basis of student’s IQ using a simple regression
analysis. The residuals are clearly not random, and if we look more closely, we
see some clustering if we give different colours for the data from the different
students, see Figure 7.11.
We see the same information if we draw a boxplot, see Figure 7.12. We
see that residuals that are close together come from the same student. So,
reaction time are not only similar because of IQ, but also because they come
from the same student: clearly something other than IQ also explains why
221
3
1
res
−1
−2
Ethiopian Vietnamese
country
Figure 7.9: Residual plot after regressing height on age and country.
0
residual
−5
0 25 50 75 100
trial
222
5
student
1
2
3
0
4
residual
5
6
7
−5 8
9
10
0 25 50 75 100
trial
Figure 7.11: Residual plot after regressing reaction time on IQ, with separate
colours for each student.
0
residual
−5
1 2 3 4 5 6 7 8 9 10
student
223
reaction times are different across individuals. The residuals in this analysis
based on IQ are not independent: they are dependent on the student. This
may be because of a number of factors: dexterity, left-handedness, practice,
age, motivation, tiredness, or any combination of such factors. You may or may
not have information about these factors. If you do, you can add them to your
model and see if they explain variance and check if the residuals become more
randomly distributed. But if you don’t have any extra information, or if do
you but the residuals remain clustered, you might either consider adding the
categorical variable student to the model or use linear mixed models, discussed
in Chapter 10.
The assumption of independence is the most important assumption in linear
models. Just a small amount of dependence among the observations causes your
actual standard error to be much larger than reported by your software. For
example, you may think that a confidence interval is [0.1, 0.2], so you reject the
null-hypothesis, but in reality the standard error is much larger, with a much
wider interval, say [-0.1, 0.4] so that in reality you are not allowed to reject
the null-hypothesis. The reason that this happens can be explained when we
look again at Figure 7.11. Objectively, there are 100 observations, and this is
fed into the software: n = 100. This sample size is then used to compute the
standard error (see Chapter 5). However, because the reaction times from the
same student are so much alike, effectively the number of observations is much
smaller. The reaction times from one student are in fact so much alike, you could
almost say that there are only 10 different reaction times, one for each student,
with only slight deviations within each student. Therefore, the real number of
observations is somewhere between 10 and 100, and thus the reported standard
error is underestimated when there is dependence in your residuals (standard
errors are inversely related to sample size, see Chapter 5).
7.3 Linearity
The assumption of linearity is often also referred to as the assumption of additivity.
Contrary to intuition, the assumption is not that the relationship between
variables should be linear. The assumption is that there is linearity or additivity
in the parameters. That is, the effects of the variables in the model should add
up.
Suppose we gather data on height and fear of snakes in 100 children from a
different distant country. Figure 7.13 plots these two variables, together with
the least squares regression line.
Figure 7.14 shows a pattern in the residuals: the positive residuals seem to
be smaller than the negative residuals. We also clearly see a problem when we
plot residuals against the predicted fear (see Fig. 7.15). The same problem is
reflected in the histogram in Figure 7.16, that does not look symmetric at all.
What might be the problem?
Take another look at the data in Figure 7.13. We see that for small heights,
the data points are all below the regression line, and the same pattern we see
224
2000
fear
1000
Figure 7.13: Least squares regression line for fear of snakes on height in 100
children.
100
0
residuals
−100
−200
−300
0 25 50 75 100
observation
225
100
0
residual
−100
−200
−300
1000 2000
predicted fear
10
count
0
−300 −200 −100 0 100
residual
Figure 7.16: Histogram of the residuals after regressing fear of snakes on height.
226
2000
Fear for snakes
1500
1000
500
Figure 7.17: Observed and predicted fear based on a linear model with height
and height squared
for large heights. For average heights, we see on the contrary all data points
above the regression line. Somehow the data points do not suggest a completely
linear relationship, but a curved one.
This problem of model misfit could be solved by not only using height as
the predictor variable, but also the square of height, that is, height2 . For
each observed height we compute the square. This new variable, let’s call it
height2, we add to our regression model. The least squares regression equation
then becomes:
If we then plot the data and the regression line, we get Figure 7.17. There
we see that the regression line goes straight through the points. Note that the
regression line when plotted against height is non-linear, but equation 7.5 itself
is linear, that is, there are only two effects added up, one from variable height
and one from variable height2. We also see from the histogram (Figure 7.18)
and the residuals plot (Figure 7.19) that the residuals are randomly drawn from
a normal distribution and are not related to predicted fear. Thus, our additive
model (our linear model) with effects of height and height squared results in a
nice-fitting model with random normally scattered residuals.
In sum, the relationship between two variables need not be linear in order for
a linear model to be appropriate. A transformation of an independent variable,
such as taking a square, can result in normally randomly scattered residuals.
The linearity assumption is that the effects of a number of variables (transformed
or untransformed) add up and lead to a model with normally and independently,
randomly scattered residuals.
227
8
6
count
0
−20 −10 0 10 20
residuals
Figure 7.18: Histogram of the residuals of the fear of snakes data with height
squared introduced into the linear model.
20
10
residual
−10
−20
Figure 7.19: Residuals plot of the fear of snakes data with height squared
introduced into the linear model.
228
9
Reaction time in seconds
20 40 60 80
Age in years
Figure 7.20: Least squares regression line for reaction time on age in 100 adults.
229
7.5
5.0
residuals
2.5
0.0
−2.5
20 40 60 80
Age in years
2
Logarithm of reaction time
20 40 60 80
Age in years
Figure 7.22: Least squares regression line for log reaction time on age in 100
adults.
230
1.5
1.0
0.5
residual
0.0
−0.5
−1.0
Figure 7.23: Residual plot after regressing log reaction time on age.
231
9
count
0
−2.5 0.0 2.5 5.0 7.5
residuals
what the histogram of the residuals looks like if we use reaction time as our
dependent variable. Figure 7.24 shows that in that case the distribution is not
symmetric: it is clearly skewed.
After a logarithmic transformation of the reaction times, we get the histogram
in Figure 7.25, which looks more symmetric.
Remember that if your sample size is of limited size, a distribution will
never look completely normal, even if it is sampled from a normal distribution.
It should however be likely to be sampled from a population of data that seems
normal. That means that the histogram should not be too skewed, or too
peaked, or have two peaks far apart. Only if you have a lot of observations, say
1000, you can reasonably say something about the shape of the distribution.
If you have categorical independent variables in your linear model, it is best
to look at the various subgroups separately and look at the histogram of the
residuals: the residuals e are defined as residuals given the rest of the linear
model. For instance, if there is a model for height, and country is the only
predictor in the model, all individuals from the same country are given the same
expected height based on the model. They only differ from each other because of
the normally distributed random residuals. Therefore look at the residuals for all
individuals from one particular country to see whether the residuals are indeed
normally distributed. Then do this for all countries separately. Think about
it: the residuals might look non-normal from country A, and non-normal from
country B, but put together, they might look very normal! This is illustrated
in Figure 7.26. Therefore, when checking for the assumption of normality, do
this for every subgroup separately.
It should be noted that the assumption of normally distributed residuals as
checked with a histogram is the least important assumption. Even when the
232
15
count
10
0
−1 0 1
residuals
Figure 7.25: Histogram of the residuals after a regression of log reaction time
on age.
distribution is skewed, your standard errors are more or less correct. Only in
severe cases, like with the residuals in Figure 7.24, the standard errors start to
be somewhat incorrect.
233
400
300
country
count
A
200
B
100
0
−10 0 10
residual
Figure 7.26: Two distributions might be very non-normal, but when taken
together, might look normal nevertheless. Normality should therefore always
be checked for each subgroup separately.
Next, we use the function add residuals from the modelr package to add
residuals to the data set and plot a histogram.
library(modelr)
mpg %>%
add_residuals(out) %>%
ggplot(aes(x = resid)) +
geom_histogram()
234
30
20
count
10
0
−5 0 5 10 15
resid
As stated earlier, it’s even better to do this for the different subgroups
separately:
mpg %>%
add_residuals(out) %>%
ggplot(aes(x = resid)) +
geom_histogram() +
facet_wrap(. ~ cyl)
4 5
20
15
10
0
count
6 8
20
15
10
0
−5 0 5 10 15 −5 0 5 10 15
resid
For the second type of plot, we use two functions from the modelr package
to add predicted values and residuals to the data set, and use these to make a
residual plot:
mpg %>%
add_residuals(out) %>%
add_predictions(out) %>%
ggplot(aes(x = pred, y = resid)) +
235
geom_point()
15
10
resid
−5
12 14 16 18 20
pred
When there are few values for the predictions, or when you have a categorical
predictor, it’s better to make a boxplot:
mpg %>%
add_residuals(out) %>%
add_predictions(out) %>%
ggplot(aes(x = factor(pred), y = resid)) +
geom_boxplot()
15
10
resid
−5
For the third type of plot, we put the predictor on the x-axis and the residual
on the y-axis.
mpg %>%
add_residuals(out) %>%
236
ggplot(aes(x = cyl, y = resid)) +
geom_point()
15
10
resid
−5
4 5 6 7 8
cyl
mpg %>%
add_residuals(out) %>%
ggplot(aes(x = factor(cyl), y = resid)) +
geom_boxplot()
15
10
resid
−5
4 5 6 8
factor(cyl)
To check for independence you can also put variables on the x-axis that are
not in the model yet, for example the type of the car (class):
mpg %>%
add_residuals(out) %>%
237
ggplot(aes(x = class, y = resid)) +
geom_boxplot()
15
10
resid
−5
238
Chapter 8
8.1 Introduction
Linear models do not apply to every data set. As discussed in Chapter 7,
sometimes the assumptions of linear models are not met. One of the assumptions
is linearity or additivity. Additivity requires that one unit change in variable
X leads to the same amount of change in Y , no matter what value X has.
For bivariate relationships this leads to a linear shape. But sometimes you can
only expect that Y will change in the same direction, but you don’t believe
that this amount is the same for all values of X. This is the case for example
with an ordinal dependent variable. Suppose we wish to model the relationship
between the age of a mother and an aggression score in her 7-year-old child.
Suppose aggression is measured on a three-point ordinal scale: ’not aggressive’,
’sometimes aggressive’, ’often aggressive’. Since we do not know the quantitative
differences between these three levels, there are many graphs we could draw for
a given data set.
Suppose we have the data set given in Table 8.1. If we want to make a scatter
plot, we could arbitrarily choose the values 1, 2, and 3 for the three categories,
respectively. We would then get the plot in Figure 8.1. But since the aggression
data are ordinal, we could also choose the arbitrary numeric values 0, 2, and 3,
which would yield the plot in Figure 8.2.
As you can see from the least squares regression lines in Figures 8.1 and 8.2,
when we change the way in which we code the ordinal variable into a numeric
one, we also see the best fitting regression line changing. This does not mean
though, that ordinal data cannot be modelled linearly. Look at the example data
in Table 8.2 where aggression is measured with a 7-point scale. Plotting these
239
4
3
Aggression
0
30.0 30.5 31.0 31.5 32.0 32.5
AgeMother
Figure 8.1: Regression of the child’s aggression score (1,2,3) on the mother’s
age.
3
Aggression2
0
30.0 30.5 31.0 31.5 32.0 32.5
AgeMother
Figure 8.2: Regression of the child’s aggression score (0,2,3) on the mother’s
age.
240
Table 8.1: Aggression in children and age of the mother.
AgeMother Aggression
32.00 Sometimes aggressive
31.00 Often aggressive
32.00 Often aggressive
30.00 Not aggressive
31.00 Sometimes aggressive
30.00 Sometimes aggressive
31.00 Not aggressive
31.00 Often aggressive
31.00 Not aggressive
30.00 Sometimes aggressive
32.00 Often aggressive
32.00 Often aggressive
31.00 Sometimes aggressive
30.00 Sometimes aggressive
31.00 Not aggressive
data in Figure 8.3 using the values 1 through 7, we see a nice linear relationship.
So even when the values 1 thru 7 are arbitrarily chosen, a linear model can be
a good model for a given data set with one or more ordinal variables. Whether
the interpretation makes sense is however up to the researcher.
So with ordinal data, always check that your data indeed conform to a linear
model, but realise at the same time that you’re assuming a quantitative and
additive relationship between the variables that may or may not make sense. If
you believe that a quantitative analysis is meaningless then you may consider a
non-parametric analysis that we discuss in this chapter.
Another instance where we favour a non-parametric analysis over a linear
model one, is when the assumption of normally distributed residuals is not
tenable. For instance, look again at Figure 8.1 where we regressed aggression
in the child on the age of its mother. Figure 8.4 shows a histogram of the
residuals. Because of the limited number of possible values in the dependent
variable (1, 2 and 3), the number of possible values for the residuals is also
very restricted, which leads to a very discrete distribution. The histogram looks
therefore far removed from a continuous symmetric, bell-shaped distribution,
which is a violation of the normality assumption.
Every time we see a distribution of residuals that is either very skew, or has
very few different values, we should consider a non-parametric analysis. Note
that the shape of the distribution of the residuals is directly related to what scale
values we choose for the ordinal categories. By changing the values we change
the regression line, and that directly affects the relative sizes of the residuals.
First, we will discuss a non-parametric alternative for two numeric variables.
We will start with Spearman’s ρ (rho, pronouned ’row’), also called Spearman’s
rank-order correlation coefficient rs . Next we will discuss an alternative to rs ,
241
8
6
Aggression
0
27.5 30.0 32.5 35.0 37.5
AgeMother
Figure 8.3: Regression of the child’s aggression 1 thru 7 Likert score on the
mother’s age.
4
count
0
−1.0 −0.5 0.0 0.5 1.0
residual
Figure 8.4: Histogram of the residuals after the regression of a child’s aggression
score on the mother’s age.
242
Table 8.2: Aggression in children on a 7-point Likert scale and age of the mother.
AgeMother Aggression
35.0 6
32.0 4
35.2 6
36.0 5
32.9 3
29.9 1
32.3 4
32.2 2
34.2 4
30.5 2
31.6 3
30.5 2
31.7 3
31.4 3
37.5 7
Kendall’s τ (tau, pronounced ’taw’). After that, we will discuss the combination
of numeric and categorical variables, when comparing groups.
243
Now we acknowledge the ordinal nature of the data by only having rankings:
a person with rank 1 is brighter than a person with rank 2, but we do not how
large the difference in brightness really is. Now we want to establish to what
extent there is a relationship between rankings on geography and the rankings
on history: the higher the ranking on geography, the higher the ranking on
history?
By eye-balling the data, we see that the brightest student in geography is also
the brightest student in history (rank 1). We also see that the dullest student
in history is also the dullest student in geography (rank 10). Furthermore, we
see relatively small differences between the rankings on the two subjects: high
rankings on geography seem to go together with high rankings on history. Let’s
look at these differences between rankings more closely by computing them, see
Table 8.4.
6 d2
P
rs = 1 − 3 (8.1)
n −n
244
Table 8.5: Student rankings on geography and history.
rank.geography rank.history difference squared.difference
5 4 -1 1
4 5 1 1
6 7 1 1
7 8 1 1
8 6 -2 4
9 9 0 0
10 10 0 0
2 3 1 1
1 1 0 0
3 2 -1 1
because then you get a value between -1 and 1, just like a Pearson correlation,
where a value close to 1 describes a high positive correlation (high rank on one
variable goes together with a high rank on the other variable) and a value close
to -1 describes a negative correlation (a high rank on one variable goes together
with a low rank on the other variable). So in our case the sum of the squared
differences is equal to 10, and n is the number of students, so we get:
6 × 10 60
rs = 1 − =1− = 0.94 (8.2)
103 − 10 990
This is called the Spearman rank-order correlation coefficient rs , or Spearman’s
rho (the Greek letter ρ). It can be used for any two variables of which at least
one is ordinal. The trick is to convert the scale values into ranks, and then
apply the formula above. For instance, if we have the variable Grade with the
following values (C, B, D, A, F), we convert them into rankings by saying the A
is the highest value (1), B is the second highest value (2), C is the third highest
value (3), D is the fourth highest value (4) and F is the fifth highest value (5).
So transformed into ranks we get (3, 2, 4, 1, 5). Similarly, we could turn numeric
variables into ranks. Table 8.6 shows how the variables grade, shoesize and
height are transformed into their respective ranked versions. Note that the
ranking is alphanumerically by default: the first alphanumeric value gets rank
1. You could also do the ranking in the opposite direction, if that makes more
sense.
Table 8.6: Ordinal and numeric variables and their ranked transformations.
student grade rank.grade shoesize rank.shoesize height rank.height
1 A 1 6 1 1.70 1
2 D 4 8 3 1.82 2
3 C 3 9 4 1.92 4
4 B 2 7 2 1.88 3
245
8.3 Spearman’s rho in R
When we let R compute rs for us, it automatically ranks the data for us. Let’s
look at the mpg data on 234 cars from the ggplot2 package again. Suppose we
want to treat the variables cyl (the number of cylinders) and year (year of the
model) as ordinal variables, and we want to look whether the ranking on the
cyl variable is related to the ranking on the year variable. We use the function
rcorr() from the Hmisc package to compute Pearson’s rho:
library(Hmisc)
rcorr(mpg$cty, mpg$year, type = "spearman")
## x y
## x 1.00 -0.01
## y -0.01 1.00
##
## n= 234
##
##
## P
## x y
## x 0.9169
## y 0.9169
In the output you will see a correlation matrix very similar the one for a
Pearson correlation. Spearman’s rho is equal -0.01. You will also see whether
the correlation is significantly different from 0, indicated by a p-value. If the
p-value is very small, you may conclude that on the basis of these data, the
correlation in the population is not equal to 0, ergo, in the population there is a
relationship between the year a car was produced and the number of cylinders.
246
Table 8.7: Student rankings on geography and history, now ordered according
to the ranking for geography.
student rank.geography rank.history
9 1 1
8 2 3
10 3 2
2 4 5
1 5 4
3 6 7
4 7 8
5 8 6
6 9 9
7 10 10
From this table we see that the history teacher disagrees with the geography
teacher that student 8 is brighter than student 10. She also disagrees with her
colleague that student 1 is brighter than student 2. If we do this for all possible
pairs of students, we can count the number of times that they agree and we can
count the number of times they disagree. The total number of possible pairs is
equal to 10
2 = n(n−1)/2 = 90/2 = 45 (see Chapter 3). This is a rather tedious
job to do, but it can be made simpler if we reshuffle the data a bit. We put
the students in a new order, such that the brightest student in geography comes
first, and the dullest last. This also changes the order in the variable history. We
then get the data in Table 8.7. We see that the geography teacher believes that
student 9 outperforms all 9 other students. On this, the history teacher agrees,
as she also ranks student 9 first. This gives us 9 agreements. Moving down
the list, we see that the geography teacher believes student 8 outperforms 8
other students. However, we see that the history teacher believes student 8 only
outperforms 7 other students. This results in 7 agreements and 1 disagreement.
So now in total we have 9 + 7 = 16 agreements and 1 disagreements. If we
go down the whole list in the same way, we will find that there are in total 41
agreements and 4 disagreements.
The computation is rather tedious. There is a trick to do it faster. Now
focus on Table 8.7 but start in the column of the history teacher. Start at the
top row and count the number of rows beneath it with a rank higher than the
rank in the first row. The rank in the first row is 1, and all other ranks beneath
it are higher, so the number of ranks is 9. We plug that value in the last column
in Table 8.8. Next we move to row 2. The rank is 3. We count the number of
rows below row 2 with a rank higher than 3. Rank 2 is lower, so we are left
with 7 rows and we again plug 7 in the last column of Table 8.8. Then we move
on to row 3, with rank 2. There are 7 rows left, and all of them have a higher
rank. So the number is 7. Then we move on to row 4. It has rank 5. Of the
6 rows below it, only 5 have a higher rank. Next, row 5 shows rank 4. Of the
5 rows below it, all 5 show a higher rank. Row 6 shows rank 7. Of the 4 rows
247
Table 8.8: Student rankings on geography and history, now ordered according
to the ranking for geography, with number of agreements.
student rank.geography rank.history number
9 1 1 9
8 2 3 7
10 3 2 7
2 4 5 5
1 5 4 5
3 6 7 3
4 7 8 2
5 8 6 2
6 9 9 1
7 10 10 0
below it, only 3 show a higher rank. Row 7 shows rank 8. Of the three rows
below it, only 2 show a higher rank. Row 8 shows rank 6. Both rows below it
show a higher rank. And row 9 shows rank 9, and the row below it shows a
higher rank so that is 1. Finally, when we add up the values in the last column
in Table 8.8, we find 41. This is the number of agreements. The number of
disagreements can be found by reasoning that the total number of pairs equals
the number of pairs that can be formed using a total number of 10 objects: 102
= 10(10 − 1)/2 = 45. In this case we have 45 possible pairs. Of these there are
41 agreements, so there must be 45 − 41 = 4 disagreements. We can then fill in
the formula to compute Kendall’s τ :
agreements − disagreements 37
τ= = = 0.82 (8.3)
totalnumberof pairs 45
8.5 Kendall’s τ in R
Let’s again use the mpg data on 234 cars. We can compute Kendall’s τ for the
variables cyl and year using the Kendall package:
248
library(Kendall)
Kendall(mpg$cyl, mpg$year)
As said, Kendall’s τ can also be used if you want to control for a third
variable (or even more variables). This can be done with the ppcor package.
Because this package has its own function select(), you need to be explicit
about which function from which package you want to use. Here you want to
use the select() function from the dplyr package (part of the tidyverse suite
of packages).
library(ppcor)
mpg %>%
dplyr::select(cyl, year, cty) %>%
pcor(method = "kendall")
## $estimate
## cyl year cty
## cyl 1.0000000 0.1642373 -0.7599993
## year 0.1642373 1.0000000 0.1210952
## cty -0.7599993 0.1210952 1.0000000
##
## $p.value
## cyl
## cyl 0.000000000000000000000000000000000000000000000000000000000000000000000000
## year 0.000189654827628346264994235736978112072392832487821578979492187500000000
## cty 0.000000000000000000000000000000000000000000000000000000000000000000770657
## year
## cyl 0.0001896548
## year 0.0000000000
## cty 0.0059236967
## cty
## cyl 0.000000000000000000000000000000000000000000000000000000000000000000770657
## year 0.005923696652933938683327497187747212592512369155883789062500000000000000
## cty 0.000000000000000000000000000000000000000000000000000000000000000000000000
##
## $statistic
## cyl year cty
## cyl 0.000000 3.732412 -17.271535
## year 3.732412 0.000000 2.751975
## cty -17.271535 2.751975 0.000000
##
## $n
## [1] 234
##
249
## $gp
## [1] 1
##
## $method
## [1] "kendall"
In the output, we see that the Kendall correlation between cyl and year,
controlled for cty, equals 0.16, with an associated p-value of 0.00019.
250
Table 8.9: Field trip data.
student group size rank
001 math extra small 1
002 math extra large 6
003 psych medium 4
004 psych small 2.5
005 engineer large 5
006 math small 2.5
35
30
25
cty
20
15
10
Figure 8.5: Distributions of city mileage (city) as a function of car type (class).
(1988), that you don’t need to know. The distribution of this KW -statistic
under the null-hypothesis is known, so we know what extreme values are, and
consequently can compute p-values. This tedious computation can be done in
R.
251
mpg %>%
kruskal.test(cty ~ class, data = .)
##
## Kruskal-Wallis rank sum test
##
## data: cty by class
## Kruskal-Wallis chi-squared = 149.53, df = 6, p-value <
## 0.00000000000000022
252
Chapter 9
Moderation: testing
interaction effects
d = b0 + b1 × age + b2 × SES
vocab (9.2)
This main effect of SES is yet unknown and denoted by b2 . Note that this
linear equation is an example of multiple regression.
Let’s use some numerical example. Suppose age is coded in years, and SES
is dummy coded, with a 1 for high SES and a 0 for low SES. Let b2 , the effect
253
3000
Vocabulary #words
2000 SES=1
SES=0
1000
0
0 1 2 3 4 5
Age in years
Figure 9.1: Two regression lines: one for low SES children and one for high SES
children.
of SES over and above age, be 10. Then we can write out the linear equation
for low SES and high SES separately.
lowSES : vocab
d = 200 + 500 × age + 10 × 0 (9.3)
= 200 + 500 × age (9.4)
highSES : vocab
d = 200 + 500 × age + 10 × 1 (9.5)
= (200 + 10) + 500 × age (9.6)
= 210 + 500 × age (9.7)
Figure 9.1 depicts the two regression lines for the high and low SES children
separately. We see that the effect of SES involves a change in the intercept:
the intercept equals 200 for low SES children and the intercept for high SES
children equals 210. The difference in intercept is indicated by the coefficient
for SES. Note that the two regression lines are parallel: for every age, the
difference between the two lines is equal to 10. For every age therefore, the
predicted number of words is 10 words more for high SES children than for low
SES children.
So far, this is an example of multiple regression that we already saw in
Chapter 4. But suppose that such a model does not describe the data that we
actually have, or does not make the right predictions based on on our theories.
Suppose our researcher also expects that the yearly increase in vocabulary is a
bit lower than 500 words in low SES families, and a little bit higher than 500
words in high SES families. In other words, he believes that SES might moderate
(affect or change) the slope coefficient for age. Let’s call the slope coefficient
in this case b1 . In the above equation this slope parameter is equal to 500, but
let’s now let itself have a linear relationship with SES:
254
b1 = a + b3 × SES (9.8)
In words: the slope coefficient for the regression of vocab on age, is itself
linearly related to SES: we predict the slope on the basis of SES. We model
that by including a slope b3 , but also an intercept a. Now we have two linear
equations for the relationship between vocab, age and SES:
vocab
d = b0 + b1 × age + b2 × SES (9.9)
b1 = a + b3 × SES (9.10)
We can rewrite this by plugging the second equation into the first one
(substitution):
Now this very much looks like a regression equation with one intercept and
three slope coefficients: one for age (a), one for SES (b2 ) and one for SES × age
(b3 ).
We might want to change the label a into b1 to get a more familiar looking
form:
So the first slope coefficient is the increase in vocabulary for every year that
age increases (b1 ), the second slope coefficient is the increase in vocabulary for
an increase of 1 on the SES variable (b2 ), and the third slope coefficient is the
increase in vocabulary for every increase of 1 on the product of SES and age
(b3 ).
What does this mean exactly?
Suppose we find the following parameter values for the regression equation:
255
If we code low SES children as SES = 0, and high SES children as SES = 1,
we can write the above equation into two regression equations, one for low SES
children (SES = 0) and one for high SES children (SES = 1):
lowSES : vocab
d = 200 + 450 × age (9.16)
highSES : vocab
d = 200 + 450 × age + 125 + 100 × age (9.17)
= (200 + 125) + (450 + 100) × age
= 325 + 550 × age
Then for low SES children, the intercept is 200 and the regression slope for
age is 450, so they learn 450 words per year. For high SES children, we see
the same intercept of 200, with an extra 125 (this is the main effect of SES).
So effectively their intercept is now 325. For the regression slope, we now have
450 × age + 100 × age which is of course equal to 550 × age. So we see that the
high SES group has both a different intercept, and a different slope: the increase
in vocabulary is 550 per year: somewhat steeper than in low SES children. So
yes, the researcher was right: vocabulary increase per year is faster in high SES
children than in low SES children.
These two different regression lines are depicted in Figure 9.2. It can be
clearly seen that the lines have two different intercepts and two different slopes.
That they have two different slopes can be seen from the fact that the lines are
not parallel. One has a slope of 450 words per year and the other has a slope of
550 words per year. This difference in slope of 100 is exactly the size of the slope
coefficient pertaining to the product SES × age, b3 . Thus, the interpretation of
the regression coefficient for a product of two variables is that it represents the
difference in slope.
The observation that the slope coefficient is different for different groups
is called an interaction effect, or interaction for short. Other words for this
phenomenon are modification and moderation. In this case, SES is called the
modifier variable: it modifies the relationship between age on vocabulary. Note
however that you could also interpret age as the modifier variable: the effect of
SES is larger for older children than for younger children. In the plot you see
that the difference between vocabulary for high and low SES children of age 6
is larger than it is for children of age 2.
256
4000
3000
Vocabulary #words
SES=1
2000
1000
SES=0
0
0 2 4 6 8
Age in years
Figure 9.2: Two regression lines for the relationship between age and vocab,
one for low SES children (SES = 0) and one for high SES children (SES = 1).
Diet 1 and 2 data under the name chick data. When we have a quick look at
the data with glimpse(), we see that Diet is a factor (<fct>).
chick_data %>%
glimpse()
## Rows: 340
## Columns: 4
## $ weight <dbl> 42, 51, 59, 64, 76, 93, 106, 125, 149, 171, 199, 205, 40, 49...
## $ Time <dbl> 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 21, 0, 2, 4, 6, 8, 10...
## $ Chick <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Diet <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
257
300
Diet
200
weight
1
2
100
0 5 10 15 20
Time
Figure 9.3: The relationship between Time and weight in all chicks with either
Diet 1 or Diet 2.
## # A tibble: 4 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 30.9 4.50 6.88 2.95e-11 22.1 39.8
## 2 Time 6.84 0.361 19.0 4.89e-55 6.13 7.55
## 3 Diet2 -2.30 7.69 -0.299 7.65e- 1 -17.4 12.8
## 4 Time:Diet2 1.77 0.605 2.92 3.73e- 3 0.577 2.96
In the regression table, we see the effect of the numeric Time variable, which
has a slope of 6.84. For every increase of 1 in Time, there is a corresponding
expected increase of 6.84 grams in weight. Next, we see that R created a dummy
variable Diet2. That means this dummy codes 1 for Diet 2 and 0 for Diet 1.
From the output we see that if a chick gets Diet 2, its weight is -2.3 grams
heavier (that means, Diet 2 results in a lower weight).
Next, R created a dummy variable Time:Diet2, by multiplying the variables
Time and Diet2. Results show that this interaction effect is 1.77.
These results can be plugged into the following regression equation:
258
300
Diet
200
weight
1
2
100
0 5 10 15 20
Time
Figure 9.4: The relationship between Time and weight in chicks, separately for
Diet 1 and Diet 2.
weight
d = 30.93 + 6.84 × Time − 2.3 × 1 + 1.77 × Time × 1
= 28.63 + 8.61 × Time (9.19)
If we fill in 0s for the Diet2 dummy variable, we get the equation for chicks
with Diet 1:
weight
d = 30.93 + 6.84 × Time (9.20)
When comparing these two regression lines for chicks with Diet 1 and Diet
2, we see that the slope for Time is 1.77 steeper for Diet 2 chicks than for Diet
1 chicks. In this particular random sample of chicks, the chicks on Diet 1 grow
6.84 grams per day (on average), but chicks on Diet 2 grow 6.84 + 1.77 = 8.61
grams per day (on average).
We visualised these results in Figure 9.4. There we see two regression lines:
one for the red data points (chicks on Diet 1) and one for the blue data points
(chicks on Diet 2). These two regression lines are the same as those regression
lines we found when filling in either 1s and 0s in the general linear model. Note
that the lines are not parallel, like in Chapter 6. Each regression line is the least
squares regression line for the subsample of chicks on a particular diet.
We see that the difference in slope is 1.77 grams per day. This is what we
observe in this particular sample of chicks. However, what does that tell us
about the difference in slope for chicks in general, that is, the population of all
chicks? For that, we need to look at the confidence interval. In the regression
table above, we also see the 95% confidence intervals for all model parameters.
The 95% confidence interval for the Time × Diet2 interaction effect is (0.58,
259
2.96). That means that plausible values for this interaction effect are those
values between 0.58 and 2.96.
It is also possible to do null-hypothesis testing for interaction effects. One
could test whether this difference of 1.77 is possible if the value in the entire
population of chicks equals 0 ? In other words, is the value of 1.77 significantly
different from 0?
The null-hypothesis is
H0 : βTime×Diet2 = 0 (9.21)
The regression table shows that the null-hypothesis for the interaction effect
has a t-value of t = 2.92, with a p-value of 3.73 × 10−3 = 0.00373. For research
reports one always also reports the degrees of freedom for a statistical test. The
(residual) degrees of freedom can be found in R by typing
out$df.residual
## [1] 336
Y = b0 + b1 × X + b2 × Z + b3 × X × Z + e
Then, we call b0 the intercept, b1 the main effect of X, b2 the main effect of
Z, and b3 the interaction effect of X and Z (alternatively, the X by Z interaction
effect).
260
9.3 Interaction effects with a categorical variable
in R
In the previous section, we looked at the difference in slopes between two groups.
But what we can do for two groups, we can do for multiple groups. The data
set on chicks contains data on chicks with 4 different diets. When we perform
the same analysis using all data in ChickWeight, we obtain the regression table
## # A tibble: 8 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 30.9 4.25 7.28 1.09e-12 22.6 39.3
## 2 Time 6.84 0.341 20.1 3.31e-68 6.17 7.51
## 3 Diet2 -2.30 7.27 -0.316 7.52e- 1 -16.6 12.0
## 4 Diet3 -12.7 7.27 -1.74 8.15e- 2 -27.0 1.59
## 5 Diet4 -0.139 7.29 -0.0191 9.85e- 1 -14.5 14.2
## 6 Time:Diet2 1.77 0.572 3.09 2.09e- 3 0.645 2.89
## 7 Time:Diet3 4.58 0.572 8.01 6.33e-15 3.46 5.70
## 8 Time:Diet4 2.87 0.578 4.97 8.92e- 7 1.74 4.01
The regression table for four diets is substantially larger than for two diets.
It contains one slope parameter for the numeric variable Time, three different
slopes for the factor variable Diet and three different interaction effects for the
Time by Diet interaction.
The full linear model equation is
You see that R created dummy variables for Diet 2, Diet 3 and Diet 4. We
can use this equation to construct a separate linear model for the Diet 1 data.
Chicks with Diet 1 have 0s for the dummy variables Diet2, Diet3 and Diet4.
If we fill in these 0s, we obtain
For the chicks on Diet 2, we have 1s for the dummy variable Diet2 and 0s
for the other dummy variables. Hence we have
261
d = 30.93 + 6.84 × Time − 2.3 × 1 + 1.77 × Time × 1
weight
= 30.93 + 6.84 × Time − 2.3 + 1.77 × Time
= (30.93 − 2.3) + (6.84 + 1.77) × Time
= 28.63 + 8.61 × Time (9.24)
Here we see exactly the same equation for Diet 2 as in the previous section
where we only analysed two diet groups. The difference between the two slopes
in the Diet 1 and Diet 2 groups is again 1.77. The only difference for this
interaction effect is the standard error, and therefore the confidence interval is
also slightly different. We will come back to this issue in Chapter ??.
For the chicks on Diet 3, we have 1s for the dummy variable Diet3 and 0s
for the other dummy variables. The regression equation is then
We see that the intercept is again different than for the Diet 1 chicks. We
also see that the slope is different: it is now 4.58 steeper than for the Diet 1
chicks. This difference in slopes is exactly equal to the Time by Diet3 interaction
effect. This is also what we saw in the Diet 2 group. Therefore, we can say
that an interaction effect for a specific diet group says something about how
much steeper the slope is in that group, compared to the reference group. The
reference group is the group for which all the dummy variables are 0. Here, that
is the Diet 1 group.
Based on that knowledge, we can expect that the slope in the Diet 4 group
is equal to the slope in the reference group (6.84) plus the Time by Diet4
interaction effect, 2.87, so 9.71.
We can do the same for the intercept in the Diet 4 group. The intercept is
equal to the intercept in the reference group (30.93) plus the main effect of the
Diet4 dummy variable, -0.14, which is 30.79.
The linear equation is then for the Diet 4 chicks:
262
300
Diet
1
weight
200 2
3
4
100
0 5 10 15 20
Time
Figure 9.5: Four different regression lines for the four different diet groups.
by running an ANOVA. That is, we apply the anova() function to the results
of an lm() analysis:
## # A tibble: 4 x 6
## term df sumsq meansq statistic p.value
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Time 1 2042344. 2042344. 1760. 2.11e-176
## 2 Diet 3 129876. 43292. 37.3 5.07e- 22
## 3 Time:Diet 3 80804. 26935. 23.2 3.47e- 14
## 4 Residuals 570 661532. 1161. NA NA
”The slopes for the four different diets were significantly different
from each other, F (3, 570) = 23.2, M SE = 26935, p < 0.001.”
263
9.4 Interaction between two dichotomous variables
in R
In the previous section we discussed the situation that regression slopes might
be different in two four groups. In Chapter 6 we learned that we could also
look at slopes for dummy variables. The slope is then equal to the difference in
group means, that is, the slope is the increase in the group mean of one group
compared to the reference group.
Now we discuss the situation where we have two dummy variables, and want
to do inference on their interaction. Does one dummy variable moderate the
effect of the other dummy variable?
Let’s have a look at a data set on penguins. It can be found in the palmerpenguins
package.
# install.packages("palmerpenguins")
library(palmerpenguins)
penguins %>%
str ()
We see there is a species factor with three levels, and a sex factor with two
levels. Let’s select only the Adelie and Chinstrap species.
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
264
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 190. 0.548 347. 8.31e-300 189. 191.
## 2 speciesChinstrap 5.87 0.983 5.97 9.38e- 9 3.93 7.81
The output shows that in this sample, the Chinstrap penguins have on
average larger flippers than Adelie penguins. The confidence intervals tell us
that this difference in flipper length is somewhere between 3.93 and 7.81. But
suppose that this is not what we want to know. The real question might be
whether this difference is different for male and female penguins. Maybe there
is a larger difference in flipper length in females than in males?
This difference or change in one variable (flipper length mm) as a function
of another variable (sex) should remind us of moderation: maybe sex moderates
the effect of species on flipper length.
In order to study such moderation, we have to analyse the sex by species
interaction effect. By now you should know how to do that in R:
## # A tibble: 4 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 188. 0.707 266. 2.15e-267 186. 189.
## 2 speciesChinstrap 3.94 1.25 3.14 1.92e- 3 1.47 6.41
## 3 sexmale 4.62 1.00 4.62 6.76e- 6 2.65 6.59
## 4 speciesChinstrap:se~ 3.56 1.77 2.01 4.60e- 2 0.0638 7.06
From this we can make the following predictions. The predicted flipper
length for female Adelie penguins is
265
188 + 3.94 × 0 + 4.62 × 1 + 3.56 × 0 × 1
= 188 + 4.62 = 192.62
(9.30)
These predicted flipper length for each male/species combination are actually
the group means. It is generally best to plot these means with a means and
errors plot. For that we first need to compute means by R. With left join()
we add these means to the data set. These diamant-shaped means (shape =
18) are plotted with intervals that are twice (mult = 2) the standard error of
those means (geom = "errorbar").
266
210
flipper_length_mm
200
species
Adelie
190
Chinstrap
180
170
female male NA
sex
This plot shows also the data on penguins with unknown sex (sex = NA). If
we leave these out, we get
267
210
flipper_length_mm
200
species
Adelie
190
Chinstrap
180
170
female male
sex
Comparing the Adelie and the Chinstrap data, we see that for both males
and females, the Adelie penguins have smaller flippers than the Chinstrap
penguins. Comparing males and females, we see that the males have generally
larger flippers than females. More interestingly in relation to this chapter, the
means in the females are farther apart than the means in the males. Thus, in
females the effect of species is larger than in males. This is the interaction effect,
and this difference in the difference in means is equal to 3.56 in this data set.
With a confidence level of 95% we can say that this difference in the effect of
species is probably somewhere between 0.06 and 7.06 mm in the population of
all penguins.
mpg %>%
ggplot(aes(x = displ, y = cty)) +
268
geom_point(aes(colour = cyl)) +
geom_smooth(method = "lm", se = F)
30
cyl
8
7
cty
20 6
10
2 3 4 5 6 7
displ
When we run separate linear models for the different number of cylinders,
we get
mpg %>%
ggplot(aes(x = displ, y = cty, colour = cyl, group = cyl)) +
geom_point() +
geom_smooth(method = "lm", se = F)
35
30
cyl
8
25
7
cty
6
20
5
15 4
10
2 3 4 5 6 7
displ
We see that the slope is different, depending on the number of cylinders: the
more cylinders, the less negative is the slope: very negative for cars with low
number of cylinders, and slightly positive for cars with high number of cilinders.
In other words, the slope increases in value with increasing number of cylinders.
If we want to quantify this interaction effect, we need to run a linear model with
an interaction effect.
269
out <- mpg %>%
lm(cty ~ displ + cyl + displ:cyl, data = .)
out %>%
tidy()
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 38.6 2.03 19.0 3.69e-49
## 2 displ -5.29 0.825 -6.40 8.45e-10
## 3 cyl -2.70 0.376 -7.20 8.73e-12
## 4 displ:cyl 0.559 0.104 5.38 1.85e- 7
We see that the displ by cyl interaction effect is 0.559. It means that the
slope of displ changes by 0.559 for every unit increase in cyl.
For example, when we look at the predicted city miles per gallon with cyl
= 2, we get the following model equation:
cd
ty = 0.6 − 5.285 × displ − 2.704cyl + 0.559 × displ × cyl
cty
d = 0.6 − 5.285 × displ − 2.704 × 2 + 0.559 × displ × 2
cd
ty = 0.6 − 5.285 × displ − 5.408 + 1.118 × displ
cd
ty = (0.6 − 5.408) + (1.118 − 5.285) × displ
cd
ty = −4.808 − 4.167 × displ (9.33)
We see a different intercept and a different slope. The difference in the slope
between 3 and 2 cylinders equals 0.559, which is exactly the interaction effect.
If you do the same exercise with 4 and 5 cylinders, or 6 and 7 cylinders, you
will always see this difference again. This parameter for the interaction effect
just says that the best prediction for the change in slope when increasing the
number of cylinders with 1, is 0.559. We can plot the predictions from this
model in the following way:
library(modelr)
mpg %>%
add_predictions(out) %>%
ggplot(aes(x = displ, y = cty, colour = cyl)) +
geom_point() +
geom_line(aes(y = pred, group = cyl))
270
35
30
cyl
8
25
7
cty
6
20
5
15 4
10
2 3 4 5 6 7
displ
mpg %>%
add_predictions(out) %>%
ggplot(aes(x = displ, y = cty, group = cyl)) +
geom_point() +
geom_line(aes(y = pred), colour = "black") +
geom_smooth(method = "lm", se = F)
35
30
25
cty
20
15
10
2 3 4 5 6 7
displ
we see that they are a little bit different. That is because in the model we
treat cyl as numeric: for every increase of 1 in cyl, the slope changes by a
fixed amount. When you treat cyl as categorical, then you estimate the slope
separately for all different levels. You would then see multiple parameters for
the interaction effect:
271
out <- mpg %>%
mutate(cyl = factor(cyl)) %>%
lm(cty ~ displ + cyl + displ:cyl, data = .)
out %>%
tidy()
## # A tibble: 8 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 33.8 1.70 19.9 1.50e-51
## 2 displ -5.96 0.785 -7.60 7.94e-13
## 3 cyl5 1.60 1.17 1.37 1.72e- 1
## 4 cyl6 -11.6 2.50 -4.65 5.52e- 6
## 5 cyl8 -23.4 2.89 -8.11 3.12e-14
## 6 displ:cyl5 NA NA NA NA
## 7 displ:cyl6 4.21 0.948 4.44 1.38e- 5
## 8 displ:cyl8 6.39 0.906 7.06 2.05e-11
When cyl is turned into a factor, you see that cars with 4 cylinders are taken
as the reference category, and there are effects of having 5, 6, or 8 cylinders.
We see the same for the interaction effects: there is a reference category with 4
cylinders, where the slope of displ equals -5.96. Cars with 6 and 8 cylinders
have different slopes: the one for 6 cylinders is 5.96 + 4.21 and the one for 8
cylinders is 5.96 + 6.39. The slope for cars with 5 cylinders can’t be separately
estimated because there is no variation in displ in the mpg data set.
You see that you get different results, depending on whether you treat a
variable as numeric or as categorical. Treated as numeric, you end up with a
simpeler model with fewer parameters, and therefore a larger number of degrees
of freedom. What to choose depends on the research question and the amount
of data. In general, a model should be not too complex when you have relatively
few data points. Whether the model is appropriate for your data can be checked
by looking at the residuals and checking the assumptions.
272
Chapter 10
Y = b0 + b1 X + e (10.1)
2
e ∼ N (0, σ ) (10.2)
Using this model, we know that for a person with a value of 5 for X, we
expect Y to be equal to b0 + b1 × 5. As another example, if Y is someone’s IQ
score, X is someone’s brain size in cubic millilitres, b0 is equal to 70, and b1 is
equal to 0.1, we expect on the basis of this model that a person with a brain
size of 1500 cubic millimetres has an IQ score of 70+0.01×1500, which equals 85.
Now, for any model the predicted values usually are not the same as the observed
values. If the model predicts on the basis of my brain size that my IQ is 140,
my true IQ might be in fact 130. This discrepancy is termed the residual:
the observed Y , minus the predicted Y , or Yb , so in this case the residual is
Y − Yb = 130 − 140 = −10.
Here we have the model for the relationship between IQ and brain size.
273
IQ = 70 + 0.1 × Brain size + e (10.3)
2
e ∼ N (0, σ ) (10.4)
Note that in this model, the values of 70 and 0.1 are fixed, that is, we use the
same intercept and the same slope for everyone. You use these values for any
person, for Henry, Jake, Liz, and Margaret. We therefore call these effects of
intercept and slope fixed effects, as they are all the same for all units of analysis.
In contrast, we call the e term, the random error term or the residual in the
regression, a random effect. This is because the error term is different for every
unit. We don’t know the specific values of these random errors or residuals for
every person, but nevertheless, we assume that they come from a distribution,
in this case a normal distribution with mean 0 and an unknown variance. This
unknown variance is given the symbol σ 2 .
Here are a few more examples.
1. Suppose we study a number of schools, and for every school we use a simple
linear regression equation to predict the number of students (dependent
variable) on the basis of the number of teachers (independent variable).
For every unit of analysis (in this case: school), the intercept and the
regression slope are the same (fixed effects), but the residuals are different
(random effect).
2. Suppose we study reaction times, and for every measure of reaction time
– a trial – we use a simple linear regression equation to predict reaction
time in milliseconds on the basis of the characteristics of the stimulus.
Here, the unit of analysis is trial, and for every trial, the intercept and the
regression slope are the same (fixed effects), but the residuals are different
(random effect).
3. Suppose we study a number of students, and for every student we use
a simple linear regression equation to predict the math test score on the
basis of the number of hours of study the student puts in. Here, the unit of
analysis is student, and for every student, the intercept and the regression
slope are the same (fixed effects), but the residuals are different (random
effect).
Let’s focus for now on the last example. What happens when we have a lot
of data on students, but the students come from different schools? Suppose we
want to predict average grade for every student, on the basis of the number of
hours of study the student puts in. We again could use a simple linear regression
equation.
Y = b0 + b1 hourswork + e (10.5)
e ∼ N (0, σ 2 ) (10.6)
274
That would be fine if all schools would be all very similar. But suppose
that some schools have a lot of high scoring students, and some schools have a
lot of low scoring students? Then school itself would also be a very important
predictor, apart from the number of hours of study. One could say that the data
are clustered : math test scores coming from the same school are more similar
than math test scores coming from different schools. When we do not take this
into account, the residuals will not show independence (see Chapter 7 on the
assumptions of linear models).
One thing we could therefore do to remedy this is to include school as a
categorical predictor. We would then have to code this school variable into a
number of dummy variables. The first dummy variable called school1 would
indicate whether students are in the first school (school1 = 1) or not (school1
= 0). The second dummy variable school2 would indicate whether students are
in the second school (school2 = 1) or not (school2 = 0), etcetera. You can
then add these dummy variables to the regression equation like this:
In the output we would find a large number of effects, one for each dummy
variable. For example, if the students came from 100 different schools, you
would get 99 fixed effects for the 99 dummy variables. However, one could
wonder whether this is very useful. As stated earlier, fixed effects are called
fixed because they are the same for every unit of research, in this case every
student. But working with 99 dummy variables, where students mostly score 0,
this seems very much over the top. In fact, we’re not even interested in these
99 effects. We’re interested in the relationship between test score and hours
of work, meanwhile taking into account that there are test score differences
across schools. The dummy variables are only there to account for differences
across schools; the prediction for one school is a little bit higher or lower than
for another school, depending on how well students generally perform in each
school.
We could therefore try an alternative model, where we treat the school effect
as random: we assume that every school has a different average test score, and
that these averages are normally distributed. We call these average test score
deviations school effects:
So in this equation, the intercept is fixed, that is, the intercept is the same
for all observed test scores. The regression coefficient b1 for the effect of hours
of work is also fixed. But the schooleffect is random, since it is different for
275
every school. The residual e is also random, being different for every student.
It could also be written like this:
This representation emphasizes that for every school, the intercept is a little
bit different: for school A the intercept might be b0 + 2, and for school B the
intercept might be b0 − 3.
So, equation 10.10 states that every observed test score is
To put it more formally: test score Yij , that is, the test score from student j
in school i, is the sum of an effect of the school b0 + schooleffecti (the average
test score in school i), plus an effect of hours of work, b1 × hourswork, and an
unknown residual eij (a specific residual for the test score for student j in school
i).
276
of reaction times. To take this into account we can use the following linear
equation:
where Yij , is the reaction time j from participant i, (b0 +speedi ) is a random
intercept representing the average speed for each participant i (where b0 is the
overall average across all participants and speedi the random deviation for each
and every participant), b1 is the fixed effect of the size of the stimulus. Unknown
residual eij is a specific residual for the reaction time for trial j of participant i.
The reason for introducing random effects is that when your observed data
are clustered, for instance student scores clustered within schools, or trial response
times are clustered within participants, you violate the assumption of independence:
two reaction times from the same person are more similar than two reaction
times from different persons. Two test scores from students from the same
school may be more similar than two scores from students in different schools
(see Chapter 7). When this is the case, when data are clustered, it is very
important to take this into account. When the assumption of independence is
violated, you are making wrong inference if you use an ordinary linear model,
the so-called general linear model (GLM). With clustered data, it is therefore
necessary to work with an extension of the general linear model or GLM, the
linear mixed model. The above models for students’ test scores across different
schools and reaction times across different participants, are examples of linear
mixed models. The term mixed comes from the fact that the models contain a
mix of both fixed and random effects. GLMs only contain fixed effects, apart
from the random residual.
If you have clustered data, you should take this clustering into account, either
by using the grouping variable as a categorical predictor or by using a random
factor in a linear mixed model. As a rule of thumb: if you have fewer than
10 groups, consider a fixed categorical factor; if you have 10 or more groups,
consider a random factor. Two other rules you can follow are: Use a random
factor if the assumption of normally distributed group differences is tenable.
Use a fixed categorical factor if you are actually interested in the size of group
differences.
Below, we will start with a very simple example of a linear mixed model,
one that we use for a simple pre-post intervention design.
277
Table 10.1: Headache measurements in NY Times readers suffering from
headaches.
patient pre post
001 55 45
002 63 50
003 66 56
004 50 37
005 63 50
... ... ...
55
50
post
45
40
50 55 60 65
pre
mg of aspirin. These patients are randomly selected among people who read
the NY Times and suffer from regular headaches. So here we have clustered
data: we have 100 patients, and for each patient we have two scores, one before
(pre) and one after (post) the intervention of taking aspirin. Of course, overall
headache severity levels tend to vary from person to person, so we might have
to take into account that some patients have a higher average level of pain than
other patients.
The data could be represented in different ways, but suppose we have the
data matrix in Table 10.1 (showing only the first five patients). What we observe
in that table is that the severity seems generally lower after the intervention
than before the intervention. But you may also notice that the severity of
the headache also varies across patients: some have generally high scores (for
instance patient 003), and some have generally low scores (for example patient
001). Therefore, the headache scores seem to be clustered, violating the assumption
of independence. We can quantify this clustering by computing a correlation
between the pre-intervention scores and the post-intervention scores. We can
also visualise this clustering by a scatter plot, see Figure 10.1. Here it appears
that there is a strong positive correlation, indicating that the higher the pain
score before the intervention, the higher the pain score after the intervention.
278
There is an alternative way of representing the same data. Let’s look at
the same data in a new format in Table 10.2. In Chapter 1 we saw that this
representation is called long format.
279
Yij = b0 + patienti + b1 measure2 + eij (10.19)
patienti ∼ N (0, σp2 ) (10.20)
eij ∼ N (0, σe2 ) (10.21)
where Yij is the jth headache severity score (first or second) for patient i,
(b0 + patienti ) is the average amount of headache before aspirin that patient
i deviates from the mean, measure2 is a dummy variable, and b1 is the effect
of the intervention (by how much the severity changes from pre to post). We
assume that the average pain level for each patient shows a normal distribution
with average b0 and variance σp2 . And of course we assume that the residuals
show a normal distribution.
An analysis with this model can be done with the R package lme4. That
package contains the function lmer() that works more or less the same as the
lm() function that we saw earlier, but requires the addition of at least one
random variable. Below, we run a linear mixed model, with dependent variable
headache, a regular fixed effect for the categorical variable measure, and a
random effect for the categorical variable patient.
library(lme4)
out <- data %>%
lmer(headache ~ measure + (1|patient), data = .)
summary(out)
280
##
## Correlation of Fixed Effects:
## (Intr)
## measure2 -0.342
In the output we see the results. We’re mainly focused on the fixed effect of
the intervention: does aspirin reduce headache? Where it says ’Fixed effects:’
in the output, we see the linear model coefficients, with an intercept of around
59 and a negative effect of the intervention dummy variable measure2, around
−10. We see that the dummy variable was coded 1 for the second measure
(after taking aspirin). So, for our dependent variable headache, we see that
the expected headache severity for the observations with a 0 for the dummy
variable measure2 (that is, measure 1, which is before taking aspirin), is equal
to 59 − (10) × 0 = 59.
Similarly, we see that the expected headache severity for the observations
with a 1 for the dummy variable measure2 (that is, after taking aspirin),
is equal to 59 − (10) × 1 = 49 − 10 = 49. So, expected pain severity is
10 points lower after the intervention than before the intervention. Whether
this difference is significant is indicated by a t-statistic. We see here that the
average headache severity after taking an aspirin is significantly different from
the average headache severity before taking an aspirin, t = −25.46. However,
note that we do not see a p-value. That’s because the degrees of freedom are not
clear, because we also have a random variable in the model. The determination
of the degrees of freedom in a linear mixed model is a complicated matter, with
different choices, the discussion of which is beyond the scope of this book.
However, we do know that for whatever the degrees of freedom really are, a t-
statistic of -25.46 will always be in the far tail of the t-distribution, see Appendix
B. So we have good reason to conclude that we have sufficient evidence to reject
the null-hypothesis that headache levels before and after aspirin intake are the
same.
If we do not want to test a null-hypothesis, but want to construct a confidence
interval, we run into the same problem that we do not know what t-distribution
to use. Therefore we do not know with what value the standard error of 0.40
should be multiplied to compute the margin of error. We could however use
a coarse rule of thump and say that the critical t-value for a 95% confidence
interval is more or less equal to 2 (see Appendix B). If we do that, then we get the
confidence level for the effect of aspirin: between −10.36 − 2 × 0.4069 = −11.17
and −10.36 + 2 × 0.4069 = −9.55.
Taking into account the direction of the effect and the confidence interval for
this effect, we might therefore carefully conclude that aspirin reduces headache
in the population of NY Times readers with headache problems, where the
reduction is around 10 points on a 1...100 scale (95% CI: 9.55 – 11.17).
Now let’s look at the output regarding the random effect of patient more
closely. The model assumed that the individual differences in headache severity
in the 100 patients came from a normal distribution. How large are these
281
0.08
0.06
density
0.04
0.02
0.00
0 25 50 75 100
headache
individual differences actually? This can be gleaned from the ’Random effects:’
part of the R output. The ’intercept’ (i.e., the patient effect in our model)
seems
√ to vary with a variance of 27, which is equivalent to a standard deviation
of 27 which is around 5.2. What does that mean exactly? Well let’s look at
the equation again and fill in the numbers:
Since R used the the headache level before the intervention as the reference
category, we conclude that the average pain level before taking aspirin is 59.
However, not everybody’s pain level before taking aspirin is 59: people show
variance (variation). The pain level before aspirin varies with a variance of 27,
which is equivalent to a standard deviation of around 5.2. Figure 10.2 shows
how much this variance actually is. It depicts a normal distribution with a mean
of 59 and a standard deviation of 5.2.
So before taking aspirin, most patients show headache levels roughly between
50 and 70. More specifically, if we would take the middle 95% by using plus or
minus twice the standard deviation, we can estimate that 95% of the patients
shows levels between 59 − 2 × 5.2 = 48.6 and 59 + 2 × 5.2 = 69.4.
Now let’s look at the levels after taking aspirin. The average headache level
is equal to 59 − 10 = 49. So 95% of the patients shows headache levels between
49 − 2 × 5.2 = 38.6 and 49 + 2 × 5.2 = 59.4 before taking aspirin.
282
0.08
0.04
0.02
0.00
0 25 50 75 100
headache
Figure 10.3: Distribution of headache scores before and after taking aspirin,
according to the linear mixed model.
Together these results are visualised in Figure 10.3. In this plot you see there
is variability in headache levels before taking aspirin, and there is variation in
headache levels after taking aspirin. We also see that these distributions have the
same spread (variance): in the model we assume that the variability in headache
before aspirin is equal to the variability after aspirin (homoscedasticity). The
distributions are equal, except for a horizontal shift: the distribution for headache
after aspirin is the same as the distribution before aspirin, except for a shift to
the left of about 10 points. This is of course the effect of aspirin in the model,
the b1 parameter in our model above.
The fact that the two distributions before and after aspirin show the same
spread (variance) was an inherent assumption in our model: we only have one
random effect for patient in our output with one variance (σp2 ). If the assumption
of equal variance (homoscedasticity) is not tenable, then one should consider
other linear mixed models. But this is beyond the scope of this book. The
assumption can be checked by plotting the residuals, using different colours for
residuals from before taking aspirin and for residuals from after taking aspirin.
library(modelr)
data %>%
add_residuals(out) %>%
add_predictions(out) %>%
ggplot(aes(x = pred, y = resid, colour = measure)) +
geom_point() +
xlab("Predicted headache") +
ylab("Residual") +
283
scale_colour_brewer(palette = "Set1") # use nice colours
measure
Residual
0 1
2
−4
40 50 60 70
Predicted headache
data %>%
add_residuals(out) %>%
ggplot(aes(x = measure, y = resid)) +
geom_boxplot() +
xlab("Measure") +
ylab("Residual")
4
Residual
−4
1 2
Measure
The plots show that the variation in the residuals is about the same for pre
and post aspirin headache levels (box plot) and for all predicted headache levels
(scatter plot). This satisfies the assumption of homogeneity of variance.
If all assumptions are satisfied, you are at liberty to make inferences regarding
the model parameters. We saw that the effect of aspirin was estimated at
about 10 points with a rough 95% confidence interval, that was based on the
284
rule of thumb of 2 standard errors around the estimate. For people that are
uncomfortable with such quick and dirty estimates, it is also possible to use
Satterthwaite’s approximation of the degrees of freedom. You can obtain these,
once you load the package lmerTest. After loading, your lmer() analysis will
yield (estimated) degrees of freedom and the p-values associated with the t-
statistics and those degrees of freedom.
library(lmerTest)
##
## Attaching package: ’lmerTest’
## The following object is masked from ’package:lme4’:
##
## lmer
## The following object is masked from ’package:stats’:
##
## step
out <- data %>%
lmer(headache ~ measure + (1|patient), data = .)
summary(out)
285
## Correlation of Fixed Effects:
## (Intr)
## measure2 -0.342
qt(0.975, df = 99)
## [1] 1.984217
If we calculate the 95% confidence interval using the new critical value, we
obtain 59 − 1.98 × 5.2 = 48.7 and 59 + 1.98 × 5.2 = 69.3. That’s only a minor
difference with what we saw before.
286
## Data: .
##
## AIC BIC logLik deviance df.resid
## 1198.5 1211.7 -595.3 1190.5 196
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.37429 -0.42643 -0.04998 0.49953 2.18189
##
## Random effects:
## Groups Name Variance Std.Dev.
## patient (Intercept) 26.872 5.184
## Residual 8.195 2.863
## Number of obs: 200, groups: patient, 100
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 59.6800 0.5922 126.0065 100.78 <0.0000000000000002 ***
## measure2 -10.3600 0.4049 100.0000 -25.59 <0.0000000000000002 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Correlation of Fixed Effects:
## (Intr)
## measure2 -0.342
In the output you can see ”Linear mixed model fit by maximum likelihood”.
In this case, this is the output that we would be most interested in: we were
interested in the effect of aspirin on headache, and this was modelled as a
fixed effect of variable measure on the dependent variable. We only included
the random patient variable in order to deal with the dependency in the data.
We were never interested in the size of σp2 . My guess is that in the behavioural,
management and social sciences, interest is practically always in the fixed effects.
Therefore it is generally advised to add REML = FALSE when you use the lmer()
function.
287
288
Chapter 11
So for each patient we have three measures: pre, post1 and post2. To see
if there is some clustering, it is no longer possible to study this by computing
289
a single correlation. We could however compute 3 different correlations: pre-
post1, pre-post2, and post1-post2, but this is rather tedious, and moreover
does not give us a single measure of the extent of clustering of the data. But
there is an alternative: one could compute not a Pearson correlation, but an
intraclass correlation (ICC). To do this, we need to bring the data again into
long format, as opposed to wide format, see Chapter 1. This is done in Table
11.2.
Next, we can perform an analysis with the lmer() function from the lme4
package.
library(lme4)
model1 <- datalong %>%
lmer(headache ~ measure + (1|patient), data = ., REML = FALSE)
model1
In the output we see the fixed effects of two automatically created dummy
variables measurepost2 and measurepre, and the intercept. We also see the
290
standard deviations of the random effects: the standard deviation of the residuals
and the standard deviation of the random effects for the patients.
From this output, we can plug in the values into the equation:
Based on this equation, the expected headache severity score in the population
three hours after aspirin intake is 49.32 (the first post measure is the reference
group). Dummy variable measurepost2 is coded 1 for the measurements 24
hours after aspirin intake. Therefore, the expected headache score 24 hours after
aspirin intake is equal to 49.32 + 2.36 = 51.68. Dummy variable measurepre
was coded 1 for the measurements before aspirin intake. Therefore, the expected
headache before aspirin intake is equal to 49.32 + 9.85 = 59.17. In sum, in this
sample we see that the average headache level decreases directly after aspirin
intake from 59.17 to 49.32, but then increases again to 51.68.
The F -statistic for the equality of the group means is quite large (remember
that the expected value is always 1 if the null-hypothesis is true, see Chapter 6).
We see no degrees of freedom. These need to be estimated, for example using
Satterthwaite’s method, provided by the lmerTest package:
291
0.06 3 hrs after aspirin before aspirin
density 24 hrs after aspirin
0.04
0.02
0.00
0 25 50 75 100
headache
Figure 11.1: Distributions of the three headache levels before aspirin intake,
3 hours after intake and 24 hours after intake, according to the linear mixed
model.
library(lmerTest)
model2 <- datalong %>%
lmer(headache ~ measure + (1|patient),
data = .,
REML = FALSE)
model2 %>% anova()
292
a post hoc analysis of the three means. See Chapter ?? on how to perform
planned comparisons and post hoc tests.
2
σpatient
ICC = 2 (11.1)
σpatient + σe2
Here, the variance of the patient random effects is equal to 5.292 = 27.98,
and the variance of the residuals e is equal to 2.912 = 8.47, so the intraclass
correlation for the headache severity scores is equal to
27.98
ICC = = 0.77 (11.2)
28.3 + 8.47
As this correlation is substantially higher than 0, we conclude there is quite
a lot of clustering. Therefore it’s a good thing that we used random effects for
the individual differences in headache scores among NY Times readers. Had
this correlation been 0 or very close to 0, however, then it would not have
mattered to include these random effects. In that case, we might as well use an
ordinary linear model, using the lm() function. Note from the formula that the
correlation becomes 0 when the variance of the random effects for patients is 0.
It approaches 0 as the random effects for patients grows small relative to the
residual variance. It approaches 1 as the random effects for patients grows large
relative to the residual variance. Because variance cannot be negative, ICCs
always have values between 0 and 1.
293
Table 11.3: Headache measures in NY Times readers in long format with a new
variable time.
patient measure headache time
1 pre 52 0
1 post1 45 3
1 post2 47 24
2 pre 59 0
2 post1 50 3
2 post2 55 24
3 pre 65 0
3 post1 56 3
3 post2 58 24
4 pre 51 0
and patients take an aspirin. Next we measure headache after 3 hours and 24
hours. Above, we wanted to know if there were differences in average headache
between before intake and 3 hrs and 24 hrs after intake. Another question we
might ask ourselves: is there a linear reduction in headache severity after taking
aspirin?
For this we can do a linear regression type of analysis. We want to take into
account individual differences in headache severity levels among patients, so we
perform an lmer() analysis, using the following code, replacing the categorical
variable measure by numerical variable time:
In the output we see that the model for our data is equivalent to
294
80
70
headache score
60
50
40
30
0 5 10 15 20 25
time after aspirin intake
Figure 11.2: Headache levels before aspirin intake, 3 hours after intake and 24
hours after intake.
This model predicts that at time 0, the average headache severity score
equals 54.79, and that for every hour after intake, the headache level drops by
0.1557 points. So it predicts for example that after 10 hours, the headache has
dropped 1.557 points to 53.23.
Is this a good model for the data? Probably not. Look at the variance of the
residuals: with a standard deviation of 5.55 it is now a lot bigger than in the
previous analysis with the same data (see previous section). Larger variance of
residuals means that the model explains the data worse: predictions are worse,
so the residuals increase in size.
That the model is not appropriate for this data set is also obvious when we
plot the data, focusing on the relationship between time and headache levels,
see Figure 11.2.
The line shown is the fitted line based on the output. It can be seen that the
prediction for time = 0 is systematically too low, for time = 3 systematically
too high, and for time = 24 again too low. So for this particular data set on
headache, it would be better to use a categorical predictor for the effect of time
on headache, like we did in the previous section.
295
80
70
headache score
60
50
40
30
0 1 2 3
time after aspirin intake
Figure 11.3: Alternative headache levels before aspirin intake, 3 hours after
intake and 24 hours after intake.
As an example of a data set where a linear effect would have been appropriate,
imagine that we measured headache 0 hours, 2 hours and 3 hours after aspirin
intake (and not after 24 hours). Suppose these data would look like those in
Figure 11.3. There we see a gradual increase of headache levels right after
aspirin intake. Here, a numeric treatment of the time variable would be quite
appropriate.
Suppose we would then see the following output.
296
## Residual 8.732 2.955
## Number of obs: 300, groups: patient, 100
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 58.9721 0.6000 134.6807 98.29 <0.0000000000000002 ***
## time -3.3493 0.1368 200.0000 -24.48 <0.0000000000000002 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Correlation of Fixed Effects:
## (Intr)
## time -0.380
Because we are confident that this model is appropriate for our data, we can
interpret the statistical output. The Satterthwaite error degrees of freedom are
200, so we can construct a 95% confidence interval by finding the appropriate
t-value.
qt(0.025, df = 200)
## [1] -1.971896
The 95% confidence interval for the effect of time is then from −3.349 −
1.97 × 0.1368 to −3.349 + 1.97 × 0.1368, so from -3.62 to -3.08. We can report:
A linear mixed model was run on the headache levels, using the
numeric predictor variable time and random effects for the variable
patient. We saw a significant linear effect of time on headache level,
t(200) = −24.42, p < 0.001. The estimated effect of time based
on this analysis is negative, −3.3, so with every hour that elapses
after aspirin intake, the predicted headache score decreases with 3.3
points (95% CI: 3.08 to 3.62 points).
H0 : The effect of aspirin is the same for NY Times readers as for Wall Street
Journal readers.
297
Suppose we have the data set in Table 11.4 (we only show the first six patients),
and we only look at the measurements before aspirin intake and 3 hours after
aspirin intake (pre-post design).
Table 11.4: Headache measures in NY Times and Wall Street Journal readers
in wide format.
patient group pre post
1 NYTimes 55 45
2 WallStreetJ 63 50
3 NYTimes 66 56
4 WallStreetJ 50 37
5 NYTimes 63 50
6 WallStreetJ 65 53
In this part of the data set, patients 2, 4, and 6 read the Wall Street Journal,
and patients 1, 3 and 5 read the NY Times. We assume that people only read
one of these newspapers. We measure their headache before and after the intake
of aspirin (a pre-post design). The data are now in what we call wide format:
the dependent variable headache is spread over two columns, pre and post. In
order to analyse the data with linear models, we need them in long format, as
in Table 11.5.
Table 11.5: Headache measures in NY Times and Wall Street Journal readers
in long format.
patient group measure headache
1 NYTimes pre 55
1 NYTimes post 45
2 WallStreetJ pre 63
2 WallStreetJ post 50
3 NYTimes pre 66
3 NYTimes post 56
298
that this hypothesis states that there is no interaction effect of aspirin (measure)
and group. The null-hypothesis is that group is not a moderator of the effect
of aspirin on headache. There may be an effect of aspirin or there may not,
and there may be an effect of newspaper (group) or there may not, but we’re
interested in the interaction of aspirin and group membership. Is the effect of
aspirin different for NY Times readers than for Wall Street Journal readers?
In our model we therefore need to specify an interaction effect. Since the
data are clustered (2 measures per patient), we use a linear mixed model. First
we show how to analyse these data using dummy variables, later we will show
the results using a different approach.
We recode the data into two dummy variables, one for the aspirin intervention
(dummy1: 1 if measure = post, 0 otherwise), and one for group membership
(dummy2: 1 if group = NYTimes, 0 otherwise):
## # A tibble: 3 x 6
## patient group measure headache dummy1 dummy2
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 NYTimes pre 55 0 1
## 2 1 NYTimes post 45 1 1
## 3 2 WallStreetJ pre 63 0 0
Next we need to compute the product of these two dummies to code a dummy
for the interaction effect. Since with the above dummy coding, all post measures
get a 1, and all NY Times readers get a 1, only the observations that are post
aspirin and that are from NY Times readers get a 1 for this product.
## # A tibble: 3 x 7
## patient group measure headache dummy1 dummy2 dummy_interact
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 NYTimes pre 55 0 1 0
## 2 1 NYTimes post 45 1 1 1
## 3 2 WallStreetJ pre 63 0 0 0
With these three new dummy variables we can specify the linear mixed
model.
299
model5 <- datalong %>%
lmer(headache ~ dummy1 + dummy2 + dummy_interact + (1|patient),
data = .,
REML = FALSE)
model5
In the output, we recognise the three fixed effects for the three dummy
variables. Since we’re interested in the interaction effect, we look at the effect
of dummy interact. The effect is in the order of +0.6. What does this mean?
Remember that all headache measures before aspirin intake are given a 0
for the intervention dummy dummy1. A reader from the Wall Street Journal
gets a 0 for the group dummy dummy2. Since the product of 0 × 0 equals 0, all
measures before aspirin in Wall Street Journal readers get a 0 for the interaction
dummy dummy interact. Therefore, the intercept of 59.52 refers to the expected
headache severity of Wall Street Journal readers before they take their aspirin.
Furthermore, we see that the effect of dummy1 is -10.66. The variable dummy1
codes for post measurements. So, relative to Wall Street Journal readers prior
to aspirin intake, the level of post intake headache is 10.66 points lower.
If we look further in the output, we see that the effect of dummy2 equals
+0.32. This variable dummy2 codes for NY Times readers. So, relative to Wall
Street Journal readers and before aspirin intake (the reference group), NY Times
readers score on average 0.32 points higher on the headache scale.
However, we’re not interested in a general difference between those two
groups of readers, we’re interested in the effect of aspirin and whether it is
different in the two groups of readers. In the output we see the interaction
effect: being a reader of the NY Times AND at the same time being a measure
after aspirin intake, the expected level of headache is an extra +0.60. The effect
of aspirin is -10.66 in Wall Street Journal readers, as we saw above, but the
effect is −10.66 + 0.60 = −10.06 in NY Times readers. So in this sample the
effect of aspirin on headache is 0.60 smaller than in Wall Street Journal readers
(note that even while the interaction effect is positive, it is positive on a scale
where a high score means more headache).
300
Table 11.6: Expected headache levels in Wall Street Journal and NY Times
readers, before and after aspirin intake.
measure group dummy1 dummy2 dummy interact exp mean
pre WallStreetJ 0 0 0 60
post WallStreetJ 1 0 0 60 + (−11) = 49
pre NYTimes 0 1 0 60 + 0.3 = 60.3
post NYTimes 1 1 1 60 + (−11) + 0.3 + 0.6 = 49.9
Let’s look at it in a different way, using a table with the dummy codes, see
Table 11.6. For each group of data, pre or post aspirin and NY Times readers
and Wall Street Journal readers, we note the dummy codes for the new dummy
variables. In the last column we use the output estimates and multiply them
with the respective dummy codes (1 and 0) to obtain the expected headache
level (using rounded numbers):
The exact numbers are displayed in Figure 11.4. We see that the specific
effect of aspirin in NY Times readers is 0.60 smaller than the effect of aspirin
in Wall Street Journal readers. This difference in the effect of aspirin between
the groups was not significantly different from 0, as we can see when we let R
plot a summary of the results.
301
62
headache 58
measure
difference = −10.66 difference = −10.66+0.60 = −10.06
pre
54
post
50
WallStreetJ NYTimes
group
Figure 11.4: Expected headache levels in NY Times readers and Wall Street
Journal readers based on a linear mixed model with an interaction effect.
Note that we could have done the analysis in another way, not recoding
the variables into numeric dummy variables ourselves, but by letting R do it
automatically. R does that automatically for factor variables like our variable
group. The code is then:
302
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite’s
## method [lmerModLmerTest]
## Formula: headache ~ measure + group + measure:group + (1 | patient)
## Data: .
##
## AIC BIC logLik deviance df.resid
## 1201.7 1221.5 -594.8 1189.7 194
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.34085 -0.43678 -0.02942 0.50880 2.22534
##
## Random effects:
## Groups Name Variance Std.Dev.
## patient (Intercept) 26.80 5.177
## Residual 8.15 2.855
## Number of obs: 200, groups: patient, 100
##
## Fixed effects:
## Estimate Std. Error df t value
## (Intercept) 49.7800 0.8361 125.9463 59.542
## measurepre 10.0600 0.5710 100.0000 17.619
## groupWallStreetJ -0.9200 1.1824 125.9463 -0.778
## measurepre:groupWallStreetJ 0.6000 0.8075 100.0000 0.743
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## measurepre <0.0000000000000002 ***
## groupWallStreetJ 0.438
## measurepre:groupWallStreetJ 0.459
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Correlation of Fixed Effects:
## (Intr) mesrpr grpWSJ
## measurepre -0.341
## grpWllStrtJ -0.707 0.241
## msrpr:grWSJ 0.241 -0.707 -0.341
303
significance level of the interaction effect is still the same. You are always free
to choose to either construct your own dummy variables and analyse them in
a quantitative way (using numeric variables), or to let R construct the dummy
variables for you (by using a factor variable): the p-value for the interaction
effect will always be the same (this is not true for the intercept and the main
effects).
Because the two analyses are equivalent (they end up with exactly the same
predictions, feel free to check!), we can safely report that we have found a non-
significant group by measure interaction effect, t(98) = 0.74, p = 0.46. We
therefore conclude that we found no evidence that in the populations of NY
Times readers and Wall Street Journal readers, the short-term effect of aspirin
on headache is any different.
304
2. The second option is to have 50 participants drive a Porsche, with and
without alcohol, and to have the other 50 participants drive the Fiat,
with and without alcohol. In this case, the car is the between-participants
variable, and alcohol is the within-participant variable.
4. The fourth option is to have 25 participants drive the Porsche with alcohol,
25 other participants drive the Porsche without alcohol, 25 participants
drive the Fiat with alcohol, and the remaining 25 participants drive the
Fiat without alcohol. Now both the car variable and the alcohol variable
are between-participant variables: none of the participants is present in
more than 1 condition.
Only the second and the third design described here are mixed designs,
having at least one between-participants variable and at least one within-participant
variable.
Remember that when there is at least one within variable in your design,
you have to use a linear mixed model. If all variables are between variables, one
can use an ordinary linear model. Note that the term mixed in linear mixed
model refers to the effects in the model that can be both random and fixed. The
term mixed in mixed designs refers to the mix of two kinds of variables: within
variables and between variables.
Also note that the within and between distinction refers to the units of
analysis. If the unit of analysis is school, then the denomination of the school is
a between-school variable. An example of a within-school variable could be time:
before a major curriculum reform and after a major curriculum reform. Or it
could be teacher: classes taught by teacher A or by teacher B, both teaching at
the same school.
305
variable with 20 levels. The example is about stress in athletes that are going
to partake in the 2018 Winter Olympics. Stress can be revealed in morning
cortisol levels. In the 20 days preceding the start of the Olympics, each athlete
was measured every morning after waking and before breakfast by letting them
chew on cotton. The cortisol level in the saliva was then measured in the lab.
Our research question is by how much cortisol levels rise in athletes that prepare
for the Olympics.
Three groups were studied. One group consisted of 50 athletes who were
selected to partake in the Olympics, one group consisted of 50 athletes that
were very good but were not selected to partake (Control group I) and one
group consisted of 50 non-athlete spectators that were going to watch the games
(Control group II). The research question was about what the differences are in
average cortisol increase in these three groups: the Olympians, Control group I
and Control group II.
In Table 11.7 you see part of the fictional data, the first 6 measurements on
person 1 that belongs to the group of Olympians.
When we plot the data, and use different colours for the three different
groups, we already notice that the Olympians show generally higher cortisol
levels, particularly at the end of the 20-day period (Figure 11.5).
We want to know to what extent the linear effect of time is moderated by
group. Since for every person we have 20 measurements, the data are clustered
so we use a linear mixed model. We’re looking for a linear effect of time, so
we use the measure variable numerically (i.e., it is numeric, and we do not
transform it into a factor). We also use the categorical variable group as a
predictor, but in a qualitative way. It is a factor variable with three levels, so
that R will automatically make two dummy variables. Because we’re interested
in the interaction effects, we include both main effects of group and measure
and their interaction in the model. Lastly, we control for individual differences
in cortisol levels by introducing random effects for person.
306
40
35
group
cortisol
Control group I
30
Control group II
Olympian
25
20
5 10 15 20
measure
307
## (Intercept) <0.0000000000000002 ***
## measure <0.0000000000000002 ***
## groupControl group II 0.2978
## groupOlympian 0.0651 .
## measure:groupControl group II 0.4096
## measure:groupOlympian <0.0000000000000002 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Correlation of Fixed Effects:
## (Intr) measur grCgII grpOly m:CgII
## measure -0.374
## grpCntrlgII -0.707 0.264
## groupOlympn -0.707 0.264 0.500
## msr:grpCgII 0.264 -0.707 -0.374 -0.187
## msr:grpOlym 0.264 -0.707 -0.187 -0.374 0.500
In the output we see an intercept of 20.09, a slope of 0.60 for the effect
of measure, two main effects for the variable group (Control group I is the
reference group), and two interaction effects (one for Control group II and one
for the Olympian group). Let’s fill in the linear model equation based on this
output:
308
40
35
group
cortisol
Control group I
30
Control group II
Olympian
25
20
5 10 15 20
measure
Figure 11.6: Cortisol levels over time in three groups with the group-specific
regression lines.
309
we see a significant measure by group interaction effect, F (2, 2850) = 1859.16, p <
.001. The null-hypothesis of the same cortisol change in three different populations
can be rejected, and we conclude that Olympian athletes, non-Olympian athletes
and spectators show a different change in cortisol levels in the weeks preceding
the games.
310
Chapter 12
Non-parametric alternatives
for linear mixed models
311
28
24
Time in minutes
20
16
not acquainted with them), and your data show non-equal variance and/or non-
normally distributed residuals, there are non-parametric alternatives. Here we
discuss two: Friedman’s test and Wilcoxon’s signed rank test. We explain them
using an imaginary data set on speed skating.
set is too small to say anything about assumptions at the population level. Residuals for a
data set of 8 persons might show very non-normal residuals, or very different variances for two
subgroups of 4 persons each, but that might just be a coincidence, a random result because
of the small sample size. If in doubt, it is best to use non-parametric methods.
312
5
Residual
−5
EuropeanChampionships WorldChampionships Olympics
Occasion
Figure 12.2: Residuals of the speedskating data with a linear mixed model.
4
Count
0
−5 0 5
Residual
Figure 12.3: Histogram of the residuals of the speedskating data with a linear
mixed model.
313
to another kind of test. Here we discuss Friedman’s test, a non-parametric test,
for testing the null-hypothesis that the medians of the three groups of data
are the same (see Chapter 1. This Friedman test can be used in all situations
where you have at least 2 levels of the within variable. In other words, you
can use this test when you have data from three occasions, but also when you
have data from 10 occasions or only 2. In a later section the Wilcoxon signed
ranks test is discussed. This test is often used in social and behavioural sciences.
The downside of this test is that it can only handle data sets with 2 levels of
the within variable. In other words, it can only be used when we have data
from two occasions. Friedman’s test is therefore more generally applicable than
Wilcoxon’s. We therefore advise to always go with the Friedman test, but for
the sake of completeness, we will also explain the Wilcoxon test.
We rank all of these time measures by determining the fastest time, then the
next to fastest time, etcetera, until the slowest time. But because the data in
each row belong together (we compare individuals with themselves), we do the
ranking row-wise. For each athlete separately, we determine the fastest time
(1), the next fastest time (2), and the slowest time (3) and put the ranks in
a new table, see Table 12.2. There we see for example that athlete 1 had the
fastest time at the European Championships (14.35, rank 1) and the slowest at
the Olympics (16.42, rank 3).
Next, we compute the sum of the ranks column-wise: the sum of the ranks
for the European Championships data is 31, for the Olympic data it’s 15 and
for the World Championships data it is 26.
314
Table 12.2: Row-wise ranks of the speed skating data.
athlete EuropeanChampionships WorldChampionships Olympics
1 1 2 3
2 2 1 3
3 2 1 3
4 3 1 2
5 3 2 1
6 3 1 2
7 2 1 3
8 3 1 2
9 3 1 2
10 3 1 2
11 3 2 1
12 3 1 2
From these sums we can gather that in general, these athletes showed their
best times (many rank 1s) at the World Championships, as the sum of the ranks
is lowest. We also see that in general these athletes showed their worst times
(many rank 2s and 3s) at the European Championships, as the relevant column
showed the highest sum of ranks.
In order to know whether these sums of ranks are significantly different from
each other, we may compute an Fr -value based on the following formula:
12
Fr = Σkj=1 Sj2 − 3n(k + 1) (12.1)
nk(k + 1)
In this formula, n stands for the number of rows (12 athletes), k stands for
the number of columns (3 occasions), and Sj2 stands for the squared sum of
column j (312 , 262 and 152 ). If we fill in these numbers, we get:
12
Fr = × (312 + 152 + 262 ) − 3 × 12(3 + 1)
12 × 3(3 + 1)
12
= × 1862 − 144 = 11.167
144
What can we tell from this Fr -statistic? In order to say something about
significance, we have to know what values are to be expected under the null-
hypothesis that there are no differences across the three groups of data. Suppose
we randomly mixed up the data by taking all the speed skating times and
randomly assigning them to the three contests and the twelve athletes, until we
have a newly filled data matrix, for example the one in Table 12.3.
If we then compute Fr for this data matrix, we get a different value. If we do
this mixing up the data and computing Fr say 1000 times, we get 1000 values
for Fr , summarised in the histogram in Figure 12.4.
315
Table 12.3: The raw skating data in random order.
athlete EuropeanChampionships WorldChampionships Olympics
1 15.76 14.26 17.78
2 19.44 17.10 17.83
3 16.18 14.83 15.63
4 15.30 17.00 16.42
5 15.79 16.23 17.36
6 14.77 28.00 15.65
7 14.35 16.15 16.30
8 27.90 15.12 18.94
9 16.96 19.01 19.95
10 15.83 17.17 13.89
11 14.99 18.27 15.69
12 18.37 17.67 18.13
So if the data are just randomly distributed over the three columns (and
12 rows) in the data matrix, we expect no systematic differences across the
three columns and so the null-hypothesis is true. So now we know what the
distribution of Fr looks like when the null-hypothesis is true: more or less like
the one in Figure 12.4. Remember that for the true data that we actually
gathered (in the right order that is!), we found an Fr -value of 11.167 . From
the histogram, we see that only very few values of 11.167 or larger are observed
when the null-hypothesis is true. If we look more closely, we find that only 0.2%
of the values are larger than 11.167 , so we have a p-value of 0.004. The 95th
percentile of these 1000 Fr -values is 6.1666667, meaning that of the 1000 values
for Fr , 5% are larger than 6.1666667. So if we use a significance level of 5%, our
observed value of 11.167 is larger than the critical value for Fr , and we conclude
that the null-hypothesis can be rejected.
Now this p-value of 0.004 and the critical value of 6.1666667 are based on our
own computations2 . Actually there are better ways. One is to look up critical
values of Fr in tables, for instance in Kendall M.G. (1970) Rank correlation
methods. (fourth edition). The p-value corresponding to this Fr -value depends
on k, the number of groups of data (here 3 columns) and n, the number of rows
(12 individuals). If we look up that table, we find that for k = 3 and n = 12
the critical value of Fr for a type I error rate of 0.05 equals 6.17. Our observed
Fr -value of 11.167 is larger than that, therefore we can reject the null-hypothesis
that the median skating times are the same at the three different championships.
So we have to tell your friend that there are general differences in skating times
at different contests, Fr = 11.167, p < 0.05, but it is not the case that the fastest
times were observed at the Olympics.
2 What we have actually done is a very simple form of bootstrapping: jumbling up the data
set many times and in that way determining the distribution of a test-statistic under the
null-hypothesis, in this case the distribution of Fr . For more on bootstrapping, see Davison,
A.C. & Hinkley, D.V. (1997). Bootstrap Methods and their Application. Cambridge, UK:
Cambridge.
316
200
150
count
100
50
0
0 5 10
F_r
Figure 12.4: Histogram of 1000 possible values for Fr given that the null-
hypothesis is true, for 12 speed skaters.
3 The χ2 -distribution is based on the normal distribution: the χ2 -distribution with k degrees
of freedom is the distribution of a sum of the squares of k independent standard normal random
variables.
317
200
150
count
100
50
0
0 5 10 15
F_r
Figure 12.5: Histogram of 1000 possible values for Fr given that the null-
hypothesis is true, for 120 speed skaters.
0.5
0.4
0.3
density
0.2
0.1
0.0
0 5 10 15
F_r
318
## [1] 5.991465
## [1] 0.003759385
datawide %>%
pivot_longer(cols = -athlete, names_to = "occasion", values_to = "time")
We can then specify that we want Friedman’s test by using the friedman.test()
function and indicating which variables we want to use:
datalong %>%
friedman.test(time ~ occasion | athlete, data = .)
##
## Friedman rank sum test
##
## data: time and occasion and athlete
## Friedman chi-squared = 11.167, df = 2, p-value = 0.00376
319
In the output we see a chi-squared statistic, degrees of freedom, and an
asymptotic (approximated) p-value. Why don’t we see an Fr -statistic?
The reason is, as discussed in the previous section, that for large number of
measurements (in wide format: columns) and a large number of individuals (in
wide format: rows), the Fr statistic tends to have the same distribution as a
chi-square, χ2 , with k − 1 degrees of freedom. So what we are looking at in this
output is really an Fr -value of 11.167 (exactly the same value as we computed
by hand in the previous section). In order to approximate the p-value, this value
of 11.167 is interpreted as a chi-square (χ2 ), which with 2 degrees of freedom
has a p-value of 0.004.
This asymptotic (approximated) p-value is the correct p-value if you have a
lot of rows (large n) and at least 6 variables (k > 5). If you do not have that,
as we have here, this asymptotic p-value is only what it is: an approximation.
However, this is only a problem when the approximate p-value is close to the
pre-selected significance level α. If α equals 0.05, an approximate p-value of
0.002 is much smaller than that, and we do not hesitate to call it significant,
whatever its true value may be. If a p-value is very close to α, it might be a
good idea to look up the exact critical values for Fr in online tables4 . If your
Fr is larger than the critical value for a certain combination of n, k and α, you
may reject the null-hypothesis.
In the above case we can report:
For each athlete, we take the difference in skating times (Olympics - WorldChampionships)
and call it d, see Table 12.5. Next we rank these d-values, irrespective of sign,
and call these ranks rank d. From Table 12.5 we see that athlete 12 shows
the smallest difference in skating times (d = 0.06, rank = 1) and athlete 2 the
largest difference (d = 3.78, rank = 12).
Next, we indicate for each rank whether it belongs to a positive or a negative
difference d and call that variable ranksign.
Under the null-hypothesis, we expect that some of the larger d-values are
positive and some of them negative, in a fairly equal amount. If we sum the
ranks having plus-signs and sum the ranks having minus-signs, we would expect
that these two sums are about equal, but only if the null-hypothesis is true. If
4 https://www.jstor.org/stable/3315656?seq=1
320
Table 12.5: The raw skating data and the computations for Wilcoxon signed
ranks test
athlete Olympics WorldChampionships d rank d ranksign
1 16.42 15.79 0.63 5.00 5.00
2 18.13 14.26 3.87 12.00 12.00
3 19.95 18.37 1.58 8.00 8.00
4 17.78 15.12 2.66 10.00 10.00
5 16.96 17.17 -0.21 3.00 -3.00
6 16.15 15.30 0.85 6.00 6.00
7 19.44 15.63 3.81 11.00 11.00
8 16.23 15.69 0.54 4.00 4.00
9 15.76 15.65 0.11 2.00 2.00
10 16.18 14.99 1.19 7.00 7.00
11 13.89 15.83 -1.94 9.00 -9.00
12 14.83 14.77 0.06 1.00 1.00
the sums are very different, then we should reject this null-hypothesis. In order
to see if the difference in sums is too large, we compute them as follows:
T+ = 5 + 12 + 8 + 10 + 6 + 11 + 4 + 2 + 7 + 1 = 66
T− = 3 + 9 = 12
To know whether T + is significantly larger than T − , the value of T + can be
looked up in a table, for instance in Siegel & Castellan (1988). There we see
that for T + , with 12 rows, the probability of obtaining a T + of at least 66 is
0.0171. For a two-sided test (if we would have switched the columns of the two
championships, we would have gotten a T − of 66 and a T + of 12!), we have to
double this probability. So we end up with a p-value of 2 × 0.0171 = 0.0342.
In the table we find no critical values for large sample size n, but fortunately,
similar to the Friedman test, we can use an approximation using the normal
distribution. It can be shown that for large sample sizes, the statistic T + is
approximately normally distributed with mean
n(n + 1)
µ= (12.2)
4
and variance:
n(n + 1)(2n + 1)
σ2 = (12.3)
24
If we therefore standardise the Tp+ by subtracting the µ and then dividing
by the square root of the variance (σ 2 ) = σ, we get a z-value with mean 0
and standard deviation 1. To do that, we use the following formula:
T+ − µ T + − n(n + 1)/4
z= =p (12.4)
σ n(n + 1)(2n + 1)/24
321
Here T + is 66 and n equals 12, so if we fill in the formula we get z = 2.118.
From the standard normal distribution we know that 5% of the observations lie
above 1.96 and below -1.96. So a value for z larger than 1.96 or smaller than
-1.96 is enough evidence to reject the null-hypothesis. Here our z-statistic is
larger than 1.96, therefore we reject the null-hypothesis that the median skating
times are the same at the World Championships and the Olympics. The p-value
associated with a z-score of 2.118 is 0.034.
Next, you use the wilcoxon.test() function. You select the two variables
(occasions) that you would like to compare, and indicate that the Olympic and
World Championship data are paired within rows (i.e., the Olympic and World
championship data in one row belong to the same individual).
wilcox.test(datawide$Olympics, datawide$WorldChampionships,
paired = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: datawide$Olympics and datawide$WorldChampionships
## V = 66, p-value = 0.03418
## alternative hypothesis: true location shift is not equal to 0
322
wilcox.test(datawide$Olympics, datawide$WorldChampionships,
paired = TRUE,
exact = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: datawide$Olympics and datawide$WorldChampionships
## V = 66, p-value = 0.03418
## alternative hypothesis: true location shift is not equal to 0
In this case we see that the exact p-value is equal to the approximate p-value.
Note that we use a two-sided test, to allow for the fact that random sampling
could lead to a higher median for the Olympic Games or a higher median for
the World Championships. We just want to know whether the null-hypothesis
that the two medians differ can be rejected (in whatever direction) or not.
Let’s compare the output with the Friedman test, but then only use the
relevant variables in your code:
datalong %>%
filter(occasion != "EuropeanChampionships") %>%
mutate(occasion = as.integer(occasion)) %>%
friedman.test(time ~ occasion | athlete,
data = .)
##
## Friedman rank sum test
##
## data: time and occasion and athlete
## Friedman chi-squared = 5.3333, df = 1, p-value = 0.02092
Note that the friedman.test() function does not perform well if some
variables are factors and you make a selection of the levels. Here we have the
factor occasion with originally 3 levels. If the friedman.test() function only
finds two of these in the data it is supposed to analyse, it throws an error.
Therefore we turn factor variable occasion into an integer variable first. The
friedman.test() function then turns this integer variable into a new factor
variable with only 2 levels before analysis.
In the output we see that the null-hypothesis of equal medians at the World
Championships and the Olympic Games can be rejected, with an approximate
p-value of 0.02.
Note that both the Friedman and Wilcoxon tests come up with very similar
p-values. Their rationales are also similar: Friedman’s test is based on ranks and
Wilcoxon’s test is based on positive and negative differences between measures
323
1 and 2, so in fact ranks 1 and 2 for each row in the wide data matrix. Both
can therefore be used in the case you have two measures. We recommend to
use the Friedman test, since that test can be used in all situations where you
have 2 or more measures per row. Wilcoxon’s test can only be used if you have
2 measures per row.
In sum, we can report in two ways on our hypothesis regarding similar skating
times at the World Championships and at the Olympics:
How do we know whether the fastest times were at the World Championships
or at the Olympics? If we look at the raw data in Table 12.1, it is not
immediately obvious. We have to inspect the T + = 66 and T − = 12 and
consider what they represent: there is more positivity than negativity. The
positivity is due to positive ranksigns that are computed based on d = Olympics−
W orldChampionships, see Table 12.5. A positive difference d means that the
Olympics time was larger than the WorldChampionships time. A large value
for time stands for slower speed. A positive ranksign therefore means that the
Olympics time was larger (slower!) than the WorldChampionships time. A
large rank d means that the difference between the two times was relatively
large. Therefore, you get a large value of T + if the Olympic times are on the
whole slower than the World Championships times, and/or when these positive
differences are relatively large. When we look at the values of ranksign in Table
12.5, we notice that only two values are negative: one relatively large value and
one relatively small value. The rest of the values are positive, both small and
large, and these all contribute to the T + value. We can therefore state that the
pattern in the data is that for most athletes, the Olympic times are slower than
the times at the World Championships.
12.6 Ties
Many non-parametric tests are based on ranks. For example, if we have the data
sequence 0.1, 0.4, 0.5, 0.2, we give these values the ranks 1, 3, 4, 2, respectively.
But in many data cases, data sequences cannot be ranked unequivocally. Let’s
look at the sequence 0.1, 0.4, 0.4, 0.2. Here we have 2 values that are exactly
the same. We say then that we have ties. If we have ties in our data like the
0.4 in this case, one very often used option is to arbitrarily choose one of the
0.4 values as smaller than the other, and then average the ranks. Thus, we rank
the data into 1, 3, 4, 2 and then average the tied observations: 1, 3.5, 3.5, 2. As
324
another example, suppose we have the sequence 23, 54, 54, 54, 19, we turn this
into ranks 2, 3, 4, 5, 1 and take the average of the ranks of the tied observations
of 54: 2, 4, 4, 4, 1. These ranks corrected for ties can then be used to compute
the test statistic, for instance Friedman’s Fr or Wilcoxon’s z. However, in many
cases, because of these corrections, a slightly different formula is to be used. So
the formulas become a little bit different. This is all done in R automatically.
If you want to know more, see Siegel and Castellan (1988). Non-parametric
Statistics for the Behavioral Sciences. New York: McGraw-Hill.
325
326
Chapter 13
13.1 Introduction
In previous chapters we were introduced to the linear model, with its basic form
Y = b0 + b1 X1 + · · · + bn Xn + e (13.1)
e ∼ N (0, σe2 ) (13.2)
Two basic assumptions of this model are the additivity in the parameters,
and the normally distributed residual e. Additivity in the parameters means
that the effects of intercept and the independent variables X1 , X2 , . . . Xn are
additive: the assumption is that you can sum these effects to come to a predicted
value for Y . So that is also true when we include an interaction effect to account
for a moderation,
Y = b0 + b1 X1 + b2 X2 + b3 X1 X2 + e (13.3)
e ∼ N (0, σe2 ) (13.4)
Y = b0 + b1 X1 + b2 X1 X1 + e (13.5)
e ∼ N (0, σe2 ) (13.6)
In all these models, the assumption is that the effects of the parameters (b0 ,
b1 , b2 ) can be summed.
327
0.20
0.15
0.10
0.05
0.00
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
residual
Figure 13.1: Density function of the normal distribution, with mean 0 and
variance 4 (standard deviation 2). Inflexion points are positioned at residual
values of minus 1 standard deviation and plus 1 standard deviation.
The other major assumption of linear (mixed) models is the normal distribution
of the residuals. As we have seen in for instance Chapter 7, sometimes the
residuals are not normally distributed. Remember that with a normal distribution
N (0, σ 2 ), in principle all values between −∞ and +∞ are possible, but they tend
to concentrate around the value of 0, in the shape of the bell-curve. Figure 13.1
shows the normal distribution N (0, σ 2 = 4): it is centred around 0 and has
variance 4. Note that the inflexion point, that is the point where the decrease
in density tends to decelerate, is exactly at the values -2 and +2. These are
equal to the square root of the variance, which is the standard deviation, −σ
and +σ.
A normal distribution is suitable for continuous dependent variables. For
most measured variables this is not true. Think for example of temperature
measures: if the thermometer gives degrees Celsius with a precision of only 1
decimal, we can never have values of say 10.07 or -56.789. Our actual data will
in fact be discrete, showing rounded values like 10.1, 10.2, 10.3, but never any
values in between.
Nevertheless, the normal distribution can still be used in many such cases.
Take for instance a data set where the temperature in Amsterdam in summer
was predicted on the basis of a linear model. Fig 13.2 shows the distribution
of the residuals for that model. The temperature measures were discrete with
a precision of one tenth of a degree Celsius, but the distribution seems well
approximated by a normal curve.
But let’s look at an example where the discreteness is more prominent. In
Figure 13.3 we see the residuals of an analysis of exam results. Students had
to do an assignment that had to meet 4 criteria: 1) originality, 2) language, 3)
328
0.15
0.10
density
0.05
0.00
−10 −5 0 5 10
residual
Figure 13.2: Even if residuals are really discrete, the normal distribution can be
a good approximation of their distribution.
structure, and 4) literature review. Each criterion was scored as either fulfilled
(1) or not fulfilled (0). The score for the assignment was determined on the
basis of the number of criteria that were met, so the scores could be 0, 1, 2, 3
or 4. In an analysis, this score was predicted on the basis of the average exam
score on previous assignments, using a linear model.
Figure 13.3 shows that the residuals are very discrete, and that the continuous
normal distribution is a very bad approximation of the histogram. We often see
this phenomenon when our data consist of counts with a limited maximum
number.
An even more extreme case we observe when our dependent variable consists
of whether or not students passed the assignment: only those assignments that
fulfilled all 4 criteria are regarded as sufficient. If we score all students with a
sufficient assignment as passed (scored as a value of 1) and all students with
an insufficient assignment as failed (scored as a value of 0) and we predict this
score by the average exam score on previous assignments using a linear model,
we get the residuals displayed in Figure 13.4.
Here it is also evident that a normal approximation of the residuals will not
do. When the dependent variable has only 2 possible values, a linear model
will never work because the residuals can never have a distribution that is even
remotely looking normal.
In this chapter and the next we will discuss how generalised linear models
can be used to analyse data sets where the assumption of normally distributed
residuals is not tenable. First we discuss the case where the dependent variable
has only 2 possible values (dichotomous dependent variables like yes/no or
pass/fail, heads/tails, 1/0). In Chapter 14, we will discuss the case where the
dependent variable consists of counts (0, 1, 2, 3, 4, . . . ).
329
3
2
density
0
−1 0 1 2 3
residual
Figure 13.3: Count data example where the normal distribution is not a good
approximation of the distribution of the residuals.
10.0
7.5
density
5.0
2.5
0.0
−0.25 0.00 0.25 0.50 0.75 1.00
residual
Figure 13.4: Dichotomous data example where the normal distribution is not a
good approximation of the distribution of the residuals.
330
1.00
0.75
score
0.50
0.25
0.00
75 80 85
age in months
Figure 13.5: Data example: Exam outcome (score) as a function of age, where
1 means pass and 0 means fail.
331
0.8
score
0.4
0.0
75 80 85
age in months
0.5
residual
0.0
−0.5
−1.0
75 80 85
age in months
332
probability of heads is 0.1, we can expect that if we flip the coin 100 times, on
average we expect to see 10 times heads and 90 times tails. Our best bet then
is that the outcome is tails. However, if we actually flip the coin, we might see
heads anyway. There is some randomness to be expected. Let Y be the outcome
of a coin flip: heads or tails. If we have a Bernoulli distribution for variable Y
with probability p for heads, we expect to see heads p times, but we actually
observe heads or tails (Y ).
Y ∼ Bern(p) (13.9)
The same is true for the normal distribution in the linear model case: we
expect that the observed value of Y is exactly equal to its predicted value (b0 +
b1 X), but we observe Y that it is most often different.
Y ∼ N (µ = b0 + b1 X, σe2 ) (13.10)
In our example of passing the exam by the third graders, the pass rate could
also be conceived as the outcome of a coin flip: pass instead of heads and fail
instead of tails. So would it be an idea to predict the probability of passing the
exam on the basis of age? And then for every predicted probability, we allow for
the fact that actually the observed success can differ. Our linear model could
then look like this:
pi = b0 + b1 agei (13.11)
scorei ∼ Bern(pi ) (13.12)
333
values of less than 0 and more than 1, and this is not possible for probabilities.
If we use the above values of b0 = −3.8 and b1 = 0.05, we predict a probability
of -.3 for a child of 70 months and a probability of 1.2 for a child of 100 months.
Those values are meaningless, since probabilities are always between 0 and 1!
If we would summarise the odds by doing the division, we have just one number.
For example, if the odds are 4 to 5 (4:5), the odds are 4/5 = 0.8, and if the odds
334
natural logarithm of the odds
−4
Figure 13.8: The relationship between a probability and the natural logarithm
of the corresponding odds.
are a thousand to one (1000:1), then we can also say the odds are 1000. Odds,
unlike probabilities, can have values that are larger than 1.
However, note that odds can never be negative: a very small odds is one to
a thousand (1:1000). This can be summarised as an odds of 0.000999001, but
that is still larger than 0. In summary: probabilities range from 0 to 1, and
odds from 0 to infinity.
Because odds can never be negative, mathematicians have proposed to use
the natural logarithm 1 of the odds as the preferred transformation of probabilities.
For example, suppose we have a probability of heads of 0.42. This can be
transformed into an odds by noting that in 100 coin tosses, we would expect
42 times heads and 58 times tails. So the odds are 42:58, which is equal to
42
58 = 0.7241379. The natural logarithm of 0.7241379 equals -0.3227734 (use the
ln button on your calculator!). If we have a value between 0 and 1 and we take
the logarithm of that value, we always get a value smaller than 0. In short: a
probability is never negative, but the corresponding logarithm of the odds can
be negative.
Figure 13.8 shows the relationship between a probability (with values between
0 and 1) and the natural logarithm of the corresponding odds (the logodds).
The result is a mirrored S-shaped curve on its side. For large probabilities close
to one, the equivalent logodds becomes infinitely positive, and for very small
probabilities close to zero, the equivalent logodds becomes infinitely negative.
1 The natural logarithm of a number is its logarithm to the base of the constant e, where
335
A logodds of 0 is equal to a probability of 0.5. If a logodds is larger than 0,
it means the probability is larger than 0.5, and if a logodds is smaller than 0
(negative), the probability is smaller than 0.5.
Returning back to our example of the children passing the exam, suppose we
have the following linear equation for the relationship between age and the
logarithm of the odds of passing the exam
This equation predicts that a child aged 70 months has a logodds of −33.15+
0.42 × 70 = −3.75. In order to transform that logodds back to a probability, we
first have to take the exponential of the logodds2 to get the odds:
An odds of 0.02 means that the odds of passing the exam is 0.02 to 1 (0.02:1).
So out of 1 + 0.02 = 1.02 times, we expect 0.02 successes and 1 failure. The
0.02
probability of success is therefore 1+0.02 = 0.02. Thus, based on this equation,
the expected probability of passing the exam for a child of 70 months equals
0.02.
If you find that easier, you can also memorise the following formula for the
relationship between a logodds of x and the corresponding probability:
exp(x)
px = (13.13)
1 + exp(x)
Thus, if you have a logodds x of −3.75, the odds equals exp(−3.75) = 0.02,
0.02
and the corresponding probability is 1+0.02 = 0.02.
2 If we know ln(x) = 60, we have to infer that x equals e60 , because ln(e60 ) = 60 by
definition of the natural logarithm, see previous footnote. Therefore, if we know that ln(x) = c,
we know that x equals ec . The exponent of c, ec , is often written as exp(c). So if we know
that the logarithm of the odds equals c, logodds = ln(oddsratio) = c, then the odds is equal
to exp(c).
336
13.2.3 Logistic link function
In previous pages we have seen that logodds have the nice property of having
meaningful values between −∞ and +∞. This makes them suitable for linear
models. In essence, our linear model for our exam data in children might then
look like this:
Note that we can write the odds as p/(1 − p), p is a probability (or a
proportion). So the logodds that corresponds to the probability of passing the
ppass
exam, ppass , can be written as ln 1−p pass
, so that we have
ppass
ln = b0 + b1 age (13.16)
1 − ppass
Y ∼ Bern(ppass ) (13.17)
Note that we do not have a residual any more: the randomness around
the predicted values is no longer modelled using a residual e that is normally
distributed, but is now modelled by a Y -variable with a Bernoulli distribution.
Also note the strange relationship between the probability parameter ppass for
the Bernoulli distribution, and the dependent variable for the linear equation
b0 +b1 age. The linear model predicts the logodds, but for the Bernoulli distribution,
we use the probability. But it turns out that this model is very flexible and useful
in many real-life problems. This model is often called a logit model: one often
writes that the logit of the probability is predicted by a linear model.
337
logit(p) = logarithm of the odds 4
−2
70 75 80 85
age in months
Figure 13.9: Example of a linear model for the logit of probabilities of passing
an exam.
1.00
probability of passing the exam
0.75
0.50
0.25
0.00
70 75 80 85
age in months
338
1.00
probability of passing the exam
0.75
0.50
0.25
0.00
70 75 80 85
age in months
ages, and probabilities close to 1 for very old ages. There is a clear positive effect
of age on the probability of passing the exam. But note that the relationship is
not linear on the scale of the probabilities: it is linear on the scale of the logit of
the probabilities, see Figure 13.9, but non-linear on the scale of the probabilities
themselves, see Figure 13.10.
The curvilinear shape we see in Figure 13.10 is called a logistic curve. It is
based on the logistic function: here p is a logistic function of age (and note the
similarity with Equation 13.13):
exp(b0 + b1 age)
p = logistic(b0 + b1 age) =
1 + exp(b0 + b1 age)
In summary, if we go from logodds to probabilities, we use the logistic
exp(x)
function, logistic(x) = 1+exp(x) . If we go from probabilities to logodds, we
p
use the logit function, logit(p) = ln 1−p . The logistic regression model is a
generalised linear model with a logit link function, because the linear equation
b0 +b1 X predicts the logit of a probability. It is also often said that we’re dealing
with a logistic link function, because the linear equation gives a value that we
have to subject to the logistic function to get the probability. Both terms, logit
link function and logistic link function are used.
If we go back to our data on the third-grade children that either passed or
failed the exam, we see that this curve gives a description of our data, see Figure
13.11. The model predicts that around the age of 78 months, the probability of
passing the exam is around 0.50. We indeed see in Figure 13.11 that around this
age some children pass the exam (score = 1) and some don’t (score = 0). On
the basis of this analysis there seems to be a positive relationship between age
in third-grade children and the probability of passing the exam in this sample.
What we have done here is a logistic regression of passing the exam on age. It
339
is called logistic because the curve in Figure 13.11 has a logistic shape. Logistic
regression is one specific form of a generalised linear model. Here we have
applied a generalised linear model with a so-called logit link function: instead
of modelling dependent variable Y directly, we have modelled the logit of the
probabilities of obtaining a Y -value of 1. There are many other link functions
possible. One of them we will see in the chapter on generalised linear models
for count data (Chapt. 14). But first, let’s see how logistic regression can be
performed in R, and how we should interpret the output.
340
for one coin flip, which is equivalent to a Bernoulli distribution. Actually, the
code can be a little bit shorter, because the logit link function is the default
option with the binomial distribution:
model . train < - data . train % >%
glm ( train ∼ income ,
data = . ,
family = binomial )
Below, we see the parameter estimates from this generalised linear model
run on the train data.
model.train %>%
tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 90.0 32.5 2.77 0.00564
## 2 income -0.00817 0.00297 -2.75 0.00603
The parameter estimates table from a glm() analysis looks very much like
that of the ordinary linear model and the linear mixed model. An important
difference is that the statistics shown are no longer t-statistics, but Z-statistics.
This is because with logistic models, the ratio b1 /SE does not have a t-distribution.
In ordinary linear models, the ratio b1 /SE has a t-distribution because in linear
models, the variance of the residuals, σe2 , has to be estimated (as it is unknown).
If the residual variance were known, b1 /SE would have a standard normal
distribution. In logistic models, there is no σe2 that needs to be estimated (it is
by default 1), so the ratio b1 /SE has a standard normal distribution. One could
therefore calculate a Z-statistic Z = b1 /SE and see whether that value is smaller
than 1.96 or larger than 1.96, if you want to test with a Type I error rate of 0.05.
The interpretation of the slope parameters is very similar to other linear models.
Note that we have the following equation for the logistic model:
logit(ptrain ) = b0 + b1 income
train ∼ Bern(ptrain ) (13.20)
341
equals 90.0 − 0.00817 × 11000 = 0.13. When we transform this back to a
exp(0.13)
probability, we get 1+exp(0.13) = 0.53. So this model predicts that for people
with a yearly income of 11,000, about 53% of them take the train (if they travel
at all, that is!).
Now imagine a traveller with a yearly income of 100,000. Then the predicted
logodds equals 90.0 − 0.00817 × 100000 = −727. When we transform this back
exp(−727)
to a probability, we get 1+exp(−727) = 0.00. So this model predicts that for
people with a yearly income of 100,000, close to none of them take the train.
Going from 11,000 to 100,000 is a big difference. But the change in probabilities
is also huge: the probability goes down from 0.53 to 0.00.
We found a difference in probability of taking the train for people with
different incomes in this sample of 1000 travellers, but is there also an effect
of income in the entire population of travellers between Amsterdam and Paris?
The regression table shows us that the effect of income, −0.00817, is statistically
significant at an α of 5%, Z = −2.75, p < 0.01. We can therefore reject the null-
hypothesis that income is not related to whether people take the train or not.
We conclude that in the population of travellers to Paris, a higher income is
associated with a lower probability of travelling by train.
Note that similar to other linear models, the intercept can be interpreted as
the predicted logodds for people that have values 0 for all other variables in the
model. Therefore, 90.0 means in this case that the predicted logodds for people
with zero income equals 90.0. This is equivalent to a probability of very close
to 1.
342
Chapter 14
The normal distribution has two parameters, the mean and the variance. The
Bernoulli distribution has only 1 parameter (the probability), and the Poisson
343
0.20
0.15
probability
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
count
Figure 14.1: Count data example where the normal distribution is not a good
approximation of the distribution of the residuals.
λ = b0 + b1 X (14.1)
y ∼ P oisson(λ) (14.2)
344
therefore we used the logarithm. Here we want to have positive values for our
dependent variable, so we can use the inverse of the logarithm function: the
exponential. Then we have the following model:
345
0.3
probability
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
count
0.4
0.3
probability
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
count
346
0.3
0.2
probability
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
count
0.4
0.3
previous
probability
low
0.2 average
high
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
count
Figure 14.5: Three different Poisson distributions with lambdas 0.85, 1.17, and
1.60, for three different kinds of students.
347
14.2 Poisson regression in SPSS
Poisson regression is a form of a generalized linear model analysis, similar to
logistic regression. However, instead of using a Bernoulli distribution we use a
Poisson distribution. For a numeric predictor like the variable previous, the
syntax is as follows.
GENLIN scores WITH previous
/MODEL previous
DISTRIBUTION=POISSON LINK=LOG
/PRINT CPS DESCRIPTIVES SOLUTION.
The output with parameter values is shown in Figure 14.6.
Parameter Estimates
We see the same values for the intercept and the effect of previous as in the
previous section. We now also see 95% confidence intervals for these parameter
values. For both, the value 0 is included in the confidence intervals, therefore
we know that we cannot reject the null-hypotheses that these values are 0 in the
population of students. This is also reflected by the Wald statistics. Remember
that the Wald chi-square (X 2 ) statistic is computed by B 2 /SE 2 . For large
enough samples, these X 2 statistics follow a χ2 distribution with 1 degree of
freedom. From that distribution we know that a value of 0.372 is not significant
at the 5% level. It has an associated p-value of 0.542.
We can write:
Scores for the assignment (1-4) for 100 students were analysed
using a generalized linear model with a Poisson distribution (Poisson
regression). The scores were not significantly predicted by the average
score of previous assignments, B = −0.06, X 2 (1) = 0.37, p = 0.54.
Therefore we cannot reject the null-hypothesis that there is no relationship
between the average of previous assignments and the score on the
present assignment in the population of students.
Suppose we also have a categorical predictor, for example the degree that
the students are working for. Some do the assignment for a bachelor’s degree
348
(degree=1), some for a master’s degree (degree=2), and some for a PhD degree
(degree=3). The syntax would then look like:
Note that only the independent variable has changed and WITH is changed
into BY. The output is given in Figure 14.7.
Parameter Estimates
We see that the parameter for the degree=3 category is fixed to 0, meaning
that it is used as the reference category. If we make a prediction for this group
of students that is studying for a PhD degree, we have λ = exp(.354 + 0) =
exp(0.354) = 1.4. For the students studying for a Master’s degree we have
λ = exp(.354 − 0.089) = 1.3 and for students studying for their Bachelor’s
degree we have λ = exp(.354 − 0.584) = 0.8. These λ-values correspond to
the expected number in a Poisson distribution, so for Bachelor students we
expect a score of 0.8, for Master students we expect a score of 1.3 and for Phd
students a score of 1.4. Are these different scores also present in the population?
We see that the effect for degree=1 is significant, X 2 (1) = 5.85, p = 0.02, so
there is a difference in score between students studying for a Bachelor’s degree
and students studying for a PhD. The effect for degree=2 is not significant,
X 2 (1) = 0.18, p = 0.67, so there is no difference in assignment scores between
Master students and PhD students.
Remember that for the linear model, when we wanted to compare more than
two groups at the same time, we used an F -test to test for an overall difference
in group means. Also for the generalized linear model, we might be interested
349
in whether there is an overall difference in scores between Bachelor, Master and
PhD students. For that we need to tweak the syntax a little bit, by stating that
we also want to see an overall test printed. The PRINT statements then also
needs the word SUMMARY. In other words, the syntax becomes
We then get the relevant output in Figure 14.8. There we see a Wald Chi-
Square statistic for the effect of degree. It has 2 degrees of freedom, since the
effect for the 3 categories is coded by 2 dummy variables. So this test tells us
that the null-hypothesis that the expected scores in each group of students are
the same can be rejected, X 2 (2) = 6.27, p = 0.04.
350
Table 14.2: Counts of adult survivors on the Titanic.
count
Male 338
Female 316
Let’s analyse this data set with SPSS. In SPSS we assign the value sex=1
to Females and sex=2 to Males. Our dependent variable is count, and the
independent variable is sex.
Parameter Estimates
95% Wald Confidence Interval Hypothesis Test
Wald Chi-
Parameter B Std. Error Lower Upper Square df Sig.
(Intercept) 5.823 .0544 5.716 5.930 11460.858 1 .000
[sex=1.00] -.067 .0783 -.221 .086 .740 1 .390
[sex=2.00] 0a . . . . . .
(Scale) 1b
Dependent Variable: count
Model: (Intercept), sex
a. Set to zero because this parameter is redundant.
b. Fixed at the displayed value.
Figure 14.9: SPSS output of a generalized linear model for predicting numbers
of men and women on board the Titanic.
From the output in Figure 14.9 we see that the expected count for females
is exp(5.823 − 0.067) = 318.3 and the expected count for males is exp(5.823) =
340.4. These expected counts are close to the observed counts of males and
females. The only reason that they differ from the observed is because of
rounding errors (SPSS shows only the first three decimals). From the Wald
statistic, we see that the difference in counts between males and females is not
significant, X 2 (1) = 0.74, p = 0.391 .
The difference in these counts is very small. But does this tell us that women
were as likely to survive as men? Note that we have only looked at those who
1 Note that a hypothesis test is a bit odd here: there is no clear population that we want
to generalize the results to: there was only one Titanic disaster. Also, here we have data on
the entire population of those people on board the Titanic, there is no random sample here.
351
survived. How about the people that perished: were there more men that died
than women? Table 14.3 shows the counts of male survivors, female survivors,
male non-survivors and female non-survivors. Then we see a different story:
on the whole there were many more men than women, and a relatively small
proportion of the men survived. Of the men, most of them perished: 1329
perished and only 338 survived, a survival rate of 20.3%. Of the women, most
of them survived: 109 perished and 316 survived, yielding a survival rate of
74%. Does this tell us that women are much more likely than men to survive
collisions with icebergs?
Let’s first run a multivariate Poisson regression analysis including the effects
of both sex and survival. The syntax is
Parameter Estimates
95% Wald Confidence Interval Hypothesis Test
Wald Chi-
Parameter B Std. Error Lower Upper Square df Sig.
(Intercept) 7.044 .0286 6.988 7.100 60709.658 1 .000
[sex=1.00] -1.367 .0543 -1.473 -1.260 632.563 1 .000
[sex=2.00] 0a . . . . . .
survived -.788 .0472 -.880 -.695 279.073 1 .000
(Scale) 1b
Dependent Variable: count
Model: (Intercept), sex, survived
a. Set to zero because this parameter is redundant.
b. Fixed at the displayed value.
Figure 14.10: SPSS output of a generalized linear model for predicting numbers
of men and women that perished and survived on board the Titanic.
352
observed predicted
1000
survived
count
0
1
500
The output is given in Figure 14.10. From the parameter values, we can
calculate the predicted numbers of male (sex = 2) and female (sex = 1) that
survived and perished. For female survivors we have exp(7.04 − 1.37 − .79) =
131.63, for female non-survivors we have exp(7.04 − 1.37) = 290.03, for male
survivors we have exp(7.04 − .79) = 518.01 and for male non-survivors we have
exp(7.04) = 1141.39.
These predicted numbers are displayed in Figure 14.11. It also shows the
observed counts. The pattern that is observed is clearly different from the one
that is predicted from the generalized linear model. The linear model predicts
that there are fewer survivors than non-survivors, irrespective of sex, but we
observed that in females, there are more survivors than non-survivors. It seems
that sex is a moderator of the effect of survival on counts.
In order to test this moderation effect, we run a new generalized linear
model for counts including an interaction effect of sex by survived. This is done
in SPSS syntax by changing the MODEL part by including a sex by survived
interaction:
GENLIN count BY sex WITH survived
/MODEL sex survived sex*survived
DISTRIBUTION=POISSON LINK=LOG
/PRINT CPS DESCRIPTIVES SOLUTION.
The output is displayed in Figure 14.12. When we plot the predicted counts
from this new model with an interaction effect, we see that they are exactly
equal to the counts that are actually observed in the data, see Figure 14.13.
From the output we see that the interaction effect is significant, X 2 (1) =
91.82, p < 0.001. If we regard this data set as a random sample of all ships that
353
Parameter Estimates
95% Wald Confidence Interval Hypothesis Test
Wald Chi-
Parameter B Std. Error Lower Upper Square df Sig.
(Intercept) 7.192 .0274 7.138 7.246 68745.825 1 .000
[sex=1.00] -2.501 .0996 -2.696 -2.306 630.032 1 .000
a
[sex=2.00] 0 . . . . . .
survived -1.369 .0609 -1.489 -1.250 505.126 1 .000
[sex=1.00] *
survived 2.434 .1267 2.185 2.682 368.979 1 .000
[sex=2.00] *
survived 0a . . . . . .
b
(Scale) 1
Dependent Variable: count
Model: (Intercept), sex, survived, sex * survived
a. Set to zero because this parameter is redundant.
b. Fixed at the displayed value.
Figure 14.12: SPSS output of a generalized linear model for predicting numbers
of men and women that perished and survived on board the Titanic.
sink after collision with icebergs, we may conclude that in such situations, sex
is a significant moderator of the difference in the numbers of survivors and non-
survivors. One could also say: the proportion of people that survive a disaster
like this is different in females than it is in males.
In the previous section these counts were analysed using a generalized linear
model with a Poisson distribution and an exponential link function. We wanted
to know whether there was a significant difference in the proportion of survivors
for men and women. In this section we discuss an alternative method of
analysing count data. We discuss an alternative chi-square (X 2 ) statistic for
the moderation effect of one variable of the effect of another variable.
354
Page
observed predicted
1000
survived
count
0
1
500
First let’s have a look at the overall survival rate. In total there were 654
people that survived and 1438 people that did not survive. Table 14.5 shows
these column totals.
Suppose we only know that of the 2092 people, 1667 were men, and of all
people, 654 survived. Then suppose we pick a random person from these 2092
people. What is the probability that we get a male person that survived, given
that sex and survival have nothing to do with each other ?
355
Well, from probability theory we know that if two events A and B are
independent, the probability of observing A and B at the same time, is equal
to the product of the probability of event A and the probability of event B.
If sex and survival are independent from each other, then the probability of
observing a male survivor is equal to the probability of seeing a male times the
probability of seeing a survivor. The probability for survival is 0.31, as we saw
earlier, and the probability of seeing a male is equal to the proportion of males
in the data, which is 1667/2092 = 0.8. Therefore, the probability of seeing a
male survivor is 0.8 × 0.31 = 0.24. The expected number of male survivors is
then that probability times the total number of people, 0.24 × 2092 = 502.08.
Similarly we can calculate the expected number of non-surviving males, the
number of surviving females, and the number of non-surviving females.
These numbers, after rounding, are displayed in Table 14.7.
The expected numbers in Table 14.7 are quite different from the observed
numbers in Table 14.4. Are the differences large enough to think that the two
events of being male and being a survivor are NOT independent? If the expected
numbers on the assumption of independence are different enough from the
observed numbers, then we can reject the null-hypothesis that being male and
being a survivor have nothing to do with each other. To measure the difference
between expected and observed counts, we need a test statistic. Here we use
Pearson’s chi-square statistic. It involves calculating the difference between the
numbers in the respective cells, and standardize them by the expected number.
Here’s how it goes:
For each cell, we take the predicted count and subtract it from the observed
count. For instance, for the male survivors, we expected 519 but observed 338.
The difference is therefore 338 − 519 = −181. Then we take the square of this
difference, 1812 = 32761. Then we divide this number by the expected number,
and then we get 32761/519 = 63.1233141. We do exactly the same thing for the
male non-survivors, the female survivors and the female non-survivors. Then
we add these 4 numbers, and then we have the Pearson chi-square statistic. In
formula form:
(Oi − Ei )2
X 2 = Σi (14.8)
Ei
356
So for male survivors we get
(338 − 519)2
= 63.1233141 (14.9)
519
For male non-survivors we get
(1329 − 1155)2
= 26.212987 (14.10)
1155
For female survivors we get
(316 − 130)2
= 266.1230769 (14.11)
130
and for female non-survivors we get
(109 − 289)2
= 112.1107266 (14.12)
289
If we add these 4 numbers we have the chi-square statistic: X 2 = 467.57.
Note that we only use the rounded expected numbers. Better would be to use
the non-rounded numbers. Had we used the non-rounded expected numbers,
we would have gotten X 2 = 460.87.
The Wald chi-square statistic for the sex*survived interaction effect was
368.9788928, see Figure 14.12. It tests exactly the same null-hypothesis as the
Pearson chi-square: that of independence, or in other words, that the numbers
can be explained by only two main effects, sex and survival.
If the data set is large enough and the numbers are not too close to 0,
the same conclusions will be drawn, whether from a Wald chi-square for an
interaction effect in a generalized linear model, or from a cross-tabulation and
computing a Pearson chi-square. The advantage of the generalized linear model
approach is that you can do much more with them, for instance more than
two predictors, and that you make it more explicit that when computing the
statistic, you take into account the main effects of the variables. You do that also
for the Pearson chi-square but it is less obvious: we did that by first calculating
the probability of survival and second calculating the proportion of males.
357
There is yet a third way to analyse the sex and survived variables. Remember
that in the previous chapter we discussed logistic regression. In logistic regression,
a dichotomous variable (a variable with only two values, say 0 and 1) is the
dependent variable, with one or more quantitative or qualitative independent
variables. Both sex and survived are dichotomous variables: male and female,
and survived yes or survived no. In principle therefore, we could do a logistic
regression: for example predicting whether a person is a male or female, on
the basis of whether they survived or not, or the other way around, predicting
whether people survive or not, on the basis of whether a person is a woman or
a man.
What variable is used here as your dependent variable, depends on your
research question. If your question is whether females are more likely to survive
than men, perhaps because of their body fat composition, or perhaps because of
male chivalry, then the most logical choice is to take survival as the dependent
variable and sex as the independent variable.
The syntax for logistic regression then looks like
Note however that the data is in the wrong format. For the Poisson regression,
the data were there in the form of what we see in Table 14.4. However, for a
logistic regression, we need the data in the format like in Table 14.8. For every
person on-board the ship, we have to know their sex and their survival status.
Table 14.8: Individual data of adult survivors and non-survivors on the Titanic.
ID sex survived
1004 Male 0
623 Male 0
934 Male 0
400 Male 0
1626 Male 1
1103 Male 0
270 Male 0
2052 Female 1
1685 Male 1
1236 Male 0
358
is −1.37 + 2.43 = 1.06. These logodds ratios correspond to probabilities of 0.20
and 0.74, respectively. Thus, some are much more likely to survive than men.
Parameter Estimates
95% Wald Confidence Interval Hypothesis Test
Wald Chi-
Parameter B Std. Error Lower Upper Square df Sig.
(Intercept) -1.369 .0609 -1.489 -1.250 505.126 1 .000
[sex=1.00] 2.434 .1267 2.185 2.682 368.979 1 .000
a
[sex=2.00] 0 . . . . . .
(Scale) 1b
Dependent Variable: survived
Model: (Intercept), sex
a. Set to zero because this parameter is redundant.
b. Fixed at the displayed value.
Figure 14.14: SPSS output of a generalized linear model for predicting numbers
of men and women that perished and survived on board the Titanic.
However, suppose you are the relative of a passenger on board a ship that
shipwrecks. After two days, there is news that a person was found. The only
thing known about the person is that he or she is alive. Your relative is your
niece, so you’d like to know on the basis that the person that was found lives,
what is the probability that that person is a woman, because then it could be
your beloved niece! You could therefore run a logistic regression on the Titanic
data to see to what extent the survival of a person predicts the sex of that
person. The syntax would then look like this:
Note that we use WITH in order to treat the dummy variable survived as
quantitative. We also use (REFERENCE=LAST) to indicate that we use the
last (second) category of sex (2) as the reference category, because that category
refers to men, because we want to predict whether a person is a female.
The output is give in Figure 14.15
From this output we conclude that survival is a significant predictor of sex,
B = −2.434, X 2 = 368, 98, p < 0.001. The logodds ratio for a surviving person
to be a woman is −4.93 + 2.43 = −2.50, and the logodds ratio for a non-
surviving person to be a woman is −4.93. These logodds ratios correspond to
probabilities of 0.08 and 0.01, respectively. Thus, if you know that there is a
person that survived the Titanic, it is not very likely that it was a woman, only
8% chance. If you think this is counter-intuitive, remember that even though a
359
Parameter Estimates
95% Wald Confidence Interval Hypothesis Test
Wald Chi-
Parameter B Std. Error Lower Upper Square df Sig.
(Intercept) -4.934 .2141 -5.354 -4.515 531.265 1 .000
survived 2.434 .1267 2.185 2.682 368.979 1 .000
a
(Scale) 1
Dependent Variable: sex
Model: (Intercept), survived
a. Fixed at the displayed value.
Figure 14.15: SPSS output of a generalized linear model for predicting numbers
of men and women that perished and survived on-board the Titanic.
large proportion of the women survived the Titanic, there were many more men
on-board than women.
In summary if you have count data, and one of the variables is dichotomous, you
have the choice whether to use a Poisson regression model or a logistic regression.
The choice depends on the research question: if your question involves prediction
of a dichotomous variable, logistic regression is the logical choice. If you have
a theory that one or more independent variable explain one other variable,
logistic regression is the logical choice. If however your theory does not involve
a natural direction or prediction of one variable, and you are simply interested
in associations among variables, then Poisson regression is the obvious choice.
360
Page
Chapter 15
15.1 Introduction
Previous chapters looked into traditional or classic data analysis: inference
about a population using a limited sample of data with a limited number of
variables. For instance, we might be interested in how large the effect is of a
new kind of therapy for clinical depression in the population of all patients,
based on a sample of 150 treated patients and 150 non-treated patients on a
waiting list.
In many contexts, we are not interested in the effect of some intervention
in a population, but in the prediction of events in the future. For example, we
would like to predict which patients are likely to relapse into depression after
an initially successful therapy. For such a prediction we might have a lot of
variables. In fact, more variables than we could use in a straightforward linear
model analysis.
In this age of big data, there are more often too many data then too few
data about people. However, this wealth of data is often not nicely stored in
data matrices. Data on patients for example are stored in different types of files,
in different file formats, as text files, scans, X-rays, lab reports, perhaps even
videotaped interviews. They may be stored at different hospitals or medical
centres, so they need to be linked and combined without mixing them up. In
short: data can be really messy. Moreover, data are not variables yet. Data
science is about making data available for analysis. This field of research aims
to extract knowledge and insight from structured and unstructured data. To do
that, it draws from statistics, mathematics, computer science and information
science.
The patient data example is a typical case of a set of unstructured data.
From a large collection of pieces of texts (e.g., notes from psychiatric interviews
and counselling, notes on prescriptions and adverse effects of medication, lab
361
reports) one has to distil a set of variables that could predict whether or not an
individual patient would fall back into a second depressive period.
There are a couple of reasons why big data analytics is different from the data
analysis framework discussed in previous chapters. These relate to 1) different
types of questions, 2) the p > n problem and 3) the problem of over-fitting.
First, the type of questions are different. In classic data analysis, you have a
model with one or more model parameters, for example a regression coefficient,
and the question is what the value is of that parameter in the population.
Based on sample data, you draw inferences regarding the parameter value in
the population. In contrast, typical questions in big data situations are about
predictions for future data (e.g., how will the markets respond to the start of
the hurricane season), or how to classify certain events (e.g., is a Facebook
posting referring to a real event or is it ”fake news”). In big data situations,
such predictions or classifications are based on training data. In classic data
analysis, inference is based on sample data.
Second, the type of data in big data settings allows for a far larger number
of variables than in non-big data settings. In the patient data example, imagine
the endless ways in which we could think of predicting relapse on the basis
of the text data alone. We could take as predictor variables the number of
counselling sessions, whether or not a tricyclic antidepressant was prescribed,
whether or not a non-tricyclic antidepressant was prescribed, whether or not the
word ”mother” was mentioned in the sessions, the number of times the word
”mother” was used in the sessions, how often the word ”mother” was associated
with the word ”angry” or ”anger” in the same sentence, and so on. The types of
variables you could distil from such data is endless, so what to pick? And where
to stop? So the first way in which big data analytics differs from classic data
analysis is that a variable selection method has to be used. The analyst has to
make a choice of what features of the raw data will be used in the analysis. Or,
during the analysis itself, an algorithm can be used that picks those features that
predict the outcome variable most efficiently. Usually there is a combination
of both methods: there is an informed choice of what features in the data are
likely to be most informative (e.g., the data analyst a priori believes that the
specific words used in the interviews will be more informative about relapse
than information contained in X-rays), and an algorithm that selects the most
informative features out of this selection (e.g., the words ”mother” and ”angry”).
One reason that variable selection is necessary is because statistical methods,
like for example linear models, do not work when the number of variable is large
relative to the number of cases. This is known as the p > n problem, where p
refers to the number of variables and n to the number of cases. We will come
back to this problem below.
Third, because there is so much information available in big data situations,
there is the likely danger of over-fitting your model. Maybe you have enough
cases to include 1,000 predictor variables in your linear models, and they will run
and give meaningful output, but then the model will be too much focused on the
data that you have now, so that it will be very bad at predicting or classifying
new events correctly. Therefore, another reason for limiting the number of
362
120
100
Y
80
60
80 90 100 110 120
X
Figure 15.1: Ilustration of overfitting: a data set showing all 50 data points
showing a linear relationship between variables X and Y .
363
120
Y
100
80
80 100 120
X
Figure 15.2: Ilustration of overfitting: only showing half the data points
(training data) and a local polynomial regression model fitted to them.
the training data. The training data set and the model predictions, depicted as
a blue line is given in Figure 15.2. It turns out that the correlation between the
observed Y -values and the predicted values based on the model is pretty high:
0.78.
Next, we apply this model to the test data. These are depicted in Figure
15.3. The blue line are the predicted Y -values for the X-values in this data set,
based on the model. We see that the blue line is not a good description of the
pattern in the test data. This is also reflected in the much lower correlation
between the observed Y -values and the predicted values in the test data: 0.71.
Thus we see that the model is a good model for the training data, upon which
the model was based, but the model is a terrible model for new data, even
both data sets have the same origin. The training data were only randomly
selected. The model was simply too much focused on the details in the data.
Had we used a much simpler model, a linear regression model for example, the
relative performance on the test data would be much better. The least squares
equation predicts the Y -values in the training data with a correlation of 0.48.
That is much worse than the splines, but we see that the model performs much
better in the test data: there we see a correlation of 0.39. In sum: a complex
model will always give better predictions than a simple model for training data.
However, what is important is that a model will also show good predictions in
test data. Then we see that often, a relatively simple model will perform better
than a very complex model. This is due to the problem of over-fitting. The
trade-off between the model complexity and over-fitting is also known as the
bias-variance trade-off, where bias refers to the error that we make when we
select the wrong model for the data, and variance refers to error that we make
because we are limited to seeing only the training data.
364
120
Y
100
80
80 100 120
X
Figure 15.3: Ilustration of overfitting: only showing half the data points (test
data) and a local polynomial regression model that does not describe these data
well because it was based on the training data.
365
optimal level of complexity for our training data, one often uses cross-validation.
15.1.2 Cross-validation
Cross-validation is a form of a re-sampling method. In re-sampling methods,
different subsets of the training data are used to fit the same model or different
models, or different versions of a model. There are different forms of cross-
validation, but here we discuss k-fold cross-validation. In k-fold cross-validation,
the training data are split randomly into k groups (folds) of approximately equal
size. The model is then fit k times, each time leaving out the data from one of
the k groups. Each time, predictions are made for the data in the group that
is left out of the analysis. And each time we assess how good these predictions
are, for example by determining the residuals and computing the mean squared
error (MSE). With k groups, we then have k MSEs, and we can compute the
mean MSE. If we do this cross-validation for several models, we can see which
model has the lowest mean MSE. That is the model that on average shows the
best prediction. This should not lead to over-fitting, because by the random
sampling into k sub-samples, we are no longer dependent on one particular
subset of the data. Usually, a value of 5 or 10 is used for k.
Table 15.1: Small data set illustrating the p larger than n problem.
X Y
0.00 0.64
1.00 2.08
2.00 3.33
366
p<n
2
Y
0
0.0 0.5 1.0 1.5 2.0 2.5
X
Figure 15.4: p < n: There is a unique solution that fits the least squares
criterion. No problem whatsoever.
Therefore, you can say something about the intercept and slope in the sample,
but you cannot say anything about the intercept and slope in the population.
The situation will be even worse when you have only 1 data point. Suppose
we only have the second data point, which we plot in Figure 15.6. If we then
try to fit a regression line, we will see that the software will refuse to estimate
a slope parameter. It will be fixed to 0, so that an intercept only model will
be fitted (see Section 5.15). It simply is impossible for the software to decide
what regression line to pick that goes through this one data point: there is an
infinite number of regression lines that go through this data point! Therefore,
for a two-parameter model like a regression model (the two parameters being
the intercept and slope), you need at least two data points for the model to run,
and at least three data points to get standard errors and do inference.
The same is true for larger n and larger models. For example, a multiple
regression model with 10 predictor variables together with an intercept will have
11 parameters. Such a model will not give standard errors when you have 11
observations in your data matrix, and it will not run if you have fewer than 11
observations.
In sum, the number of data points, n, should always exceed the number of
parameters in your model. That means that if you have a lot of variables in
your data file, you cannot always use them in your analysis, because you simply
do not have enough rows in your data matrix to estimate the model parameters.
367
p=n
2
Y
0
0.0 0.5 1.0 1.5 2.0 2.5
X
p>n
2
Y
0
0.0 0.5 1.0 1.5 2.0 2.5
X
Figure 15.6: p > n: There is an infinite number of lines through the data point,
but there is no criterion that determines which is best. The problem of the least
squares regression line is not defined with only one data point.
368
the following steps:
1. Problem identification. You need to know what the problem is: what do
you want to know? What do you want to predict? How good does the
prediction have to be? How fast does it have to be: real-time?
3. Feature selection. From the data sources you have access to, what features
are of interest? For example, from spoken interviews, are you mainly
interested in the words spoken by the patient? Or perhaps interested in
the length of periods of silence, or perhaps in changes in pitch? There are
so many features you could extract from data.
4. Construction of a data matrix. Once you have decided what features you
want to extract from the raw data, you have to put this information into
a data matrix. You have to decide what to put in the rows (your units of
observation), and in the columns (the features, now variables). So what is
now your variable: this could be the length of one period of silence within
one interview for a particular patient. But it could also be the average
pitch for a 1-minute interval in one interview for one particular patient.
5. Training and test (validation) data set. In order to check that we are not
over-fitting, and to make sure that our model will work for future data,
we divide our data set (our data matrix) into two parts: training data and
test data. This is done by taking a random sample of all the data that
we have. Usually, a random sample of 70% is used for training, and the
remaining 30% is used for testing (validating) the model. We set the test
data aside and will only look at the training data.
7. Build the model. Once we know what model works best for our training
data, we fit that model on the training model. This fitted model is our
final model.
8. Validate the model. The final test is whether this final model will work
on new data. We don’t have new data, but we have put away some of
our data as test data (validation data). These data can now be used as
a substitute for new data to estimate how well our model will work with
future data.
369
9. Interpret the result and evaluate. There will be always some over-fitting,
so the performance on the test data will always be worse than on the
training data. But is the performance good enough to be satisfied with
the model? Is the model useful for daily practice? If not, maybe the
data sources and feature selection steps should be reconsidered. Another
important aspect of statistical learning is interpretability. There are some
very powerful models and methods around that are capable of very precise
predictions. However, the problem with these models and methods is
that they are hard to interpret: they are black boxes in that they make
predictions that cannot be explained by even the data analysts themselves.
Any decisions are therefore hard to justify, which brings ethical issues.
For instance, what would you say if an algorithm would determine on the
basis of all your life’s data that you will not be a successful student? Of
course you would want to know on the basis of what data exactly that
decision is based. Your make-up? Your height? The colour of your skin?
Last year’s grades? Of course it would matter to you what variables are
used and how. Recent research has focused on how to make complicated
models and methods easier to interpret and help data analysts evaluate
the usefulness and applicability of their results and communicate them to
others.
370
Appendices
371
Appendix A
373
0.4
0.3
density
0.2
0.1
p
0.0
−2 −1 0 1 2
z
z p z p z p z p z p z p
-4.00 0.0000 -1.43 0.0764 -0.71 0.2389 0.01 0.5040 0.73 0.7673 1.45 0.9265
-3.80 0.0001 -1.42 0.0778 -0.70 0.2420 0.02 0.5080 0.74 0.7704 1.46 0.9279
-3.60 0.0002 -1.41 0.0793 -0.69 0.2451 0.03 0.5120 0.75 0.7734 1.47 0.9292
-3.40 0.0003 -1.40 0.0808 -0.68 0.2483 0.04 0.5160 0.76 0.7764 1.48 0.9306
-3.20 0.0007 -1.39 0.0823 -0.67 0.2514 0.05 0.5199 0.77 0.7794 1.49 0.9319
-3.00 0.0013 -1.38 0.0838 -0.66 0.2546 0.06 0.5239 0.78 0.7823 1.50 0.9332
-2.90 0.0019 -1.37 0.0853 -0.65 0.2578 0.07 0.5279 0.79 0.7852 1.51 0.9345
-2.80 0.0026 -1.36 0.0869 -0.64 0.2611 0.08 0.5319 0.80 0.7881 1.52 0.9357
-2.70 0.0035 -1.35 0.0885 -0.63 0.2643 0.09 0.5359 0.81 0.7910 1.53 0.9370
-2.60 0.0047 -1.34 0.0901 -0.62 0.2676 0.10 0.5398 0.82 0.7939 1.54 0.9382
-2.50 0.0062 -1.33 0.0918 -0.61 0.2709 0.11 0.5438 0.83 0.7967 1.55 0.9394
-2.40 0.0082 -1.32 0.0934 -0.60 0.2743 0.12 0.5478 0.84 0.7995 1.56 0.9406
-2.30 0.0107 -1.31 0.0951 -0.59 0.2776 0.13 0.5517 0.85 0.8023 1.57 0.9418
-2.20 0.0139 -1.30 0.0968 -0.58 0.2810 0.14 0.5557 0.86 0.8051 1.58 0.9429
-2.10 0.0179 -1.29 0.0985 -0.57 0.2843 0.15 0.5596 0.87 0.8078 1.59 0.9441
-2.00 0.0228 -1.28 0.1003 -0.56 0.2877 0.16 0.5636 0.88 0.8106 1.60 0.9452
-1.99 0.0233 -1.27 0.1020 -0.55 0.2912 0.17 0.5675 0.89 0.8133 1.61 0.9463
-1.98 0.0239 -1.26 0.1038 -0.54 0.2946 0.18 0.5714 0.90 0.8159 1.62 0.9474
-1.97 0.0244 -1.25 0.1056 -0.53 0.2981 0.19 0.5753 0.91 0.8186 1.63 0.9484
-1.96 0.0250 -1.24 0.1075 -0.52 0.3015 0.20 0.5793 0.92 0.8212 1.64 0.9495
-1.95 0.0256 -1.23 0.1093 -0.51 0.3050 0.21 0.5832 0.93 0.8238 1.65 0.9505
-1.94 0.0262 -1.22 0.1112 -0.50 0.3085 0.22 0.5871 0.94 0.8264 1.66 0.9515
-1.93 0.0268 -1.21 0.1131 -0.49 0.3121 0.23 0.5910 0.95 0.8289 1.67 0.9525
-1.92 0.0274 -1.20 0.1151 -0.48 0.3156 0.24 0.5948 0.96 0.8315 1.68 0.9535
-1.91 0.0281 -1.19 0.1170 -0.47 0.3192 0.25 0.5987 0.97 0.8340 1.69 0.9545
-1.90 0.0287 -1.18 0.1190 -0.46 0.3228 0.26 0.6026 0.98 0.8365 1.70 0.9554
-1.89 0.0294 -1.17 0.1210 -0.45 0.3264 0.27 0.6064 0.99 0.8389 1.71 0.9564
-1.88 0.0301 -1.16 0.1230 -0.44 0.3300 0.28 0.6103 1.00 0.8413 1.72 0.9573
-1.87 0.0307 -1.15 0.1251 -0.43 0.3336 0.29 0.6141 1.01 0.8438 1.73 0.9582
-1.86 0.0314 -1.14 0.1271 -0.42 0.3372 0.30 0.6179 1.02 0.8461 1.74 0.9591
-1.85 0.0322 -1.13 0.1292 -0.41 0.3409 0.31 0.6217 1.03 0.8485 1.75 0.9599
-1.84 0.0329 -1.12 0.1314 -0.40 0.3446 0.32 0.6255 1.04 0.8508 1.76 0.9608
Continued on next page
374
Table A.1: Cumulative proportions (p) for the standard normal
distribution.
z p z p z p z p z p z p
-1.83 0.0336 -1.11 0.1335 -0.39 0.3483 0.33 0.6293 1.05 0.8531 1.77 0.9616
-1.82 0.0344 -1.10 0.1357 -0.38 0.3520 0.34 0.6331 1.06 0.8554 1.78 0.9625
-1.81 0.0351 -1.09 0.1379 -0.37 0.3557 0.35 0.6368 1.07 0.8577 1.79 0.9633
-1.80 0.0359 -1.08 0.1401 -0.36 0.3594 0.36 0.6406 1.08 0.8599 1.80 0.9641
-1.79 0.0367 -1.07 0.1423 -0.35 0.3632 0.37 0.6443 1.09 0.8621 1.81 0.9649
-1.78 0.0375 -1.06 0.1446 -0.34 0.3669 0.38 0.6480 1.10 0.8643 1.82 0.9656
-1.77 0.0384 -1.05 0.1469 -0.33 0.3707 0.39 0.6517 1.11 0.8665 1.83 0.9664
-1.76 0.0392 -1.04 0.1492 -0.32 0.3745 0.40 0.6554 1.12 0.8686 1.84 0.9671
-1.75 0.0401 -1.03 0.1515 -0.31 0.3783 0.41 0.6591 1.13 0.8708 1.85 0.9678
-1.74 0.0409 -1.02 0.1539 -0.30 0.3821 0.42 0.6628 1.14 0.8729 1.86 0.9686
-1.73 0.0418 -1.01 0.1562 -0.29 0.3859 0.43 0.6664 1.15 0.8749 1.87 0.9693
-1.72 0.0427 -1.00 0.1587 -0.28 0.3897 0.44 0.6700 1.16 0.8770 1.88 0.9699
-1.71 0.0436 -0.99 0.1611 -0.27 0.3936 0.45 0.6736 1.17 0.8790 1.89 0.9706
-1.70 0.0446 -0.98 0.1635 -0.26 0.3974 0.46 0.6772 1.18 0.8810 1.90 0.9713
-1.69 0.0455 -0.97 0.1660 -0.25 0.4013 0.47 0.6808 1.19 0.8830 1.91 0.9719
-1.68 0.0465 -0.96 0.1685 -0.24 0.4052 0.48 0.6844 1.20 0.8849 1.92 0.9726
-1.67 0.0475 -0.95 0.1711 -0.23 0.4090 0.49 0.6879 1.21 0.8869 1.93 0.9732
-1.66 0.0485 -0.94 0.1736 -0.22 0.4129 0.50 0.6915 1.22 0.8888 1.94 0.9738
-1.65 0.0495 -0.93 0.1762 -0.21 0.4168 0.51 0.6950 1.23 0.8907 1.95 0.9744
-1.64 0.0505 -0.92 0.1788 -0.20 0.4207 0.52 0.6985 1.24 0.8925 1.96 0.9750
-1.63 0.0516 -0.91 0.1814 -0.19 0.4247 0.53 0.7019 1.25 0.8944 1.97 0.9756
-1.62 0.0526 -0.90 0.1841 -0.18 0.4286 0.54 0.7054 1.26 0.8962 1.98 0.9761
-1.61 0.0537 -0.89 0.1867 -0.17 0.4325 0.55 0.7088 1.27 0.8980 1.99 0.9767
-1.60 0.0548 -0.88 0.1894 -0.16 0.4364 0.56 0.7123 1.28 0.8997 2.00 0.9772
-1.59 0.0559 -0.87 0.1922 -0.15 0.4404 0.57 0.7157 1.29 0.9015 2.10 0.9821
-1.58 0.0571 -0.86 0.1949 -0.14 0.4443 0.58 0.7190 1.30 0.9032 2.20 0.9861
-1.57 0.0582 -0.85 0.1977 -0.13 0.4483 0.59 0.7224 1.31 0.9049 2.30 0.9893
-1.56 0.0594 -0.84 0.2005 -0.12 0.4522 0.60 0.7257 1.32 0.9066 2.40 0.9918
-1.55 0.0606 -0.83 0.2033 -0.11 0.4562 0.61 0.7291 1.33 0.9082 2.50 0.9938
-1.54 0.0618 -0.82 0.2061 -0.10 0.4602 0.62 0.7324 1.34 0.9099 2.60 0.9953
-1.53 0.0630 -0.81 0.2090 -0.09 0.4641 0.63 0.7357 1.35 0.9115 2.70 0.9965
-1.52 0.0643 -0.80 0.2119 -0.08 0.4681 0.64 0.7389 1.36 0.9131 2.80 0.9974
-1.51 0.0655 -0.79 0.2148 -0.07 0.4721 0.65 0.7422 1.37 0.9147 2.90 0.9981
-1.50 0.0668 -0.78 0.2177 -0.06 0.4761 0.66 0.7454 1.38 0.9162 3.00 0.9987
-1.49 0.0681 -0.77 0.2206 -0.05 0.4801 0.67 0.7486 1.39 0.9177 3.20 0.9993
-1.48 0.0694 -0.76 0.2236 -0.04 0.4840 0.68 0.7517 1.40 0.9192 3.40 0.9997
-1.47 0.0708 -0.75 0.2266 -0.03 0.4880 0.69 0.7549 1.41 0.9207 3.60 0.9998
-1.46 0.0721 -0.74 0.2296 -0.02 0.4920 0.70 0.7580 1.42 0.9222 3.80 0.9999
-1.45 0.0735 -0.73 0.2327 -0.01 0.4960 0.71 0.7611 1.43 0.9236 4.00 1.0000
-1.44 0.0749 -0.72 0.2358 0.00 0.5000 0.72 0.7642 1.44 0.9251 4.01 1.0000
375
376
Appendix B
0.3
density
0.2
0.1
p
0.0
−2 −1 0 1 2
t
377
Table B.1: Values of the t-distribution, given the degrees of freedom (rows) and
tail probability p (columns). These can be used for critical values for a given
confidence level.
df 0.15 0.10 0.05 0.025 0.02 0.01 0.005 0.0025 0.001 0.0005
1 1.963 3.078 6.314 12.71 15.90 31.82 63.66 127.3 318.3 636.6
2 1.386 1.886 2.920 4.303 4.849 6.965 9.925 14.09 22.33 31.60
3 1.250 1.638 2.353 3.182 3.482 4.541 5.841 7.453 10.22 12.92
4 1.190 1.533 2.132 2.776 2.999 3.747 4.604 5.598 7.173 8.610
5 1.156 1.476 2.015 2.571 2.757 3.365 4.032 4.773 5.893 6.869
6 1.134 1.440 1.943 2.447 2.612 3.143 3.707 4.317 5.208 5.959
7 1.119 1.415 1.895 2.365 2.517 2.998 3.499 4.029 4.785 5.408
8 1.108 1.397 1.860 2.306 2.449 2.896 3.355 3.833 4.501 5.041
9 1.100 1.383 1.833 2.262 2.398 2.821 3.250 3.690 4.297 4.781
10 1.093 1.372 1.812 2.228 2.359 2.764 3.169 3.581 4.144 4.587
11 1.088 1.363 1.796 2.201 2.328 2.718 3.106 3.497 4.025 4.437
12 1.083 1.356 1.782 2.179 2.303 2.681 3.055 3.428 3.930 4.318
13 1.079 1.350 1.771 2.160 2.282 2.650 3.012 3.372 3.852 4.221
14 1.076 1.345 1.761 2.145 2.264 2.624 2.977 3.326 3.787 4.140
15 1.074 1.341 1.753 2.131 2.249 2.602 2.947 3.286 3.733 4.073
16 1.071 1.337 1.746 2.120 2.235 2.583 2.921 3.252 3.686 4.015
17 1.069 1.333 1.740 2.110 2.224 2.567 2.898 3.222 3.646 3.965
18 1.067 1.330 1.734 2.101 2.214 2.552 2.878 3.197 3.610 3.922
19 1.066 1.328 1.729 2.093 2.205 2.539 2.861 3.174 3.579 3.883
20 1.064 1.325 1.725 2.086 2.197 2.528 2.845 3.153 3.552 3.850
21 1.063 1.323 1.721 2.080 2.189 2.518 2.831 3.135 3.527 3.819
22 1.061 1.321 1.717 2.074 2.183 2.508 2.819 3.119 3.505 3.792
23 1.060 1.319 1.714 2.069 2.177 2.500 2.807 3.104 3.485 3.768
24 1.059 1.318 1.711 2.064 2.172 2.492 2.797 3.091 3.467 3.745
25 1.058 1.316 1.708 2.060 2.167 2.485 2.787 3.078 3.450 3.725
26 1.058 1.315 1.706 2.056 2.162 2.479 2.779 3.067 3.435 3.707
27 1.057 1.314 1.703 2.052 2.158 2.473 2.771 3.057 3.421 3.690
28 1.056 1.313 1.701 2.048 2.154 2.467 2.763 3.047 3.408 3.674
29 1.055 1.311 1.699 2.045 2.150 2.462 2.756 3.038 3.396 3.659
30 1.055 1.310 1.697 2.042 2.147 2.457 2.750 3.030 3.385 3.646
40 1.050 1.303 1.684 2.021 2.123 2.423 2.704 2.971 3.307 3.551
50 1.047 1.299 1.676 2.009 2.109 2.403 2.678 2.937 3.261 3.496
60 1.045 1.296 1.671 2.000 2.099 2.390 2.660 2.915 3.232 3.460
120 1.041 1.289 1.658 1.980 2.076 2.358 2.617 2.860 3.160 3.373
10000 1.036 1.282 1.645 1.960 2.054 2.327 2.576 2.808 3.091 3.291
70% 80% 90% 95% 96% 98% 99% 99.5% 99.8% 99.9%
Confidence level
378