Introducing The Linear Model
Introducing The Linear Model
Preamble
In our prior sessions we have focused on the exploration, description, and visualization of variables. This will
be our first session to explore methods for understanding the associations between variables. Many questions
in life sciences research can be expressed as estimation of the associations between variables, so these
methods are a very important part of the methodological toolkit for researchers.
Analyses of associations between variables are a common task in R, so there are a variety of ways to do this.
Over the next few sessions we will explore the basic operations in R for conducting statistical analyses of
associations between different types of variables, and the interpretation of the results that you obtain from
these analyses.
The emphasis in this prac will be on the mechanics of linear regression models and the implementation of
these analyses in R. The overall goal of the session is to introduce you to the process of modelling
associations between variables, in preparation for the subsequent sessions which will focus in detail on
elements of these analyses and ways in which the basic framework can be modified for different research
problems.
1. assess and prepare data for analysis with the linear regression model
2. define a model formula from the available variables and justify the structure of a linear regression model
3. generate a linear regression model in R and interrogate it using various utility functions
4. define ‘residual’, explain the importance of residuals in linear regression models, and assess
distributions of residuals
5. understand the concept of prediction from a linear model
Assessment
This prac is not assessed. So, kick back and relax..! (But not too much because, this will be in the next
assessment and on the final exam)
https://www.aeaweb.org/articles?id=10.1257/jep.22.4.199 (https://www.aeaweb.org/articles?
id=10.1257/jep.22.4.199)
You have conducted an experiment evaluating the relationship between roasting time of malt and the colour of
the resulting brew. Color is an important element of quality assessment in the brewing industry. Before
fermentation, malted barley, the source of sugars (especially maltose) is roasted in an oven to generate
important color and flavour compounds, especially for darker beer varieties such as porter or stout. These
chemicals arise via Maillard reactions.
Optimizing the conditions of brewing was the major focus of Gosset’s scientific career. Several issues are
important in making a good brew. If the roasting temperature is too high it increases the overall cost of the
process and risks making a final brew that is too dark. Conversely, if the roasting temperature is tool low it will
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 1/15
30/03/2022, 15:22 Introducing the Linear Model
produce a final brew that is too light and weakly flavoured to satisfy the consumer.
In this prac we are going to explore the relationship between final brew colour and roasting temperature.
Imagine an experiment in which you roasted malts of the same variety at several different temperatures. The
other roasting conditions, including the roasting time were comparable, and the post-roasting processing was
the same. In short, the primary source of variation in the final brew colour is the roasting temperature because
other sources of variability have been controlled to be approximately the same.
After brewing, the color of the final product was assessed using the European Brewing Convention (EBC)
scale, which assigns a positive numeric value, with higher values representing darker color. This number is
derived from a measurement of the absorption of 430nm light by a sample of the beer (absorption
spectroscopy).
Temperature is a classical continuous variable, and in practical application its lower bound is often too distant
to be relevant, so it should behave very sensibly when modelled as a continuous variable. EBC, though it
includes only positive values, is derived directly from a real physical measurement, so it should be reasonable
to treat it as a continuous variable. Remember that this assessment of a variable has important implications for
the selection of appropriate analyses. In this case, the characteristics of the data suggest that a linear
regression model may be an appropriate technique for analysis of the data.
The data are contained in the file beer.Rdata which has been pre-loaded in the R Markdown setup chunk.
y = mx + b
It describes the relationship between two numeric variables, x and y using two parameters: m, which is the
slope (or gradient) of the function, and b, which is the y-intercept of the function (the value of y when x equals
zero). The slope represents the amount of change in the value of y that is expected with a one unit change in
the value of x .
Note: depending on where you went to high school you may have also learned this equation in a
slightly different form. Some common alternatives include:
y = mx + c
y = ax + b
Throughout most of this class we will use a powerful modelling approach know as a general linear model, for
most of our analyses. In this week’s prac, we will be working with a special case of the general linear model
commonly referred to as linear regression. The form of this model is very close to the simple linear function
2
y ∼ mx i + b + ϵi , ϵi ∼ N (0, σ )
Like the equation for the line, there is the variable y, which we generally refer to as the response or dependent
variable, and the variable x , which we generally refer to as the predictor or independent variable. For linear
regression, both x and y are continuous numeric variables. The main difference between the linear model and
the equation for the line is the additional parameter ϵ , which is called the residual variance. This represents the
variability of the observed y-values relative to the y-values predicted by the right-hand side of the equation.
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 2/15
30/03/2022, 15:22 Introducing the Linear Model
You should also note that the ϵ values are assumed to be normally distributed with a mean of 0 and a standard
deviation of σ 2 . This is a key assumption of linear models and one that we will spend some time confirming in
our post-analysis diagnostics.
These mathematical characteristics demonstrate the importance of various assumptions of the linear
regression model. These can be inferred from the structure. The most important ones to remember are:
C. the residuals are homogeneous across the range of the predictor variable
If we take multiple samples of grass biomass and nutrient content from one corner of a paddock near a
stream, then the observations are not spatially independent for the broader population.
If we measure tree growth in one year, it is likely that growth in the previous year was relatively similar
(because the forest hasn’t changed much in a year). As a consequence, the growth rates from one year
to the next are not temporally independent.
If we record bird responses to a stimulus, but the first bird’s behaviour influences the behaviour of other
birds, then subsequent measures of the other birds may be spatially and temporally dependent on the
first measurement.
Most of these problems are ones of experimental design. Using randomised or stratified random sampling will
help to reduce or eliminate potential lack of independence amongst the observations. Addressing these
problems during the analysis stage of a project is complicated, requires sophisticated autocorrelation modelling
tools, and is often not very satisfying or successful. We will not address them any further in this prac.
Let’s see how these concepts are implemented in R using our beer brewing example. The simplest call to
lm() requires two arguments: the formula describing the model and the data to be modelled.
The model formula follows the basic form of the model equation described above: y ~ x , where y is the
response variable and x is the predictor variable.
For the beer data, we could define the model for colour as a function of temperature using the following
formula: EBC ~ temp .
Let’s go ahead and ask R to make us a linear model for color as a function of temperature using the following
command. We’ll assign the linear model to an object that we call beer_mod1 :
If we want to look at a summary of the linear model analysis, we use the function summary() with the name of
the model that we are using in the brackets:
summary(beer_mod1)
##
## Call:
##
## Residuals:
##
## Coefficients:
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
This gives us a lot of information about the linear model describing the relationship between temp and EBC .
Let’s step through the output and see what we can learn about our analysis. Some of these reported statistics
will be covered in detail over the next few practical sessions, so it’s not necessary to master all of them at
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 4/15
30/03/2022, 15:22 Introducing the Linear Model
once. However, by the end of the semester you should be quite comfortable parsing all of this information. The
output is organised in four chunks:
Model structure This returns to us the model that we typed in. The primary purpose of this is as a record of
the model that we used associated with its results. This is particularly when we are comparing many models
and want to quickly check the details of the model structure.
Distribution of residuals A core assumption of the linear model is that the residuals are normally distributed
with a mean of 0 and a standard deviation of σ 2 . So, we should be looking for the median of the residuals to
be close to 0 and the 1st and 3rd quantiles to have similar absolute values. This is a quick-and-dirty way to
assess the residuals; we’ll discuss more sophisticated approaches later in today’s prac.
Model coefficients This chunk provides the best estimates of the model coefficients (ie, for m and b), as well
as some statistics describing them. Notice that there are two coefficients. (Intercept) is the y-intercept (ie,
b ). While we did not request it in the model description, R assumes that you want to know it (if you don’t you
can add -1 to the model and it will not estimate the y-intercept). temp is the slope (ie, m). The first column
( Estimate ) is the value of the parameter that is the best fit to the data. This tells us that given the data from
the beer dataset, the linear model that minimises the observed variability in the dataset is:
These are the best estimates of the parameters given the data available. However, because these are
estimated parameters, we also want to know about the uncertainty associated with them. The second column
( Std. Error ) provides that. The standard error of the parameter estimates gives us an indication of how
clearly the data describe the parameters. If the standard error is large relative to the parameter estimate, then
we will not have much confidence in that parameter estimate. If it is small relative to the parameter estimate, it
suggests that we can have some confidence in the parameter estimates. The third and fourth columns provide
an indication of whether the parameter estimates are significantly different from 0. We won’t talk much about
that this week, but will return to later in the semester.
Model strength The last chunk describes how much variability in the dataset the model explains and how
much remains unexplained. The residual standard error (RSE) describes the unexplained variability in the
model. The RSE is an estimate of ϵ from the equation for the linear model. The coefficient of determination (
R ) describes the proportion of the variability in y that is described by its relationship with x . It has a value
2
between 0 and 1, with higher values representing a better fit. We will discuss this statistic in detail in future
pracs. Finally, the F-statistic and p-value can be used when comparing several models that are trying to predict
the same response variable. This, too, we will discuss in detail later in the semester.
Let’s explore some of these outputs in a bit more details. We’ll start by looking at the residuals in more detail
using the utility function residuals() :
residuals(beer_mod1)
## 1 2 3 4 5 6 7 8
## 9 10
## 1.482049 -7.323618
Notice that the vector returned is the same size as the data that we used to build the model:
length(residuals(beer_mod1))
## [1] 10
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 5/15
30/03/2022, 15:22 Introducing the Linear Model
length(beer_data$EBC)
## [1] 10
This is not a coincidence. There is a residual for each observation. The model predicts a y-value for each x in
the dataset. The residual is the difference between the predicted (or fitted) y-value and the observed y-value
(ie, the one in the dataset). We can check this:
# The response data we used to fit the model i.e. the observations
beer_data$EBC
fitted(beer_mod1)
## 1 2 3 4 5 6 7 8
## 9 10
## 66.08828 72.84458
beer_data$EBC-fitted(beer_mod1)
## 1 2 3 4 5 6 7 8
## 9 10
## 1.482049 -7.323618
residuals(beer_mod1)
## 1 2 3 4 5 6 7 8
## 9 10
## 1.482049 -7.323618
(beer_data$EBC-fitted(beer_mod1)) / residuals(beer_mod1)
## 1 2 3 4 5 6 7 8 9 10
## 1 1 1 1 1 1 1 1 1 1
Recall that the linear model assumes that the residuals are normally distributed with mean zero. If this
assumption does not hold, our analysis is suspect, for two reasons
A. the structure of the model that we are using might be a poor representation of the data and/or
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 6/15
30/03/2022, 15:22 Introducing the Linear Model
B. inferential statistics generated from the model (more on these later!) may not be valid.
Model diagnostics
Base R graphics produces useful diagnostic plots simply by calling the plot() function for the model
summary:
plot(beer_mod1)
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 7/15
30/03/2022, 15:22 Introducing the Linear Model
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 8/15
30/03/2022, 15:22 Introducing the Linear Model
The performance package provides similar results to plot(lm) but with more graphical detail and a nicer
aesthetic:
library(performance)
check_model(beer_mod1)
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 9/15
30/03/2022, 15:22 Introducing the Linear Model
In this class, we’ll use the performance package output because it is a bit easier to use in presenting your
results.
So, what do all of these figures mean? You should be familiar (from last week) with the upper two panels in the
performance output. The upper left-hand panel is a quantile-quantile plot and the upper right-hand panel is a
density plot of the residuals with a theoretical normal distribution superimposed on top of it. These two plots
provide a means to assess whether the model residuals are approximately normally distributed. For
beer_mod1 the residuals aren’t perfectly normal, but they are pretty good (particularly for a dataset with only
10 observations).
The two lower panels compare the fitted values (ie, the predicted y-values) against the corresponding residual.
This allows us to assess whether there are any patterns in the residuals that might be cause for concern. The
two things we are looking for are a relatively even spread of points from left to right and obvious evidence of
non-linearity. This dataset does not have any obvious issues with spread (the points are as widely spread on
the y-axis at low fitted values as they are for high fitted values), but it does have a bit of a curve to it. This is
highlighted by the fifth panel, which shows the outliers. Notably the first and last observations are considered
outliers, which likely indicates some curvilinearity in these data. However, over the range of observations in the
dataset, the R2 is 0.9194, which means that the temperature variable explains 91.94% of the observed
variation in EBC , which is extremely good!
It is worth noting that we can also use the performance package to check specific issues quantitatively. Here
we request specific statistical tests for normality and heteroscedasticity, as well as the identification of any
outliers in the dataset:
check_normality(beer_mod1)
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 10/15
30/03/2022, 15:22 Introducing the Linear Model
check_heteroscedasticity(beer_mod1)
check_outliers(beer_mod1)
beer_mod2 <- lm(EBC ~ -1 + temp, data = beer_data) # -1 removes the y-intercept from the mode
l
summary(beer_mod2)
##
## Call:
##
## Residuals:
##
## Coefficients:
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
QUESTION 3: Compare the parameter estimates, residuals, goodness-of-fit statistics, and diagnostic graphics
for beer_mod1 and beer_mod2 . Which is a better model and why?
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 11/15
30/03/2022, 15:22 Introducing the Linear Model
## 1
## 49.19755
This returns a single EBC value predicted by the beer_mod1 model. The newdata argument specifies that we
want the model to take the values for the predictor variable from the supplied data frame. Two critical points to
note: 1) this must have the same variable name as in the dataset used to fit the model and 2) the new data
needs to be wrapped in the data.frame() function.
If we wanted to predict a range of temperatures, we can specify a vector of values. For example, to get the
EBC for each full degree between 140∘ C and 160∘ C, we would use:
## 1 2 3 4 5 6 7 8
## 9 10 11 12 13 14 15 16
## 17 18 19 20 21
An additional consideration is how much certainty our model provides us about that prediction. For example,
for a temperature of 172∘ C, we can additionally obtain the standard error of the prediction by
## $fit
## 1
## 55.95384
##
## $se.fit
## [1] 2.322588
##
## $df
## [1] 8
##
## $residual.scale
## [1] 6.027415
QUESTION 4: The maximum achievable temperature with a new piece of roasting equipment is 180C. What
value of EBC do we expect for it? How uncertain are we about this? If our target EBC is 58, is this equipment
suitable for us?
QUESTION 5: What EBC do we expect for a roasting temperature of 100C? How is this distinct from the
values we predicted above? Is this prediction trustworthy?
Another useful application of the predict() function is to plot the behaviour of our model. For example, let’s
plot the predicted EBC from our model as a function of temperature, across a range of temperatures including
all of our data. Let’s start by defining a range of values for temperature and then predicting EBC for all of those
temperatures;
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 12/15
30/03/2022, 15:22 Introducing the Linear Model
length(new_EBC)
## [1] 4
Now we have a dataframe containing a range of temperatures, the model-predicted EBC for each temperature
value, and the standard error of the prediction at each temperature. Let’s try plotting all of this in ggplot . We
use the predicated data to plot a line representing the model:
geom_line() +
xlim(125,205)
pred_EBC
Note that we assigned this ggplot call to an object called pred_EBC . This will make it easy for us to add
layers of data on top of this graph. Now let’s add the original data points:
pred_EBC
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 13/15
30/03/2022, 15:22 Introducing the Linear Model
This is a very illustrative style of plot, as it displays both our fitted model and the observations upon which it is
based. This is often easier to use than residual plots to examine the appropriateness of the shape of the
model.
It is also useful to show the uncertainty in the predictions. Let’s add lines representing the 2 times the standard
error (SE). To do this, we add one line at the predicted value plus 1 SE and one line at the predicted value
minus 1 SE. Note that, when presenting this kind of plot, it is essential to describe in your caption what the
lines represent as this cannot be assumed (ie, there is no accepted standard for this).
theme_bw()
pred_EBC
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 14/15
30/03/2022, 15:22 Introducing the Linear Model
QUESTION 6: Why are the prediction intervals (ie, the dashed lines) not straight?
QUESTION 7: Repeat this analysis to make a figure showing the observed data, the fitted line from the model,
and the prediction interval using beer_mod2 .
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 15/15