0% found this document useful (0 votes)
45 views

Introducing The Linear Model

The document introduces linear regression models. It discusses using linear regression to model the relationship between roasting temperature of malt and the color of the resulting beer. Students are asked to plot the relationship between temperature and color from sample data and describe the relationship. The document then defines the linear regression model equation and discusses key assumptions, including that observations are independent, residuals are normally distributed and homogeneous, and the data are linear.

Uploaded by

Thanh Mai Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Introducing The Linear Model

The document introduces linear regression models. It discusses using linear regression to model the relationship between roasting temperature of malt and the color of the resulting beer. Students are asked to plot the relationship between temperature and color from sample data and describe the relationship. The document then defines the linear regression model equation and discusses key assumptions, including that observations are independent, residuals are normally distributed and homogeneous, and the data are linear.

Uploaded by

Thanh Mai Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

30/03/2022, 15:22 Introducing the Linear Model

Introducing the Linear Model


Laura Brannelly, Saras Windecker, Patrick Baker, and Andrew Woodward
26 March 2022

Preamble
In our prior sessions we have focused on the exploration, description, and visualization of variables. This will
be our first session to explore methods for understanding the associations between variables. Many questions
in life sciences research can be expressed as estimation of the associations between variables, so these
methods are a very important part of the methodological toolkit for researchers.

Analyses of associations between variables are a common task in R, so there are a variety of ways to do this.
Over the next few sessions we will explore the basic operations in R for conducting statistical analyses of
associations between different types of variables, and the interpretation of the results that you obtain from
these analyses.

The emphasis in this prac will be on the mechanics of linear regression models and the implementation of
these analyses in R. The overall goal of the session is to introduce you to the process of modelling
associations between variables, in preparation for the subsequent sessions which will focus in detail on
elements of these analyses and ways in which the basic framework can be modified for different research
problems.

Following this session you should be able to:

1. assess and prepare data for analysis with the linear regression model
2. define a model formula from the available variables and justify the structure of a linear regression model
3. generate a linear regression model in R and interrogate it using various utility functions
4. define ‘residual’, explain the importance of residuals in linear regression models, and assess
distributions of residuals
5. understand the concept of prediction from a linear model

Assessment
This prac is not assessed. So, kick back and relax..! (But not too much because, this will be in the next
assessment and on the final exam)

Setting the Scene


Inspired by the work of brewing scientist and statistician William Gosset:

https://www.aeaweb.org/articles?id=10.1257/jep.22.4.199 (https://www.aeaweb.org/articles?
id=10.1257/jep.22.4.199)

You have conducted an experiment evaluating the relationship between roasting time of malt and the colour of
the resulting brew. Color is an important element of quality assessment in the brewing industry. Before
fermentation, malted barley, the source of sugars (especially maltose) is roasted in an oven to generate
important color and flavour compounds, especially for darker beer varieties such as porter or stout. These
chemicals arise via Maillard reactions.

Optimizing the conditions of brewing was the major focus of Gosset’s scientific career. Several issues are
important in making a good brew. If the roasting temperature is too high it increases the overall cost of the
process and risks making a final brew that is too dark. Conversely, if the roasting temperature is tool low it will

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 1/15
30/03/2022, 15:22 Introducing the Linear Model

produce a final brew that is too light and weakly flavoured to satisfy the consumer.

In this prac we are going to explore the relationship between final brew colour and roasting temperature.
Imagine an experiment in which you roasted malts of the same variety at several different temperatures. The
other roasting conditions, including the roasting time were comparable, and the post-roasting processing was
the same. In short, the primary source of variation in the final brew colour is the roasting temperature because
other sources of variability have been controlled to be approximately the same.

After brewing, the color of the final product was assessed using the European Brewing Convention (EBC)
scale, which assigns a positive numeric value, with higher values representing darker color. This number is
derived from a measurement of the absorption of 430nm light by a sample of the beer (absorption
spectroscopy).

Temperature is a classical continuous variable, and in practical application its lower bound is often too distant
to be relevant, so it should behave very sensibly when modelled as a continuous variable. EBC, though it
includes only positive values, is derived directly from a real physical measurement, so it should be reasonable
to treat it as a continuous variable. Remember that this assessment of a variable has important implications for
the selection of appropriate analyses. In this case, the characteristics of the data suggest that a linear
regression model may be an appropriate technique for analysis of the data.

The data are contained in the file beer.Rdata which has been pre-loaded in the R Markdown setup chunk.

First things first


QUESTION 1: Use ggplot to plot the relationship between the roasting temperature ( temp ) and the beer
color ( EBC ). Describe (in words) the nature of the relationship between the two variables.

Defining a Linear Model


You should all be familiar with the equation for a line:

y = mx + b

It describes the relationship between two numeric variables, x and y using two parameters: m, which is the
slope (or gradient) of the function, and b, which is the y-intercept of the function (the value of y when x equals
zero). The slope represents the amount of change in the value of y that is expected with a one unit change in
the value of x .

Note: depending on where you went to high school you may have also learned this equation in a
slightly different form. Some common alternatives include:

y = mx + c

y = ax + b

Throughout most of this class we will use a powerful modelling approach know as a general linear model, for
most of our analyses. In this week’s prac, we will be working with a special case of the general linear model
commonly referred to as linear regression. The form of this model is very close to the simple linear function
2
y ∼ mx i + b + ϵi ,      ϵi ∼ N (0, σ )

Like the equation for the line, there is the variable y, which we generally refer to as the response or dependent
variable, and the variable x , which we generally refer to as the predictor or independent variable. For linear
regression, both x and y are continuous numeric variables. The main difference between the linear model and
the equation for the line is the additional parameter ϵ , which is called the residual variance. This represents the
variability of the observed y-values relative to the y-values predicted by the right-hand side of the equation.

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 2/15
30/03/2022, 15:22 Introducing the Linear Model

You should also note that the ϵ values are assumed to be normally distributed with a mean of 0 and a standard
deviation of σ 2 . This is a key assumption of linear models and one that we will spend some time confirming in
our post-analysis diagnostics.

These mathematical characteristics demonstrate the importance of various assumptions of the linear
regression model. These can be inferred from the structure. The most important ones to remember are:

A. the observations are independent of one another

B. the residuals are normally distributed

C. the residuals are homogeneous across the range of the predictor variable

D. the data are linear

Let’s take these one by one…

Individual observations are independent


Most statistical tests assume that each observation is independent of the others. This means that knowing
something about the one observation doesn’t tell us something about another observation. This assumption of
independence can be violated temporally or spatially. When data are not completely spatially or temporally
independent, we refer to them as spatially or temporally autocorrelated. Some following examples illustrate
this:

If we take multiple samples of grass biomass and nutrient content from one corner of a paddock near a
stream, then the observations are not spatially independent for the broader population.
If we measure tree growth in one year, it is likely that growth in the previous year was relatively similar
(because the forest hasn’t changed much in a year). As a consequence, the growth rates from one year
to the next are not temporally independent.
If we record bird responses to a stimulus, but the first bird’s behaviour influences the behaviour of other
birds, then subsequent measures of the other birds may be spatially and temporally dependent on the
first measurement.

Most of these problems are ones of experimental design. Using randomised or stratified random sampling will
help to reduce or eliminate potential lack of independence amongst the observations. Addressing these
problems during the analysis stage of a project is complicated, requires sophisticated autocorrelation modelling
tools, and is often not very satisfying or successful. We will not address them any further in this prac.

Residuals are normally distributed


Linear regression assumes that the population of potential y-values (ie, the response data) are normally
distributed for each level of the predictor variable x. In practice, this means that the residuals (ie, the squared
distances from an observation to the predicted value for that observation) should be normally distributed. While
linear regression is relatively robust to departures from this assumption, heavily skewed values can cause
problems. The best way to assess this issue is to have experiments designed with replicated measurements
for a given x-value. However, this is rarely done. Instead we use graphical methods to evaluate the normality of
the model residuals. If the assumptions are not met, we can either transform the data (which we will do in this
prac) or use a non-normal distribution for the error term (which we will do in a later prac).

Variance is homogeneous across range of predictor


Linear regression assumes that the variability of the y-values for each level of x is constant across the range of
x values. That is, as x increases, there is no trend in the variance of y-values. The assumption of homogeneity
of variance is an important one and is more of a problem than the normality of the response data. As with the
residuals, lacking replicated y-values for each level of x, it is not possible to formally test this assumption.
However, we can use graphical methods to qualitatively assess patterns of residual variance across the range
of x values. If the assumption of homogeneity of variance is not met, we can either transform the data or use
weighted regression (which is beyond the scope of this class).
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 3/15
30/03/2022, 15:22 Introducing the Linear Model

Data are linear


Finally, linear regression assumes that the data are best described by a straight line. If the data do not meet
this assumption, then it is worth considering other types of models that can better account for the shape of the
data. This is typically an issue when data are curvilinear or have step changes. In some instances, this can be
addressed by transforming the data. In other cases, we may just require a non-linear model form.

Conducting Linear Regression in R


The function lm() is the workhorse for linear modelling in R. Over the next few pracs we will explore how this
function may be applied flexibly to generate various canonical analyses in parametric statistics (eg, regression,
ANOVA, t-tests).

Let’s see how these concepts are implemented in R using our beer brewing example. The simplest call to
lm() requires two arguments: the formula describing the model and the data to be modelled.

The model formula follows the basic form of the model equation described above: y ~ x , where y is the
response variable and x is the predictor variable.

For the beer data, we could define the model for colour as a function of temperature using the following
formula: EBC ~ temp .

Let’s go ahead and ask R to make us a linear model for color as a function of temperature using the following
command. We’ll assign the linear model to an object that we call beer_mod1 :

beer_mod1 <- lm(EBC ~ temp, data = beer_data)

If we want to look at a summary of the linear model analysis, we use the function summary() with the name of
the model that we are using in the brackets:

summary(beer_mod1)

##

## Call:

## lm(formula = EBC ~ temp, data = beer_data)

##

## Residuals:

## Min 1Q Median 3Q Max

## -9.7555 -3.4773 0.9805 4.7751 6.8682

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -96.06281 13.73661 -6.993 0.000113 ***

## temp 0.84454 0.08295 10.181 7.42e-06 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 6.027 on 8 degrees of freedom

## Multiple R-squared: 0.9284, Adjusted R-squared: 0.9194

## F-statistic: 103.7 on 1 and 8 DF, p-value: 7.421e-06

This gives us a lot of information about the linear model describing the relationship between temp and EBC .
Let’s step through the output and see what we can learn about our analysis. Some of these reported statistics
will be covered in detail over the next few practical sessions, so it’s not necessary to master all of them at

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 4/15
30/03/2022, 15:22 Introducing the Linear Model

once. However, by the end of the semester you should be quite comfortable parsing all of this information. The
output is organised in four chunks:

Model structure This returns to us the model that we typed in. The primary purpose of this is as a record of
the model that we used associated with its results. This is particularly when we are comparing many models
and want to quickly check the details of the model structure.

Distribution of residuals A core assumption of the linear model is that the residuals are normally distributed
with a mean of 0 and a standard deviation of σ 2 . So, we should be looking for the median of the residuals to
be close to 0 and the 1st and 3rd quantiles to have similar absolute values. This is a quick-and-dirty way to
assess the residuals; we’ll discuss more sophisticated approaches later in today’s prac.

Model coefficients This chunk provides the best estimates of the model coefficients (ie, for m and b), as well
as some statistics describing them. Notice that there are two coefficients. (Intercept) is the y-intercept (ie,
b ). While we did not request it in the model description, R assumes that you want to know it (if you don’t you

can add -1 to the model and it will not estimate the y-intercept). temp is the slope (ie, m). The first column
( Estimate ) is the value of the parameter that is the best fit to the data. This tells us that given the data from
the beer dataset, the linear model that minimises the observed variability in the dataset is:

EBC = 0.844 × temp − 96.06

These are the best estimates of the parameters given the data available. However, because these are
estimated parameters, we also want to know about the uncertainty associated with them. The second column
( Std. Error ) provides that. The standard error of the parameter estimates gives us an indication of how
clearly the data describe the parameters. If the standard error is large relative to the parameter estimate, then
we will not have much confidence in that parameter estimate. If it is small relative to the parameter estimate, it
suggests that we can have some confidence in the parameter estimates. The third and fourth columns provide
an indication of whether the parameter estimates are significantly different from 0. We won’t talk much about
that this week, but will return to later in the semester.

Model strength The last chunk describes how much variability in the dataset the model explains and how
much remains unexplained. The residual standard error (RSE) describes the unexplained variability in the
model. The RSE is an estimate of ϵ from the equation for the linear model. The coefficient of determination (
R ) describes the proportion of the variability in y that is described by its relationship with x . It has a value
2

between 0 and 1, with higher values representing a better fit. We will discuss this statistic in detail in future
pracs. Finally, the F-statistic and p-value can be used when comparing several models that are trying to predict
the same response variable. This, too, we will discuss in detail later in the semester.

Let’s explore some of these outputs in a bit more details. We’ll start by looking at the residuals in more detail
using the utility function residuals() :

residuals(beer_mod1)

## 1 2 3 4 5 6 7 8

## -9.755545 4.617167 -3.774739 5.164924 4.827781 6.868209 -2.585131 0.478904

## 9 10

## 1.482049 -7.323618

Notice that the vector returned is the same size as the data that we used to build the model:

length(residuals(beer_mod1))

## [1] 10

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 5/15
30/03/2022, 15:22 Introducing the Linear Model

length(beer_data$EBC)

## [1] 10

This is not a coincidence. There is a residual for each observation. The model predicts a y-value for each x in
the dataset. The residual is the difference between the predicted (or fitted) y-value and the observed y-value
(ie, the one in the dataset). We can check this:

# The response data we used to fit the model i.e. the observations

beer_data$EBC

## [1] 2.282375 23.411383 21.775772 37.471731 43.890883 52.687606 49.990562

## [8] 59.810893 67.570333 65.520962

# The matching values fitted by our model

fitted(beer_mod1)

## 1 2 3 4 5 6 7 8

## 12.03792 18.79422 25.55051 32.30681 39.06310 45.81940 52.57569 59.33199

## 9 10

## 66.08828 72.84458

# The difference between the observations and the fitted values

beer_data$EBC-fitted(beer_mod1)

## 1 2 3 4 5 6 7 8

## -9.755545 4.617167 -3.774739 5.164924 4.827781 6.868209 -2.585131 0.478904

## 9 10

## 1.482049 -7.323618

# The residuals reported from the model object

residuals(beer_mod1)

## 1 2 3 4 5 6 7 8

## -9.755545 4.617167 -3.774739 5.164924 4.827781 6.868209 -2.585131 0.478904

## 9 10

## 1.482049 -7.323618

# How close are the model-reported residuals to our calculated ones?

(beer_data$EBC-fitted(beer_mod1)) / residuals(beer_mod1)

## 1 2 3 4 5 6 7 8 9 10

## 1 1 1 1 1 1 1 1 1 1

Recall that the linear model assumes that the residuals are normally distributed with mean zero. If this
assumption does not hold, our analysis is suspect, for two reasons

A. the structure of the model that we are using might be a poor representation of the data and/or

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 6/15
30/03/2022, 15:22 Introducing the Linear Model

B. inferential statistics generated from the model (more on these later!) may not be valid.

QUESTION 2: Make a histogram of the residuals from beer_mod1 .

Model diagnostics
Base R graphics produces useful diagnostic plots simply by calling the plot() function for the model
summary:

plot(beer_mod1)

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 7/15
30/03/2022, 15:22 Introducing the Linear Model

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 8/15
30/03/2022, 15:22 Introducing the Linear Model

The performance package provides similar results to plot(lm) but with more graphical detail and a nicer
aesthetic:

library(performance)

check_model(beer_mod1)

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 9/15
30/03/2022, 15:22 Introducing the Linear Model

In this class, we’ll use the performance package output because it is a bit easier to use in presenting your
results.

So, what do all of these figures mean? You should be familiar (from last week) with the upper two panels in the
performance output. The upper left-hand panel is a quantile-quantile plot and the upper right-hand panel is a
density plot of the residuals with a theoretical normal distribution superimposed on top of it. These two plots
provide a means to assess whether the model residuals are approximately normally distributed. For
beer_mod1 the residuals aren’t perfectly normal, but they are pretty good (particularly for a dataset with only
10 observations).

The two lower panels compare the fitted values (ie, the predicted y-values) against the corresponding residual.
This allows us to assess whether there are any patterns in the residuals that might be cause for concern. The
two things we are looking for are a relatively even spread of points from left to right and obvious evidence of
non-linearity. This dataset does not have any obvious issues with spread (the points are as widely spread on
the y-axis at low fitted values as they are for high fitted values), but it does have a bit of a curve to it. This is
highlighted by the fifth panel, which shows the outliers. Notably the first and last observations are considered
outliers, which likely indicates some curvilinearity in these data. However, over the range of observations in the
dataset, the R2 is 0.9194, which means that the temperature variable explains 91.94% of the observed
variation in EBC , which is extremely good!

It is worth noting that we can also use the performance package to check specific issues quantitatively. Here
we request specific statistical tests for normality and heteroscedasticity, as well as the identification of any
outliers in the dataset:

check_normality(beer_mod1)

## OK: residuals appear as normally distributed (p = 0.299).

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 10/15
30/03/2022, 15:22 Introducing the Linear Model

check_heteroscedasticity(beer_mod1)

## OK: Error variance appears to be homoscedastic (p = 0.411).

check_outliers(beer_mod1)

## Warning: 2 outliers detected (cases 1, 10).

Comparing two models


To illustrate the importance of selecting an appropriate model and give you the opportunity to compare models,
let’s make a second model that has no y-intercept (ie, the y-intercept is fixed at 0):

beer_mod2 <- lm(EBC ~ -1 + temp, data = beer_data) # -1 removes the y-intercept from the mode
l

summary(beer_mod2)

##

## Call:

## lm(formula = EBC ~ -1 + temp, data = beer_data)

##

## Residuals:

## Min 1Q Median 3Q Max

## -32.286 -10.883 1.570 9.418 15.718

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## temp 0.27007 0.02894 9.331 6.35e-06 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 15.16 on 9 degrees of freedom

## Multiple R-squared: 0.9063, Adjusted R-squared: 0.8959

## F-statistic: 87.08 on 1 and 9 DF, p-value: 6.345e-06

QUESTION 3: Compare the parameter estimates, residuals, goodness-of-fit statistics, and diagnostic graphics
for beer_mod1 and beer_mod2 . Which is a better model and why?

Working with the Fitted Model


We can use the linear regression model to predict values of the response variable for new values of the
predictor variable (ie, values of the predictor variable that were not in the original dataset). To do this, we use
the function predict() , which requires the name of the linear model for making the prediction and the new
data that we want to predict from. For example, if we want to estimate the color of a brew that was roasted at
172∘ C, we would do this:

predict(beer_mod1, newdata = data.frame(temp = 172))

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 11/15
30/03/2022, 15:22 Introducing the Linear Model

## 1

## 49.19755

This returns a single EBC value predicted by the beer_mod1 model. The newdata argument specifies that we
want the model to take the values for the predictor variable from the supplied data frame. Two critical points to
note: 1) this must have the same variable name as in the dataset used to fit the model and 2) the new data
needs to be wrapped in the data.frame() function.

If we wanted to predict a range of temperatures, we can specify a vector of values. For example, to get the
EBC for each full degree between 140∘ C and 160∘ C, we would use:

predict(beer_mod1, newdata = data.frame(temp = seq(from = 140, to = 160, by = 1)))

## 1 2 3 4 5 6 7 8

## 22.17236 23.01690 23.86144 24.70597 25.55051 26.39505 27.23959 28.08412

## 9 10 11 12 13 14 15 16

## 28.92866 29.77320 30.61773 31.46227 32.30681 33.15134 33.99588 34.84042

## 17 18 19 20 21

## 35.68495 36.52949 37.37403 38.21857 39.06310

An additional consideration is how much certainty our model provides us about that prediction. For example,
for a temperature of 172∘ C, we can additionally obtain the standard error of the prediction by

#predict(beer_mod1, newdata = data.frame(temp = 172), se.fit = TRUE)

predict(beer_mod1, newdata = data.frame(temp = 180), se.fit = TRUE)

## $fit

## 1

## 55.95384

##

## $se.fit

## [1] 2.322588

##

## $df

## [1] 8

##

## $residual.scale

## [1] 6.027415

QUESTION 4: The maximum achievable temperature with a new piece of roasting equipment is 180C. What
value of EBC do we expect for it? How uncertain are we about this? If our target EBC is 58, is this equipment
suitable for us?

QUESTION 5: What EBC do we expect for a roasting temperature of 100C? How is this distinct from the
values we predicted above? Is this prediction trustworthy?

Another useful application of the predict() function is to plot the behaviour of our model. For example, let’s
plot the predicted EBC from our model as a function of temperature, across a range of temperatures including
all of our data. Let’s start by defining a range of values for temperature and then predicting EBC for all of those
temperatures;

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 12/15
30/03/2022, 15:22 Introducing the Linear Model

new_tempdata <- data.frame(temp = seq(from = 120, to = 210, by = 0.1))

new_EBC <- predict(beer_mod1, newdata = new_tempdata, se.fit = TRUE)

new_tempdata$EBC <- new_EBC$fit

new_tempdata$EBC_SE <- new_EBC$se.fit

length(new_EBC)

## [1] 4

Now we have a dataframe containing a range of temperatures, the model-predicted EBC for each temperature
value, and the standard error of the prediction at each temperature. Let’s try plotting all of this in ggplot . We
use the predicated data to plot a line representing the model:

pred_EBC <- ggplot(data = new_tempdata, aes(x = temp, y = EBC)) +

geom_line() +

labs(x ='Temperature (Celsius)', y = 'EBC') +

xlim(125,205)

pred_EBC

Note that we assigned this ggplot call to an object called pred_EBC . This will make it easy for us to add
layers of data on top of this graph. Now let’s add the original data points:

pred_EBC <- pred_EBC + geom_point(data = beer_data, aes(x = temp, y= EBC))

pred_EBC

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 13/15
30/03/2022, 15:22 Introducing the Linear Model

This is a very illustrative style of plot, as it displays both our fitted model and the observations upon which it is
based. This is often easier to use than residual plots to examine the appropriateness of the shape of the
model.

It is also useful to show the uncertainty in the predictions. Let’s add lines representing the 2 times the standard
error (SE). To do this, we add one line at the predicted value plus 1 SE and one line at the predicted value
minus 1 SE. Note that, when presenting this kind of plot, it is essential to describe in your caption what the
lines represent as this cannot be assumed (ie, there is no accepted standard for this).

pred_EBC <- pred_EBC +

geom_line(data = new_tempdata, aes(x = temp, y = (EBC+2*EBC_SE)), linetype = 'dashed') +

geom_line(data = new_tempdata, aes(x = temp, y = (EBC-2*EBC_SE)), linetype = 'dashed') +

theme_bw()

pred_EBC

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 14/15
30/03/2022, 15:22 Introducing the Linear Model

QUESTION 6: Why are the prediction intervals (ie, the dashed lines) not straight?

QUESTION 7: Repeat this analysis to make a figure showing the observed data, the fitted line from the model,
and the prediction interval using beer_mod2 .

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 15/15

You might also like