0% found this document useful (0 votes)

45 views

Introducing The Linear Model

The document introduces linear regression models. It discusses using linear regression to model the relationship between roasting temperature of malt and the color of the resulting beer. Students are asked to plot the relationship between temperature and color from sample data and describe the relationship. The document then defines the linear regression model equation and discusses key assumptions, including that observations are independent, residuals are normally distributed and homogeneous, and the data are linear.

Uploaded by

Thanh Mai Pham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Introducing The Linear Model

Uploaded by

Thanh Mai Pham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

30/03/2022, 15:22 Introducing the Linear Model

Introducing the Linear Model

Laura Brannelly, Saras Windecker, Patrick Baker, and Andrew Woodward
26 March 2022

Preamble
In our prior sessions we have focused on the exploration, description, and visualization of variables. This will
be our first session to explore methods for understanding the associations between variables. Many questions
in life sciences research can be expressed as estimation of the associations between variables, so these
methods are a very important part of the methodological toolkit for researchers.

Analyses of associations between variables are a common task in R, so there are a variety of ways to do this.
Over the next few sessions we will explore the basic operations in R for conducting statistical analyses of
associations between different types of variables, and the interpretation of the results that you obtain from
these analyses.

The emphasis in this prac will be on the mechanics of linear regression models and the implementation of
these analyses in R. The overall goal of the session is to introduce you to the process of modelling
associations between variables, in preparation for the subsequent sessions which will focus in detail on
elements of these analyses and ways in which the basic framework can be modified for different research
problems.

Following this session you should be able to:

1. assess and prepare data for analysis with the linear regression model
2. define a model formula from the available variables and justify the structure of a linear regression model
3. generate a linear regression model in R and interrogate it using various utility functions
4. define ‘residual’, explain the importance of residuals in linear regression models, and assess
distributions of residuals
5. understand the concept of prediction from a linear model

Assessment
This prac is not assessed. So, kick back and relax..! (But not too much because, this will be in the next
assessment and on the final exam)

Setting the Scene

Inspired by the work of brewing scientist and statistician William Gosset:

https://www.aeaweb.org/articles?id=10.1257/jep.22.4.199 (https://www.aeaweb.org/articles?
id=10.1257/jep.22.4.199)

You have conducted an experiment evaluating the relationship between roasting time of malt and the colour of
the resulting brew. Color is an important element of quality assessment in the brewing industry. Before
fermentation, malted barley, the source of sugars (especially maltose) is roasted in an oven to generate
important color and flavour compounds, especially for darker beer varieties such as porter or stout. These
chemicals arise via Maillard reactions.

Optimizing the conditions of brewing was the major focus of Gosset’s scientific career. Several issues are
important in making a good brew. If the roasting temperature is too high it increases the overall cost of the
process and risks making a final brew that is too dark. Conversely, if the roasting temperature is tool low it will

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 1/15
30/03/2022, 15:22 Introducing the Linear Model

produce a final brew that is too light and weakly flavoured to satisfy the consumer.

In this prac we are going to explore the relationship between final brew colour and roasting temperature.
Imagine an experiment in which you roasted malts of the same variety at several different temperatures. The
other roasting conditions, including the roasting time were comparable, and the post-roasting processing was
the same. In short, the primary source of variation in the final brew colour is the roasting temperature because
other sources of variability have been controlled to be approximately the same.

After brewing, the color of the final product was assessed using the European Brewing Convention (EBC)
scale, which assigns a positive numeric value, with higher values representing darker color. This number is
derived from a measurement of the absorption of 430nm light by a sample of the beer (absorption
spectroscopy).

Temperature is a classical continuous variable, and in practical application its lower bound is often too distant
to be relevant, so it should behave very sensibly when modelled as a continuous variable. EBC, though it
includes only positive values, is derived directly from a real physical measurement, so it should be reasonable
to treat it as a continuous variable. Remember that this assessment of a variable has important implications for
the selection of appropriate analyses. In this case, the characteristics of the data suggest that a linear
regression model may be an appropriate technique for analysis of the data.

The data are contained in the file beer.Rdata which has been pre-loaded in the R Markdown setup chunk.

First things first

QUESTION 1: Use ggplot to plot the relationship between the roasting temperature ( temp ) and the beer
color ( EBC ). Describe (in words) the nature of the relationship between the two variables.

Defining a Linear Model

You should all be familiar with the equation for a line:

y = mx + b

It describes the relationship between two numeric variables, x and y using two parameters: m, which is the
slope (or gradient) of the function, and b, which is the y-intercept of the function (the value of y when x equals
zero). The slope represents the amount of change in the value of y that is expected with a one unit change in
the value of x .

Note: depending on where you went to high school you may have also learned this equation in a
slightly different form. Some common alternatives include:

y = mx + c

y = ax + b

Throughout most of this class we will use a powerful modelling approach know as a general linear model, for
most of our analyses. In this week’s prac, we will be working with a special case of the general linear model
commonly referred to as linear regression. The form of this model is very close to the simple linear function
2
y ∼ mx i + b + ϵi , ϵi ∼ N (0, σ )

Like the equation for the line, there is the variable y, which we generally refer to as the response or dependent
variable, and the variable x , which we generally refer to as the predictor or independent variable. For linear
regression, both x and y are continuous numeric variables. The main difference between the linear model and
the equation for the line is the additional parameter ϵ , which is called the residual variance. This represents the
variability of the observed y-values relative to the y-values predicted by the right-hand side of the equation.

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 2/15
30/03/2022, 15:22 Introducing the Linear Model

You should also note that the ϵ values are assumed to be normally distributed with a mean of 0 and a standard
deviation of σ 2 . This is a key assumption of linear models and one that we will spend some time confirming in
our post-analysis diagnostics.

These mathematical characteristics demonstrate the importance of various assumptions of the linear
regression model. These can be inferred from the structure. The most important ones to remember are:

A. the observations are independent of one another

B. the residuals are normally distributed

C. the residuals are homogeneous across the range of the predictor variable

D. the data are linear

Let’s take these one by one…

Individual observations are independent

Most statistical tests assume that each observation is independent of the others. This means that knowing
something about the one observation doesn’t tell us something about another observation. This assumption of
independence can be violated temporally or spatially. When data are not completely spatially or temporally
independent, we refer to them as spatially or temporally autocorrelated. Some following examples illustrate
this:

If we take multiple samples of grass biomass and nutrient content from one corner of a paddock near a
stream, then the observations are not spatially independent for the broader population.
If we measure tree growth in one year, it is likely that growth in the previous year was relatively similar
(because the forest hasn’t changed much in a year). As a consequence, the growth rates from one year
to the next are not temporally independent.
If we record bird responses to a stimulus, but the first bird’s behaviour influences the behaviour of other
birds, then subsequent measures of the other birds may be spatially and temporally dependent on the
first measurement.

Most of these problems are ones of experimental design. Using randomised or stratified random sampling will
help to reduce or eliminate potential lack of independence amongst the observations. Addressing these
problems during the analysis stage of a project is complicated, requires sophisticated autocorrelation modelling
tools, and is often not very satisfying or successful. We will not address them any further in this prac.

Residuals are normally distributed

Linear regression assumes that the population of potential y-values (ie, the response data) are normally
distributed for each level of the predictor variable x. In practice, this means that the residuals (ie, the squared
distances from an observation to the predicted value for that observation) should be normally distributed. While
linear regression is relatively robust to departures from this assumption, heavily skewed values can cause
problems. The best way to assess this issue is to have experiments designed with replicated measurements
for a given x-value. However, this is rarely done. Instead we use graphical methods to evaluate the normality of
the model residuals. If the assumptions are not met, we can either transform the data (which we will do in this
prac) or use a non-normal distribution for the error term (which we will do in a later prac).

Variance is homogeneous across range of predictor

Linear regression assumes that the variability of the y-values for each level of x is constant across the range of
x values. That is, as x increases, there is no trend in the variance of y-values. The assumption of homogeneity
of variance is an important one and is more of a problem than the normality of the response data. As with the
residuals, lacking replicated y-values for each level of x, it is not possible to formally test this assumption.
However, we can use graphical methods to qualitatively assess patterns of residual variance across the range
of x values. If the assumption of homogeneity of variance is not met, we can either transform the data or use
weighted regression (which is beyond the scope of this class).
https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 3/15
30/03/2022, 15:22 Introducing the Linear Model

Data are linear

Finally, linear regression assumes that the data are best described by a straight line. If the data do not meet
this assumption, then it is worth considering other types of models that can better account for the shape of the
data. This is typically an issue when data are curvilinear or have step changes. In some instances, this can be
addressed by transforming the data. In other cases, we may just require a non-linear model form.

Conducting Linear Regression in R

The function lm() is the workhorse for linear modelling in R. Over the next few pracs we will explore how this
function may be applied flexibly to generate various canonical analyses in parametric statistics (eg, regression,
ANOVA, t-tests).

Let’s see how these concepts are implemented in R using our beer brewing example. The simplest call to
lm() requires two arguments: the formula describing the model and the data to be modelled.

The model formula follows the basic form of the model equation described above: y ~ x , where y is the
response variable and x is the predictor variable.

For the beer data, we could define the model for colour as a function of temperature using the following
formula: EBC ~ temp .

Let’s go ahead and ask R to make us a linear model for color as a function of temperature using the following
command. We’ll assign the linear model to an object that we call beer_mod1 :

beer_mod1 <- lm(EBC ~ temp, data = beer_data)

If we want to look at a summary of the linear model analysis, we use the function summary() with the name of
the model that we are using in the brackets:

summary(beer_mod1)

## Call:

## lm(formula = EBC ~ temp, data = beer_data)

## Residuals:

## Min 1Q Median 3Q Max

## -9.7555 -3.4773 0.9805 4.7751 6.8682

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -96.06281 13.73661 -6.993 0.000113 ***

## temp 0.84454 0.08295 10.181 7.42e-06 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 6.027 on 8 degrees of freedom

## Multiple R-squared: 0.9284, Adjusted R-squared: 0.9194

## F-statistic: 103.7 on 1 and 8 DF, p-value: 7.421e-06

This gives us a lot of information about the linear model describing the relationship between temp and EBC .
Let’s step through the output and see what we can learn about our analysis. Some of these reported statistics
will be covered in detail over the next few practical sessions, so it’s not necessary to master all of them at

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 4/15
30/03/2022, 15:22 Introducing the Linear Model

once. However, by the end of the semester you should be quite comfortable parsing all of this information. The
output is organised in four chunks:

Model structure This returns to us the model that we typed in. The primary purpose of this is as a record of
the model that we used associated with its results. This is particularly when we are comparing many models
and want to quickly check the details of the model structure.

Distribution of residuals A core assumption of the linear model is that the residuals are normally distributed
with a mean of 0 and a standard deviation of σ 2 . So, we should be looking for the median of the residuals to
be close to 0 and the 1st and 3rd quantiles to have similar absolute values. This is a quick-and-dirty way to
assess the residuals; we’ll discuss more sophisticated approaches later in today’s prac.

Model coefficients This chunk provides the best estimates of the model coefficients (ie, for m and b), as well
as some statistics describing them. Notice that there are two coefficients. (Intercept) is the y-intercept (ie,
b ). While we did not request it in the model description, R assumes that you want to know it (if you don’t you

can add -1 to the model and it will not estimate the y-intercept). temp is the slope (ie, m). The first column
( Estimate ) is the value of the parameter that is the best fit to the data. This tells us that given the data from
the beer dataset, the linear model that minimises the observed variability in the dataset is:

EBC = 0.844 × temp − 96.06

These are the best estimates of the parameters given the data available. However, because these are
estimated parameters, we also want to know about the uncertainty associated with them. The second column
( Std. Error ) provides that. The standard error of the parameter estimates gives us an indication of how
clearly the data describe the parameters. If the standard error is large relative to the parameter estimate, then
we will not have much confidence in that parameter estimate. If it is small relative to the parameter estimate, it
suggests that we can have some confidence in the parameter estimates. The third and fourth columns provide
an indication of whether the parameter estimates are significantly different from 0. We won’t talk much about
that this week, but will return to later in the semester.

Model strength The last chunk describes how much variability in the dataset the model explains and how
much remains unexplained. The residual standard error (RSE) describes the unexplained variability in the
model. The RSE is an estimate of ϵ from the equation for the linear model. The coefficient of determination (
R ) describes the proportion of the variability in y that is described by its relationship with x . It has a value
2

between 0 and 1, with higher values representing a better fit. We will discuss this statistic in detail in future
pracs. Finally, the F-statistic and p-value can be used when comparing several models that are trying to predict
the same response variable. This, too, we will discuss in detail later in the semester.

Let’s explore some of these outputs in a bit more details. We’ll start by looking at the residuals in more detail
using the utility function residuals() :

residuals(beer_mod1)

## 1 2 3 4 5 6 7 8

## -9.755545 4.617167 -3.774739 5.164924 4.827781 6.868209 -2.585131 0.478904

## 9 10

## 1.482049 -7.323618

Notice that the vector returned is the same size as the data that we used to build the model:

length(residuals(beer_mod1))

## [1] 10

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 5/15
30/03/2022, 15:22 Introducing the Linear Model

length(beer_data$EBC)

## [1] 10

This is not a coincidence. There is a residual for each observation. The model predicts a y-value for each x in
the dataset. The residual is the difference between the predicted (or fitted) y-value and the observed y-value
(ie, the one in the dataset). We can check this:

# The response data we used to fit the model i.e. the observations

beer_data$EBC

## [1] 2.282375 23.411383 21.775772 37.471731 43.890883 52.687606 49.990562

## [8] 59.810893 67.570333 65.520962

# The matching values fitted by our model

fitted(beer_mod1)

## 1 2 3 4 5 6 7 8

## 12.03792 18.79422 25.55051 32.30681 39.06310 45.81940 52.57569 59.33199

## 9 10

## 66.08828 72.84458

# The difference between the observations and the fitted values

beer_data$EBC-fitted(beer_mod1)

## 1 2 3 4 5 6 7 8

## -9.755545 4.617167 -3.774739 5.164924 4.827781 6.868209 -2.585131 0.478904

## 9 10

## 1.482049 -7.323618

# The residuals reported from the model object

residuals(beer_mod1)

## 1 2 3 4 5 6 7 8

## -9.755545 4.617167 -3.774739 5.164924 4.827781 6.868209 -2.585131 0.478904

## 9 10

## 1.482049 -7.323618

# How close are the model-reported residuals to our calculated ones?

(beer_data$EBC-fitted(beer_mod1)) / residuals(beer_mod1)

## 1 2 3 4 5 6 7 8 9 10

## 1 1 1 1 1 1 1 1 1 1

Recall that the linear model assumes that the residuals are normally distributed with mean zero. If this
assumption does not hold, our analysis is suspect, for two reasons

A. the structure of the model that we are using might be a poor representation of the data and/or

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 6/15
30/03/2022, 15:22 Introducing the Linear Model

B. inferential statistics generated from the model (more on these later!) may not be valid.

QUESTION 2: Make a histogram of the residuals from beer_mod1 .

Model diagnostics
Base R graphics produces useful diagnostic plots simply by calling the plot() function for the model
summary:

plot(beer_mod1)

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 7/15
30/03/2022, 15:22 Introducing the Linear Model

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 8/15
30/03/2022, 15:22 Introducing the Linear Model

The performance package provides similar results to plot(lm) but with more graphical detail and a nicer
aesthetic:

library(performance)

check_model(beer_mod1)

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear+… 9/15
30/03/2022, 15:22 Introducing the Linear Model

In this class, we’ll use the performance package output because it is a bit easier to use in presenting your
results.

So, what do all of these figures mean? You should be familiar (from last week) with the upper two panels in the
performance output. The upper left-hand panel is a quantile-quantile plot and the upper right-hand panel is a
density plot of the residuals with a theoretical normal distribution superimposed on top of it. These two plots
provide a means to assess whether the model residuals are approximately normally distributed. For
beer_mod1 the residuals aren’t perfectly normal, but they are pretty good (particularly for a dataset with only
10 observations).

The two lower panels compare the fitted values (ie, the predicted y-values) against the corresponding residual.
This allows us to assess whether there are any patterns in the residuals that might be cause for concern. The
two things we are looking for are a relatively even spread of points from left to right and obvious evidence of
non-linearity. This dataset does not have any obvious issues with spread (the points are as widely spread on
the y-axis at low fitted values as they are for high fitted values), but it does have a bit of a curve to it. This is
highlighted by the fifth panel, which shows the outliers. Notably the first and last observations are considered
outliers, which likely indicates some curvilinearity in these data. However, over the range of observations in the
dataset, the R2 is 0.9194, which means that the temperature variable explains 91.94% of the observed
variation in EBC , which is extremely good!

It is worth noting that we can also use the performance package to check specific issues quantitatively. Here
we request specific statistical tests for normality and heteroscedasticity, as well as the identification of any
outliers in the dataset:

check_normality(beer_mod1)

## OK: residuals appear as normally distributed (p = 0.299).

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 10/15
30/03/2022, 15:22 Introducing the Linear Model

check_heteroscedasticity(beer_mod1)

## OK: Error variance appears to be homoscedastic (p = 0.411).

check_outliers(beer_mod1)

## Warning: 2 outliers detected (cases 1, 10).

Comparing two models

To illustrate the importance of selecting an appropriate model and give you the opportunity to compare models,
let’s make a second model that has no y-intercept (ie, the y-intercept is fixed at 0):

beer_mod2 <- lm(EBC ~ -1 + temp, data = beer_data) # -1 removes the y-intercept from the mode
l

summary(beer_mod2)

## Call:

## lm(formula = EBC ~ -1 + temp, data = beer_data)

## Residuals:

## Min 1Q Median 3Q Max

## -32.286 -10.883 1.570 9.418 15.718

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## temp 0.27007 0.02894 9.331 6.35e-06 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 15.16 on 9 degrees of freedom

## Multiple R-squared: 0.9063, Adjusted R-squared: 0.8959

## F-statistic: 87.08 on 1 and 9 DF, p-value: 6.345e-06

QUESTION 3: Compare the parameter estimates, residuals, goodness-of-fit statistics, and diagnostic graphics
for beer_mod1 and beer_mod2 . Which is a better model and why?

Working with the Fitted Model

We can use the linear regression model to predict values of the response variable for new values of the
predictor variable (ie, values of the predictor variable that were not in the original dataset). To do this, we use
the function predict() , which requires the name of the linear model for making the prediction and the new
data that we want to predict from. For example, if we want to estimate the color of a brew that was roasted at
172∘ C, we would do this:

predict(beer_mod1, newdata = data.frame(temp = 172))

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 11/15
30/03/2022, 15:22 Introducing the Linear Model

## 1

## 49.19755

This returns a single EBC value predicted by the beer_mod1 model. The newdata argument specifies that we
want the model to take the values for the predictor variable from the supplied data frame. Two critical points to
note: 1) this must have the same variable name as in the dataset used to fit the model and 2) the new data
needs to be wrapped in the data.frame() function.

If we wanted to predict a range of temperatures, we can specify a vector of values. For example, to get the
EBC for each full degree between 140∘ C and 160∘ C, we would use:

predict(beer_mod1, newdata = data.frame(temp = seq(from = 140, to = 160, by = 1)))

## 1 2 3 4 5 6 7 8

## 22.17236 23.01690 23.86144 24.70597 25.55051 26.39505 27.23959 28.08412

## 9 10 11 12 13 14 15 16

## 28.92866 29.77320 30.61773 31.46227 32.30681 33.15134 33.99588 34.84042

## 17 18 19 20 21

## 35.68495 36.52949 37.37403 38.21857 39.06310

An additional consideration is how much certainty our model provides us about that prediction. For example,
for a temperature of 172∘ C, we can additionally obtain the standard error of the prediction by

#predict(beer_mod1, newdata = data.frame(temp = 172), se.fit = TRUE)

predict(beer_mod1, newdata = data.frame(temp = 180), se.fit = TRUE)

## $fit

## 1

## 55.95384

## $se.fit

## [1] 2.322588

## $df

## [1] 8

## $residual.scale

## [1] 6.027415

QUESTION 4: The maximum achievable temperature with a new piece of roasting equipment is 180C. What
value of EBC do we expect for it? How uncertain are we about this? If our target EBC is 58, is this equipment
suitable for us?

QUESTION 5: What EBC do we expect for a roasting temperature of 100C? How is this distinct from the
values we predicted above? Is this prediction trustworthy?

Another useful application of the predict() function is to plot the behaviour of our model. For example, let’s
plot the predicted EBC from our model as a function of temperature, across a range of temperatures including
all of our data. Let’s start by defining a range of values for temperature and then predicting EBC for all of those
temperatures;

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 12/15
30/03/2022, 15:22 Introducing the Linear Model

new_tempdata <- data.frame(temp = seq(from = 120, to = 210, by = 0.1))

new_EBC <- predict(beer_mod1, newdata = new_tempdata, se.fit = TRUE)

new_tempdata$EBC <- new_EBC$fit

new_tempdata$EBC_SE <- new_EBC$se.fit

length(new_EBC)

## [1] 4

Now we have a dataframe containing a range of temperatures, the model-predicted EBC for each temperature
value, and the standard error of the prediction at each temperature. Let’s try plotting all of this in ggplot . We
use the predicated data to plot a line representing the model:

pred_EBC <- ggplot(data = new_tempdata, aes(x = temp, y = EBC)) +

geom_line() +

labs(x ='Temperature (Celsius)', y = 'EBC') +

xlim(125,205)

pred_EBC

Note that we assigned this ggplot call to an object called pred_EBC . This will make it easy for us to add
layers of data on top of this graph. Now let’s add the original data points:

pred_EBC <- pred_EBC + geom_point(data = beer_data, aes(x = temp, y= EBC))

pred_EBC

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 13/15
30/03/2022, 15:22 Introducing the Linear Model

This is a very illustrative style of plot, as it displays both our fitted model and the observations upon which it is
based. This is often easier to use than residual plots to examine the appropriateness of the shape of the
model.

It is also useful to show the uncertainty in the predictions. Let’s add lines representing the 2 times the standard
error (SE). To do this, we add one line at the predicted value plus 1 SE and one line at the predicted value
minus 1 SE. Note that, when presenting this kind of plot, it is essential to describe in your caption what the
lines represent as this cannot be assumed (ie, there is no accepted standard for this).

pred_EBC <- pred_EBC +

geom_line(data = new_tempdata, aes(x = temp, y = (EBC+2*EBC_SE)), linetype = 'dashed') +

geom_line(data = new_tempdata, aes(x = temp, y = (EBC-2*EBC_SE)), linetype = 'dashed') +

theme_bw()

pred_EBC

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 14/15
30/03/2022, 15:22 Introducing the Linear Model

QUESTION 6: Why are the prediction intervals (ie, the dashed lines) not straight?

QUESTION 7: Repeat this analysis to make a figure showing the observed data, the fitted line from the model,
and the prediction interval using beer_mod2 .

https://1c37f8c7e1f444019447aba8e0b4586e.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek+5+prac+manual+-+the+linear… 15/15

M348 Applied Statistical Modelling - Linear Models
No ratings yet
M348 Applied Statistical Modelling - Linear Models
504 pages
Exponential Smoothing - The State of The Art
No ratings yet
Exponential Smoothing - The State of The Art
28 pages
Multiple Linear Regressioin Part 1
0% (1)
Multiple Linear Regressioin Part 1
27 pages
Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
Unit 3 Regression Models
No ratings yet
Unit 3 Regression Models
74 pages
7 Regression
No ratings yet
7 Regression
96 pages
Module 3 - SimpleLinearRegression - Afterclass1b
No ratings yet
Module 3 - SimpleLinearRegression - Afterclass1b
26 pages
Regression Analysis Using R
No ratings yet
Regression Analysis Using R
17 pages
Unit-15 Data Analysis and R
No ratings yet
Unit-15 Data Analysis and R
12 pages
Estad Istica II Chapter 4: Simple Linear Regression
No ratings yet
Estad Istica II Chapter 4: Simple Linear Regression
46 pages
unit5_R
No ratings yet
unit5_R
5 pages
Using R For Linear Regression
No ratings yet
Using R For Linear Regression
9 pages
DA-3rd unit
No ratings yet
DA-3rd unit
16 pages
MachineLearning_Unit-II
No ratings yet
MachineLearning_Unit-II
45 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Generalised Linear Models: Getwd
No ratings yet
Generalised Linear Models: Getwd
7 pages
RegrCorr PDF
No ratings yet
RegrCorr PDF
20 pages
Chapter 2 B
No ratings yet
Chapter 2 B
9 pages
Unit-III (Data Analytics)
100% (1)
Unit-III (Data Analytics)
15 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
MachineLearning Unit II
No ratings yet
MachineLearning Unit II
45 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
Regression Models for Data Science in R
No ratings yet
Regression Models for Data Science in R
137 pages
Linear Regression
No ratings yet
Linear Regression
46 pages
SM Notes 2020
No ratings yet
SM Notes 2020
139 pages
Learn Linear Regression with R_ Linear Regression In R Cheatsheet _ Codecademy
No ratings yet
Learn Linear Regression with R_ Linear Regression In R Cheatsheet _ Codecademy
5 pages
BES - Lecture 10 - Simple Linear Regression
No ratings yet
BES - Lecture 10 - Simple Linear Regression
15 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Reg Mods
No ratings yet
Reg Mods
137 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
45 pages
Applied Partial Differential Equations
From Everand
Applied Partial Differential Equations
Paul DuChateau
5/5 (1)
Linear Regression
No ratings yet
Linear Regression
11 pages
Machine Learning. Supervised Learning Techniques and Tools: Nonlinear Models Exercises with R, SAS, Stata, Eviews and SPSS
From Everand
Machine Learning. Supervised Learning Techniques and Tools: Nonlinear Models Exercises with R, SAS, Stata, Eviews and SPSS
César Pérez López
No ratings yet
Understanding Analysis: Foundations and Applications
From Everand
Understanding Analysis: Foundations and Applications
Tanmay Shroff
No ratings yet
Unit1 - Data Science - SPPU
No ratings yet
Unit1 - Data Science - SPPU
15 pages
LinearRegressionUsing R
No ratings yet
LinearRegressionUsing R
91 pages
LINEAR REGRESSION IN R
No ratings yet
LINEAR REGRESSION IN R
6 pages
R Lab 4
No ratings yet
R Lab 4
7 pages
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
Stat 378
No ratings yet
Stat 378
73 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Lecture-3---Linear-Regression-imran-20022025-092939am
No ratings yet
Lecture-3---Linear-Regression-imran-20022025-092939am
46 pages
DA Notes 3
No ratings yet
DA Notes 3
12 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Simple Linear Regression Homework Solutions
100% (1)
Simple Linear Regression Homework Solutions
6 pages
Linear Regression - Jupyter Notebook
100% (3)
Linear Regression - Jupyter Notebook
56 pages
Course Notes18
No ratings yet
Course Notes18
113 pages
Regression Coeffient
No ratings yet
Regression Coeffient
52 pages
Linear Regression
No ratings yet
Linear Regression
17 pages
Simple Regression Analysis
No ratings yet
Simple Regression Analysis
60 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Linear Model
No ratings yet
Linear Model
10 pages
Simple Linear and Logistic Regression
No ratings yet
Simple Linear and Logistic Regression
81 pages
SLRin R
No ratings yet
SLRin R
23 pages
Linear.regression.with.Python
No ratings yet
Linear.regression.with.Python
140 pages
NASA Regression Lecture
No ratings yet
NASA Regression Lecture
268 pages
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
No ratings yet
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
67 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
MODULE-3
No ratings yet
MODULE-3
34 pages
Statistics and Probabilityq4Week 3 Module 11
No ratings yet
Statistics and Probabilityq4Week 3 Module 11
22 pages
Pengaruh Pelatihan Dan Motivasi Kerja Te 1926c7b7
No ratings yet
Pengaruh Pelatihan Dan Motivasi Kerja Te 1926c7b7
8 pages
Stats - Probability
No ratings yet
Stats - Probability
53 pages
370 Integral Problems Sol
No ratings yet
370 Integral Problems Sol
5 pages
Lecture3
No ratings yet
Lecture3
25 pages
M1112SP IVh 1,2,3
No ratings yet
M1112SP IVh 1,2,3
11 pages
What - Are - Confidence Interval and P Value
100% (1)
What - Are - Confidence Interval and P Value
8 pages
Module 3 Excel Utility: This Workbook Contains Two Utilities Related To Confidence Intervals
No ratings yet
Module 3 Excel Utility: This Workbook Contains Two Utilities Related To Confidence Intervals
3 pages
CTV Actuarial Sciences
No ratings yet
CTV Actuarial Sciences
8 pages
Properties of The OLS Estimators
100% (1)
Properties of The OLS Estimators
2 pages
Unit-Iii: Statistical Estimation Theory Unbiased Estimates
No ratings yet
Unit-Iii: Statistical Estimation Theory Unbiased Estimates
14 pages
Summary MAS291
No ratings yet
Summary MAS291
7 pages
Download Complete Probability and Computing 2nd ed Edition Mitzenmacher PDF for All Chapters
100% (1)
Download Complete Probability and Computing 2nd ed Edition Mitzenmacher PDF for All Chapters
51 pages
104 April 2001 Question
No ratings yet
104 April 2001 Question
7 pages
STA03B3 Lecture 21
No ratings yet
STA03B3 Lecture 21
16 pages
Data Analysis Activity 2
No ratings yet
Data Analysis Activity 2
4 pages
Exercises CHP 7
No ratings yet
Exercises CHP 7
2 pages
Chi-Square, Student's T and Snedecor's F Distributions
No ratings yet
Chi-Square, Student's T and Snedecor's F Distributions
20 pages
Normal Distribution
No ratings yet
Normal Distribution
84 pages
Regression Discontinuity
No ratings yet
Regression Discontinuity
60 pages
Actual Foh
No ratings yet
Actual Foh
1 page
Slide 2023.2 MI2036 Chap3
No ratings yet
Slide 2023.2 MI2036 Chap3
148 pages
6736
No ratings yet
6736
50 pages
Information Theory and Coding
No ratings yet
Information Theory and Coding
3 pages
Survey Scale: 0 1 2 3 4 5 Sample: Dealer Satisfaction
No ratings yet
Survey Scale: 0 1 2 3 4 5 Sample: Dealer Satisfaction
61 pages
Assignment 01
No ratings yet
Assignment 01
5 pages
Lee J. Bain, Max Engelhardt - Introduction To Probability and Mathematical Statistics (2000) (2) (243-270)
No ratings yet
Lee J. Bain, Max Engelhardt - Introduction To Probability and Mathematical Statistics (2000) (2) (243-270)
28 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
3 Classification of Random Processes.9402112 PDF
No ratings yet
3 Classification of Random Processes.9402112 PDF
32 pages

Introducing The Linear Model

Uploaded by

Introducing The Linear Model

Uploaded by

30/03/2022, 15:22 Introducing the Linear Model

Introducing the Linear Model

Following this session you should be able to:

Setting the Scene

First things first

Defining a Linear Model

A. the observations are independent of one another

B. the residuals are normally distributed

D. the data are linear

Let’s take these one by one…

Individual observations are independent

Residuals are normally distributed

Variance is homogeneous across range of predictor

Data are linear

Conducting Linear Regression in R

beer_mod1 <- lm(EBC ~ temp, data = beer_data)

## lm(formula = EBC ~ temp, data = beer_data)

## Min 1Q Median 3Q Max

## -9.7555 -3.4773 0.9805 4.7751 6.8682

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -96.06281 13.73661 -6.993 0.000113 ***

## temp 0.84454 0.08295 10.181 7.42e-06 ***

## Residual standard error: 6.027 on 8 degrees of freedom

## Multiple R-squared: 0.9284, Adjusted R-squared: 0.9194

## F-statistic: 103.7 on 1 and 8 DF, p-value: 7.421e-06

EBC = 0.844 × temp − 96.06

## -9.755545 4.617167 -3.774739 5.164924 4.827781 6.868209 -2.585131 0.478904

## [1] 2.282375 23.411383 21.775772 37.471731 43.890883 52.687606 49.990562

## [8] 59.810893 67.570333 65.520962

# The matching values fitted by our model

## 12.03792 18.79422 25.55051 32.30681 39.06310 45.81940 52.57569 59.33199

# The difference between the observations and the fitted values

## -9.755545 4.617167 -3.774739 5.164924 4.827781 6.868209 -2.585131 0.478904

# The residuals reported from the model object

## -9.755545 4.617167 -3.774739 5.164924 4.827781 6.868209 -2.585131 0.478904

# How close are the model-reported residuals to our calculated ones?

QUESTION 2: Make a histogram of the residuals from beer_mod1 .

## OK: residuals appear as normally distributed (p = 0.299).

## OK: Error variance appears to be homoscedastic (p = 0.411).

## Warning: 2 outliers detected (cases 1, 10).

Comparing two models

## lm(formula = EBC ~ -1 + temp, data = beer_data)

## Min 1Q Median 3Q Max

## -32.286 -10.883 1.570 9.418 15.718

## Estimate Std. Error t value Pr(>|t|)

## temp 0.27007 0.02894 9.331 6.35e-06 ***

## Residual standard error: 15.16 on 9 degrees of freedom

## Multiple R-squared: 0.9063, Adjusted R-squared: 0.8959

## F-statistic: 87.08 on 1 and 9 DF, p-value: 6.345e-06

Working with the Fitted Model

predict(beer_mod1, newdata = data.frame(temp = 172))

predict(beer_mod1, newdata = data.frame(temp = seq(from = 140, to = 160, by = 1)))

## 22.17236 23.01690 23.86144 24.70597 25.55051 26.39505 27.23959 28.08412

## 28.92866 29.77320 30.61773 31.46227 32.30681 33.15134 33.99588 34.84042

## 35.68495 36.52949 37.37403 38.21857 39.06310

#predict(beer_mod1, newdata = data.frame(temp = 172), se.fit = TRUE)

predict(beer_mod1, newdata = data.frame(temp = 180), se.fit = TRUE)

new_tempdata <- data.frame(temp = seq(from = 120, to = 210, by = 0.1))

new_EBC <- predict(beer_mod1, newdata = new_tempdata, se.fit = TRUE)

new_tempdata$EBC <- new_EBC$fit

new_tempdata$EBC_SE <- new_EBC$se.fit

pred_EBC <- ggplot(data = new_tempdata, aes(x = temp, y = EBC)) +

labs(x ='Temperature (Celsius)', y = 'EBC') +

pred_EBC <- pred_EBC + geom_point(data = beer_data, aes(x = temp, y= EBC))

pred_EBC <- pred_EBC +

geom_line(data = new_tempdata, aes(x = temp, y = (EBC+2*EBC_SE)), linetype = 'dashed') +

geom_line(data = new_tempdata, aes(x = temp, y = (EBC-2*EBC_SE)), linetype = 'dashed') +

You might also like