MZB127_Topic_11_Lecture_Notes_(Unannotated_Version)
MZB127_Topic_11_Lecture_Notes_(Unannotated_Version)
Linear Regression
Preface
In this final week, we look at the case of two numerical variables and consider whether they may
have a linear (straight line) relationship. The associated analysis method is known as linear
regression; because it is based on some statistical assumptions about our observations, it also
allows us to test statistically whether there is evidence of such a relationship between these
variables. We will be heavily relying upon Microsoft Excel for this week’s content.
97
98 Chapter 11. Linear Regression
completely confident in determining what value of the response variable would be observed in
association with a particular value of the explanatory variable. In other words, we may not be
completely sure what the true values of β0 or β1 are, and hence we are not completely sure
what value of y would result from a particular value of x. We might naturally then ask how
well we can estimate these values, which immediately reminds us of the questions we asked
in preceding chapters about using sample data to estimate true values of parameters. Thus,
it will be useful here to introduce a statistical approach to determining the (proposed linear)
relationship between the variables y and x. We motivate and illustrate this approach throughout
this chapter using the “Fishing expedition” dataset described in Example 11.1.1.
Examples
Figure 11.1: Data for first 30 of 57 fish caught on the fishing expedition.
11.1. Exploring Linear Relationships 99
1. Select all of the data (this will consist of two columns and multiple rows,
excluding the column headings).
2. Select “Insert”, then in the “Charts” section, select “Insert Scatter (X, Y) or
Bubble Chart”, then select the top-left option “Scatter”. This will produce
the scatterplot.
4. If you wish to change text shown on the scatterplot (e.g. modify the “Chart
Title”), click twice on the text of interest and then type your changes.
(a) Treating fish weight as the dependent variable y, and treating fish length as the
independent variable x, will it be possible to exactly fit the equation
y = β0 + β1 x
to the data shown in the scatterplot of fish weight vs fish length (Figure 11.2)?
100 Chapter 11. Linear Regression
(b) In your opinion, does it look like the proposed relationship between fish weight (y)
and fish length (x) shown in Figure 11.2 could be approximately linear?
yi = β0 + β1 xi + εi , i = 1, ..., n,
where εi is the error in the observation of yi . (Note in Example 11.1.1 that n = 57, so i can
take values of 1, 2, 3, etc. up to 57.)
Equivalently, we can consider εi = yi − β0 − β1 xi as the difference between the observed value yi
and the value it should take according to the underlying linear model. This difference is called
the residual (for that observation). Now, if we are considering εi as a random error or scatter,
it makes sense to assume a probability distribution for these quantities εi . In practice, we make
the following (reasonable) assumptions about the residuals:
1. The errors εi and εj are identically distributed but independent of one another for all
i ̸= j.
Note that Assumption 1 implies that σ 2 in Assumption 3 is the same for all observations.
iid
These assumptions can be summarised in the statement εi ∼ N (0, σ 2 ), where “iid” is short for
independently and identically distributed. Having assumed a distribution for these errors, we
11.3. Outputs of linear regression 101
can also make equivalent statements about the distribution of the observed response values yi ,
as follows:
E (yi ) = β0 + β1 xi + E (εi )
= β0 + β1 xi ,
Var (yi ) = 0 + 0 + Var (εi )
= σ2,
so yi ∼ N (β0 + β1 xi , σ 2 ) for all i. (Recall that β0 and β1 are (unknown) constants and that we
have assumed the xi values have been observed without any error, so they are effectively known
constants here.)
Not assessed: It is possible to show that, for a set of data (xi , yi ), i = 1, ..., n, fitted to the
iid
linear model yi = β0 + β1 xi + εi , where εi ∼ N (0, σ 2 ), that:
Pn
(xi yi − nx̄ȳ)
β̂1 = Pi=1
n 2 2
,
i=1 (xi − nx̄ )
n
1 X 2
s2 = yi − β̂0 − β̂1 xi ,
n − 2 i=1
n n
1X 1X
where ȳ = yi and x̄ = xi .
n i=1 n i=1
Whilst we could calculate β̂0 , β̂1 and s by hand using the formulas above, in practice we would
usually use statistical software packages (e.g. “Trendline” option and/or “Regression” Analysis
Tool in Microsoft Excel) to do these calculations for us. Furthermore, these packages can give
us several additional useful quantities, beyond just the sample estimates β̂0 , β̂1 and s.
For example, we may be interested in how close β̂0 and β̂1 are to the true values β0 and β1 . To
assess this, we can obtain sample estimates for the standard deviations of β̂0 and β̂1 , denoted as
sβb0 and sβb1 .
102 Chapter 11. Linear Regression
Not assessed: It is possible to show that, for a set of data (xi , yi ), i = 1, ..., n, fitted to the
iid
linear model yi = β0 + β1 xi + εi , where εi ∼ N (0, σ 2 ), that:
s2
s2βb0 = Pn 2,
i=1 (xi − x̄)
s2 ni=1 x2i
P
s2βb1 = Pn 2.
i=1 (xi − x̄)
Furthermore, we may be interested in the proportion of variation in the response variable (y)
that is explained by fitting the linear regression model. This quantity is labelled R2 , and takes
a value between 0 and 1 inclusive. If R2 is closer to 1, then a greater proportion of the variation
in y is explained by the regression model yi = β0 + β1 xi + εi . If R2 is closer to 0, then a smaller
proportion of the variation in y is explained by this regression model.
Not assessed: It is possible to show that, for a set of data (xi , yi ), i = 1, ..., n, fitted to the
iid
linear model yi = β0 + β1 xi + εi , where εi ∼ N (0, σ 2 ), that:
Pn 2
2 i=1 (yi − ŷi )
R =1− P n 2
,
i=1 (yi − ȳ)
All of these quantities can be calculated by hand but we usually use statistical software to
perform the calculations for us instead.
If you are only interested in obtaining the sample estimate of the intercept (β̂0 ), the
sample estimate of the slope (β̂1 ), and/or the proportion of the variation in response
variable explained by the linear regression model (R2 ), then:
1. Click on the scatterplot, then click the “+” icon that appears to the top-right of
the scatterplot, then check “Trendline”.
2. A linear trendline will appear on your scatterplot. Right-click on the trendline, then
click “Format Trendline...”.
3. In “Trendline Options”, ensure that “Linear” is checked. Check “Display Equation
on chart” and “Display R-squared value on chart”.
4. A textbox will appear on your scatterplot in the format
y = β̂1 x + β̂0
R2 = R2
with the calculated values of the linear regression outputs β̂0 , β̂1 , and R2 shown in
place of the red text above (see Figure 11.3 for an example).
11.3. Outputs of linear regression 103
Figure 11.3: Scatterplot of the weight vs length of fish caught (data from Example 11.1.1),
including trendline and linear regression parameter estimates β̂0 = −190.13, β̂1 = 2.444 and
R2 = 0.8639.
3. For “Input Y Range”, select all cells where the data is stored for the dependent
variable (y).
4. For “Input X Range”, select all cells where the data is stored for the independent
variable (x).
6. Select “Output Range” and choose a cell that has no data above or to the right of
it.
7. Click OK.
(a) Several tables of linear regression outputs positioned with its top-left corner in
the cell you chose in Step 6 (see Figure 11.5 for an example). In this table,
The sample estimate of the intercept (β̂0 ) is given by the number in the
“Coefficients” column and “Intercept” row of the third table.
The sample estimate of the slope (β̂1 ) is given by the number in the
“Coefficients” column and “X Variable 1” row of the third table.
The standard error of the estimate (s) is given by the number next to
“Standard Error” in the first table.
The sample standard deviation of the intercept (sβb0 ) is given by the number
in the “Standard Error” column and “Intercept” row of the third table.
The sample standard deviation of the slope (sβb1 ) is given by the number in
the “Standard Error” column and “X Variable 1” row of the third table.
The proportion of the variation in response variable explained by the linear
regression model (R2 ) is given by the number next to “R Square” in the
first table.
(b) A “line fit plot” which is an option you chose in Step 5 (see Figure 11.4 for
an example). To change the fitted line from dots to a line, right-click on the
dots making up the line, and select “Format Data Series...”. There are many
options there to make the fitted line plot prettier!
Figure 11.4: Line fit plot obtained from a linear regression analysis applied to the data for
weight vs length of fish caught (data from Example 11.1.1).
11.3. Outputs of linear regression 105
Figure 11.5: Outputs of a linear regression analysis applied to the data for weight vs length of
fish caught (data from Example 11.1.1). In this example, the “Residual Output” table has rows
for all 57 observations (only the first 29 observations are shown here!). This output gives us
that β̂0 ≈ −190.13, β̂1 ≈ 2.444, s ≈ 67.944, sβb0 ≈ 35.52, sβb1 ≈ 0.131, and R2 ≈ 0.8639.
106 Chapter 11. Linear Regression
βˆ0 − β0
t= , d = n − 2,
sβb0
βˆ1 − β1
t= , d = n − 2.
sβb1
(Not assessed: The use of n − 2 degrees of freedom here is related to the fact that we have
used linear regression to estimate two quantities, β0 and β1 .)
Combining the formulas above with what we have learned in Chapter 10, this means that we
can:
1. Construct confidence intervals for the true value of the intercept (β0 ) and the true value
of the slope (β1 ) – see Section 10.1; and
2. Perform hypothesis tests to compare the sample values of the intercept (β̂0 ) and slope (β̂1 )
obtained from our (x, y) data to separate pre-existing hypotheses about the true values of
the intercept (β0 = some number) and slope (β1 = some number) – see Section 10.2.
Following on from the first point above, and using similar mathematical procedures to those
described in Sections 10.1.1 and 10.1.2, we can obtain that the confidence intervals, for a
confidence level of (1 − α), for the true values of parameters β0 and β1 , are given by:
Examples
Following on from the second point (about hypothesis tests), a natural test to perform – and
this answers our original question posed at the start of this section “what is the strength of the
evidence [from the data] that a linear relationship exists between variables x and y?” – is:
How much evidence is there that the slope parameter (β1 ) is different from zero?
A value of the slope parameter of β1 = 0 would imply a horizontal line when y is plotted against
x, which in turn implies that x does not explain any variation in y! So comparing the null
hypothesis H0 : β1 = 0 against the alternative hypothesis H1 : β1 ̸= 0 determines the evidence
for a linear relationship between variables x and y.
Because this is such an important test to perform, the p-value associated with this hypothesis
test (H0 : β1 = 0 versus H1 : β1 ̸= 0) is already pre-calculated in the output of most linear
regression analysis software packages.
Microsoft Excel also calculates the p-value for the evidence associated with the intercept of
the fitted line between y-data and x-data being different from zero (and thus the evidence
for whether or not the fitted line goes through the origin (x = 0, y = 0)). This p-value is
listed in the “P-value” column and “Intercept” row in the third table, and is associated
with the hypothesis test H0 : β0 = 0 versus H1 : β0 ̸= 0.
Examples
(c) Using the Microsoft Excel regression analysis output shown in Figure 11.5 for
the fishing expedition data described in Example 11.1.1, confirm that Microsoft
Excel is correctly calculating the p-value for the hypothesis test H0 : β0 = 0 versus
H1 : β0 ̸= 0.