0% found this document useful (0 votes)
4 views

MZB127_Topic_11_Lecture_Notes_(Unannotated_Version)

Chapter 11 focuses on linear regression, a statistical method used to analyze the relationship between two numerical variables, identifying one as independent and the other as dependent. It discusses how to explore linear relationships through scatterplots and introduces the statistical assumptions underlying linear regression, including error distribution and parameter estimation. The chapter also provides guidance on using Microsoft Excel to perform linear regression analysis and interpret the results.

Uploaded by

Jagath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

MZB127_Topic_11_Lecture_Notes_(Unannotated_Version)

Chapter 11 focuses on linear regression, a statistical method used to analyze the relationship between two numerical variables, identifying one as independent and the other as dependent. It discusses how to explore linear relationships through scatterplots and introduces the statistical assumptions underlying linear regression, including error distribution and parameter estimation. The chapter also provides guidance on using Microsoft Excel to perform linear regression analysis and interpret the results.

Uploaded by

Jagath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Chapter 11

Linear Regression

Preface
In this final week, we look at the case of two numerical variables and consider whether they may
have a linear (straight line) relationship. The associated analysis method is known as linear
regression; because it is based on some statistical assumptions about our observations, it also
allows us to test statistically whether there is evidence of such a relationship between these
variables. We will be heavily relying upon Microsoft Excel for this week’s content.

11.1 Exploring Linear Relationships


In most of the cases where we look at possible relationships between two variables, it is natural
to consider one of the variables as being in some sense a response to the values of the other
variable. For example, when comparing the heights and weights of a group of people, we tend
to consider each person’s height as being more or less fixed and think of their weight as being
dependent to some extent on their height. In such a situation, we often call the variable that
explains the other one an explanatory, independent or predictor variable and the other variable
the response variable or dependent variable. (Note that these designations do not necessarily
imply a causal link between the variables, only that one is in some sense more primary compared
to the other.) Whenever we have clear explanatory and response variables, it is conventional
to plot the explanatory variable on the horizontal (x) axis and the response variable on the
vertical (y) axis.
A scatterplot of dependent variable (y) versus independent variable (x) can reveal trends of
various kinds – straight, curved, complicated – or, at times, it just shows a random scatter of
points that has no clear indication of a trend of any sort. While any knowledge of a relationship
between two variables can be useful, the most convenient form of relationship is when the two
variables are related linearly, that is, when the points in their plot fall on a perfect straight line.
We can express this situation mathematically by the equation
y = β0 + β1 x
where β0 is the intercept (specifically, the y-intercept; that is, the predicted value of y when
x = 0), and β1 is the slope (gradient) of the relationship between y and x. In other contexts,
you may have seen this equation written as y = ax + b or y = mx + c.
In practice, however, we often find that there is noticeable variation about whatever trend is
present. It may still be reasonable to characterise the trend as linear but we can no longer be

97
98 Chapter 11. Linear Regression

completely confident in determining what value of the response variable would be observed in
association with a particular value of the explanatory variable. In other words, we may not be
completely sure what the true values of β0 or β1 are, and hence we are not completely sure
what value of y would result from a particular value of x. We might naturally then ask how
well we can estimate these values, which immediately reminds us of the questions we asked
in preceding chapters about using sample data to estimate true values of parameters. Thus,
it will be useful here to introduce a statistical approach to determining the (proposed linear)
relationship between the variables y and x. We motivate and illustrate this approach throughout
this chapter using the “Fishing expedition” dataset described in Example 11.1.1.

Examples

11.1.1 Fishing expedition


A group of students on a fishing expedition observed the type, length (mm), and weight (g),
of 57 fish that they caught. Figure 11.1 shows the first 30 of these 57 observations (see
Canvas for a spreadsheet containing all the data). A scatterplot was created in Microsoft
Excel (Figure 11.2) to explore the possible relationship between length and weight of fish
in this example.

Figure 11.1: Data for first 30 of 57 fish caught on the fishing expedition.
11.1. Exploring Linear Relationships 99

How to calculate in Microsoft Excel?


To create a scatterplot in Microsoft Excel:

1. Select all of the data (this will consist of two columns and multiple rows,
excluding the column headings).

2. Select “Insert”, then in the “Charts” section, select “Insert Scatter (X, Y) or
Bubble Chart”, then select the top-left option “Scatter”. This will produce
the scatterplot.

3. If you wish to add/remove information to the scatterplot (e.g. to make it


prettier), click on the scatterplot, then click the “+” icon that appears to the
top-right of the scatterplot. (For example, you may want to add “Axis Titles”
or add a “Trendline”.)

4. If you wish to change text shown on the scatterplot (e.g. modify the “Chart
Title”), click twice on the text of interest and then type your changes.

Figure 11.2: Scatterplot of the weight vs length of fish caught.

(a) Treating fish weight as the dependent variable y, and treating fish length as the
independent variable x, will it be possible to exactly fit the equation
y = β0 + β1 x
to the data shown in the scatterplot of fish weight vs fish length (Figure 11.2)?
100 Chapter 11. Linear Regression

(b) In your opinion, does it look like the proposed relationship between fish weight (y)
and fish length (x) shown in Figure 11.2 could be approximately linear?

11.2 Statistical assumptions of linear regression


We have just seen that it isn’t possible to exactly fit a straight line through the points for a real
dataset comparing two variables. How then can we model our observations so as to provide a
workable method of fitting a suitable straight line through these points?
We start by assuming an underlying linear relationship between the two variables, but now with
some sort of scatter or “error” superimposed on this underlying relationship. More specifically,
we assume that the explanatory (x) values have been observed precisely but that there is some
random error in each observed response (y) value. (This error could be due to measurement
inaccuracy, the presence of other influencing factors that haven’t been taken into account, or
some combination of these.) If there are n paired observations each of variables x and y, and we
denote the ith observations of x and y as xi and yi respectively, then the model for our observed
responses is then:

yi = β0 + β1 xi + εi , i = 1, ..., n,

where εi is the error in the observation of yi . (Note in Example 11.1.1 that n = 57, so i can
take values of 1, 2, 3, etc. up to 57.)
Equivalently, we can consider εi = yi − β0 − β1 xi as the difference between the observed value yi
and the value it should take according to the underlying linear model. This difference is called
the residual (for that observation). Now, if we are considering εi as a random error or scatter,
it makes sense to assume a probability distribution for these quantities εi . In practice, we make
the following (reasonable) assumptions about the residuals:

1. The errors εi and εj are identically distributed but independent of one another for all
i ̸= j.

2. Their average is zero, that is, E (εi ) = 0 for all i.

3. They are Normally distributed, that is, εi ∼ N (0, σ 2 ) for all i.

Note that Assumption 1 implies that σ 2 in Assumption 3 is the same for all observations.
iid
These assumptions can be summarised in the statement εi ∼ N (0, σ 2 ), where “iid” is short for
independently and identically distributed. Having assumed a distribution for these errors, we
11.3. Outputs of linear regression 101

can also make equivalent statements about the distribution of the observed response values yi ,
as follows:

E (yi ) = β0 + β1 xi + E (εi )
= β0 + β1 xi ,
Var (yi ) = 0 + 0 + Var (εi )
= σ2,

so yi ∼ N (β0 + β1 xi , σ 2 ) for all i. (Recall that β0 and β1 are (unknown) constants and that we
have assumed the xi values have been observed without any error, so they are effectively known
constants here.)

11.3 Outputs of linear regression


When we perform linear regression on a dataset consisting of n paired x- and y-values, we aim
to estimate three unknown parameters characterising the proposed (and likely approximate)
linear relationship between x and y. The three unknown parameters we wish to estimate are β0
(the intercept), β1 (the slope) and σ (the standard deviation of the errors).
Now, we will not typically be able to estimate the true values of parameters β0 , β1 and σ;
instead we will obtain sample estimates of these parameters, denoted here as β̂0 , β̂1 and s. Also,
note here that the sample standard deviation of the errors, s, is sometimes also referred to as
the “standard error of the estimate”.

Not assessed: It is possible to show that, for a set of data (xi , yi ), i = 1, ..., n, fitted to the
iid
linear model yi = β0 + β1 xi + εi , where εi ∼ N (0, σ 2 ), that:

β̂0 = ȳ − β̂1 x̄,

Pn
(xi yi − nx̄ȳ)
β̂1 = Pi=1
n 2 2
,
i=1 (xi − nx̄ )

n
1 X 2
s2 = yi − β̂0 − β̂1 xi ,
n − 2 i=1

n n
1X 1X
where ȳ = yi and x̄ = xi .
n i=1 n i=1

Whilst we could calculate β̂0 , β̂1 and s by hand using the formulas above, in practice we would
usually use statistical software packages (e.g. “Trendline” option and/or “Regression” Analysis
Tool in Microsoft Excel) to do these calculations for us. Furthermore, these packages can give
us several additional useful quantities, beyond just the sample estimates β̂0 , β̂1 and s.
For example, we may be interested in how close β̂0 and β̂1 are to the true values β0 and β1 . To
assess this, we can obtain sample estimates for the standard deviations of β̂0 and β̂1 , denoted as
sβb0 and sβb1 .
102 Chapter 11. Linear Regression

Not assessed: It is possible to show that, for a set of data (xi , yi ), i = 1, ..., n, fitted to the
iid
linear model yi = β0 + β1 xi + εi , where εi ∼ N (0, σ 2 ), that:

s2
s2βb0 = Pn 2,
i=1 (xi − x̄)
s2 ni=1 x2i
P
s2βb1 = Pn 2.
i=1 (xi − x̄)

Furthermore, we may be interested in the proportion of variation in the response variable (y)
that is explained by fitting the linear regression model. This quantity is labelled R2 , and takes
a value between 0 and 1 inclusive. If R2 is closer to 1, then a greater proportion of the variation
in y is explained by the regression model yi = β0 + β1 xi + εi . If R2 is closer to 0, then a smaller
proportion of the variation in y is explained by this regression model.

Not assessed: It is possible to show that, for a set of data (xi , yi ), i = 1, ..., n, fitted to the
iid
linear model yi = β0 + β1 xi + εi , where εi ∼ N (0, σ 2 ), that:
Pn 2
2 i=1 (yi − ŷi )
R =1− P n 2
,
i=1 (yi − ȳ)

where ŷi = β̂0 + β̂1 xi .

All of these quantities can be calculated by hand but we usually use statistical software to
perform the calculations for us instead.

How to calculate in Microsoft Excel?

If you are only interested in obtaining the sample estimate of the intercept (β̂0 ), the
sample estimate of the slope (β̂1 ), and/or the proportion of the variation in response
variable explained by the linear regression model (R2 ), then:

1. Click on the scatterplot, then click the “+” icon that appears to the top-right of
the scatterplot, then check “Trendline”.
2. A linear trendline will appear on your scatterplot. Right-click on the trendline, then
click “Format Trendline...”.
3. In “Trendline Options”, ensure that “Linear” is checked. Check “Display Equation
on chart” and “Display R-squared value on chart”.
4. A textbox will appear on your scatterplot in the format

y = β̂1 x + β̂0
R2 = R2

with the calculated values of the linear regression outputs β̂0 , β̂1 , and R2 shown in
place of the red text above (see Figure 11.3 for an example).
11.3. Outputs of linear regression 103

Figure 11.3: Scatterplot of the weight vs length of fish caught (data from Example 11.1.1),
including trendline and linear regression parameter estimates β̂0 = −190.13, β̂1 = 2.444 and
R2 = 0.8639.

How to calculate in Microsoft Excel?


If you are instead interested in performing a full linear regression analysis in Microsoft
Excel, which will yield values for the sample estimate of the intercept (β̂0 ), the sample
estimate of the slope (β̂1 ), the standard error of the estimate (s), the sample standard
deviation of the intercept (sβb0 ), the sample standard deviation of the slope (sβb1 ), the
proportion of the variation in response variable explained by the linear regression model
(R2 ), as well as many other useful quantities, then there are two steps:
Installing the Analysis Toolpak
To perform linear regression in Microsoft Excel, you first need to install the free add-in
called “Analysis Toolpak”:

1. Go to “File” >> “Options” >> “Add-Ins”.

2. Next to “Manage: Excel Add-ins”, click “Go...”

3. Check “Analysis ToolPak” and then click OK.

Performing Regression Analysis


You only need to do the installation (described above) once. After you have installed the
“Analysis Toolpak”, linear regression is performed as follows:

1. Click “Data”, then click “Data Analysis”.

2. Select “Regression” and then click OK.

3. For “Input Y Range”, select all cells where the data is stored for the dependent
variable (y).

4. For “Input X Range”, select all cells where the data is stored for the independent
variable (x).

5. Check “Line Fit Plots”.


104 Chapter 11. Linear Regression

6. Select “Output Range” and choose a cell that has no data above or to the right of
it.

7. Click OK.

8. This will produce two outputs:

(a) Several tables of linear regression outputs positioned with its top-left corner in
the cell you chose in Step 6 (see Figure 11.5 for an example). In this table,
ˆ The sample estimate of the intercept (β̂0 ) is given by the number in the
“Coefficients” column and “Intercept” row of the third table.
ˆ The sample estimate of the slope (β̂1 ) is given by the number in the
“Coefficients” column and “X Variable 1” row of the third table.
ˆ The standard error of the estimate (s) is given by the number next to
“Standard Error” in the first table.
ˆ The sample standard deviation of the intercept (sβb0 ) is given by the number
in the “Standard Error” column and “Intercept” row of the third table.
ˆ The sample standard deviation of the slope (sβb1 ) is given by the number in
the “Standard Error” column and “X Variable 1” row of the third table.
ˆ The proportion of the variation in response variable explained by the linear
regression model (R2 ) is given by the number next to “R Square” in the
first table.
(b) A “line fit plot” which is an option you chose in Step 5 (see Figure 11.4 for
an example). To change the fitted line from dots to a line, right-click on the
dots making up the line, and select “Format Data Series...”. There are many
options there to make the fitted line plot prettier!

Figure 11.4: Line fit plot obtained from a linear regression analysis applied to the data for
weight vs length of fish caught (data from Example 11.1.1).
11.3. Outputs of linear regression 105

Figure 11.5: Outputs of a linear regression analysis applied to the data for weight vs length of
fish caught (data from Example 11.1.1). In this example, the “Residual Output” table has rows
for all 57 observations (only the first 29 observations are shown here!). This output gives us
that β̂0 ≈ −190.13, β̂1 ≈ 2.444, s ≈ 67.944, sβb0 ≈ 35.52, sβb1 ≈ 0.131, and R2 ≈ 0.8639.
106 Chapter 11. Linear Regression

11.4 Strength of evidence for the linear relationship


When we perform linear regression, a natural question to ask is “what is the strength of the
evidence [from the data] that a linear relationship exists between variables x and y?” In this
section, we combine hypothesis testing (from Section 10.2) with the linear regression output
parameters (from Section 11.3) to answer this question.
First of all, it can be shown (not here!) that differences between the true and sample estimates of
parameters β0 and β1 , scaled by their respective sample standard deviations, follow a Student’s
t-distribution with n − 2 degrees of freedom:

βˆ0 − β0
t= , d = n − 2,
sβb0

βˆ1 − β1
t= , d = n − 2.
sβb1
(Not assessed: The use of n − 2 degrees of freedom here is related to the fact that we have
used linear regression to estimate two quantities, β0 and β1 .)
Combining the formulas above with what we have learned in Chapter 10, this means that we
can:

1. Construct confidence intervals for the true value of the intercept (β0 ) and the true value
of the slope (β1 ) – see Section 10.1; and

2. Perform hypothesis tests to compare the sample values of the intercept (β̂0 ) and slope (β̂1 )
obtained from our (x, y) data to separate pre-existing hypotheses about the true values of
the intercept (β0 = some number) and slope (β1 = some number) – see Section 10.2.

Following on from the first point above, and using similar mathematical procedures to those
described in Sections 10.1.1 and 10.1.2, we can obtain that the confidence intervals, for a
confidence level of (1 − α), for the true values of parameters β0 and β1 , are given by:

βˆ0 ± a0 , where a0 = tn−2,α/2 sβb0 ,

βˆ1 ± a1 , where a1 = tn−2,α/2 sβb1 ,


recalling here that td,p denotes the value of the t-distribution for d degrees of freedom that
satisfies Pr(T > td,p ) = p, and recalling that Fawcett and Kent Table 6 can be used to obtain
these values td,p .

How to calculate in Microsoft Excel?


Note here that the Microsoft Excel linear regression output actually pre-calculates the
95% confidence intervals (i.e. α = 0.05) for the true values of the intercept (β0 ) and slope
(β1 ) already! This information is stated in the “Lower 95%” and “Upper 95%” columns
in the third table of the regression output (see Figure 11.5). By default, Microsoft Excel
reports these confidence intervals twice.
If instead you want Excel to calculate an confidence interval that is different to 95%, this
can be selected when you perform the regression analysis (it is an additional option listed
just below “Input X Range”).
11.4. Strength of evidence for the linear relationship 107

Examples

11.4.1 Fishing expedition: revisited


Using the Microsoft Excel regression analysis output shown in Figure 11.5 for the fishing
expedition data described in Example 11.1.1, and the formulas on the previous page,
confirm that Microsoft Excel is correctly calculating the 95% confidence intervals for the
true values of the intercept (β0 ) and slope (β1 ).
108 Chapter 11. Linear Regression

Following on from the second point (about hypothesis tests), a natural test to perform – and
this answers our original question posed at the start of this section “what is the strength of the
evidence [from the data] that a linear relationship exists between variables x and y?” – is:

How much evidence is there that the slope parameter (β1 ) is different from zero?

A value of the slope parameter of β1 = 0 would imply a horizontal line when y is plotted against
x, which in turn implies that x does not explain any variation in y! So comparing the null
hypothesis H0 : β1 = 0 against the alternative hypothesis H1 : β1 ̸= 0 determines the evidence
for a linear relationship between variables x and y.
Because this is such an important test to perform, the p-value associated with this hypothesis
test (H0 : β1 = 0 versus H1 : β1 ̸= 0) is already pre-calculated in the output of most linear
regression analysis software packages.

How to calculate in Microsoft Excel?


For the hypothesis test H0 : β1 = 0 versus H1 : β1 ̸= 0, which determines the evidence
associated with the slope of the fitted line between y-data and x-data being different
from zero (and thus the evidence associated with a linear relationship existing between
variables x and y), the associated p-value is listed in Microsoft Excel regression analysis
output in two places:

ˆ Under the “Significance F” column in the second table.

ˆ In the “P-value” column and “X Variable 1” row in the third table.

Microsoft Excel also calculates the p-value for the evidence associated with the intercept of
the fitted line between y-data and x-data being different from zero (and thus the evidence
for whether or not the fitted line goes through the origin (x = 0, y = 0)). This p-value is
listed in the “P-value” column and “Intercept” row in the third table, and is associated
with the hypothesis test H0 : β0 = 0 versus H1 : β0 ̸= 0.

Examples

11.4.2 Fishing expedition: revisited


(a) Using the Microsoft Excel regression analysis output shown in Figure 11.5 for
the fishing expedition data described in Example 11.1.1, confirm that Microsoft
Excel is correctly calculating the p-value for the hypothesis test H0 : β1 = 0 versus
H1 : β1 ̸= 0.
11.4. Strength of evidence for the linear relationship 109

(b) What does the p-value above tell us?


110 Chapter 11. Linear Regression

(c) Using the Microsoft Excel regression analysis output shown in Figure 11.5 for
the fishing expedition data described in Example 11.1.1, confirm that Microsoft
Excel is correctly calculating the p-value for the hypothesis test H0 : β0 = 0 versus
H1 : β0 ̸= 0.

(d) What does the p-value above tell us?

You might also like