Chapter 6 - 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Linear Regression

and Correlation
Chapter 6
Contents
6.1 Introduction
6.2 Curve Fitting
6.3 Fitting a Simple Linear Regression Line
6.4 Linear Correlation Analysis
6.5 Spearman’s Rank Correlation
6.6 Multiple Regression and Correlation Analysis

2
6.1 Introduction
This chapter presents some statistical techniques to analyze the
association between two variables and develop the relationship
for prediction.

3
6.2 Curve Fitting
• Very often in practice a relation is found to exist between two (or
more) variables.
• It is frequently desirable to express this relationship in
mathematical form by determining an equation connecting the
variables.
• To aid in determining an equation connecting variables, a first
step is the collection of data showing corresponding values of
the variables under consideration.

4
6.3 Fitting a Simple Linear Regression
Line
To determine from a set of data, a line of best fit to infer the
relationship between two variables.

5
6.3.1 The Method of Least Squares
Determining the line of “best fit”:

𝑦ො = 𝑎 + 𝑏𝑥
by minimizing σ 𝐸𝑖2 .

6
6.3.1 The Method of Least Squares
To minimize σ 𝐸𝑖2 , we apply calculus and find the following “normal
equations”:
෍ 𝑦 = 𝑛𝑎 + 𝑏 ෍ 𝑥 (1)

෍ 𝑦𝑥 = 𝑎 ෍ 𝑥 + 𝑏 ෍ 𝑥 2 (2)

7
6.3.1 The Method of Least Squares
Solve (1) and (2) simultaneously, we have:

𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑏=
𝑛 σ 𝑥2 − σ 𝑥 2

σ𝑦 𝑏σ𝑥
𝑎= −
𝑛 𝑛

8
6.3.1 The Method of Least Squares
Notes:
1. The formula for calculating the slope 𝑏 is commonly written as

σ 𝑥 − 𝑥ҧ 𝑦 − 𝑦ത
𝑏=
σ 𝑥 − 𝑥ҧ 2
in which the numerator and denominator then reduce to formulas

෍ 𝑥 − 𝑥ҧ 𝑦 − 𝑦ത = ෍ 𝑥𝑦 − 𝑥𝑦
ҧ − 𝑥𝑦ത + 𝑥ҧ 𝑦ത = ෍ 𝑥𝑦 − ෍ 𝑥𝑦
ҧ − ෍ 𝑥 𝑦ത + ෍ 𝑥ҧ 𝑦ത = ෍ 𝑥𝑦 − 𝑛𝑥ҧ 𝑦ത − 𝑛𝑥ҧ 𝑦ത + 𝑛𝑥ҧ 𝑦ത

= ෍ 𝑥𝑦 − 𝑛𝑥ҧ 𝑦ത

and
2
෍ 𝑥 − 𝑥ҧ = ෍ 𝑥 2 − 2𝑥 𝑥ҧ + 𝑥ҧ 2 = ෍ 𝑥 2 − 2𝑥ҧ ෍ 𝑥 + ෍ 𝑥ҧ 2 = ෍ 𝑥 2 − 2𝑛𝑥ҧ 2 + 𝑛𝑥ҧ 2 = ෍ 𝑥 2 − 𝑛𝑥ҧ 2

respectively; and 𝑎 = 𝑦ത − 𝑏𝑥ҧ is the y-intercept of the regression line. 9


6.3.1 The Method of Least Squares
Notes:
2. When the equation 𝑦ො = 𝑎 + 𝑏𝑥 is calculated from a sample of
observations rather than from a population, it is referred as a
sample regression line.

10
Example 1
Suppose an appliance store conducts a 5-month experiment to
determine the effect of advertising on sales revenue and obtains the
following results
Advertising Expenditure Sales Revenue
Month
(in $1,000) (in $10,000)
1 1 1
2 2 1
3 3 2
4 4 2
5 5 4
Find the sample regression line and predict the sales revenue if the
appliance store spends 4.5 thousand dollars for advertising in a month.

11
Example 1
Solution:
𝑛 = 5 ෍ 𝑥 = 15 ෍ 𝑦 = 10 ෍ 𝑥𝑦 = 37 ෍ 𝑥 2 = 55

σ𝑥 15 σ𝑦 10
Hence, 𝑥ҧ = = = 3 and 𝑦ത = = = 2.
𝑛 5 𝑛 5

Then the slope of the sample regression line is


𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 5 37 − 15 10
𝑏= 2 2
= 2
= 0.7
𝑛σ𝑥 − σ𝑥 5 55 − 15
and the y-intercept is
10 15
𝑎 = 𝑦ത − 𝑏𝑥ҧ = − 0.7 = −0.1
5 5

12
Example 1
The sample regression line is thus
𝑦ො = −0.1 + 0.7𝑥
So if the appliance store spends 4.5 thousand dollars for advertising in a month, it
can expect to obtain 𝑦ො = −0.1 + 0.7 4.5 = 3.05 ten-thousand dollars as sales revenue
during that month.
4.5
4
y, Sales Revenue

3.5
3
2.5
2
1.5
1
0.5
0
0 1 2 3 4 5 6
x, Advertising Expenditure

13
Example 2
Obtain the least squares prediction line for the data below:
𝑦𝑖 𝑥𝑖 𝑥𝑖2 𝑥𝑖 𝑦𝑖 𝑦𝑖2
101 1.2 1.44 121.2 10201
92 0.8 0.64 73.6 8464
110 1.0 1.00 110.0 12100
120 1.3 1.69 156.0 14400
90 0.7 0.49 63.0 8100
82 0.8 0.64 65.6 6724
93 1.0 1.00 93.0 8649
75 0.6 0.36 45.0 5625
91 0.9 0.81 81.9 8281
105 1.1 1.21 115.5 11025
Sum 959 9.4 9.28 924.8 93569
14
Example 2
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 10 924.8 − 9.4 959 233.4
𝑏= 2 2
= 2
= = 52.568
σ
𝑛 𝑥 − σ 𝑥 10 9.28 − 9.4 4.44

σ 𝑦 𝑏 σ 𝑥 959 9.4
𝑎= − = − 52.568 = 46.486
𝑛 𝑛 10 10
140
120

Therefore, 𝑦ො = 46.486 + 52.568𝑥 100


80

y
60
40
20
0
0.4 0.6 0.8 1 1.2 1.4
x
Example 3
Find a regression curve in the form 𝑦 = 𝑎 + 𝑏 ln 𝑥 for the following
data:

𝑥𝑖 1 2 3 4 5 6 7 8
𝑦𝑖 9 13 14 17 18 19 19 20

ln 𝑥𝑖 0 0.693 1.099 1.386 1.609 1.792 1.946 2.079


𝑦𝑖 9 13 14 17 18 19 19 20

16
Example 3
2 25
෍ ln 𝑥𝑖 = 10.604 ෍ ln 𝑥𝑖 = 17.518

20
෍ 𝑦𝑖 = 129 ෍ ln 𝑥𝑖 𝑦𝑖 = 189.521

15
𝑛 σ ln 𝑥 𝑦 − σ ln 𝑥 σ 𝑦
𝑏=

y
2
𝑛 σ ln 𝑥 − ln 𝑥 2 10
8 189.521 − 10.604 129
= = 5.35
8 17.518 − 10.604 2
5

σ 𝑦 𝑏 σ 𝑥 129 10.604
𝑎= − = − 5.35 = 9.03
𝑛 𝑛 8 8 0
0 0.5 1 1.5 2 2.5
Therefore, 𝑦ො = 9.03 + 5.35 ln 𝑥. ln x 17
Why is this called “Regression”?

Francis Galton (1822 – 1911)

If parents are taller (shorter) than the average, their children tend
to be shorter (taller) than their parents.
Galton called this phenomenon “regression towards mediocrity”
18
6.4 Linear Correlation Analysis
Correlation analysis is the statistical tool that we can use to
determine the degree to which variables are related.

19
6.4.1 Coefficient of Determination,𝑟 2

Problem: how well a least squares regression line fits a given set of
paired data?

20
6.4.1 Coefficient of Determination,𝑟 2

• Variation of the 𝑦 values around their own mean = σ 𝑦 − 𝑦ത 2

• Variation of the 𝑦 values around the regression line = σ 𝑦 − 𝑦ො 2

• Regression sum of squares = σ 𝑦ො − 𝑦ത 2

We have:
σ 𝑦 − 𝑦ത 2 = σ 𝑦ො − 𝑦ത 2 + σ 𝑦 − 𝑦ො 2

21
6.4.1 Coefficient of Determination,𝑟 2

• Dividing both sides of the equation by σ 𝑦 − 𝑦ത 2 , we have

σ 𝑦ො − 𝑦ത 2 σ 𝑦 − 𝑦ො 2 σ 𝑦ො − 𝑦ത 2 σ 𝑦 − 𝑦ො 2
1= 2
+ 2
⟹ 2
=1− 2
.
σ 𝑦 − 𝑦ത σ 𝑦 − 𝑦ത σ 𝑦 − 𝑦ത σ 𝑦 − 𝑦ത
ො 𝑦ത 2
σ 𝑦−
• Denoting σ 𝑦−𝑦ത 2
by 𝑟 2 , then
σ 𝑦 − 𝑦ො 2
2
𝑟 =1− 2
.
σ 𝑦 − 𝑦ത
• 𝑟 2 , the coefficient of determination, is the proportion of variation in 𝑦
explained by a sample regression line.
• For example, 𝑟 2 =0.9797; that is, 97.97% of the variation in 𝑦 is due to
their linear relationship with 𝑥.

22
6.4.2 Correlation Coefficient
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2

and −1 ≤ 𝑟 ≤ 1.

23
6.4.2 Correlation Coefficient
Notes:
The formulas for calculating 𝑟 2 (sample coefficient of determination) and
𝑟 (sample coefficient of correlation) can be simplified in a more common
version as follows:

σ 𝑥 − 𝑥ҧ 𝑦 − 𝑦
ത 2 σ 𝑥𝑦 − 𝑛 𝑥ҧ 𝑦
ത 2
𝑟2 = 2 2
=
σ 𝑥 − 𝑥ҧ σ 𝑦 − 𝑦ത σ 𝑥 2 − 𝑛𝑥ҧ 2 σ 𝑦 2 − 𝑛𝑦ത 2

σ 𝑥 − 𝑥ҧ 𝑦 − 𝑦ത σ 𝑥𝑦 − 𝑛𝑥ҧ 𝑦ത
𝑟= =
σ 𝑥 − 𝑥ҧ 2σ 𝑦 − 𝑦ത 2 σ 𝑥 2 − 𝑛𝑥ҧ 2 σ 𝑦 2 − 𝑛𝑦ത 2
Since the numerator used in calculating 𝑟 and 𝑏 are the same and both
denominators are always positive, 𝑟 and 𝑏 will always be of the same
sign. Moreover, if 𝑟 = 0 then 𝑏 = 0; and vice versa.

24
6

y, Sales Revenue
4
2

Example 4 0
0 2 4 6
x, Advertising Expenditure

Calculate the sample coefficient of determination and the sample coefficient of


correlation for Example 1. Interpret the results.
Solution:

From the data we get


𝑛 = 5 ෍ 𝑥 = 15 ෍ 𝑦 = 10 ෍ 𝑥𝑦 = 37 ෍ 𝑥 2 = 55 ෍ 𝑦 2 = 26

Then, the coefficient of determination is given by


2
15 10
37 − 5
σ 𝑥𝑦 − 𝑛 𝑥ҧ 𝑦
ത 2 5 5
𝑟2 = = ≈ 0.81667
σ 𝑥 2 − 𝑛𝑥ҧ 2 σ 𝑦 2 − 𝑛𝑦ത 2 15 2
10 2
55 − 5 26 − 5
5 5 25
Example 4
Solution:
𝑟 = + 𝑟 2 = 0.9037
Note: 𝑟 is positive because 𝑏 in example 1 is positive.

𝑟 2 = 0.81667 implies that 81.67% of the sample variability in sales


revenue is explained by its linear dependence on the advertising
expenditure. 𝑟 = 0.9037 indicates a very strong positive linear
relationship between sales revenue and advertising expenditure.

26
Example 5
Interest rates (x) provide an excellent leading indicator for
predicting housing starts (y). As interest rates decline, housing starts
increase, and vice versa. Suppose the data given in the
accompanying table represent the prevailing interest rates on first
mortgages and the recorded building permits in a certain region
over a 12-year span.
Year
1985 1986 1987 1988 1989 1990
Interest rates (%) 6.5 6.0 6.5 7.5 8.5 9.5
Building permits 2165 2984 2780 1940 1750 1535
Year
1991 1992 1993 1994 1995 1996
Interest rates (%) 10.0 9.0 7.5 9.0 11.5 15.0
Building permits 962 1310 2050 1695 856 510
27
Example 5
(a) Find the least squares line to allow for the estimation of building
permits from interest rates.
(b) Calculate the correlation coefficient 𝑟 for these data.
(c) By what percentage is the sum of squares of deviations of
building permits reduced by using interest rates as a predictor
rather than using the average annual building permits 𝑦ത as a
predictor of 𝑦 for these data?

28
Example 5(a)
(a) Find the least squares line to allow for the estimation of building 3500
permits from interest rates.

Solution: 3000

𝑛 = 12 ෍ 𝑥 = 106.5 ෍ 𝑦 = 20537 ෍ 𝑥𝑦 = 163588 ෍ 𝑥 2 = 1014.75


2500

Building Permits (y)


Then the slope of the sample regression line is
2000
𝑛 σ 𝑥𝑦−σ 𝑥 σ 𝑦 12 163588 − 106.5 20537
𝑏= = = −268.5049416
𝑛 σ 𝑥2− σ 𝑥 2 12 1014.75 − 106.5 2
1500
and the y-intercept is
20537 106.5 1000
𝑎 = 𝑦ത − 𝑏𝑥ҧ = − −268.5049416 = 4094.398023
12 12

Therefore, the least squares line is 500

0
𝑦ො = 4094.40 − 268.50𝑥 0 5 10 15 20
Interest Rates (x)
29
Example 5(b)
(b) Calculate the correlation coefficient 𝑟 for these data.
Solution:
The correlation coefficient is given by
106.5 20537
σ 𝑥𝑦−𝑛𝑥ҧ 𝑦ത 163588− 12
12 12
𝑟= = ≈ −0.909355154
σ 𝑥 2 −𝑛𝑥ҧ 2 σ 𝑦 2 −𝑛𝑦ത 2
106.5 2 20537 2
1014.75− 12 41212111− 12
12 12

30
Example 5(c)
(c) By what percentage is the sum of squares of deviations of
building permits reduced by using interest rates as a predictor
rather than using the average annual building permits 𝑦ത as a
predictor of 𝑦 for these data?
Solution:
The coefficient of determination is

𝑟 2 = −0.909355154 2 ≈ 0.826926796 = 82.6926796%


𝑟 2 = 0.827 implies that 82.7% of the sample variability in building
permits is explained by its linear dependence on the interest rates.
31
6.5 Spearman’s Rank Correlation
• Occasionally we may need to determine the correlation between two
variables where suitable measures of one or both variables do not
exist.
• However, variables can be ranked and the association between the
two variables can be measured by 𝑟𝑠 :

6 σ 𝑑2
𝑟𝑠 = 1 −
𝑛 𝑛2 − 1
where 𝑑 is the difference of rank between 𝑥 and 𝑦.

• −1 ≤ 𝑟𝑠 ≤ 1
– If 𝑟𝑠 closes to 1: strong positive association
– If 𝑟𝑠 closes to −1: strong negative association
– if 𝑟𝑠 closes to 0: no association 32
6.5 Spearman’s Rank Correlation
Notes:
1. The two variables must be ranked in the same order, giving rank 1
either to the largest (or smallest) value, rank 2 to the second largest
(or smallest) value and so forth.
2. If there are ties, we assign to each of the tied observations the mean
of the ranks which they jointly occupy; thus, if the third and fourth
ordered values are identical we assign each the rank of 3 + 4 Τ2 =
3.5, and if the fifth, sixth and seventh ordered values are identical we
assign each the rank of 5 + 6 + 7 Τ3 = 6.
3. The ordinary sample correlation coefficient 𝑟 can also be used to
calculate the rank correlation coefficient where 𝑥 and 𝑦 represent
ranks of the observations instead of their actual numerical values.

33
Example 6
Calculate the rank correlation coefficient 𝑟𝑠 for example 1.

Month Value 𝑥 rank (𝑥) Value 𝑦 rank (𝑦) 𝑑 𝑑2


(1) (2) (3) (4) (5) (6)=(3)-(5) (7)
1 1 1 1 1.5 -0.5 0.25
2 2 2 1 1.5 0.5 0.25
3 3 3 2 3.5 -0.5 0.25
4 4 4 2 3.5 0.5 0.25
5 5 5 4 5 0 0

34
𝑑
Month Value 𝑥 rank (𝑥) Value 𝑦 rank (𝑦) 𝑑2
(6)=(3)-
(1) (2) (3) (4) (5) (7)
(5)

Example 6 1
2
1
2
1
2
1
1
1.5
1.5
-0.5
0.5
0.25
0.25
3 3 3 2 3.5 -0.5 0.25
4 4 4 2 3.5 0.5 0.25
5 5 5 4 5 0 0

Solution:
By formula, 4.5
4
6 σ 𝑑2 6 1 3.5
𝑟𝑠 = 1 − 2
=1− 2
= 0.95 3
𝑛 𝑛 −1 5 5 −1 2.5

y
2

𝑟𝑠 = 0.95 indicates a very strong 1.5


1
correlation between the rankings of 0.5

advertising expenditure and sales 0


0 1 2 3 4 5 6
revenue. x

35
Example 6
Note that if we apply the ordinary formula of correlation coefficient 𝑟 to calculate the correlation coefficient
of the rankings of the variables in example 6, the result would be slightly different. Since
𝑛=5 ෍ 𝑟𝑎𝑛𝑘 𝑥 = 15 ෍ 𝑟𝑎𝑛𝑘 𝑦 = 15

2 2
෍ 𝑟𝑎𝑛𝑘 𝑥 𝑟𝑎𝑛𝑘 𝑦 = 54 ෍ 𝑟𝑎𝑛𝑘 𝑥 = 55 ෍ 𝑟𝑎𝑛𝑘 𝑦 = 54

then
σ(𝑟𝑎𝑛𝑘(𝑥))(𝑟𝑎𝑛𝑘 𝑦 ) − 𝑛 𝑟𝑎𝑛𝑘 𝑥 𝑟𝑎𝑛𝑘 𝑦
𝑟=
2 2 2 2
σ 𝑟𝑎𝑛𝑘 𝑥 − 𝑛 𝑟𝑎𝑛𝑘 𝑥 σ 𝑟𝑎𝑛𝑘 𝑦 − 𝑛 𝑟𝑎𝑛𝑘 𝑦
15 15
54 − 5
5 5
= ≈ 0.9486833
2 2
15 15
55 − 5 54 − 5
5 5

which is very close to the result of 𝑟𝑠 .


36
Example 7
Calculate the Spearman’s rank correlation, 𝑟𝑠 , between 𝑥 and 𝑦 for
the following data:
𝑦𝑖 rank(𝑦𝑖 ) 𝑥𝑖 rank(𝑥𝑖 ) rank(𝑦𝑖 ) − rank(𝑥𝑖 ) 2

52 10
54 14
47 6
42 8
49 6
38 4
50 8
49 8

37
Example 7
Calculate the Spearman’s rank correlation, 𝑟𝑠 , between 𝑥 and 𝑦 for
the following data:
𝑦𝑖 rank(𝑦𝑖 ) 𝑥𝑖 rank(𝑥𝑖 ) rank(𝑦𝑖 ) − rank(𝑥𝑖 ) 2

52 7 10 7 0
54 8 14 8 0
47 3 6 2.5 0.25
42 2 8 5 9
49 4.5 6 2.5 4
38 1 4 1 0
50 6 8 5 1
49 4.5 8 5 0.25

38
(rank(𝑦𝑖 )
𝑦𝑖 rank(𝑦𝑖 ) 𝑥𝑖 rank(𝑥𝑖 )
− rank(𝑥𝑖 ))2
52 7 10 7 0

Example 7 54
47
8
3
14
6
8
2.5
0
0.25
42 2 8 5 9
49 4.5 6 2.5 4
38 1 4 1 0
50 6 8 5 1
49 4.5 8 5 0.25

Solution:
By formula, 60
2
6σ𝑑 6 14.5 50
𝑟𝑠 = 1 − 2
=1−
𝑛 𝑛 −1 8 82 − 1 40

= 0.1712619 30

y
20

𝑟𝑠 = 0.1713 indicates that there is 17.13% 10

correlation between the rankings of y 0


0 5 10 15
and x. x
Example 8
The data in the table represent the monthly sales and the
promotional expenses for a store that specializes in sportswear for
young women.
a) Calculate the coefficient of correlation between monthly sales
and promotional expenses.
b) Calculate the Spearman’s rank correlation between monthly
sales and promotional expenses.
c) Compare your results from part a and part b. What do these
results suggest about the linearity and association between the
two variables?

40
Example 8
Month Sales (in $1,000) Promotional expenses (in $10,000)
1 62.4 3.9
2 68.5 4.8
3 70.2 5.5
4 79.6 6.0
5 80.1 6.8
6 88.7 7.7
7 98.6 7.9
8 104.3 9.0
9 106.5 9.2
10 107.3 9.7
11 115.8 10.9
12 120.1 11.0 41
Example 8
12
y, Promotional expenses (in $10,000)

10

0
0 20 40 60 80 100 120 140
x, Sales (in $1,000)

42
Example 8
𝑖 𝑥𝑖 𝑥𝑖2 𝑟𝑎𝑛𝑘 𝑥𝑖 𝑦𝑖 𝑦𝑖2 𝑟𝑎𝑛𝑘 𝑦𝑖 𝑥𝑖 𝑦𝑖 𝑑𝑖2
1 62.4 3893.8 1 3.9 1
15.21 243.36 0
2 68.5 4692.3 2 4.8 2
23.04 328.8 0
3 70.2 4928 3 5.5 3
30.25 386.1 0
4 79.6 6336.2 4 6 4
36 477.6 0
5 80.1 6416 5 6.8 5
46.24 544.68 0
6 88.7 7867.7 6 7.7 6
59.29 682.99 0
7 98.6 9722 7 7.9 7
62.41 778.94 0
8 104.3 10878 8 9 8
81 938.7 0
9 106.5 11342 9 9.2 9
84.64 979.8 0
10 107.3 11513 10 9.7 10
94.09 1040.81 0
11 115.8 13410 11 10.9 11
118.81 1262.22 0
12 120.1 14424 12 11 12
121 1321.1 0
1102.1 105423.55 92.4 771.98 8985.1 0
Example 8(a)
(a) Calculate the coefficient of correlation between monthly sales and
promotional expenses.
Solution:
The correlation coefficient is given by
1102.1 92.4
σ 𝑥𝑦−𝑛𝑥ҧ 𝑦ത 8985.1 − 12
12 12
𝑟= = ≈ 0.9892
σ 𝑥 2 −𝑛𝑥ҧ 2 σ 𝑦 2 −𝑛𝑦ത 2
1102.1 2 92.4 2
105423.55 − 12 771.98− 12
12 12

44
Example 8(b)
(b) Calculate the Spearman’s rank correlation between monthly
sales and promotional expenses.
Solution:
By formula,
6 σ 𝑑2 6 0
𝑟𝑠 = 1 − 2
=1− 2
=1
𝑛 𝑛 −1 12 12 − 1

45
Example 8(c)
(c) Compare your results from part a and part b. What do these
results suggest about the linearity and association between the
two variables?
Solution:
Results from (a) and (b) both suggest there is a strong positive
relationship between the two variables.
Specifically, 𝑟 = 0.9892 indicates a very strong positive linear
relationship between monthly sales and promotional expenses,
and 𝑟𝑠 = 1 indicates a very strong positive association between the
rankings of monthly sales and promotional expenses.

46
6.6 Multiple Regression and
Correlation Analysis
• We may use more than one independent variable to estimate the
dependent variable, and in this way, attempt to increase the
accuracy of the estimate. This process is called multiple regression and
correlation analysis.
• It is based on the same assumptions and procedures we have
encountered using simple regression.
• The principal advantage of multiple regression is that it allows us to use
more of the information available to us to estimate the dependent
variable.
• Sometimes the correlation between two variables may be insufficient
to determine a reliable estimating equation. Yet, if we add the data
from more independent variables, we may be able to determine an
estimating equation that describes the relationship with greater
accuracy.

47
6.6 Multiple Regression and
Correlation Analysis
Considering the problem of estimating or predicting the value of a
dependent variable 𝑦 on the basis of a set of measurements taken
on 𝑝 independent variables 𝑥1 , … , 𝑥𝑝 , we shall assume a theoretical
equation of the form:

𝜇𝑦|𝑥1 ,…,𝑥𝑝 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑝 𝑥𝑝 ,

where 𝛽0 , … , 𝛽𝑝 are coefficient parameters to be estimated from


the data. Denoting these estimates by 𝑏0 , … , 𝑏𝑝 , respectively, we
can write the sample regression equation in the form:

𝑦ො = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑝 𝑥𝑝 .

48
6.6 Multiple Regression and
Correlation Analysis
The coefficients in the model are estimated by the least-squares
method. For a random sample of size 𝑛 (i.e. 𝑛 data points), the
least-squares estimates are obtained such that the residual sum of
squares (SSE) is minimized, where

𝑛
2
𝑆𝑆𝐸 = ෍ 𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖1 − ⋯ − 𝑏𝑝 𝑥𝑖𝑝 .
𝑖=1

49
6.6 Multiple Regression and
Correlation Analysis
With only two independent variables (i.e. 𝑝 = 2) the sample regression equation
reduces to the form:

𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2
The least-squares estimates 𝑏0 , 𝑏1 and 𝑏2 are obtained by solving the following
normal equations simultaneously:

𝑛𝑏0 + 𝑏1 ෍ 𝑥1 + 𝑏2 ෍ 𝑥2 = ෍ 𝑦

𝑏0 ෍ 𝑥1 + 𝑏1 ෍ 𝑥12 + 𝑏2 ෍ 𝑥1 𝑥2 = ෍ 𝑥1 𝑦

𝑏0 ෍ 𝑥2 + 𝑏1 ෍ 𝑥1 𝑥2 + 𝑏2 ෍ 𝑥22 = ෍ 𝑥2 𝑦

50
Example 9
A placement agency would like to predict the salary of senior staff
(𝑦) by his years of experience (𝑥1 ) and the number of employees
he supervises (𝑥2 ). A random sample of 12 cases is selected and
the observations are shown in the following table. Set up the
normal equations.

51
Example 9
Salary (‘000) Year of experience Number of employees supervised
62 10 175
65 12 150
72 18 135
70 15 175
81 20 150
77 18 200
72 19 180
77 22 225
75 20 175
90 21 275
82 19 225
95 23 300 52
Example 9
෍ 𝑥1 = 217 ෍ 𝑥2 = 2365 ෍ 𝑦 = 918

෍ 𝑥12 = 4093 ෍ 𝑥22 = 494375 ෍ 𝑥1 𝑥2 = 44025

෍ 𝑥1 𝑦 = 16947 ෍ 𝑥2 𝑦 = 185230 ෍ 𝑦 2 = 71230

The normal equations are


12𝑏0 + 217𝑏1 + 2365𝑏2 = 918
217𝑏0 + 4093𝑏1 + 44025𝑏2 = 16947
2365𝑏0 + 44025𝑏1 + 494375𝑏2 = 185230

When we solve these three equations simultaneously, we get the least-


squares estimates of the regression coefficients. 53
Example 9
Alternatively, SPSS is used to analyze this set of sample data and
gives the following results:

54
Example 9
From the above results, the fitted regression equation is

𝑦ො = 33.703 + 1.371𝑥1 + 0.09𝑥2

55
6.6 Multiple Regression and
Correlation Analysis
Similar to the simple linear regression model, the sum of squares identity

𝑛 𝑛 𝑛
2 2 2
෍ 𝑦𝑖 − 𝑦ത = ෍ 𝑦ො𝑖 − 𝑦ത + ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑖=1 𝑖=1 𝑖=1

will also hold.


Denote
𝑛

total sum of squares, 𝑆𝑆𝑇 = ෍ 𝑦𝑖 − 𝑦ത 2

𝑖 𝑛

regression sum of squares, 𝑆𝑆𝑅 = ෍ 𝑦ො𝑖 − 𝑦ത 2

𝑛 𝑖

residual sum of squares, 𝑆𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2

The identity becomes 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸. 56


6.6 Multiple Regression and
Correlation Analysis
The coefficient of determination, 𝑅 2 , is evaluated by

𝑆𝑆𝐸 𝑆𝑆𝑅
𝑅2 =1− = ,
𝑆𝑆𝑇 𝑆𝑆𝑇
which states the percentage of variation of 𝑦 that can be
explained by the multiple linear regression model.
Given a fixed sample size 𝑛, 𝑅2 will generally increase as more
independent variables are included in the multiple regression
equation. However, the additional independent variables may not
contribute significantly to the explanation of the dependent
variable.

57
6.6.1 Inferences on the parameters
• The significance of individual regression coefficients can be
tested. All the 𝑏𝑖 's are assumed normally distributed with mean 𝛽𝑖 .
• The null hypothesis and alternative hypothesis are
– 𝐻0 : 𝛽𝑖 = 0 (i.e. 𝑥𝑖 is not a significant explanatory variable)
– 𝐻1 : 𝛽𝑖 ≠ 0 (i.e. 𝑥𝑖 is a significant explanatory variable)

• We can test these hypotheses using the t-test. The test statistic

𝑏𝑖
𝑡=
𝑠. 𝑒. (𝑏𝑖 )
follows the t-distribution with 𝑛 − 𝑝 − 1 degrees of freedom. Note
that 𝑠. 𝑒. 𝑏𝑖 is the standard error of 𝑏𝑖 .
58
6.6.1 Inferences on the parameters
Using the SPSS results of Example 9 again, the standard errors of 𝑏1
and 𝑏2 are 0.364 and 0.028 respectively and the corresponding test
statistics are 3.770 and 3.250. The significances of 𝑏1 and 𝑏2 are
0.004 and 0.01 respectively, and hence we reject 𝐻0 and conclude
that both independent variables (i.e. 𝑥1 and 𝑥2 ) are significant
explanatory variables of 𝑦 at 5% level of significance.

59
6.6.2 Analysis of Variance (ANOVA)
approach
• The analysis of variance approach is used to test for the
significance of the multiple linear regression model. The null
hypothesis and alternative hypothesis are
– 𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0 (i.e. 𝑦 does not depend on the 𝑥𝑖 ’s)
– 𝐻1 : at least one 𝛽𝑖 ≠ 0 (i.e. 𝑦 depends on at least one of the 𝑥𝑖 ’s)

• After evaluating the sum of squares, the ANOVA table is


constructed as follows:
Source SS df MS F
Regression 𝑆𝑆𝑅 𝑝 𝑀𝑆𝑅 = 𝑆𝑆𝑅/𝑝 𝑀𝑆𝑅/𝑀𝑆𝐸
Residual 𝑆𝑆𝐸 𝑛 − 𝑝 − 1 𝑀𝑆𝐸 = 𝑆𝑆𝐸/(𝑛 − 𝑝 − 1)
Total 𝑆𝑆𝑇 𝑛−1
60
6.6.2 Analysis of Variance (ANOVA)
approach
• The test statistic 𝐹 = 𝑀𝑆𝑅Τ𝑀𝑆𝐸 follows the F distribution with 𝑝 and
𝑛 − 𝑝 − 1 degrees of freedom under the null hypothesis. If 𝐹 >
𝐹𝛼,𝑝,𝑛−𝑝−1 , there is evidence to reject the null hypothesis.

• Example 9 has the F-statistic = 29.073 with significance 0.000,


therefore the multiple regression equation is highly significant.

61
6.6.3 Multicollinearity in Multiple
Regression
• In multiple-regression analysis, the regression coefficients often
become less reliable as the degree of correlation between the
independent variables increases.

• If there is a high level of correlation between some of the independent


variables, we have a problem that statisticians call multicollinearity.
• Multicollinearity might occur if we wished to estimate a firm’s sales
revenue and we used both the number of salespeople employed and
their total salaries.
– Because the values associated with these two independent variables are
highly correlated, we need to use only one set of them to make our estimate.
In fact, adding a second variable that is correlated with the first distorts the
values of the regression coefficients.

62

You might also like