0% found this document useful (0 votes)
164 views84 pages

2 Correlation and Regression PDF

The document discusses correlation and regression analysis tools used to examine relationships between two quantitative variables. It introduces scatterplots, which graphically depict relationships in data. Correlation coefficients measure the strength and direction of linear relationships, while regression equations describe average relationships between a response and explanatory variable. The chapter covers scatterplots, correlation coefficients, cautions in their use, and simple linear regression.

Uploaded by

Winnie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views84 pages

2 Correlation and Regression PDF

The document discusses correlation and regression analysis tools used to examine relationships between two quantitative variables. It introduces scatterplots, which graphically depict relationships in data. Correlation coefficients measure the strength and direction of linear relationships, while regression equations describe average relationships between a response and explanatory variable. The chapter covers scatterplots, correlation coefficients, cautions in their use, and simple linear regression.

Uploaded by

Winnie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

STAT1600B

Statistics: Ideas and Concepts


2017-2018 (2nd Semester)

Department of Statistics and Actuarial Science


The University of Hong Kong

Chapter 2:
Correlation and Regression

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 1 / 63
Introduction

Outline

1 Introduction

2 Scatterplot

3 Correlation Coefficient r

4 Rank Correlation Coefficient rs

5 Cautions in the Use of Correlation

6 Simple Linear Regression

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 2 / 63
Introduction

Tools Used to Examine the Relationships between


Two Quantitative Variables
In this chapter, we are going to examine the relationship
between two quantitative variables.
By three tools:
1 Scatterplot, which is a two-dimensional graph of data values.
2 Correlation
1 Correlation coefficient r , which is a statistic that measures the
strength and direction of a linear relationship between two
quantitative variables.
2 Rank correlation coefficient rs , which is the non-parametric
counterpart of r .
3 Regression equation, which is an equation that describes the
average relationship between a quantitative response variable
and a quantitative explanatory variable.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 3 / 63
Scatterplot

Outline

1 Introduction

2 Scatterplot

3 Correlation Coefficient r

4 Rank Correlation Coefficient rs

5 Cautions in the Use of Correlation

6 Simple Linear Regression

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 4 / 63
Scatterplot

Scatterplot
Scatterplot is a two-dimensional graph of data values.
It is used to reveal graphically any relation between the two
variables.
Suppose we have the following bivariate data in the table below.
TABLE 7.2 Bivariate Data: Scores for 10 Male College Students
on Two Self-Report Measures

STRESS EATING DIFFICULTIES


STUDENT X Y
A 17 9
B 8 13
C 8 7
D 20 18
E 14 11
F 7 2
G 21 5
H 22 15
I 19 26
J 30 28

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 5 / 63
Scatterplot

Scatterplot
The scatterplot can be plotted to allow us to easily see the
nature of the relationship, if any exists, between the two
variables.
30
J
I
25

20
Eating difficulties

H
15
B
E
10
A
C
G
5

0
5 10 15 20 25 30
Stress

FIGURE 7.1 Scatter diagram of bivariate distribution of stress scores and eating difficulties scores for
10 male college students. Data from Table 7.2.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 6 / 63
Scatterplot

Scatterplot
Again, suppose we have the following data of two variables –
height and handspan.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 7 / 63
Scatterplot

Scatterplot
We can then plot a scatterplot to investigate the relationships
between height and handspan.

Observations from the scatterplot:


Handspan tends to increase with height, implying positive
association.
The pattern of relationship resembles a straight line, implying
linear relationship.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 8 / 63
Scatterplot

Positive/Negative Association and Linear


Relationship

sum of normal score is positive


Two variables have a positive association when the values of
one variable tend to increase as the values of the other variable
increase.
Two variables have a negative association when the values of
one variable tend to decrease as the values of the other variable
increase.
Two variables have a linear relationship when the pattern of
their relationship resembles a straight line.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 9 / 63
Scatterplot

Curvilinear Pattern

A linear pattern is common, but it is not the only type of


relationship.
Sometimes, a curve describes the pattern of a scatterplot better
than a straight line does.
In this case, the relationship is called nonlinear or curvilinear.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 10 / 63
Scatterplot

Curvilinear Pattern
For instance, below is a scatterplot showing the relationship between
song-specific age (age in the year the song was popular) and musical
preference (positive score ! above average, negative score ! below
average).

Observations from the scatterplot:


The association is curvilinear.
Musical
Chung, LI (SAAS
preference
, HKU )
is at peak around the song-specific age atCh23.5.
STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) 2 11 / 63
Scatterplot
Scatterplot Each point represent the (X,Y) value for an object

Regression line

A line well fit the data


Scatterplot
Example : Stress and Eating Difficulties of 10 Male Students

Student Stress (x ) Eating Difficulty (y )


A 17 9
B 8 13
C 8 7
D 20 18
E 14 11
F 7 2
G 21 5
H 22 15
I 19 26
J 30 28
Scatterplot
Example : Stress and Eating Difficulties of 10 Male Students

9 Linear relationship
how strong?

17
Correlation
Example : Height and Handspan of n = 167 individuals

x æ x - x öæ y - y ö
çç ÷÷ç ÷>0
æ x - x öæ y - y ö s ç s ÷
è x øè y ø
çç ÷÷ç ÷<0
s ç s ÷
è x øè y ø

y
æ x - x öæ y - y ö
çç ÷÷ç ÷>0
s ç s ÷ æ x - x öæ y - y ö
è x øè y ø çç ÷÷ç ÷<0
s ç s ÷
è x øè y ø
Correlation
Example : Height and Handspan of n = 167 individuals

æ x - x öæ y - y ö
å çç s ÷÷çç s ÷÷ > 0
è x øè y ø
+ve assocation
Correlation
Measure the strength of linear relationship

1 æ xi - x öæ yi - y ö
r= å çç ÷÷ç
n - 1 è sx øçè s y ÷ø
÷ Pearson correlation
coefficient

Standard score of X Standard score of Y

Average of product standard scores

• unit free • sign (+ or -) indicates direction of association


Association

* *
*
* *
* ** r>0 ** *
* ** * r<0
** *
* * * +ve -ve ** *

*
* * *
* * *
* * * r=0 *
* *
*
*
r=0
** * * Not
* * Not *
* * linear
associated
Correlation coefficient
Sample statistics
1 1
x=
n
åx y=
n
åy

S xx = å (x - x ) = å x -
2 2
(å x ) 2

S yy = å(y - y) = å y -
2 (å y )
2
2

n n

S xy = å ( x - x )( y - y ) = å xy -
(å x )(å y )
n

r=
S xy
=
å (x - x )( y - y )
S xx S yy å (x - x ) å ( y - y )
2 2
Correlation
S xx = 3248 -
(166)
2
= 492.4
Stress Eat Difficulty Product
10
(x ) (y ) (xy )
17 9 153 S yy = 2458 -
(134)
2
= 662.4
8 13 104 10
8 7 56 166 ´ 134
20 18
S xy = 2610 - = 385.6
360 10
14 11 154
7 2 14 Correlation coefficient
21 5 105
385.6 S xy
22 15 330 r= r= = 0.675
19 26 494 S
492.4 ´ 662 S
xx .4
yy

30 28 840
+ve association
å x = 166 å y = 134
å xy = 2610
åx 2
= 3248 åy 2
= 2458 How strong?
Correlation
-1 £ r £ 1
Perfect -ve Perfect +ve
linear relationship linear relationship

r=0

No linear relationship
(uncorrelated)
Correlation
Correlation
Pearson correlation coefficient is sensitive to outliers.

Add a student K with x = 60, y = 50

r = 0.675 r = 0.899
Rank Correlation
Rank Correlation Coefficient (Spearman’s Rho)

Pearson correlation coefficient between ranks of data

rs =
å (R - R )(R - R )
x x y y

å (R - R ) å (R - R )
2 2
x x y y

• Robust (less sensitive to outliers)

• Can be applied to qualitative (ordinal) data


Rank Correlation
Step 1: Rank the observations for each variable.
x Rx y Ry

• B and C have equal stress scores (ties), so they share


the same average ranks: (9 + 10) / 2 = 9.5.
Rank Correlation
Step 2: Calculate Pearson correlation between ranks.
Rx Ry RxRy S Rxx = 505.5 -
(66)
2
= 109.5
11

S Ryy = 506 -
(66)
2
= 110
11

66 ´ 66
S Rxy = 473.5 - = 77.5
11

Spearman’s Rho

77.5
rs = = 0.706
åR x = 66 åR y = 66 109.5 ´ 110
åR R
x y = 473.5
åR 2
x = 505.5 åR 2
y = 506
Rank Correlation
Computational formula

6å d i2
rs = 1 - d = Rx - R y
n (n - 1)
where
2

• When there is no ties, it gives the exact value of rank


correlation coefficient.

• When there are not too many ties, it gives well


approximation to the rank correlation coefficient.
Rank Correlation
Example : Ranking of candidates by two vice presidents

6å d i2 6 ´ 18 strongly
= 1- = 0.786
8 ´ (8 - 1)
rs = 1 -
n (n - 1) +ve associated
2 2
Correlation ¹ Causation
Example : Price and Demand for town gas
Year 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
Price 30 31 37 42 43 45 50 54 54 57
Demand 134 112 136 109 105 87 56 43 77 35

Year 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
Price 58 58 60 73 88 89 92 97 100 102
Demand 65 56 58 55 49 39 36 46 40 42

Pearson correlation coefficient: r = – 0.79

? Low demand is due to high price. ?


Correlation ¹ Causation

1960-1965
Time

1974-1979

1966-1973
Price Demand
Correlation ¹ Causation
• People with rare surnames live longer? latent causes : inheritance

• Observed correlation does not imply causation.

Source: “Surname Frequency and Lifespan”, Pablo A. Peña (2013)


Correlation for Combined Data
• Grouping may result in deceiving correlation

Combined data –ve r

BBA Within group +ve r

BEng
GPA
BSc

Time spent

Simpson’s Paradox – direction of relationship is reversed within subgroups


compared to the direction of relationship within the whole group.
Correlation Coefficient r

Outline

1 Introduction

2 Scatterplot

3 Correlation Coefficient r

4 Rank Correlation Coefficient rs

5 Cautions in the Use of Correlation

6 Simple Linear Regression

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 12 / 63
Correlation Coefficient r

Correlation Coefficient r

Correlation coefficient is a statistic that measures the


strength and direction of a linear relationship between two
quantitative variables.
Strength
It is determined by the closeness of the points to a straight line
Direction
It is determined by whether one variable generally increases or
generally decreases when the other variable increases
Linear
When the pattern is nonlinear, the correlation coefficient is not
an appropriate way to measure the strength of the relationship.
This measure is also called the Pearson product-moment
correlation coefficient.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 13 / 63
Correlation Coefficient r

Formula of the Correlation Coefficient r

Formula:
P
Sxy (x x )(y y)
r=p = pP P
(Sxx )(Syy ) (x x )2 (y y)2
where
X X P
2 x )2
2 (
Sxx = (x x) = x
X X Pn 2
( y)
Syy = (y y)2 = y2
n P P
X X ( x )( y)
Sxy = (x x )(y y) = xy
n

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 14 / 63
Correlation Coefficient r

Calculation of r
TABLE 7.5 Calculation of r from the Raw Scores of Table 7.2

STUDENT X Y X2 Y2 XY
A 17 9 289 81 153
B 8 13 64 169 104
C 8 7 64 49 56
D 20 18 400 324 360
E 14 11 196 121 154
F 7 2 49 4 14
G 21 5 441 25 105
H 22 15 484 225 330
I 19 26 361 676 494
J 30 28 900 784 840
n ! 10 Sum: 166 134 3,248 2,458 2,610




































!X "!X# /n ! 3,248 " 166 /10 ! 492.4
2
② SSX ! 2
" 2

!Y "!Y # /n ! 2,458 " 134 /10 ! 662.4


2
③ SSY ! 2
" 2

"!X#"!Y# (166)(134)
④ !(X " X$ )(Y " Y$ ) ! !XY " ## n
! 2,610 " ## ! 385.6
10
!(X " X$)(Y " Y$) 385.6 385.6 385.6
⑤ r ! ### ! ## ! ## ! # ! $.675
%$S
(SSX)(S$
Y)
$)(662.4
%(492.4$)$ $5.76
%326,16 $ 571.11

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 15 / 63
Correlation Coefficient r

Interpreting the Correlation Coefficient r

r is always between –1 and +1.


Magnitude indicates the strength of the linear relationship.
Sign indicates the direction of the association.
r > 0: the two variables tend to increase together (a positive
association)
r < 0: when one variable increases, the other is likely to decrease (a
negative association)
r = –1 or + 1 indicates a perfect linear relationship. (i.e., All data points
lie on the same straight line.)
r = 0 indicates that the best straight line through the data is exactly
horizontal (i.e., with a slope of 0), so knowing x does not change the
predicted value of y.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 16 / 63
Correlation Coefficient r

Interpreting the Correlation Coefficient r

Strong Positive Correlation Strong Negative Correlation Very Strong Positive Correlation

Moderately Strong Positive Correlation Weak Connection Not Too Strong Negative Correlation

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 17 / 63
Rank Correlation Coefficient rs

Outline

1 Introduction

2 Scatterplot

3 Correlation Coefficient r

4 Rank Correlation Coefficient rs

5 Cautions in the Use of Correlation

6 Simple Linear Regression

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 18 / 63
Rank Correlation Coefficient rs

Rank Correlation Coefficient rs

Previously, we discussed the correlation coefficient r , as a


measure of the strength of a linear relationship for quantitative
bivariate variables X and Y .
Since rankings are qualitative data but not quantitative data
even though they are numerical, sample correlation coefficient r
cannot be used.
We are going to introduce rank correlation coefficient rs (also
called Spearman’s rho), which can be used to perform correlation
analysis to a form of qualitative data: bivariate rankings.
Correlation coefficient r and rank correlation coefficient rs are
regarded as parametric and nonparametric counterparts.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 19 / 63
Rank Correlation Coefficient rs

Rank Table
As an example, consider two corporate vice-presidents who have just interviewed eight
candidates for the position of personnel manager in the firm.
Each vice-president separately has contemplated the strengths and weaknesses of each
candidate and has ranked the individuals from 1 = most promising to
8 = least promising. The orderings are shown in the following rank table:

Candidate Ranking of Vice-President 1, X Ranking of Vice-President 2, Y


Feldho↵ 2 4
Hancock 6 6
Johnson 5 7
Pringle 4 3
Reilly 3 1
Sayer 7 5
Stephan 1 2
Taylor 8 8

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 20 / 63
Rank Correlation Coefficient rs

Formula of the Rank Correlation Coefficient rs

If we wish to assess the strength of the relation between the two


sets of ranks, we can compute the sample rank correlation
coefficient rs .
The Spearman correlation coefficient rs is defined as the Pearson
correlation coefficient between the ranks of the data.
P
(Rx R x )(Ry R y )
rs = qP P ,
(Rx R x ) 2 (Ry R y ) 2

where Rx and Ry are the ranks of the two variables of interest


and R x and R y are the means these ranks respectively.
That is, the Spearman correlation coefficient rs is defined as the
Pearson correlation coefficient between the ranks of the data.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 21 / 63
Rank Correlation Coefficient rs

Question
Find the Pearson correlation and Spearman correlation between X
and Y below:

X Y
160 26
158 24
180 19
198 58

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 22 / 63
Rank Correlation Coefficient rs

Formula of the Rank Correlation Coefficient rs

If there are no tied ranks in the data, then the following formula
also works.
Shortcut Formula:
P
6 ni=1 di2
rs = 1
n(n 2 1)

where

di = Rank(xi ) Rank(yi )
= Rxi Ryi (di↵erence between a pair of ranks)
n = the number of pairs of ranks

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 23 / 63
Rank Correlation Coefficient rs

Calculation of rs
Candidate i Ranking of VP 1, X Ranking of VP 2, Y di di2
Feldho↵ 2 4 2 4
Hancock 6 6 0 0
Johnson 5 7 2 4
Pringle 4 3 1 1
Reilly 3 1 2 4
Sayer 7 5 2 4
Stephan 1 2 1 1
Taylor 8 8 0 0
P
n=8 di2 = 18

P
6 ni=1 di2 6(18)
rs = 1 =1 =1 0.214 = 0.786
n(n 2 1) 8(63)
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 24 / 63
Rank Correlation Coefficient rs

Interpreting Rank Correlation Coefficient rs

Same as its parametric counterpart, rs is constrained to be


between 1 and +1 inclusively.

Rank Correlation Coefficient Interpretation


Perfectly negatively correlated
rs = 1
The two vice-presidents’ agreements are exactly opposite. (Refer to following slides)
Negatively correlated
1 < rs < 0
There is an overall disagreement between the two vice-presidents.
Uncorrelated
rs = 0
The two vice-presidents’ rankings are not related.
Positively correlated
0 < rs < 1
There is an overall agreement between the two vice-presidents.
Perfectly positively correlated
rs = 1
The two vice-presidents agrees exactly. (Refer to following slides)

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 25 / 63
Rank Correlation Coefficient rs

Case of Perfectly Positive Correlation


Candidate i Ranking of VP 1, X Ranking of VP 2, Y di di2
Feldho↵ 1 1 0 0
Hancock 2 2 0 0
Johnson 3 3 0 0
Pringle 4 4 0 0
Reilly 5 5 0 0
Sayer 6 6 0 0
Stephan 7 7 0 0
Taylor 8 8 0 0
P
n=8 di2 = 0

P
6 ni=1 di2 6(0)
rs = 1 =1 =1 0=1
n(n 2 1) 8(63)
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 26 / 63
Rank Correlation Coefficient rs

Case of Perfectly Negative Correlation


Candidate i Ranking of VP 1, X Ranking of VP 2, Y di di2
Feldho↵ 1 8 7 49
Hancock 2 7 5 25
Johnson 3 6 3 9
Pringle 4 5 1 1
Reilly 5 4 1 1
Sayer 6 3 3 9
Stephan 7 2 5 25
Taylor 8 1 7 49
P
n=8 di2 = 168

P
6 ni=1 di2 6(168)
rs = 1 =1 =1 2= 1
n(n 2 1) 8(63)
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 27 / 63
Rank Correlation Coefficient rs

When to Use rs instead of r ?

Situation 1: Data are given in the form of ranks.


just like the above example
Situation 2: Data are given in the form of scores, but what
matters is that one score is higher than another and how much
higher is not really important. Then, translating scores to ranks
will be suitable.
will be illustrated in the following example

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 28 / 63
Rank Correlation Coefficient rs

Calculation of rs from Data in the Form of Scores


Suppose an instructor is curious about the relation between the order in which the 15
members of her class completed an examination and the number of points earned on it.
She assigns a rank of 1 to the first paper turned in and succeeding ranks according to the
order of completion.
After she has scored the tests, she records the order of turn-in X and the test score
obtained Y , as shown in the table below.
TABLE 7.7 Calculation of rS
① ②
ORDER OF TEST
TURN-IN SCORE RANK OF
SUBJECT X Y R

A 1 28
B 2 21
C 3 22
D 4 22
E 5 32
F 6 36
G 7 33
H 8 39
I 9 25
J 10 30
K 11 20
L 12 28
M 13 31
N 14 38
O 15 34
n ! 15
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 29 / 63
Rank Correlation Coefficient rs

Calculation of rs from Data in the Form of Scores


TABLE 7.7 Calculation of rS
① ②
ORDER OF TEST ③ ④ ⑤
TURN-IN SCORE RANK OF X RANK OF Y D! ⑥
SUBJECT X Y RX RY RX " RY D2

A 1 28 1 6.5 "5.5 30.25


B 2 21 2 2 0.0 0.00
C 3 22 3 3.5 ".5 .25
D 4 22 4 3.5 .5 .25
E 5 32 5 10 "5.0 25.00
F 6 36 6 13 "7.0 49.00
G 7 33 7 11 "4.0 16.00
H 8 39 8 15 "7.0 49.00
I 9 25 9 5 4.0 16.00
J 10 30 10 8 2.0 4.00
K 11 20 11 1 10.0 100.00
L 12 28 12 6.5 5.5 30.25
M 13 31 13 9 4.0 16.00
N 14 38 14 14 0.0 0.00
O 15 34 15 12 3.0 9.00
n ! 15 !D 2
! 345.00

Calculation: ⑦ rS
6 !D 2
6(345)
! 1 " # ! 1 " ## ! .38
n(n2 " 1) 15(152 " 1)

She then converts the test scores to ranks, assigning a rank of 1 to the lowest score.
Since two scores are tied, the instructor assigns the average of the ranks available for
them to each.
The set of paired ranks appears in the columns Rank of X (RX ) and Rank of Y (RY ).
The value of rs is then computed as above. Are there any problems here?
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 30 / 63
Cautions in the Use of Correlation

Outline

1 Introduction

2 Scatterplot

3 Correlation Coefficient r

4 Rank Correlation Coefficient rs

5 Cautions in the Use of Correlation

6 Simple Linear Regression

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 31 / 63
Cautions in the Use of Correlation

Cautions in the Use of Correlation

Bare in mind the following five cautions in the use of correlation.


1 Correlation Does Not Prove Causation
2 r and rs are Only for Linear Relationship
3 E↵ect of Variability
4 E↵ect of Discontinuity
5 Correlation for Combined Data

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 32 / 63
Cautions in the Use of Correlation

1. Correlation Does Not Prove Causation

If variation in X causes variation in Y , that causal connection


will appear in some degree of correlation between X and Y .
However, we cannot reason backward from a correlation to a
causal relationship.
We must always remember “correlation does not imply
causation”.
There are at least four possibilities of an observed correlation.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 33 / 63
Cautions in the Use of Correlation

1. Correlation Does Not Prove Causation


Denote X as the explanatory variable, Y as the response variable.
X Y X Y

(a) (b)

X Y FIGURE 7.9 Possible relationships


X Y
between variables X and Y that may
(c) (d ) underlie a correlation.

(a) Causation – X is a cause of Y .


(b) Reverse of causation – Y is a cause of X .
(c) A third variable influences both X and Y .
(d) A complex of interrelated variables influences X and Y .
Note: Two or more of these situations may occur simultaneously.
For example, X and Y may influence each other. (That is,
both (a) and (b).)
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 34 / 63
Cautions in the Use of Correlation

2. r and rs are Only for Linear Relationship

Remember that Pearson’s and Spearman’s correlation


coefficients are appropriate only for linear relationships.

FIGURE 7.10 A curvilinear relationship between


X and Y to which a straight line has been fitted. Ob-
servations of age (X ) and strength of grip (Y ) would
X yield data like those plotted here.

When data for one or both variables are not linear, other
measures of association are better.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 35 / 63
Cautions in the Use of Correlation

3. E↵ect of Variability
The correlation coefficient is sensitive to the variability characterizing the
measurements of the two variables.
For example, if a university had only minimal entrance requirements, the
relationship between total SAT scores and freshman GPA might look like
4.0 4.0
this in Fig (a):
3.0
Freshman GPA

Freshman GPA
3.0

2.0

2.0
1.0

800 1000 1200 1400 1200 1300 1400


SAT total SAT total
(a) (b)

FIGURE 7.11 Relations between SAT scores and freshman GPA when range is unrestricted (a) and
when it is restricted (b).
However, suppose that a more selective private university admitted students
only with SAT scores of 1, 200 or higher.
From the new scatterplot in Fig (b), the relationship is much weaker.
Therefore, restricting the range, whether in X , in Y , or in both, results in a
Chung,lower correlation
LI (SAAS , HKU ) coefficient (in
STAT1600B magnitude).
Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 36 / 63
Cautions in the Use of Correlation

4. E↵ect of Discontinuity
The correlation tends to be an overestimate in discontinuous distributions.
Revisit the example of GPA vs SAT total. Suppose you made a mistake and
lost the data records with GPA lies between 1.0 and 3.0. And you still want
to compute the correlation coefficient using the remaining data. The data
might look like this:

discontinuity
Region of
Y: GPA

X: SAT score
FIGURE 7.12 Scatter diagram for discontinuous data.
Most likely you will obtain a higher correlation than the previous one.
Usually, discontinuity, whether in X , in Y , or in both, results in a higher
correlation coefficient.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 37 / 63
Cautions in the Use of Correlation

5. Correlation for Combined Data

Suppose the correlation coefficient between the academic


aptitude test score and the grade in a course conducted by
Professor Haggerty is 0.5.
Another professor called Eagan taught the same course. In his
class, the correlation coefficient is also 0.5 .
What do you think the correlation coefficient would be if we
pool the two samples? Also 0.5?

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 38 / 63
Cautions in the Use of Correlation

5. Correlation for Combined Data


Actually, the correlation coefficient for the pooled sample is not necessarily
0.5. It depends on where the sample values lie relative to one another in
both the X and Y dimensions.

Eagan’s class
Y = course grade

Haggerty’s class

X = aptitude X
(a) (b)

FIGURE 7.13 Correlation resulting from the pooling of data from heterogeneous samples.
If they lie in the way like Fig 7.13 (a), the correlation coefficient would be
lower among the pooled data than among the separate samples.
If they lie in the way like Fig 7.13 (b), the correlation coefficient would be
higher among
Chung, LI (SAAS
the pooled
, HKU )
data than among the separate samples. Ch 2 39 / 63
STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2)
Cautions in the Use of Correlation

Examples of Deceiving Relationship


Outliers can substantially inflate or deflate correlations.
An outlier that is consistent with the trend of the rest of the
data will inflate the correlation.
An outlier that is not consistent with the rest of the data can
substantially decrease the correlation.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 40 / 63
Cautions in the Use of Correlation

Examples of Deceiving Relationship


Example 1 (Rivkin, 1986): Highway Deaths and Speed Limits

Correlation between death rate and speed limit is 0.55.


If Italy removed, correlation drops to 0.098.
If then Britain removed, correlation jumps to 0.70.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 41 / 63
Cautions in the Use of Correlation

Examples of Deceiving Relationship


Example 2 (Utts, 2005): Ages of Husbands and Wives (r = 0.39)

Subset of data on ages of husbands and wives, with one outlier


added (entered 82 instead of 28 for husband’s age).
Correlation with outlier removed is 0.964 – a very strong linear
relationship.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 42 / 63
Cautions in the Use of Correlation

Examples of Deceiving Relationship

Groups combined inappropriately may mask relationships.


The missing link is a third variable.
Simpson’s Paradox
Two or more groups.
Variables for each group may be strongly correlated.
When groups combined into one, very little correlation between
the two variables.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 43 / 63
Cautions in the Use of Correlation

Examples of Deceiving Relationship


Example 3 (Utts, 2005):
Pages versus Price for the Books on a Professor’s Shelf

Correlation is 0.312, more pages =) less cost?


Scatterplot includes book type: H = hardcover, S = softcover.
Correlation for H books: 0.64 and Correlation for S books: 0.35
Combining two types masked the positive correlation and produced illogical
negative association.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 44 / 63
Simple Linear Regression

Outline

1 Introduction

2 Scatterplot

3 Correlation Coefficient r

4 Rank Correlation Coefficient rs

5 Cautions in the Use of Correlation

6 Simple Linear Regression

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 45 / 63
Simple Linear Regression

Simple Linear Regression

Regression analysis is the area of statistics that is used to


examine the relationship between a quantitative response
variable and one or more explanatory variables.
A key element of regression analysis is the estimation of a
regression equation that describes how, on average, the
response variable is related to the explanatory variables.
The simplest kind of relationship between two variables X and
Y is a straight line, which is called linear relationship.
The term simple linear regression refers to methods used to
analyze straight line relationship, i.e., only one response variable
Y (regressand) and only one explanatory variable X (regressor).

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 46 / 63
Simple Linear Regression

Response Variable and Explanatory Variable

In studying the relationship between two quantities, the value of the


explanatory variable is thought to partially explain the value of the
response variable for an individual.
Examples:
In the relationship between smoking and lung cancer, whether or not
an individual smokes is the explanatory variable, and whether or not
he or she develops lung cancer is the response variable.
If we note that people with higher education levels generally have
higher incomes, education level is the explanatory variable and income
is the response variable.
The identification of one variable as “explanatory” and the other as
“response” does not imply that there is a causal relationship. It simply
implies that knowledge of the value of the explanatory variable may help
provide knowledge about the value of the response variable for an individual.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 47 / 63
Simple Linear Regression

Scatterplot with Regression Line


Revisit the example of handspan versus height.
From the scatterplot, the pattern resembles a linear relationship.
Thus, we want to plot a “best fit” line on the scatterplot to show the linear
relationship.
This “best fit” line is called regression line, which is obtained from
estimating b0 and b1 in the regression equation Ŷ = b0 + b1 X .
The criterion to determine which line is “best fit” is based on least square
estimation.

Regression equation: Handspan = 3 + 0.35 ⇥ Height


Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 48 / 63
Simple Linear Regression

Regression Equation

The equation for the regression line is

Ŷ = b0 + b1 X
Y hat =/= Y, Y is unknown
where
b0 is the intercept, which is the value of Y when X = 0,
b1 is the slope, which is how much the variable Y changes for
one unit increase in the variable X .
Purposes of the regression equation:
To estimate the average value of Y at any specified value of
X.
To predict the unknown value of Y for an individual, given
that individual’s value of X .

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 49 / 63
Simple Linear Regression

Criterion of Best Fit: Least Squares Criterion


How do we find the straight line of “best fit”?
One simple way is the least squares criterion.

d6
Actual value
of Y1 d7
d5

d3
d4
Y
d1
d2

Predicted value
of Y1
FIGURE 8.2 Discrepancies between
seven Y values and the line of regres-
X sion of Y on X.

The least squares regression line has to minimize the SSE (Sum of
Squared Errors) for the observed data set.
P P
SSE = i (yi ŷi )2 = i di2
The term di is called prediction error or residual, which is the
di↵erence between the observed value and the predicted value of
observation i .
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 50 / 63
Simple Linear Regression

How to Estimate b0 and b1?


Sxy
The slope is b1 =
S
Pxx
(x x )(y y)
= P
(x x )2
P
xy nx y
= P 2
x nx 2
P P P
xy ( x )( n
y)
= P 2 ( P x )2 .
x n

The intercept is b0 = y b1 x .

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 51 / 63
Simple Linear Regression

Example of Estimating b0 and b1


Suppose we are given the following data on the residence size
(X ) (in hundreds of square feet) and the building material cost
(Y ) (in thousand dollars).

Home X : Residence Size Y : Building Material Cost


1 17 46
2 29 60
3 18 42
4 19 43
5 21 50
6 21 47
7 14 39
8 24 58
9 26 53
10 28 58

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 52 / 63
Simple Linear Regression

Example of Estimating b0 and b1


We have
P toPcalculate
P six P items, which are
2
x , y, x , y, xy, x , in order to estimate b0 and b1 .
Home X : Residence Size Y : Building Material Cost XY X2
1 17 46
2 29 60
3 18 42
4 19 43
5 21 50
6 21 47
7 14 39
8 24 58
9 26 53
10 28 58
P
P
/n — —

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 53 / 63
Simple Linear Regression

Example of Estimating b0 and b1


Home X : Residence Size Y : Building Material Cost XY X2
1 17 46 782 289
2 29 60 1740 841
3 18 42 756 324
4 19 43 817 361
5 21 50 1050 441
6 21 47 987 441
7 14 39 546 196
8 24 58 1392 576
9 26 53 1378 676
10 28 58 1624 784
P
P 217 496 11072 4929
/n 21.7 49.6 — —
P P
Note: xP= 21.7, y = 49.6,
P 2 x = 217, y = 496,
xy = 11072, x = 4929.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 54 / 63
Simple Linear Regression

Example of Estimating b0 and b1


The slope is
P (
P P
x )( y) (217)(496)
xy 11072
b1 = P Pn = 10
= 1.4030.
( x )2 (217)2
x2 n
4929 10

The intercept is

b0 = y b1 x = 49.6 1.4030(21.7) = 19.1549.

Therefore, the regression equation is

Ŷ = 19.1549 + 1.4030X .

Note: To avoid large rounding errors in your final results, it is a


good idea to keep the decimal terms such as y, x and b1 in the
memory of your calculator as you work.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 55 / 63
Simple Linear Regression

Prediction of Y at a Particular Value of X

We can then use the regression equation to predict the building


material cost Y at a particular residence size X .
For example, suppose the next contract that the builder signs
calls for a house with 2500 square feet (X = 25).
The building material cost is predicted to be
Ŷ = 19.15 + 1.40(25) = 54.15, or about $54150.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 56 / 63
Simple Linear Regression

Prediction Errors and Residuals

How good is the prediction of the regression?


Put the observed X values in the regression equation to get the
predicted Ŷ values.
Compare with the observed Y values.
The prediction error is the di↵erence:

prediction error = Y Ŷ

The amount by which an individual value di↵ers from the


regression line value can be due to natural variation rather than
“errors” in the measurements.
Thus, a more neutral term, the residual of an individual, would
sometimes be used instead of the prediction error.

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 57 / 63
Simple Linear Regression

Interpreting the Squared Correlation, r 2


Recall that the correlation, r , measures the strength and direction of
a linear relationship between two quantitative variables, and has a
value between 1 and 1.
The squared correlation, r 2 , has a value between 0 and 1, retains
information about the strength of the relationship, but loses
information about the direction.
However, it gives the proportion of variation explained by
the regressor.
For example,
r = 0.6 =) r 2 = 0.36 = 36%

That means, the explanatory variable explains 36% of the


variation among the observed values of the response variable.
This interpretation stems from the use of least squares line as a
prediction, HKU
Chung, LI (SAAS
tool.) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 58 / 63
Simple Linear Regression

Explained Error and Unexplained Error

Note that at X = 20, Ŷ = 517, and at X = 63, Ŷ = 388.


Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 59 / 63
Simple Linear Regression

Sum of Squares in Regression

To analyse the explained and unexplained errors over the entire sample,
consider the sum of squares of them to get rid of negative signs.

1 Total errors: Total variation / Sum of squares total (SST )


X
SST = (y y)2

2 Unexplained residuals (prediction errors): Sum of squared errors (SSE )


X
SSE = (y ŷ)2

3 Errors explained by regression: Sum of squares due to regression (SSR)


X
SSR = (ŷ y)2

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 60 / 63
Simple Linear Regression

Sum of Squares in Regression


Two important results:
1

SST = SSR + SSE


2
SST SSE SSR
r2 = =
SST SST

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 61 / 63
Simple Linear Regression

Extrapolation
It is risky to use a regression equation to predict values outside
the range of the observed data, a process called extrapolation.
Because there is no guarantee that the relationship will continue
to hold beyond the range for which we have the observed data.
Examples:
Regression equation relating weight to height
Weight = 180 + 5 ⇥ (Height)
This equation should work well for adult, but not for children.
The weight of a boy who is 36 inches tall would be estimated to
be 0 pound.
Straight line relationship between
y = winning time in Olympic women’s 100 m backstroke swim
and x = Olympic year
This straight line could be used to predict the winning time in
the near future, but should not be used to predict the time in
the year 3000.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 62 / 63
Simple Linear Regression

Extension of Simple Linear Regression

In simple linear regression, a response variable (Y ) is regressed


on only one explanatory variable (X ).
Actually, we can also regress a response variable on many
explanatory variables. This kind of regression is called
multiple linear regression. regression equation, not regression line
(It will be discussed in detail in the course
STAT3600 – Linear Statistical Analysis.)
In addition, we can regress many response variables on many
explanatory variables. This kind of regression is called
multivariate linear regression and is much more complicated.
(It will be discussed in detail in the course
STAT4602 – Multivariate Data Analysis.)

Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 63 / 63

You might also like