2 Correlation and Regression PDF
2 Correlation and Regression PDF
Chapter 2:
Correlation and Regression
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 1 / 63
Introduction
Outline
1 Introduction
2 Scatterplot
3 Correlation Coefficient r
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 2 / 63
Introduction
Outline
1 Introduction
2 Scatterplot
3 Correlation Coefficient r
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 4 / 63
Scatterplot
Scatterplot
Scatterplot is a two-dimensional graph of data values.
It is used to reveal graphically any relation between the two
variables.
Suppose we have the following bivariate data in the table below.
TABLE 7.2 Bivariate Data: Scores for 10 Male College Students
on Two Self-Report Measures
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 5 / 63
Scatterplot
Scatterplot
The scatterplot can be plotted to allow us to easily see the
nature of the relationship, if any exists, between the two
variables.
30
J
I
25
20
Eating difficulties
H
15
B
E
10
A
C
G
5
0
5 10 15 20 25 30
Stress
FIGURE 7.1 Scatter diagram of bivariate distribution of stress scores and eating difficulties scores for
10 male college students. Data from Table 7.2.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 6 / 63
Scatterplot
Scatterplot
Again, suppose we have the following data of two variables –
height and handspan.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 7 / 63
Scatterplot
Scatterplot
We can then plot a scatterplot to investigate the relationships
between height and handspan.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 9 / 63
Scatterplot
Curvilinear Pattern
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 10 / 63
Scatterplot
Curvilinear Pattern
For instance, below is a scatterplot showing the relationship between
song-specific age (age in the year the song was popular) and musical
preference (positive score ! above average, negative score ! below
average).
Regression line
9 Linear relationship
how strong?
17
Correlation
Example : Height and Handspan of n = 167 individuals
x æ x - x öæ y - y ö
çç ÷÷ç ÷>0
æ x - x öæ y - y ö s ç s ÷
è x øè y ø
çç ÷÷ç ÷<0
s ç s ÷
è x øè y ø
y
æ x - x öæ y - y ö
çç ÷÷ç ÷>0
s ç s ÷ æ x - x öæ y - y ö
è x øè y ø çç ÷÷ç ÷<0
s ç s ÷
è x øè y ø
Correlation
Example : Height and Handspan of n = 167 individuals
æ x - x öæ y - y ö
å çç s ÷÷çç s ÷÷ > 0
è x øè y ø
+ve assocation
Correlation
Measure the strength of linear relationship
1 æ xi - x öæ yi - y ö
r= å çç ÷÷ç
n - 1 è sx øçè s y ÷ø
÷ Pearson correlation
coefficient
* *
*
* *
* ** r>0 ** *
* ** * r<0
** *
* * * +ve -ve ** *
*
* * *
* * *
* * * r=0 *
* *
*
*
r=0
** * * Not
* * Not *
* * linear
associated
Correlation coefficient
Sample statistics
1 1
x=
n
åx y=
n
åy
S xx = å (x - x ) = å x -
2 2
(å x ) 2
S yy = å(y - y) = å y -
2 (å y )
2
2
n n
S xy = å ( x - x )( y - y ) = å xy -
(å x )(å y )
n
r=
S xy
=
å (x - x )( y - y )
S xx S yy å (x - x ) å ( y - y )
2 2
Correlation
S xx = 3248 -
(166)
2
= 492.4
Stress Eat Difficulty Product
10
(x ) (y ) (xy )
17 9 153 S yy = 2458 -
(134)
2
= 662.4
8 13 104 10
8 7 56 166 ´ 134
20 18
S xy = 2610 - = 385.6
360 10
14 11 154
7 2 14 Correlation coefficient
21 5 105
385.6 S xy
22 15 330 r= r= = 0.675
19 26 494 S
492.4 ´ 662 S
xx .4
yy
30 28 840
+ve association
å x = 166 å y = 134
å xy = 2610
åx 2
= 3248 åy 2
= 2458 How strong?
Correlation
-1 £ r £ 1
Perfect -ve Perfect +ve
linear relationship linear relationship
r=0
No linear relationship
(uncorrelated)
Correlation
Correlation
Pearson correlation coefficient is sensitive to outliers.
r = 0.675 r = 0.899
Rank Correlation
Rank Correlation Coefficient (Spearman’s Rho)
rs =
å (R - R )(R - R )
x x y y
å (R - R ) å (R - R )
2 2
x x y y
S Ryy = 506 -
(66)
2
= 110
11
66 ´ 66
S Rxy = 473.5 - = 77.5
11
Spearman’s Rho
77.5
rs = = 0.706
åR x = 66 åR y = 66 109.5 ´ 110
åR R
x y = 473.5
åR 2
x = 505.5 åR 2
y = 506
Rank Correlation
Computational formula
6å d i2
rs = 1 - d = Rx - R y
n (n - 1)
where
2
6å d i2 6 ´ 18 strongly
= 1- = 0.786
8 ´ (8 - 1)
rs = 1 -
n (n - 1) +ve associated
2 2
Correlation ¹ Causation
Example : Price and Demand for town gas
Year 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
Price 30 31 37 42 43 45 50 54 54 57
Demand 134 112 136 109 105 87 56 43 77 35
Year 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
Price 58 58 60 73 88 89 92 97 100 102
Demand 65 56 58 55 49 39 36 46 40 42
1960-1965
Time
1974-1979
1966-1973
Price Demand
Correlation ¹ Causation
• People with rare surnames live longer? latent causes : inheritance
BEng
GPA
BSc
Time spent
Outline
1 Introduction
2 Scatterplot
3 Correlation Coefficient r
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 12 / 63
Correlation Coefficient r
Correlation Coefficient r
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 13 / 63
Correlation Coefficient r
Formula:
P
Sxy (x x )(y y)
r=p = pP P
(Sxx )(Syy ) (x x )2 (y y)2
where
X X P
2 x )2
2 (
Sxx = (x x) = x
X X Pn 2
( y)
Syy = (y y)2 = y2
n P P
X X ( x )( y)
Sxy = (x x )(y y) = xy
n
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 14 / 63
Correlation Coefficient r
Calculation of r
TABLE 7.5 Calculation of r from the Raw Scores of Table 7.2
STUDENT X Y X2 Y2 XY
A 17 9 289 81 153
B 8 13 64 169 104
C 8 7 64 49 56
D 20 18 400 324 360
E 14 11 196 121 154
F 7 2 49 4 14
G 21 5 441 25 105
H 22 15 484 225 330
I 19 26 361 676 494
J 30 28 900 784 840
n ! 10 Sum: 166 134 3,248 2,458 2,610
⎭
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎬
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎫
①
!X "!X# /n ! 3,248 " 166 /10 ! 492.4
2
② SSX ! 2
" 2
"!X#"!Y# (166)(134)
④ !(X " X$ )(Y " Y$ ) ! !XY " ## n
! 2,610 " ## ! 385.6
10
!(X " X$)(Y " Y$) 385.6 385.6 385.6
⑤ r ! ### ! ## ! ## ! # ! $.675
%$S
(SSX)(S$
Y)
$)(662.4
%(492.4$)$ $5.76
%326,16 $ 571.11
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 15 / 63
Correlation Coefficient r
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 16 / 63
Correlation Coefficient r
Strong Positive Correlation Strong Negative Correlation Very Strong Positive Correlation
Moderately Strong Positive Correlation Weak Connection Not Too Strong Negative Correlation
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 17 / 63
Rank Correlation Coefficient rs
Outline
1 Introduction
2 Scatterplot
3 Correlation Coefficient r
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 18 / 63
Rank Correlation Coefficient rs
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 19 / 63
Rank Correlation Coefficient rs
Rank Table
As an example, consider two corporate vice-presidents who have just interviewed eight
candidates for the position of personnel manager in the firm.
Each vice-president separately has contemplated the strengths and weaknesses of each
candidate and has ranked the individuals from 1 = most promising to
8 = least promising. The orderings are shown in the following rank table:
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 20 / 63
Rank Correlation Coefficient rs
Question
Find the Pearson correlation and Spearman correlation between X
and Y below:
X Y
160 26
158 24
180 19
198 58
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 22 / 63
Rank Correlation Coefficient rs
If there are no tied ranks in the data, then the following formula
also works.
Shortcut Formula:
P
6 ni=1 di2
rs = 1
n(n 2 1)
where
di = Rank(xi ) Rank(yi )
= Rxi Ryi (di↵erence between a pair of ranks)
n = the number of pairs of ranks
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 23 / 63
Rank Correlation Coefficient rs
Calculation of rs
Candidate i Ranking of VP 1, X Ranking of VP 2, Y di di2
Feldho↵ 2 4 2 4
Hancock 6 6 0 0
Johnson 5 7 2 4
Pringle 4 3 1 1
Reilly 3 1 2 4
Sayer 7 5 2 4
Stephan 1 2 1 1
Taylor 8 8 0 0
P
n=8 di2 = 18
P
6 ni=1 di2 6(18)
rs = 1 =1 =1 0.214 = 0.786
n(n 2 1) 8(63)
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 24 / 63
Rank Correlation Coefficient rs
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 25 / 63
Rank Correlation Coefficient rs
P
6 ni=1 di2 6(0)
rs = 1 =1 =1 0=1
n(n 2 1) 8(63)
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 26 / 63
Rank Correlation Coefficient rs
P
6 ni=1 di2 6(168)
rs = 1 =1 =1 2= 1
n(n 2 1) 8(63)
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 27 / 63
Rank Correlation Coefficient rs
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 28 / 63
Rank Correlation Coefficient rs
A 1 28
B 2 21
C 3 22
D 4 22
E 5 32
F 6 36
G 7 33
H 8 39
I 9 25
J 10 30
K 11 20
L 12 28
M 13 31
N 14 38
O 15 34
n ! 15
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 29 / 63
Rank Correlation Coefficient rs
Calculation: ⑦ rS
6 !D 2
6(345)
! 1 " # ! 1 " ## ! .38
n(n2 " 1) 15(152 " 1)
She then converts the test scores to ranks, assigning a rank of 1 to the lowest score.
Since two scores are tied, the instructor assigns the average of the ranks available for
them to each.
The set of paired ranks appears in the columns Rank of X (RX ) and Rank of Y (RY ).
The value of rs is then computed as above. Are there any problems here?
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 30 / 63
Cautions in the Use of Correlation
Outline
1 Introduction
2 Scatterplot
3 Correlation Coefficient r
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 31 / 63
Cautions in the Use of Correlation
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 32 / 63
Cautions in the Use of Correlation
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 33 / 63
Cautions in the Use of Correlation
(a) (b)
When data for one or both variables are not linear, other
measures of association are better.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 35 / 63
Cautions in the Use of Correlation
3. E↵ect of Variability
The correlation coefficient is sensitive to the variability characterizing the
measurements of the two variables.
For example, if a university had only minimal entrance requirements, the
relationship between total SAT scores and freshman GPA might look like
4.0 4.0
this in Fig (a):
3.0
Freshman GPA
Freshman GPA
3.0
2.0
2.0
1.0
FIGURE 7.11 Relations between SAT scores and freshman GPA when range is unrestricted (a) and
when it is restricted (b).
However, suppose that a more selective private university admitted students
only with SAT scores of 1, 200 or higher.
From the new scatterplot in Fig (b), the relationship is much weaker.
Therefore, restricting the range, whether in X , in Y , or in both, results in a
Chung,lower correlation
LI (SAAS , HKU ) coefficient (in
STAT1600B magnitude).
Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 36 / 63
Cautions in the Use of Correlation
4. E↵ect of Discontinuity
The correlation tends to be an overestimate in discontinuous distributions.
Revisit the example of GPA vs SAT total. Suppose you made a mistake and
lost the data records with GPA lies between 1.0 and 3.0. And you still want
to compute the correlation coefficient using the remaining data. The data
might look like this:
discontinuity
Region of
Y: GPA
X: SAT score
FIGURE 7.12 Scatter diagram for discontinuous data.
Most likely you will obtain a higher correlation than the previous one.
Usually, discontinuity, whether in X , in Y , or in both, results in a higher
correlation coefficient.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 37 / 63
Cautions in the Use of Correlation
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 38 / 63
Cautions in the Use of Correlation
Eagan’s class
Y = course grade
Haggerty’s class
X = aptitude X
(a) (b)
FIGURE 7.13 Correlation resulting from the pooling of data from heterogeneous samples.
If they lie in the way like Fig 7.13 (a), the correlation coefficient would be
lower among the pooled data than among the separate samples.
If they lie in the way like Fig 7.13 (b), the correlation coefficient would be
higher among
Chung, LI (SAAS
the pooled
, HKU )
data than among the separate samples. Ch 2 39 / 63
STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2)
Cautions in the Use of Correlation
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 40 / 63
Cautions in the Use of Correlation
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 43 / 63
Cautions in the Use of Correlation
Outline
1 Introduction
2 Scatterplot
3 Correlation Coefficient r
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 45 / 63
Simple Linear Regression
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 46 / 63
Simple Linear Regression
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 47 / 63
Simple Linear Regression
Regression Equation
Ŷ = b0 + b1 X
Y hat =/= Y, Y is unknown
where
b0 is the intercept, which is the value of Y when X = 0,
b1 is the slope, which is how much the variable Y changes for
one unit increase in the variable X .
Purposes of the regression equation:
To estimate the average value of Y at any specified value of
X.
To predict the unknown value of Y for an individual, given
that individual’s value of X .
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 49 / 63
Simple Linear Regression
d6
Actual value
of Y1 d7
d5
d3
d4
Y
d1
d2
Predicted value
of Y1
FIGURE 8.2 Discrepancies between
seven Y values and the line of regres-
X sion of Y on X.
The least squares regression line has to minimize the SSE (Sum of
Squared Errors) for the observed data set.
P P
SSE = i (yi ŷi )2 = i di2
The term di is called prediction error or residual, which is the
di↵erence between the observed value and the predicted value of
observation i .
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 50 / 63
Simple Linear Regression
The intercept is b0 = y b1 x .
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 51 / 63
Simple Linear Regression
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 52 / 63
Simple Linear Regression
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 53 / 63
Simple Linear Regression
The intercept is
Ŷ = 19.1549 + 1.4030X .
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 56 / 63
Simple Linear Regression
prediction error = Y Ŷ
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 57 / 63
Simple Linear Regression
To analyse the explained and unexplained errors over the entire sample,
consider the sum of squares of them to get rid of negative signs.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 60 / 63
Simple Linear Regression
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 61 / 63
Simple Linear Regression
Extrapolation
It is risky to use a regression equation to predict values outside
the range of the observed data, a process called extrapolation.
Because there is no guarantee that the relationship will continue
to hold beyond the range for which we have the observed data.
Examples:
Regression equation relating weight to height
Weight = 180 + 5 ⇥ (Height)
This equation should work well for adult, but not for children.
The weight of a boy who is 36 inches tall would be estimated to
be 0 pound.
Straight line relationship between
y = winning time in Olympic women’s 100 m backstroke swim
and x = Olympic year
This straight line could be used to predict the winning time in
the near future, but should not be used to predict the time in
the year 3000.
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 62 / 63
Simple Linear Regression
Chung, LI (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 2 63 / 63