Correlation and Regression: Fathers' and Daughters' Heights
Correlation and Regression: Fathers' and Daughters' Heights
Fathers’ heights
mean = 67.7
SD = 2.8
55 60 65 70 75
height (inches)
Daughters’ heights
mean = 63.8
SD = 2.7
55 60 65 70 75
height (inches)
corr = 0.52
70
Daughter’s height (inches)
65
60
55
60 65 70 75
Covariance Correlation
cov(X, Y)
cov(X,Y) = E{(X – µX) (Y – µY)} cor(X, Y) =
σXσY
2 30 30
1 25 25
0 20
Y
Y
20
!1 15
15
!2 10
10
!3 !2 !1 0 1 2 5 10 15 20 25 30 5 10 15 20 25 30
30 30 30
25 25 25
20 20 20
Y
Y
15 15
15
10 10
10
5 5
5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30
30 30 30
25 25 25
20 20 20
Y
Y
15 15 15
10 10 10
5 5 5
5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30
Estimated correlation
Consider n pairs of data: (x1, y1), (x2, y2), (x3, y3), . . . , (xn, yn)
−→ Covariance / correlation:
−→ Regression
A A and B
0.35 0.35
A
0.30 0.30 B
0.25 0.25
OD
OD
0.20 0.20
0.15 0.15
0.10 0.10
0 10 25 50 0 10 25 50
Linear regression
140 Y = 20 + 15X
120
Y = 40 + 8X
100
80
Y
Y = 70 + 0X
60
40 Y = 0 + 5X
20
0 2 4 6 8 10 12
X
Linear regression
2
!1
1
Y
!0
!1 0 1 2 3 4
This implies:
E[Y|X] = β0 + β1X.
Interpretation:
For two subjects that differ by one unit in X, we expect the responses to differ by β1 .
We can write
#i = yi − β0 − β1xi
For a pair of estimates (β̂0, β̂1) for the pair of parameters (β0, β1)
we define the fitted values as
Residuals
Y
^
Y
^"
X
Residual sum of squares
For every pair of values for β0 and β1 we get a different value for
the residual sum of squares.
#
RSS(β0, β1)= (yi − β0 − β1xi)2
i
b0 b1
Notation
Parameter estimates
The function
#
RSS(β0, β1)= (yi − β0 − β1xi)2
i
is minimized by
SXY
β̂1 =
SXX
β̂0 = ȳ − β̂1x̄
Useful to know
Using the parameter estimates, our best guess for any y given x is
y=β̂0 + β̂1x
Hence
That means every regression line goes through the point (x̄, ȳ).
Variance estimates
This quantity is called the residual mean square. It has the follow-
ing property:
σ̂ 2
(n – 2) × 2
∼ χ2n – 2
σ
E(σ̂ 2)=σ 2
Example
H2O2 concentration
0 10 25 50
0.3399 0.3168 0.2460 0.1535
0.3563 0.3054 0.2618 0.1613
0.3538 0.3174 0.2848 0.1525
We get
x̄=21.25, ȳ=0.27, SXX=4256.25, SXY=– 16.48, RSS=0.0013.
Therefore
– 16.48
β̂1 = = – 0.0039, β̂0 = 0.27 – (– 0.0039) × 21.25 = 0.353,
4256.25
$
0.0013
σ̂ = = 0.0115.
12 – 2
Example
0.30
OD
0.25
0.20
0.15
0 10 25 50
H2O2 concentration
Comparing models
H0 : yi = β0 + #i versus Ha : yi = β0 + β1xi + #i
Fit under Ha
Fit under Ho
y
Example
0.30
OD
0.25
0.20
0.15
0 10 25 50
H2O2 concentration
Sum of squares
Under Ha :
# 2 (SXY)2
RSS = (yi − ŷi) = SYY − = SYY − β̂12 × SXX
SXX
i
Under H0 :
# 2
#
(yi − β̂0) = (yi − ȳ)2 = SYY
i i
Hence
(SXY)2
SSreg = SYY − RSS =
SXX
ANOVA
Source df SS MS F
SSreg MSreg
regression on X 1 SSreg MSreg =
1 MSE
RSS
residuals for full model n–2 RSS MSE =
n–2
Source df SS MS F
total 11 0.06509
Parameter estimates
E(β̂0) = β0 E(β̂1) = β1
% &
2 1 x̄2 σ2
Var(β̂0) = σ + Var(β̂1) =
n SXX SXX
x̄ −x̄
Cov(β̂0, β̂1) = −σ 2 Cor(β̂0, β̂1) = '
SXX
x̄2 + SXX/n
One can even show that the distribution of β̂0 and β̂1 is a bivariate
normal distribution!
% &
β̂0
∼ N(β, Σ)
β̂1
where
( ) 1 x̄2 −x̄
β0 2 n
+ SXX SXX
β= and Σ=σ
β1 −x̄ 1
SXX SXX
Simulation: coefficients
!0.0034
!0.0036
slope
!0.0038
!0.0040
!0.0042
!0.0044
y!intercept
Possible outcomes
0.35
0.30
OD
0.25
0.20
0.15
0 10 20 30 40 50
H2O2
Confidence intervals
We know that
% % &&
2
1 x̄
β̂0 ∼ N β0, σ 2 +
n SXX
( )
σ2
β̂1 ∼ N β1,
SXX
We use
$
β̂1 − β1∗ σ̂ 2
t= ∼ tn – 2 where se(β̂1) =
se(β̂1) SXX
Also,
. /
β̂1 − t(1 – α
2 ),n –2 × se(β̂1) , β̂1 + t(1 – α
2 ),n – 2 × se(β̂1)
Results
β̂1
t=
se(β̂1)
% &2
2 β̂1 β̂12 β̂12 × SXX (SYY − RSS)/1 MSreg
t = = 2
= 2
= = = F
se(β̂1) σ̂ /SXX σ̂ RSS/n – 2 MSE
A 95% joint confidence region for the two parameters is the set of
all values (β0, β1) that fulfill
( )T ( ! )( )
∆β0 n
! ! i 2i x ∆β0
∆β1 i xi i xi ∆β1
≤ F(0.95),2,n-2
2σ̂ 2
!1
^
^
!0
Notation
We previously defined
# #
SXX = (xi − x̄)2 = x2i − n(x̄)2
i i
# 2
#
SYY = (yi − ȳ) = y2i − n(ȳ)2
i i
# #
SXY = (xi − x̄)(yi − ȳ) = xiyi − nx̄ȳ
i i
We also define
SXY
rXY = √ √ (called the sample correlation)
SXX SYY
Coefficient of determination
We previously wrote
(SXY)2
SSreg = SYY − RSS =
SXX
Define
SSreg RSS
R2 = =1−
SYY SYY
2SSreg (SXY)2
R = = = r2XY
SYY SXX × SYY
12 12
10 10
8 8
6 6
4 4
2 2
0 0
0 5 10 15 20 0 5 10 15 20
12 12
10 10
8 8
6 6
4 4
2 2
0 0
0 5 10 15 20 0 5 10 15 20
Fathers’ and daughters’ heights
corr = 0.52
Daughter’s height (inches) 70
65
60
55
60 65 70 75
Linear regression
70
Daughter’s height (inches)
65
60
55
60 65 70 75
65
60
55
60 65 70 75
Regression line
70
Daughter’s height (inches)
65
60
55
60 65 70 75
65
60
55
60 65 70 75
70
Daughter’s height (inches)
65
60
55
60 65 70 75
65
60
55
60 65 70 75
70
Daughter’s height (inches)
65
60
55
60 65 70 75
65
60
55
60 65 70 75
70
Daughter’s height (inches)
65
60
55
60 65 70 75
0.30
OD
0.25
0.218
0.20
0.15
0 10 25 35 50
H2O2 concentration
−→ We can use the regression results to predict the expected response for a new
concentration of hydrogen peroxide. But what is its variability?
Variability of the mean response
ŷ=β̂0 + β̂1x
Then
E(ŷ) = β0 + β1 x
% &
2
1 (x − x̄)
var(ŷ) = σ 2 +
n SXX
Why?
Hence
5
1 (x − x̄)2
ŷ ± t(1 – α2 ),n – 2 × σ̂ × +
n SXX
Confidence limits
0.35
0.30
OD
0.25
0.20
0.15
0 10 25 50
H2O2 concentration
Prediction
%
The variance of ŷ is
% & % &
% 1 (x − x̄)2 1 (x − x̄)2
var(ŷ )=σ 2 + σ 2 + =σ 2 1+ +
n SXX n SXX
Prediction intervals
Hence
5
% 1 (x − x̄)2
ŷ ± t(1 – α2 ),n – 2 × σ̂ × 1+ +
n SXX
%
ŷ ± t(1 – α2 ),n – 2 × σ̂
Prediction intervals
0.30
OD
0.25
0.20
0.15
0 10 25 50
H2O2 concentration
75
70
Height (inches)
65
60
60 65 70 75 80
Span (inches)
With just 100 individuals
75
70
Height (inches)
65
60
60 65 70 75 80
Span (inches)
That prediction interval is for the case that the x’s are known with-
out error while
◦ We obtain a new value, y%, and want to estimate the corresponding x%:
y%=β0 + β1 x% + #
Example
180
160
140
Y
120
100
0 5 10 15 20 25 30 35
Another example
180
160
140
Y
120
100
0 5 10 15 20 25 30 35
X
Regression for calibration
−→ Goal:
Estimate x% and give a 95% confidence interval.
−→ The estimate:
Obtain β̂0 and β̂1 by regressing the yi on the xi.
% !
Let x̂ =(ȳ% − β̂0)/β̂1 where ȳ% = j y%j /m
Let T denote the 97.5th percentile of the t distr’n with n–2 d.f.
√ √
Let g = T / [|β̂1| / (σ̂/ SXX)] = (T σ̂) / (|β̂1| SXX)
'
% %
%
(x̂ − x̄) g + (T σ̂ / |β̂1|) (x̂ − x̄)2/SXX + (1 − g2) ( m1 + 1n )
2
x̂ ±
1 − g2
% √
For very large n, this reduces to approximately x̂ ± (T σ̂) / (|β̂1| m)
Example
180
160
140
Y
120
100
0 5 10 15 20 25 30 35
Another example
180
160
140
Y
120
100
0 5 10 15 20 25 30 35
X
Infinite m
180
160
140
Y
120
100
0 5 10 15 20 25 30 35
Infinite n
180
160
140
Y
120
100
0 5 10 15 20 25 30 35
X
Multiple linear regression
A and B
0.35
A
0.30
0.25
OD
0.20
0.15
0.10
0 10 25 50
H2O2 concentration
general parallel
concurrent coincident
Multiple linear regression
A and B
0.35
A
0.30
0.25
OD
0.20
0.15
0.10
0 10 25 50
H2O2 concentration
# Y X1 X2
1 0.3399 0 0
2 0.3563 0 0
3 0.3538 0 0
4 0.3168 10 0
5 0.3054 10 0 The model with two parallel lines can be described as
6 0.3174 10 0
7 0.2460 25 0
8 0.2618 25 0
9 0.2848 25 0
Y =β0 + β1X1 + β2X2 + #
10 0.1535 50 0
11 0.1613 50 0
12 0.1525 50 0
13 0.3332 0 1
14 0.3414 0 1 In other words (or, equations):
15 0.3299 0 1
16 0.2940 10 1 6
17 0.2948 10 1 β0 + β1X1 + # if X2 = 0
18 0.2903 10 1 Y=
19 0.2089 25 1 (β0 + β2) + β1X1 + # if X2 = 1
20 0.2189 25 1
21 0.2102 25 1
22 0.1006 50 1
23 0.1031 50 1
24 0.1452 50 1
Multiple linear regression
Interpretation
E[Y] = β0 + β1 X1
E[Y] = β0 + β1 X1 + β2 X2
−→ Comparing two subjects of the same age from the two dif-
ferent treatment arms (X2=1 versus X2=0), we expect the re-
sponses to differ by β2.
Interpretation
E[Y] = β0 + β1 X1 + β2 X2 + β3 X1X2
5
RSS
−→ We estimate σ by σ̂ =
n − (k + 1)
FYI
x1 = [H2O2].
x2 = 0 or 1, indicating type of heme.
y = the OD measurement.
i.e.,
β0 + β1X1 + # if X2 = 0
y=
(β0 + β2) + (β1 + β3)X1 + # if X2 = 1
β2 = 0 −→ Same intercepts.
β3 = 0 −→ Same slopes.
β2 = β3 = 0 −→ Same lines.
Results
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.35305 0.00544 64.9 < 2e-16
x1 -0.00387 0.00019 -20.2 8.86e-15
x2 -0.01992 0.00769 -2.6 0.0175
x1:x2 -0.00055 0.00027 -2.0 0.0563
What to do. . .
(RSSred−RSSfull)/(dfred−dffull)
5. Calculate F = RSSfull/dffull .
where dfred = n − r − 1 and dffull = n − k − 1).
!
−→ Reduced model: y = β0 + # RSSred = i (yi − ȳ)2
! ! !
−→ F = [( i(yi − ȳ)2 − i(yi − ŷi)2)/k] / [ i(yi − ŷi)2/(n − k − 1)]
Compare this to a F(k, n – k – 1) dist’n.
The example
To test β2 = β3 = 0
Model 1: y ˜ x1
Model 2: y ˜ x1 + x2 + x1:x2