Response
Response
20
5 10
0
60 65 70 75 80
64
Now do same for fathers 59 inches tall, then
60 inches tall and so on.
X X
X X
Son’s Height
X
70
X
X X
X
X
X
X X
X X
65
X X
60
60 65 70 75
Father’s Height
65
Notice the line: it is called a regression or
least-squares line.
y = a + bx
Where:
a = ȳ − bx̄
and
sy
b=r
sx
I prefer to write:
y − ȳ x − x̄
=r
sy sx
In words: predict y in standard units to be x in
standard units times correlation coefficient.
66
Jargon:
a is the intercept.
i=1
Sum of vertical squared deviations between (xi, yi)
and straight line with slope b and intercept a.
Correlation: r = 0.50.
67
Average weight vs height for STAT 201:
200
X
180
X
160
Weight
X
140
X X
120
X
X
X
100
60 65 70 75
Height
Correlation: r = 0.73.
Regression line:
DO NOT EXTRAPOLATE.
69
Issues:
yi − a − bxi = yi − (a + bxi)
against xi to look for problems.
70
3) Residual variability: for oval shaped scatter-
plots histogram of y values for a given x value
tend to follow normal curve. Mean predicted
by regression line; SD is roughly
q
1 − r2sy
71
Illustration of regression effect using height data
Correlation: r = 0.50.
72
OR work out a and b and use regression line:
Prediction is
70
65
60
60 65 70 75
74
Residual plots. plot of yi − a − bxi against xi
should be flat, not wider at one end than the
other, not curved, no big outliers.
6
Residual (L)
2
−2
−6
0
−4
75
Notice that in top plot the main body of dots
seems to slope down and to right.
25
20
15
Distance Driven
76
Plot of SD of son’s heights for each different
father’s height.
3.0
X
X X X
2.5
X X
X X X X
X
X
X
SD of Son’s Height
X
2.0
X
1.5
1.0
X
0.5
60 65 70 75
Father’s Height
79
Ecological correlations: correlations computed
between averages.
60
55
50
Rating of TA
80
Now make up hypothetical data consistent with
known averages:
r= −0.8 r= −0.5
30 40 50 60 70 80 90
90
70
Final
Final
50
30
0 1 2 3 4 0 1 2 3 4
TA Rating TA Rating
r= 0 r= 0.8
100
90
80
70
Final
Final
60
50
40
30
0 1 2 3 4 0 1 2 3 4
TA Rating TA Rating
81
Look at a few TAs for the r=0.8 example. In
each section correlation is high. Overall cor-
relation using raw data positive. Correlation
using averages negative!
TA B TA C
70
70
60
Final
Final
60
50
40
50
30
40
TA Rating TA Rating
TA H TA I
80
30 40 50 60 70
70
Final
Final
60
50
TA Rating TA Rating
82