0% found this document useful (0 votes)
12 views20 pages

Response

The document discusses the concepts of regression analysis, focusing on the relationship between dependent and independent variables, using examples like predicting a son's height based on a father's height. It explains the least-squares regression line, the importance of correlation coefficients, and the implications of regression effects, including regression to the mean and issues with residual analysis. Additionally, it highlights the limitations of regression, such as the dangers of extrapolation and the potential for confounding variables in observational data.

Uploaded by

0126ds201025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views20 pages

Response

The document discusses the concepts of regression analysis, focusing on the relationship between dependent and independent variables, using examples like predicting a son's height based on a father's height. It explains the least-squares regression line, the importance of correlation coefficients, and the implications of regression effects, including regression to the mean and issues with residual analysis. Additionally, it highlights the limitations of regression, such as the dangers of extrapolation and the potential for confounding variables in observational data.

Uploaded by

0126ds201025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Fact: r does not depend on which variable you

put on x axis and which on the y axis.

Often: interest centres on whether or not changes


in x cause changes in y or on predicting y from
x.

In this case: call y the dependent or outcome


or response or endogenous variable. Response
in this course.

Call x explanatory, or exogenous or indepen-


dent or predictor.

Example: predict son’s height from father’s


height.

Suppose father is 70 inches tall. How to guess


height of son?

Simple method: use average height of those


sons whose fathers were 70 inches tall.
63
Pick out cases where father is between 69.5
and 70.5 inches tall. There were 115 such fa-
thers.

Average son’s height in this group: 69.8 in

SD son’s height in this group: 2.5in


Sons of 70 inch fathers
30
Frequency

20
5 10
0

60 65 70 75 80

Son’s Height (In)

64
Now do same for fathers 59 inches tall, then
60 inches tall and so on.

Get a ȳ for each x from 59 to 75.


75

X X
X X
Son’s Height

X
70

X
X X
X
X
X
X X
X X
65

X X
60

60 65 70 75

Father’s Height

65
Notice the line: it is called a regression or
least-squares line.

Formula for the least squares line?

y = a + bx

Where:
a = ȳ − bx̄
and
sy
b=r
sx
I prefer to write:
y − ȳ x − x̄
=r
sy sx
In words: predict y in standard units to be x in
standard units times correlation coefficient.

66
Jargon:

a is the intercept.

b is the slope also called regression coefficient.

“Least squares” because we find formulae for


a and b by using calculus to minimize the Error
Sum of Squares:
n
(yi − (a + bxi))2
X

i=1
Sum of vertical squared deviations between (xi, yi)
and straight line with slope b and intercept a.

For the height data

Fathers: x̄ = 67.7, sx = 2.74 (inches)

Sons: ȳ = 68.7, sy = 2.81 (inches)

Correlation: r = 0.50.
67
Average weight vs height for STAT 201:
200

X
180

X
160
Weight

X
140

X X
120

X
X
X
100

60 65 70 75

Height

“Fit not as good”.

Scatterplot not too oval; mixing sexes in same


plot.
68
Numerical values:

Height: H̄ = 66.8, sH = 3.75 (inches).

Weight: W̄ = 140, sW = 25.3 (pounds).

Correlation: r = 0.73.

Regression line:

Slope: b = 0.73 × 25.3/3.75 = 4.93 (pounds


per inch)

Intercept: a = 140−4.93×66.8 = −189 (pounds)

Meaning of intercept: NONE whatever. Not


to be understood as weight of person 0 inches
tall.

DO NOT USE regression line outside of range


of x values!

DO NOT EXTRAPOLATE.
69
Issues:

1) Regression effect: when r > 0: cases high


in one variable predicted to be high in the
other BUT closer to mean in standard devi-
ation units. Called regression to the mean.

Cases low in one variable predicted to be low


in the other but not as low.

For r < 0: cases above mean in one variable


predicted to be below mean on other but not
as far below.

2) Residual analysis: straight line regression


not always appropriate. Watch for non-linearity,
outliers, influential observations. Plot residuals

yi − a − bxi = yi − (a + bxi)
against xi to look for problems.

70
3) Residual variability: for oval shaped scatter-
plots histogram of y values for a given x value
tend to follow normal curve. Mean predicted
by regression line; SD is roughly
q
1 − r2sy

4) Cause and effect. Variables x and y can be


highly correlated without changes in one caus-
ing changes in the other. Watch out for lurk-
ing or confounding variables. Do controlled
experiments.

5) Ecological correlations; replacing groups of


cases by averages can change correlation dra-
matically.

71
Illustration of regression effect using height data

Fathers: x̄ = 67.7, sx = 2.74 (inches)

Sons: ȳ = 68.7, sy = 2.81 (inches)

Correlation: r = 0.50.

Predict average son’s height when father is 72


inches:

My way without remembering formulas:

Convert 72 to standard deviation units:


72 − 67.7
= 1.57
2.74
Predict son’s height in Standard units to be
0.50*1.57=0.78

Convert back to original units:

68.7 + 2.81 ∗ 0.78 = 70.9

72
OR work out a and b and use regression line:

Slope is b = rsy /sx = 0.50×2.81/2.74 = 0.513.

Intercept is a = ȳ − bx̄ = 68.7 − 0.513 × 67.7 =


34.0.

Prediction is

ŷ = a + 72b = 34.0 + 0.513 ∗ 72 = 70.9 inches.

Now take sons who are 70.9 inches tall and


predict father’s height?

NOT just going backwards to 72 inches!

Convert 70.9 to Standard units: get back 0.78.

Multiply by r: predict father’s height is 0.39 in


standard units.

This is 67.7+0.39*2.71=68.7 inches.


73
Explanation: picking out 72 inch fathers picks
out strip on right side of picture. Picking out
70.9 inch sons picks our strip across top. Dif-
ferent groups of people!
r=0.5
75
Son’s Height (Inches)

70
65
60

60 65 70 75

Father’s Height (Inches)

74
Residual plots. plot of yi − a − bxi against xi
should be flat, not wider at one end than the
other, not curved, no big outliers.
6
Residual (L)

2
−2
−6

200 250 300 350 400 450 500

Distance Driven (km)

Two inluential observations removed


4
Residual (L)

0
−4

350 400 450 500

Distance Driven (km)

75
Notice that in top plot the main body of dots
seems to slope down and to right.

The regression line is not too useful. Differ-


ence in two plots: omission of two data points.
Gives two different lines:
35
30
Litres Used

25
20
15

200 250 300 350 400 450 500

Distance Driven

76
Plot of SD of son’s heights for each different
father’s height.
3.0

X
X X X
2.5

X X
X X X X
X
X
X
SD of Son’s Height

X
2.0

X
1.5
1.0

X
0.5

60 65 70 75

Father’s Height

Note line across at height


q
1 − r2 × sy
77
Idea: SD of y when x is held fixed
q is smaller
than overall SD of y by factor of 1 − r2.

Usually expressed in terms of variance: smaller


by factor of 1 − r2.

Jargon the fraction r2 for the variance of y is


“explained by the variation in x”. The rest,
the other 1 − r2 is “unexplained variation”.

This brings up the 4th point.

Our use of regression is for prediction for a new


value of x observed in the same way.

NOT to predict what would happen if you changed


x.

Example: if you gave the father drugs to make


him grow 2 inches the son would not get taller.
Manipulating father’s height doesn’t impact son’s
height.

But sometimes manipulating x changes y. We


say changes in x cause changes in y.
78
Example: blood pressure (y) regressed on drug
dose (x). We hope that changing x will cause
a change in y;

If the data are collected in a study not an


experiment we usually cannot tell if changes
in x cause changes in y.

Standard example: weekly sales of soft drinks


versus weekly cases of polio diagnosed in US
during year 1950. Correlation positive.

Why? Both go up in the summer. The con-


founding or lurking variable is weather. Good
weather brought increases in both.

79
Ecological correlations: correlations computed
between averages.

Example: 11 TAs for STAT 2 in 1975.


Relation between average rating of TA in sec-
tion and average Final exam mark:
r= −0.57
70
65
Final Exam score

60
55
50

2.5 3.0 3.5 4.0

Rating of TA

80
Now make up hypothetical data consistent with
known averages:
r= −0.8 r= −0.5
30 40 50 60 70 80 90

90
70
Final

Final

50
30

0 1 2 3 4 0 1 2 3 4

TA Rating TA Rating

r= 0 r= 0.8
100
90

80
70
Final

Final

60
50

40
30

0 1 2 3 4 0 1 2 3 4

TA Rating TA Rating

81
Look at a few TAs for the r=0.8 example. In
each section correlation is high. Overall cor-
relation using raw data positive. Correlation
using averages negative!
TA B TA C

70
70

60
Final

Final
60

50
40
50

30
40

1 2 3 4 5 3.0 3.5 4.0 4.5 5.0

TA Rating TA Rating

TA H TA I
80

30 40 50 60 70
70
Final

Final
60
50

1.0 1.5 2.0 2.5 3.0 2.0 3.0 4.0 5.0

TA Rating TA Rating

82

You might also like