Looking at Data: Relationships: Least-Squares Regression
Looking at Data: Relationships: Least-Squares Regression
relationships
Least-squares regression
IPS chapter 2.3
sy sy
yˆ = ( y − rx )+r x, or yˆ = a + bx
sx sx
ˆ
y is the predicted y value (y hat)
b is the slope
a is the y-intercept
"a" is in units of y
"b" is in units of y / units of x
How to:
sy
b=r
First we calculate the slope of the line, b;
from statistics we already know:
r is the correlation.
sx
sy is the standard deviation of the response variable y.
sx is the the standard deviation of the explanatory variable x.
a=
y −
bx where x and y are the sample
means of the x and y variables
This means that we don't have to calculate a lot of squared distances to find the least-
squares regression line for a data set. We can instead rely on the equation.
yˆ = a + bx
Some use instead:
ˆ ax +b
y =
Make sure you know what YOUR
calculator gives you for a and b before
you answer homework or exam questions.
Software output
intercept
slope
R2
r
R2
intercept
slope
The equation completely describes the regression line.
To plot the regression line you only need to plug two x values into the
equation, get y, and draw the line that goes through those those points.
Hint: The regression line always passes through the mean of x and y.
Regression examines the distance of all points from the line in the y
direction only.
ˆ 0.0144 x +
y = 0.0008 Nobody in the study drank 6.5
beers, but by finding the value
of ŷ from the regression line for
x = 6.5 we would expect a blood
alcohol content of 0.094 mg/ml.
Height in Inches
!!!
If you only observed bacterial growth in test-tube during a small subset of the
time shown here, you could get almost any regression line imaginable.
Extrapolation = big mistake.
The y intercept
y-intercept shows
But the negative value is negative blood alcohol
r=0 Changes in x
r2 = 0 explain 0% of the Here the change in x only
variations in y. explains 76% of the change in
The value(s) y y. The rest of the change in y
takes is (are) (the vertical scatter, shown as
entirely
red arrows) must be explained
independent of
by something other than x.
what value x
takes.
There is quite some variation in BAC for the same
r =0.7 number of beers drunk. A person’s blood volume is
r2 =0.49 a factor in the equation that was overlooked here.
We changed number
of beers to number of
beers/weight of
person in lb.
r =0.9
r2 =0.81 In the first plot, number of beers only explains
49% of the variation in blood alcohol content.
But number of beers / weight explains 81% of
the variation in blood alcohol content.
Additional factors contribute to variations in
BAC among individuals (like maybe some
genetic ability to process alcohol).
Grade performance
A weak correlation.
Transforming relationships
A scatterplot might show a clear relationship between two quantitative
variables, but issues of influential points or non linearity prevent us
from using correlation and regression tools.
Transforming the data – changing the scale in which one or both of the
variables are expressed – can make the shape of the relationship
linear in some cases.
5000 4
4000
3000
2
2000
1
1000
0 0
0 30 60 90 120 150 180 210 240 0 30 60 90 120 150 180 210 240
Time (min) Time (min)