0% found this document useful (0 votes)
73 views23 pages

Looking at Data: Relationships: Least-Squares Regression

The least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is the smallest possible. The equation completely describes the regression line.

Uploaded by

crutili
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views23 pages

Looking at Data: Relationships: Least-Squares Regression

The least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is the smallest possible. The equation completely describes the regression line.

Uploaded by

crutili
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23

Looking at data:

relationships
Least-squares regression
IPS chapter 2.3

© 2006 W. H. Freeman and Company


Objectives (IPS chapter 2.3)
Least-squares regression

 The regression line


 Making predictions: interpolation
 Coefficient of determination, r2
 Transforming relationships
Correlation tells us about
strength (scatter) and direction
of the linear relationship
between two quantitative
variables.

In addition, we would like to have a numerical description of how both


variables vary together. For instance, is one variable increasing faster
than the other one? And we would like to make predictions based on that
numerical description.
But which line best
describes our data?
The regression line
The least-squares regression line is the unique line such that the sum
of the squared vertical (y) distances between the data points and the
line is the smallest possible.

Distances between the points and


line are squared so all are positive
values. This is done so that
distances can be properly added
(Pythagoras).
Properties
The least-squares regression line can be shown to have this equation:

sy sy
yˆ = ( y − rx )+r x, or yˆ = a + bx
sx sx

ˆ 
y  is the predicted y value (y hat)

b is the slope
a is the y-intercept

"a" is in units of y
"b" is in units of y / units of x
How to:
sy
b=r
First we calculate the slope of the line, b;
from statistics we already know:

r is the correlation.
sx
sy is the standard deviation of the response variable y.
sx is the the standard deviation of the explanatory variable x.

Once we know b, the slope, we can calculate a, the y-intercept:

a=
y −
bx  where x and y are the sample
means of the x and y variables

This means that we don't have to calculate a lot of squared distances to find the least-
squares regression line for a data set. We can instead rely on the equation.

But typically, we use a 2-var stats calculator or stats software.


BEWARE!!!
Not all calculators and software use the same convention:

yˆ = a + bx
Some use instead:

ˆ  ax +b
y =
Make sure you know what YOUR
calculator gives you for a and b before
you answer homework or exam questions.
Software output

intercept
slope
R2

r
R2

intercept
slope
The equation completely describes the regression line.

To plot the regression line you only need to plug two x values into the
equation, get y, and draw the line that goes through those those points.
Hint: The regression line always passes through the mean of x and y.

The points you use for


drawing the regression
line are derived from the
equation.

They are NOT points from


your sample data (except
by pure coincidence).
The distinction between explanatory and response variables is crucial in
regression. If you exchange y for x in calculating the regression line, you
will get the wrong line.

Regression examines the distance of all points from the line in the y
direction only.

Hubble telescope data about


galaxies moving away from earth:

These two lines are the two


regression lines calculated either
correctly (x = distance, y = velocity,
solid line) or incorrectly (x =
velocity, y = distance, dotted line).
Correlation versus regression

The correlation is a measure In regression we examine


of spread (scatter) in both the the variation in the response
x and y directions in the linear variable (y) given change in
relationship. the explanatory variable (x).
Making predictions:
interpolation
The equation of the least-squares regression allows to predict y for
any x within the range studied. This is called interpolating.

ˆ  0.0144 x +
y = 0.0008 Nobody in the study drank 6.5
beers, but by finding the value
of ŷ from the regression line for
x = 6.5 we would expect a blood
alcohol content of 0.094 mg/ml.

yˆ = 0.0144 * 6.5 + 0.0008


yˆ = 0.936 + 0.0008 = 0.0944 mg/ml
(in 1000’s)
y =0.
ˆ  125x−41.4
Year Powerboats Dead Manatees
1977 447 13
1978 460 21
1979 481 24
1980 498 16
1981 513 24
1982 512 20
1983 526 15
1984 559 34
1985 585 33
1986 614 33
1987 645 39
1988 675 43
1989 711 50
1990 719 47

There is a positive linear relationship between the number of powerboats


registered and the number of manatee deaths.

The least squares regression line has the equation: y =0.


ˆ  125x−41.4
Thus if we were to limit the number of powerboat registrations to 500,000, what
could we expect for the number of manatee deaths?

yˆ = 0.125(500) − 41.4 ⇒ yˆ = 62.5 − 41.4 = 21.1


Roughly 21 manatees.
Extrapolation
!!!

Height in Inches
!!!

Extrapolation is the use of a


regression line for predictions
outside the range of x values
used to obtain the line.

This can be a very stupid thing


Height in Inches
to do, as seen here.
Example: Bacterial growth rate over time in closed cultures

If you only observed bacterial growth in test-tube during a small subset of the
time shown here, you could get almost any regression line imaginable.
Extrapolation = big mistake.
The y intercept

Sometimes the y-intercept is not biologically possible. Here we have


negative blood alcohol content, which makes no sense…

y-intercept shows
But the negative value is negative blood alcohol

appropriate for the equation


of the regression line.

There is a lot of scatter in the


data, and the line is just an
estimate.
Coefficient of determination,
r2
r2, the coefficient of determination, is the square of the correlation
coefficient.

r2 represents the percentage of


the variance in y (vertical scatter
from the regression line) that can
be explained by changes in x. sy
b=r
sx
r = -1 Changes in x
r2 = 1 explain 100% of r = 0.87
the variations in y. r2 = 0.76
Y can be entirely
predicted for any
given value of x.

r=0 Changes in x
r2 = 0 explain 0% of the Here the change in x only
variations in y. explains 76% of the change in
The value(s) y y. The rest of the change in y
takes is (are) (the vertical scatter, shown as
entirely
red arrows) must be explained
independent of
by something other than x.
what value x
takes.
There is quite some variation in BAC for the same
r =0.7 number of beers drunk. A person’s blood volume is
r2 =0.49 a factor in the equation that was overlooked here.

We changed number
of beers to number of
beers/weight of
person in lb.

r =0.9
r2 =0.81  In the first plot, number of beers only explains
49% of the variation in blood alcohol content.
 But number of beers / weight explains 81% of
the variation in blood alcohol content.
 Additional factors contribute to variations in
BAC among individuals (like maybe some
genetic ability to process alcohol).
Grade performance

If class attendance explains 16% of the variation in grades, what is


the correlation between percent of classes attended and grade?

1. We need to make an assumption: attendance and grades are


positively correlated. So r will be positive too.

2. r2 = 0.16, so r = +√0.16 = + 0.4

A weak correlation.
Transforming relationships
A scatterplot might show a clear relationship between two quantitative
variables, but issues of influential points or non linearity prevent us
from using correlation and regression tools.

Transforming the data – changing the scale in which one or both of the
variables are expressed – can make the shape of the relationship
linear in some cases.

Example: Patterns of growth are often exponential, at least in their initial


phase. Changing the response variable y into log(y) or ln(y) will transform
the pattern from an upward-curved exponential to a straight line.
Exponential bacterial growth
In ideal environments, bacteria multiply through binary fission. The
number of bacteria can double every 20 minutes in that way.

5000 4

4000

Log of bacterial count


3
Bacterial count

3000
2
2000

1
1000

0 0
0 30 60 90 120 150 180 210 240 0 30 60 90 120 150 180 210 240
Time (min) Time (min)

1 - 2 - 4 - 8 - 16 - 32 - 64 - … log(2n) = n*log(2) ≈ 0.3n


Exponential growth 2n, Taking the log changes the growth
not suitable for regression. pattern into a straight line.
Body weight and brain
weight in 96 mammal
species
r = 0.86, but this is misleading.

The elephant is an influential point. Most


mammals are very small in comparison.
Without this point, r = 0.50 only.

Now we plot the log of brain weight


against the log of body weight.

The pattern is linear, with r = 0.96.


The vertical scatter is homogenous
→ good for predictions of brain weight
from body weight (in the log scale).

You might also like