Linear Regression
The Practice of Statistics in the Life Sciences
Objectives
Regression
What is Linear Regression
The least-squares regression line
Finding the least-squares regression line
The coefficient of determination, r 2
Outliers and influential observations
Making predictions
Association does not imply causation
What is Linear Regression
Linear regression is a linear model, e.g. a model that assumes a
linear relationship between the input variables (x) and the single
output variable (y). More specifically, that y can be calculated from a
linear combination of the input variables (x).
When there is a single input variable (x), the method is referred to
as simple linear regression. When there are multiple input
variables, literature from statistics often refers to the method as
multiple linear regression.
The least-squares regression
line
The least-squares regression line is the unique line such that the sum
of the vertical distances between the data points and the line is zero,
and the sum of the squared vertical distances is the smallest possible.
Notation
yˆ is the predicted y value on the regression line
yˆ intercept slope x yˆ a bx
slope < 0 slope = 0 slope > 0
Not all calculators/software use this yˆ ax b
convention. Other notations include: ŷ b0 b1 x
yˆ variable_name x constant
Interpretation
The slope of the regression line
describes how much we expect
y to change, on average, for
every unit change in x.
The intercept is a necessary mathematical descriptor of the
regression line. It does not describe a specific property of the data.
Finding the least-squares
regression line
sy
The slope of the regression line, b, equals: b r
sx
r is the correlation coefficient between x and y
sy is the standard deviation of the response variable y
sx is the standard deviation of the explanatory variable x
The intercept, a, equals: a y bx
x̅ and y̅ are the respective means of the x and y variables
Plotting the least-square regression line
Use the regression equation to find the value of y for two distinct values
of x, and draw the line that goes through those two points.
Hint: The regression line always passes through the mean of x and y.
The points used for drawing
the regression line are derived
from the equation.
They are NOT actual points
from the data set (except by
pure coincidence).
Least-squares regression is only for
linear associations
Don’t compute the regression line until you have confirmed that there is
a linear relationship between x and y.
ALWAYS PLOT THE RAW DATA
These data sets all give a
linear regression equation
of about ŷ = 3 + 0.5x.
But don’t report that until
you have plotted the data.
Moderate linear Obvious nonlinear
association; relationship;
regression OK. regression
inappropriate.
ŷ = 3 + 0.5x ŷ = 3 + 0.5x
One extreme Only two values
outlier, requiring for x; a redesign is
further due here…
examination.
ŷ = 3 + 0.5x ŷ = 3 + 0.5x
The coefficient of
determination, r 2 yˆ i y
r 2, the coefficient of determination, is the
square of the correlation coefficient.
r 2 represents the fraction of the
variance in y that can be explained
by the regression model. yi y
r = 0.87, so r 2 = 0.76
This model explains 76% of individual variations in BAC
r = –0.3, r 2 = 0.09, or 9%
The regression model explains not even 10% of the
variations in y.
r = –0.7, r 2 = 0.49, or 49%
The regression model explains nearly half of the
variations in y.
r = –0.99, r 2 = 0.9801, or ~98%
The regression model explains almost all of the
variations in y.
Outliers and influential points
Outlier: An observation that lies outside the overall pattern.
“Influential individual”: An observation that markedly changes the
regression if removed. This is often an isolated point.
Child 19 = outlier
(large residual)
Child 19 is an outlier of the
relationship (it is unusually
far from the regression line,
vertically).
Child 18 is isolated from the
Child 18 = potential rest of the points, and might
influential individual be an influential point.
Residuals
The vertical distances from each point to the least-squares regression
line are called residuals. The sum of all the residuals is by definition 0.
Outliers have unusually large residuals (in absolute value).
Points above the
line have a positive
residual (under
estimation).
Points below the line have a
negative residual (over
estimation).
^
Predicted y
dist. ( y yˆ ) residual
Observed y
All data
Outlier Without child 18
Without child 19
Influential
Child 18 changes the regression line substantially when it is removed. So, Child 18
is indeed an influential point.
Child 19 is an outlier of the relationship, but it is not influential (regression line
changed very little by its removal).
Making predictions
Use the equation of the least-squares regression to predict y for any
value of x within the range studied.
Predication outside the range is extrapolation. Avoid extrapolation.
yˆ 0.0144 x 0.0008
What would we expect for the
BAC after drinking 6.5 beers?
yˆ 0.0144 * 6.5 0.0008
yˆ 0.936 0.0008 0.0944 mg / ml
Thousands Manatee
100 powerboats deaths
y = 0.1301x - 43.7 447 13
R² = 0.9061 460 21
481 24
80
498 16
513 24
Manatee deaths
512 20
60
526 15
559 34
585 33
40 614 33
The least-squares 645 39
675 43
regression line is: 20 711 50
719 47
yˆ 0.1301x 43.7 0
681
679
55
38
678 35
400 600 800 1000
696 49
Powerboats (x1000)
713 42
732 60
If Florida were to limit the number of powerboat registrations to 500,000, 755 54
809 66
what could we expect for the number of manatee deaths in a year? 830 82
880 78
yˆ 0.1301(500) 43.7 yˆ 65.05 43.7 21.35 944
962
81
95
978 73
Roughly 21 manatee deaths. 983 69
1010 79
1024 92
Could we use this regression line to predict the number of manatee
deaths for a year with 200,000 powerboat registrations?
Association does not imply
causation
Association, however strong, does NOT imply causation.
The observed association could have an external cause.
A lurking variable is a variable that is not among the explanatory or
response variables in a study, and yet may influence the relationship
between the variables studied.
We say that two variables are confounded when their effects on a
response variable cannot be distinguished from each other.
In each example, what is most likely the lurking variable? Notice that some
cases are more obvious than others.
1
0.9
0.8
0.7
Strong positive association 0.6
reading index
between the shoe size and 0.5
0.4
reading skills in young children. 0.3
0.2
0.1
0
0 1 2 3 4 5 6 7
Shoe size
Strong positive association between the number firefighters
at a fire site and the amount of damage a fire does.
Negative association between moderate
amounts of wine-drinking and death rates
from heart disease in developed nations.
Establishing causation
Establishing causation from an observed association can be done if:
1) The association is strong.
2) The association is consistent.
3) Higher doses are associated with stronger responses.
4) The alleged cause precedes the effect.
5) The alleged cause is plausible.
Lung cancer is clearly associated with smoking.
What if a genetic mutation (lurking variable) caused
people to both get lung cancer and become addicted to smoking?
It took years of research and accumulated indirect evidence to reach the
conclusion that smoking causes lung cancer.