Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
Yi = β0 + β1 Xi + i (1)
where the noise variables i all have the same expectation (0) and the same
variance (σ 2 ), and Cov [i , j ] = 0 (unless i = j, of course).
c1 = cXY
β (2)
s2X
1
We also saw, in the notes to the last lecture, that so long as the law of large
numbers holds,
c1 → β1
β (3)
as n → ∞. It follows easily that
c0 = Y − β
β c1 X (4)
will also converge on β0 .
2
These are often called the normal equations for least-squares estimation, or
the estimating equations: a system of two equations in two unknowns, whose
solution gives the estimate. Many people would, at this point, remove the factor
of 1/n, but I think it makes it easier to understand the next steps:
That is, the least-squares estimate of the slope is our old friend the plug-in
estimate of the slope, and thus the least-squares intercept is also the plug-in
intercept.
Going forward The equivalence between the plug-in estimator and the least-
squares estimator is a bit of a special case for linear models. In some non-linear
models, least squares is quite feasible (though the optimum can only be found
numerically, not in closed form); in others, plug-in estimates are more useful
than optimization.
3
cXY
β̂1 = (16)
s2X
1
Pn
n i=1 xi yi − x̄ȳ
= (17)
s2X
1
Pn
n xi (β0 + β1 xi + 1 ) − x̄(β0 + β1 x̄ + ¯)
i=1
= (18)
s2X
n
β0 x̄ + β1 x2 + n1 i=1 xi i − x̄β0 − β1 x̄2 − x̄¯
P
= 2
(19)
sx
Pn
β1 s2X + n1 i=1 xi i − x̄¯
= 2 (20)
s
1
Pn X
xi i − x̄¯
= β1 + n i=1 2 (21)
sX
= n−1 i x̄i ,
P
Since x̄¯
1
Pn
(xi − x̄)i
βˆ1 = β1 + n i=1 2 (22)
sX
This representation of the slope estimate shows that it is equal to the true
slope (β1 ) plus something which depends on the noise terms (the i , and their
sample average ¯). We’ll use this to find the expected value and the variance of
the estimator β̂1 .
In the next couple of paragraphs, I am going to treat the xi as non-random
variables. This is appropriate in “designed” or “controlled” experiments, where
we get to chose their value. In randomized experiments or in observational stud-
ies, obviously the xi aren’t necessarily fixed;h however, these
i expressions will be
correct for the conditional expectation E β̂1 |x1 , . . . xn and conditional vari-
h i
ance Var β̂1 |x1 , . . . xn , and I will come back to how we get the unconditional
expectation and variance.
Thus,
h i
E β̂1 = β1 (24)
Since the bias of an estimator is the difference between its expected value
and the truth, β̂1 is an unbiased estimator of the optimal slope.
4
(To repeat what I’m sure you remember from mathematical
h statistics:
i “bias”
here is a technical term, meaning no more and no less than E β̂1 −β1 . An unbi-
ased estimator could still make systematic mistakes — for instance, it could un-
derestimate 99% of the time, provided that the 1% of the time it over-estimates,
it does so by much more than it under-estimates. Moreover, unbiased estimators
are not necessarily superior to biased ones: the total error depends on both the
bias of the estimator and its variance, and there are many situations where you
can remove lots of bias at the cost of adding a little variance. Least squares
for simple linear regression happens not to be one of them, but you shouldn’t
expect that as a general rule.)
Turning to the intercept,
h i h i
E β̂0 = E Y − β̂1 X (25)
h i
= β0 + β1 X − E β̂1 X (26)
= β0 + β1 X − β1 X (27)
= β0 (28)
so it, too, is unbiased.
Variance and Standard Error Using the formula for the variance of a sum
from lecture 1, and the model assumption that all the i are uncorrelated with
each other,
1
Pn
i=1 (xi − x̄)i
h i
n
Var β̂1 = Var β1 + (29)
s2X
1 Pn
i=1 (xi − x̄)i
n
= Var (30)
s2X
1
P n 2
2 i=1 (xi − x̄) Var [i ]
= n (31)
(s2X )2
σ2 2
n sX
= (32)
(s2X )2
2
σ
= (33)
ns2X
In words, this says that the variance of the slope estimate goes up as the
noise around the regression line (σ 2 ) gets bigger, and goes down as we have
more observations (n), which are further spread out along the horizontal axis
(s2X ); it should not be surprising that it’s easier to work out the slope of a line
from many, well-separated points on the line than from a few points smushed
together.
The standard error of an estimator is just its standard deviation, or the
square root of its variance:
σ
se(β̂1 ) = p 2 (34)
nsX
5
I will leave working out the variance of β̂0 as an exercise.
β1 = E [Y |X = x] − E [Y |X = x − 1] (40)
6
The tricky part is that we have a very strong, natural tendency to interpret
this as telling us something about causation — “If we change X by 1, then
on average Y will change by β1 ”. This interpretation is usually completely
unsupported by the analysis. If I use an old-fashioned mercury thermometer,
the height of mercury in the tube usually has a nice linear relationship with the
temperature of the room the thermometer is in. This linear relationship goes
both ways, so we could regress temperature (Y ) on mercury height (X). But if
I manipulate the height of the mercury (say, by changing the ambient pressure,
or shining a laser into the tube, etc.), changing the height X will not, in fact,
change the temperature outside.
The right way to interpret β1 is not as the result of a change, but as an
expected difference. The correct catch-phrase would be something like “If we
select two sets of cases from the un-manipulated distribution where X differs
by 1, we expect Y to differ by β1 .” This covers the thermometer example, and
every other I can think of. It is, I admit, much more inelegant than “If X
changes by 1, Y changes by β1 on average”, but it has the advantage of being
true, which the other does not.
There are circumstances where regression can be a useful part of causal
inference, but we will need a lot more tools to grasp them; that will come
towards the end of 402.
“Gaussian noise”. I dislike this: “normal” is an over-loaded word in math, while “Gaussian”
is (comparatively) specific; “error” made sense in Gauss’s original context of modeling, specif-
ically, errors of observation, but is misleading generally; and calling Gaussian distributions
“normal” suggests they are much more common than they really are.
7
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
ε ε ε
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
ε ε ε
par(mfrow=c(2,3))
curve(dnorm(x), from=-3, to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(exp(-abs(x))/2, from=-3, to=3, xlab=expression(epsilon), ylab="",
ylim=c(0,1))
curve(sqrt(pmax(0,1-x^2))/(pi/2), from=-3, to=3, xlab=expression(epsilon),
ylab="", ylim=c(0,1))
curve(dt(x,3), from=-3, to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(dgamma(x+1.5, shape=1.5, scale=1), from=-3, to=3,
xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(0.5*dgamma(x+1.5, shape=1.5, scale=1) +
0.5*dgamma(0.5-x, shape=0.5, scale=1), from=-3,
to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
par(mfrow=c(1,1))
Figure 1: Some possible noise distributions for the simple linear regression model,
since all have E [] = 0, and could get any variance by scaling. (The model is even
compatible with each observation taking from a different distribution.) From top left
to bottom right: Gaussian; double-exponential (“Laplacian”); “circular” distribution;
t with 3 degrees of freedom; a gamma distribution (shape 1.5, scale 1) shifted to have
8
mean 0; mixture of two gammas with shape 1.5 and shape 0.5, each off-set to have
expectation 0. The first three were all used as error models in the 18th and 19th
centuries. (See the source file for how to get the code below the figure.)
4. is independent across observations.
You will notice that these assumptions are strictly stronger than those of the
simple linear regression model. More exactly, the first two assumptions are the
same, while the third and fourth assumptions of the Gaussian-noise model imply
the corresponding assumptions of the other model. This means that everything
we have done so far directly applies to the Gaussian-noise model. On the other
hand, the stronger assumptions let us say more. They tell us, exactly, the
probability distribution for Y given X, and so will let us get exact distributions
for predictions and for other inferential statistics.
Why the Gaussian noise model? Why should we think that the noise
around the regression line would follow a Gaussian distribution, independent of
X? There are two big reasons.
1. Central limit theorem The noise might be due to adding up the effects
of lots of little random causes, all nearly independent of each other and
of X, where each of the effects are of roughly similar magnitude. Then
the central limit theorem will take over, and the distribution of the sum
of effects will indeed be pretty Gaussian. For Gauss’s original context,
X was (simplifying) “Where is such-and-such-a-planet in space?”, Y was
“Where does an astronomer record the planet as appearing in the sky?”,
and noise came from defects in the telescope, eye-twitches, atmospheric
distortions, etc., etc., so this was pretty reasonable. It is clearly not a
universal truth of nature, however, or even something we should expect
to hold true as a general rule, as the name “normal” suggests.
2. Mathematical convenience Assuming Gaussian noise lets us work out a
very complete theory of inference and prediction for the model, with lots
of closed-form answers to questions like “What is the optimal estimate of
the variance?” or “What is the probability that we’d see a fit this good
from a line with a non-zero intercept if the true line goes through the
origin?”, etc., etc. Answering such questions without the Gaussian-noise
assumption needs somewhat more advanced techniques, and much more
advanced computing; we’ll get to it towards the end of the class.
9
4
0
y
−2
−4
−1 0 1 2 3
0.6 0.4 0.2 0.0 x
p(y | x)
10
those parameters. We could not work with the likelihood with the simple linear
regression model, because it didn’t specify enough about the distribution to let
us calculate a density. With the Gaussian-noise model, however, we can write
down a likelihood2 By the model’s assumptions, if think the parameters are the
parameters are b0 , b1 , s2 (reserving the Greek letters for their true values), then
Y |X = x ∼ N (b0 + b1 x, s2 ), and Yi and Yj are independent given Xi and Xj ,
so the over-all likelihood is
n
Y 1 (yi −(b0 +b1 xi ))2
√ e− 2s2 (42)
i=1 2πs2
As usual, we work with the log-likelihood, which gives us the same information3
but replaces products with sums:
n
n n 1 X
L(b0 , b1 , s2 ) = − log 2π − s− 2 (yi − (b0 + b1 xi ))2 (43)
2 log 2s i=1
We recall from mathematical statistics that when we’ve got a likelihood func-
tion, we generally want to maximize it. That is, we want to find the parameter
values which make the data we observed as likely, as probable, as the model
will allow. (This may not be very likely; that’s another issue.) We recall from
calculus that one way to maximize is to take derivatives and set them to zero.
n
∂L 1 X
= − 2(yi − (b0 + b1 xi ))(−1) (44)
∂b0 2s2 i=1
n
∂L 1 X
= − 2(yi − (b0 + b1 xi ))(−xi ) (45)
∂b1 2s2 i=1
Notice that when we set these derivatives to zero, all the multiplicative
constants — in particular, the prefactor of 2s12 — go away. We are left with
n
X
yi − (β
c0 + β
c1 xi ) = 0 (46)
i=1
n
X
(yi − (β
c0 + β
c1 xi ))xi = 0 (47)
i=1
These are, up to a factor of 1/n, exactly the equations we got from the method
of least squares (Eq. 9). That means that the least squares solution is the
maximum likelihood estimate under the Gaussian noise model; this is no coin-
cidence4 .
2 Strictly speaking, this is a “conditional” (on X) likelihood; but only pedants use the
adjective in this context.
3 Why is this?
4 It’s no coincidence because, to put it somewhat anachronistically, what Gauss did was
ask himself “for what distribution of the noise would least squares maximize the likelihood?”.
11
Now let’s take the derivative with respect to s:
n
∂L n 1 X
= − +2 3 (yi − (b0 + b1 xi ))2 (48)
∂s s 2s i=1
b3 , we get
Setting this to 0 at the optimum, including multiplying through by σ
n
c2 = 1
X
σ (yi − (β c1 xi ))2
c0 + β (49)
n i=1
Notice that the right-hand side is just the in-sample mean squared error.
Exercises
To think through, not to hand it.
1. Show that if E [|X = x] = 0 for all x, then Cov [X, ] = 0. Would this
still be true if E [|X = x] = a for some other constant a?
2. Find the variance of βˆ0 . Hint: Do you need to worry about covariance
between Y and βˆ1 ?
12