Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017
Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017
E (Y − m)2
(1)
We will call this the mean squared error of m, M SE(m).
From the definition of variance,
2
M SE(m) = E [(Y − m)] = (E [Y − m])2 + Var [Y − m] (2)
1
2 1.1 Predicting a Random Variable from Its Distribution
The first term is the squared bias of estimating Y with m; the second term
is the variance of Y − m. Mean squared error is bias (squared) plus variance.
This is the simplest form of the bias-variance decomposition, which is one
of the central parts of statistics.
Now remember that Var [Y − m] = Var [Y ], so
dM SE(m) d
Var [Y ] + (E [Y ] − m)2
= (5)
dm dm
is the derivative we need to work out and set to zero. So, using the chain rule,
dM SE(m) dVar [Y ] dE [Y ] dm
= + 2(E [Y ] − m) − (6)
dm dm dm dm
Changing the prediction we make, m, doesn’t do anything to the true dis-
tribution of Y , so dVar [Y ] /dm = dE [Y ] /dm = 0, and we’ve got
dM SE(m)
= −2(E [Y ] − m) (7)
dm
Say this is zero at m = µ, and solve for µ:
−2(E [Y ] − µ) = 0 (8)
E [Y ] − µ = 0 (9)
E [Y ] = µ (10)
In other words, the best one-number guess we could make for Y is just its
expected value.
1 Or maximum; but here it’s a minimum. (How could you check this, if you were worried
13
12
11
MSE(m)
10
9
8
7
−2 −1 0 1 2
predict, when E [Y ] = 0.57, Var [Y ] = 7. (The text below the plot shows the R command
used to make it.
For each possible value x, the optimal value µ(x) is just the conditional mean,
E [Y |X = x]. The optimal function just gives the optimal value at each point:
µ(x) = E [Y | X = x] (12)
= E (Y − (b0 + b1 X))2
M SE(b0 , b1 ) (13)
2 2
= E Y − 2b0 E [Y ] − 2b1 E [XY ] + E (b0 + b1 X) (14)
2 2 2
2
= E Y − 2b0 E [Y ] − 2b1 (Cov [X, Y ] + E [X] E [Y ]) + b0 + 2b1 E [X] + b1 E X
(15)
2 2
= E Y − 2b0 E [Y ] − 2b1 Cov [X, Y ] − 2b1 E [X] E [Y ] + b0 (16)
+2b0 b1 E [X] + b21 Var [X] + b21 (E [X])2
matter. (To appreciate this, think about how a straight line may seem like a simple function,
but so does a step function, and yet you need a lot of little steps to approximate a straight
line...) Resolving this leads to some very deep mathematics (??).
25
Intercept=−1
Intercept=0
Intercept=1
20
15
MSE(b0, b1)
10
5
0
Slope
mse <- function(b0, b1, E.Y.sq = 10, E.Y = 2, Cov.XY = -1, E.X = -0.5, Var.X = 3) {
E.Y.sq - 2 * b0 * E.Y - 2 * b1 * Cov.XY - 2 * b1 * E.X * E.Y + b0^2 + 2 *
b0 * b1 * E.X + Var.X * b1^2 + (E.X * b1)^2
}
curve(mse(b0 = -1, b1 = x), from = -1, to = 1, lty = "solid", ylim = c(0, 25),
xlab = "Slope", ylab = expression(MSE(b[0], b[1])))
curve(mse(b0 = 0, b1 = x), add = TRUE, lty = "dashed")
curve(mse(b0 = 1, b1 = x), add = TRUE, lty = "dotted")
legend("topleft", legend = c("Intercept=-1", "Intercept=0", "Intercept=1"),
lty = c("solid", "dashed", "dotted"))
Figure 2: Mean squared error of linear modelswith different slopes and intercepts,
when E [X] = −0.5, Var [X] = 3, E [Y ] = 2, E Y 2 = 10, Cov [X, Y ] = −1. Each
curve represents a different intercept b0 in the linear model b0 + b1 x for Y .
Important Morals
1. At no time did we have to assume that the relationship between X and Y
really is linear. We have derived the optimal linear approximation to the
true relationship, whatever that might be.
Because it’s tiresome to keep writing out the derivatives in this form, I’ll abbre-
viate them as µ0 , µ00 , etc.
3 OK, to be pedantic, we had to assume that E [X], E [Y ], Var [X] and Var [Y ] were all
For x close enough to x0 , we can get away with truncating the series at first
order,
µ(x) ≈ µ(x0 ) + (x − x0 )µ0 (25)
and so we could identify that first derivative with the optimal slope β1 . (The
optimal intercept β0 would depend on µ(x0 ) and the distribution of x − x0 .)
How close is enough? Close enough that all the other terms don’t matter, so,
e.g., the quadratic term has to be negligible, meaning
Unless the function is really straight, therefore, any linear approximation is only
going to be good over very short ranges.
It is possible to do a lot of “local” linear approximations, and estimate µ(x)
successfully that way — in fact, we’ll see how to do that in 402 (or read ?). But
a justification for a global linear model, this is weak.
A better justification for using linear models is simply that they are com-
putationally convenient, and there are many situations where computation is at
a premium. If you have huge amounts of data, or you need predictions very
quickly, or your computing hardware is very weak, getting a simple answer can
be better than getting the right answer. In particular, this is a rationale for
using linear models to make predictions, rather than than for caring about their
parameters.
All of that said, we are going to spend most of this course talking about
doing inference on the parameters of linear models. There are a few reasons
this is not totally perverse.
• The theory of linear models is a special case of the more general theory
which covers more flexible and realistic models. But precisely because it
is such a special case, it allows for many simplifying short-cuts, which can
make it easier to learn, especially without advanced math. (We can talk
about points and lines, and not about reproducing-kernel Hilbert spaces.)
Learning linear models first is like learning to swim in a shallow pool,
rather than in the ocean with a gorgeous reef, deceptive currents, and the
occasional shark. (By the end of the year, you will know how to dive with
small sharks.)
• Because linear models are so simple, for most of the last two hundred odd
years they were the only sort of statistical model people could actually
use. This means that lots of applications of statistics, in science, in policy
and in industry, has been done on linear models. It also means that lots
of consumers of statisticians, in science, in policy and in industry, expect
linear models. It is therefore important that you understand thoroughly
both how they work and what their limitations are.
Throughout the rest of the course, we are going to tack back and forth be-
tween treating the linear model as exactly correct, and treating it as just a
Figure 3: Statistician (right) receiving population moments from the Oracle (left).
while the expectation of a discrete random variable with probability mass func-
tion p(x) is X
E [X] = xp(x) (29)
x
5 Eventhe idea that the variables we see are randomly generated from a probability distri-
bution is a usually-untestable assumption.
(Because everything is parallel for the discrete and continuous cases, I will not
keep writing out both forms; after tossing a coin, I will just write out the
integrals.)
The expectation of any function of a random variable f (X) is
Z
E [f (X)] = f (x)p(x)dx (30)
The covariance is positive when X and Y tend to be above or below their ex-
pected values together, and negative if one of them having a positive fluctuation
tends to go with the other having a negative fluctuation.
1. Linearity of expectations
2. Variance identity
2
Var [X] = E X 2 − (E [X]) = E (X − E [X])2
(34)
3. Covariance identity
4. Covariance is symmetric
8. Variance of a sum
12. Independence implies zero covariance If X and Y are independent, Cov [X, Y ] =
0. The reverse is not true; Cov [X, Y ] = 0 is even compatible with Y being
a function of X.
2.2 Convergence
The Law of Large Numbers Suppose that X1 , X2 , . . . Xn all have the same
expected value E [X], the same variance Var [X], zero covariance with each other.
Then
n
1X
Xi → E [X] (44)
n i=1
In particular, if the Xi all have the same distribution and are independent
(“independent and identically distributed”, IID) then this holds.
Note: There are forms of the law of large numbers which don’t even require
a finite variance, but they are harder to state. There are also ones which do not
require constant means, ore even a lack of covariance among the Xi , but they
are also harder to state.
√ X n − E [X]
n N (0, 1) (45)
Var [X]
Note: There are versions of the central limit theorem which do not assume
independent or identically distributed variables being averaged, but they are
considerably more complicated to state.
a consistent estimator of E [X]. The central limit theorem tells us that the
sampling distribution is asymptotically Gaussian.
It is easy to prove (so do so!) that E X n = E [X], hence the mean is an
unbiased estimator of the expected value. Notice that Var X n = Var [X1 ] /n,
√ above, goes to zero as n → ∞. The corresponding standard
which, as promised
deviation is σ/ n, which is the “standard error in the mean”. (Again, every
estimator of every quantity has its own standard error, which is not just this.)