0% found this document useful (0 votes)

63 views13 pages

Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017

This document provides an overview of optimal prediction in regression analysis. It discusses predicting a random variable Y based on its distribution or another random variable X. The best prediction of Y is its expected value E(Y). When predicting Y based on X, the optimal prediction function is the conditional expectation E(Y|X). For a linear prediction model of the form b0 + b1X, the optimal values of b0 and b1 minimize the mean squared error and can be found by taking the derivative and setting it equal to zero.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views13 pages

Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 1: Optimal Prediction (with Refreshers)

36-401, Fall 2017

Sunday 3rd September, 2017

Regression analysis is about investigating quantitative, predictive relation-

ships between variables. It’s about situations where there is some sort of link,
tie or relation between two (or more) variables, so if we know the value of one
of them, it tells us something about the other. The concrete sign of this is
that knowledge of one variable lets us predict the other — predict the target
variable better than if we didn’t know the other. Pretty much everything we
are going to do in this class is about crafting predictive mathematical models,
seeing whether such models really have any predictive power, and comparing
their predictions. Before we get into the issues of statistics and data analysis,
it will help us to think what optimal prediction would look like, if we somehow
knew all the probability distributions of all our variables.
§1 refers to many concepts from probability (reviewed in §2) and statistical
inference (reviewed in §3).

1 Statistical Prediction and the Optimal Linear

Predictor
1.1 Predicting a Random Variable from Its Distribution
Suppose we want to guess the value of a random variable Y . Since we don’t feel
comfortable with the word “guess”, we call it a “prediction” instead. What’s
the best guess we can make?
We need some way to measure how good a guess is. Say our guess is m. The
difference Y − m should somehow be small. If we don’t care about positive more
than negative errors, it’s traditional to care about the squared error, (Y − m)2 .
Since Y is random, this will fluctuate; let’s look at its expected value,

E (Y − m)2

(1)
We will call this the mean squared error of m, M SE(m).
From the definition of variance,
2
M SE(m) = E [(Y − m)] = (E [Y − m])2 + Var [Y − m] (2)

1
2 1.1 Predicting a Random Variable from Its Distribution

The first term is the squared bias of estimating Y with m; the second term
is the variance of Y − m. Mean squared error is bias (squared) plus variance.
This is the simplest form of the bias-variance decomposition, which is one
of the central parts of statistics.
Now remember that Var [Y − m] = Var [Y ], so

M SE(m) = (E [Y − m])2 + Var [Y ] (3)

= (E [Y ] − m)2 + Var [Y ] (4)

where the second line uses the linearity of expectations.

We would like to pick m to make this small, to minimize it (Figure 1). The
variance term is irrelevant to making this small, since it’s the same no matter
what m is. (Remember, Var [Y ] is about the true distribution of Y , but m is
just our guess.) It should therefore play no role in the minimization.
Remember from basic calculus that one way to find the minimum1 of a
function is to take the derivative, set it to zero, and solve for the minimizing
argument to the function. Here what we want to minimize is M SE(m) and the
argument is m, so

dM SE(m) d
Var [Y ] + (E [Y ] − m)2

= (5)
dm dm
is the derivative we need to work out and set to zero. So, using the chain rule,

dM SE(m) dVar [Y ] dE [Y ] dm
= + 2(E [Y ] − m) − (6)
dm dm dm dm
Changing the prediction we make, m, doesn’t do anything to the true dis-
tribution of Y , so dVar [Y ] /dm = dE [Y ] /dm = 0, and we’ve got

dM SE(m)
= −2(E [Y ] − m) (7)
dm
Say this is zero at m = µ, and solve for µ:

−2(E [Y ] − µ) = 0 (8)
E [Y ] − µ = 0 (9)
E [Y ] = µ (10)

In other words, the best one-number guess we could make for Y is just its
expected value.
1 Or maximum; but here it’s a minimum. (How could you check this, if you were worried

that I was wrong?)

13:54 Sunday 3rd September, 2017

3 1.1 Predicting a Random Variable from Its Distribution

13
12
11
MSE(m)

10
9
8
7

−2 −1 0 1 2

curve(7 + (0.57 - x)^2, from = -2, to = 2, xlab = "m", ylab = "MSE(m)")

Figure 1: Mean squared error E (Y − m)2 as a function of the value m which we

predict, when E [Y ] = 0.57, Var [Y ] = 7. (The text below the plot shows the R command
used to make it.

13:54 Sunday 3rd September, 2017

4 1.2 Predicting One Random Variable from Another

1.2 Predicting One Random Variable from Another

Now imagine we have two random variables, say X and Y . We know X and
would like to use that knowledge to improve our guess
about Y . Our guess is
therefore a function of x, say m(x). We would like E (Y − m(X))2 to be small.
We can use conditional expectations to reduce this problem to the one al-
ready solved.

E (Y − m(X))2 = E [E [(Y − m(X) | X]]

(11)

For each possible value x, the optimal value µ(x) is just the conditional mean,
E [Y |X = x]. The optimal function just gives the optimal value at each point:

µ(x) = E [Y | X = x] (12)

This µ(x) is called the (true, optimal, or population) regression function

(of Y on X). If we are interested in the relationship between Y and X, this is
what we would really like to know, or one of the things we’d really like to know.
Unfortunately, in general µ(x) is a really complicated function, for which
there exists no nice mathematical expression. The Ancestors, then, in their
wisdom decided to ask “what is the best prediction we can make which is also a
simple function of x?” In other words, they substituted a deliberately simplified
model of the relationship for the actual relationship.

1.3 The Optimal Linear Predictor

Many people regard linear functions as especially simple2 , so let us now ask
“What is the optimal prediction we can make which is linear in X?” That is,
we restrict our prediction function m(x) to have the form b0 + b1 x. (To be really
pedantic, that’s an “affine” rather than a “linear” function.)
The mean squared error of the linear model b0 + b1 x is now a function of two
arguments, b0 and b1 . Let’s re-write it to better separate the contributions from
those arguments (which we control) and the contributions from the distribution
of X and Y (which are outside our control).

= E (Y − (b0 + b1 X))2

M SE(b0 , b1 ) (13)
2 2

= E Y − 2b0 E [Y ] − 2b1 E [XY ] + E (b0 + b1 X) (14)
2 2 2
2
= E Y − 2b0 E [Y ] − 2b1 (Cov [X, Y ] + E [X] E [Y ]) + b0 + 2b1 E [X] + b1 E X
(15)
2 2
= E Y − 2b0 E [Y ] − 2b1 Cov [X, Y ] − 2b1 E [X] E [Y ] + b0 (16)
+2b0 b1 E [X] + b21 Var [X] + b21 (E [X])2

(See §2 for the identities I’m using above.)

2 Actually being precise about “how complicated is this function?” is a surprisingly hard

matter. (To appreciate this, think about how a straight line may seem like a simple function,
but so does a step function, and yet you need a lot of little steps to approximate a straight
line...) Resolving this leads to some very deep mathematics (??).

13:54 Sunday 3rd September, 2017

5 1.3 The Optimal Linear Predictor

Intercept=−1
Intercept=0
Intercept=1
20
15
MSE(b0, b1)

10
5
0

−1.0 −0.5 0.0 0.5 1.0

Slope

mse <- function(b0, b1, [Link] = 10, E.Y = 2, [Link] = -1, E.X = -0.5, Var.X = 3) {
[Link] - 2 * b0 * E.Y - 2 * b1 * [Link] - 2 * b1 * E.X * E.Y + b0^2 + 2 *
b0 * b1 * E.X + Var.X * b1^2 + (E.X * b1)^2
}
curve(mse(b0 = -1, b1 = x), from = -1, to = 1, lty = "solid", ylim = c(0, 25),
xlab = "Slope", ylab = expression(MSE(b[0], b[1])))
curve(mse(b0 = 0, b1 = x), add = TRUE, lty = "dashed")
curve(mse(b0 = 1, b1 = x), add = TRUE, lty = "dotted")
legend("topleft", legend = c("Intercept=-1", "Intercept=0", "Intercept=1"),
lty = c("solid", "dashed", "dotted"))

Figure 2: Mean squared error of linear modelswith different slopes and intercepts,
when E [X] = −0.5, Var [X] = 3, E [Y ] = 2, E Y 2 = 10, Cov [X, Y ] = −1. Each
curve represents a different intercept b0 in the linear model b0 + b1 x for Y .

13:54 Sunday 3rd September, 2017

6 1.3 The Optimal Linear Predictor

We minimize again by setting derivatives to zero; we now need to take two

partial derivatives, which will give us two equations in two unknowns.

∂E (Y − (b0 + b1 X))2
= −2E [Y ] + 2b0 + 2b1 E [X] (17)
∂b0

∂E (Y − (b0 + b1 X))2
= −2Cov [X, Y ] − 2E [X] E [Y ] + 2b0 E [X] (18)
∂b1
+2b1 Var [X] + 2b1 (E [X])2
We’ll call the optimal value of b0 and b1 , the ones where these derivatives are
exactly 0, β0 and β1 .
The first equation is simpler, so we use it to find β0 in terms of β1 :
β0 = E [Y ] − β1 E [X] (19)
Some points about this equation:
• In words, it says that the optimal intercept (β0 ) makes sure that the line
goes through the mean Y value at the mean X value. (To see this, add
β1 E [X] to both sides.)
• It’s often helpful to sanity-check our math by making sure that the units
balance on both sides of any equation we derive. Here, β0 should have the
same units as Y , and the right-hand side of this formula does, because β1
has the units of Y /X.
• If the variables were “centered”, with E [X] = E [Y ] = 0, we’d get β0 = 0.
Now we plug this in to the other equation:
0 = −Cov [X, Y ] − E [X] E [Y ] + β0 E [X] + β1 Var [X] + β1 (E [X])2 (20)
= −Cov [X, Y ] − E [X] E [Y ] (21)
2
+(E [Y ] − β1 E [X])E [X] + β1 Var [X] + β1 (E [X])
= −Cov [X, Y ] + β1 Var [X] (22)
β1 = Cov [X, Y ] /Var [X] (23)
Some notes:
• In words, the optimal slope is the ratio between the covariance of X and
Y , and the variance of X. The slope increases the more X and Y tend
to fluctuate together, and gets pulled towards zero the more X fluctuates
period.
• You can apply the sanity check of seeing whether this gives the right units
for β1 . (Spoiler: it does.)
• The expected values E [X] and E [Y ] play no role in the formula for β1 —
only the variance and covariance matter, and they don’t change when we
add or subtract constants. In particular, the optimal slope doesn’t change
if we use instead Y − E [Y ] and X − E [X].

13:54 Sunday 3rd September, 2017

7 1.3 The Optimal Linear Predictor

The line β0 + β1 x is the optimal regression line (of Y on X), or the

optimal linear predictor (of Y from X).

Important Morals
1. At no time did we have to assume that the relationship between X and Y
really is linear. We have derived the optimal linear approximation to the
true relationship, whatever that might be.

2. The best linear approximation to the truth can be awful. (Imagine E [Y |X = x] =

ex , or even = sin x.) There is no general reason to think linear approxi-
mations ought to be good.
3. At not time did we have to assume anything about the marginal distribu-
tions of the variables3 , or about the joint distribution of the two variables
together4
4. At no time did we have to assume anything about the fluctuations Y
might show around the optimal regression line — that the fluctuations
are Gaussian, or symmetric, or that they just add on to the regression
line, etc.

5. In general, changing the distribution of X will change the optimal regres-

sion line, even if P (Y |X = x) doesn’t change. This is because changing
the distribution of X will (generally) change both Cov [X, Y ] and Var [X],
and the changes won’t (generally) cancel out.
6. At no time did we have to assume that X came before Y in time, or that
X causes Y , or that X is known precisely but Y only noisily, etc. It may
be more interesting to model Y as a linear function of X under those
circumstances, but the math doesn’t care about it at all.
I will expand on that first two points a little. There is a sort of reason to
think that linear models should work generally, which contains a kernel of truth,
but needs to be used carefully.
The true regression function, as I said, is µ(x). Suppose that this is a smooth
function, so smooth that we can expand it in a Taylor series. Pick then any
particular value x0 . Then
2

dµ 1 2 d µ
µ(x) = µ(x0 ) + (x − x0 ) + (x − x 0 ) + ... (24)
dx x=x0 2 dx2 x=x0

Because it’s tiresome to keep writing out the derivatives in this form, I’ll abbre-
viate them as µ0 , µ00 , etc.
3 OK, to be pedantic, we had to assume that E [X], E [Y ], Var [X] and Var [Y ] were all

well-defined and Var [X] > 0.

4 Except, to keep being pedantic, that Cov [X, Y ] was well-defined.

13:54 Sunday 3rd September, 2017

8 1.3 The Optimal Linear Predictor

For x close enough to x0 , we can get away with truncating the series at first
order,
µ(x) ≈ µ(x0 ) + (x − x0 )µ0 (25)
and so we could identify that first derivative with the optimal slope β1 . (The
optimal intercept β0 would depend on µ(x0 ) and the distribution of x − x0 .)
How close is enough? Close enough that all the other terms don’t matter, so,
e.g., the quadratic term has to be negligible, meaning

|x − x0 |µ0 |x − x0 |2 µ00 /2 (26)

0 00
2µ /µ |x − x0 | (27)

Unless the function is really straight, therefore, any linear approximation is only
going to be good over very short ranges.
It is possible to do a lot of “local” linear approximations, and estimate µ(x)
successfully that way — in fact, we’ll see how to do that in 402 (or read ?). But
a justification for a global linear model, this is weak.
A better justification for using linear models is simply that they are com-
putationally convenient, and there are many situations where computation is at
a premium. If you have huge amounts of data, or you need predictions very
quickly, or your computing hardware is very weak, getting a simple answer can
be better than getting the right answer. In particular, this is a rationale for
using linear models to make predictions, rather than than for caring about their
parameters.
All of that said, we are going to spend most of this course talking about
doing inference on the parameters of linear models. There are a few reasons
this is not totally perverse.
• The theory of linear models is a special case of the more general theory
which covers more flexible and realistic models. But precisely because it
is such a special case, it allows for many simplifying short-cuts, which can
make it easier to learn, especially without advanced math. (We can talk
about points and lines, and not about reproducing-kernel Hilbert spaces.)
Learning linear models first is like learning to swim in a shallow pool,
rather than in the ocean with a gorgeous reef, deceptive currents, and the
occasional shark. (By the end of the year, you will know how to dive with
small sharks.)
• Because linear models are so simple, for most of the last two hundred odd
years they were the only sort of statistical model people could actually
use. This means that lots of applications of statistics, in science, in policy
and in industry, has been done on linear models. It also means that lots
of consumers of statisticians, in science, in policy and in industry, expect
linear models. It is therefore important that you understand thoroughly
both how they work and what their limitations are.
Throughout the rest of the course, we are going to tack back and forth be-
tween treating the linear model as exactly correct, and treating it as just a

13:54 Sunday 3rd September, 2017

9 1.4 Probability versus Statistics

Figure 3: Statistician (right) receiving population moments from the Oracle (left).

more-or-less convenient, more-or-less accurate approximation. When we make

the stronger assumption that the linear model is right, we will be able to draw
stronger conclusions; but these will not be much more secure than that assump-
tion was to start with.

1.4 Probability versus Statistics

Everything I’ve gone over so far is purely mathematical. We have been pre-
tending (Figure 3) that we have gone to the Oracle, and in a mystic trance
they have revealed to us the full joint probability distribution of X and Y , or
at least the exact values of all of their moments. As in many mathematical
problems, therefore, we have idealized away everything inconvenient. In reality,
we never know the full probability distribution of the variables we are dealing
with5 . Rather than exact knowledge from the Oracle, we have only a limited
number of noisy samples from the distribution. The statistical problem is that
of drawing inferences about the ideal predictor from this unpromising material.

2 Reminders from Basic Probability

The expectation (or expected value) of a continuous random variable X with
probability density function p(x) is
Z
E [X] = xp(x)dx (28)

while the expectation of a discrete random variable with probability mass func-
tion p(x) is X
E [X] = xp(x) (29)
x
5 Eventhe idea that the variables we see are randomly generated from a probability distri-
bution is a usually-untestable assumption.

13:54 Sunday 3rd September, 2017

10 2.1 Algebra with Expectations, Variances and Covariances

(Because everything is parallel for the discrete and continuous cases, I will not
keep writing out both forms; after tossing a coin, I will just write out the
integrals.)
The expectation of any function of a random variable f (X) is
Z
E [f (X)] = f (x)p(x)dx (30)

(Of course, f (X) hasR its own distribution,

R with a density we might call pf ; can
you prove that that f (x)p(x)dx = hpf (h)dz?)
X − E [X] is the deviation or fluctuation of X from its expected value.
The variance of X is

Var [X] = E (X − E [X])2

(31)

The covariance of X and Y is

Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] (32)

The covariance is positive when X and Y tend to be above or below their ex-
pected values together, and negative if one of them having a positive fluctuation
tends to go with the other having a negative fluctuation.

2.1 Algebra with Expectations, Variances and Covariances

We’re going to deal a lot with expectation values, variances and covariances.
There are some useful bits of algebra about these, which I will now remind you
of. You will commit them to memory (either deliberately or because you’ll use
them so often).

1. Linearity of expectations

E [aX + bY ] = aE [X] + bE [Y ] (33)

2. Variance identity
2
Var [X] = E X 2 − (E [X]) = E (X − E [X])2

(34)

3. Covariance identity

Cov [X, Y ] = E [XY ] − E [X] E [Y ] = E [(X − E [X])(Y − E [Y ])] (35)

4. Covariance is symmetric

Cov [X, Y ] = Cov [Y, X] (36)

5. Variance is covariance with itself

Cov [X, X] = Var [X] (37)

13:54 Sunday 3rd September, 2017

11 2.2 Convergence

6. Variance is not linear

Var [aX + b] = a2 Var [X] (38)

7. Covariance is not linear

Cov [aX + b, Y ] = aCov [X, Y ] (39)

8. Variance of a sum

Var [X + Y ] = Var [X] + Var [Y ] + 2Cov [X, Y ] (40)

9. Variance of a big sum

" n # n Xn n n−1
X X X XX
Var Xi = Cov [Xi , Xj ] = Var [Xi ]+2 Cov [Xi , Xj ]
i=1 i=1 j=1 i=1 i=1 j>i
(41)
10. Law of total expectation

E [X] = E [E [X|Y ]] (42)

Remember: E [X|Y ] is a function of Y ; it’s random.

11. Law of total variance

Var [X] = Var [E [X|Y ]] + E [Var [Y |X]] (43)

12. Independence implies zero covariance If X and Y are independent, Cov [X, Y ] =
0. The reverse is not true; Cov [X, Y ] = 0 is even compatible with Y being
a function of X.

2.2 Convergence
The Law of Large Numbers Suppose that X1 , X2 , . . . Xn all have the same
expected value E [X], the same variance Var [X], zero covariance with each other.
Then
n
1X
Xi → E [X] (44)
n i=1
In particular, if the Xi all have the same distribution and are independent
(“independent and identically distributed”, IID) then this holds.
Note: There are forms of the law of large numbers which don’t even require
a finite variance, but they are harder to state. There are also ones which do not
require constant means, ore even a lack of covariance among the Xi , but they
are also harder to state.

13:54 Sunday 3rd September, 2017

Pn limit theorem If the Xi are IID, then as n → ∞, the distribution

Central
of n1 i=1 Xi approaches N (E [X] , Var [X] /n), regardless of the distribution of
the Xi .
Mathematically, it is nicer to have the limit that we’re converging to not
change with n, so this is often stated as

√ X n − E [X]
n N (0, 1) (45)
Var [X]

Note: There are versions of the central limit theorem which do not assume
independent or identically distributed variables being averaged, but they are
considerably more complicated to state.

3 Reminders from Basic Statistics: Estimation

We observe values X1 , X2 , . . . Xn from some distribution. We don’t know the
distribution, so we imagine writing it down with one or more unknown param-
eters, f (x; θ). A statistic is a function of the data, and the data alone. An
estimator is a statistic which takes a guess at the parameter θ, or some func-
tion of it, h(θ). (For instance we might want to estimate E X 2 = µ2 + σ 2 .)
We will generically write such an estimator as θ̂n , with the hat to distinguish it
from the true value of the parameter, and the subscript n to emphasize that it
will change as we get more data.
An estimator is a random variable; it inherits its distribution from that of
the data Xi . This is often called the sampling distribution of the estimator.
An estimator is consistent if θ̂n → θ, whatever the true θ might be. An
estimator which is not consistent his inconsistent,
i h iand usually not very good.
The bias of an estimator is E θ̂n − θ = E θ̂n − θ. An estimator is unbi-
ased if its bias is zero for all θ. h i
An estimator also has a variance, Var θ̂ . The standard error of an esti-
mator is its standard deviation, the square root of the variance. We give it the
name “standard error” to remind ourselves that this is telling us about how pre-
cise our estimate is. N.B., there are more standard errors than just the standard
error in the mean (see below).
An estimator cannot be consistent unless its standard error goes to zero as n
grows. If both the standard error and the bias go to zero, that guarantees con-
sistency, but there are exceptional circumstances where asymptotically biased
estimators are still consistent.

Example: Sample Mean The expectation value E [X] is either a parameter

distribution, or a function of the parameters. The sample mean X n =
of a P
n
n−1 i=1 Xi is a statistic, since it is a function of the data alone. The sample
mean can be used as an estimator of E [X], and is a natural choice for this
role. If the Xi are IID, then the law of large numbers tells us that X n is

13:54 Sunday 3rd September, 2017

a consistent estimator of E [X]. The central limit theorem tells us that the
sampling distribution is asymptotically Gaussian.

It is easy to prove (so do so!) that E X n = E [X], hence the mean is an

unbiased estimator of the expected value. Notice that Var X n = Var [X1 ] /n,
√ above, goes to zero as n → ∞. The corresponding standard
which, as promised
deviation is σ/ n, which is the “standard error in the mean”. (Again, every
estimator of every quantity has its own standard error, which is not just this.)

Example: Shrunken Sample Mean As an alternative estimator, consider

n
n+λ X n , where you get to set the number λ > 0 (but then you have to use the
same λ for all n). You should be able to convince yourself that (i) at every n
and every λ, it has a strictly smaller variance than X n , and hence a strictly
smaller standard error; (ii) it is a biased estimator of E [X], with a bias which
depends on E [X], λ and n; (iii) for every E [X] and λ, the bias goes to zero as
n → ∞; (iv) it is a consistent estimator of E [X]. This is an example of what is
called a “shrinkage” estimator, where the obvious estimate is “shrunk” towards
zero, so as to reduce variance.

13:54 Sunday 3rd September, 2017

Module 2 - Optimal Estimators
No ratings yet
Module 2 - Optimal Estimators
12 pages
Regression Notes - Part-1
No ratings yet
Regression Notes - Part-1
17 pages
Linear Regression Models 2018
No ratings yet
Linear Regression Models 2018
68 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Regression Analysis Essentials
No ratings yet
Regression Analysis Essentials
55 pages
Gary Chamberlain Econometric S
No ratings yet
Gary Chamberlain Econometric S
152 pages
Chap 7
No ratings yet
Chap 7
7 pages
Linear Regression - Module 3
No ratings yet
Linear Regression - Module 3
16 pages
Chapter 6: Regression
No ratings yet
Chapter 6: Regression
7 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Standard Errors For Regression Equations
No ratings yet
Standard Errors For Regression Equations
4 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
Linear Review 1
No ratings yet
Linear Review 1
235 pages
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
No ratings yet
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
12 pages
Ordinary Least Squares
No ratings yet
Ordinary Least Squares
54 pages
Notes 2
No ratings yet
Notes 2
15 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
Stats101A - Chapter 2
No ratings yet
Stats101A - Chapter 2
59 pages
Multiple Linear Regression Model by Jeevan Bista
No ratings yet
Multiple Linear Regression Model by Jeevan Bista
16 pages
Regression Analysis
No ratings yet
Regression Analysis
7 pages
Regression
No ratings yet
Regression
44 pages
Lecture 21: Model Selection 1 Choosing Models
No ratings yet
Lecture 21: Model Selection 1 Choosing Models
14 pages
Basics of Regression Analysis
No ratings yet
Basics of Regression Analysis
63 pages
Estimation 3
No ratings yet
Estimation 3
12 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
34 pages
Intro To Regression
No ratings yet
Intro To Regression
4 pages
f23 Econ103 Week2 Ta Note
No ratings yet
f23 Econ103 Week2 Ta Note
5 pages
R Code for Simple Linear Regression
No ratings yet
R Code for Simple Linear Regression
14 pages
Math644 - Chapter 1 - Part2 PDF
No ratings yet
Math644 - Chapter 1 - Part2 PDF
14 pages
Econ 471 Notes 1
No ratings yet
Econ 471 Notes 1
14 pages
Understanding Multiple Linear Regression
No ratings yet
Understanding Multiple Linear Regression
18 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
47 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Understanding Bias-Variance Tradeoff
No ratings yet
Understanding Bias-Variance Tradeoff
7 pages
Robust Regression: 1 M-Estimation
No ratings yet
Robust Regression: 1 M-Estimation
8 pages
M-Estimation in Robust Regression
No ratings yet
M-Estimation in Robust Regression
8 pages
Math170S Lecture6
No ratings yet
Math170S Lecture6
13 pages
CH 2
No ratings yet
CH 2
31 pages
Linear Regression
No ratings yet
Linear Regression
108 pages
Limitations of Linear Regression
No ratings yet
Limitations of Linear Regression
41 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Short - Notes - Econometric Methods
No ratings yet
Short - Notes - Econometric Methods
22 pages
OLS Matrix Analysis for Statisticians
No ratings yet
OLS Matrix Analysis for Statisticians
14 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
7 Expectation
No ratings yet
7 Expectation
20 pages
SLRM Note
No ratings yet
SLRM Note
15 pages
Week9 PDF
No ratings yet
Week9 PDF
34 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Quant Chapter 05 Ols
No ratings yet
Quant Chapter 05 Ols
15 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
Two-Variable Regression Model Basics
No ratings yet
Two-Variable Regression Model Basics
17 pages
Sparse Additive Models: University of California, Berkeley, USA
No ratings yet
Sparse Additive Models: University of California, Berkeley, USA
22 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
Advanced Density Estimation Guide
No ratings yet
Advanced Density Estimation Guide
32 pages
Differential Privacy: 1 N I 1 N N
No ratings yet
Differential Privacy: 1 N I 1 N N
7 pages
Linear Classification Overview
No ratings yet
Linear Classification Overview
33 pages
Support Vector Machines
No ratings yet
Support Vector Machines
5 pages
Online Prediction Algorithms
No ratings yet
Online Prediction Algorithms
8 pages
Nonparametric Classification Techniques
No ratings yet
Nonparametric Classification Techniques
20 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
Boosting: I I I I
No ratings yet
Boosting: I I I I
5 pages
36-708 Statistical Methods For Machine Learning Homework #1 Solutions
No ratings yet
36-708 Statistical Methods For Machine Learning Homework #1 Solutions
12 pages
Advanced Dimension Reduction Techniques
No ratings yet
Advanced Dimension Reduction Techniques
40 pages
Statistical Machine Learning Solutions
No ratings yet
Statistical Machine Learning Solutions
11 pages
Homework 4: Graphs, Minimax Risk, and Bayesian Analysis
No ratings yet
Homework 4: Graphs, Minimax Risk, and Bayesian Analysis
2 pages
CMU Statistical ML Homework Solutions
No ratings yet
CMU Statistical ML Homework Solutions
16 pages
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
22 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
Manifold Estimation and Dimension Reduction
No ratings yet
Manifold Estimation and Dimension Reduction
39 pages
Sparse Regression Techniques Explained
No ratings yet
Sparse Regression Techniques Explained
25 pages
Data Analysis Project 2 Due 5:00 PM Nov 21 1 Instructions
No ratings yet
Data Analysis Project 2 Due 5:00 PM Nov 21 1 Instructions
3 pages
Understanding Causal Inference Techniques
No ratings yet
Understanding Causal Inference Techniques
19 pages
Functionspaces PDF
No ratings yet
Functionspaces PDF
15 pages
Lecture 7: Diagnostics: 36-401, Fall 2017, Section B
No ratings yet
Lecture 7: Diagnostics: 36-401, Fall 2017, Section B
35 pages
Data Analysis Exam 1 36-401, Section B
No ratings yet
Data Analysis Exam 1 36-401, Section B
3 pages
36-401 Modern Regression HW #2 Solutions: Problem 1 (36 Points Total)
No ratings yet
36-401 Modern Regression HW #2 Solutions: Problem 1 (36 Points Total)
15 pages
Lecture 8: Inference 36-401, Fall 2015, Section B
No ratings yet
Lecture 8: Inference 36-401, Fall 2015, Section B
16 pages
1 Review
No ratings yet
1 Review
7 pages
Lecture 9: Predictive Inference
No ratings yet
Lecture 9: Predictive Inference
10 pages
HW7
No ratings yet
HW7
1 page
High-Dimensional Covariance Estimation
No ratings yet
High-Dimensional Covariance Estimation
35 pages
Day-1 (Zero Lecture)
No ratings yet
Day-1 (Zero Lecture)
35 pages
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
No ratings yet
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
258 pages
Measure of Dispersion
No ratings yet
Measure of Dispersion
17 pages
The Language of Mathematics Utilizing Math in Practice 1st Edition Robert L. Baber PDF Download
100% (2)
The Language of Mathematics Utilizing Math in Practice 1st Edition Robert L. Baber PDF Download
54 pages
Understanding Random Variables: Discrete vs Continuous
No ratings yet
Understanding Random Variables: Discrete vs Continuous
4 pages
Probability Theory Exercises
No ratings yet
Probability Theory Exercises
8 pages
MTH302Sample Paper For Practice MTH302
0% (1)
MTH302Sample Paper For Practice MTH302
7 pages
CORE Stat and Prob Q3 Mod4 W4 The Normal Distribution
No ratings yet
CORE Stat and Prob Q3 Mod4 W4 The Normal Distribution
28 pages
Understanding Family Member Statistics
No ratings yet
Understanding Family Member Statistics
2 pages
Chapter 6
No ratings yet
Chapter 6
24 pages
Amit Singh PDF
No ratings yet
Amit Singh PDF
2 pages
SPTC 0101 Q3 FPF
No ratings yet
SPTC 0101 Q3 FPF
23 pages
Stochastic Process Viva Byzid
No ratings yet
Stochastic Process Viva Byzid
2 pages
Hypothesis Testing for Normality
No ratings yet
Hypothesis Testing for Normality
3 pages
EVPI Calculation Practice Problems
No ratings yet
EVPI Calculation Practice Problems
4 pages
SOA Exam MLC - Life Contingencies-Coret 913
No ratings yet
SOA Exam MLC - Life Contingencies-Coret 913
1 page
MVN
No ratings yet
MVN
14 pages
2 Meanand Varianceof Discrete Random Variable
No ratings yet
2 Meanand Varianceof Discrete Random Variable
39 pages
PST Func
No ratings yet
PST Func
73 pages
MAT 326 Syllabus Summer 2023
No ratings yet
MAT 326 Syllabus Summer 2023
3 pages
Understanding Sampling Distribution Basics
No ratings yet
Understanding Sampling Distribution Basics
28 pages
Stats Maraguinot Module
100% (1)
Stats Maraguinot Module
38 pages
AS Lecture 12 (Naive Bayes Classifier)
No ratings yet
AS Lecture 12 (Naive Bayes Classifier)
42 pages
Mod 3
No ratings yet
Mod 3
58 pages
Descriptive Statistics PQ
No ratings yet
Descriptive Statistics PQ
7 pages
1358961fundamentals of Probability With Stochastic Processes Solution Manual 4th Edition Ghahramani Download
No ratings yet
1358961fundamentals of Probability With Stochastic Processes Solution Manual 4th Edition Ghahramani Download
79 pages
Comprehensive Bayes' Theorem Guide
No ratings yet
Comprehensive Bayes' Theorem Guide
4 pages
Understanding Normal Distribution
No ratings yet
Understanding Normal Distribution
6 pages
Class 12 Probability Distribution Problems
No ratings yet
Class 12 Probability Distribution Problems
2 pages

Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017

Uploaded by

Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017

Uploaded by

Lecture 1: Optimal Prediction (with Refreshers)

36-401, Fall 2017

Sunday 3rd September, 2017

Regression analysis is about investigating quantitative, predictive relation-

1 Statistical Prediction and the Optimal Linear

M SE(m) = (E [Y − m])2 + Var [Y ] (3)

where the second line uses the linearity of expectations.

that I was wrong?)

13:54 Sunday 3rd September, 2017

curve(7 + (0.57 - x)^2, from = -2, to = 2, xlab = "m", ylab = "MSE(m)")

Figure 1: Mean squared error E (Y − m)2 as a function of the value m which we

13:54 Sunday 3rd September, 2017

1.2 Predicting One Random Variable from Another

E (Y − m(X))2 = E [E [(Y − m(X) | X]]

This µ(x) is called the (true, optimal, or population) regression function

1.3 The Optimal Linear Predictor

(See §2 for the identities I’m using above.)

13:54 Sunday 3rd September, 2017

−1.0 −0.5 0.0 0.5 1.0

13:54 Sunday 3rd September, 2017

We minimize again by setting derivatives to zero; we now need to take two

13:54 Sunday 3rd September, 2017

The line β0 + β1 x is the optimal regression line (of Y on X), or the

2. The best linear approximation to the truth can be awful. (Imagine E [Y |X = x] =

5. In general, changing the distribution of X will change the optimal regres-

well-defined and Var [X] > 0.

13:54 Sunday 3rd September, 2017

|x − x0 |µ0  |x − x0 |2 µ00 /2 (26)

13:54 Sunday 3rd September, 2017

more-or-less convenient, more-or-less accurate approximation. When we make

1.4 Probability versus Statistics

2 Reminders from Basic Probability

13:54 Sunday 3rd September, 2017

(Of course, f (X) hasR its own distribution,

Var [X] = E (X − E [X])2

The covariance of X and Y is

Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] (32)

2.1 Algebra with Expectations, Variances and Covariances

E [aX + bY ] = aE [X] + bE [Y ] (33)

Cov [X, Y ] = E [XY ] − E [X] E [Y ] = E [(X − E [X])(Y − E [Y ])] (35)

Cov [X, Y ] = Cov [Y, X] (36)

5. Variance is covariance with itself

Cov [X, X] = Var [X] (37)

13:54 Sunday 3rd September, 2017

6. Variance is not linear

Var [aX + b] = a2 Var [X] (38)

7. Covariance is not linear

Cov [aX + b, Y ] = aCov [X, Y ] (39)

Var [X + Y ] = Var [X] + Var [Y ] + 2Cov [X, Y ] (40)

9. Variance of a big sum

E [X] = E [E [X|Y ]] (42)

Remember: E [X|Y ] is a function of Y ; it’s random.

Var [X] = Var [E [X|Y ]] + E [Var [Y |X]] (43)

13:54 Sunday 3rd September, 2017

Pn limit theorem If the Xi are IID, then as n → ∞, the distribution

3 Reminders from Basic Statistics: Estimation

Example: Sample Mean The expectation value E [X] is either a parameter

13:54 Sunday 3rd September, 2017

Example: Shrunken Sample Mean As an alternative estimator, consider

13:54 Sunday 3rd September, 2017

You might also like

|x − x0 |µ0 |x − x0 |2 µ00 /2 (26)