0% found this document useful (0 votes)
86 views44 pages

MIT18 650F16 Regression

This document discusses linear regression and related statistical concepts. It begins by introducing linear regression heuristically using examples from economics and physics. It then presents the theoretical linear regression of a random variable Y on another random variable X. It defines the least squares error estimator for the regression coefficients and discusses its properties. The document extends the linear regression model to multiple explanatory variables and deterministic designs with Gaussian noise. It introduces significance tests for individual regression coefficients and groups of coefficients. Finally, it discusses additional tests for linear hypotheses about the regression coefficients.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views44 pages

MIT18 650F16 Regression

This document discusses linear regression and related statistical concepts. It begins by introducing linear regression heuristically using examples from economics and physics. It then presents the theoretical linear regression of a random variable Y on another random variable X. It defines the least squares error estimator for the regression coefficients and discusses its properties. The document extends the linear regression model to multiple explanatory variables and deterministic designs with Gaussian noise. It introduces significance tests for individual regression coefficients and groups of coefficients. Finally, it discusses additional tests for linear hypotheses about the regression coefficients.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Statistics for Applications

Chapter 7: Regression

1/43
Heuristics of the linear regression (1)
Consider a cloud of i.i.d. random points (Xi , Yi ), i = 1, . . . , n :

2/43
Heuristics of the linear regression (2)

I Idea: Fit the best line fitting the data.

I Approximation: Yi ⇡ a + bXi , i = 1, . . . , n, for some

(unknown) a, b 2 IR.

I ˆ ˆ

Find a, b that approach a and b


.

I More generally: Yi 2 IR, Xi 2 IRd ,

Yi ⇡ a + Xi> b, a 2 IR, b 2 IRd .

I Goal: Write a rigorous model and estimate a and b.

3/43
Heuristics of the linear regression (3)

Examples:

Economics: Demand and price,

Di ⇡ a + bpi , i = 1, . . . , n.

Ideal gas law: P V = nRT ,

log Pi ⇡ a + b log Vi + c log Ti , i = 1, . . . , n.

4/43
Linear regression of a r.v. Y on a r.v. X (1)

Let X and Y be two real r.v. (non necessarily independent)


with two moments and such that V ar(X) 6= 0.

The theoretical linear regression of Y on X is the best


approximation in quadratic means of Y by a linear function of
X, i.e. the r.v. a + bX,h where a and b iare the two real
numbers minimizing IE (Y − a − bX)2 .

By some simple algebra:


cov(X, Y )
I b= ,
V ar(X)
cov(X, Y )
I a = IE[Y ] − bIE[X] = IE[Y ] − IE[X].
V ar(X)

5/43
Linear regression of a r.v. Y on a r.v. X (2)

If " = Y − (a + bX), then

Y = a + bX + ",

with IE["] = 0 and cov(X, ") = 0.

Conversely: Assume that Y = a + bX + " for some a, b 2 IR


and some centered r.v. " that satisfies cov(X, ") = 0.

E.g., if X ?? " or if IE["|X] = 0, then cov(X, ") = 0.

Then, a + bX is the theoretical linear regression of Y on X.

6/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

7/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

8/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

9/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

10/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , Y1 ), . . . , (Xn , Yn ) with
same distribution as (X, Y ) is available.

We want to estimate a and b.

11/43
Linear regression of a r.v. Y on a r.v. X (4)

Definition
The least squared error (LSE) estimator of (a, b) is the minimizer
of the sum of squared errors:
n
X
(Yi − a − bXi )2 .
i=1

(â, b̂) is given by


XY − X̄ Y¯
b̂ = ,
2
X −X ¯ 2

â = Y¯ − ˆbX.
¯

12/43
Linear regression of a r.v. Y on a r.v. X (5)

13/43
Multivariate case (1)

Yi = Xi β + "i , i = 1, . . . , n.

Vector of explanatory variables or covariates: Xi 2 IRp (wlog,

assume its first coordinate is 1).

Dependent variable: Yi .

β = (a, b ) ; β1 (= a) is called the intercept.

{"i }i=1,...,n : noise terms satisfying cov(Xi , "i ) = 0.

Definition
The least squared error (LSE) estimator of β is the minimizer of
the sum of square errors:
n
X
β̂ = argmin (Yi − Xi t)2
t2IRp i=1
14/43
Multivariate case (2)

LSE in matrix form


Let Y = (Y1 , . . . , Yn ) 2 IRn .

Let X be the n ⇥ p matrix whose rows are X1 , . . . , Xn (X is


called the design).

Let " = ("1 , . . . , "n ) 2 IRn (unobserved noise)

Y = Xβ + ".

The LSE β̂ satisfies:

β̂ = argmin kY − Xtk22 .
t2IRp

15/43
Multivariate case (3)

Assume that rank(X) = p.

Analytic computation of the LSE:

β̂ = (X X)−1 X Y.

Geometric interpretation of the LSE

Xβ̂ is the orthogonal projection of Y onto the subspace


spanned by the columns of X:

Xβ̂ = P Y,

where P = X(X X)−1 X .

16/43
Linear regression with deterministic design and Gaussian
noise (1)

Assumptions:

The design matrix X is deterministic and rank(X) = p.

The model is homoscedastic: "1 , . . . , "n are i.i.d.

The noise vector " is Gaussian:

" ⇠ Nn (0, σ 2 In ),

for some known or unknown σ 2 > 0.

17/43
Linear regression with deterministic design and Gaussian
noise (2)
⇣ ⌘
LSE = MLE: ˆ ⇠ Np β, σ (X X)
β 2 −1
.
h i ⇣ ⌘
−1
Quadratic risk of β̂: IE kβ̂ − βk22 2
= σ tr (X X) .
h i
Prediction error: IE kY − Xβ̂k22 = σ 2 (n − p).

1
Unbiased estimator of σ 2 : σ̂ 2 = kY − Xβ̂k22 .
n−p

Theorem
σ̂ 2
(n − p) 2 ⇠ χ2n−p .
σ

β̂ ?? σ̂ 2 .
18/43
Significance tests (1)
Test whether the j-th explanatory variable is significant in the
linear regression (1  j  p).

H0 : βj = 0 v.s. H1 : βj = 0.

If γj is the j-th diagonal coefficient of (X X)−1 (γj > 0):

β̂j − βj
p ⇠ tn−p .
2
σ̂ γj

β̂j
Let Tn(j) = p .
σ̂ 2 γj

Test with non asymptotic level ↵ 2 (0, 1):

δ↵(j) = 1{|Tn(j) | > q ↵2 (tn−p )},

where q ↵2 (tn−p ) is the (1 − ↵/2)-quantile of tn−p .


19/43
Significance tests (2)

Test whether a group of explanatory variables is significant in


the linear regression.

H0 : βj = 0, 8j 2 S v.s. H1 : 9j 2 S, βj = 0, where
S ✓ {1, . . . , p}.

(j)
Bonferroni’s test: δ↵B = max δ↵/k , where k = |S|.
j2S

δ↵ has non asymptotic level at most ↵.

20/43
More tests (1)

Let G be a k ⇥ p matrix with rank(G) = k (k  p) and λ 2 IRk .


Consider the hypotheses:

H0 : Gβ = λ v.s. H1 : Gβ = λ.

The setup of the previous slide is a particular case.

If H0 is true, then:

Gβ̂ − λ ⇠ Nk 0, σ 2 G(X X)−1 G ,

and
−1
−2 −1
σ (Gβ̂ − λ) G(X X) G (Gβ − λ) ⇠ χ2k .

21/43
More tests (2)
� �−1
1 (Gβ̂ − λ) G(X X)−1 G (Gβ − λ)
Let Sn = .
σ̂ 2 k

If H0 is true, then Sn ⇠ Fk,n−p .


Test with non asymptotic level ↵ 2 (0, 1):

δ↵ = 1{Sn > q↵ (Fk,n−p )},

where q↵ (Fk,n−p ) is the (1 − ↵)-quantile of Fk,n−p .

Definition
The Fisher distribution with p and q degrees of freedom, denoted
U/p
by Fp,q , is the distribution of , where:
V /q
U ⇠ χ2p , V ⇠ χ2q ,

U ?? V .
22/43
Concluding remarks

Linear regression exhibits correlations, NOT causality

Normality of the noise: One can use goodness of fit tests to


test whether the residuals "ˆi = Yi − Xi β̂ are Gaussian.

Deterministic design: If X is not deterministic, all the above


can be understood conditionally on X, if the noise is assumed
to be Gaussian, conditionally on X.

23/43
Linear regression and lack of identifiability (1)
Consider the following model:

Y = Xβ + ",

with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown;
3. " ⇠ Nn (0, σ 2 In ).

Previously, we assumed that X had rank p, so we could invert


X X.

What if X is not of rank p ? E.g., if p > n ?

β would no longer be identified: estimation of β is vain


(unless we add more structure).
24/43
Linear regression and lack of identifiability (2)

What about prediction ? Xβ is still identified.

Ŷ: orthogonal projection of Y onto the linear span of the


columns of X.

Ŷ = Xβ̂ = X(X X)† XY, where A† stands for the


(Moore-Penrose) pseudo inverse of a matrix A.

Similarly as before, if k = rank(X):

kŶ − Yk22
2
⇠ χ2n k,
σ

kŶ − Yk22 ?? Ŷ.

25/43
Linear regression and lack of identifiability (3)

In particular:

IE[kŶ − Yk22 ] = (n − k)σ 2 .

Unbiased estimator of the variance:


1
σ̂ 2 = kŶ − Yk22 .
n−k

26/43
Linear regression in high dimension (1)
Consider again the following model:

Y = Xβ + ",

with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown: to be estimated;
3. " ⇠ Nn (0, σ 2 In ).
For each i, Xi 2 IRp is the vector of covariates of the i-th
individual.

If p is too large (p > n), there are too many parameters to be


estimated (overfitting model), although some covariates may
be irrelevant.

Solution: Reduction of the dimension.


27/43
Linear regression in high dimension (2)

Idea: Assume that only a few coordinates of β are nonzero


(but we do not know which ones).

Based on the sample, select a subset of covariates and


estimate the corresponding coordinates of β.

For S ✓ {1, . . . , p}, let

β̂ S 2 argmin kY − XS tk2 ,
t2IRS

where XS is the submatrix of X obtained by keeping only the


covariates indexed in S.

28/43
Linear regression in high dimension (3)

Select a subset S that minimizes the prediction error


penalized by the complexity (or size) of the model:
ˆ k2 + λ|S|,
kY − XS β S

where λ > 0 is a tuning parameter.

If λ = 2σ̂ 2 , this is the Mallow’s Cp or AIC criterion.

If λ = σ̂ 2 log n, this is the BIC criterion.

29/43
Linear regression in high dimension (4)
Each of these criteria is equivalent to finding β 2 IRp that
minimizes:
kY − Xbk22 + λkbk0 ,
where kbk0 is the number of nonzero coefficients of b.

This is a computationally hard problem: nonconvex and


requires to compute 2n estimators (all the β̂ S , for
S ✓ {1, . . . , p}).

Lasso estimator:
p
X p
X � �
replacekbk0 = 1I{bj = 0} with kbk1 = �bj �
j=1 j=1

and the problem becomes convex.


L
β̂ 2 argmin kY − Xbk2 + λkbk1 ,
b2IRp
where λ > 0 is a tuning parameter.
30/43
Linear regression in high dimension (5)

How to choose λ ?

This is a difficult question (see grad course 18.657:


”High-dimensional statistics” in Spring 2017).

A good choice of λ with lead to an estimator β̂ that is very


close to β and will allow to recover the subset S ⇤ of all
j 2 {1, . . . , p} for which β j = 0, with high probability.

31/43
Linear regression in high dimension (6)

32/43
Nonparametric regression (1)

In the linear setup, we assumed that Yi = Xi β + "i , where


Xi are deterministic.

This has to be understood as working conditionally on the


design.

This is to assume that IE[Yi |Xi ] is a linear function of Xi ,


which is not true in general.

Let f (x) = IE[Yi |Xi = x], x 2 IRp : How to estimate the


function f ?

33/43
Nonparametric regression (2)

Let p = 1 in the sequel.


One can make a parametric assumption on f .

E.g., f (x) = a + bx, f (x) = a + bx + cx2 , f (x) = ea+bx , ...

The problem reduces to the estimation of a finite number of


parameters.

LSE, MLE, all the previous theory for the linear case could be
adapted.

What if we do not make any such parametric assumption on f


?

34/43
Nonparametric regression (3)

Assume f is smooth enough: f can be well approximated by a


piecewise constant function.

Idea: Local averages.

For x 2 IR: f (t) ⇡ f (x) for t close to x.

For all i such that Xi is close enough to x,

Yi ⇡ f (x) + "i .

Estimate f (x) by the average of all Yi ’s for which Xi is close


enough to x.

35/43
Nonparametric regression (4)

Let h > 0: the window’s size (or bandwidth).

Let Ix = {i = 1, . . . , n : |Xi − x| < h}.

Let fˆn,h (x) be the average of {Yi : i 2 Ix }.


8
> 1 X
>
< |I | Yi if Ix = ;
x
fˆn,h (x) = i2Ix
>
>
:
0 otherwise.

36/43
Nonparametric regression (5)
0.5


0.4


● ● ●
●●
0.3

● ●

● ●
● ● ●

● ●



● ●
0.2




● ●●

● ●


Y

● ●
0.1






● ●
0.0



−0.2

0.2 0.4 0.6 0.8 1.0

37/43
Nonparametric regression (6)
0.5

l
0.4

l
l l l

ll ll
ll ll
0.3

ll
l l

l l ll ll
l ll
l
l
l
l l
0.2

l
l
ll
l
l
l
ll ll
l ll l
l
Y

l l
0.1

l
l

l
l
l
l
l l
0.0

l
l

x  0.6 l

h  0.1
^
−0.2

l f x  0.27

0.2 0.4 0.6 0.8 1.0

38/43
Nonparametric regression (7)

How to choose h ?

If h ! 0: overfitting the data;

If h ! 1: underfitting, fˆn,h (x) = Y¯n .

39/43
Nonparametric regression (8)
Example:
n = 100, f (x) = x(1 − x),
h = .005.
0.4

^
l

f nh
l
l

l
l l

l l
l
0.3

l l
l
l
l
l

l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2

l
l l
l
l l l

l l l
l l
l l
l l
l
l l l
Y

l l
l l l
l

l l l
l l l

l
0.1

l l l l
l
l
l
l
l
l
l
l
l

l l

l
l

l l
0.0

l l

l
l
l l l

0.0 0.2 0.4 0.6 0.8 1.0

X
40/43
Nonparametric regression (9)
Example:
n = 100, f (x) = x(1 − x),
h = 1.
0.4

^
l

f nh
l
l

l
l l

l l
l
0.3

l l
l
l
l
l

l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2

l
l l
l
l l l

l l l
l l
l l
l l
l
l l l
Y

l l
l l l
l

l l l
l l l

l
0.1

l l l l
l
l
l
l
l
l
l
l
l

l l

l
l

l l
0.0

l l

l
l
l l l

0.0 0.2 0.4 0.6 0.8 1.0

X
41/43
Nonparametric regression (10)
Example:
n = 100, f (x) = x(1 − x),
h = .2.
0.4

^
l

f nh
l
l

l
l l

l l
l
0.3

l l
l
l
l
l

l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2

l
l l
l
l l l

l l l
l l
l l
l l
l
l l l
Y

l l
l l l
l

l l l
l l l

l
0.1

l l l l
l
l
l
l
l
l
l
l
l

l l

l
l

l l
0.0

l l

l
l
l l l

0.0 0.2 0.4 0.6 0.8 1.0

X
42/43
Nonparametric regression (11)

Choice of h ?

If the smoothness of f is known (i.e., quality of local


approximation of f by piecewise constant functions): There is
a good choice of h depending on that smoothness

If the smoothness of f is unknown: Other techniques, e.g.


cross validation.

43/43
0,72SHQ&RXUVH:DUH
KWWSVRFZPLWHGX

6WDWLVWLFVIRU$SSOLFDWLRQV
)DOO 

)RULQIRUPDWLRQDERXWFLWLQJWKHVHPDWHULDOVRURXU7HUPVRI8VHYLVLWKWWSVRFZPLWHGXWHUPV

You might also like