0% found this document useful (0 votes)
119 views30 pages

Lectureslides Chap6-Annot PDF

Linear regression models the relationship between a response variable Y and one or more explanatory variables X. It finds the line of best fit for predicting Y from X by minimizing the sum of squared errors between the actual Y values and the predicted Y values from the regression line. The slope and intercept of the regression line are estimated from sample data using the least squares method. Assumptions of the linear regression model include independence and homoscedasticity of the error terms.

Uploaded by

rashid.iisc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views30 pages

Lectureslides Chap6-Annot PDF

Linear regression models the relationship between a response variable Y and one or more explanatory variables X. It finds the line of best fit for predicting Y from X by minimizing the sum of squared errors between the actual Y values and the predicted Y values from the regression line. The slope and intercept of the regression line are estimated from sample data using the least squares method. Assumptions of the linear regression model include independence and homoscedasticity of the error terms.

Uploaded by

rashid.iisc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

18.

650 – Fundamentals of Statistics

6. Linear regression

1/30
Goals
Consider two random variables X and Y . For example,
1. X is the amount of $ spent on Facebook ads and Y is the
total conversion rate
2. X is the age of the person and Y is the number of clicks
Given two random variables (X, Y ), we can ask the following
questions:

I How to predict Y from X?


I Error bars around this prediction?
I How much more conversions Y for an additional dollar?
I Does the number of clicks even depend on age?
I What if X is a random vector? For example, X = (X1 , X2 )
where X1 is the amount of $ spent on Facebook ads and X2
is the duration in days of the campaign.

2/30
Conversions vs. amount spent

3/30
Clicks vs. age

4/30
Modeling assumptions

(Xi , Yi ), i = 1, . . . , n are i.i.d from some unknown joint


distribution IP.

IP can be described by (assuming all exist)


I Either a joint PDF h(x, y)

I The marginal density of X h(x) = and the

h(y|x) =

h(y|x) answers all our questions. It contains all the information


about given

5/30
Partial modeling
We can also describe the distribution only ,e.g., using
I The expectation of Y :

I The conditional expectation of Y given X = x:


The function
Z
x 7! f (x) := IE[Y |X = x] =

is called
I Other possibilities:
I The conditional median: such that
Z m
h(y|x)dy =
1

I Conditional
I Conditional variance (not informative about location)
6/30
Conditional expectation and standard deviation

7/30
Conditional expectation

8/30
Conditional density and conditional quantiles

9/30
Conditional distribution: boxplots

10/30
Linear regression

We first focus on modeling the regression function


f (x) =
I Too many possible regression functions f (nonparametric)

I Useful to restrict to simple functions that are described by a


few parameters
I Simplest:

f (x) = a + bx

Under this assumption, we talk about

11/30
Nonparametric regression

12/30
Linear regression

13/30
Probabilistic analysis
I Let X and Y be two real r.v. (not necessarily independent)
with two moments and such that var(X) > 0.

I The theoretical linear regression of Y on X is the line


⇤ ⇤
x 7! a + b x where
h i
⇤ ⇤ 2
(a , b ) = argmin IE (Y a bX)
(a,b)2IR2

I Setting partial derivatives to zero gives


cov(X, Y )
I b =

,
var(X)

I a⇤ = IE[Y ] ⇤
b IE[X] = IE[Y ] IE[X].

14/30
Noise

a ⇤
Clearly the points are not exactly on the line x 7! + ⇤
b xif
⇤ ⇤
(Y |X = x) > 0. The random variable " = Y (a + b X) is

called noise and satisfies

Y = a + bX + ",

with
I IE["] = 0 and
I cov(X, ") = 0.

15/30
Statistical problem
In practice ⇤
a ,b⇤ need to be estimated from data.

I Assume that we observe n i.i.d. random pairs


(X1 , Y1 ), . . . , (Xn , Yn ) with same distribution as (X, Y ):
Yi = + "i

I We want to estimate a⇤ and b⇤ .

16/30
Statistical problem
⇤ ⇤
Y i = a + b Xi + "i

17/30
Statistical problem
⇤ ⇤
Y i = a + b Xi + "i

18/30
Statistical problem
⇤ ⇤
Y i = a + b Xi + "i

19/30
Statistical problem
⇤ ⇤
Y i = a + b Xi + "i

20/30
Least squares

Definition
The least squared error (LSE) estimator of (a, b) is the minimizer
of the sum of squared errors:
n
X
2
(Yi a bXi ) .
i=1

(â, b̂) is given by

XY X̄ Ȳ
b̂ =
X2 X̄ 2
â = Ȳ b̂X̄.

21/30
Residuals

22/30
Multivariate regression
> ⇤
Yi = Xi + "i , i = 1, . . . , n.

I Vector of explanatory variables or covariates: Xi 2 IRp


(wlog, assume its first coordinate is 1).

I Response / Dependent variable: Yi .

I ⇤
= ⇤ ⇤ > >
(a , b ) ; ⇤ (=
1

a ) is called the intercept.

I {"i }i=1,...,n : noise terms satisfying cov(Xi , "i ) = 0.


Definition

The least squares estimator (LSE) of is the minimizer of the
sum of square errors:
n
X
ˆ = argmin
2IRp i=1
23/30
LSE in matrix form

I Let Y = (Y1 , . . . , Yn )> 2 IRn .

I Let X be the n ⇥ p matrix whose rows are X> >


1 , . . . , Xn (X is
called the design matrix).

I Let " = ("1 , . . . , "n )> 2 IRn (unobserved noise)

I Y= , ⇤
unknwon.

I The LSE ˆ satisfies:

ˆ = argmin kY X 2
k2 .
2IRp

24/30
Closed form solution

I Assume that rank(X) = p.

I Analytic computation of the LSE:

ˆ = (X> X) 1 >
X Y.

I Geometric interpretation of the LSE: X ˆ is the orthogonal


projection of Y onto the subspace spanned by the columns of
X:
X ˆ = P Y,
where P = > >
X(X X) X .
1

25/30
Statistical inference

To make inference (confidence regions, tests) we need more


assumptions.
Assumptions:

I The design matrix X is deterministic and rank(X) = p.

I The model is homoscedastic: "1 , . . . , "n are i.i.d.

I The noise vector " is Gaussian:

"⇠

for some known or unknown 2 > 0.

26/30
Properties of LSE
I LSE = MLE

I Distribution of ˆ : ˆ ⇠ .
h i ⇣ ⌘
I Quadratic risk of ˆ : IE k ˆ 2
k2 = 2 >
tr (X X) 1
.
h i
I Prediction error: IE kY ˆ 2
X k2 = 2
(n p).

I Unbiased estimator of 2: ˆ =2
.

Theorem
ˆ2
I (n p) ⇠ .
2

I ˆ ?? ˆ 2 .
27/30
Significance tests
I Test whether the j-th explanatory variable is significant in the
linear regression (1  j  p).

I H0 : j = 0 v.s. H1 : j 6= 0.

I If j is the j-th diagonal coefficient of >


(X X) 1 ( j > 0):
ˆj j
p ⇠
ˆ2 j

ˆj
I Let Tn(j) = p .
ˆ2 j

I Test with non asymptotic level ↵ 2 (0, 1):

Rj,↵ =
where q ↵2 (tn p ) is the (1 ↵/2)-quantile of tn p.
I We can also compute p-values.
28/30
Bonferroni’s test

I Test whether a group of explanatory variables is significant in


the linear regression.

I H0 : j = 0, 8j 2 S v.s. H1 : 9j 2 S, j 6= 0, where
S ✓ {1, . . . , p}.

I Bonferroni’s test: RB,↵ = , where k = |S|.

I This test has nonasymptotic level at most ↵.

29/30
Remarks

I Linear regression exhibits correlations, NOT causality

I Normality of the noise: One can use goodness of fit tests to


test whether the residuals "ˆi = Yi X> ˆ are Gaussian.
i

I Deterministic design: If X is not deterministic, all the above


can be understood conditionally on X, if the noise is assumed
to be Gaussian, conditionally on X.

30/30

You might also like