0% found this document useful (0 votes)
151 views232 pages

STAT 135: Linear Regression: Joan Bruna

This document is a summary of Joan Bruna's STAT 135 class on linear regression. It provides 4 examples of how linear regression can be used: 1) modeling brain weight based on head size, 2) modeling egg flight time based on weight and height, 3) modeling population growth over time, and 4) modeling factors like SAT scores and GPA that influence salary. It explains that linear regression involves observations, a parametric model, and a criteria to assess model fit. It focuses on the least squares approach which minimizes the sum of squared errors between the observed and predicted values.

Uploaded by

hoalongkiem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views232 pages

STAT 135: Linear Regression: Joan Bruna

This document is a summary of Joan Bruna's STAT 135 class on linear regression. It provides 4 examples of how linear regression can be used: 1) modeling brain weight based on head size, 2) modeling egg flight time based on weight and height, 3) modeling population growth over time, and 4) modeling factors like SAT scores and GPA that influence salary. It explains that linear regression involves observations, a parametric model, and a criteria to assess model fit. It focuses on the least squares approach which minimizes the sum of squared errors between the observed and predicted values.

Uploaded by

hoalongkiem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 232

STAT 135: Linear Regression

Joan Bruna

Department of Statistics
UC, Berkeley

May 1, 2015

Joan Bruna STAT 135: Linear Regression


Introduction: Example 1

We measure the Brain weight W (grams) and head size S (cubic


cm) for 237 adults.
(Source: R.J. Gladstone (1905). ”A Study of the Relations of the
Brain to to the Size of the Head”, Biometrika, Vol. 4, pp105-123).

Joan Bruna STAT 135: Linear Regression


Introduction: Example 1

We measure the Brain weight W (grams) and head size S (cubic


cm) for 237 adults.
(Source: R.J. Gladstone (1905). ”A Study of the Relations of the
Brain to to the Size of the Head”, Biometrika, Vol. 4, pp105-123).
A reasonable model is

W ≈ β0 + β1 S .

Joan Bruna STAT 135: Linear Regression


Introduction: Example 1

We measure the Brain weight W (grams) and head size S (cubic


cm) for 237 adults.
(Source: R.J. Gladstone (1905). ”A Study of the Relations of the
Brain to to the Size of the Head”, Biometrika, Vol. 4, pp105-123).
A reasonable model is

W ≈ β0 + β1 S .

Joan Bruna STAT 135: Linear Regression


Introduction: Example 1

Joan Bruna STAT 135: Linear Regression


Introduction: Example 1

Is the data consistent with our model? ie, is our model correct?

Joan Bruna STAT 135: Linear Regression


Introduction: Example 1

Is the data consistent with our model? ie, is our model correct?
How can we estimate the parameters of our model? Precision?

Joan Bruna STAT 135: Linear Regression


Introduction: Example 2

Say we measure the flight time T (in seconds) of an egg of weight


W grams thrown from H meters.
What is a reasonable model for T ≈ F (W , H) ??

Joan Bruna STAT 135: Linear Regression


Introduction: Example 2

F (W , H) ≈

Joan Bruna STAT 135: Linear Regression


Introduction: Example 2

√ 1
F (W , H) ≈ β H , (with β = p ).
9.8/2

How√can we combine/test different features of our measurements


(eg H, H, H 2 , etc) ?
Joan Bruna STAT 135: Linear Regression
Introduction: Example 3
World population P as a function of time Y :

Joan Bruna STAT 135: Linear Regression


Introduction: Example 3
World population P as a function of time Y :

Linear model of the form

P = β0 + β1 Y

does not look very good. Alternative?


Joan Bruna STAT 135: Linear Regression
Introduction: Example 3
A more reasonable model is that of exponential (constant) growth:

P(Y ) = γ0 e γ1 (Y −Y0 ) ,
where γ0 is the population at year Y0 .

How to estimate the growth rate from the data?

Joan Bruna STAT 135: Linear Regression


Introduction: Example 3
A more reasonable model is that of exponential (constant) growth:

P(Y ) = γ0 e γ1 (Y −Y0 ) ,
where γ0 is the population at year Y0 .

How to estimate the growth rate from the data? Simple idea:
transform the data to reveal the linear dependency:

log P = log γ0 + γ1 Y − γ1 Y0 = γ˜0 + β1 Y ,

with γ˜0 = log γ0 − γ1 Y0 .

Joan Bruna STAT 135: Linear Regression


Introduction: Example 3
A more reasonable model is that of exponential (constant) growth:

P(Y ) = γ0 e γ1 (Y −Y0 ) ,
where γ0 is the population at year Y0 .

How to estimate the growth rate from the data? Simple idea:
transform the data to reveal the linear dependency:

log P = log γ0 + γ1 Y − γ1 Y0 = γ˜0 + β1 Y ,

with γ˜0 = log γ0 − γ1 Y0 .


Current growth rate is estimated at 1.1%.
More sophisticated models have variable growth:

Joan Bruna STAT 135: Linear Regression


Introduction: Example 4

Often we need multiple factors to explain a given set of


measurements.
Average salary y from X = { SAT, GPA, University, major,
Zip}.

Joan Bruna STAT 135: Linear Regression


Introduction: Example 4

Often we need multiple factors to explain a given set of


measurements.
Average salary y from X = { SAT, GPA, University, major,
Zip}.
In digital images, pixel intensity y = I (u, v ) from the
neighboring intensities
X = {I (u 0 , v 0 ) u 0 6= u, v 0 6= v , |u − u 0 |, |v − v 0 | ≤ k}:

Joan Bruna STAT 135: Linear Regression


Introduction

In these examples,

p−1
X
y ≈ f (x1 , . . . , xp−1 , β0 , . . . , βp−1 ) = β0 + βk xk .
k=1

We will attempt to fit a function f with p parameters on n


observations.

Joan Bruna STAT 135: Linear Regression


Introduction

In these examples,

p−1
X
y ≈ f (x1 , . . . , xp−1 , β0 , . . . , βp−1 ) = β0 + βk xk .
k=1

We will attempt to fit a function f with p parameters on n


observations.
Many models can be cast as linear via appropriate data
transformations (see previous examples).

Joan Bruna STAT 135: Linear Regression


Introduction

In these examples,

p−1
X
y ≈ f (x1 , . . . , xp−1 , β0 , . . . , βp−1 ) = β0 + βk xk .
k=1

We will attempt to fit a function f with p parameters on n


observations.
Many models can be cast as linear via appropriate data
transformations (see previous examples).
Different statistical regimes as a function of p and number of
observations n:
p
When → 0, “easy” statistics.
n
p
When → C > 0, much harder (but more interesting!).
n

Joan Bruna STAT 135: Linear Regression


Regression vs Anova

In previous Chapter, we were interested in the question

“Is factor A (or B) influencing measurement Y ?”

Joan Bruna STAT 135: Linear Regression


Regression vs Anova

In previous Chapter, we were interested in the question

“Is factor A (or B) influencing measurement Y ?”

Now we are much more ambitious:


“How are factors Xj influencing measurement Y ?”

Joan Bruna STAT 135: Linear Regression


Linear Regression 101

The regression problem has three ingredients:

Joan Bruna STAT 135: Linear Regression


Linear Regression 101

The regression problem has three ingredients:


Observations (yi , xk,i ), i = 1 . . . n, k = 1 . . . p.

Joan Bruna STAT 135: Linear Regression


Linear Regression 101

The regression problem has three ingredients:


Observations (yi , xk,i ), i = 1 . . . n, k = 1 . . . p.
A parametrized fitting model F (x, β), with β ∈ Rp .

Joan Bruna STAT 135: Linear Regression


Linear Regression 101

The regression problem has three ingredients:


Observations (yi , xk,i ), i = 1 . . . n, k = 1 . . . p.
A parametrized fitting model F (x, β), with β ∈ Rp .
A criteria to assess how good is the approximation
yi ≈ F (xi , β) (also known as loss/cost function).
We will restrict ourselves to linear models: F (x, β) is linear in x
and β.

Joan Bruna STAT 135: Linear Regression


Linear Regression 101

The regression problem has three ingredients:


Observations (yi , xk,i ), i = 1 . . . n, k = 1 . . . p.
A parametrized fitting model F (x, β), with β ∈ Rp .
A criteria to assess how good is the approximation
yi ≈ F (xi , β) (also known as loss/cost function).
We will restrict ourselves to linear models: F (x, β) is linear in x
and β.
How to choose the cost function?

Joan Bruna STAT 135: Linear Regression


Least Squares

Given a cost function `(x, y ) satisfying `(x, y ) ≥ 0 and `(x, x) = 0,


we are interested in
n
X
min E (β) = `(yi , F (xi , β)) .
β∈Rp
i=1

Different choices for ` yield different statistical properties.


There exist one cost that greatly simplifies things:
`(x, y ) = |x − y |2 .

Joan Bruna STAT 135: Linear Regression


Least Squares

Given a cost function `(x, y ) satisfying `(x, y ) ≥ 0 and `(x, x) = 0,


we are interested in
n
X
min E (β) = `(yi , F (xi , β)) .
β∈Rp
i=1

Different choices for ` yield different statistical properties.


There exist one cost that greatly simplifies things:
`(x, y ) = |x − y |2 .
In that case, the system of equations

∂E (β)
= 0 , k = 1...p
∂βk
is Linear.

Joan Bruna STAT 135: Linear Regression


Linear Regression 101

Let us review the simple affine case (p = 2):


n
X
E (β0 , β1 ) = (yi − β0 − β1 xi )2 .
i=1

Joan Bruna STAT 135: Linear Regression


Linear Regression 101

Let us review the simple affine case (p = 2):


n
X
E (β0 , β1 ) = (yi − β0 − β1 xi )2 .
i=1

Then
∂E X
= −2 (yi − β0 − β1 xi ) ,
∂β0
i=1
∂E X
= −2 xi (yi − β0 − β1 xi ) .
∂β1
i=1

Joan Bruna STAT 135: Linear Regression


Linear Regression 101

Setting to zero and solving the 2 × 2 system we obtain

2 (
P  P P P
xii i y i ) − ( i xi ) ( i xi yi )
βˆ0 = P 2 P 2
,
n i xi − ( i xi )
P P P
n ( i xi yi ) − ( i xi ) ( i yi )
βˆ1 = P 2  P 2
n i xi − ( i xi )

Example of brain weight. We obtain

W ≈ 325 + 0.26S .

Joan Bruna STAT 135: Linear Regression


Linear Regression 101

Setting to zero and solving the 2 × 2 system we obtain

2 (
P  P P P
xii i y i ) − ( i xi ) ( i xi yi )
βˆ0 = P 2 P 2
,
n i xi − ( i xi )
P P P
n ( i xi yi ) − ( i xi ) ( i yi )
βˆ1 = P 2  P 2
n i xi − ( i xi )

Example of brain weight. We obtain

W ≈ 325 + 0.26S .

We will see how to build confidence intervals of these parameters

Joan Bruna STAT 135: Linear Regression


Simple Linear Regression

We introduce a statistical model for our observations:

yi = β0 + β1 xi + i , i = 1 . . . n ,
with i ∼ N (0, σ 2 ) iid.

How well can we recover the parameters of the model β0 and β1


from data?

Joan Bruna STAT 135: Linear Regression


Bias and Variance of Simple Linear Regression

Theorem
Under the previous assumptions, we have
 
E β̂j = βj , (j = 0, 1)

and
σ2 xi2 nσ 2
  P  
var βˆ0 = i
, var βˆ1 =
xi2 + ( x i )2 xi2 + ( xi )2
P P P P
n i i n i i

Unbiasedness only requires E () = 0.


Variance decreases with n, depends upon Signal (x) to noise
(σ 2 ) ratio.
Important: This theorem assumes iid errors.

Joan Bruna STAT 135: Linear Regression


Regression and Correlation

Recall the previous regression coefficients in the affine case:


P 2 P P P
ˆ i xi ( i yi ) − ( i xi ) ( i xi yi )
β0 = P 2 P 2
,
n i xi − ( i xi )
P P P
ˆ n ( i xi yi ) − ( i xi ) ( i yi )
β1 = P 2 P 2
n i xi − ( i xi )

Joan Bruna STAT 135: Linear Regression


Regression and Correlation

Recall the previous regression coefficients in the affine case:


P 2 P P P
ˆ i xi ( i yi ) − ( i xi ) ( i xi yi )
β0 = P 2 P 2
,
n i xi − ( i xi )
P P P
ˆ n ( i xi yi ) − ( i xi ) ( i yi )
β1 = P 2 P 2
n i xi − ( i xi )

We can re-write these solutions as


P
ˆ ˆ ˆ (xi − x)(yi − y )
β0 = y − β1 x , β1 = i P 2
.
i (xi − x)

Joan Bruna STAT 135: Linear Regression


Regression and Correlation

If we denote
1X 1X 1X
sx = (xi −x)2 , sy = (yi −y )2 , sxy = (xi −x)(yi −y )
n n n
i i i

The correlation coefficient between x and y is


sxy
ρ= √ .
sx sy

Joan Bruna STAT 135: Linear Regression


Regression and Correlation

If we denote
1X 1X 1X
sx = (xi −x)2 , sy = (yi −y )2 , sxy = (xi −x)(yi −y )
n n n
i i i

The correlation coefficient between x and y is


sxy
ρ= √ .
sx sy
r
sy
It results that ρ = βˆ1 , or
sx
r
ˆ sy
β1 = ρ .
sx

Joan Bruna STAT 135: Linear Regression


Regression and Correlation

If we denote
1X 1X 1X
sx = (xi −x)2 , sy = (yi −y )2 , sxy = (xi −x)(yi −y )
n n n
i i i

The correlation coefficient between x and y is


sxy
ρ= √ .
sx sy
r
sy
It results that ρ = βˆ1 , or
sx
r
ˆ sy
β1 = ρ .
sx

In particular, slope is 0 if and only if x and y are uncorrelated.

Joan Bruna STAT 135: Linear Regression


Regression and Correlation

Let us rewrite the regression equation:

ŷ = βˆ0 + βˆ1 x

Joan Bruna STAT 135: Linear Regression


Regression and Correlation

Let us rewrite the regression equation:

ŷ = βˆ0 + βˆ1 x

r r
sy sy
ŷ = y − ρ x +ρ x
sx sx

Joan Bruna STAT 135: Linear Regression


Regression and Correlation

Let us rewrite the regression equation:

ŷ = βˆ0 + βˆ1 x

r r
sy sy
ŷ = y − ρ x +ρ x
sx sx

ŷ − y x −x
√ =ρ √ .
sy sx

Joan Bruna STAT 135: Linear Regression


Regression and Correlation

Let us rewrite the regression equation:

ŷ = βˆ0 + βˆ1 x

r r
sy sy
ŷ = y − ρ x +ρ x
sx sx

ŷ − y x −x
√ =ρ √ .
sy sx

Very Important Fact: |ρ| ≤ 1.

Joan Bruna STAT 135: Linear Regression


Galton Experiment (1885)

Study of the heights of of fathers and their sons.

Joan Bruna STAT 135: Linear Regression


Galton Experiment (1885)

Study of the heights of of fathers and their sons. He found that


The children of larger than average parents tend to be smaller than
their parents,
The children of shorter than average parents tend to be larger than
their parents.

Joan Bruna STAT 135: Linear Regression


Galton Experiment (1885)

Study of the heights of of fathers and their sons. He found that


The children of larger than average parents tend to be smaller than
their parents,
The children of shorter than average parents tend to be larger than
their parents.
Why?
Joan Bruna STAT 135: Linear Regression
Linear Regression in Matrix Form

We denoted the data as

yi , xi,1 , xi,2 , . . . , xi,p , i = 1 . . . n .

Joan Bruna STAT 135: Linear Regression


Linear Regression in Matrix Form

We denoted the data as

yi , xi,1 , xi,2 , . . . , xi,p , i = 1 . . . n .

Our regression model was


X
ŷi = βk xi,k , i = 1 . . . n .
k≤p

Joan Bruna STAT 135: Linear Regression


Linear Regression in Matrix Form

We denoted the data as

yi , xi,1 , xi,2 , . . . , xi,p , i = 1 . . . n .

Our regression model was


X
ŷi = βk xi,k , i = 1 . . . n .
k≤p

Let us write

β = (β1 , . . . , βp ) ∈ Rp , ŷ = (ŷ1 , . . . , ŷn ) ∈ Rn , (X )i,k = xi,k ∈ Rn×p .

Joan Bruna STAT 135: Linear Regression


Linear Regression in Matrix Form

We denoted the data as

yi , xi,1 , xi,2 , . . . , xi,p , i = 1 . . . n .

Our regression model was


X
ŷi = βk xi,k , i = 1 . . . n .
k≤p

Let us write

β = (β1 , . . . , βp ) ∈ Rp , ŷ = (ŷ1 , . . . , ŷn ) ∈ Rn , (X )i,k = xi,k ∈ Rn×p .

Then

ŷ = X β .

Joan Bruna STAT 135: Linear Regression


Linear Regression in Matrix Form

Our loss function was


n
1X X
E (β) = (yi − βk xi,k )2 .
2
i=1 k

Joan Bruna STAT 135: Linear Regression


Linear Regression in Matrix Form

Our loss function was


n
1X X
E (β) = (yi − βk xi,k )2 .
2
i=1 k

This is the squared Euclidean norm of the vector y − X β, therefore

1
E (β) = ky − X βk2 .
2

Joan Bruna STAT 135: Linear Regression


Computing derivatives in Rn

If
E (x) : Rn −→ R ,
the derivative or gradient of E with respect to x is written

∂E
∇x E ∈ Rn , with (∇x E )i = .
∂xi

Example: Say
1 1X 2
E (x) = kxk2 = xi .
2 2
i

Joan Bruna STAT 135: Linear Regression


Computing derivatives in Rn

If
E (x) : Rn −→ R ,
the derivative or gradient of E with respect to x is written

∂E
∇x E ∈ Rn , with (∇x E )i = .
∂xi

Example: Say
1 1X 2
E (x) = kxk2 = xi .
2 2
i

Then
∇x E = x .

Joan Bruna STAT 135: Linear Regression


Vector Chain Rule

Say you have a function

E = F (y ) , y = G (x) ,

with
E ∈ R , y ∈ Rm , x ∈ Rn .

Joan Bruna STAT 135: Linear Regression


Vector Chain Rule

Say you have a function

E = F (y ) , y = G (x) ,

with
E ∈ R , y ∈ Rm , x ∈ Rn .
Question: How to compute the gradient of E with respect to x?

Joan Bruna STAT 135: Linear Regression


Vector Chain Rule

Say you have a function

E = F (y ) , y = G (x) ,

with
E ∈ R , y ∈ Rm , x ∈ Rn .
Question: How to compute the gradient of E with respect to x?
A: Chain Rule is defined as usual:
 T
∂G
∇x E = (y ) ∇y F .
∂x

Joan Bruna STAT 135: Linear Regression


Back to Regression

Recall that our objective function is


1
E (β) = ky − X βk2 .
2

Joan Bruna STAT 135: Linear Regression


Back to Regression

Recall that our objective function is


1
E (β) = ky − X βk2 .
2
Let’s compute painlessly the least squares solution:
1
E (β) = kzk2 , with z = y − X β .
2

Joan Bruna STAT 135: Linear Regression


Back to Regression

Recall that our objective function is


1
E (β) = ky − X βk2 .
2
Let’s compute painlessly the least squares solution:
1
E (β) = kzk2 , with z = y − X β .
2
Now we apply the chain rule:

∇z E = z and (∂β z) = −X .

Joan Bruna STAT 135: Linear Regression


Back to Regression

Recall that our objective function is


1
E (β) = ky − X βk2 .
2
Let’s compute painlessly the least squares solution:
1
E (β) = kzk2 , with z = y − X β .
2
Now we apply the chain rule:

∇z E = z and (∂β z) = −X .

Therefore
∇x E = −X T z = −X T (y − X β) .

Joan Bruna STAT 135: Linear Regression


The Normal Equations

It results that

∇x E = 0 ⇔ X T X βb = X T y ⇔ βb = (X T X )−1 X T y .

Joan Bruna STAT 135: Linear Regression


The Normal Equations

It results that

∇x E = 0 ⇔ X T X βb = X T y ⇔ βb = (X T X )−1 X T y .

The matrix X † = (X T X )−1 X T is called the pseudoinverse of X .

Joan Bruna STAT 135: Linear Regression


The Normal Equations

It results that

∇x E = 0 ⇔ X T X βb = X T y ⇔ βb = (X T X )−1 X T y .

The matrix X † = (X T X )−1 X T is called the pseudoinverse of X .


Lemma
The normal equations have a unique solution if and only if
rank(X ) = p.

Joan Bruna STAT 135: Linear Regression


Revisit Affine Case p = 2
 
1 x1
 1 x2 
When fitting a straight line, we had X = 
 
.. .. 
 . . 
1 xn

Joan Bruna STAT 135: Linear Regression


Revisit Affine Case p = 2
 
1 x1
 1 x2 
When fitting a straight line, we had X = 
 
.. .. 
 . . 
1 xn
Thus
 X   X 
n xi yi
T  , X T y =  Xi
   
X X = X i
X  , and
 x i xi2   xi yi 
i i i

 X X 
xi2 − xi
1
(X T X )−1
 i i

= P 2  X  .
n i xi − ( i xi )2  −
P
xi n 
i

Joan Bruna STAT 135: Linear Regression


Multivariate Random Variables

In many situations, it will be easier to study random variables


jointly. In particular, in linear regression.

Joan Bruna STAT 135: Linear Regression


Multivariate Random Variables

In many situations, it will be easier to study random variables


jointly. In particular, in linear regression.
We define a random vector Y as Y = (Y1 , . . . , Yn ), where each Yi
is a random variable, with

E (Yi ) = µi , cov (Yi , Yj ) = σi,j .

Joan Bruna STAT 135: Linear Regression


Multivariate Random Variables

In many situations, it will be easier to study random variables


jointly. In particular, in linear regression.
We define a random vector Y as Y = (Y1 , . . . , Yn ), where each Yi
is a random variable, with

E (Yi ) = µi , cov (Yi , Yj ) = σi,j .

More compactly, we write

E (Y ) = µ = (µ1 , . . . , µn ) ∈ Rn , ΣY ,Y ∈ Rn×n , ΣY ,Y i,j = σi,j .

µ is the mean of Y and Σ is the covariance matrix of Y .

Joan Bruna STAT 135: Linear Regression


Multivariate Random Variables

In many situations, it will be easier to study random variables


jointly. In particular, in linear regression.
We define a random vector Y as Y = (Y1 , . . . , Yn ), where each Yi
is a random variable, with

E (Yi ) = µi , cov (Yi , Yj ) = σi,j .

More compactly, we write

E (Y ) = µ = (µ1 , . . . , µn ) ∈ Rn , ΣY ,Y ∈ Rn×n , ΣY ,Y i,j = σi,j .

µ is the mean of Y and Σ is the covariance matrix of Y .


Q: What happens with E (Y ) and ΣY under linear transformations?

Joan Bruna STAT 135: Linear Regression


Linearity of Mean and Covariance

Lemma
Let Y be a random vector with mean µ and covariance Σ. Then, if
Z = b + AY is a linear transformation, we have

E (Z ) = b + Aµ , and ΣZ ,Z = AΣAT .

Joan Bruna STAT 135: Linear Regression


Example

Suppose we have Y1 , . . . , Yn modeling the temperature (in Celsius)


at every second n, with Yi = µi + i and E () = 0, Σ = σ 2 1.

Joan Bruna STAT 135: Linear Regression


Example

Suppose we have Y1 , . . . , Yn modeling the temperature (in Celsius)


at every second n, with Yi = µi + i and E () = 0, Σ = σ 2 1.
Say we want the average temperature Z every minute, and in
Farenheit!.

Joan Bruna STAT 135: Linear Regression


Example

Suppose we have Y1 , . . . , Yn modeling the temperature (in Celsius)


at every second n, with Yi = µi + i and E () = 0, Σ = σ 2 1.
Say we want the average temperature Z every minute, and in
Farenheit!.
First, every coordinate is transformed according to
18
Faren = 50 + (Celsius − 10).
10
Then, the minute averages can be computed with a matrix of the
form  
1 ... 1 0 ... 0 ...
1  0 ... 0 1 ... 1 ... 

A=  .. .. .. .. .. .. ..  .
60  . . . . . . . 
0 ... 0 ... 1 ... 1

Joan Bruna STAT 135: Linear Regression


Example

Thus
Z = b + AY ,
and

E (Z ) = b + AE (Y ) = b + A(µ + E ()) = b + Aµ .

Joan Bruna STAT 135: Linear Regression


Example

Thus
Z = b + AY ,
and

E (Z ) = b + AE (Y ) = b + A(µ + E ()) = b + Aµ .

Also,
ΣZ = σ 2 AAT .

Joan Bruna STAT 135: Linear Regression


Random Quadratic forms

Another operation we have encountered often is sums of squares of


random variables.

Joan Bruna STAT 135: Linear Regression


Random Quadratic forms

Another operation we have encountered often is sums of squares of


random variables.
These are examples of what is called quadratic forms
X
x T Ax = ai,i 0 xi xi 0 .
i,i 0

Joan Bruna STAT 135: Linear Regression


Random Quadratic forms

Another operation we have encountered often is sums of squares of


random variables.
These are examples of what is called quadratic forms
X
x T Ax = ai,i 0 xi xi 0 .
i,i 0

Lemma
Under the same assumptions as the previous theorem, let
X = Y T AY ∈ R. Then

E (X ) = Tr (AΣ) + µT Aµ .

Joan Bruna STAT 135: Linear Regression


Important Example
We consider X
s= (Xi − X )2 ,
i

where Xi are uncorrelated with common mean µ.

Joan Bruna STAT 135: Linear Regression


Important Example
We consider X
s= (Xi − X )2 ,
i

where Xi are uncorrelated with common mean µ.


Can we see s as the squared norm of a vector of the form AX ?

Joan Bruna STAT 135: Linear Regression


Important Example
We consider X
s= (Xi − X )2 ,
i

where Xi are uncorrelated with common mean µ.


Can we see s as the squared norm of a vector of the form AX ?
1
Observe that X = 1T X , with 1 = (1, . . . , 1).
n

Joan Bruna STAT 135: Linear Regression


Important Example
We consider X
s= (Xi − X )2 ,
i

where Xi are uncorrelated with common mean µ.


Can we see s as the squared norm of a vector of the form AX ?
1
Observe that X = 1T X , with 1 = (1, . . . , 1). Thus
n
1 T
(X , . . . , X ) = 11 X , and
n
1
A = Id − 11T .
n

Joan Bruna STAT 135: Linear Regression


Important Example
We consider X
s= (Xi − X )2 ,
i

where Xi are uncorrelated with common mean µ.


Can we see s as the squared norm of a vector of the form AX ?
1
Observe that X = 1T X , with 1 = (1, . . . , 1). Thus
n
1 T
(X , . . . , X ) = 11 X , and
n
1
A = Id − 11T .
n
We then have
X
(Xi − X )2 = kAX k2 = X T AT AX .
i

Joan Bruna STAT 135: Linear Regression


Example (continued)

The matrix A is called a projection matrix. It satisfies


AT = A,
A2 =

Joan Bruna STAT 135: Linear Regression


Example (continued)

The matrix A is called a projection matrix. It satisfies


AT = A,
A2 = A.

Joan Bruna STAT 135: Linear Regression


Example (continued)

The matrix A is called a projection matrix. It satisfies


AT = A,
A2 = A.
Thus
X T AT AX = X T AX
and  
E X T AX = σ 2 Tr (A) + µT Aµ .

Joan Bruna STAT 135: Linear Regression


Example (continued)

The matrix A is called a projection matrix. It satisfies


AT = A,
A2 = A.
Thus
X T AT AX = X T AX
and  
E X T AX = σ 2 Tr (A) + µT Aµ .

Since µ is a constant vector, Aµ = 0 and

s = σ 2 Tr (A) = σ 2 (n − 1) .

Joan Bruna STAT 135: Linear Regression


Cross-Covariance of Random Vectors

Given random vectors Y ∈ Rn and Z ∈ Rm , the cross-covariance


ΣX ,Y ∈ Rn×m is the matrix

(ΣX ,Y )i,j = cov (Yi , Zj ) .

Joan Bruna STAT 135: Linear Regression


Cross-Covariance of Random Vectors

Given random vectors Y ∈ Rn and Z ∈ Rm , the cross-covariance


ΣX ,Y ∈ Rn×m is the matrix

(ΣX ,Y )i,j = cov (Yi , Zj ) .

Lemma
Let X ∈ Rn be random vector with covariance Σ. If Y = AX ∈ Rp
and Z = BX ∈ Rm are linear transformations of X , where A and B
are constant matrices, then

ΣY ,Z = AΣB T .

Joan Bruna STAT 135: Linear Regression


Mean and Covariance of Least Squares Estimates

Recall our statistical model:

Y = Xβ +  ,
with
X ∈ Rn×p known and fixed,
β ∈ Rp unknown parameters,
 ∈ Rn with E () = 0 and Σ = σ 2 1.

Joan Bruna STAT 135: Linear Regression


Mean and Covariance of Least Squares Estimates

Recall our statistical model:

Y = Xβ +  ,
with
X ∈ Rn×p known and fixed,
β ∈ Rp unknown parameters,
 ∈ Rn with E () = 0 and Σ = σ 2 1.

Estimation of β: Given y = (y1 , . . . , yn ) observations, the least


squares estimate is

β̂ = X † y = (X T X )−1 X T y .

Joan Bruna STAT 135: Linear Regression


Mean and Covariance of Least Squares Estimates

Theorem
If E () = 0, then  
E β̂ = β .

Joan Bruna STAT 135: Linear Regression


Mean and Covariance of Least Squares Estimates

Theorem
If E () = 0, then  
E β̂ = β .

Theorem
If E () = 0 and Σ = σ 2 1, then

Σβ̂ = σ 2 (X T X )−1 .

Joan Bruna STAT 135: Linear Regression


Example: affine case.

Recall that
 X X 
xi2 − xi
1
(X T X )−1
 
= P 2  X i i  .
P 2 
n i xi − ( i xi ) − xi n 
i

Joan Bruna STAT 135: Linear Regression


Example: affine case.

Recall that
 X X 
xi2 − xi
1
(X T X )−1
 
= P 2  X i i  .
P 2 
n i xi − ( i xi ) − xi n 
i

It results that
     
var βˆ0 cov βˆ0 , βˆ1
Σβ̂ =      
cov βˆ0 , βˆ1 var βˆ1
 2P 2
−σ 2 i xi
P 
1 σ P i xi
= P 2
xi )2 −σ 2 i xi σ2n
P
n x −(
i i i

Joan Bruna STAT 135: Linear Regression


Linear Regression and Maximum-Likelihood

Theorem
Under the iid Gaussian statistical model Y ∼ N (βX , σ 2 Id),

β̂ = X † y

is the maximum likelihood estimator for β.

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

In most situations, we do not know σ 2 in advance.


MLE estimation of σ 2 under the Gaussian model is
1
σˆ2 MLE = kY − X β̂k2 .
n

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

In most situations, we do not know σ 2 in advance.


MLE estimation of σ 2 under the Gaussian model is
1
σˆ2 MLE = kY − X β̂k2 .
n
However, this is biased and only applies to Gaussian errors.

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

In most situations, we do not know σ 2 in advance.


MLE estimation of σ 2 under the Gaussian model is
1
σˆ2 MLE = kY − X β̂k2 .
n
However, this is biased and only applies to Gaussian errors.
How to generalize/improve?

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

Let us introduce the vector of residuals.


Definition
The residuals of the linear regression are

ê = Y − Ŷ = Y − X β̂

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

Let us introduce the vector of residuals.


Definition
The residuals of the linear regression are

ê = Y − Ŷ = Y − X β̂

Theorem
Under the assumption that errors are uncorrelated with constant
variance σ 2 ,
kêk2
s2 =
n−p
is an unbiased estimator of σ 2 : E s 2 = σ 2 .


Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

We first write the residuals as

Y − Ŷ = Y − X β̂ = (Id − X (X T X )−1 X T )Y .

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

We first write the residuals as

Y − Ŷ = Y − X β̂ = (Id − X (X T X )−1 X T )Y .

The matrix PX ⊥ = Id − X (X T X )−1 X T is a projection matrix


onto the orthogonal subspace of X .

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

We first write the residuals as

Y − Ŷ = Y − X β̂ = (Id − X (X T X )−1 X T )Y .

The matrix PX ⊥ = Id − X (X T X )−1 X T is a projection matrix


onto the orthogonal subspace of X .
In particular, we have PXT⊥ = PX ⊥ and PX2 ⊥ = PX ⊥ .

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

It results that
X
kêk2 = (Yi − Ŷi )2 = kY − Ŷ k2 = Y T PX ⊥ Y ,
i

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

It results that
X
kêk2 = (Yi − Ŷi )2 = kY − Ŷ k2 = Y T PX ⊥ Y , so
i

E kêk2 = E (Y )T PX ⊥ E (Y ) + σ 2 tr (PX ⊥ ) .


Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

It results that
X
kêk2 = (Yi − Ŷi )2 = kY − Ŷ k2 = Y T PX ⊥ Y , so
i

E kêk2 = E (Y )T PX ⊥ E (Y ) + σ 2 tr (PX ⊥ ) .


We have
PX ⊥ E (Y ) = PX ⊥ X β = 0 .

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

It results that
X
kêk2 = (Yi − Ŷi )2 = kY − Ŷ k2 = Y T PX ⊥ Y , so
i

E kêk2 = E (Y )T PX ⊥ E (Y ) + σ 2 tr (PX ⊥ ) .


We have
PX ⊥ E (Y ) = PX ⊥ X β = 0 .
tr (PX ⊥ ) = tr (Id) − tr (X (X T X )−1 X T ) =
n − tr ((X T X )−1 X T X ) = n − p .

Joan Bruna STAT 135: Linear Regression


Estimation of σ 2

It results that
X
kêk2 = (Yi − Ŷi )2 = kY − Ŷ k2 = Y T PX ⊥ Y , so
i

E kêk2 = E (Y )T PX ⊥ E (Y ) + σ 2 tr (PX ⊥ ) .


We have
PX ⊥ E (Y ) = PX ⊥ X β = 0 .
tr (PX ⊥ ) = tr (Id) − tr (X (X T X )−1 X T ) =
n − tr ((X T X )−1 X T X ) = n − p .
So E kêk2 = σ 2 (n − p).


Joan Bruna STAT 135: Linear Regression


Assessing Fit with Residuals

How can we determine whether a regression model is “good” ?

Joan Bruna STAT 135: Linear Regression


Assessing Fit with Residuals

How can we determine whether a regression model is “good” ?


Using that ê = PX ⊥ Y , we can compute how the residuals are
correlated:

Σê,ê = PX ⊥ ΣY ,Y PXT⊥ = σ 2 PX ⊥ .

Residuals are correlated and variances are not uniform.

Joan Bruna STAT 135: Linear Regression


Assessing Fit with Residuals

How can we determine whether a regression model is “good” ?


Using that ê = PX ⊥ Y , we can compute how the residuals are
correlated:

Σê,ê = PX ⊥ ΣY ,Y PXT⊥ = σ 2 PX ⊥ .

Residuals are correlated and variances are not uniform.


We can standardize the residuals with

Y − Ŷi Yi − Ŷi
pi = p .
s PX ⊥ (i, i) s 1 − PX (i, i)

Joan Bruna STAT 135: Linear Regression


Assessing Fit with Residuals

We also have that


Lemma
If the errors have covariance matrix σ 2 Id, then the residuals ê are
uncorrelated from the fitted values Ŷ .

Joan Bruna STAT 135: Linear Regression


Assessing Fit with Residuals

We also have that


Lemma
If the errors have covariance matrix σ 2 Id, then the residuals ê are
uncorrelated from the fitted values Ŷ .
Remarks:
we can check visually that the residuals and fitted values are
not linearly related.
we can also check visually the assumption of constant
variance.

Joan Bruna STAT 135: Linear Regression


Example: Brain Weight vs Head Size
Recall the data: Brain Weight vs Head Size:

We perform affine regression to obtain

Ŵ = 325.6 + 0.26S .

Joan Bruna STAT 135: Linear Regression


Example: Brain Weight vs Head Size
We plot the error residuals vs the predicted values:

Joan Bruna STAT 135: Linear Regression


Example: Brain Weight vs Head Size
We plot the error residuals vs the predicted values:

No apparent remaining correlation.


Variance seems uniform across predictor,

Joan Bruna STAT 135: Linear Regression


Example: Brain Weight vs Head Size
We plot the error residuals vs the predicted values:

No apparent remaining correlation.


Variance seems uniform across predictor,
So the simple regression model appears valid for this dataset.
Joan Bruna STAT 135: Linear Regression
Inference about regression coefficients

We saw that if Y ∼ N (X β, σ 2 Id), then

β̂ = X † Y ∼ N (β, σ 2 (X T X )−1 ) .

Joan Bruna STAT 135: Linear Regression


Inference about regression coefficients

We saw that if Y ∼ N (X β, σ 2 Id), then

β̂ = X † Y ∼ N (β, σ 2 (X T X )−1 ) .

Q: How to test hypothesis about β and construct confidence


intervals?

Joan Bruna STAT 135: Linear Regression


Inference about regression coefficients

Denote by C = (X T X )−1 the inverse empiric covariance of X .

Joan Bruna STAT 135: Linear Regression


Inference about regression coefficients

Denote by C = (X T X )−1 the inverse empiric covariance of X .


Under normality assumption, we have that

β̂k − βk
∀k , p ∼ tn−p .
s Ck,k

Joan Bruna STAT 135: Linear Regression


Inference about regression coefficients

Denote by C = (X T X )−1 the inverse empiric covariance of X .


Under normality assumption, we have that

β̂k − βk
∀k , p ∼ tn−p .
s Ck,k

If n large enough, CLT gives approximate normality for β̂ even


if errors are not Gaussian.

Joan Bruna STAT 135: Linear Regression


Inference about regression coefficients

Denote by C = (X T X )−1 the inverse empiric covariance of X .


Under normality assumption, we have that

β̂k − βk
∀k , p ∼ tn−p .
s Ck,k

If n large enough, CLT gives approximate normality for β̂ even


if errors are not Gaussian.
A 100(1p − α)% confidence interval for βi is
β̂i ± s Ci,i tn−p (1 − α/2).

Joan Bruna STAT 135: Linear Regression


Inference about regression coefficients

Denote by C = (X T X )−1 the inverse empiric covariance of X .


Under normality assumption, we have that

β̂k − βk
∀k , p ∼ tn−p .
s Ck,k

If n large enough, CLT gives approximate normality for β̂ even


if errors are not Gaussian.
A 100(1p − α)% confidence interval for βi is
β̂i ± s Ci,i tn−p (1 − α/2).
A test for the null hypothesis H0 : βi = β0 is performed
β̂i − β0
using t = p , whose null distribution is tn−p .
s Ci,i

Joan Bruna STAT 135: Linear Regression


Important Remark: Signal vs Noise

The covariance of our estimators β̂ is

Σβ̂ = σ 2 (X T X )−1 .

Joan Bruna STAT 135: Linear Regression


Important Remark: Signal vs Noise

The covariance of our estimators β̂ is

Σβ̂ = σ 2 (X T X )−1 .

σ 2 is the amount of noise ; so the stronger the noise, the


worse.

Joan Bruna STAT 135: Linear Regression


Important Remark: Signal vs Noise

The covariance of our estimators β̂ is

Σβ̂ = σ 2 (X T X )−1 .

σ 2 is the amount of noise ; so the stronger the noise, the


worse.
(X T X )−1 is the inverse covariance of the signal X .

Joan Bruna STAT 135: Linear Regression


Important Remark: Signal vs Noise

The covariance of our estimators β̂ is

Σβ̂ = σ 2 (X T X )−1 .

σ 2 is the amount of noise ; so the stronger the noise, the


worse.
(X T X )−1 is the inverse covariance of the signal X . So the
stronger the signal, the better.

Joan Bruna STAT 135: Linear Regression


Example: Egg flight time

Joan Bruna STAT 135: Linear Regression


Example: Egg flight time

We can nevertheless attempt a linear regression...

Joan Bruna STAT 135: Linear Regression


Example: Egg flight time
We plot the error residuals vs the predicted values:

Joan Bruna STAT 135: Linear Regression


Example: Egg flight time
We plot the error residuals vs the predicted values:

There is a (nonlinear) dependency between residuals and


predicted values.
The regression model seems poorly adapted in this case.
Joan Bruna STAT 135: Linear Regression
Important Example
Q: When there is a relationship of the form Y = aX b , but we do
not know b, how to estimate it?

Joan Bruna STAT 135: Linear Regression


Important Example
Q: When there is a relationship of the form Y = aX b , but we do
not know b, how to estimate it?
A: log-log data transformation: log Y = log a + b log X .

Joan Bruna STAT 135: Linear Regression


Important Example
Q: When there is a relationship of the form Y = aX b , but we do
not know b, how to estimate it?
A: log-log data transformation: log Y = log a + b log X .

We estimate b̂ = 0.48 ≈ 1/2.


However, we observe that the variance of the error residuals is
not constant.

Joan Bruna STAT 135: Linear Regression


Important Example
Q: When there is a relationship of the form Y = aX b , but we do
not know b, how to estimate it?
A: log-log data transformation: log Y = log a + b log X .

We estimate b̂ = 0.48 ≈ 1/2.


However, we observe that the variance of the error residuals is
not constant.
What can we do to improve the fit?
Joan Bruna STAT 135: Linear Regression
Transforming Data to Improve the Fit

Now that we have figured out that Y ∝ X 0.48 ≈ X , let’s try to
figure out the gravitational constant.

Joan Bruna STAT 135: Linear Regression


Transforming Data to Improve the Fit

Now that we have figured out that Y ∝ X 0.48 ≈ X , let’s try to
figure out the gravitational constant.
First Idea: Regress Y 2 against X :

Joan Bruna STAT 135: Linear Regression


Transforming Data to Improve the Fit

Now that we have figured out that Y ∝ X 0.48 ≈ X , let’s try to
figure out the gravitational constant.
First Idea: Regress Y 2 against X :

Yˆ2 = β0 + β1 X , withβ0 ∈ (−0.3 ± 0.55) , β1 ∈ (0.21 ± 0.01) .


√ p
Remember that Y = α X , with αp = 2/9.8 = 0.452. We thus
obtain a 100(1 − α) CI for α with β1 = (0.45, 0.4728).

Joan Bruna STAT 135: Linear Regression


Transforming Data to Improve the Fit

Now that we have figured out that Y ∝ X 0.48 ≈ X , let’s try to
figure out the gravitational constant.
First Idea: Regress Y 2 against X :

Yˆ2 = β0 + β1 X , withβ0 ∈ (−0.3 ± 0.55) , β1 ∈ (0.21 ± 0.01) .


√ p
Remember that Y = α X , with αp = 2/9.8 = 0.452. We thus
obtain a 100(1 − α) CI for α with β1 = (0.45, 0.4728). Can we
do better?
Joan Bruna STAT 135: Linear Regression
Transforming Data to Improve the Fit
The previous transformation amplifies the noise for large X .

Joan Bruna STAT 135: Linear Regression


Transforming Data to Improve the Fit
The previous transformation amplifies the
√ noise for large X .
We can rather try to regress Y against X :

Joan Bruna STAT 135: Linear Regression


Transforming Data to Improve the Fit
The previous transformation amplifies the
√ noise for large X .
We can rather try to regress Y against X :


Ŷ = β0 + β1 X , withβ0 ∈ (−0.03 ± 0.13) , β1 ∈ (0.459 ± 0.018) .

Joan Bruna STAT 135: Linear Regression


Transforming Data to Improve the Fit
The previous transformation amplifies the
√ noise for large X .
We can rather try to regress Y against X :


Ŷ = β0 + β1 X , withβ0 ∈ (−0.03 ± 0.13) , β1 ∈ (0.459 ± 0.018) .

Now the variance of the residuals is more uniform across


samples.
The corresponding CI for α now is (0.44, 0.47). Better than
before?

Joan Bruna STAT 135: Linear Regression


Transforming Data to Improve the Fit
The previous transformation amplifies the
√ noise for large X .
We can rather try to regress Y against X :


Ŷ = β0 + β1 X , withβ0 ∈ (−0.03 ± 0.13) , β1 ∈ (0.459 ± 0.018) .

Now the variance of the residuals is more uniform across


samples.
The corresponding CI for α now is (0.44, 0.47). Better than
before?
Whereas the noise now behaves better, the signal does not.
Joan Bruna STAT 135: Linear Regression
Role of Outliers
GDP per capita and Internet Usage (source worldbank.org)

Looks OK? Joan Bruna STAT 135: Linear Regression


Role of Outliers

Outliers are extreme values that greatly influence the rest of the
fitted model parameters.

Joan Bruna STAT 135: Linear Regression


Role of Outliers

Outliers are extreme values that greatly influence the rest of the
fitted model parameters. The quadratic loss is very sensitive to
such extreme values:
Outliers in X (Monaco). How would β̂ change if we removed
an “extreme” observation (x ∗ , y ∗ ) ?
n
X n
X
(X T X )k,l = xi,k xi,l = xk∗ xl∗ + xi,k xi,l ,
i=1 xi 6 x∗
=

X n
X
(X T Y )k = xi,k yi = xk∗ y ∗ + xi,k yi .
i xi 6=x ∗

Joan Bruna STAT 135: Linear Regression


Role of Outliers

Outliers are extreme values that greatly influence the rest of the
fitted model parameters. The quadratic loss is very sensitive to
such extreme values:
Outliers in X (Monaco). How would β̂ change if we removed
an “extreme” observation (x ∗ , y ∗ ) ?
n
X n
X
(X T X )k,l = xi,k xi,l = xk∗ xl∗ + xi,k xi,l ,
i=1 xi 6 x∗
=

X n
X
(X T Y )k = xi,k yi = xk∗ y ∗ + xi,k yi .
i xi 6=x ∗

Outliers are over-emphasized via the least squares criterion.

Joan Bruna STAT 135: Linear Regression


Role of Outliers

Model misfits. Suppose that one observation (eg Iceland) does not
follow the specified model. Q: How much is it going to degrade
the overall fit?

Joan Bruna STAT 135: Linear Regression


Role of Outliers

Model misfits. Suppose that one observation (eg Iceland) does not
follow the specified model. Q: How much is it going to degrade
the overall fit?
Suppose y∗ = x∗ β +  + γ with |γ|  ||.

Joan Bruna STAT 135: Linear Regression


Role of Outliers

Model misfits. Suppose that one observation (eg Iceland) does not
follow the specified model. Q: How much is it going to degrade
the overall fit?
Suppose y∗ = x∗ β +  + γ with |γ|  ||. Then

β̂ = X † Y = X † Ỹ + γX∗†

Joan Bruna STAT 135: Linear Regression


Role of Outliers

Model misfits. Suppose that one observation (eg Iceland) does not
follow the specified model. Q: How much is it going to degrade
the overall fit?
Suppose y∗ = x∗ β +  + γ with |γ|  ||. Then

β̂ = X † Y = X † Ỹ + γX∗†

The influence of a model misfit depends upon where it


happens.

Joan Bruna STAT 135: Linear Regression


Regression coefficients and Conditioning

We have seen that if the noise is uncorrelated, then

Σβ = σ 2 (X T X )−1 .

Q: What happens when the features xk are correlated to each


other?

Joan Bruna STAT 135: Linear Regression


Regression coefficients and Conditioning

We have seen that if the noise is uncorrelated, then

Σβ = σ 2 (X T X )−1 .

Q: What happens when the features xk are correlated to each


other?
A: The total uncertainty can be measured with the trace
X
σ2 Ci,i , C = (X T X )−1 .
i

Joan Bruna STAT 135: Linear Regression


Regression coefficients and Conditioning

If the features xk are very correlated to each other, the matrix C is


ill-conditioned, meaning that
X
Ci,i
i

is large relative to the norm of X .

Joan Bruna STAT 135: Linear Regression


Regression coefficients and Conditioning

If the features xk are very correlated to each other, the matrix C is


ill-conditioned, meaning that
X
Ci,i
i

is large relative to the norm of X .


Interpretation of regression coefficients is not always a good
idea.
This does not mean regression is unreliable!

Joan Bruna STAT 135: Linear Regression


Prediction

So far, we have concentrated in modeling our observations, ie


explaining the observed variability in Y through linear
combinations of X ’s.

Joan Bruna STAT 135: Linear Regression


Prediction

So far, we have concentrated in modeling our observations, ie


explaining the observed variability in Y through linear
combinations of X ’s.
Suppose now we have estimated a regression model

Ŷ = X β̂ ,

and we observe a new value of features X = x∗ for a new sample.


Q: How to predict the outcome Y∗ ?

Joan Bruna STAT 135: Linear Regression


Prediction

A reasonable estimate for Y∗ is

Yˆ∗ = x∗ β̂ .

Its variance is
 
var Yˆ∗ = x∗ Σβ̂ x∗T = σ 2 x∗ (X T X )−1 x∗T ,

and we can estimate it by replacing σ 2 with s 2 .

Joan Bruna STAT 135: Linear Regression


Prediction

It is instructive to look at the variance of the predicted outcome as


a function of the number of features p.

Joan Bruna STAT 135: Linear Regression


Prediction

It is instructive to look at the variance of the predicted outcome as


a function of the number of features p.
The average variance across the data-points (x1 , . . . , xn ) is

n
 σ2 X
1X 
var Ŷ (xi ) = xi (X T X )−1 xiT
n n
i i=1
σ2 σ2
= Tr (X (X T X )−1 X T ) = Tr (Idp×p )
n n
p
= σ2 .
n

Joan Bruna STAT 135: Linear Regression


Prediction

It is instructive to look at the variance of the predicted outcome as


a function of the number of features p.
The average variance across the data-points (x1 , . . . , xn ) is

n
 σ2 X
1X 
var Ŷ (xi ) = xi (X T X )−1 xiT
n n
i i=1
σ2 σ2
= Tr (X (X T X )−1 X T ) = Tr (Idp×p )
n n
p
= σ2 .
n

So the variance increases with the number of covariates


of the model.

Joan Bruna STAT 135: Linear Regression


Prediction Confidence Intervals

Q: Can we construct a confidence interval for Y∗ ?

Joan Bruna STAT 135: Linear Regression


Prediction Confidence Intervals

Q: Can we construct a confidence interval for Y∗ ?


Recall that the model is

Y = Xβ +  = θ +  .

We can construct a CI for θ with


q
Yˆ∗ ± s x∗ (X T X )−1 x∗T tn−p (1 − α/2) .

Joan Bruna STAT 135: Linear Regression


Prediction Confidence Intervals

Q: Can we construct a confidence interval for Y∗ ?


Recall that the model is

Y = Xβ +  = θ +  .

We can construct a CI for θ with


q
Yˆ∗ ± s x∗ (X T X )−1 x∗T tn−p (1 − α/2) .

A CI for Y∗ needs to account for the extra uncertainty from


 ∼ N (0, σ 2 ):

Joan Bruna STAT 135: Linear Regression


Prediction Confidence Intervals

Q: Can we construct a confidence interval for Y∗ ?


Recall that the model is

Y = Xβ +  = θ +  .

We can construct a CI for θ with


q
Yˆ∗ ± s x∗ (X T X )−1 x∗T tn−p (1 − α/2) .

A CI for Y∗ needs to account for the extra uncertainty from


 ∼ N (0, σ 2 ):

An approximate 100(1 − α) prediction interval for Y∗ is


q
x∗ β̂ ± s (x∗ (X T X )−1 x∗T +1)tn−p (1 − α/2)

Joan Bruna STAT 135: Linear Regression


Example
Let us try to model CO2 emissions in France in the last 50 years
from some economic indicators, such as GDP growth, GDP per
capita and Exports of goods and services:

Source: Worldbank.org
Joan Bruna STAT 135: Linear Regression
Example

We attempt to predict CO2 at year t from all the indicators at year


t − 1 and economic indicators at year t:

CO2 (t) = β1 CO2 (t − 1) + β2 GDP(t − 1) + · · · + β7 Exports(t) .

Joan Bruna STAT 135: Linear Regression


Example

We attempt to predict CO2 at year t from all the indicators at year


t − 1 and economic indicators at year t:

CO2 (t) = β1 CO2 (t − 1) + β2 GDP(t − 1) + · · · + β7 Exports(t) .

We form the feature matrix X of size 50 × 7 and the response


vector Y of size 50 × 1. Then

β̂ = (X T X )−1 Y , and
\
GDP(t + 1) = β̂1 CO2 (t) + βˆ2 GDP(t) + · · · + β̂7 Exports(t + 1) .
Also
[ 2
kGDP − GDPk
s2 = .
50 − 7

Joan Bruna STAT 135: Linear Regression


Example
We construct the confidence interval of the predicted GDP as
q
x∗ β̂ ± s (x∗ (X T X )−1 x∗T +1)tn−p (1 − α/2) .

Joan Bruna STAT 135: Linear Regression


Example
We construct the confidence interval of the predicted GDP as
q
x∗ β̂ ± s (x∗ (X T X )−1 x∗T +1)tn−p (1 − α/2) .

Joan Bruna STAT 135: Linear Regression


Example
We construct the confidence interval of the predicted GDP as
q
x∗ β̂ ± s (x∗ (X T X )−1 x∗T +1)tn−p (1 − α/2) .

According to our model, this year’s CO2 emissions for France will
be
5.57 ± 0.78 ( metric tons per capita) .
Joan Bruna STAT 135: Linear Regression
Model Selection

Q: Given a dataset with many covariates, how many should we use


to predict/model a given response with small risk?

Joan Bruna STAT 135: Linear Regression


Image Regression Example
Let us try to predict a pixel value from its Neighbors:

Joan Bruna STAT 135: Linear Regression


Image Regression Example
Let us try to predict a pixel value from its Neighbors:

Model images as locally smooth functions:


X
Yi = βj xj,i + i ,
j∈N(i,δ)

where N(i, δ): Neighborhood of size δ centered at i.


Joan Bruna STAT 135: Linear Regression
Example
We estimate the coefficients β using Least Squares on a given
image xtr and then we test it on a different image xte :

Figure: Left: Estimated Ŷ using δ = 5. Right: Estimated regression


coefficients β̂.

We evaluate the model using the prediction error:


X X X
R(δ) = (Yi − Ŷi )2 = (Yi − β̂j xj,i )2 .
i i j∈N(i,δ)

Joan Bruna STAT 135: Linear Regression


Example
We see what happens as we vary δ:

Joan Bruna STAT 135: Linear Regression


Example
We see what happens as we vary δ:

So, as we make the model bigger (increase δ), the training


error always decrases.
But the prediction error does NOT. Why?
Joan Bruna STAT 135: Linear Regression
Overfitting

Suppose a model

Y = X β +  , β ∈ Rp , X ∈ R1×p ,
with E () = 0, Σ = σ 2 Id, and observations (xi , yi ) , i = 1 . . . n.

Joan Bruna STAT 135: Linear Regression


Overfitting

Suppose a model

Y = X β +  , β ∈ Rp , X ∈ R1×p ,
with E () = 0, Σ = σ 2 Id, and observations (xi , yi ) , i = 1 . . . n.

Say we pick a subset S ⊂ {1 . . . p} of features:


c
Y = XS β S + XS c β S +  ,

and we perform linear regression only using features from S:

Joan Bruna STAT 135: Linear Regression


Overfitting

Suppose a model

Y = X β +  , β ∈ Rp , X ∈ R1×p ,
with E () = 0, Σ = σ 2 Id, and observations (xi , yi ) , i = 1 . . . n.

Say we pick a subset S ⊂ {1 . . . p} of features:


c
Y = XS β S + XS c β S +  ,

and we perform linear regression only using features from S:

ŶS (x) = xS βˆS , with βˆS = XS† Y .

Joan Bruna STAT 135: Linear Regression


Overfitting

Suppose a model

Y = X β +  , β ∈ Rp , X ∈ R1×p ,
with E () = 0, Σ = σ 2 Id, and observations (xi , yi ) , i = 1 . . . n.

Say we pick a subset S ⊂ {1 . . . p} of features:


c
Y = XS β S + XS c β S +  ,

and we perform linear regression only using features from S:

ŶS (x) = xS βˆS , with βˆS = XS† Y .

Q: What is the bias and the variance of ŶS (x) ?

Joan Bruna STAT 135: Linear Regression


Bias-Variance Trade-off

The bias of ŶS is


   
E ŶS (x) − xβ = xS E βˆS − xβ
c
= xS XS† (XS β S + XS c β S ) − xβ
c c
= xS β S + xS (XS† XS c )β S − xS β S − xS c β S
  c
= xS (XS† XS c ) − xS c β S .

Joan Bruna STAT 135: Linear Regression


Bias-Variance Trade-off

The bias of ŶS is


   
E ŶS (x) − xβ = xS E βˆS − xβ
c
= xS XS† (XS β S + XS c β S ) − xβ
c c
= xS β S + xS (XS† XS c )β S − xS β S − xS c β S
  c
= xS (XS† XS c ) − xS c β S .

The variance of ŶS (x) is


   
var ŶS (x) = σ 2 xST (XST XS )−1 xS .

So, as |S| increases,


The bias of YˆS (x) decreases,
but its variance increases.
Joan Bruna STAT 135: Linear Regression
Bias-Variance Trade-off

So, how to pick a good trade-off?

Joan Bruna STAT 135: Linear Regression


Bias-Variance Trade-off

So, how to pick a good trade-off?


Remember, we want to optimize the prediction error, or test error,
of the model, evaluated at the observed data-points:
n
X    
R(S) = E (ŶS (xi ) − Yi∗ )2 = E kŶS − Y ∗ k2 ,
i=1

where Yi∗ is a future observation at data-point xi .

Joan Bruna STAT 135: Linear Regression


Bias-Variance Trade-off

So, how to pick a good trade-off?


Remember, we want to optimize the prediction error, or test error,
of the model, evaluated at the observed data-points:
n
X    
R(S) = E (ŶS (xi ) − Yi∗ )2 = E kŶS − Y ∗ k2 ,
i=1

where Yi∗ is a future observation at data-point xi .


We could think of looking at the expected residual error (ie,
training error) as a guide:

  Xn    
2 2
E R̂tr (S) = E (ŶS (xi ) − Yi ) = E kŶS − Y k .
i=1

Joan Bruna STAT 135: Linear Regression


Bias-Variance Trade-off

It turns out that the training error is a biased estimator of the test
error:
Theorem
 
E R̂tr (S) = R(S) − 2Tr (ΣŶ ,Y ) .

Joan Bruna STAT 135: Linear Regression


Bias-Variance Trade-off

It turns out that the training error is a biased estimator of the test
error:
Theorem
 
E R̂tr (S) = R(S) − 2Tr (ΣŶ ,Y ) .

Remarks:
The data is being used twice: to fit the model and then to
estimate the risk.

Joan Bruna STAT 135: Linear Regression


Bias-Variance Trade-off

It turns out that the training error is a biased estimator of the test
error:
Theorem
 
E R̂tr (S) = R(S) − 2Tr (ΣŶ ,Y ) .

Remarks:
The data is being used twice: to fit the model and then to
estimate the risk.
The Cross-covariance between Ŷ and Y increases as the
model becomes more complex.

Joan Bruna STAT 135: Linear Regression


Bias-Variance Trade-off

It turns out that the training error is a biased estimator of the test
error:
Theorem
 
E R̂tr (S) = R(S) − 2Tr (ΣŶ ,Y ) .

Remarks:
The data is being used twice: to fit the model and then to
estimate the risk.
The Cross-covariance between Ŷ and Y increases as the
model becomes more complex.
How to estimate the risk more reliably, ie how to choose the
best model size?

Joan Bruna STAT 135: Linear Regression


Cross-Validation
Rather than using data twice, we can organize it differently:

Joan Bruna STAT 135: Linear Regression


Cross-Validation
Rather than using data twice, we can organize it differently:

Joan Bruna STAT 135: Linear Regression


k-fold Cross-Validation
Why not repeat with different splittings to improve the risk
estimate?

Joan Bruna STAT 135: Linear Regression


k-fold Cross-Validation
Why not repeat with different splittings to improve the risk
estimate?

Joan Bruna STAT 135: Linear Regression


k-fold Cross-Validation
Why not repeat with different splittings to improve the risk
estimate?

1 X
R̂ = R̂k .
K
k≤K

Joan Bruna STAT 135: Linear Regression


Logistic Regression

In many situations, we are naturally interested in predicting a


binary (or categorical outcome).

Joan Bruna STAT 135: Linear Regression


Logistic Regression

In many situations, we are naturally interested in predicting a


binary (or categorical outcome).
Patient has diabetes or not.
Handwritten digit is in {0, . . . , 9}.
Message is Spam or not.
...

Joan Bruna STAT 135: Linear Regression


Logistic Regression

In the simple binary setting, observations (Xi , Yi ) are modeled as


Bernoulli trials:
Yi |Xi ∼ Bern(p(Xi )) .
Q: How can we model the dependency between Xi and p(Xi )?

Joan Bruna STAT 135: Linear Regression


Logistic Regression
We can use the logistic model:
T
eβ x
p(x, β) = P(Y = 1|X = x) = .
1 + e βT x

Joan Bruna STAT 135: Linear Regression


Logistic Regression
We can use the logistic model:
T
eβ x
p(x, β) = P(Y = 1|X = x) = .
1 + e βT x

et
The function f (t) = is the logistic function:
1 + et

Joan Bruna STAT 135: Linear Regression


MLE of Logistic Regression
Given observations (xi , yi ), i = 1, . . . , n, the likelihood of the
model is
n
Y
lik(β) = p(xi , β)yi (1 − p(xi , β))1−yi
i=1
n T
!yi  1−yi
Y e β xi 1
=
i=1
1 + e β T xi 1 + e β T xi
n T
Y e yi β xi
= ,
i=1
1 + e β T xi

So the log-likelihood becomes


n
Tx
X
`(β) = yi β T xi − log(1 + e β i
).
i=1

Joan Bruna STAT 135: Linear Regression


Solving Logistic Regression

Q: How to obtain the MLE β̂?

Joan Bruna STAT 135: Linear Regression


Solving Logistic Regression

Q: How to obtain the MLE β̂?


Let’s start by computing the gradient of `(β):

n T
X xi e β xi
∇`(β) = xi yi −
i=1
1 + e β T xi
n
X
= xi (yi − p(xi , β)) .
i=1

Joan Bruna STAT 135: Linear Regression


Solving Logistic Regression

Q: How to obtain the MLE β̂?


Let’s start by computing the gradient of `(β):

n T
X xi e β xi
∇`(β) = xi yi −
i=1
1 + e β T xi
n
X
= xi (yi − p(xi , β)) .
i=1

Setting ∇`(β) = 0 results in a system of p non-linear


equations.

Joan Bruna STAT 135: Linear Regression


Solving Logistic Regression

Q: How to obtain the MLE β̂?


Let’s start by computing the gradient of `(β):

n T
X xi e β xi
∇`(β) = xi yi −
i=1
1 + e β T xi
n
X
= xi (yi − p(xi , β)) .
i=1

Setting ∇`(β) = 0 results in a system of p non-linear


equations.
No closed form solution for β̂.

Joan Bruna STAT 135: Linear Regression


Solving Logistic Regression

Q: How to obtain the MLE β̂?


Let’s start by computing the gradient of `(β):

n T
X xi e β xi
∇`(β) = xi yi −
i=1
1 + e β T xi
n
X
= xi (yi − p(xi , β)) .
i=1

Setting ∇`(β) = 0 results in a system of p non-linear


equations.
No closed form solution for β̂.
We need to rely on iterative methods!

Joan Bruna STAT 135: Linear Regression


The Newton Algorithm

Iterative scheme from the 17th Century.

Joan Bruna STAT 135: Linear Regression


The Newton Algorithm

Iterative scheme from the 17th Century. If f is a differentiable real


function, we can find a solution for f (t) = 0 iteratively via

f (tn )
tn+1 = tn − .
f 0 (tn )

Joan Bruna STAT 135: Linear Regression


The Newton Algorithm

Iterative scheme from the 17th Century. If f is a differentiable real


function, we can find a solution for f (t) = 0 iteratively via

f (tn )
tn+1 = tn − .
f 0 (tn )

In our setting, we obtain


−1
∂ 2 `(β)

n+1 n
β =β − ∇`(β) .
∂β∂β T
∂ 2 `(β)
The matrix is called the Hessian of `.
∂β∂β T

Joan Bruna STAT 135: Linear Regression


Iterative Reweighted Least Squares

If we define the vector

p = (p(x1 , β), p(x2 , β), . . . , p(xn , β)) ,

we have
∇`(β) = X T (y − p) ,
and
∂ 2 `(β)
= −X T WX ,
∂β∂β T
with W a diagonal matrix Wi,i = p(xi , β)(1 − p(xi , β).

Joan Bruna STAT 135: Linear Regression


Iterative Reweighted Least Squares

The Newton step thus becomes

β n+1 = β n + (X T WX )−1 X T (y − p)
= (X T WX )−1 X T W (X β n + W −1 (y − p))
= (X T WX )−1 X T Wz ,

with z = X β n + W −1 (y − p).

Joan Bruna STAT 135: Linear Regression


Iterative Reweighted Least Squares

The Newton step thus becomes

β n+1 = β n + (X T WX )−1 X T (y − p)
= (X T WX )−1 X T W (X β n + W −1 (y − p))
= (X T WX )−1 X T Wz ,

with z = X β n + W −1 (y − p).

Each Newton step is a reweighted least squares step.


At each iteration p changes, since it depends upon β.
We can initialize the algorithm with β = 0.
In R, the package glmnet implements this algorithm.

Joan Bruna STAT 135: Linear Regression


Asymptotic Properties of regression coefficients

Since β̂ obtained by IRLS approximates a the MLE,


asymptotic MLE theory tells us that if the model is correct,
then β̂ is consistent:

β̂ → β (n → ∞).

Joan Bruna STAT 135: Linear Regression


Asymptotic Properties of regression coefficients

Since β̂ obtained by IRLS approximates a the MLE,


asymptotic MLE theory tells us that if the model is correct,
then β̂ is consistent:

β̂ → β (n → ∞).

Moreover, the distribution of β̂ converges to

N (β, (X T WX )−1 ) .

We will use this approximation to do inference on β.

Joan Bruna STAT 135: Linear Regression


Example: South African Heart Disease

We consider an example from (Hastie& Tibshirani).


Aim of the study: establish the intensity of heart disease factors in
rural Western Cape, South Africa. Response variable is the
presence or absence of myocardial infarction.

Joan Bruna STAT 135: Linear Regression


Example: South African Heart Disease
0 10 20 30 0.0 0.4 0.8 0 50 100

220
o o o o o oo o o oo
oo ooo o
o oo o o
o ooo o
o ooo oooo o
o o
o
o ooo
ooo o ooo o o o o oo o ooo o
o o oo oo oo ooo
ooo oo ooo ooo
ooooo oooooo oo o o ooooooo ooooooo oooo o
oooo o o o oooooooo oooooo o
oooooooo
oo o ooo
oo
ooooooooooo
ooooooo o oo o oooo oooooooooo o oo
ooo oo oooo o oooo
oooooo
oo
ooo

160
o oooooo oo oooo oooo ooooo o o o oo
oo oooo ooooo oo ooooo o o ooooooo oooooo
sbp o
o o
ooo o o o ooo o ooo o ooooooo
o o oo ooooooo oo ooooooo o o
oooooooo oo o oooooo o o o
ooo
oooooo oo o o oo oo
oo
o oooo
o
oo
oo
ooooo
ooo ooooo ooo o oooooooooo o
ooooo ooooooooo
o ooo
ooo oooo
oooooo
oooo oooo
ooooo o
o o
oo
oo ooo
o
oo ooo
ooooo
oo oo
oo
o
oo ooooooo
oooooooooo ooooooooo
oo oooo o oo
ooooooo ooooo oo o o
o oo o o oooooooooo ooooooooooo
oo o
ooooo
oo ooooooo
o
oooo ooo
ooooo ooo
ooo oo ooo
ooooo oo
o
oo
o o o
oooo
o o
o o ooooooo
oo o oo o
oooooooo oo
o
oo
o
o
oo o
o o
oo ooooo ooo
o o ooo o
ooooo o o
o oo
oo
o o
o ooo
o o o oo o
o o oo oo
ooooo
o o
ooooooo
oooo ooooooo
o oooo
oooo oo
ooo ooo
o o o
o o
oo ooo
o
o o o o o o o
o
o o oo
o o
ooo
oo ooooo
o o ooooo
oooooo
oooooo
ooo oooo
ooo
oooooo
ooooo ooooooo
oo ooo oo
oo ooo
o ooooo o oooo oooooooooooooo ooo ooo o ooo oo oooooo
oooo ooo o o
oooooooo ooooo ooooo
ooooo
oooo ooooooo ooo o
oo ooo
o ooo oooo ooo oooo
o
o
ooo o oooooo o oo oo oo o o
oo o
ooo
o o o
o o oo
oo
o
o
oo
oo
ooo
o o
ooooooo oo ooo o o
o oo
o o
ooo o
ooo ooooo
oo oooooooooo ooo
oo oo
ooooo
o
oo oo
oo ooo
oooo
o
oo o oooooo ooo o oooo
ooo o o o
oooo oooo o oo
o o o oooooo o oooo
oo o
o ooooo o
ooooooo o o o o o oooo o oo
oooooooooooooooooo o o o oo o o
ooooo o
o
ooooooo oooo ooooo o o oo o
ooooo o
oooooo
o o o o
oooo o o oo o o oo o o
o o o oo o oooo o o o o oo

100
oooo oo ooo
o o o ooo oooo o o
o

30
o o o o o
o o o o o
o o o
o o o o
20 oo oo o o o
o ooo o ooo o ooooo o ooo o o o
o ooo
o
o oo oo o tobacco ooooo o oo
o ooo oooo o
o o o o o o oo
o oooo o ooooooo oooooo oooo ooo o oooo
o oo ooo oo ooooo ooo
oo oooooo
oooooooo
oo o
oooo ooo oo o
o
o ooooooooooooo o o
ooooooooooo o oooo
o oo oo o o o ooo oo o
10

ooooooo o o o o oo oooo
o ooooooo ooooooo o oo oo o o o
o o o o o o o o o
o o o
ooooo ooooo ooooo
o oooo oooooooo oo ooooo oo ooooooooo
oo oo ooo
o
o o
o
o
o ooo
oo o o
oooo oooo oooo
o oooo
o
ooooo
oooooo oooooo o o
o ooo ooo
o
ooo o
o oo
ooooo oo o
oooooooo
ooo oooo
oo oo o o
oooooo
oooo
o
ooooooo o
ooo ooo
o o
oo oo
oo
oo
o oooooo ooooooo
ooooo ooooooooo o o o o o oo
o oo
o oo
oo o ooooo
oo
oooooo oooo
ooo
oo ooooo
o o ooo
oooooooo
oo oo
oo oo oo ooo o
oo oooo oo ooo ooooooo
o oo oo oooooooo oooooo
o oo oo o o
o oooo
o o oo ooo
oo
oooo
o
oooooo oooo
o
o o o oooo
oooo ooooo
ooo oo
oo
oo
o
oo o
oooo ooo
oo oo o o
oo o ooo
oo o oooooooooooo oooooo
oo
ooo o oo oooooooooooooooo o
o o o
ooo o
ooooo
ooo
o o
oooo
ooo
oooo ooo ooo o o
ooooooooooooo o o oooooo o
ooooooooo o
o
ooo oooo
ooo o o
oooooo o o
ooo
o o
oooooooo oo o ooo o
oooo
oo o o
ooooooo
oo
oo
ooooooo
oo ooooo
oo
ooo oo
ooo oo
ooo
o oo ooo
oooo ooooooooooooooo ooo oo o oo
ooo
oooo
oooo oo
oo
o ooo ooooo
oo ooooooooooooooooooo oo
o oo ooo oooooooo ooooooo o o
ooooooo
oo oo
ooo oooo o oo
ooo
oo o o
oo ooooo ooooooo ooooooo ooo oo oooooooooooo ooo
oooooooooooooooooooo oooo
0

o
oo o o o oo ooo o o o
o o o o o o

10 14
o o o o o o
ooo o o oooo
o o oooo o oo oo oo o o ooo o o
o oo o oo oooo oo oo o ooo oooo o o
o o oo
ooo oo
o ooooooo ooooo
o ooo o oooo oooo oooooo o
o
oooooooo o
o
o oo o
ooooo ooo o o o o o o oo ooo
ooo
o ooo
ldl ooo ooo o o ooo ooooooooo o o oo oo o oo ooooooooooooooooooooooo
oooo o oo
oooooo o oo o
ooooooooooo oo o oo oooo
oooooooooo ooo o o
ooo o o oo oooooooooo
o ooo oo
o
oo
oooo o o o ooooooooo o o ooooooo o
ooooooooooo ooo o oooooooo ooooooo
oooo ooo oo
ooooooo
o
oo
o oooo ooo ooooo o oo
oooooo
oooo o ooo
o oooo ooooo
oooo o ooooooooooo oooooooo oo oooooo oo
oooooooooooo ooooooo

6
o o o o o ooooooooo oo oo o
o o o o
o o oo oooo o oo
o o o oo oooooo ooooooooooo o o oo oo
ooo o
oo
oo o
ooo
o
ooo
ooooooo o o o o ooo
ooooooooooooo
ooooooooo
oooo oo o o o o oo
o ooo
oooooooooooo
oooo oooo oo o o o o o o
o o ooo
oo o o
ooo
o o
ooooo oo o
oooo
ooo o o
o ooo oo oo o
oooo ooo
o oo oo
oo
ooo ooo
ooooo
ooooo oo o o oooo ooooooooooo ooooooooooo
o ooo
oooooooooo o
oo o o ooo oo
o oo
oo oo
ooo
ooooo oooo
o oo
o
oo
oooo oo
oo
o ooo oo ooo o ooooo
oo ooo
oo oooooooooo
oo oooooooo
oo o o o o
oo
o o
o o
oo
o oooo
o oooooooo
oo
oo
oooo
oo ooo oo
oo
ooooooooo
oo oo ooo o
oo o
oo
oooo
oooo oooooooo
oooooooo ooo ooo
ooooooooo
oooo oooooo
o oo oooo
oooooooooo ooooooo
o oooo ooooooooo ooooo
oooooo oooo
oo oooo ooo ooooooo ooo o oooooooooo
ooooooooooo
o oo
ooo
ooo
o o
o o
ooo oooo
oooo ooo
oooooo
o oo oo ooooooo oooo oo o o o o o
ooo o ooooooo oo o o o oo oo
ooooooooooo
oo oooooo oooooooooo ooo ooo
o o
o oooooooo ooo oo

2
ooooooo oo o o o oo o o
o
oooooo
ooo
oooo oo
ooooo
oooo ooooooo
ooooo oooooooooooooooooooo ooo
o oo
ooo
ooo
o oooooo
oo
oo oooooo
oooo oo
ooooooooooooo o o ooooo
ooo oo
oo ooooooo
oooo
o oo
oooo
oooooooooooo oooo
ooo
oo
o o oo oooo
oo
oo
oo ooooo
oo
oooo
oo oooooo
oo
ooo
ooo
oooo oooo
ooooo ooooooooo ooooo
o
oo oooooooo oooooooooooooooooo ooooo oo oo oooo
oooooooooooo ooooo
ooooo oooooo
oo
ooooooooooo
oo oooooooooooooo
oo ooo
oo oooo
oo
0.8

famhist
0.4
0.0

oooooooooooo
oooo
oo
o oo
oooo ooo
oo
o oooooo
o
oo ooo
oo
o oooooooo oooooooooooooooooooo oooo ooo
oo ooo
oo
oo ooooooooo ooo ooooooooooooooooo
oo oo o oo
oooooo
ooo ooooo
ooo ooooooooo
ooo
oo
ooo
o oooooooooo o o o
oooooo ooo
o
oooooo
ooooooo
oo
oo oo ooooo
ooooo
ooo oo
oo ooooo
ooo oooooo
oooooooo o oooo
oo
oo
ooo
oo
oo
o
oooooooo
o
oo
oo
o ooo ooo
ooo oooooooooo
oooo ooooooooooo
ooooo oooo
ooooooooooooooooooooooo
ooo ooo
oooooooo oooooo oo
oooooo
o ooo
o o o o o o oo oo o

45
o
o o oo oo oo o oo
o o oo oo ooo ooo o oo o oo oo o
oooooo o o
oo oooooo oooo oooo oooooooo o oo

35
o o o o oo o oo ooooo o oo o o o o o o
ooo o ooooooooooooo o
oo ooo
o oo o oooooooo o ooo ooo
oooooooooooooo ooooo
o o
oo
oo
o ooo oo oooooo oo ooooo ooooooooooooooo o o oooo
oooo oo
oo
oooo
oo o ooooooo ooo ooooooo o o oo ooo o oo oooo ooo
oo o o o o oo
ooooooo o o
oo o o obesity o o o o
oo ooooo o ooooooo oo
oooo o ooooooo oo oo oo ooooooooo
ooooo oooo
ooooo o oooo o oooo ooo
ooo oo
o ooooooo oo
ooo ooo ooo oooooooo oooooooooooo o o
o oo oo
ooo o oo
o
oo ooo
oooooo oooooo
ooooooooooooooo
o o oo
oo o o
o
oo
ooo o
oo
oooooooooo
o
oooooooo
o
oo
oooo o o oooooooooooo oooooooo
o oooooo oooooooooo o
ooooo
ooooo
o oo oo oooo
oo ooooo oo o
oooo ooo o oo o o o
oo ooo ooooooo ooooo o ooo
oo o
oo oooooo
ooooooo o o o oooooo o
ooo o
ooo o
ooooo oooo ooooooooooooooooo
o oo oooooo o oo oo
oo
o oooo
ooooo
ooooo ooooo
o ooo
oo ooo

25
oo o oo o ooo o o oooooo o oo
oo ooooo oooo ooooooo ooo ooooooo
ooo ooooo
ooo
oooo o
o oooo
o
o oo
ooooo oooo
oo oooooooooooo
o ooooooo o o o o
oo o
ooooo ooo
oooooo
o oo oo
o
oo oooo o ooo oooo
oo
o
o
oo
o
ooo oo
oo
oooo ooo oo
o oo
ooooooooo oo o
oo o
oo oo oo
o oooo o ooo oooooooo
oooooooooo o o o oooo o
oo
oooooo
oooooooo
ooooooooo
oo
oooo
o
ooooooooo
oooooooo
ooooooo ooo oo ooo oooo ooo
o ooooooooooo o o ooooo oooo oooo
o
oooooo
ooooooooo
ooo oooooooo o
oooo ooooo o oo
o oo
o oo
ooooo ooo
oooo
ooo ooooo ooo oo oo ooooo o
oooooo
ooooooooooo oooo oo o oooooooooo
oooooooo o o o
o o o
ooo
o oooo ooooo oooo
oo oo

15
o o o o o o
ooo o oo o oo o oo o o o
o o o oo o o
oo o
100

o o oo o o o ooo o o o
oo ooooo o o o o o o ooo ooo oo oooo oo o ooo oo o oo o
ooooo ooooo oo oo oo oooooo o ooooooooo oo o o
o
o ooo oooooo
o o
o alcohol oo ooo oooooo ooooo
oo oo oooo o ooo o o oo
oooooooo oo o oooooooooooooooo o o
ooo
o o
oo o oooo oooo ooo oo o o
oooooo oo oo oooooooooooo
50

oooo oooooooooooo o oo o ooo o


o o oo ooooo
o o o o
oooooooooooooooooo oo o o o o o
o
ooooo
ooo o ooo o oooo o oo
ooooooooo ooo
ooooooooooo oo o
o
oo o oo o o ooo oooo oo
oooooooo ooo
o o ooooooooooooooooooo ooo
ooo
ooo
oooooo oooooooooooooooo o oooo ooooo
oo oo oooo ooooooo o oo oooooo oooooooooooo oooo
o ooooooo oo oo oooooo
oo o ooo
ooo o oooo
oooooooooooooo ooo oooooo
oo
o oo
ooo
oooo ooo oooo
ooo
oo ooo oo ooooooooo
ooooooo oooo o oo oooooo oooooooo ooooooooooooooo o oooo ooooo oo
oo
o o oooo
ooooo oo oooo o o
ooo oo
o
oo o o oo
o o o o oo oo oo oooooooo oo ooooo
oooooooooo oo
ooo oo ooo
oooooo oooo
oooo oooooooo
ooooo oooo oo
oo
ooooooooo
ooo o ooo
oo
o
ooo
oooo
ooooooooo oo
ooo
oo
ooooooooo ooooooooo
o o ooo
oo ooooo oooo
oo ooooooooooooo o oo
oooo oo
ooooooooooooo
oo oo
o ooooo
oooo ooooo
oo oo
oo ooooooooooo o
oo ooo
oo o
o o
ooo
oo o
o o oooo
oo oo
o
oooo
oo
oo
o
oooo
ooo
ooo
oo
o
o
oooooo
oo oo o ooooo
ooooooo ooo o
oo ooooooooooo
ooooo o oo
oooooooooooooo
oo
oo oo
oo
oo
o
oooooo
oo
o
ooo ooo
oooo oo
oooo
oooo oooooo
oooo ooooo
oo
oooo
oo
oooo
0

ooo o o o oo
o ooo oooo o o o o o o o o o oo o o o o o o oooo o oooo oo oooo ooo o oooooooo o ooo ooo o o oo ooooooo oooo oo oooo oooo oooooo o o
o o o o o o o o o o o o o o o
o o oooooo o o o o o o o o o o o oo oo

60
o oo o oooooo oooo ooooooooo o o
oooooooo
oo
oo
oooooo
ooooo
oooooo oooo
oooo oo oo
oooo o ooo o oooo ooooooooo oo ooooooooo o oo oo o oo oo
oo oo o
ooooo oo
ooo ooo ooo o oo oooo o
oooooo oooooooooo oo
o ooooo oo oo
oo oooo ooo
oooo ooo
ooo o
ooooo
ooo
oooooooo
oo
oo oo oo o
ooooooooo oo oo
oo o o o o o oo o o o o
oo o
o o o o o o o o
o o ooo oooo o
o oo ooooooooooooo ooooo oo
ooooooooooo o oo
o o o
o oo o o ooooo ooooo oo oooo ooooo o o ooo ooooooo
oooooooooo
oo
oooo
o
ooooooooooooo ooo o o o o
o o oooo
o ooo o oo
ooooo ooo
o
oooooooooooooo o o
ooo ooo
ooo o ooooooo ooo oo
ooo ooooooooo oo oooooooo ooo oo o
oooooo
o oooooo
ooooooooo
ooooo o
ooo o oo ooooo oo
oooo oooooooooooo o ooooo
ooooo oooooo o
ooo
ooo oo
ooo
o oo
o
o oo o oo ooo
ooooooooooooooo o
o o o oo oooo oo
ooooooo ooo
oo ooo o ooo
ooo
oo oooo oooo oo o o oooo oooo oo oo oo o oooo
oo oooooo o
oooo o oo o
oo ooooo o o o
o oo ooo
oo o
ooooooo ooo
oo oo o

40
ooooo oo ooooooooooo ooooooo oo
o oo o ooo oooooo o ooo ooooooo oo
o ooo ooo
ooooo o
o o oo
o oo ooo
o oo
oo o oo
ooooooooo oooooooo oo oo o oo oo
oooo ooooo o oo age
ooo oo oooo
o
oooooooooo oooo ooo o o o
oo o o o ooooooooooo
ooooo
ooooooo oo o
oooooooo oooooo oo oooo
o o
oooooooooo oooo
o
oooo
o oo ooo oo oo o ooo ooooo o ooooo o oooo
ooooooooo
o o o oooo ooo
oo
ooooooo oo o
ooo ooo
ooo o o o
ooooooooo o o o o o
o
o o oo
ooo o o
oooo o o oooo ooooooo o oo oo o oo ooooo oooooooo oooooo o
oo ooooo oooooooooo oo o
ooo o o o oooo o oo oo o ooooooooo o o oo o

20
oooo ooo
o oooo oo oo
o ooooo ooooooooooooo ooo oo oo o o o o o oooo
ooooo o
oooo ooooooooo ooooo o
o oo ooooooooooooo o
o ooo o
oo ooooo
ooooo
oo ooooooo oo o oo oo
100 160 220 2 6 10 14 15 25 35 45 20 40 60

FIGURE 4.12. A scatterplot matrix of the South African heart disease data.
Each plot shows a pair ofJoan
risk Bruna
factors, and STAT
the cases and
135: controls
Linear are color coded
Regression
Example: South African Heart disease

We fit a logistic regression model using IRLS

β̂ std(
c β̂)
(intercept) -4.13 0.964
sbp 0.006 0.006
tobacco 0.080 0.026
ldl 0.185 0.057
famhist 0.939 0.225
obesity -0.035 0.029
alcohol 0.001 0.004
age 0.043 0.010
(results from Hastie&Tibshirani)

Q: Are these numbers statistically significant?

Joan Bruna STAT 135: Linear Regression


Inference About Regression Coefficients

The Z -score is simply the ratio

βˆk
.
c βˆk )
std(

Joan Bruna STAT 135: Linear Regression


Inference About Regression Coefficients

The Z -score is simply the ratio

βˆk
.
c βˆk )
std(

Asymptotic normality means that if n is large, then

βˆk − βk
∼ N (0, 1) .
c βˆk )
std(

Joan Bruna STAT 135: Linear Regression


Inference About Regression Coefficients

The Z -score is simply the ratio

βˆk
.
c βˆk )
std(

Asymptotic normality means that if n is large, then

βˆk − βk
∼ N (0, 1) .
c βˆk )
std(

The Wald Test tests the null hypothesis βk = 0. Reject the null
hypothesis if
|βˆk |
Zk = ≥ z(1 − α/2) .
c βˆk )
std(

Joan Bruna STAT 135: Linear Regression


Back to Example

We compute the Z -scores:

β̂ std(
c β̂) Z -score
(intercept) -4.13 0.964 -4.28
sbp 0.006 0.006 1.023
tobacco 0.080 0.026 3.034
ldl 0.185 0.057 3.22
famhist 0.939 0.225 4.178
obesity -0.035 0.029 -1.18
alcohol 0.001 0.004 0.136
age 0.043 0.010 4.184
(results from Hastie&Tibshirani)

Blood pressure and obesity are not significant. Why?

Joan Bruna STAT 135: Linear Regression


Back to Example

We compute the Z -scores:

β̂ std(
c β̂) Z -score
(intercept) -4.13 0.964 -4.28
sbp 0.006 0.006 1.023
tobacco 0.080 0.026 3.034
ldl 0.185 0.057 3.22
famhist 0.939 0.225 4.178
obesity -0.035 0.029 -1.18
alcohol 0.001 0.004 0.136
age 0.043 0.010 4.184
(results from Hastie&Tibshirani)

Blood pressure and obesity are not significant. Why?


Moreover, obesity correlates negatively with heart disease?

Joan Bruna STAT 135: Linear Regression


Extension to Multi-Class Regression

What if we need to perform a multi-class categorization?


Examples:
Different mutations.
Image classification.

Joan Bruna STAT 135: Linear Regression


Extension to Multi-Class Regression

What if we need to perform a multi-class categorization?


Examples:
Different mutations.
Image classification.
Replace the Bernouilli model with a Multinomail model with K
classes:
P(Yi = k|Xi ) = θk (Xi ) , (k = 1, . . . K ) ,
X
with θk ∈ [0, 1] and θk = 1.
k

Joan Bruna STAT 135: Linear Regression


Extension to Multi-Class Regression

What if we need to perform a multi-class categorization?


Examples:
Different mutations.
Image classification.
Replace the Bernouilli model with a Multinomail model with K
classes:
P(Yi = k|Xi ) = θk (Xi ) , (k = 1, . . . K ) ,
X
with θk ∈ [0, 1] and θk = 1.
k
The Softmax function is the generalization of the logistic function:
T
e βk x
θk (x) = PK .
βkT x
j=1 e

Joan Bruna STAT 135: Linear Regression


Extension to Multi-Class Regression

What if we need to perform a multi-class categorization?


Examples:
Different mutations.
Image classification.
Replace the Bernouilli model with a Multinomail model with K
classes:
P(Yi = k|Xi ) = θk (Xi ) , (k = 1, . . . K ) ,
X
with θk ∈ [0, 1] and θk = 1.
k
The Softmax function is the generalization of the logistic function:
T
e βk x
θk (x) = PK .
βkT x
j=1 e

...but that is another story!

Joan Bruna STAT 135: Linear Regression


Joan Bruna STAT 135: Linear Regression

You might also like