0% found this document useful (0 votes)
37 views61 pages

L2 Linear Regression

Uploaded by

yuebb2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views61 pages

L2 Linear Regression

Uploaded by

yuebb2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Linear Regression and Related Topics

MAST90083 Computational Statistics and Data Mining

Dr Karim Seghouane
School of Mathematics & Statistics
The University of Melbourne

Linear Regression and Related Topics 1/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Outline

§i. Introduction

§ii. Linear regression

§iii. Other Considerations

§iii. Selection and Regularization

§iv. Dimension Reduction Methods

§v. Multiple Outcome Shrinkage

Linear Regression and Related Topics 2/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Statistical Models

I What is the simplest mathematical model that describes the


relationship between two variables ?
I Straight line
I Statistical models are fitted for a variety of reasons:
I Explanation and prediction: Uncover causes by studying the
relationship between an interested variable (the response) and
a set of variables called the explanatory variables& use the
model for prediction
I Examine and test scientific hypotheses

Linear Regression and Related Topics 3/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Linear Models

I Linear models have a long history in statistics, but even in


today’s computer era they are still important and widely used
in supervised learning.
I They are simple and provide a picture of how the inputs affect
the output
I For prediction purposes they can sometimes outperform
fancier nonlinear models, particularly in small sample cases,
low signal-to-noise ratio or sparse data
I We will study some of the key questions associated with the
linear regression model

Linear Regression and Related Topics 4/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Linear regression
Given a vector of input variables x = (x1 , ..., xp )> ∈ Rp and a
response variable y ∈ R
y ≈ f (x)
where
p
X
f (x) = β0 + βi xi = β0 + β1 x1 + ... + βp xp
i=1

I The linear model assumes that the dependence of y on


x1 , x2 , ..., xp is linear. Or, it can be well approximated by the
linear relationships.
I Although it may seem overly simplistic, linear regression is
extremely useful both conceptually and practically.
I The βj0 s are the unknown parameters that need to be
determined.
Linear Regression and Related Topics 5/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Parameter Estimation

I We have at our disposal a set of training data


(x1 , y1 ), ..., (xN , yN ) from which to estimate the parameters β
I The most popular method is least squares where β is obtained
by minimizing the residual sum of squares
 2
X N N
X Xp
RSS(β) = (yi − f (xi ))2 = yi − β0 − βj xij 
i=1 i=1 j=1

I RSS(β) is a quadratic function of the parameters, its


minimum always exists, but may not be unique
I What is the statistical interpretation ?

Linear Regression and Related Topics 6/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Statistical Interpretation

I From a statistical point of view, this represents the maximum


likelihood estimation of β assuming

yi = x>
i β + i i = 1, ..., N

I and 1 , ..., N are independent random samples from N (0, σ 2 ),


σ > 0 an unknown parameter so that  ∼ N (0, σ 2 IN )
I Taking X as the N × (p + 1) matrix with each row an input
vector and 1 in the first position and y an N × 1 vector of
responses: E (y) = X β and Cov (y) = σ 2 IN , so that
y ∼ N X β, σ 2 IN


Note 1

Linear Regression and Related Topics 7/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Matrix Form

I The residual sum of square takes the matrix form

RSS(β) = (y − X β)> (y − X β)
I Assuming that X has full column rank or X > X is positive
definite, minimizing RSS(β) gives a unique solution
 −1
β̂ = X > X X >y
I and the fitted values at the training inputs are ( Note 2 )
 −1
ŷ = X β̂ = X X > X X > y = Hy

Linear Regression and Related Topics 8/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Geometric Interpretation
I The hat matrix H is square and satisfies: H 2 = H and
H > = H ( Note 3 )
I H is the orthogonal projector onto V = Sp(X ) (column space
of X or the subspace of RN spanned by the column vectors of
X)
I and ŷ is the orthogonal projection of y onto Sp(X )
I The residual vector y − ŷ is orthogonal to this subspace

Linear Regression and Related Topics 9/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Statistical Properties
I Assuming the model y ∼ N X β, σ 2 IN gives


  −1 
β̂ ∼ N β, σ 2 X > X
or
ŷ ∼ N X β, σ 2 H


I where σ 2 is estimated by

N
1 1
(y − ŷ)> (y − ŷ)
X
σ̂ =2
(yi − ŷi )2 =
N −p−1 N −p−1
i=1

I and (N − p − 1)σ̂ 2 ∼ σ 2 χ2N−p−1


I Furthermore σ̂ 2 and β̂ are statistically independent. ( Why ?)
Linear Regression and Related Topics 10/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Assessing the Accuracy of the coefficient estimates


I Approximate confidence set for the parameter vector β
 >  
(1−α)
Cβ = {β| β̂ − β X > X β̂ − β ≤ σ̂ 2 χp+1 }
I Test the null hypothesis of H0 : βj = 0 vs. H1 : βj 6= 0 using

β̂j
zj = √
σ̂ vj
I and vj is the jth diagonal element of X > X −1 . Under H0 , zj


is tN−p−1 .
I Testing for a group of variables, H0 : smaller model is correct

(RSS0 − RSS1 )/(p1 − p0 )


F = ∼ Fp1 −p0 ,N−p1 −1
RSS1 /(N − p1 − 1)
Note 4
Linear Regression and Related Topics 11/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Gauss-Markov Theorem

I LSE of β have the smallest variance among all linear unbiased


estimates
I Assuming the estimation of θ = a> β, then its LSE is
 −1
θ̂ = a> β̂ = a> X > X X > y = c>
0y
 
I Then E a> β̂ = a> β is unbiased and
   
Var θ̂ ≤ Var c> y

I for any other linear estimator θ̃ = c> y

Linear Regression and Related Topics 12/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Reducing the MSE

I The LSE has the smallest MSE of all linear estimators with no
bias.
I Biased estimator can generate smaller MSE
I Shrinking a set of coefficients to zero may result in a biased
estimate
I MSE is related to the prediction accuracy of a new response
y0 = f (x0 ) + 0 at input x0

h i2 h i2 h i
E y0 − f˜(x0 ) = σ 2 +E x0> β̃ − f (x0 ) = σ 2 +MSE f˜(x0 )

Note 5

Linear Regression and Related Topics 13/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Assessing the Overall Accuracy of the Model

I Quantify how well the model fits the observations


I Two quantities are used: the residual standard error (RSE)
which measures the lack of fit
v  
u s
u RSS β̂ PN 2
t i=1 (yi − ŷi )
RSE = =
N −p−1 N −p−1
I R 2 Statistics

TSS − RSS
R2 =
TSS
I measure the amount of variability (TSS = (yi − ȳ )2 )
P
removed by the model

Linear Regression and Related Topics 14/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Other Considerations in the Regression Model

I Correlation of the error terms


I Interactions or collinearity
I Categorical predictors and their interpretation (two or more
categories).
I Non-linear effects of predictors
I Outliers and high-leverage points
I Multiple outputs

Linear Regression and Related Topics 15/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Correlation of the error terms

I Use generalized LS

RSS(β) = (y − X β)> Σ−1 (y − X β)


I Similar to assuming

y ∼ N (X β, Σ)
I Still least square but using a different metric matrix Σ instead
of I
Note 6

Linear Regression and Related Topics 16/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Interactions or collinearity

I Variables closely related to one another which leads to linear


dependence or collinearity among the columns of X .
I It is difficult to separate the individual effects of collinear
variables on the response.
   −1
Var β̂ = σ 2 X > X

I Collinearity has considerable effect on the precision of β̂ →


large variances, wide confidence interval and low power of the
tests
I It is important to identify and address potential collinearity
problems

Linear Regression and Related Topics 17/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Detection of collinearity

I Look at the correlation matrix of the variables to detect pair


of highly correlated variables
I Collinearity between three or more variables compute variance
inflation factor VIF for each variable
  1
VIF β̂j =
1 − Rj2
I Geometrically 1 − Rj2 measures how close xj is to the subspace
spanned by X−j
 
VIF β̂j ≤ λmax λ−1min = κ (X )
2

I Examine the eigenvalues and eigenvectors Note 7

Linear Regression and Related Topics 18/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Categorical predictors
I Also referred as categorical or discrete predictors or variables.
I Prediction task is called regression for quantitative output and
classification for qualitative outputs
I Qualitative variables are represented by numerical codes
(
1 if the ith experiment is a success
xi =
0 if the ith experiment is a failure
I This results in the model

(
β0 + β1 + i if the ith exp. is a success
yi = β0 +β1 xi +i =
β0 + i if the ith exp. is a failure

Note 8
Linear Regression and Related Topics 19/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Non-linear effects of predictors

I The linear model assumes a linear relationship between the


response and predictors
I The true relationship between the response and the predictors
may be non-linear
I Polynomial regression is simple way to extend linear models to
accommodate non-linear relationships
I In this case non-linearity is obtained by considering
transformed versions of the predictors
I The parameters can be estimated using standard linear
regression methods.
Note 9

Linear Regression and Related Topics 20/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Outliers

I Different reasons can lead to outliers. Example: incorrect


recording of an observation
I The residual as an estimate of the error can be used to
identify outliers by examination for extreme values
I Better use the studentized residuals

E [e] = E [(IN − H) y] = (IN − H) δ

I If the diagonal of H is not close to 1 (small) then e reflects


the presence of outliers.
Note 10

Linear Regression and Related Topics 21/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

High-leverage points

I Observations with high leverage have an unusual value for xi


I Difficult to identify when there are multiple predictors
I To quantify an observation’s leverage use the leverage statistic

1
+ (N − 1)−1 (xi − x̄) S−1 (xi − x̄)
hi =
N
I S is the sample covariance matrix, xi the ith row of X and x̄
the average row
I The leverage statistic n1 ≤ hi ≤ 1 and the average is (p + 1)/n.
I If an observation has hi greatly exceeds (p + 1)/n, then we
may suspect that the corresponding point has high leverage.
Note 11

Linear Regression and Related Topics 22/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Multiple outputs

I Multiple outputs y1 , ..., yK need to be predicted from x1 , ..., xp


where a linear model is assumed for each output
p
X
yk = β0k + xj βjk + k = fk (X ) + k
j=1

I In matrix notation Y = XB + E where Y is N × K , X is


N × (p + 1), B is (p + 1) × K (matrix of parameters) and E is
N × K matrix of errors.

K X
N h i
(yik − fk (xi ))2 = tr (Y − XB)> (Y − XB)
X
RSS(B) =
k=1 i=1

Linear Regression and Related Topics 23/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Multiple outputs

I The least squares estimates have exactly the same form


 −1
B̂ = X> X X> Y

I In case of correlated errors  ∼ N (0, Σ) the multivariate


criterion becomes
N
(yi − f (xi ))> Σ−1 (yi − f (xi ))
X
RSS(B) =
i=1

Note 12

Linear Regression and Related Topics 24/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Why ?

I The least squares estimates is in most cases not satisfied when


a large number of potential explanatory variables are available
I Improving prediction accuracy: LSE often has low bias but
large variance, sacrifice a little bit of bias to reduce the
variance of the predicted values and improve overall prediction
accuracy
I Interpretation: Do all the predictors help to explain y ?
determine a smaller subset with strongest effects and sacrifices
the small details

Linear Regression and Related Topics 25/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Linear Model Selection and Regularization

Linear Regression and Related Topics 26/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Deciding on the Important Variables

Subset selection
I All subsets or best subsets regression (examine all
potential combination :o )
I Forward selection - begin with intercept and iteratively add
one variable.
I Backward selection - begin with the full model and
iteratively remove one variable.
I What is best for cases where p > n?

Linear Regression and Related Topics 27/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Best Subset

I Retain a subset of predictor and eliminate the rest


I LSE is used to obtain the coefficients of the retained variables
I For eack k ∈ {0, 1, 2, ..., p} find the subset k that gives the
smallest residual
I The choice of k is obtained using a criterion and involves a
tradeoff between bias & variance
I Different criteria ← minimizes an estimate of the expected
prediction error
Infeasible for large p
Note 13

Linear Regression and Related Topics 28/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Forward Selection

I Sequential addition of predictors → forward stepwise selection


I Starts with the intercept and sequentially add the predictor
that most improve the fit
I Add predictor producing the largest value of
   
RSS β̂ i − RSS β̂ i+1
F =  
RSS β̂ i+1 /(N − k − 2)
I Use 90th or 95th percentile of F1,N−k−2 as Fe
Note 14

Linear Regression and Related Topics 29/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Backward Elimination

I Start with the full model and sequentially remove predictors


I Use Fd to choose the predictor to delete (smallest value of F )
I Stop when each predictor in the model produces F > Fd
I Can be used only when N > p
Fd ' Fe
Note 15

Linear Regression and Related Topics 30/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Alternative: Shrinkage Methods

I Subset selection produces produce an interpretable model with


possible lower prediction error than the full model
I The selection is discrete → often exhibits high variance
I Shrinkage methods are continuous and don’t suffer as much
from high variability
I We fit a model containing all p predictors using a technique
that constrains or regularizes the coefficient estimates, or
equivalently, that shrinks the coefficient estimates towards
zero.
I Shrinking the coefficient estimates can significantly reduce
their variance (not immediately obvious).

Linear Regression and Related Topics 31/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

Linear Regression and Related Topics 32/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

Linear Regression and Related Topics 33/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

Linear Regression and Related Topics 34/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

I Ridge regression shrinks the regression coefficients by


constraining their size
I This is the approach used in neural networks where it is
known as weight decay
I The larger the value of λ, the greater the amount of shrinkage
I β0 is left out of the penalty term
Note 16

Linear Regression and Related Topics 35/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

I Because we have now the addition of the penalty term, the


ridge regression coefficient estimates can change substantially
when multiplying a given predictor by a constant!
I It is best to apply ridge regression after standardizing the
predictors, using the formula
xij
x̃ij = q P
1 n
n i=1 (xij − x̄ij )2

Linear Regression and Related Topics 36/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

RSS(λ) = (y − X β)> (y − X β) + λβ > β

ridge
 −1
β̂ = X > X + λI X >y

I The ridge solution is a linear function of y


I Avoid singularity when X > X is not full by adding a positive
constant to the diagonal of X > X .
ridge
I For orthogonal predictor β̂ = γ β̂ where 0 ≤ γ ≤ 1
I The effective degrees of freedom of the ridge regression fit is
 
df (λ) = tr X [X > X + λI]−1 X >

Note 17
Linear Regression and Related Topics 37/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge regression - credit data example


balance ∼
age, cards, education, income, limit, rating , gender , student,
status, ethnicity

Linear Regression and Related Topics 38/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge regression vs. LS

Linear Regression and Related Topics 39/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Lasso

I Ridge regression disadvantage: includes all p predictors


(some of them with minor influence)
I Lasso , in contrast, select subset.
I The lasso coefficients, β̂λL , minimize the quantity

Linear Regression and Related Topics 40/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

The Variable Selection Property of the Lasso

Linear Regression and Related Topics 41/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Some Remarks on Lasso

I Making s sufficiently small will cause some of the coefficients


to be exactly zero → continuous subset selection
I If s = pi=1 kβ̂jls k, then the lasso estimates are the β̂jls ’s.
P

I s should be adaptively chosen to minimize an estimate of


expected prediction error.

Linear Regression and Related Topics 42/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

The Variable Selection Property of the Lasso

Linear Regression and Related Topics 43/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Profile of Lasso Coefficients

I Profiles of lasso coefficients as the tuning parameter t is


varied. The coefficients are plotted versus t = s/ pi=1 kβ̂jls k
P

Linear Regression and Related Topics 44/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Dimension Reduction Methods

I When there is a large number of predictors, often correlated,


we can select what variables (dimensions) to use.
I But, why not transform the predictors (to a lower dimension)
and then fit the least squares model using the transformed
variables.
I We will refer to these techniques as dimension reduction
methods.
I Use a small linear combinations zm , m = 1, ..., M of xj
I The methods differ in how the linear combinations are
obtained

Linear Regression and Related Topics 45/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components Regression

I The linear combinations zm are the principal components


zm = X vm

max Var (X α)
kαk=1
v>
` Sα=0, `=1,...,m−1

I v>` Sα = 0 ensure zm = X vm is uncorrelated with all previous


linear combinations z` = X v` , ` = 1, ..., m − 1
I y is regressed on z1 , ..., zM for M ≤ p

Linear Regression and Related Topics 46/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components Regression


I Since zm are othrogonal, this regression is just a sum of
univariate regressions
M
X
pcr
ŷ = ȳ + θ̂m zm
m=1

θ̂m =< zm , y > / < zm , zm >

M
pcr X
β̂ = θ̂m vm
m=1
pcr LS
I if M = p, ŷ = ŷ since the columns of Z = UD span the
column space of X
I PCR discards the p − M smallest eigenvalue components.
Linear Regression and Related Topics 47/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components Regression

Linear Regression and Related Topics 48/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Dimension Reduction Methods

Linear Regression and Related Topics 49/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components

I Effectively, we change the orientation of the axes.


I The 1st principal component is that (normalized) linear
combination of the variables with the largest variance.
I The 2nd principal component has the largest (remainder)
variance, subject to being uncorrelated with the first.
I And so on...
I Many times we can explain most of the variation with only
few principal components.
I More details in Chapter 10.2 of the book ‘An introduction to
statistical learning’

Linear Regression and Related Topics 50/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components

Linear Regression and Related Topics 51/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components

Linear Regression and Related Topics 52/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components Regression

I Now we can run a regression analysis on only several principal


components.
I We call it Principal Components Regression ( PCR )
I Note, these directions are identified in an unsupervised way,
since the response Y is not used to help determine the
principal component directions.
I Consequently, PCR suffers from a potentially serious
drawback: there is no guarantee that the directions that best
explain the predictors will also be the best directions to use for
predicting the response.

Linear Regression and Related Topics 53/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Partial Least Squares (PLS)

I It is a dimension reduction method, which first identifies a


new set of features Z1 , ..., ZM that are linear combinations of
the original features.
I Then fits a linear model via OLS using these M new features.
I Up to this point very much as PCR.
I PLS identifies these new features using the response Y
(supervised way ).
I PLS approach attempts to find directions that help explain
both the response and the predictors.

Linear Regression and Related Topics 54/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Partial Least Squares (PLS)

I Uses y to construct linear combinations of the inputs.


I The inputs are weighted by the strength of their univariate
effect on y
I Regress y on zm → θ̂m and orthogonalize with respect to zm
I Continue the process until M < p directions are obtained
I PLS seeks directions that have high variance and have high
correlation with the response

max Corr2 (y, X α) Var (X α)


kαk=1
v>
` Sα=0, `=1,...,m−1

Linear Regression and Related Topics 55/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Partial Least Squares (PLS)


I The first component say t = X α1 maximizes

α1 = arg max Cov2 (y, X α)


kαk=1

I Subsequent components t2 , t3 ,... are chosen such that they


maximizes the squared covariance to y and all components are
mutually orthogonal
I Orthogonality is enforced by deflating the original variables X

Xi = X − Pt1 ,...,ti−1 X

I Pt1 ,...,ti−1 denotes the orthogonal projection onto the space


spanned by t1 , ..., ti−1
I ŷ = Pt1 ,...,tm y instead of ŷ = X β̂ = X X > X −1 X > y


Linear Regression and Related Topics 56/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Illustrating the connection


The connection between these methods can be seen through the
optimisation criterion they use to define projection directions
I PCR extracts components that explain the variance of the
predictor space

max Var (X α)
kαk=1
v>
` Sα=0, `=1,...,m−1

I PLS extracts components that have a high covariance with

max Corr2 (y, X α) Var (X α)


kαk=1
v>
` Sα=0, `=1,...,m−1

I Both method are similar in there aim to extract m


components from the predictor space X
Linear Regression and Related Topics 57/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Illustrating the connection

Both methods aims


I at expressing the solution in lower dimensional subspace
β = V z where V is an p × m matrix of orthonormal columns
I Using this basis for the subspace, an alternative approximate
minimization problem is considered

minky − X βk≈ minky − XV zk


β z

I In PCR V is directly obtained from X


I in PLS V depends on y in a complicated nonlinear way

Linear Regression and Related Topics 58/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Illustrating the connection


Considering
I The singular value decomposition X = UDV > where U is
N × N, D = diag(d1 , ..., dp ) is N × p and V is p × p
I The columns of U and V are orthogonal such that U > U = Ip
and V > V = Ip
I The least squares solution takes the form
p p
X u> y i
X
β̂ = vi = βi
di
i=1 i

I The other estimator are shrinkage estimators and can be


expressed as
Xp
β̂ = f (di )βi
i

Linear Regression and Related Topics 59/61


Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Multiple Outcome Shrinkage


I When the output are not correlated → apply a univariate
technique individually to each outcome or work with each
column output individually
I Other approaches exploit correlations in the different
responses → canonical correlation analysis
I CCA find a sequence of linear combinations Xvm and Yum
such that the correlations are maximized
Corr2 (Yum , Xvm )
I Reduced rank regression
N
rr
(yi − Bxi )> Σ−1 (yi − Bxi )
X
B̂ (m) = arg min rank(B)=m
i=1

Note 18
Linear Regression and Related Topics 60/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

For more readings

I Summaries on LMS.
I Chapter 3, 5 & 14.5 from ’The elements of statistical learning’
book.
I Chapters 3, 6 & 10.2 from ’An introduction to statistical
learning’ book.

Linear Regression and Related Topics 61/61

You might also like