0% found this document useful (0 votes)

37 views61 pages

L2 Linear Regression

Uploaded by

yuebb2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views61 pages

L2 Linear Regression

Uploaded by

yuebb2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Linear Regression and Related Topics

MAST90083 Computational Statistics and Data Mining

Dr Karim Seghouane
School of Mathematics & Statistics
The University of Melbourne

Linear Regression and Related Topics 1/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Outline

§i. Introduction

§ii. Linear regression

§iii. Other Considerations

§iii. Selection and Regularization

§iv. Dimension Reduction Methods

§v. Multiple Outcome Shrinkage

Linear Regression and Related Topics 2/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Statistical Models

I What is the simplest mathematical model that describes the

relationship between two variables ?
I Straight line
I Statistical models are fitted for a variety of reasons:
I Explanation and prediction: Uncover causes by studying the
relationship between an interested variable (the response) and
a set of variables called the explanatory variables& use the
model for prediction
I Examine and test scientific hypotheses

Linear Regression and Related Topics 3/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Linear Models

I Linear models have a long history in statistics, but even in

today’s computer era they are still important and widely used
in supervised learning.
I They are simple and provide a picture of how the inputs affect
the output
I For prediction purposes they can sometimes outperform
fancier nonlinear models, particularly in small sample cases,
low signal-to-noise ratio or sparse data
I We will study some of the key questions associated with the
linear regression model

Linear Regression and Related Topics 4/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Linear regression
Given a vector of input variables x = (x1 , ..., xp )> ∈ Rp and a
response variable y ∈ R
y ≈ f (x)
where
p
X
f (x) = β0 + βi xi = β0 + β1 x1 + ... + βp xp
i=1

I The linear model assumes that the dependence of y on

x1 , x2 , ..., xp is linear. Or, it can be well approximated by the
linear relationships.
I Although it may seem overly simplistic, linear regression is
extremely useful both conceptually and practically.
I The βj0 s are the unknown parameters that need to be
determined.
Linear Regression and Related Topics 5/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Parameter Estimation

I We have at our disposal a set of training data

(x1 , y1 ), ..., (xN , yN ) from which to estimate the parameters β
I The most popular method is least squares where β is obtained
by minimizing the residual sum of squares
 2
X N N
X Xp
RSS(β) = (yi − f (xi ))2 = yi − β0 − βj xij 
i=1 i=1 j=1

I RSS(β) is a quadratic function of the parameters, its

minimum always exists, but may not be unique
I What is the statistical interpretation ?

Linear Regression and Related Topics 6/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Statistical Interpretation

I From a statistical point of view, this represents the maximum

likelihood estimation of β assuming

yi = x>
i β + i i = 1, ..., N

I and 1 , ..., N are independent random samples from N (0, σ 2 ),

σ > 0 an unknown parameter so that ∼ N (0, σ 2 IN )
I Taking X as the N × (p + 1) matrix with each row an input
vector and 1 in the first position and y an N × 1 vector of
responses: E (y) = X β and Cov (y) = σ 2 IN , so that
y ∼ N X β, σ 2 IN

Note 1

Linear Regression and Related Topics 7/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Matrix Form

I The residual sum of square takes the matrix form

RSS(β) = (y − X β)> (y − X β)
I Assuming that X has full column rank or X > X is positive
definite, minimizing RSS(β) gives a unique solution
−1
β̂ = X > X X >y
I and the fitted values at the training inputs are ( Note 2 )
−1
ŷ = X β̂ = X X > X X > y = Hy

Linear Regression and Related Topics 8/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Geometric Interpretation
I The hat matrix H is square and satisfies: H 2 = H and
H > = H ( Note 3 )
I H is the orthogonal projector onto V = Sp(X ) (column space
of X or the subspace of RN spanned by the column vectors of
X)
I and ŷ is the orthogonal projection of y onto Sp(X )
I The residual vector y − ŷ is orthogonal to this subspace

Linear Regression and Related Topics 9/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Statistical Properties
I Assuming the model y ∼ N X β, σ 2 IN gives

−1
β̂ ∼ N β, σ 2 X > X
or
ŷ ∼ N X β, σ 2 H

I where σ 2 is estimated by

N
1 1
(y − ŷ)> (y − ŷ)
X
σ̂ =2
(yi − ŷi )2 =
N −p−1 N −p−1
i=1

I and (N − p − 1)σ̂ 2 ∼ σ 2 χ2N−p−1

I Furthermore σ̂ 2 and β̂ are statistically independent. ( Why ?)
Linear Regression and Related Topics 10/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Assessing the Accuracy of the coefficient estimates

I Approximate confidence set for the parameter vector β
>
(1−α)
Cβ = {β| β̂ − β X > X β̂ − β ≤ σ̂ 2 χp+1 }
I Test the null hypothesis of H0 : βj = 0 vs. H1 : βj 6= 0 using

β̂j
zj = √
σ̂ vj
I and vj is the jth diagonal element of X > X −1 . Under H0 , zj

is tN−p−1 .
I Testing for a group of variables, H0 : smaller model is correct

(RSS0 − RSS1 )/(p1 − p0 )

F = ∼ Fp1 −p0 ,N−p1 −1
RSS1 /(N − p1 − 1)
Note 4
Linear Regression and Related Topics 11/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Gauss-Markov Theorem

I LSE of β have the smallest variance among all linear unbiased

estimates
I Assuming the estimation of θ = a> β, then its LSE is
−1
θ̂ = a> β̂ = a> X > X X > y = c>
0y

I Then E a> β̂ = a> β is unbiased and

Var θ̂ ≤ Var c> y

I for any other linear estimator θ̃ = c> y

Linear Regression and Related Topics 12/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Reducing the MSE

I The LSE has the smallest MSE of all linear estimators with no
bias.
I Biased estimator can generate smaller MSE
I Shrinking a set of coefficients to zero may result in a biased
estimate
I MSE is related to the prediction accuracy of a new response
y0 = f (x0 ) + 0 at input x0

h i2 h i2 h i
E y0 − f˜(x0 ) = σ 2 +E x0> β̃ − f (x0 ) = σ 2 +MSE f˜(x0 )

Note 5

Linear Regression and Related Topics 13/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Assessing the Overall Accuracy of the Model

I Quantify how well the model fits the observations

I Two quantities are used: the residual standard error (RSE)
which measures the lack of fit
v
u s
u RSS β̂ PN 2
t i=1 (yi − ŷi )
RSE = =
N −p−1 N −p−1
I R 2 Statistics

TSS − RSS
R2 =
TSS
I measure the amount of variability (TSS = (yi − ȳ )2 )
P
removed by the model

Linear Regression and Related Topics 14/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Other Considerations in the Regression Model

I Correlation of the error terms

I Interactions or collinearity
I Categorical predictors and their interpretation (two or more
categories).
I Non-linear effects of predictors
I Outliers and high-leverage points
I Multiple outputs

Linear Regression and Related Topics 15/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Correlation of the error terms

I Use generalized LS

RSS(β) = (y − X β)> Σ−1 (y − X β)

I Similar to assuming

y ∼ N (X β, Σ)
I Still least square but using a different metric matrix Σ instead
of I
Note 6

Linear Regression and Related Topics 16/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Interactions or collinearity

I Variables closely related to one another which leads to linear

dependence or collinearity among the columns of X .
I It is difficult to separate the individual effects of collinear
variables on the response.
−1
Var β̂ = σ 2 X > X

I Collinearity has considerable effect on the precision of β̂ →

large variances, wide confidence interval and low power of the
tests
I It is important to identify and address potential collinearity
problems

Linear Regression and Related Topics 17/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Detection of collinearity

I Look at the correlation matrix of the variables to detect pair

of highly correlated variables
I Collinearity between three or more variables compute variance
inflation factor VIF for each variable
1
VIF β̂j =
1 − Rj2
I Geometrically 1 − Rj2 measures how close xj is to the subspace
spanned by X−j

VIF β̂j ≤ λmax λ−1min = κ (X )
2

I Examine the eigenvalues and eigenvectors Note 7

Linear Regression and Related Topics 18/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Categorical predictors
I Also referred as categorical or discrete predictors or variables.
I Prediction task is called regression for quantitative output and
classification for qualitative outputs
I Qualitative variables are represented by numerical codes
(
1 if the ith experiment is a success
xi =
0 if the ith experiment is a failure
I This results in the model

(
β0 + β1 + i if the ith exp. is a success
yi = β0 +β1 xi +i =
β0 + i if the ith exp. is a failure

Note 8
Linear Regression and Related Topics 19/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Non-linear effects of predictors

I The linear model assumes a linear relationship between the

response and predictors
I The true relationship between the response and the predictors
may be non-linear
I Polynomial regression is simple way to extend linear models to
accommodate non-linear relationships
I In this case non-linearity is obtained by considering
transformed versions of the predictors
I The parameters can be estimated using standard linear
regression methods.
Note 9

Linear Regression and Related Topics 20/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Outliers

I Different reasons can lead to outliers. Example: incorrect

recording of an observation
I The residual as an estimate of the error can be used to
identify outliers by examination for extreme values
I Better use the studentized residuals

E [e] = E [(IN − H) y] = (IN − H) δ

I If the diagonal of H is not close to 1 (small) then e reflects

the presence of outliers.
Note 10

Linear Regression and Related Topics 21/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

High-leverage points

I Observations with high leverage have an unusual value for xi

I Difficult to identify when there are multiple predictors
I To quantify an observation’s leverage use the leverage statistic

1
+ (N − 1)−1 (xi − x̄) S−1 (xi − x̄)
hi =
N
I S is the sample covariance matrix, xi the ith row of X and x̄
the average row
I The leverage statistic n1 ≤ hi ≤ 1 and the average is (p + 1)/n.
I If an observation has hi greatly exceeds (p + 1)/n, then we
may suspect that the corresponding point has high leverage.
Note 11

Linear Regression and Related Topics 22/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Multiple outputs

I Multiple outputs y1 , ..., yK need to be predicted from x1 , ..., xp

where a linear model is assumed for each output
p
X
yk = β0k + xj βjk + k = fk (X ) + k
j=1

I In matrix notation Y = XB + E where Y is N × K , X is

N × (p + 1), B is (p + 1) × K (matrix of parameters) and E is
N × K matrix of errors.

K X
N h i
(yik − fk (xi ))2 = tr (Y − XB)> (Y − XB)
X
RSS(B) =
k=1 i=1

Linear Regression and Related Topics 23/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Multiple outputs

I The least squares estimates have exactly the same form

−1
B̂ = X> X X> Y

I In case of correlated errors ∼ N (0, Σ) the multivariate

criterion becomes
N
(yi − f (xi ))> Σ−1 (yi − f (xi ))
X
RSS(B) =
i=1

Note 12

Linear Regression and Related Topics 24/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Why ?

I The least squares estimates is in most cases not satisfied when

a large number of potential explanatory variables are available
I Improving prediction accuracy: LSE often has low bias but
large variance, sacrifice a little bit of bias to reduce the
variance of the predicted values and improve overall prediction
accuracy
I Interpretation: Do all the predictors help to explain y ?
determine a smaller subset with strongest effects and sacrifices
the small details

Linear Regression and Related Topics 25/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Linear Model Selection and Regularization

Linear Regression and Related Topics 26/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Deciding on the Important Variables

Subset selection
I All subsets or best subsets regression (examine all
potential combination :o )
I Forward selection - begin with intercept and iteratively add
one variable.
I Backward selection - begin with the full model and
iteratively remove one variable.
I What is best for cases where p > n?

Linear Regression and Related Topics 27/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Best Subset

I Retain a subset of predictor and eliminate the rest

I LSE is used to obtain the coefficients of the retained variables
I For eack k ∈ {0, 1, 2, ..., p} find the subset k that gives the
smallest residual
I The choice of k is obtained using a criterion and involves a
tradeoff between bias & variance
I Different criteria ← minimizes an estimate of the expected
prediction error
Infeasible for large p
Note 13

Linear Regression and Related Topics 28/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Forward Selection

I Sequential addition of predictors → forward stepwise selection

I Starts with the intercept and sequentially add the predictor
that most improve the fit
I Add predictor producing the largest value of

RSS β̂ i − RSS β̂ i+1
F =
RSS β̂ i+1 /(N − k − 2)
I Use 90th or 95th percentile of F1,N−k−2 as Fe
Note 14

Linear Regression and Related Topics 29/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Backward Elimination

I Start with the full model and sequentially remove predictors

I Use Fd to choose the predictor to delete (smallest value of F )
I Stop when each predictor in the model produces F > Fd
I Can be used only when N > p
Fd ' Fe
Note 15

Linear Regression and Related Topics 30/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Alternative: Shrinkage Methods

I Subset selection produces produce an interpretable model with

possible lower prediction error than the full model
I The selection is discrete → often exhibits high variance
I Shrinkage methods are continuous and don’t suffer as much
from high variability
I We fit a model containing all p predictors using a technique
that constrains or regularizes the coefficient estimates, or
equivalently, that shrinks the coefficient estimates towards
zero.
I Shrinking the coefficient estimates can significantly reduce
their variance (not immediately obvious).

Linear Regression and Related Topics 31/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

Linear Regression and Related Topics 32/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

Linear Regression and Related Topics 33/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

Linear Regression and Related Topics 34/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

I Ridge regression shrinks the regression coefficients by

constraining their size
I This is the approach used in neural networks where it is
known as weight decay
I The larger the value of λ, the greater the amount of shrinkage
I β0 is left out of the penalty term
Note 16

Linear Regression and Related Topics 35/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

I Because we have now the addition of the penalty term, the

ridge regression coefficient estimates can change substantially
when multiplying a given predictor by a constant!
I It is best to apply ridge regression after standardizing the
predictors, using the formula
xij
x̃ij = q P
1 n
n i=1 (xij − x̄ij )2

Linear Regression and Related Topics 36/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge Regression

RSS(λ) = (y − X β)> (y − X β) + λβ > β

ridge
−1
β̂ = X > X + λI X >y

I The ridge solution is a linear function of y

I Avoid singularity when X > X is not full by adding a positive
constant to the diagonal of X > X .
ridge
I For orthogonal predictor β̂ = γ β̂ where 0 ≤ γ ≤ 1
I The effective degrees of freedom of the ridge regression fit is

df (λ) = tr X [X > X + λI]−1 X >

Note 17
Linear Regression and Related Topics 37/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge regression - credit data example

balance ∼
age, cards, education, income, limit, rating , gender , student,
status, ethnicity

Linear Regression and Related Topics 38/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Ridge regression vs. LS

Linear Regression and Related Topics 39/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Lasso

I Ridge regression disadvantage: includes all p predictors

(some of them with minor influence)
I Lasso , in contrast, select subset.
I The lasso coefficients, β̂λL , minimize the quantity

Linear Regression and Related Topics 40/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

The Variable Selection Property of the Lasso

Linear Regression and Related Topics 41/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Some Remarks on Lasso

I Making s sufficiently small will cause some of the coefficients

to be exactly zero → continuous subset selection
I If s = pi=1 kβ̂jls k, then the lasso estimates are the β̂jls ’s.
P

I s should be adaptively chosen to minimize an estimate of

expected prediction error.

Linear Regression and Related Topics 42/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

The Variable Selection Property of the Lasso

Linear Regression and Related Topics 43/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Profile of Lasso Coefficients

I Profiles of lasso coefficients as the tuning parameter t is

varied. The coefficients are plotted versus t = s/ pi=1 kβ̂jls k
P

Linear Regression and Related Topics 44/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Dimension Reduction Methods

I When there is a large number of predictors, often correlated,

we can select what variables (dimensions) to use.
I But, why not transform the predictors (to a lower dimension)
and then fit the least squares model using the transformed
variables.
I We will refer to these techniques as dimension reduction
methods.
I Use a small linear combinations zm , m = 1, ..., M of xj
I The methods differ in how the linear combinations are
obtained

Linear Regression and Related Topics 45/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components Regression

I The linear combinations zm are the principal components

zm = X vm

max Var (X α)
kαk=1
v>
` Sα=0, `=1,...,m−1

I v>` Sα = 0 ensure zm = X vm is uncorrelated with all previous

linear combinations z` = X v` , ` = 1, ..., m − 1
I y is regressed on z1 , ..., zM for M ≤ p

Linear Regression and Related Topics 46/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components Regression

I Since zm are othrogonal, this regression is just a sum of
univariate regressions
M
X
pcr
ŷ = ȳ + θ̂m zm
m=1

θ̂m =< zm , y > / < zm , zm >

M
pcr X
β̂ = θ̂m vm
m=1
pcr LS
I if M = p, ŷ = ŷ since the columns of Z = UD span the
column space of X
I PCR discards the p − M smallest eigenvalue components.
Linear Regression and Related Topics 47/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components Regression

Linear Regression and Related Topics 48/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Dimension Reduction Methods

Linear Regression and Related Topics 49/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components

I Effectively, we change the orientation of the axes.

I The 1st principal component is that (normalized) linear
combination of the variables with the largest variance.
I The 2nd principal component has the largest (remainder)
variance, subject to being uncorrelated with the first.
I And so on...
I Many times we can explain most of the variation with only
few principal components.
I More details in Chapter 10.2 of the book ‘An introduction to
statistical learning’

Linear Regression and Related Topics 50/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components

Linear Regression and Related Topics 51/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components

Linear Regression and Related Topics 52/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Principal Components Regression

I Now we can run a regression analysis on only several principal

components.
I We call it Principal Components Regression ( PCR )
I Note, these directions are identified in an unsupervised way,
since the response Y is not used to help determine the
principal component directions.
I Consequently, PCR suffers from a potentially serious
drawback: there is no guarantee that the directions that best
explain the predictors will also be the best directions to use for
predicting the response.

Linear Regression and Related Topics 53/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Partial Least Squares (PLS)

I It is a dimension reduction method, which first identifies a

new set of features Z1 , ..., ZM that are linear combinations of
the original features.
I Then fits a linear model via OLS using these M new features.
I Up to this point very much as PCR.
I PLS identifies these new features using the response Y
(supervised way ).
I PLS approach attempts to find directions that help explain
both the response and the predictors.

Linear Regression and Related Topics 54/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Partial Least Squares (PLS)

I Uses y to construct linear combinations of the inputs.

I The inputs are weighted by the strength of their univariate
effect on y
I Regress y on zm → θ̂m and orthogonalize with respect to zm
I Continue the process until M < p directions are obtained
I PLS seeks directions that have high variance and have high
correlation with the response

max Corr2 (y, X α) Var (X α)

kαk=1
v>
` Sα=0, `=1,...,m−1

Linear Regression and Related Topics 55/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Partial Least Squares (PLS)

I The first component say t = X α1 maximizes

α1 = arg max Cov2 (y, X α)

kαk=1

I Subsequent components t2 , t3 ,... are chosen such that they

maximizes the squared covariance to y and all components are
mutually orthogonal
I Orthogonality is enforced by deflating the original variables X

Xi = X − Pt1 ,...,ti−1 X

I Pt1 ,...,ti−1 denotes the orthogonal projection onto the space

spanned by t1 , ..., ti−1
I ŷ = Pt1 ,...,tm y instead of ŷ = X β̂ = X X > X −1 X > y

Linear Regression and Related Topics 56/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Illustrating the connection

The connection between these methods can be seen through the
optimisation criterion they use to define projection directions
I PCR extracts components that explain the variance of the
predictor space

max Var (X α)
kαk=1
v>
` Sα=0, `=1,...,m−1

I PLS extracts components that have a high covariance with

max Corr2 (y, X α) Var (X α)

kαk=1
v>
` Sα=0, `=1,...,m−1

I Both method are similar in there aim to extract m

components from the predictor space X
Linear Regression and Related Topics 57/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Illustrating the connection

Both methods aims

I at expressing the solution in lower dimensional subspace
β = V z where V is an p × m matrix of orthonormal columns
I Using this basis for the subspace, an alternative approximate
minimization problem is considered

minky − X βk≈ minky − XV zk

β z

I In PCR V is directly obtained from X

I in PLS V depends on y in a complicated nonlinear way

Linear Regression and Related Topics 58/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Illustrating the connection

Considering
I The singular value decomposition X = UDV > where U is
N × N, D = diag(d1 , ..., dp ) is N × p and V is p × p
I The columns of U and V are orthogonal such that U > U = Ip
and V > V = Ip
I The least squares solution takes the form
p p
X u> y i
X
β̂ = vi = βi
di
i=1 i

I The other estimator are shrinkage estimators and can be

expressed as
Xp
β̂ = f (di )βi
i

Linear Regression and Related Topics 59/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Multiple Outcome Shrinkage

I When the output are not correlated → apply a univariate
technique individually to each outcome or work with each
column output individually
I Other approaches exploit correlations in the different
responses → canonical correlation analysis
I CCA find a sequence of linear combinations Xvm and Yum
such that the correlations are maximized
Corr2 (Yum , Xvm )
I Reduced rank regression
N
rr
(yi − Bxi )> Σ−1 (yi − Bxi )
X
B̂ (m) = arg min rank(B)=m
i=1

Note 18
Linear Regression and Related Topics 60/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

For more readings

I Summaries on LMS.
I Chapter 3, 5 & 14.5 from ’The elements of statistical learning’
book.
I Chapters 3, 6 & 10.2 from ’An introduction to statistical
learning’ book.

Linear Regression and Related Topics 61/61

ECE 069 - Engineering Data Analysis - WM
No ratings yet
ECE 069 - Engineering Data Analysis - WM
133 pages
Introductory Econometrics For Finance Chris Brooks Solutions To Review Questions - Chapter 9
No ratings yet
Introductory Econometrics For Finance Chris Brooks Solutions To Review Questions - Chapter 9
11 pages
Essentials of Modern Business Statistics With Microsoft Excel 7th Edition David Anderson - Ebook PDF PDF Download
100% (2)
Essentials of Modern Business Statistics With Microsoft Excel 7th Edition David Anderson - Ebook PDF PDF Download
75 pages
Basic Eco No Metrics - Gujarati
50% (2)
Basic Eco No Metrics - Gujarati
48 pages
ML Module3 Regression
No ratings yet
ML Module3 Regression
51 pages
Unit 3c Linear Regression
No ratings yet
Unit 3c Linear Regression
98 pages
Complete Linear Regression Algorithm
No ratings yet
Complete Linear Regression Algorithm
4 pages
MAF3821 2024 Part1
100% (1)
MAF3821 2024 Part1
35 pages
Pls-Sem P
100% (1)
Pls-Sem P
32 pages
ML Module 2
No ratings yet
ML Module 2
185 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Kernel Methods For Vine Copula Estimation: Technische Universität München Department of Mathematics
No ratings yet
Kernel Methods For Vine Copula Estimation: Technische Universität München Department of Mathematics
133 pages
Daunit 3
No ratings yet
Daunit 3
32 pages
L9 Support Vector Machines
No ratings yet
L9 Support Vector Machines
83 pages
Advanced Regression Methods: 1. Reminders On Linear Regression
No ratings yet
Advanced Regression Methods: 1. Reminders On Linear Regression
109 pages
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
Lec 3
No ratings yet
Lec 3
69 pages
Lecture 9-10 - Updated Vesion S25 - Regression
No ratings yet
Lecture 9-10 - Updated Vesion S25 - Regression
43 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Chapter 3 Multiple Regression Analysis Estimation
No ratings yet
Chapter 3 Multiple Regression Analysis Estimation
38 pages
MachineLearning Unit-II
No ratings yet
MachineLearning Unit-II
45 pages
Intecxpres 182
No ratings yet
Intecxpres 182
65 pages
L7 Artificial Neural Networks
No ratings yet
L7 Artificial Neural Networks
56 pages
Linear Regression
No ratings yet
Linear Regression
46 pages
SLChapter 3
No ratings yet
SLChapter 3
29 pages
4 A Regression Main
No ratings yet
4 A Regression Main
38 pages
L5 Spline Regression
No ratings yet
L5 Spline Regression
74 pages
Probability and Statistics Part 6 Regression
No ratings yet
Probability and Statistics Part 6 Regression
47 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
CHAPTER 3-Process Control
No ratings yet
CHAPTER 3-Process Control
42 pages
Ch3slides Multiple Linear Regression
No ratings yet
Ch3slides Multiple Linear Regression
61 pages
Mod007192 10 2 Tri2 2023 24
No ratings yet
Mod007192 10 2 Tri2 2023 24
21 pages
ML EasySol
No ratings yet
ML EasySol
62 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Regression
No ratings yet
Regression
44 pages
ML Unit-2
No ratings yet
ML Unit-2
34 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
MLR - 2023
No ratings yet
MLR - 2023
18 pages
2 Regression Annotated
No ratings yet
2 Regression Annotated
15 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
Linear Regression
No ratings yet
Linear Regression
18 pages
4-Curve Fitting and Interpolation
No ratings yet
4-Curve Fitting and Interpolation
48 pages
Chapter 02
No ratings yet
Chapter 02
14 pages
L4 Kernel Regression
No ratings yet
L4 Kernel Regression
42 pages
Corrigé TP ML Prétraitmodelisation
No ratings yet
Corrigé TP ML Prétraitmodelisation
24 pages
4 - 1 - Numerical Analysis - Function Approximation - Linear Regression
No ratings yet
4 - 1 - Numerical Analysis - Function Approximation - Linear Regression
20 pages
09 Curve Fitting II
No ratings yet
09 Curve Fitting II
17 pages
BS Classes V2
No ratings yet
BS Classes V2
70 pages
Clase 11 Calculo Numerico I
No ratings yet
Clase 11 Calculo Numerico I
37 pages
5.linear Regression
No ratings yet
5.linear Regression
39 pages
Linear Regression
No ratings yet
Linear Regression
35 pages
For Finance Course Outline
No ratings yet
For Finance Course Outline
5 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Multiple Regression Okk PDF
No ratings yet
Multiple Regression Okk PDF
19 pages
Regression 4
No ratings yet
Regression 4
12 pages
L4 Emt 2101 Engineering Mathematics Iii
No ratings yet
L4 Emt 2101 Engineering Mathematics Iii
25 pages
Jurnal RRB Speaking Ester Dan Khairunnisa Revised
No ratings yet
Jurnal RRB Speaking Ester Dan Khairunnisa Revised
7 pages
TV 30 2023 6 1683-1691
No ratings yet
TV 30 2023 6 1683-1691
9 pages
SimpleMultipleLinearRegression FoundationalMathofAI S24
No ratings yet
SimpleMultipleLinearRegression FoundationalMathofAI S24
6 pages
IST2024 Lecture02
No ratings yet
IST2024 Lecture02
31 pages
Class 11 ECO Sample Papers 2024 25
No ratings yet
Class 11 ECO Sample Papers 2024 25
9 pages
Linear Regression-Part 2
No ratings yet
Linear Regression-Part 2
26 pages
Resumen Gujarati Econometria
No ratings yet
Resumen Gujarati Econometria
8 pages
Regress A o Linear
No ratings yet
Regress A o Linear
8 pages
Lecture2 241007 162001
No ratings yet
Lecture2 241007 162001
11 pages
Analisis Permintaan Pakaian Pada Marketplace Shopee Di Generasi Milenial
No ratings yet
Analisis Permintaan Pakaian Pada Marketplace Shopee Di Generasi Milenial
6 pages
Note - Unit-4
No ratings yet
Note - Unit-4
12 pages
Chapter 1 ECONOMETRICS MGT
No ratings yet
Chapter 1 ECONOMETRICS MGT
5 pages
University of Palestine Gaza Strip Civil Engineering College Numerical Analysis CIVL 3309 Dr. Suhail Lubbad
No ratings yet
University of Palestine Gaza Strip Civil Engineering College Numerical Analysis CIVL 3309 Dr. Suhail Lubbad
19 pages
Mini Project - Golf: By: Kantimati Subramanian Iyer
No ratings yet
Mini Project - Golf: By: Kantimati Subramanian Iyer
12 pages
Regression and Multiple Regression Analysis
100% (1)
Regression and Multiple Regression Analysis
21 pages
QTM Lab Test
No ratings yet
QTM Lab Test
2 pages
MLR Note
No ratings yet
MLR Note
3 pages
Regression
No ratings yet
Regression
24 pages
Supplement 5 - Multiple Regression
No ratings yet
Supplement 5 - Multiple Regression
19 pages
Ordinary Least Squares Linear Regression Review: Week 4
No ratings yet
Ordinary Least Squares Linear Regression Review: Week 4
10 pages
The Chi-Squared Test With TI-Nspire IB10
No ratings yet
The Chi-Squared Test With TI-Nspire IB10
5 pages
Mult Regression
No ratings yet
Mult Regression
28 pages
Acosta-Escalante Et Al. - 2018 - Meta-Classifiers in Huntington's Disease Patients Classification, Using Iphone's Movement Sensors Place-2
No ratings yet
Acosta-Escalante Et Al. - 2018 - Meta-Classifiers in Huntington's Disease Patients Classification, Using Iphone's Movement Sensors Place-2
5 pages
A Review On Linear Regression Comprehensive in Machine Learning
No ratings yet
A Review On Linear Regression Comprehensive in Machine Learning
8 pages
Statistical Inference 2 Note 02
No ratings yet
Statistical Inference 2 Note 02
7 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
Untitled
No ratings yet
Untitled
6 pages
Combinepdf
No ratings yet
Combinepdf
8 pages
Minitab 19 Brochure EN Global - PDF##
No ratings yet
Minitab 19 Brochure EN Global - PDF##
10 pages
Stata Ts Introduction To Time-Series Commands
100% (1)
Stata Ts Introduction To Time-Series Commands
6 pages
Week 2 Exercise Due Date: Submission Method:: Nama Kelompok
No ratings yet
Week 2 Exercise Due Date: Submission Method:: Nama Kelompok
4 pages
The Measures of Relative Positions
No ratings yet
The Measures of Relative Positions
2 pages
Handout Hypothesis Test For A Population Mean
No ratings yet
Handout Hypothesis Test For A Population Mean
6 pages
Chem 26.1 Formal Report Expt 1
No ratings yet
Chem 26.1 Formal Report Expt 1
8 pages

L2 Linear Regression

Uploaded by

L2 Linear Regression

Uploaded by

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Linear Regression and Related Topics

MAST90083 Computational Statistics and Data Mining

Linear Regression and Related Topics 1/61

§ii. Linear regression

§iii. Other Considerations

§iii. Selection and Regularization

§iv. Dimension Reduction Methods

§v. Multiple Outcome Shrinkage

Linear Regression and Related Topics 2/61

I What is the simplest mathematical model that describes the

Linear Regression and Related Topics 3/61

I Linear models have a long history in statistics, but even in

Linear Regression and Related Topics 4/61

I The linear model assumes that the dependence of y on

I We have at our disposal a set of training data

I RSS(β) is a quadratic function of the parameters, its

Linear Regression and Related Topics 6/61

I From a statistical point of view, this represents the maximum

I and 1 , ..., N are independent random samples from N (0, σ 2 ),

Linear Regression and Related Topics 7/61

I The residual sum of square takes the matrix form

Linear Regression and Related Topics 8/61

Linear Regression and Related Topics 9/61

I and (N − p − 1)σ̂ 2 ∼ σ 2 χ2N−p−1

Assessing the Accuracy of the coefficient estimates

(RSS0 − RSS1 )/(p1 − p0 )

I LSE of β have the smallest variance among all linear unbiased

I for any other linear estimator θ̃ = c> y

Linear Regression and Related Topics 12/61

Reducing the MSE

Linear Regression and Related Topics 13/61

Assessing the Overall Accuracy of the Model

I Quantify how well the model fits the observations

Linear Regression and Related Topics 14/61

Other Considerations in the Regression Model

I Correlation of the error terms

Linear Regression and Related Topics 15/61

Correlation of the error terms

RSS(β) = (y − X β)> Σ−1 (y − X β)

Linear Regression and Related Topics 16/61

I Variables closely related to one another which leads to linear

I Collinearity has considerable effect on the precision of β̂ →

Linear Regression and Related Topics 17/61

I Look at the correlation matrix of the variables to detect pair

I Examine the eigenvalues and eigenvectors Note 7

Linear Regression and Related Topics 18/61

Non-linear effects of predictors

I The linear model assumes a linear relationship between the

Linear Regression and Related Topics 20/61

I Different reasons can lead to outliers. Example: incorrect

E [e] = E [(IN − H) y] = (IN − H) δ

I If the diagonal of H is not close to 1 (small) then e reflects

Linear Regression and Related Topics 21/61

I Observations with high leverage have an unusual value for xi

Linear Regression and Related Topics 22/61

I Multiple outputs y1 , ..., yK need to be predicted from x1 , ..., xp

I In matrix notation Y = XB + E where Y is N × K , X is

Linear Regression and Related Topics 23/61

I The least squares estimates have exactly the same form

I In case of correlated errors  ∼ N (0, Σ) the multivariate

Linear Regression and Related Topics 24/61

I The least squares estimates is in most cases not satisfied when

Linear Regression and Related Topics 25/61

Linear Model Selection and Regularization

Linear Regression and Related Topics 26/61

Deciding on the Important Variables

Linear Regression and Related Topics 27/61

I Retain a subset of predictor and eliminate the rest

Linear Regression and Related Topics 28/61

I Sequential addition of predictors → forward stepwise selection

Linear Regression and Related Topics 29/61

I Start with the full model and sequentially remove predictors

Linear Regression and Related Topics 30/61

Alternative: Shrinkage Methods

I Subset selection produces produce an interpretable model with

I and 1 , ..., N are independent random samples from N (0, σ 2 ),

I In case of correlated errors ∼ N (0, Σ) the multivariate