L2 Linear Regression
L2 Linear Regression
Dr Karim Seghouane
School of Mathematics & Statistics
The University of Melbourne
Outline
§i. Introduction
Statistical Models
Linear Models
Linear regression
Given a vector of input variables x = (x1 , ..., xp )> ∈ Rp and a
response variable y ∈ R
y ≈ f (x)
where
p
X
f (x) = β0 + βi xi = β0 + β1 x1 + ... + βp xp
i=1
Parameter Estimation
Statistical Interpretation
yi = x>
i β + i i = 1, ..., N
Note 1
Matrix Form
RSS(β) = (y − X β)> (y − X β)
I Assuming that X has full column rank or X > X is positive
definite, minimizing RSS(β) gives a unique solution
−1
β̂ = X > X X >y
I and the fitted values at the training inputs are ( Note 2 )
−1
ŷ = X β̂ = X X > X X > y = Hy
Geometric Interpretation
I The hat matrix H is square and satisfies: H 2 = H and
H > = H ( Note 3 )
I H is the orthogonal projector onto V = Sp(X ) (column space
of X or the subspace of RN spanned by the column vectors of
X)
I and ŷ is the orthogonal projection of y onto Sp(X )
I The residual vector y − ŷ is orthogonal to this subspace
Statistical Properties
I Assuming the model y ∼ N X β, σ 2 IN gives
−1
β̂ ∼ N β, σ 2 X > X
or
ŷ ∼ N X β, σ 2 H
I where σ 2 is estimated by
N
1 1
(y − ŷ)> (y − ŷ)
X
σ̂ =2
(yi − ŷi )2 =
N −p−1 N −p−1
i=1
β̂j
zj = √
σ̂ vj
I and vj is the jth diagonal element of X > X −1 . Under H0 , zj
is tN−p−1 .
I Testing for a group of variables, H0 : smaller model is correct
Gauss-Markov Theorem
I The LSE has the smallest MSE of all linear estimators with no
bias.
I Biased estimator can generate smaller MSE
I Shrinking a set of coefficients to zero may result in a biased
estimate
I MSE is related to the prediction accuracy of a new response
y0 = f (x0 ) + 0 at input x0
h i2 h i2 h i
E y0 − f˜(x0 ) = σ 2 +E x0> β̃ − f (x0 ) = σ 2 +MSE f˜(x0 )
Note 5
TSS − RSS
R2 =
TSS
I measure the amount of variability (TSS = (yi − ȳ )2 )
P
removed by the model
I Use generalized LS
y ∼ N (X β, Σ)
I Still least square but using a different metric matrix Σ instead
of I
Note 6
Interactions or collinearity
Detection of collinearity
Categorical predictors
I Also referred as categorical or discrete predictors or variables.
I Prediction task is called regression for quantitative output and
classification for qualitative outputs
I Qualitative variables are represented by numerical codes
(
1 if the ith experiment is a success
xi =
0 if the ith experiment is a failure
I This results in the model
(
β0 + β1 + i if the ith exp. is a success
yi = β0 +β1 xi +i =
β0 + i if the ith exp. is a failure
Note 8
Linear Regression and Related Topics 19/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Outliers
High-leverage points
1
+ (N − 1)−1 (xi − x̄) S−1 (xi − x̄)
hi =
N
I S is the sample covariance matrix, xi the ith row of X and x̄
the average row
I The leverage statistic n1 ≤ hi ≤ 1 and the average is (p + 1)/n.
I If an observation has hi greatly exceeds (p + 1)/n, then we
may suspect that the corresponding point has high leverage.
Note 11
Multiple outputs
K X
N h i
(yik − fk (xi ))2 = tr (Y − XB)> (Y − XB)
X
RSS(B) =
k=1 i=1
Multiple outputs
Note 12
Why ?
Subset selection
I All subsets or best subsets regression (examine all
potential combination :o )
I Forward selection - begin with intercept and iteratively add
one variable.
I Backward selection - begin with the full model and
iteratively remove one variable.
I What is best for cases where p > n?
Best Subset
Forward Selection
Backward Elimination
Ridge Regression
Ridge Regression
Ridge Regression
Ridge Regression
Ridge Regression
Ridge Regression
ridge
−1
β̂ = X > X + λI X >y
Note 17
Linear Regression and Related Topics 37/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Lasso
max Var (X α)
kαk=1
v>
` Sα=0, `=1,...,m−1
M
pcr X
β̂ = θ̂m vm
m=1
pcr LS
I if M = p, ŷ = ŷ since the columns of Z = UD span the
column space of X
I PCR discards the p − M smallest eigenvalue components.
Linear Regression and Related Topics 47/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components
Principal Components
Principal Components
Xi = X − Pt1 ,...,ti−1 X
max Var (X α)
kαk=1
v>
` Sα=0, `=1,...,m−1
Note 18
Linear Regression and Related Topics 60/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
I Summaries on LMS.
I Chapter 3, 5 & 14.5 from ’The elements of statistical learning’
book.
I Chapters 3, 6 & 10.2 from ’An introduction to statistical
learning’ book.