Chapter 3. Linear Regression
Chapter 3. Linear Regression
Chapter 3. Linear Regression
Linear Regression
Outline
Linear Regression
◦Different perspectives
◦Issues with linear regression
𝑦 ≈ 𝑋𝛽
𝑋𝑇
𝑦 − 𝑋𝛽 = 0
𝛽= 𝑋𝑇𝑋 −1 𝑋𝑇𝑦 = 𝐴𝑦
( 𝑋 𝑋+𝜆 𝐼 ) 𝛽=𝑋 𝑦
𝑇 𝑇
We want to test ridge regression with datasets with a low effective rank
◦Highly correlated (or linearly dependent) features
Python: Comparing ridge with basic
regression
Comparison of variances
Linear
regression
Ridge
regression
Scikit: Ridge
solvers
The problem is inherently much better than the LinearRegression() case
Several choices for the solver provided by Scikit
◦SVD
◦ Used by the unregularized linear regression
◦Cholesky factorization
◦Conjugate gradients (CGLS)
◦ Iterative method and we can target quality of fit
◦Lsqr
◦ Similar to CG but is more stable and may need fewer iterations to converge
◦Stochastic Average Gradient – Fairly new
◦ Use for big data sets
◦ Improvement over standard stochastic gradient
◦ Convergence rate linear – Same as gradient descent
How to choose 𝜆: Cross
validation
Choosing a smaller 𝜆 or adding more features will always result in
lower error on the training dataset
◦Over fitting
◦How to identify a model that will work as a good predictor?
Break up the dataset
◦Training and validation set
Train the model over a subset of the data and test its predictive
capability
◦Test predictions on an independent set of data
◦Compare various models and choose the model with the best
prediction error
Cross validation: Training vs Test Error
Leave one out cross validation (LOOCV)
Leave one out CV
◦Leave one data point as the
validation point and train on the 𝑦1 1
remaining dataset 𝑦2 𝑥11 𝑥21 𝛽
◦Evaluate model on the left 1 𝑥12 𝑥22 0
out data point 𝑦3 ≈ 1 𝑥13 𝛽1
◦Repeat the modeling and ⋮ 𝑥23
validation test for all choices of
⋮ ⋮ ⋮ 𝛽2
the left out data point 𝑦𝑛 1 𝑥1𝑛 𝑥2𝑛
◦Generalizes to leave-p-out
K-Fold cross validation
2-fold CV
◦Divide data set into two parts
◦Use each part once as training and once
as validation dataset 𝑦1 1 𝑥11
◦Generalizes to k-fold CV 21
◦May want to shuffle the data before 𝑦2 𝑥 1 𝑥22 𝛽0
partitioning
𝑦3 ≈ 1 𝑥𝑥12
13 𝑥23 𝛽1
Generally 3/5/10-fold cross validation is
preferred ⋮ ⋮ ⋮ ⋮
◦Leave-p-out requires several fits over 𝛽2
similar sets of data 𝑦𝑛 1
◦Also, computationally expensive compared to
k-fold CV
𝑥1𝑛 𝑥2𝑛
RidgeCV: Scikit’s Cross validated Ridge
Model
Outline
Linear Regression
◦Different perspectives
◦Issues with linear regression
Scikit options
LASSO
The penalty term for coefficient sizes is now the l1 norm
2
Minimize ‖𝑦 − 𝑥 𝛽‖ + 𝜆‖ 𝛽‖1 22
Gaussian MLE with a laplacian prior distribution on the parameters
Penalty
function
Prior
Scikit LASSO: Coordinate
descent
Minimize along coordinate axes iteratively
◦Does not work for non-differentiable functions
LASSO objective
Non-differentiable part is separable
h(x1, x2, …., xn)
Separable
f1(x1)+f2(x2)+ … + fn(xn)
Option in scikit to choose the direction either cyclically or at random called “selection”
Matching Pursuit (MP)
Select feature most correlated to the residual
f2
f1
Orthogonal Matching Pursuit (OMP)
Keep residual orthogonal to the set of selected features
f2
f1
f2
f1
Outline
Linear Regression
◦Different perspectives
◦Issues with linear regression
Scikit options
Options
Normalize (default false)
◦Scale the feature vectors to have unit norm
◦Your choice