Chapter 3. Linear Regression

Chapter 3.
Linear Regression
Outline
Linear Regression
◦Different perspectives
◦Issues with linear regression
Addressing the issues through regularization

Adding sparsity to the model/Feature selection
Scikit options
Regression
Modeling a quantity as a simple function of features
◦The predicted quantity should be well approximated as
continuous
◦Prices, lifespan, physical measurements
◦As opposed to classification where we seek to predict discrete classes
Python example for today: Boston house prices

◦The model is a linear function of the features
◦House_price = a*age + b*House_size + ….
◦Create nonlinear features to capture non-linearities
◦House_size2 = house_size*house_size
◦House_price = a*age + b*House_size + c*House_size2 + …..
Case of two features
𝑦1 1 𝑥11
𝑦2 1 𝑥21 𝛽0
𝑦3 ≈ 1 𝑥 𝛽1
12
⋮ 𝑥22
𝑦𝑛 𝛽2
1 𝑥13 𝛽0
𝑥23
Residuals
⋮ ⋮ ⋮
𝑥1𝑛
Linear Regression
▶ Model a quantity as a linear function of some known features
▶ 𝑦 is the quantity to be modeled
▶ 𝑋 are the sample points with each row being one data point
▶ Columns are feature vectors
𝑦 ≈ 𝑋𝛽
▶ Goal: Estimate the model coefficients

or 𝛽
Least squares: Optimization perspective
Define objective function using the 2-norm of the
residuals
◦ 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 = 𝑦 − 𝑋𝛽
2
◦Minimize: 𝑓𝑜𝑏𝑗 = 𝑦 − 𝑋𝛽2
𝑇
= 𝑦
−
= 𝑋𝛽 𝑦+
𝛽𝑇𝑋𝑇𝑋𝛽 − 2𝑦 𝑇 𝑋𝛽
◦ 𝜕 𝑓 𝑜 𝑏 = 2𝑋𝑇𝑋𝛽 −𝑦−𝑇𝑋𝛽
𝑦2𝑋 𝑇 𝑦 = 𝛽 = 𝑋𝑇𝑋
−1 𝑋𝑇𝑦
𝑗 𝜕
◦𝛽Normal0 equation
◦X is assumed to be thin and full rank so that 𝑋𝑇𝑋 is invertible
Geometrical perspective
We are trying to approximate y as linear combinations of the
column vectors of X
Lets make the residual orthogonal to the column space of X
𝑋𝑇
𝑦 − 𝑋𝛽 = 0
We get the same normal equation 
𝛽= 𝑋𝑇𝑋 −1 𝑋𝑇𝑦 = 𝐴𝑦
A Defines a left inverse of a rectangular matrix X

Python Example
Python example
What is Scikit doing?
Least norm
Singular Value Decomposition (SVD)
◦ 𝑋 = 𝑈Σ𝑉𝑇
Set of all
Defines a general pseudo-inverse VΣ† 𝑈 𝑇 solutions having
◦ Known as Moore-Penrose inverse the smallest
◦ For a thin matrix it is the left inverse residual norm
◦ For a fat matrix it is the right inverse
◦ Provides a minimum norm solution of an underdetermined set of equations
In general we can have XTX not being full rank

We get the minimum norm solution among the set of least squares solution
Stats perspective
𝑦 ≈ 𝑋𝛽 𝑦 = 𝑋𝛽𝑡𝑟𝑢𝑒 + 𝜀
Maximum Likelihood Estimator (MLE)

◦ Normally distributed error
◦ 𝑦 − 𝑋𝛽 = 𝜀~𝑁 0, 𝜎2𝐼
−1/2 𝜀 −𝜇 𝜀 𝑇 Σ− 1 𝜀 −𝜇
◦ Consider the exponent in the Gaussian pdf 2𝜋 −𝑘 /
Σ𝜀 −1/2
𝑎 0𝑒 𝜀
2 𝜀
◦ L2 norm minimization
2
𝑇
2𝜋 −𝑘/2 Σ𝜀 −1/2𝑎 0 𝑒 −1/2𝜎 𝑦 −𝑋 𝛽
𝑦 −𝑋 𝛽
Problem I: Unstable results
Let’s look at the distribution of our estimated model coefficients
𝛽 = 𝑋𝑇𝑋 −1 𝑋𝑇𝑦 = 𝑋𝑇𝑋 −1

𝑋𝑇
𝑋𝛽𝑡𝑟𝑢𝑒 + 𝜀 = 𝛽𝑡𝑟𝑢𝑒+ 𝑋𝑇𝑋 −1𝑋 𝑇 𝜀
𝐸 𝛽 =𝛽 Yay!!!!! Unbiased estimator

𝑡𝑟𝑢𝑒
◦We can show it is the best linear unbiased estimator (BLUE)
−1
𝐶𝑜𝑣 𝛽 = 𝐸 𝑇
= 𝑋 𝑇𝑋 −1 𝑋 𝑇 𝐸 𝜀𝜀𝑇 𝑋 𝑋𝑇𝑋
𝛽 − 𝛽𝑡𝑟𝑢𝑒 𝛽 − 𝛽𝑡𝑟𝑢𝑒
−1
= 𝜎 2 𝑋 𝑇𝑋
Even if (XTX) is close to being non-invertible we are in trouble
Estimate parameter variance
Bootstrapping
Problem II: Over fitting
Model describes the training data very well
◦Actually “too” well
◦The model is adapting to any noise in the training
data
Model is very bad predicting at other points

Defeats the purpose of predictive modeling
How do we know that we have overfit?

What can we do to avoid overfitting?
Outline
Linear Regression

◦Ridge regression
◦Python example: Bootstrapping to demonstrate reduction in variance
◦Optimizing the predictive capacity of the model through cross validation

Scikit options
Ridge Regression / Tikhonov
regularization
( 𝑋 𝑋+𝜆 𝐼 ) 𝛽=𝑋 𝑦
𝑇 𝑇
A biased linear estimator to get better variance

◦Least squares was BLUE so we cant hope to get better variance while staying unbiased
Gaussian MLE with a Gaussian prior on the model coefficients

Python example: Creating testcases
make_regression in scikit.datasets
◦Several parameters to control the “type” of dataset we want
◦Parameters:
◦ Size: n_samples and n_features
◦ Type: n_informative, effective_rank, tail_strength, noise
We want to test ridge regression with datasets with a low effective rank
◦Highly correlated (or linearly dependent) features
Python: Comparing ridge with basic
regression
Comparison of variances
Linear
regression
Ridge
regression
Scikit: Ridge
solvers
The problem is inherently much better than the LinearRegression() case
Several choices for the solver provided by Scikit
◦SVD
◦ Used by the unregularized linear regression
◦Cholesky factorization
◦Conjugate gradients (CGLS)
◦ Iterative method and we can target quality of fit
◦Lsqr
◦ Similar to CG but is more stable and may need fewer iterations to converge
◦Stochastic Average Gradient – Fairly new
◦ Use for big data sets
◦ Improvement over standard stochastic gradient
◦ Convergence rate linear – Same as gradient descent
How to choose 𝜆: Cross
validation
Choosing a smaller 𝜆 or adding more features will always result in
lower error on the training dataset
◦Over fitting
◦How to identify a model that will work as a good predictor?
Break up the dataset
◦Training and validation set
Train the model over a subset of the data and test its predictive
capability
◦Test predictions on an independent set of data
◦Compare various models and choose the model with the best
prediction error
Cross validation: Training vs Test Error
Leave one out cross validation (LOOCV)
Leave one out CV
◦Leave one data point as the
validation point and train on the 𝑦1 1
remaining dataset 𝑦2 𝑥11 𝑥21 𝛽
◦Evaluate model on the left 1 𝑥12 𝑥22 0
out data point 𝑦3 ≈ 1 𝑥13 𝛽1
◦Repeat the modeling and ⋮ 𝑥23
validation test for all choices of
⋮ ⋮ ⋮ 𝛽2
the left out data point 𝑦𝑛 1 𝑥1𝑛 𝑥2𝑛
◦Generalizes to leave-p-out
K-Fold cross validation
2-fold CV
◦Divide data set into two parts
◦Use each part once as training and once
as validation dataset 𝑦1 1 𝑥11
◦Generalizes to k-fold CV 21
◦May want to shuffle the data before 𝑦2 𝑥 1 𝑥22 𝛽0
partitioning
𝑦3 ≈ 1 𝑥𝑥12
13 𝑥23 𝛽1
Generally 3/5/10-fold cross validation is
preferred ⋮ ⋮ ⋮ ⋮
◦Leave-p-out requires several fits over 𝛽2
similar sets of data 𝑦𝑛 1
◦Also, computationally expensive compared to
k-fold CV
𝑥1𝑛 𝑥2𝑛
RidgeCV: Scikit’s Cross validated Ridge
Model
Outline
Linear Regression

◦Ridge regression

◦LASSO
◦Basis Pursuit Methods: Matching Pursuit and Least Angle regression
Scikit options
LASSO
The penalty term for coefficient sizes is now the l1 norm
2
Minimize ‖𝑦 − 𝑥 𝛽‖ + 𝜆‖ 𝛽‖1 22
Gaussian MLE with a laplacian prior distribution on the parameters
Can result in many feature coefficients being zero/sparse solution

◦Can be used to select a subset of features – Feature selection
How does this induce sparsity
Penalty
function
Prior
Scikit LASSO: Coordinate
descent
Minimize along coordinate axes iteratively
◦Does not work for non-differentiable functions
LASSO objective
Non-differentiable part is separable
h(x1, x2, …., xn)
Separable
f1(x1)+f2(x2)+ … + fn(xn)
Option in scikit to choose the direction either cyclically or at random called “selection”
Matching Pursuit (MP)
Select feature most correlated to the residual
f2
f1
Orthogonal Matching Pursuit (OMP)
Keep residual orthogonal to the set of selected features
f2
f1
(O) MP methods are greedy

◦ Correlated features are ignored and will not be considered again
LARS (Least Angle regression)
Move along most correlated feature until another feature becomes equally correlated
f2
f1
Outline
Linear Regression

◦Ridge regression

◦LASSO
◦Basis Pursuit Methods: Matching Pursuit and Least Angle regression
Scikit options
Options
Normalize (default false)
◦Scale the feature vectors to have unit norm
◦Your choice
Fit intercept (default true)

◦False: Implies the X and y already centered
◦Basic linear regression will do this implicitly if X is not sparse and compute the intercept separately
◦ Centering can kill sparsity
◦ Center data matrix in regularized regressions unless you really want a penalty on the bias
◦ Issues with sparsity still being worked out in scikit (Temporary bug fix for ridge in 0.17 using sag solver)
RidgeCV options
CV - Control to choose type of cross validation
◦Default LOOCV
◦Integer value ‘n’ sets n-fold CV
◦You can provide your own data splits as well
RidgeCV options
◦Default LOOCV
RidgeCV options
◦Default LOOCV
Lasso(CV)/Lars(CV) options
Positive
◦Force coefficients to be positive
Other controls for iterations

◦Number of iterations (Lasso) / Number of non-zeros (Lars)
◦Tolerance to stop iterations (Lasso)
Python example
Summary
Linear Models
◦Linear regression
◦Ridge – L2 penalty
◦Lasso – L1 penalty results in sparsity
◦LARS – Select a sparse set of features iteratively
Use Cross Validation (CV) to choose your models – Leverage scikit

◦RidgeCV, LarsCV, LassoCV
Not discussed – Explore scikit

◦Combing Ridge and Lasso: Elastic Nets
◦Random Sample Consensus (RANSAC)
◦ Fitting linear models where data has several outliers
◦lassoLars, lars_path

Chapter 3. Linear Regression

Uploaded by

Copyright:

Available Formats

Chapter 3. Linear Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3. Linear Regression

Uploaded by

Copyright:

Available Formats

Chapter 3.

Addressing the issues through regularization

Python example for today: Boston house prices

▶ Goal: Estimate the model coefficients

We get the same normal equation 

A Defines a left inverse of a rectangular matrix X

In general we can have XTX not being full rank

Maximum Likelihood Estimator (MLE)

𝛽 = 𝑋𝑇𝑋 −1 𝑋𝑇𝑦 = 𝑋𝑇𝑋 −1

𝐸 𝛽 =𝛽 Yay!!!!! Unbiased estimator

Model is very bad predicting at other points

How do we know that we have overfit?

Addressing the issues through regularization

Adding sparsity to the model/Feature selection

A biased linear estimator to get better variance

Gaussian MLE with a Gaussian prior on the model coefficients

Addressing the issues through regularization

Adding sparsity to the model/Feature selection

Can result in many feature coefficients being zero/sparse solution

(O) MP methods are greedy

Addressing the issues through regularization

Adding sparsity to the model/Feature selection

Fit intercept (default true)

Other controls for iterations

Use Cross Validation (CV) to choose your models – Leverage scikit

Not discussed – Explore scikit

You might also like