Chapter 3. Linear Regression

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 41

Chapter 3.

Linear Regression
Outline
Linear Regression
◦Different perspectives
◦Issues with linear regression

Addressing the issues through regularization


Adding sparsity to the model/Feature selection
Scikit options
Regression
Modeling a quantity as a simple function of features
◦The predicted quantity should be well approximated as
continuous
◦Prices, lifespan, physical measurements
◦As opposed to classification where we seek to predict discrete classes

Python example for today: Boston house prices


◦The model is a linear function of the features
◦House_price = a*age + b*House_size + ….
◦Create nonlinear features to capture non-linearities
◦House_size2 = house_size*house_size
◦House_price = a*age + b*House_size + c*House_size2 + …..
Case of two features
𝑦1 1 𝑥11
𝑦2 1 𝑥21 𝛽0
𝑦3 ≈ 1 𝑥 𝛽1
12
⋮ 𝑥22
𝑦𝑛 𝛽2
1 𝑥13 𝛽0
𝑥23
Residuals
⋮ ⋮ ⋮
𝑥1𝑛
Linear Regression
▶ Model a quantity as a linear function of some known features
▶ 𝑦 is the quantity to be modeled
▶ 𝑋 are the sample points with each row being one data point
▶ Columns are feature vectors

𝑦 ≈ 𝑋𝛽

▶ Goal: Estimate the model coefficients


or 𝛽
Least squares: Optimization perspective
Define objective function using the 2-norm of the
residuals
◦ 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 = 𝑦 − 𝑋𝛽
2
◦Minimize: 𝑓𝑜𝑏𝑗 = 𝑦 − 𝑋𝛽2
𝑇
= 𝑦

= 𝑋𝛽 𝑦+
𝛽𝑇𝑋𝑇𝑋𝛽 − 2𝑦 𝑇 𝑋𝛽
◦ 𝜕 𝑓 𝑜 𝑏 = 2𝑋𝑇𝑋𝛽 −𝑦−𝑇𝑋𝛽
𝑦2𝑋 𝑇 𝑦 = 𝛽 = 𝑋𝑇𝑋
−1 𝑋𝑇𝑦
𝑗 𝜕
◦𝛽Normal0 equation
◦X is assumed to be thin and full rank so that 𝑋𝑇𝑋 is invertible
Geometrical perspective
We are trying to approximate y as linear combinations of the
column vectors of X
Lets make the residual orthogonal to the column space of X

𝑋𝑇
𝑦 − 𝑋𝛽 = 0

We get the same normal equation 

𝛽= 𝑋𝑇𝑋 −1 𝑋𝑇𝑦 = 𝐴𝑦

A Defines a left inverse of a rectangular matrix X


Python Example
Python example
What is Scikit doing?
Least norm
Singular Value Decomposition (SVD)
◦ 𝑋 = 𝑈Σ𝑉𝑇
Set of all
Defines a general pseudo-inverse VΣ† 𝑈 𝑇 solutions having
◦ Known as Moore-Penrose inverse the smallest
◦ For a thin matrix it is the left inverse residual norm
◦ For a fat matrix it is the right inverse
◦ Provides a minimum norm solution of an underdetermined set of equations

In general we can have XTX not being full rank


We get the minimum norm solution among the set of least squares solution
Stats perspective
𝑦 ≈ 𝑋𝛽 𝑦 = 𝑋𝛽𝑡𝑟𝑢𝑒 + 𝜀

Maximum Likelihood Estimator (MLE)


◦ Normally distributed error
◦ 𝑦 − 𝑋𝛽 = 𝜀~𝑁 0, 𝜎2𝐼
−1/2 𝜀 −𝜇 𝜀 𝑇 Σ− 1 𝜀 −𝜇
◦ Consider the exponent in the Gaussian pdf 2𝜋 −𝑘 /
Σ𝜀 −1/2
𝑎 0𝑒 𝜀
2 𝜀
◦ L2 norm minimization
2
𝑇
2𝜋 −𝑘/2 Σ𝜀 −1/2𝑎 0 𝑒 −1/2𝜎 𝑦 −𝑋 𝛽
𝑦 −𝑋 𝛽
Problem I: Unstable results
Let’s look at the distribution of our estimated model coefficients

𝛽 = 𝑋𝑇𝑋 −1 𝑋𝑇𝑦 = 𝑋𝑇𝑋 −1


𝑋𝑇
𝑋𝛽𝑡𝑟𝑢𝑒 + 𝜀 = 𝛽𝑡𝑟𝑢𝑒+ 𝑋𝑇𝑋 −1𝑋 𝑇 𝜀

𝐸 𝛽 =𝛽 Yay!!!!! Unbiased estimator


𝑡𝑟𝑢𝑒
◦We can show it is the best linear unbiased estimator (BLUE)
−1
𝐶𝑜𝑣 𝛽 = 𝐸 𝑇
= 𝑋 𝑇𝑋 −1 𝑋 𝑇 𝐸 𝜀𝜀𝑇 𝑋 𝑋𝑇𝑋
𝛽 − 𝛽𝑡𝑟𝑢𝑒 𝛽 − 𝛽𝑡𝑟𝑢𝑒
−1
= 𝜎 2 𝑋 𝑇𝑋
Even if (XTX) is close to being non-invertible we are in trouble
Estimate parameter variance
Bootstrapping
Problem II: Over fitting
Model describes the training data very well
◦Actually “too” well
◦The model is adapting to any noise in the training
data

Model is very bad predicting at other points


Defeats the purpose of predictive modeling

How do we know that we have overfit?


What can we do to avoid overfitting?
Outline
Linear Regression
◦Different perspectives
◦Issues with linear regression

Addressing the issues through regularization


◦Ridge regression
◦Python example: Bootstrapping to demonstrate reduction in variance
◦Optimizing the predictive capacity of the model through cross validation

Adding sparsity to the model/Feature selection


Scikit options
Ridge Regression / Tikhonov
regularization

( 𝑋 𝑋+𝜆 𝐼 ) 𝛽=𝑋 𝑦
𝑇 𝑇

A biased linear estimator to get better variance


◦Least squares was BLUE so we cant hope to get better variance while staying unbiased

Gaussian MLE with a Gaussian prior on the model coefficients


Python example: Creating testcases
make_regression in scikit.datasets
◦Several parameters to control the “type” of dataset we want
◦Parameters:
◦ Size: n_samples and n_features
◦ Type: n_informative, effective_rank, tail_strength, noise

We want to test ridge regression with datasets with a low effective rank
◦Highly correlated (or linearly dependent) features
Python: Comparing ridge with basic
regression
Comparison of variances

Linear
regression

Ridge
regression
Scikit: Ridge
solvers
The problem is inherently much better than the LinearRegression() case
Several choices for the solver provided by Scikit
◦SVD
◦ Used by the unregularized linear regression
◦Cholesky factorization
◦Conjugate gradients (CGLS)
◦ Iterative method and we can target quality of fit
◦Lsqr
◦ Similar to CG but is more stable and may need fewer iterations to converge
◦Stochastic Average Gradient – Fairly new
◦ Use for big data sets
◦ Improvement over standard stochastic gradient
◦ Convergence rate linear – Same as gradient descent
How to choose 𝜆: Cross
validation
Choosing a smaller 𝜆 or adding more features will always result in
lower error on the training dataset
◦Over fitting
◦How to identify a model that will work as a good predictor?
Break up the dataset
◦Training and validation set

Train the model over a subset of the data and test its predictive
capability
◦Test predictions on an independent set of data
◦Compare various models and choose the model with the best
prediction error
Cross validation: Training vs Test Error
Leave one out cross validation (LOOCV)
Leave one out CV
◦Leave one data point as the
validation point and train on the 𝑦1 1
remaining dataset 𝑦2 𝑥11 𝑥21 𝛽
◦Evaluate model on the left 1 𝑥12 𝑥22 0
out data point 𝑦3 ≈ 1 𝑥13 𝛽1
◦Repeat the modeling and ⋮ 𝑥23
validation test for all choices of
⋮ ⋮ ⋮ 𝛽2
the left out data point 𝑦𝑛 1 𝑥1𝑛 𝑥2𝑛
◦Generalizes to leave-p-out
K-Fold cross validation
2-fold CV
◦Divide data set into two parts
◦Use each part once as training and once
as validation dataset 𝑦1 1 𝑥11
◦Generalizes to k-fold CV 21
◦May want to shuffle the data before 𝑦2 𝑥 1 𝑥22 𝛽0
partitioning
𝑦3 ≈ 1 𝑥𝑥12
13 𝑥23 𝛽1
Generally 3/5/10-fold cross validation is
preferred ⋮ ⋮ ⋮ ⋮
◦Leave-p-out requires several fits over 𝛽2
similar sets of data 𝑦𝑛 1
◦Also, computationally expensive compared to
k-fold CV
𝑥1𝑛 𝑥2𝑛
RidgeCV: Scikit’s Cross validated Ridge
Model
Outline
Linear Regression
◦Different perspectives
◦Issues with linear regression

Addressing the issues through regularization


◦Ridge regression
◦Python example: Bootstrapping to demonstrate reduction in variance
◦Optimizing the predictive capacity of the model through cross validation

Adding sparsity to the model/Feature selection


◦LASSO
◦Basis Pursuit Methods: Matching Pursuit and Least Angle regression

Scikit options
LASSO
The penalty term for coefficient sizes is now the l1 norm

2
Minimize ‖𝑦 − 𝑥 𝛽‖ + 𝜆‖ 𝛽‖1 22
Gaussian MLE with a laplacian prior distribution on the parameters

Can result in many feature coefficients being zero/sparse solution


◦Can be used to select a subset of features – Feature selection
How does this induce sparsity

Penalty
function

Prior
Scikit LASSO: Coordinate
descent
Minimize along coordinate axes iteratively
◦Does not work for non-differentiable functions
LASSO objective
Non-differentiable part is separable
h(x1, x2, …., xn)

Separable

f1(x1)+f2(x2)+ … + fn(xn)

Option in scikit to choose the direction either cyclically or at random called “selection”
Matching Pursuit (MP)
Select feature most correlated to the residual

f2

f1
Orthogonal Matching Pursuit (OMP)
Keep residual orthogonal to the set of selected features
f2

f1

(O) MP methods are greedy


◦ Correlated features are ignored and will not be considered again
LARS (Least Angle regression)
Move along most correlated feature until another feature becomes equally correlated

f2

f1
Outline
Linear Regression
◦Different perspectives
◦Issues with linear regression

Addressing the issues through regularization


◦Ridge regression
◦Python example: Bootstrapping to demonstrate reduction in variance
◦Optimizing the predictive capacity of the model through cross validation

Adding sparsity to the model/Feature selection


◦LASSO
◦Basis Pursuit Methods: Matching Pursuit and Least Angle regression

Scikit options
Options
Normalize (default false)
◦Scale the feature vectors to have unit norm
◦Your choice

Fit intercept (default true)


◦False: Implies the X and y already centered
◦Basic linear regression will do this implicitly if X is not sparse and compute the intercept separately
◦ Centering can kill sparsity
◦ Center data matrix in regularized regressions unless you really want a penalty on the bias
◦ Issues with sparsity still being worked out in scikit (Temporary bug fix for ridge in 0.17 using sag solver)
RidgeCV options
CV - Control to choose type of cross validation
◦Default LOOCV
◦Integer value ‘n’ sets n-fold CV
◦You can provide your own data splits as well
RidgeCV options
CV - Control to choose type of cross validation
◦Default LOOCV
◦Integer value ‘n’ sets n-fold CV
◦You can provide your own data splits as well
RidgeCV options
CV - Control to choose type of cross validation
◦Default LOOCV
◦Integer value ‘n’ sets n-fold CV
◦You can provide your own data splits as well
Lasso(CV)/Lars(CV) options
Positive
◦Force coefficients to be positive

Other controls for iterations


◦Number of iterations (Lasso) / Number of non-zeros (Lars)
◦Tolerance to stop iterations (Lasso)
Python example
Summary
Linear Models
◦Linear regression
◦Ridge – L2 penalty
◦Lasso – L1 penalty results in sparsity
◦LARS – Select a sparse set of features iteratively

Use Cross Validation (CV) to choose your models – Leverage scikit


◦RidgeCV, LarsCV, LassoCV

Not discussed – Explore scikit


◦Combing Ridge and Lasso: Elastic Nets
◦Random Sample Consensus (RANSAC)
◦ Fitting linear models where data has several outliers
◦lassoLars, lars_path

You might also like