0% found this document useful (0 votes)
28 views

Linear Regression

Uploaded by

mahdi.rahmoune
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Linear Regression

Uploaded by

mahdi.rahmoune
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

CSC380: Principles of Data Science

Linear Models

Prof. Jason Pacheco


TA: Enfa Rose George TA: Saiful Islam Salim
Outline

 Linear Regression

 Least Squares Estimation

 Regularized Least Squares

 Logistic Regression
Outline

 Linear Regression

 Least Squares Estimation

 Regularized Least Squares

 Logistic Regression
Linear Regression

Regression Learn a function that


predicts outputs from inputs,

Outputs y are real-valued


OUTPUT: Y

Linear Regression As the name


suggests, uses a linear function:

We will add noise later…


INPUT: X
Linear Regression

Where is linear regression useful?

Trendlines Stock Prediction Climate Models


Massie and Rose (1997)

Used anywhere a linear relationship is assumed


between continuous inputs / outputs
Line Equation

Recall the equation for a line has a


slope and an intercept,

Slope Intercept

• Intercept (b) indicates where line crosses y-axis


• Slope controls angle of line
• Positive slope (w)  Line goes up left-to-right
• Negative slope  Line goes down left-to-right
Moving to higher dimensions…

In higher dimensions Line  Plane

Multiple ways to define a plane, we


will use:

Normal Vector In-Plane Vector


(controls orientation) (handles offset)

Regression weights will take place


of normal vector

Source: http://www.songho.ca/math/plane/plane.html
Inner Products

Recall the definition of an inner product:

Equivalently, projection of one vector onto another,

where
Vector Norm
Linear Regression
[ Image: Murphy, K. (2012) ]

For D-dimensional input vector the


plane equation,

Often we simplify this by including the intercept


into the weight vector,

Since:
Linear Regression

Input-output mapping is not exact, so we will add


zero-mean Gaussian noise, Multivariate Normal

OUTPUT: Y
(uncorrelated)
where

This is equivalent to the likelihood function,


INPUT: X

Because Adding a constant to a Normal RV is still a Normal RV,

In the case of linear regression and


Great, we’re done right?

Data – We have this


We need to fit it to
data by learning the
regression weights Random; Can’t do
anything about it

How to do this?
Don’t know these;
What makes good need to learn them
weights?
Learning Linear Regression Models

There are several ways to think about fitting regression:

• Intuitive Find a plane/line that is close to data

• Functional Find a line that minimizes the least squares loss

• Estimation Find maximum likelihood estimate of parameters

They are all the same thing…


Learning Linear Regression Models

There are several ways to think about fitting regression:

• Intuitive Find a plane/line that is close to data

• Functional Find a line that minimizes the least squares loss

• Estimation Find maximum likelihood estimate of parameters

They are all the same thing…


Fitting Linear Regression

Intuition Find a line that is as


close as possible to every
training data point

The distance from each point


to the line is the residual

Training Output Prediction


https://www.activestate.com/resources/quick-reads/how-to-run-linear-regressions-in-python-scikit-learn/
Outline

 Linear Regression

 Least Squares Estimation

 Regularized Least Squares

 Logistic Regression
Least Squares Solution

Functional Find a line that


minimizes the sum of
squared residuals

Over training all the data,

Least squares regression


https://www.activestate.com/resources/quick-reads/how-to-run-linear-regressions-in-python-scikit-learn/
Least Squares

This is just a quadratic function…


• Convex, unique minimum
• Minimum given by zero-derivative
• Can find a closed-form solution

Let’s see for scalar case with no bias,


Least Squares : Simple Case

Derivative (+ chain rule)

Distributive Property

Algebra
Least Squares in Higher Dimensions

Things are a bit more complicated in higher [ Image: Murphy, K. (2012) ]


dimensions and involve more linear algebra,

Design Matrix Vector of


( each training input on a column ) Training labels

Can write regression over all training data more compactly…


Nx1 Vector
Least Squares in Higher Dimensions

Least squares can also be written more [ Image: Murphy, K. (2012) ]


compactly,

Some slightly more advanced linear algebra


gives us a solution,
Derivation a bit advanced for this class, but…
• We know it has a closed-form and why
• We can evaluate it
• Generally know where it comes from
Ordinary Least Squares (OLS) solution
Learning Linear Regression Models

There are several ways to think about fitting regression:

• Intuitive Find a plane/line that is close to data

• Functional Find a line that minimizes the least squares loss

• Estimation Find maximum likelihood estimate of parameters

They are all the same thing…


Learning Linear Regression Models

There are several ways to think about fitting regression:

• Intuitive Find a plane/line that is close to data

• Functional Find a line that minimizes the least squares loss

• Estimation Find maximum likelihood estimate of parameters

They are all the same thing…


MLE for Linear Regression

Given training data likelihood function


is given by,

OUTPUT: Y
Recall that the likelihood is Gaussian:
INPUT: X

So MLE maximizes the log-likelihood over the whole data as,


Univariate Gaussian (Normal) Distribution
Gaussian (a.k.a. Normal) distribution with
mean (location) and variance (scale)
parameters,

PDF
The logarithm of the PDF if just a negative
quadratic,

Log- PDF

Constant in mean Quadratic Function of mean


Notation

Likelihood of linear basic regression model…

…we will just look at learning mean parameter for now


MLE of Gaussian Mean
Assume data are i.i.d. univariate Gaussian,
Variance is known

Log-likelihood function:

Constant doesn’t
depend on mean
MLE doesn’t change when we:
1) Drop constant terms (in )
MLE estimate is least squares estimator: 2) Minimize negative log-likelihood
MLE of Linear Regression

Substitute linear regression


prediction into MLE solution
and we have,

So for Linear Regression,


MLE = Least Squares
Estimation

https://www.activestate.com/resources/quick-reads/how-to-run-linear-regressions-in-python-scikit-learn/
Multivariate Gaussian Distribution
We have only seen scalar (1-dimensional) X, but MLE is still least
squares for higher-dimensional X…

Let with mean and positive semidefinite covariance


matrix then the PDF is,

Again, the logarithm is a negative quadratic form,

Constant (in mean) Quadratic Function of mean


Multivariate Quadratic Form

Quadratic form for vectors is


given by inner product,

For iid data MLE of Gaussian


mean is once-again least
squares,
• Strongly convex
• Differentiable
• Unique optimizer at zero gradient
Notation

Substitute multi-dimensional linear regression…

…brings us back to the least squares solution


MLE of Linear Regression

Using previous results, MLE is equivalent to [ Image: Murphy, K. (2012) ]


minimizing squared residuals,

Some slightly more advanced linear algebra


gives us a solution,
Derivation a bit advanced for this class, but…
• We know it has a closed-form and why
• We can evaluate it
• Generally know where it comes from
Ordinary Least Squares (OLS) solution
Linear Regression Summary

1. Definition of linear regression model,


where

2. For N iid training data fit using least squares,

3. Equivalent to maximum likelihood solution


Linear Regression Summary

Ordinary least squares solution

Is solved in closed-form using the Normal equations,

Design Matrix Vector of QUESTIONS?


( each training input on a column ) Training labels
A word on matrix inverses…

Least squares solution requires inversion of the term,

What are some issues with this?


1. Requires time for D input features

2. May be numerically unstable (or even non-invertible)


Small numerical errors in input
can lead to large errors in solution
Pseudoinverse

The Moore-Penrose pseudoinverse is denoted,

• Generalization of the standard matrix inverse


• Exists even for non-invertible XTX
• Directly computable in most libraries
• In Numpy it is: linalg.pinv
Linear Regression in Scikit-Learn

For Evaluation
Load your libraries,

Load data,

Train / Test Split:


Linear Regression in Scikit-Learn
Train (fit) and predict,

Plot regression line with the test set,


Outline

 Linear Regression

 Least Squares Estimation

 Regularized Least Squares

 Logistic Regression
Outliers
How does an outlier affect the estimator?

Squared Error
Outliers
How does an outlier affect the estimator?

Squared Error
Outliers in Linear Regression

Outlier “pulls”
regression line away
from inlier data

Y
Need a way to ignore or
to down-weight impact
of outlier

X
https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/mlr-residual-analysis-and-outliers.html
Dealing with Outliers

Too many outliers can indicate many things: non-Gaussian


(heavy-tailed) data, corrupt data, bad data collection, …

A few ways to handle outliers…


1. Use a heavy-tailed noise distribution (Student’s T)
Fitting regression becomes difficult
2. Identify outliers and discard them
NP-Hard and throwing away data is generally bad

3. Penalize large weights to avoid overfitting (Regularization)


Regularization

Recall, regularization helps avoid overfitting training data…

Regularization Regularization Penalty


Strength

Y Red model is without regularization


Green model includes regularization

X
Regularized Least Squares
Ordinary least-squares estimation (no regularizer),
Already know how
solve this…

Quadratic Penalty
L2-regularized Least-Squares (Ridge)

L1-regularized Least-Squares (LASSO) Absolute Value (L1) Penalty


A word on vector norms…

The L2-norm (Euclidean norm) of a vector w is,

The L1-norm (absolute value) of a vector w is,

They are not the same functions…


Other Regularization Terms

q<1 is not a norm, L1 is non- L2 Regularization


and thus not convex differentiable

A more general regularization penalty,


Administrative Items

• HW7 out Thursday (Due next Thursday)

• HW6 due tonight

• Also, I saw this ad…


Regularized Least Squares

A couple regularizers are so common they have specific names

L2 Regularized Linear Regression


• Ridge Regression
• Tikhonov Regularization

L1 Regularized Linear Regression


• LASSO
• Stands for: Least Absolute Shrinkage and Selection Operator
L2 Regularized Least Squares
Quadratic

Quadratic

Quadratic + Quadratic = Quadratic


• Differentiable
• Convex
• Unique optimum
• Closed form solution
L2 Regularized Least Squares : Simple Case

Derivative (+ chain rule)

Distributive Property

Algebra
L2 Regularized Linear Regression – Ridge Regression
Source: Kevin Murphy’s Textbook

After some algebra…

Compare to ordinary least squares:

Regularized least-squares includes


pseudocount in weighting similar to
Gaussian mean estimator
Notes on L2 Regularization

• Feature weights are “shrunk” towards zero (and each other) –


statisticians often call this a “shrinkage” method
• Typically do not penalize bias (y-intercept, w0) parameter,

• Penalizing w0 would make solution depend on origin for Y – adding a


constant c to Y would not add a constant to solution weights
• Can fit bias in a two-step procedure, by centering features
then bias estimate is
• Solutions are not invariant to scaling, so typically we standardize (e.g.
Z-score) features before fitting model ( Sklearn StandardScaler )
Scikit-Learn : L2 Regularized Regression

Alpha is what we have been calling


Scikit-Learn : L2 Regularized Regression

Define and fit OLS and L2 regression,

Plot results,

L2 (Ridge) reduces impact of any single data point


Choosing Regularization Strength
We need to tune regularization strength to avoid over/under fitting…

Recall bias/variance tradeoff


Error = Irreducible error + Bias2 + Variance

High regularization reduces model


complexity: increases bias / decreases
variance

How should we properly tune ?


Cross-Validation

N-fold Cross Validation Partition training


data into N “chunks” and for each run
select one chunk to be validation data

For each run, fit to training data (N-1


chunks) and measure accuracy on
validation set. Average model error
across all runs.
Drawback Need to perform training N times.

Source: Bishop, C. PRML


Model Selection for Linear Regression

A couple of common metrics for model selection…

Residual Sum-of-squared Errors The total squared residual


error on the held-out validation set,

Coefficient of Determination Also called R-squared or R2.


Fraction of variation explained by the model.

Model selection metrics are known as “goodness of fit” measures


Coefficient of Determination R2

Predicted Variance Residual Sum-of-Squares

Total variance
in dataset Variance using avg. prediction

Where: is the average output


Coefficient of Determination R2

Maximum value R2=1.0 means


model explains all variation in the
data R2 > 0

R2 = 0
Maximum value R2=0 means model is
as good as predicting average
response

R2<0 means model worse than


predicting average output
“Shrinkage” Feature Selection
Down-weight features that are not useful for prediction…
Quadratic penalty down-weights
(shrinks) features that are not useful for
prediction
Example Prostate Cancer Dataset measures
prostate-specific cancer antigen with features:
age, log-prostate weight (lweight), log-benign
prostate hyperplasia (lbph), Gleason score
(gleason), seminal vesical invasion (svi), etc.

L2 regularization learns zero-weight


for log capsular penetration (lcp)

[ Source: Hastie et al. (2001) ]


Constrained Optimization Perspective

Intuition Find best model (lowest


RSS) given constraint on total
feature weights…
Squared Error

Total Weight
There exists a mathematically
Norm equivalent formulation for some
function
Optimal Model
L2 penalized regression rarely
learns feature weight that are
exactly zero…
[ Source: Hastie et al. (2001) ]
Regularized Least Squares
Ordinary least-squares estimation (no regularizer),

Quadratic Penalty
L2-regularized Least-Squares (Ridge)

L1-regularized Least-Squares (LASSO) Absolute Value (L1) Penalty


L1 Regularized Least-Squares

Squared Error

Optimal Model

Learns w2 = 0

Able to zero-out weights that are not predictive…


Feature Weight Profiles

Varying regularization
parameter moderates
shrinkage factor

For moderate regularization


strength weights for many
features go to zero

• Induces feature sparsity


• Ideal for high-dimensional settings
• Gracefully handles p>N case, for p
features and N training data
Feature Weight Profiles

L1 Penalty L2 Penalty
Learning L1 Regularized Least-Squares

Not differentiable…

…doesn’t exist at x=0

Can’t set derivatives to zero as


in the L2 case!
Learning L1 Regularized Least-Squares

• Not differentiable, no closed-form solution

• But it is convex! Can be solved by quadratic programming


(beyond the scope of this class…)

• Efficient optimization algorithms exist

• Least Angle Regression (LAR) computes full solution path for


a range of values

• Can be solved as efficiently as L2 regression


Specialized methods for cross-validation…

Computes solution using coordinate descent

Uses least angle regression (LARS) to compute solution path


L1 Regression Cross-Validation

Perform L1 Least Squares (LASSO) 20-fold cross-validation,


or

Plot solution path for range of alphas,

All alphas_

Learned alpha_ (no “s”… annoying…)


Example: Prostate Cancer Dataset

Best LASSO model learns to


ignore several features (age, lcp,
gleason, pgg45).

Wait…Is age really not a


significant predictor of prostate
cancer? What’s going on here?

Age is highly correlated with other


factors and thus not significant in
the presence of those factors
Administrative Items

HW7 will be posted tonight


• Ordinary least squares regression
• Ridge regression
• Lasso
• Feature selection

Due next Thursday (11/11)


• A bit more is left up to the student compared to HW5 / HW6
Best-Subset Selection
L1 / L2 shrinkage offer approximate feature selection…

The optimal strategy for p features looks at models over all possible
combinations of features,

For k in 1,…,p:
subset = Compute all subset of k-features (p-choose-k)
For kfeat in subset:
model = Train model on kfeat features
score = Evaluate model using cross-validation
Choose the model with best cross-validation score
Best-Subset Selection : Prostate Cancer Dataset

Each marker is the cross-val


R2 score of a trained model
for a subset of features

Data have 8 features, there


are 8-choose-k subsets for
each k=1,…,8 for a total of
255 models

Using 10-fold cross-val


requires 10 x 255 = 2,550
training runs!
Feature Selection: Prostate Cancer Dataset
Best subset has highest test accuracy (lowest
variance) with just 2 features

[ Source: Hastie et al. (2001) ]


Comparing Feature Selection Methods

Notation Change Least


squares weights are
rather than .
Forward Sequential Selection
An efficient method adds the most predictive feature one-by-one

featSel = empty
featUnsel = All features
For iter in 1,…,p:
For kfeat in featUnsel:
thisFeat = featSel + kfeat
model = Train model on thisFeat features
score = Evaluate model using cross-validation
featSel = featSel + best scoring feature
featUnsel = featUnsel - best scoring feature
Choose the model with best cross-validation score
Backward Sequential Selection
Backwards approach starts with all features and removes one-by-one

featSel = All features


For iter in 1,…,p:
For kfeat in featSel:
thisFeat = featSel - kfeat
model = Train model on thisFeat features
score = Evaluate model using cross-validation
featSel = featSel – worst scoring feature
Choose the model with best cross-validation score
Comparing Feature Selection Methods
Sequential selection is greedy, but often performs well…

Example Feature selection on synthetic


model with p=30 features with pairwise
correlations (0.85). True feature
weights are all zero except for 10
features, with weights drawn from
N(0,6.25).

Sequential selection with p features


takes O(p2) time, compared to
exponential time for best subset

Sequential feature selection available in Scikit-Learn under:


feature_selection.SequentialFeatureSelector
Outline

 Linear Regression

 Least Squares Estimation

 Regularized Least Squares

 Logistic Regression
Classification as Regression
Suppose our response variables are binary y={0,1}. How can we use
linear regression ideas to solve this classification problem?

https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28
Classification as Regression

Idea Fit a regression function to the


data (red). Classify points based on
whether they are above or below the
midpoint (green).

• This is a discriminant function, since it discriminates between classes


• It is a linear function and so is a linear discriminant
• Green line is the decision boundary (also linear)

https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28
Multiclass Classification as Regression
Suppose we have K classes. Training outputs
for each class are a set of indicator vectors,

With if class k, e.g. Y=(0,0,…,1,0,0).

For N training inputs create NxK matrix of outputs and solve,


W is NxK matrix of K linear
regression models, one for
each class

• Compute fitted output a K-vector


This is an instance of
• Identify largest component and classify as, multi-output linear
regression

[ Image: Hastie et al. (2001) ]


Linear Probability Models

Binary Classification Linear model approximates


probability of class assignment,

Multiclass Classification Multiple decision boundaries,


each approximated by the class-specific linear model,

Where is kth row

Approximates probability of class assignment,


What’s the rational?
Recall the linear regression model,

So linear regression models the expected value,


We can call this
approach least
For discrete values we have that, squares classification

Can easily verify that they sum to 1,

But they are not guaranteed to be positive!


Logistic Regression
Idea Distort the response variable in
some way to map to [0,1] so that it is
actually a probability.

Uses the logistic function,

• Logistic function is a type of sigmoid or squashing function, since it maps any


value to the range [0,1]

• Predictor variable now actually maps to a valid probability mass function (PMF),

https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28
Logistic Regression : Decision Boundary
Binary classification decisions are
based on the posterior odds ratio,

If this ratio is greater than 1.0 then


classify as C=1, otherwise C=0

In practice, we use the (natural) logarithm of the posterior odds ratio,

This is a linear decision boundary

Logistic regression is a linear classifier


Logistic vs. Logit Transformations
Logistic Function Logit Function

Maps to [0,1] Maps [0,1] to

Logistic also widely used in Neural Networks – for classification last


layer is typically just a logistic regression
Logistic vs. Logit Transformations

Logistic function maps the linear regression to the interval [0,1],

Logit function is defined for probability values p in [0,1] as,

Logit is the inverse of the logistic function, Logit is also the log-likelihood
ratio, and thus decision boundary
for our binary classifier
Multiclass Logistic Regression

Classification decision based on log-ratio compared to final class,

K-1 log-odds (or logit)


transformations ensures
probabilities sum to 1

Choice of denominator class is arbitrary, but use K by convention


Least Squares vs. Logistic Regression

Least Squares
Logistic Regression

• Both models learn a linear decision boundary


• Least squares can be solved in closed-form (convex objective)
• Least squares is sensitive to outliers (need to do regularization)
[Source: Bishop “PRML”]
Least Squares vs. Logistic Regression

Similar results in 1-dimension

https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28
Least Squares vs. Logistic Regression

Least Squares Logistic Regression

[Source: Bishop “PRML”]


Fitting Logistic Regression
Fit by maximum likelihood—start with the binary case
Posterior probability of class assignment is Bernoulli,

Given N iid training data pairs the log-likelihood function is,


Fitting Logistic Regression

Computing the derivatives with respect to each element wd,

• For D features this gives us D equations and D unknowns


• But equations are nonlinear and can’t be solved
• Need to use gradient-based optimization to solve (Newton’s method)
• Beyond scope of this class; but know that it is an iterative process
Iteratively Reweighted Least Squares

• Given some estimate of the weights update by solving,

Design Matrix NxN Diagonal


(NxD) Weight matrix

Where z is the gradient direction, P(y=1|x) for each


training point

• Essentially solving a reweighted version of least squares,


Each iteration changes W
and p so need to resolve
Choice of Optimizer

Since Logistic regression


requires an optimizer, there are
more parameters to consider

The choice of optimizer and


parameters can effect time to
fit model (especially if there are
many features)

https://www.datasciencecentral.com/profiles/blogs/an-overview-of-gradient-descent-optimization-algorithms
Scikit-Learn Logistic Regression

Function predict_proba(X) returns prediction of class


assignment probabilities (just a number in binary case)
https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28
Using Logistic Regression

The role of Logistic Regression differs in ML and Data Science,


• In Machine Learning we use Logistic Regression for building predictive
classification models
• In Data Science we use it for understanding how features relate to data
classes / categories

Example South African Heart Disease (Hastie et al. 2001)


Data result from Coronary Risk-Factor Study in 3 rural areas of South
Africa. Data are from white men 15-64yrs and response is
presence/absence of myocardial infraction (MI). How predictive are
each of the features?
Looking at Data
Each scatterplot shows
pair of risk factors. Cases
with MI (red) and without
(cyan)
Features
• Systolic blood pressure
• Tobacco use
• Low density lipoprotein (ldl)
• Family history (discrete)
• Obesity
• Alcohol use
• Age

[Source: Hastie et al. (2001)]


Example: African Heart Disease

Fit logistic regression to the


data using MLE estimate via
iteratively reweighted least
squares
Standard error is estimated
standard deviation of the
learned coefficients
Recall, Z-score of weights is a random variable from standard Normal,

Thus anything with Z-score > 2 is significant at 5% confidence level


Example: African Heart Disease

Finding Systolic blood


pressure (sbp) is not a
significant predictor

Obesity is not significant and


negatively correlated with heart
disease in the model

Remember All correlations / significance of features are based


on presence of other features. We must always consider that
features are strongly correlated.
Example: African Heart Disease
Doing some feature selection
we find a model with 4
features: tobacco, ldl, family
history, and age
How to interpret coefficients?
(e.g. tobacco  0.081)

• Tobacco is measured in total lifetime usage (in kg)


• Thus, increase of 1kg of lifetime tobacco yields

Or 8.4% increase in odds of coronary heart disease


• 95% CI is 3% to 14% since

You might also like