Linear Regression
Linear Regression
Linear Models
Linear Regression
Logistic Regression
Outline
Linear Regression
Logistic Regression
Linear Regression
Slope Intercept
Source: http://www.songho.ca/math/plane/plane.html
Inner Products
where
Vector Norm
Linear Regression
[ Image: Murphy, K. (2012) ]
Since:
Linear Regression
OUTPUT: Y
(uncorrelated)
where
How to do this?
Don’t know these;
What makes good need to learn them
weights?
Learning Linear Regression Models
Linear Regression
Logistic Regression
Least Squares Solution
Distributive Property
Algebra
Least Squares in Higher Dimensions
OUTPUT: Y
Recall that the likelihood is Gaussian:
INPUT: X
PDF
The logarithm of the PDF if just a negative
quadratic,
Log- PDF
Log-likelihood function:
Constant doesn’t
depend on mean
MLE doesn’t change when we:
1) Drop constant terms (in )
MLE estimate is least squares estimator: 2) Minimize negative log-likelihood
MLE of Linear Regression
https://www.activestate.com/resources/quick-reads/how-to-run-linear-regressions-in-python-scikit-learn/
Multivariate Gaussian Distribution
We have only seen scalar (1-dimensional) X, but MLE is still least
squares for higher-dimensional X…
For Evaluation
Load your libraries,
Load data,
Linear Regression
Logistic Regression
Outliers
How does an outlier affect the estimator?
Squared Error
Outliers
How does an outlier affect the estimator?
Squared Error
Outliers in Linear Regression
Outlier “pulls”
regression line away
from inlier data
Y
Need a way to ignore or
to down-weight impact
of outlier
X
https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/mlr-residual-analysis-and-outliers.html
Dealing with Outliers
X
Regularized Least Squares
Ordinary least-squares estimation (no regularizer),
Already know how
solve this…
Quadratic Penalty
L2-regularized Least-Squares (Ridge)
Quadratic
Distributive Property
Algebra
L2 Regularized Linear Regression – Ridge Regression
Source: Kevin Murphy’s Textbook
Plot results,
Total variance
in dataset Variance using avg. prediction
R2 = 0
Maximum value R2=0 means model is
as good as predicting average
response
Total Weight
There exists a mathematically
Norm equivalent formulation for some
function
Optimal Model
L2 penalized regression rarely
learns feature weight that are
exactly zero…
[ Source: Hastie et al. (2001) ]
Regularized Least Squares
Ordinary least-squares estimation (no regularizer),
Quadratic Penalty
L2-regularized Least-Squares (Ridge)
Squared Error
Optimal Model
Learns w2 = 0
Varying regularization
parameter moderates
shrinkage factor
L1 Penalty L2 Penalty
Learning L1 Regularized Least-Squares
Not differentiable…
All alphas_
The optimal strategy for p features looks at models over all possible
combinations of features,
For k in 1,…,p:
subset = Compute all subset of k-features (p-choose-k)
For kfeat in subset:
model = Train model on kfeat features
score = Evaluate model using cross-validation
Choose the model with best cross-validation score
Best-Subset Selection : Prostate Cancer Dataset
featSel = empty
featUnsel = All features
For iter in 1,…,p:
For kfeat in featUnsel:
thisFeat = featSel + kfeat
model = Train model on thisFeat features
score = Evaluate model using cross-validation
featSel = featSel + best scoring feature
featUnsel = featUnsel - best scoring feature
Choose the model with best cross-validation score
Backward Sequential Selection
Backwards approach starts with all features and removes one-by-one
Linear Regression
Logistic Regression
Classification as Regression
Suppose our response variables are binary y={0,1}. How can we use
linear regression ideas to solve this classification problem?
https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28
Classification as Regression
https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28
Multiclass Classification as Regression
Suppose we have K classes. Training outputs
for each class are a set of indicator vectors,
• Predictor variable now actually maps to a valid probability mass function (PMF),
https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28
Logistic Regression : Decision Boundary
Binary classification decisions are
based on the posterior odds ratio,
Logit is the inverse of the logistic function, Logit is also the log-likelihood
ratio, and thus decision boundary
for our binary classifier
Multiclass Logistic Regression
Least Squares
Logistic Regression
https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28
Least Squares vs. Logistic Regression
https://www.datasciencecentral.com/profiles/blogs/an-overview-of-gradient-descent-optimization-algorithms
Scikit-Learn Logistic Regression