Linear Regression
Linear Regression
Presented by:
Mr. Manpreet Singh
Punjab Engineering College
Chandigarh
Table of Contents
• Definition
• SLR
• MLR
• Loss Function
• Optimization
• Closed Form
• Ridge Regression
• LASSO Regression
• Pros and Cons
Linear Regression
“Linear regression is a fundamental statistical method for modeling the
relationship between a dependent variable (also known as the target or
output variable) and one or more independent variables (also known as
features or predictors). The goal is to fit a linear equation to the
observed data”.
−2𝑋𝑇𝑦+ 2𝑋𝑇𝑋𝛽 = 0
2𝑋𝑇𝑋𝛽 = 2𝑋𝑇𝑦
𝛽 = ሺ 𝑋𝑇𝑋ሻ 𝑋𝑇𝑦
𝛽 = ሺ 𝑋𝑇𝑋ሻ 𝑋𝑇𝑦
Ridge Regression Loss function
“We aim to find the vector of coefficients β that minimizes the sum of
squared residuals”.
𝐿ሺ 𝛽ሻ = min ||𝑦− 𝑋𝛽| ȁ2 + 𝜆หȁ𝛽ȁห
2
𝛽
•y is the true label
•||β||2 is L2-Norm
•λ is the regularization parameter
• Vector form of Ridge regression loss function
𝐿ሺ 𝛽ሻ = ሺ 𝑦− 𝑋𝛽ሻ 𝑇ሺ 𝑦− 𝑋𝛽ሻ + λ𝛽𝑇𝛽
• Expending loss function
𝐿ሺ 𝛽ሻ = 𝑦𝑇𝑦− 2𝛽𝑇𝑋𝑇𝑦+ 𝛽𝑇𝑋𝑇𝑋𝛽 + λ𝛽𝑇𝛽
Solve ridge regression for β
•Compute the Gradient
𝜕𝐿ሺ 𝛽ሻ
= −2𝑋𝑇𝑦+ 2𝑋𝑇𝑋𝛽 + 2λ𝛽
𝜕𝛽
𝜕𝐿ሺ 𝛽ሻ
= −2𝑋𝑇(𝑦− 𝑋𝛽)
𝜕𝛽
•Sub-Gradient of L1-norm
−1, 𝛽𝑗 < 0
𝜕
𝜆|𝛽𝑗| = ቐ 1, 𝛽𝑗 > 0
𝜕𝛽𝑗
𝑎𝑛𝑦𝑣𝑎𝑙𝑢𝑒 ሾ−1,+1ሿ 𝛽𝑗 = 0
•Total Gradient
𝜕
𝜆ห𝛽𝑗ห = 𝜆.𝑠𝑖𝑔𝑛(𝛽𝑗)
𝜕𝛽𝑗
Comparison of LASSO and Ridge
Sharpe edges of LASSO regression shrinks feature coefficients to 0and hence do feature selection.
Logistic Regression
“Logistic Regression is a classification algorithm used to predict the
probability of a binary outcome (e.g., 0 or 1, True or False, Yes or No)
based on one or more independent variables (features). Despite its
name, logistic regression is used for classification problems rather than
regression problems”.
•The model predicts the probability that a given input belongs to the
positive class (e.g., class 1) using the sigmoid function applied to a
1
linear combination of the input features.
𝑃ሺ 𝑦= 1ȁ𝑋ሻ = σሺ 𝑋𝛽ሻ =
1 + 𝑒−𝑋𝛽
•P(y=1 | X) is the probability of output 1 given X
•X is matrix of input features
•Β is coefficients of the line
Logistic function
Where
•y is the true label
•y_cap is the predicted label (probability)
Gradient of Loss Function
•To minimize the log-loss function and find the optimal coefficients β,
we can use gradient descent. To do this, we need to compute the
gradient of the loss function with respect to the coefficients β.
•The gradient of the loss function J(β) with respect to β is the vector of
partial derivatives of J(β) with respect to each β j. This gradient tells us
the direction in which the cost function increases most rapidly, and we
𝑛
move in the opposite direction to minimize the cost.
𝜕𝐽ሺ 𝛽ሻ 1
= ሺ 𝑦𝑖 − 𝑦
ො 𝑖 ሻ .𝑥𝑖𝑗 ,∀𝑗= 1,2,…,𝑑
𝜕𝛽𝑗 𝑛
𝑖=1
Where
•y is the true label
•y_cap is the predicted label (probability)
•xij is the element of ith row and jth column of the dataset X.
Vector Form of Gradient
𝜕𝐽ሺ 𝛽ሻ 1
= 𝑋𝑇(𝑦
ො − 𝑦)
𝜕𝛽 𝑛
Where
•y is the vector true labels of n*1
•y_cap is the vector of predicted label (probability) of n*1
•X is the matrix of input features of n*d.
•n is total number of training examples (rows in dataset)
•d is the total number of dimensions (columns of dataset)
Gradient Descent update Rue
1 𝑇
𝛽 = 𝛽 − η. 𝑋 (𝑦
ො − 𝑦)
𝑛
Where
•y is the vector true labels of n*1
•y_cap is the vector of predicted label (probability) of n*1
•X is the matrix of input features of n*d.
•n is total number of training examples (rows in dataset)
•d is the total number of dimensions (columns of dataset)
•XT is the transpose of the matrix
•η is the learning rate must be less than 1.0.
Pros and Cons
Time Complexity
• Training O(T.n.d)
• Prediction O(d)
Pros
• Logistic regression is easy to implement and understand.
• Efficient and Fast
• No Need for Feature Scaling
• Logistic regression provides probabilistic predictions, which are useful for tasks where
uncertainty matters.
• Regularization Support
Cons
• Assumption of Linearity
• Limited to Binary Classification
• Sensitive to Outliers
• Poor Performance with Highly Correlated Features (Multicollinearity)
One vs. Rest Strategy
“In the One-vs-Rest (OvR) strategy, also known as One-vs-All (OvA),
the multiclass classification problem is decomposed into multiple
binary classification problems. For each class, a separate binary
classifier is trained to distinguish that class from all other classes
combined”.
• Number of Classifiers: For a problem with m classes, OvR trains m
binary classifiers, one for each class.
• Training: Each classifier is trained with the original dataset, but the
labels are modified so that the target class is labeled as 1 (positive
class), and all other classes are labeled as 0 (negative class).
• Prediction: When making predictions, the model runs all m
classifiers, and each classifier outputs a score or probability for its
respective class. The final prediction is the class corresponding to the
classifier with the highest score or probability.
One vs. Rest Strategy example
Example
• Suppose you have a dataset with three classes: A, B, and C. The OvR
strategy will create three binary classifiers:
• Classifier 1: Class A vs. (Classes B and C)
• Classifier 2: Class B vs. (Classes A and C)
• Classifier 3: Class C vs. (Classes A and B)
• During prediction, each of these classifiers will output a score or
probability, and the class with the highest score will be chosen as the
final prediction.
One vs. One Strategy
“In the One-vs-One (OvO) strategy, the multiclass classification
problem is broken down into multiple binary classification problems,
where each classifier is trained to distinguish between two specific
classes”.
• Number of Classifiers: For a problem with m classes, OvO trains
m(m−1)/2 binary classifiers. Each classifier is trained to differentiate
between two classes at a time, ignoring all other classes.
• Training: Each binary classifier is trained on a subset of the original
dataset, consisting only of the examples from the two classes it is
supposed to distinguish. All other classes are excluded from training
that specific classifier.
• Prediction: During prediction, each classifier casts a "vote" for one of
the two classes it was trained to distinguish. The class that receives
the most votes across all binary classifiers is chosen as the final
prediction.
One vs. One Strategy Example
• For a dataset with three classes: A, B, and C, the OvO strategy will
create three binary classifiers:
• Classifier 1: Class A vs. Class B
• Classifier 2: Class A vs. Class C
• Classifier 3: Class B vs. Class C
• During prediction, each classifier will vote for one of the two classes,
and the class with the most votes will be the final prediction.
Thank you