0% found this document useful (0 votes)
6 views

Linear Regression

Uploaded by

Saira Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Linear Regression

Uploaded by

Saira Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Linear Regression

Presented by:
Mr. Manpreet Singh
Punjab Engineering College
Chandigarh
Table of Contents
• Definition
• SLR
• MLR
• Loss Function
• Optimization
• Closed Form
• Ridge Regression
• LASSO Regression
• Pros and Cons
Linear Regression
“Linear regression is a fundamental statistical method for modeling the
relationship between a dependent variable (also known as the target or
output variable) and one or more independent variables (also known as
features or predictors). The goal is to fit a linear equation to the
observed data”.

Fig. 1: Linear Regression Example for Happiness Score based on Income


Simple Linear Regression (SLR)
• Only one Independent variable (x) and one dependent variable (y).
• Equation of Line for SLR.
𝑦= β0 + β1𝑥 = 𝑐+ 𝑚𝑥
• C = β0 (intercept)
• m = β1 (slope of line)
• x is independent variable
• y is dependent variable

Fig. 2:Example of SLP


Multiple Linear Regression (SLR)
• Suppose X a dataset with n training examples and d independent
variables (and features)
• Training dataset: 𝑋∈ℝ𝑛∗𝑑
• Training labels 𝑦∈ℝ𝑛
• Equation of Hyperplane in d dimensional space
𝑦= 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑑𝑥𝑑 + 𝜖
• ϵ is the error (difference between true value and predicted value)
• Equation can be represented as:
𝑦= 𝑋𝛽 + 𝜖
Loss Function (MSE)
“To measure how well the linear model fits the data, the most commonly
used cost function is the Mean Squared Error (MSE). It measures the
average squared difference between the predicted and actual values”.
𝑛
1
𝐽ሺ 𝛽ሻ = ෍ ሺ 𝑦𝑖 − 𝑦
ො𝑖 ሻ 2
2𝑛
𝑖=1

•y is the true label


•y_cap is the predicted label
•n total number of training examples
𝑦
ො = β0 + β1𝑥1 + ⋯ + β𝑑𝑥𝑑
• Gradient of the loss function
𝑛
𝜕𝐽ሺ 𝛽ሻ 1
= ෍ ሺ 𝑦𝑖 − 𝑦
ො𝑖 ሻ .𝑥𝑖𝑗 ,∀𝑗= 1,2,…,𝑑
𝜕𝛽𝑗 𝑛
𝑖=1
Derivation of Closed form of Linear Regression
Least Square Loss Function
“We aim to find the vector of coefficients β that minimizes the sum of
squared residuals”.
𝐿ሺ 𝛽ሻ = min |||𝑦− 𝑋𝛽ȁ2
𝛽

•y is the true label


•Y_cap = Xβ is the predicted label
•β is a vector of coefficients
• Vector form of Least Square Loss function (Sum of Square Loss
function)
𝐿ሺ 𝛽ሻ = ሺ 𝑦− 𝑋𝛽ሻ 𝑇(𝑦− 𝑋𝛽)
• Expending loss function
𝐿ሺ 𝛽ሻ = 𝑦𝑇𝑦− 2𝛽𝑇𝑋𝑇𝑦+ 𝛽𝑇𝑋𝑇𝑋𝛽
Solve for β
•Compute the Gradient
𝜕𝐿ሺ 𝛽ሻ
= −2𝑋𝑇𝑦+ 2𝑋𝑇𝑋𝛽
𝜕𝛽

•Equate it with zero to get the solution of β

−2𝑋𝑇𝑦+ 2𝑋𝑇𝑋𝛽 = 0
2𝑋𝑇𝑋𝛽 = 2𝑋𝑇𝑦
𝛽 = ሺ 𝑋𝑇𝑋ሻ 𝑋𝑇𝑦

•Closed form of Linear Regression

𝛽 = ሺ 𝑋𝑇𝑋ሻ 𝑋𝑇𝑦
Ridge Regression Loss function
“We aim to find the vector of coefficients β that minimizes the sum of
squared residuals”.
𝐿ሺ 𝛽ሻ = min ||𝑦− 𝑋𝛽| ȁ2 + 𝜆หȁ𝛽ȁห
2
𝛽
•y is the true label
•||β||2 is L2-Norm
•λ is the regularization parameter
• Vector form of Ridge regression loss function
𝐿ሺ 𝛽ሻ = ሺ 𝑦− 𝑋𝛽ሻ 𝑇ሺ 𝑦− 𝑋𝛽ሻ + λ𝛽𝑇𝛽
• Expending loss function
𝐿ሺ 𝛽ሻ = 𝑦𝑇𝑦− 2𝛽𝑇𝑋𝑇𝑦+ 𝛽𝑇𝑋𝑇𝑋𝛽 + λ𝛽𝑇𝛽
Solve ridge regression for β
•Compute the Gradient
𝜕𝐿ሺ 𝛽ሻ
= −2𝑋𝑇𝑦+ 2𝑋𝑇𝑋𝛽 + 2λ𝛽
𝜕𝛽

•Equate it with zero to get the solution of β


−2𝑋𝑇𝑦+ 2𝑋𝑇𝑋𝛽 + 2λ𝛽 = 0
2𝑋𝑇𝑋𝛽 + 2λ𝛽 = 2𝑋𝑇𝑦
(𝑋𝑇𝑋+ λ𝐼)𝛽 = 𝑋𝑇𝑦
𝛽 = ሺ 𝑋𝑇𝑋+ λ𝐼ሻ −1𝑋𝑇𝑦

•Closed form of Ridge Regression

𝛽 = ሺ 𝑋𝑇𝑋+ λ𝐼ሻ −1𝑋𝑇𝑦


LASSO Regression Loss function
“We aim to find the vector of coefficients β that minimizes the sum of
squared residuals”.

𝐿ሺ 𝛽ሻ = min ||𝑦− 𝑋𝛽| ȁ2 + 𝜆| ȁ𝛽ȁȁ1


𝛽

𝐿ሺ 𝛽ሻ = min ||𝑦− 𝑋𝛽| ȁ2 + 𝜆෍ |𝛽𝑖 |


𝛽
𝑖=1

•y is the true label


•||β||1 is L1-Norm
•λ is the regularization parameter
Gradient of LASSO
•Gradient of Least Square Loss function

𝐿ሺ 𝛽ሻ = min ||𝑦− 𝑋𝛽| ȁ2


𝛽

𝜕𝐿ሺ 𝛽ሻ
= −2𝑋𝑇(𝑦− 𝑋𝛽)
𝜕𝛽

•Sub-Gradient of L1-norm

−1, 𝛽𝑗 < 0
𝜕
𝜆|𝛽𝑗| = ቐ 1, 𝛽𝑗 > 0
𝜕𝛽𝑗
𝑎𝑛𝑦𝑣𝑎𝑙𝑢𝑒 ሾ−1,+1ሿ 𝛽𝑗 = 0

•Total Gradient
𝜕
𝜆ห𝛽𝑗ห = 𝜆.𝑠𝑖𝑔𝑛(𝛽𝑗)
𝜕𝛽𝑗
Comparison of LASSO and Ridge

Sharpe edges of LASSO regression shrinks feature coefficients to 0and hence do feature selection.
Logistic Regression
“Logistic Regression is a classification algorithm used to predict the
probability of a binary outcome (e.g., 0 or 1, True or False, Yes or No)
based on one or more independent variables (features). Despite its
name, logistic regression is used for classification problems rather than
regression problems”.
•The model predicts the probability that a given input belongs to the
positive class (e.g., class 1) using the sigmoid function applied to a

1
linear combination of the input features.
𝑃ሺ 𝑦= 1ȁ𝑋ሻ = σሺ 𝑋𝛽ሻ =
1 + 𝑒−𝑋𝛽
•P(y=1 | X) is the probability of output 1 given X
•X is matrix of input features
•Β is coefficients of the line
Logistic function

Fig. 3: Logistic function range


Loss Function
“To estimate the coefficients β in logistic regression, we need to define a
loss function that we want to minimize. The commonly used loss
function for logistic regression is the Log-Loss or Binary Cross-
Entropy Loss. It measures the difference between the actual labels and
the predicted probabilities”.
𝑛
1
𝐽 𝛽 = − ෍ [𝑦𝑖 .logሺ 𝑦
ሺ ሻ ො 𝑖 ሻ + ሺ 1 − 𝑦𝑖 ሻ .log(1 − 𝑦
ො 𝑖 )]
𝑛
𝑖=1

Where
•y is the true label
•y_cap is the predicted label (probability)
Gradient of Loss Function
•To minimize the log-loss function and find the optimal coefficients β,
we can use gradient descent. To do this, we need to compute the
gradient of the loss function with respect to the coefficients β.
•The gradient of the loss function J(β) with respect to β is the vector of
partial derivatives of J(β) with respect to each β j​. This gradient tells us
the direction in which the cost function increases most rapidly, and we
𝑛
move in the opposite direction to minimize the cost.
𝜕𝐽ሺ 𝛽ሻ 1
= ෍ ሺ 𝑦𝑖 − 𝑦
ො 𝑖 ሻ .𝑥𝑖𝑗 ,∀𝑗= 1,2,…,𝑑
𝜕𝛽𝑗 𝑛
𝑖=1
Where
•y is the true label
•y_cap is the predicted label (probability)
•xij is the element of ith row and jth column of the dataset X.
Vector Form of Gradient
𝜕𝐽ሺ 𝛽ሻ 1
= 𝑋𝑇(𝑦
ො − 𝑦)
𝜕𝛽 𝑛

Where
•y is the vector true labels of n*1
•y_cap is the vector of predicted label (probability) of n*1
•X is the matrix of input features of n*d.
•n is total number of training examples (rows in dataset)
•d is the total number of dimensions (columns of dataset)
Gradient Descent update Rue
1 𝑇
𝛽 = 𝛽 − η. 𝑋 (𝑦
ො − 𝑦)
𝑛

Where
•y is the vector true labels of n*1
•y_cap is the vector of predicted label (probability) of n*1
•X is the matrix of input features of n*d.
•n is total number of training examples (rows in dataset)
•d is the total number of dimensions (columns of dataset)
•XT is the transpose of the matrix
•η is the learning rate must be less than 1.0.
Pros and Cons
Time Complexity
• Training O(T.n.d)
• Prediction O(d)
Pros
• Logistic regression is easy to implement and understand.
• Efficient and Fast
• No Need for Feature Scaling
• Logistic regression provides probabilistic predictions, which are useful for tasks where
uncertainty matters.
• Regularization Support
Cons
• Assumption of Linearity
• Limited to Binary Classification
• Sensitive to Outliers
• Poor Performance with Highly Correlated Features (Multicollinearity)
One vs. Rest Strategy
“In the One-vs-Rest (OvR) strategy, also known as One-vs-All (OvA),
the multiclass classification problem is decomposed into multiple
binary classification problems. For each class, a separate binary
classifier is trained to distinguish that class from all other classes
combined”.
• Number of Classifiers: For a problem with m classes, OvR trains m
binary classifiers, one for each class.
• Training: Each classifier is trained with the original dataset, but the
labels are modified so that the target class is labeled as 1 (positive
class), and all other classes are labeled as 0 (negative class).
• Prediction: When making predictions, the model runs all m
classifiers, and each classifier outputs a score or probability for its
respective class. The final prediction is the class corresponding to the
classifier with the highest score or probability.
One vs. Rest Strategy example
Example
• Suppose you have a dataset with three classes: A, B, and C. The OvR
strategy will create three binary classifiers:
• Classifier 1: Class A vs. (Classes B and C)
• Classifier 2: Class B vs. (Classes A and C)
• Classifier 3: Class C vs. (Classes A and B)
• During prediction, each of these classifiers will output a score or
probability, and the class with the highest score will be chosen as the
final prediction.
One vs. One Strategy
“In the One-vs-One (OvO) strategy, the multiclass classification
problem is broken down into multiple binary classification problems,
where each classifier is trained to distinguish between two specific
classes”.
• Number of Classifiers: For a problem with m classes, OvO trains
m(m−1)/2​ binary classifiers. Each classifier is trained to differentiate
between two classes at a time, ignoring all other classes.
• Training: Each binary classifier is trained on a subset of the original
dataset, consisting only of the examples from the two classes it is
supposed to distinguish. All other classes are excluded from training
that specific classifier.
• Prediction: During prediction, each classifier casts a "vote" for one of
the two classes it was trained to distinguish. The class that receives
the most votes across all binary classifiers is chosen as the final
prediction.
One vs. One Strategy Example
• For a dataset with three classes: A, B, and C, the OvO strategy will
create three binary classifiers:
• Classifier 1: Class A vs. Class B
• Classifier 2: Class A vs. Class C
• Classifier 3: Class B vs. Class C
• During prediction, each classifier will vote for one of the two classes,
and the class with the most votes will be the final prediction.
Thank you

You might also like