Logistic Regression
Logistic Regression
Minati Rath
Classification
Classification
Email: Spam / Not Spam?
Online Transactions: Fraudulent / Genuine?
Tumor: Malignant / Benign ?
(Yes) 1
Malignant ?
(No) 0
Tumor Size
Malignant ?
(No) 0
Tumor Size
Can we solve the problem using linear regression? E.g., fit a straight line
and define a threshold at 0.5 Threshold classifier output at 0.5:
If , predict “y = 1”
If , predict “y = 0”
Classification
(Yes) 1
Malignant ?
(No) 0
Tumor Size
Can we solve the problem using linear regression? E.g., fit a
straight line and define a threshold at 0.5
Threshold classifier output at 0.5: Failure due to adding
If , predict “y = 1” a new point
If , predict “y = 0”
Another drawback of using linear regression for this problem
Classification: y = 0 or 1
can be > 1 or < 0
The log odds transform the range of probabilities (which are between 0
and 1) into a range between −∞ and +∞ , making it easier to model
probabilities in a regression context.
Why Use Log Odds in Logistic Regression?
In logistic regression, the relationship between the predictor
variables (such as age, income, etc.) and the binary outcome
(like yes/no or 0/1) is non-linear. To make it linear and fit into
the regression framework, the model predicts the log odds of
the outcome. The coefficients from a logistic regression model
are associated with the log odds, meaning each coefficient
represents how a one-unit change in a predictor variable affects
the log odds of the outcome.
Example:
If you have a logistic regression model predicting whether a
person buys a product based on their income, the coefficient for
income shows how a one-unit increase in income affects the log
odds of purchasing the product.
Odds Ratio
The odds ratio (OR) is a measure of association between a particular
predictor variable and an outcome, commonly used in logistic
regression. It tells us how much the odds of the outcome change with
a one-unit increase in the predictor variable.
The odds ratio compares the odds of the outcome for different levels
of the predictor variable. Specifically, it is the ratio of the odds of the
outcome occurring in one group to the odds of it occurring in another
group (often a baseline or reference group).
Interpretation of Odds Ratio:
OR = 1: The odds of the outcome are the same for both groups,
meaning the predictor has no effect on the outcome.
OR > 1: The odds of the outcome increase as the predictor increases.
For instance, an odds ratio of 2 means the odds of the outcome are
twice as high in group 1 compared to group 2.
OR < 1: The odds of the outcome decrease as the predictor increases.
For example, an odds ratio of 0.5 means the odds of the outcome are
half as high in group 1 compared to group 2.
Assumptions of Logistic Regression
Logistic regression has several assumptions:
1) The dependent variable must be binary;
2) The observations are independent;
3) There is little to no multicollinearity among the predictors;
4) The independent variables are linearly related to the log odds
of the outcome;
5) The model assumes a large sample size for reliable results.
Decision Boundary
Separating two classes of points.
We are attempting to separate two given sets / classes of points
Separate two regions of the feature space
Concept of Decision Boundary
Finding a good decision boundary => learn appropriate values
for the parameters 𝛩
x2
3
2
1 2 3 x1
Predict if
-1 1 x1
-1
Multi-class classification one vs. all
Multiclass classification
News article tagging: Politics, Sports, Movies,
Religion, …
Medical diagnosis: Not ill, Cold, Flu, Fever
x2 x2
x1 x1
Multi-class classification
One-vs-all (one- x2
vs-rest):
x1
x2 x2
x1 x1
x2
Class 1:
Class 2:
Class 3:
x1
Multinomial/ Multi-class Logistic Regression
Multinomial logistic regression is an extension of binary logistic
regression used when the dependent variable has more than two
categories. The method models the probability of each category
separately, using one of the categories as a reference. It is
commonly applied in scenarios with more than two possible
outcomes, such as predicting the type of disease from symptoms.
One-vs-all
Train a logistic regression classifier for each class to
predict the probability that On a new input , to make a
prediction, pick the class that maximizes
How to evaluate a model?
• Regression
– Some measure of how close are predicted values
(by a model) to the actual values
• Classification
– Whether predicted classes match the actual
classes
Evaluation metrics for Regression
• Application 1: Supermarket
verifies customers for giving a
discount
y
• Application 2: For entering into
RAW, GoI
On what data to measure precision,
recall, error rate, ..?
Option 1: training set
Option 2: some other set of examples that was unknown
at the time of training (test set)