Logistic Regression
Logistic Regression
Classification
• Classification is a very important area of supervised machine learning.
A large number of important machine learning problems fall within
this area. There are many classification methods, and logistic
regression is one of them.
• Supervised machine learning algorithms define models that capture
relationships among data. Classification is an area of supervised
machine learning that tries to predict which class or category some
entity belongs to, based on its features.
• There are two main types of classification problems:
• Binary or binomial classification: exactly two classes to choose
between (usually 0 and 1, true and false, or positive and negative)
• Multiclass or multinomial classification: three or more classes of the
outputs to choose from
When Do You Need Classification?
• You can apply classification in many fields of science and technology.
Example:
• text classification algorithms are used to separate legitimate and
spam emails, as well as positive and negative comments.
• Image recognition tasks are often represented as classification
problems.
What is Logistic Regression?
• Logistic Regression is a classification algorithm. It is used to predict a
binary outcome (1 / 0, Yes / No, True / False) given a set of
independent variables.
• You can also think of logistic regression as a special case of linear
regression when the outcome variable is categorical, where we are
using log of odds as dependent variable.
• In simple words, it predicts the probability of occurrence of an event
by fitting data to a logit function.
• In many situations, the response variable is qualitative or, in other
words, categorical. For example, gender is qualitative, taking on
values male or female.
• Prediciting a qualitative response for an observation can be referred
to as classifying that observation, since it involves assigning the
observation to a category, or class. On the other hand, the methods
that are often used for classification first predict the probability of
each of the categories of a qualitative variable, as the basis for making
the classification.
• Linear regression is not capable of predicting probability.
Logistic Regression Assumptions
The logistic regression method assumes that:
• The outcome is a binary or dichotomous variable like yes vs no, positive vs negative, 1 vs
0.
• There is a linear relationship between the logit of the outcome and each predictor
variables. Recall that the logit function is logit(p) = log(p/(1-p)), where p is the
probabilities of the outcome.
• There is no influential values (extreme values or outliers) in the continuous predictors
• There is no high intercorrelations (i.e. multicollinearity) among the predictors.
To improve the accuracy of your model, you should make sure that these assumptions hold
true for your data.
Types Of Logistic Regression
Models
• One of the plus points of Logistic Regression is that it can be used to
solve multi-class classification problems by using the Multinomial and
Ordinal Logistic models.
Multinomial Logistic Regression:
• Multinomial Regression is an extension of binary logistic regression,
that is used when the response variable has more than 2 classes.
Multinomial regression is used to handle multi-class classification
problems.
• Let’s assume that our response variable has K = 3 classes, then the
Multinomial logistic model will fit K-1 independent binary logistic
models in order to compute the final outcome.
Ordinal Logistic Regression:
• Ordinal Logistic Regression also known as Ordinal classification is a
predictive modeling technique used when the response variable is
ordinal in nature.
• An ordinal variable is one where the order of the values is significant,
but not the difference between values. For example, you might ask a
person to rate a movie on a scale of 1 to 5. A score of 4 is much better
than 3, because it means that the person liked the movie. But the
difference between a rating of 4 and the 3 may not be the same as
that between 4 and 1. The values simply express an order.
What is the Sigmoid Function?
• It is a mathematical function having a characteristic that can take any
real value and map it to between 0 to 1 shaped like the letter “S”. The
sigmoid function also called a logistic function.
Derivation of Logistic
Regression Equation
• g(E(y)) = α + βx1 + γx2
• Here, g() is the link function, E(y) is the expectation of target variable and α + βx1 + γx2 is the linear
predictor ( α,β,γ to be predicted). The role of link function is to ‘link’ the expectation of y to linear
predictor.
Important Points
• GLM does not assume a linear relationship between dependent and independent variables. However, it
assumes a linear relationship between link function and independent variables in logit model.
• The dependent variable need not to be normally distributed.
• It does not uses OLS (Ordinary Least Square) for parameter estimation. Instead, it uses maximum
likelihood estimation (MLE).
• Errors need to be independent but not normally distributed.
• In logistic regression, we are only concerned about the probability of
outcome dependent variable ( success or failure). As described above,
g() is the link function. This function is established using two things:
Probability of Success(p) and Probability of Failure(1-p). p should
meet following criteria:
• It must always be positive (since p >= 0)
• It must always be less than equals to 1 (since p <= 1)
• g(y) = βo + β(Age) ---- (a)
• p = exp(βo + β(Age)) / exp(βo + β(Age)) + 1 = e^(βo + β(Age)) / e^(βo + β(Age)) + 1 ----- (c)
• Using (a), (b) and (c), we can redefine the probability as:
# load dataset
df = pd.read_csv('Student-Pass-Fail-Data.csv')
df.head()
#predict values
y_pred = logistic_regression.predict(x_test)
Code…
#check model accuracy