0% found this document useful (0 votes)
4 views

Classification-Introduction, Logistic Regression

Classification
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Classification-Introduction, Logistic Regression

Classification
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Classification

(Introduction, Logistic Regression)


Classification- Introduction
Classification is a supervised learning technique that is used to
identify the category of new observations on the basis of training
data.
In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number of
classes or groups.
Mathematically, classification analysis uses an algorithm to
learn the mapping function from the input variables to the output
variable (Y) i.e. Y = f(x) where Y has discrete values

Classification of vegetables and groceries


Types of Classification

1) Binary Classification
It is a type of classification problem in which the output variable
has only binary values (True/False, 0/1, Yes/No)
Examples of Binary classification are classifying Email Spam
Detection (spam/ham), Medical Testing (patient having disease
or not), customer risk analysis (fraudulent/non-fraudulent)
2) Multi-Class Classification
It is a type of classification problem in which the output variable
has more than two discrete values.
For example, risk evaluation of customers (low risk, medium
risk, high risk), text classification into different categories
(sports, politics, entertainment), etc.
Types of Classification (Contd…)

3) Multi-Label Classification
It is a type of multi-class classification in which the examples
can be labelled with multiple categories.

For instance, in text classification a text may belong to Sports


as well as politics category (Virat Kohli joined politics).
Why Regression models are not
used for Classification ?

Classification models are not useful Example for Point 2:


for regression because of following Let’s say we create a perfectly
reasons: balanced dataset , where it contains a
list of customers and a label to
1) Regression models give continuous determine if the customer had
purchased or not.
values of output variable and does
not give probabilistic values.  In the dataset, there are 20 customers.
10 customers age between 10 to 19
2) Linear Regression models are who purchased, and 10 customers age
insensitive to imbalance data. between 20 to 29 who did not
purchase.
Why Regression models are not
used for Classification ? (Contd….)

According to Linear regression model, the


line of best fit is shown in Fig. 1(a).

To use this model for prediction is pretty


straight forward.

Given any age, we are able to predict the


value along the Y-axis. If Y is greater than
0.5 (above the green line), predict that this
customer will make purchases otherwise
will not make purchases.
Figure 1(a)
Why Regression models are not
used for Classification ? (Contd….)

Let’s add 10 more customers age


between 60 to 70, and train our linear
regression model, finding the best fit
line.

Our linear regression model manages


to fit a new line (Figure 1(b)), but if
you look closer, some customers (age
20 to 22) outcome are predicted
wrongly.
Figure 1(b)
Logistic Regression
Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning
technique.
It is used for predicting the categorical dependent variable using a
given set of independent variables.
Logistic Regression is much similar to the Linear Regression
except that how they are used. Linear Regression is used for
solving Regression problems, whereas Logistic regression is used
for solving the classification problems
Logistic regression uses the concept of predictive modeling as
regression; therefore, it is called logistic regression
Logistic Regression-Hypothesis Function

The hypothesis function that maps the given values of the input
variable to the output variable is a sigmoid (logistic) function
given by:
1
𝑦^ = 𝑓 𝑥 =
1 + 𝑒 –(β 0 +β 1 x 1 +β 2 x 2+⋯…………..+β k x k )

where, x1, x2, x3…..xk are k independent features on which the output
variable depends and β1, β2, β3…..βk are coefficients of independent
features
In other words, hypothesis function, is given by:
1
𝑦^ = 𝑓 𝑥 =
1 + 𝑒–z
and z = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ … … … … . . +𝛽k 𝑥k
Hypothesis function- Characteristics

Hypothesis function is: Therefore, The value of the logistic regression


1 must be between 0 and 1, which cannot go
𝑦^ = 𝑓 𝑥 = beyond this limit.
1 + 𝑒–z
and z = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ … … … … . . +𝛽k 𝑥k So, it forms a curve like the "S" form.

^ 1 1 1
 If z=0;
1+e—z 1+e—0 1+1

^ 1 1 1
 If z=∞; 1+e—z 1+e—∞ 1+0

^ 1 1 1
 If z=-∞;
1+e—z 1+e ∞ 1+∞
Interpretation of Hypothesis Function

The sigmoid (logistic) hypothesis ^


function gives value between 0 and 1. ^

So, the output of sigmoid function is


considered as the probability of label to
be 1 given some value of input
variables, i.e.,
^
1 2 3….. k)

Therefore, if the probability is greater


than or equal to 0.5, we assign label 1,
else we assign label 0.
Decision Boundary

 If ^

1
This is possible iff ; z 0 (because if z 0 then 0.5)
1+e—z

i.e., 𝟎 𝟏 𝟏 𝟐 𝟐 𝒌 𝒌 0
^
 If
1
This is possible iff ; z 0 (because if z 0 then < 0.5)
1+e—z

i.e., 𝟎 𝟏 𝟏 𝟐 𝟐 𝒌 𝒌 0
Decision Boundary Contd…

For example, in the figures shown below,


depending upon the age (x1) and length of hair(x2)
the person is classified as male (1) or female (0).
 So, the hypothesis function will be:
1
𝑦^ = 𝑓 𝑥 =
1 + 𝑒 –(β 0 +β 1 x 1 +β 2 x 2 )
Now, examples are labelled as male (1)
if 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 ≥ 0
i.e., all points above or on the line 𝛽1𝑥1 + 𝛽2𝑥2= -
𝛽0
Now, examples are labelled as male (1)
if 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 < 0
i.e., all points below the line 𝛽1𝑥1 + 𝛽2𝑥2= - 𝛽0
Decision Boundary Contd…

The decision boundary may be of any shape linear or non


linear (depending upon the z function we choose for the sigmoid
function).

The decision boundary is insensitive to balanced or imbalanced


data and is characteristic of hypothesis function.

For example, for the purchase labeling problem discussed in


slides 6 and 7, the logistic regression will classify correctly in
both the cases (as shown in figures in the next slide).
Decision Boundary Contd…

LOGISTIC REGRESSION LOGISTIC REGRESSION


MODEL FOR 20 MODEL FOR 30
CUSTOMERS (BALANCED CUSTOMERS
DATA) (IMBALANCED DATA)

Logistic Regression- Cost Function

Logistic regression uses the concept of This is due to the reason, if we use mean
predictive modeling as regression i.e., it find square error cost function with logistic
function, it provides non-convex outcome
the optimal value of coefficients (β’s) by which results in many local minima. (as shown
minimizing the error/cost in labeling each below)
training example.

But in case of logistic regression, we do not


use mean square error (MSE) cost function
given by the equation below:
n 2
1 1
𝑀𝑆𝐸 = Σ 𝑦i −
𝑛 1 + 𝑒 – (𝜷𝟎 +𝜷𝟏𝒙𝒊𝟏+𝜷𝟐 𝒙𝒊𝟐+⋯…………..+𝜷𝒌𝒙𝒊𝒌)
i=1
Cost Function Contd……..

Thus, for logistic regression, we use maximum likelihood cost function (cross entropy function)
which is computed as follows for every labeled example:
—log 𝑓 𝑥 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 𝑜𝑟 𝐸𝑟𝑟𝑜𝑟 = {
—log 1 −𝑓 𝑥 𝑖𝑓 𝑦 = 0

where, y is the actual value of the training example and f(x) gives the corresponding predicted
value given by the sigmoid function.
The cross entropy cost function with logistic function gives convex curve with one local/global
minima.
It adds zero cost if the actual and the predicted values are same (i.e., both zero or both one) else,
it adds some positive cost proportional to the difference between actual and predicted value.
(shown in figure in next slide)
Cost Function Contd…..
Cost Function Contd…..
The two separate equations for y=1 and y=0 can be combined in the
single equation as follows:

• When y=1 ; and when y=0 ;


The total error for all the n training examples is thus computed as

This cost function is function of coefficients of input variables (β’s)


whose optimal values are computed using optimization techniques like
gradient descent optimization.
Gradient Descent Optimization for
Logistic Regression

In logistic regression also, we use gradient descent optimization, for finding optimal values of β’s by
minimizing the total cost over the training examples.
 The gradient descent optimization considers gradient (slope/derivative) of the cost function.
1
 First lets find out the partial derivative of the sigmoid function 𝑓 𝑥 = w.r.t z
1+e—x
𝜕𝑓(𝑥) 𝜕 1 + 𝑒– x 𝜕 −𝑥
= −1 × 1 + 𝑒 – x – 1 – 1 = −1 + 𝑒 – x – 2 0 + 𝑒 – x
𝜕𝑧 𝜕𝑧 𝜕𝑧
𝑒 – x –x
𝜕𝑥 1 1 + 𝑒 − 1 𝜕𝑥 1 1 𝜕𝑥
= = × = × 1−
(1 + 𝑒 –x ) 2 𝜕𝑧 (1 + 𝑒 –x ) (1 + 𝑒 –x ) 𝜕𝑧 (1 + 𝑒 – x ) (1 + 𝑒 –x ) 𝜕𝑧
𝜕𝑥
= 𝑓(𝑥)(1 − 𝑓 𝑥 )
𝜕𝑧
Thus, partial derivative of sigmoid function f(x) w.r.t some variable z, is the product of f(x) and (1-f(x))
and derivative of power w.r.t z.
Gradient Descent Optimization for
Logistic Regression (Contd….)

1
For logistic regression, cost function is given by:
n = − 1 ∑ ni=1 𝑦i × × 𝑓 𝑥i × (1 − 𝑓 𝑥i × 𝑥ij + 1 − 𝑦i ×
n ƒ xi
1 1
× (0 − 𝑓(𝑥 i)(1 − 𝑓 𝑥 i ) × 𝑥 ij)
𝐽 = − Σ 𝑦i 𝑙𝑜𝑔(𝑓 𝑥i ) + 1 − 𝑦i log(1 − 𝑓 𝑥i ) 1 – ƒ xi
𝑛
i =1
1 (Using the derivative of sigmoid function
Where f 𝑥 i = 1+e—(𝜷𝟎+𝜷𝟏𝒙𝒊𝟏+𝜷𝟐𝒙𝒊𝟐+⋯…………..+𝜷𝒌𝒙𝒊𝒌) computed in previous slide and derivative of
Gradient of cost function w.r.t any jth coefficient is given by: power 𝜷𝟎 + 𝜷𝟏 𝒙𝒊𝟏 + 𝜷𝟐 𝒙𝒊𝟐 + ⋯ … + 𝜷𝒌 𝒙𝒊𝒌 w.r.t
n
𝜕𝐽 1 𝜕𝑦ilog(𝑓 𝑥i ) 𝜕 1 − 𝑦i log(1 − 𝑓 𝑥i ) 𝛽j is the input variable values ).
=− Σ +
𝜕𝛽j 𝑛 𝜕𝛽j 𝜕𝛽j
n i=1 = − 1 ∑ ni =1 𝑦i × (1 − 𝑓 𝑥 i × 𝑥ij − 1 − 𝑦 i × 𝑓(𝑥 i ) × 𝑥 ij
n
1 𝜕log(𝑓 𝑥i ) 𝜕log(1 − 𝑓 𝑥i )
= − Σ 𝑦i + 1 −𝑦 i
𝑛 𝜕𝛽j 𝜕𝛽j = − 1 ∑ni =1 𝑥i j 𝑦i − 𝑓 𝑥 i 𝑦 i𝑥 ij − 𝑓 𝑥 i 𝑥 ij + 𝑦i 𝑓(𝑥 i ) 𝑥 ij
n
n i =1 n
1 1 𝜕𝑓 𝑥i 1 𝜕(1 − 𝑓 𝑥i 1
= − Σ 𝑦i × + 1 −𝑦 i ×
𝑛 𝑓 𝑥i 𝜕𝛽 j 1 −𝑓 𝑥i 𝜕𝛽 j = Σ (𝑓 𝑥i − 𝑦i ) × 𝑥ij
i =1 𝑛
i=1
Gradient Descent Optimization for
Logistic Regression (Contd….)

 Thus derivative of cost function w.r.t to 𝛽j is same as in case of linear regression.

The only difference is that in case of linear regression the hypothesis function is linear function of input variables
whereas in logistic regression the hypothesis function is a sigmoid function of input variables.
 The gradient descent optimization for Logistic Regression is summarized as below:
1. Initialize 𝛽0 =0 , 𝛽1 = 0, 𝛽2 = 0,…………………………… 𝛽k = 0
2. Update parameters until convergence or for fixed number of iterations using following equation:
n
𝛼 1
𝛽j = 𝛽j − Σ —𝑦i × 𝑥 ij
𝑛 1 + 𝑒 – 𝜷𝟎+𝜷𝟏𝒙𝒊𝟏+𝜷𝟐𝒙𝒊𝟐+⋯…………..+𝜷𝒌𝒙𝒊𝒌
i=1
For j=0,1,2,3……………..k
Where xi0=1 and k are the total number of iterations
Logistic Regression for Multi-Class
Classification

We will use a strategy called one-vs.-all (one-


vs.-rest) classification, where we train a
binary classifier for each distinct class and
choose the class that has the largest value
returned by the sigmoid function.

For instance, consider a classification


problem, in which there are two input variables
on the basis of which the examples are
classified into three classes (marked as
triangles, crosses, and squares in the figure)
Logistic Regression for Multi-Class
Classification (Contd…..)

For each binary classifier that we train, we Each binary classifier will give
will need to relabel the data such that the probability of ith label given the input
outputs for our class of interest is set to 1 and feature values and choose that label for
all other labels are set to 0.
which probability is maximum.
As an example, we have 3 groups A (0), B (1),
and C (2) — we must make three binary
classifiers:
(1) A set to 1, B and C set to 0 𝑓 i 𝑥 = 𝑃(𝑦 = 𝑖|𝑥1𝑥2)
(2) B set to 1, A and C set to 0
(3) C set to 1, A and B set to 0
𝑖 = argmax 𝑓 i 𝑥
After training, choose the class that has the i
largest value returned by the sigmoid function
for each test case (as shown in figure)
Logistic Regression for Multi-Class
Classification (Contd…..)
Regularization for Logistic Regression

Overfitting is also a problem of


classification models, as we may fit a
very complex decision boundary (lot of
curves and angles) that considers each
training examples but does not
generalize well.
The problem of overfitting can be
handled using regularization that
shrinks the coefficients of input
variables thereby smoothen the
decision boundary to generalize well.

You might also like