Classification-Introduction, Logistic Regression
Classification-Introduction, Logistic Regression
1) Binary Classification
It is a type of classification problem in which the output variable
has only binary values (True/False, 0/1, Yes/No)
Examples of Binary classification are classifying Email Spam
Detection (spam/ham), Medical Testing (patient having disease
or not), customer risk analysis (fraudulent/non-fraudulent)
2) Multi-Class Classification
It is a type of classification problem in which the output variable
has more than two discrete values.
For example, risk evaluation of customers (low risk, medium
risk, high risk), text classification into different categories
(sports, politics, entertainment), etc.
Types of Classification (Contd…)
3) Multi-Label Classification
It is a type of multi-class classification in which the examples
can be labelled with multiple categories.
The hypothesis function that maps the given values of the input
variable to the output variable is a sigmoid (logistic) function
given by:
1
𝑦^ = 𝑓 𝑥 =
1 + 𝑒 –(β 0 +β 1 x 1 +β 2 x 2+⋯…………..+β k x k )
where, x1, x2, x3…..xk are k independent features on which the output
variable depends and β1, β2, β3…..βk are coefficients of independent
features
In other words, hypothesis function, is given by:
1
𝑦^ = 𝑓 𝑥 =
1 + 𝑒–z
and z = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ … … … … . . +𝛽k 𝑥k
Hypothesis function- Characteristics
^ 1 1 1
If z=0;
1+e—z 1+e—0 1+1
^ 1 1 1
If z=∞; 1+e—z 1+e—∞ 1+0
^ 1 1 1
If z=-∞;
1+e—z 1+e ∞ 1+∞
Interpretation of Hypothesis Function
If ^
1
This is possible iff ; z 0 (because if z 0 then 0.5)
1+e—z
i.e., 𝟎 𝟏 𝟏 𝟐 𝟐 𝒌 𝒌 0
^
If
1
This is possible iff ; z 0 (because if z 0 then < 0.5)
1+e—z
i.e., 𝟎 𝟏 𝟏 𝟐 𝟐 𝒌 𝒌 0
Decision Boundary Contd…
Logistic regression uses the concept of This is due to the reason, if we use mean
predictive modeling as regression i.e., it find square error cost function with logistic
function, it provides non-convex outcome
the optimal value of coefficients (β’s) by which results in many local minima. (as shown
minimizing the error/cost in labeling each below)
training example.
Thus, for logistic regression, we use maximum likelihood cost function (cross entropy function)
which is computed as follows for every labeled example:
—log 𝑓 𝑥 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 𝑜𝑟 𝐸𝑟𝑟𝑜𝑟 = {
—log 1 −𝑓 𝑥 𝑖𝑓 𝑦 = 0
where, y is the actual value of the training example and f(x) gives the corresponding predicted
value given by the sigmoid function.
The cross entropy cost function with logistic function gives convex curve with one local/global
minima.
It adds zero cost if the actual and the predicted values are same (i.e., both zero or both one) else,
it adds some positive cost proportional to the difference between actual and predicted value.
(shown in figure in next slide)
Cost Function Contd…..
Cost Function Contd…..
The two separate equations for y=1 and y=0 can be combined in the
single equation as follows:
In logistic regression also, we use gradient descent optimization, for finding optimal values of β’s by
minimizing the total cost over the training examples.
The gradient descent optimization considers gradient (slope/derivative) of the cost function.
1
First lets find out the partial derivative of the sigmoid function 𝑓 𝑥 = w.r.t z
1+e—x
𝜕𝑓(𝑥) 𝜕 1 + 𝑒– x 𝜕 −𝑥
= −1 × 1 + 𝑒 – x – 1 – 1 = −1 + 𝑒 – x – 2 0 + 𝑒 – x
𝜕𝑧 𝜕𝑧 𝜕𝑧
𝑒 – x –x
𝜕𝑥 1 1 + 𝑒 − 1 𝜕𝑥 1 1 𝜕𝑥
= = × = × 1−
(1 + 𝑒 –x ) 2 𝜕𝑧 (1 + 𝑒 –x ) (1 + 𝑒 –x ) 𝜕𝑧 (1 + 𝑒 – x ) (1 + 𝑒 –x ) 𝜕𝑧
𝜕𝑥
= 𝑓(𝑥)(1 − 𝑓 𝑥 )
𝜕𝑧
Thus, partial derivative of sigmoid function f(x) w.r.t some variable z, is the product of f(x) and (1-f(x))
and derivative of power w.r.t z.
Gradient Descent Optimization for
Logistic Regression (Contd….)
1
For logistic regression, cost function is given by:
n = − 1 ∑ ni=1 𝑦i × × 𝑓 𝑥i × (1 − 𝑓 𝑥i × 𝑥ij + 1 − 𝑦i ×
n ƒ xi
1 1
× (0 − 𝑓(𝑥 i)(1 − 𝑓 𝑥 i ) × 𝑥 ij)
𝐽 = − Σ 𝑦i 𝑙𝑜𝑔(𝑓 𝑥i ) + 1 − 𝑦i log(1 − 𝑓 𝑥i ) 1 – ƒ xi
𝑛
i =1
1 (Using the derivative of sigmoid function
Where f 𝑥 i = 1+e—(𝜷𝟎+𝜷𝟏𝒙𝒊𝟏+𝜷𝟐𝒙𝒊𝟐+⋯…………..+𝜷𝒌𝒙𝒊𝒌) computed in previous slide and derivative of
Gradient of cost function w.r.t any jth coefficient is given by: power 𝜷𝟎 + 𝜷𝟏 𝒙𝒊𝟏 + 𝜷𝟐 𝒙𝒊𝟐 + ⋯ … + 𝜷𝒌 𝒙𝒊𝒌 w.r.t
n
𝜕𝐽 1 𝜕𝑦ilog(𝑓 𝑥i ) 𝜕 1 − 𝑦i log(1 − 𝑓 𝑥i ) 𝛽j is the input variable values ).
=− Σ +
𝜕𝛽j 𝑛 𝜕𝛽j 𝜕𝛽j
n i=1 = − 1 ∑ ni =1 𝑦i × (1 − 𝑓 𝑥 i × 𝑥ij − 1 − 𝑦 i × 𝑓(𝑥 i ) × 𝑥 ij
n
1 𝜕log(𝑓 𝑥i ) 𝜕log(1 − 𝑓 𝑥i )
= − Σ 𝑦i + 1 −𝑦 i
𝑛 𝜕𝛽j 𝜕𝛽j = − 1 ∑ni =1 𝑥i j 𝑦i − 𝑓 𝑥 i 𝑦 i𝑥 ij − 𝑓 𝑥 i 𝑥 ij + 𝑦i 𝑓(𝑥 i ) 𝑥 ij
n
n i =1 n
1 1 𝜕𝑓 𝑥i 1 𝜕(1 − 𝑓 𝑥i 1
= − Σ 𝑦i × + 1 −𝑦 i ×
𝑛 𝑓 𝑥i 𝜕𝛽 j 1 −𝑓 𝑥i 𝜕𝛽 j = Σ (𝑓 𝑥i − 𝑦i ) × 𝑥ij
i =1 𝑛
i=1
Gradient Descent Optimization for
Logistic Regression (Contd….)
The only difference is that in case of linear regression the hypothesis function is linear function of input variables
whereas in logistic regression the hypothesis function is a sigmoid function of input variables.
The gradient descent optimization for Logistic Regression is summarized as below:
1. Initialize 𝛽0 =0 , 𝛽1 = 0, 𝛽2 = 0,…………………………… 𝛽k = 0
2. Update parameters until convergence or for fixed number of iterations using following equation:
n
𝛼 1
𝛽j = 𝛽j − Σ —𝑦i × 𝑥 ij
𝑛 1 + 𝑒 – 𝜷𝟎+𝜷𝟏𝒙𝒊𝟏+𝜷𝟐𝒙𝒊𝟐+⋯…………..+𝜷𝒌𝒙𝒊𝒌
i=1
For j=0,1,2,3……………..k
Where xi0=1 and k are the total number of iterations
Logistic Regression for Multi-Class
Classification
For each binary classifier that we train, we Each binary classifier will give
will need to relabel the data such that the probability of ith label given the input
outputs for our class of interest is set to 1 and feature values and choose that label for
all other labels are set to 0.
which probability is maximum.
As an example, we have 3 groups A (0), B (1),
and C (2) — we must make three binary
classifiers:
(1) A set to 1, B and C set to 0 𝑓 i 𝑥 = 𝑃(𝑦 = 𝑖|𝑥1𝑥2)
(2) B set to 1, A and C set to 0
(3) C set to 1, A and B set to 0
𝑖 = argmax 𝑓 i 𝑥
After training, choose the class that has the i
largest value returned by the sigmoid function
for each test case (as shown in figure)
Logistic Regression for Multi-Class
Classification (Contd…..)
Regularization for Logistic Regression