Logistic Regression
Logistic Regression
Suppose we want to do binary classification, in which every example belongs to one of the two
classes. Alternatively, there may be multiple classes, but are just interested in predicting
whether an example belongs to the specified class or not. The later problem is that of class
learning or concept learning and has been extensively studies in the ML literature.
Let (x 1 , y 1),( x 2 , y 2 ) ,... ,( x n , y n ) be our training data, where y i takes only two values, say 0
and 1. The logistic regression model models P( x)=P (Y =1|X =x) instead of directly
modeling the value of Y , as in case of linear regression.
Since we are modeling probability, it is necessary to choose a function that takes values
between 0 and 1 for any possible value of feature X. A Sigmoid function can serve this purpose.
Sigmoid function: A bounded, differentiable, real function that is defined for all real input
values and has a non-negative derivative at each point
Logistic regression uses one such sigmoid function called the logistic function. Logistic function
is defined as
1
g ( y )=
1+e− y
To be more specific, logistic regression models
1
Prob ( Y =1|X =x )=p (x)=
1+e− β ' x
It may be noted that above relationship between p and x can also be represented as
p( x )
log =β ' x
1− p(x)
The left-hand side term on above expression is the log of odds ratio, also called logit. Thus, the
logistic regression is essentially a linear regression model for the logit.
Remark: The logistic regression model can also be viewed as a particular type of Generalized
Linear Model (GLM), which we will briefly describe later.
Learning the logistic regression model
The parameters of the logistic regression model are typically learned from the training
examples using the maximum likelihood approach. The likelihood function of based on the
observed values ( y 1 , y 2 , ..., y n ) of Y is given by
n n
L ( β| y )=∏ Prob (Y i= y i| X i=x i)=∏ p ( xi ) y ( 1− p ( x i ) )
i
1− y i
i=1 i=1
{ p ( x i)
}
n
= ∑ y i log +log ( 1− p ( x i) )
i=1 1− p ( xi )
n
= ∑ { y i ( β ' x i )−log ( 1+e β ' x ) }
i
i=1
In order to maximize above expression with respect to , we can differentiate with respect to
and equate to 0 and solve the resultant equation for . However, the closed form expression
for is not available. Thus, one has to apply numerical iterative methods such as Newton-
Raphson method or Gradient Descent method to find the MLE of .
Making predictions
Once the logistic regression model is learned from the training data, the model gives predicted
value of Prob ( Y =1|X =x ) for a new example x. Typically, if the predicted probability is greater
than or equal to 0.5, Y is predicted as 1, else Y is predicted as 0.
Example
Let us consider a data set that contains data on the characteristics of a tumor from breast along
with the diagnosis whether it is Malignant (cancerous) or benign (non-cancerous). We want to
use this data as the training data to develop a learning algorithm that predicts whether the new
sample with given characteristics is malignant or not.
The dataset contains several features. However, we will use only one Feature ‘mean area’ to
build a logistic regression model.
First, lets have a look at the box plots for ‘mean area’ for malignant and benign examples.
The plot shows that there is a significant difference between the values of ‘mean area’ for the
two classes of examples. This indicates that ‘area mean’ may be a useful feature for predicting
the type of tumor.
When the simple logistic regression model is trained, the details of the learned model are as
given below.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.098550 0.769236 -10.528 <2e-16 ***
area_mean 0.011969 0.001222 9.793 <2e-16