Binary Logistic Regression From Scratch
Binary Logistic Regression From Scratch
In [2]:
# figure size in inches optional
rcParams['figure.figsize'] = 11 ,8
# read images
img_A = mpimg.imread('Classification_Using_Linear_Regression.png')
img_B = mpimg.imread('Classification_Using_Linear_Regression_Issue.png')
# display images
fig, ax = plt.subplots(1,2)
ax[0].imshow(img_A);
ax[1].imshow(img_B);
#SVD_file1 = "Classification_Using_Linear_Regression.png"
#SVD_file2 = "Classification_Using_Linear_Regression_Issue.png"
#img_A = Image.open(SVD_file1)
#img_B = Image.open(SVD_file2)
#plt.imshow(img_A)
#plt.imshow(img_B)
Hypothesis Function
Since our objective is to get discrete value(0 or 1) we will create a hypothesis function that will return values
between 0 and 1
Sigmoid function do exactly that, it maps the whole real number range between 0 and 1. It is also called as Logistic
function.
1
g(z) = 1+e−z
The term Sigmoid means ‘S-shaped’ and when plotted this function gives S-shaped curve.
z = θT x
1
g(z) = 1+e−z
Basically we are using line function as input to sigmoid function in order to get discrete value from 0 to 1. The way
our sigmoid function g(z) behaves is that, when its input is greater than or equal to zero, its output is greater than
or equal to 0.5
Since positive input results in positive class and negative input results in negative class, we can separate both the
classes by setting the weighted sum of inputs to 0. i.e.
z = θ0 + (θ1 ∗ x1 )....(θn ∗ xn ) = 0
Decision Boundary
Decision boundary separates the positive and negative class
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our
hypothesis function.
As explained earlier decision boundary can be found by setting the weighted sum of inputs to 0
Lets create a formula to find decision boundary for two feature(x and x1 ) dataset
Cost Function
It is obvious that to find the optimum values of theta parameters we have to try multiple values and then choose
the best possible values based on how the predicted class match with given data. To do this we will create a cost
function (J ). Inner working of cost function is as below
We will execute the hypothesis function using theta values, to get the predicted values(0 or 1) for every training
example
Now we will compare our predicted discrete values with actual target values from training data.
If our predicted value matches with actual value then our cost will be ‘0’ or else cost will be highly penalized.
J(θ) =
− m1 [y (i) log (hθ (x(i) )) + (1 − y (i) log (1 − hθ (x(i) ))]
h = g(Xθ)
1
J(θ) = m
⋅ (−y T log (h) − (1 − y)T log (1 − h))
Just like linear regression, logistic cost function is also ‘Convex function’
So the optimum values of the theta are the one, for which we get minimum value of the cost.
Note that while this gradient of cost looks identical to the linear regression gradient, the formula is actually
different because linear and logistic regression have different definitions of h(x).
In [3]:
import matplotlib.pyplot as plt
import numpy as np
import math
plt.show()
Notations used
m = no of training examples (no of rows of feature matrix)
n = no of features (no of columns of feature matrix)
X ’s = input variables / independent variables / features
y ’s = output variables / dependent variables / target/ labels
The data consists of marks of two exams for 100 applicants. The target value takes on binary values 1,0. 1 means the
applicant was admitted to the university whereas 0 means the applicant didn't get an admission. The objective is to build a
classifier that can predict whether an application will be admitted to the university or not.
In [4]:
df = pd.read_csv('admission_basedon_exam_scores.csv')
m, n = df.shape
print('Number of training examples m = ', m)
print('Number of features n = ', n - 1) # Not counting the 'Label: Admission status'
df.sample(15) # Show random 5 training examples
Out [4]:
Exam 1 marks Exam 2 marks Admission status
57 32.577200 95.598548 0
24 77.924091 68.972360 1
88 78.635424 96.647427 1
Exam 1 marks Exam 2 marks Admission status
8 76.098787 87.420570 1
28 61.830206 50.256108 0
38 74.789253 41.573415 0
15 53.971052 89.207350 1
7 75.024746 46.554014 1
79 82.226662 42.719879 0
99 74.775893 89.529813 1
39 34.183640 75.237720 0
64 44.668262 66.450086 0
81 94.834507 45.694307 1
5 45.083277 56.316372 0
84 80.366756 90.960148 1
Data Understanding
There are total 100 training examples (m= 100 or 100 no of rows)
There are two features Exam 1 marks and Exam 2 marks
Label column contains application status. Where ‘1’ means admitted and ‘0’ means not admitted
Total no of features (n) = 2 (Later we will add column of ones(x_0) to make it 3)
In [5]:
df_admitted = df[df['Admission status'] == 1]
print('Dimension of df_admitted= ', df_admitted.shape)
df_admitted.sample(10)
Out [5]:
Exam 1 marks Exam 2 marks Admission status
42 94.443368 65.568922 1
94 89.845807 45.358284 1
82 67.319257 66.589353 1
88 78.635424 96.647427 1
93 74.492692 84.845137 1
30 61.379289 72.807887 1
8 76.098787 87.420570 1
21 89.676776 65.799366 1
84 80.366756 90.960148 1
77 50.458160 75.809860 1
In [6]:
df_notadmitted = df[df['Admission status'] == 0]
print('Dimension of df_notadmitted= ', df_notadmitted.shape)
df_notadmitted.sample(5)
Out [6]:
Exam 1 marks Exam 2 marks Admission status
23 34.212061 44.209529 0
70 32.722833 43.307173 0
10 95.861555 38.225278 0
17 67.946855 46.678574 0
92 55.482161 35.570703 0
Data Visualization
To plot the data of admitted and not admitted applicants, we need to first create separate data frame for each
class(admitted/not-admitted)
In [8]:
plt.figure(figsize = (5,5))
plt.scatter(df_admitted['Exam 1 marks'], df_admitted['Exam 2 marks'], color='green', label='A
plt.scatter(df_notadmitted['Exam 1 marks'], df_notadmitted['Exam 2 marks'], color='red', labe
plt.xlabel('Exam 1 Marks')
plt.ylabel('Exam 2 Marks')
plt.legend()
plt.title('Admitted Vs Not Admitted Applicants')
In [9]:
# Get feature columns from dataframe
X = df.iloc[:, 0:2]
#Add column of ones (intercept term)
X = np.hstack((np.ones((m,1)),X))
# Now X is numpy array of 2 dimension
print("Dimension of feature matric X = ", X.shape, '\n')
y = df.iloc[:, -1]
# First 5 records training examples with labels
for i in range(5):
print('x =', X[i, ], ', y =', y[i])
x = [ 1. 34.62365962 78.02469282] , y = 0
x = [ 1. 30.28671077 43.89499752] , y = 0
x = [ 1. 35.84740877 72.90219803] , y = 0
x = [ 1. 60.18259939 86.3085521 ] , y = 1
x = [ 1. 79.03273605 75.34437644] , y = 1
In [11]:
#initialize the theta values with 0
theta = np.zeros(n)
theta
In [12]:
def sigmoid(z):
"""
To convert continuous value into a range of 0 to 1
I/P
----------
z : Continuous value
O/P
-------
Value in range between 0 to 1.
"""
g = 1 / (1 + np.exp(-z))
return g
A vector implementation of cost and gradient function formula’s for better performance
In [14]:
def cost_function(theta, X, y):
"""
Compute cost for logistic regression.
I/P
----------
X : 2D array where each row represent the training example and each column represent the
m= number of training examples
n= number of features (including X_0 column of ones)
y : 1D array of labels/target value for each traing example. dimension(1 x m)
O/P
-------
J : The cost of using theta as the parameter for linear regression to fit the data points
"""
m, n = X.shape
x_dot_theta = X.dot(theta)
return J
I/P
----------
X : 2D array where each row represent the training example and each column represent the
m= number of training examples
n= number of features (including X_0 column of ones)
y : 1D array of labels/target value for each traing example. dimension(1 x m)
O/P
-------
grad: (numpy array)The gradient of the cost with respect to the parameters theta
"""
m, n = X.shape
x_dot_theta = X.dot(theta)
return grad
Testing the cost_function() using initial values
grad = gradient(theta, X, y)
print ('Gradient at initial theta (zeros):', grad)
But here we are going use fmin_tnc function from the scipy library
This process is same as using ‘fit’ method from sklearn library. Because here we are trying to optimize our cost function
in order to find the best possible parameter(theta) values
fmin_tnc function takes four arguments:
func: Cost function to minimize
fprime: Gradient for the function defined by ‘func’
x0 : initial values for the parameters(theta) that we want to find
args: feature and label values
cost = cost_function(theta, X, y)
print ('Cost at theta found by fminunc:', cost)
print ('theta:', theta)
NIT NF F GTG
0 1 2.034977015894748E-01 2.47309204E-13
tnc: |pg| = 4.97302e-07 -> local minimum
0 1 2.034977015894748E-01 2.47309204E-13
tnc: Local minima reach (|pg| ~= 0)
Visualization
In [23]: # Lets calculate the X and Y values using Decision Boundary formula
# For ploting a line we just need 2 points. Here I am taking 'min' and 'max' value as my two
x_values = [min(X[:, 1]), np.max(X[:, 2])]
y_values = - (theta[0] + np.dot(theta[1], x_values)) / theta[2]
plt.figure(figsize = (6,5))
plt.scatter(df_admitted['Exam 1 marks'], df_admitted['Exam 2 marks'], color='green', label='A
plt.scatter(df_notadmitted['Exam 1 marks'], df_notadmitted['Exam 2 marks'], color='red', labe
plt.xlabel('Exam 1 Marks')
plt.ylabel('Exam 2 Marks')
Question: Predict an admission probability for applicant with scores 45 in Exam 1 and 85 in Exam 2
We can use our hypothesis function for prediction h(x) = g(z) = g(Xθ)
In [24]:
input_data = np.array([1, 45, 85]) # Note the intercept term '1' in array
prob = sigmoid(np.dot(input_data, theta))
print ('Admission probability for applicant with scores 45 in Exam 1 and 85 in Exam 2 is =',
Admission probability for applicant with scores 45 in Exam 1 and 85 in Exam 2 is = 0.7762906222622858
Create a function for prediction on our logistic model. Instead of predicting the probability between 0 and 1, this function will
use threshold value of 0.5 to predict the discrete value. 1 when probability ≥ 0.5 else 0
In [25]:
def predict(theta, X):
"""
Predict the class between 0 and 1 using learned logistic regression parameters theta.
Using threshold value 0.5 to convert probability value to class value
I/P
----------
X : 2D array where each row represent the training example and each column represent the
m= number of training examples
n= number of features (including X_0 column of ones)
O/P
-------
Class type based on threshold
"""
p = sigmoid(X.dot(theta)) >= 0.5
return p.astype(int)
Accuracy Of Model
Out [26]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
p = predict(theta, X)
print ('Accuracy:', np.mean(p == y) * 100 )
Accuracy: 89.0
Confusion Matrix
In [28]: from sklearn import metrics
In [29]:
confusion_matrix = metrics.confusion_matrix(actualAdmissionStatus, predictedValue)
In [30]:
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, display_labe
In [31]: cm_display.plot()
plt.show()
Precision
Out of all the positive predicted, what percentage is truly positive
Precision = TP/(TP+FP)
Accuracy
In [34]:
Accuracy = (34+56)/(34+56+5+6)
Accuracy
Recall
Out of the total positive, what percentage are predicted positive. It is the same as TPR (true positive rate).
Recall = TP/(TP+FN)
F1 Score:
It is the harmonic mean of precision and recall. It takes both false positive and false negatives into account. Therefore, it
performs well on an imbalanced dataset.
F1 = 2/(1/Precision + 1/Recall)
In [ ]: