A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems
A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems
ABSTRACT
In the scope of machine learning and data analysis, Logistic Regression is the most common model which has a wide range
of applications in many fields such as agriculture, economics, medical treatment... Diabetes prediction is also an application
of Logistic Regression in the field of medical diagnosis that aims to predict the risk of a patient whether or not they have
diabetes type 1 based on their medical report. This research project aims to apply a logistic regression model to predict binary
classification output as the diabetes prediction problem above using Python, a high-level programming language to build a
Logistic Regression model from scratch. Despite some advanced libraries made for Python like Sci-kit learn or Pytorch could
provide lots of convenient functions for programmers. However, understanding the underlying concept of things is a good way
to learn and acquire deeper knowledge. This project not only introduces a way to build a Logistic Regression model step by
step but also provides the core mathematical foundation behind the model.
1 | Introduction
The rise of Logistic Regression in machine learning stemmed from the need for a model capable of predicting categorical
outcomes1 . Typically, these involve binary values like 0 and 1, representing opposing states like true/false, benign/malignant,
or pregnant/not pregnant. Mastering Logistic Regression and its implementation paves the way for accurate solutions in
classification problems, particularly those with binary outputs.
Diabetes mellitus encompasses a group of diseases disrupting the body’s blood sugar utilization. Chronic diabetes, including
types 1 and 2, often presents with similar symptoms that can go undiagnosed for years2 . These symptoms, which may appear
suddenly, can include frequent urination, weight loss, thirst, fatigue, and vision changes. Regardless of the type, all diabetic
conditions lead to an excess of blood glucose due to insulin deficiency, a hormone crucial for converting glucose into energy.
High blood sugar poses a significant risk for various serious health complications.
Early diagnosis is crucial, yet traditional methods such as Fasting Plasma Glucose (FPG), Oral Glucose Tolerance Test
(OGTT), or Glycated Hemoglobin (HbA1c) often prove cumbersome. This research delves into the potential of Logistic
Regression to accurately predict diabetes based on readily available data, potentially paving the way for more accessible and
reliable diagnoses. Notably, predicting type 1 diabetes also relies on the Logistic Regression algorithm, aiming to estimate
the probability of a patient receiving a diagnosis. This problem demands a binary output, where 1 indicates a type 1 diabetes
diagnosis and 0 signifies its absence.
1
that represent the probability that each input belongs to a particular class. For example, if the returned output is 0.9888, it is
highly predicted to belong to class 1.
X: Matrix m rows n columns.
Y : The target.
Ŷ : Predicted output.
L
f (x) = (1)
1 + e−k(x−x0 )
x0 : The x value of the function’s midpoint.
L: The supremum of the values of the function.
k: The steepness of the curve.
Figure 1 shows the plot of Logistic function with L = 1, x0 = 0 and k = 1. The plot has 2 horizontal asymptotes y = 0 and
y = L (in this case y = 1). By observing the Logistic curve in Figure 1 and equation (1). Interval D ∈ (−∞, +∞) is the domain
of the Logistic function.
As the name "Logistic Regression", the Logistic function plays a significant role in being the core value of the model. It is
beneficial to use Logistic Regression as a predictor model than Linear Regression since It works best on predicting categorical
output which can denoted as 0 and 1, for example: benign or malignant, buy or not buy, true or false.
2/6
Figure 2. Difference between Linear Regression and Logistic Regression
Similarities between linear regression and a logistic regression model are trying to predict output from a set of inputs
but Linear Regression predicts continuous values and it will find the best linear equation to fit the data. However, in Binary
classification and any other classification problems in general, the output is not continuous, it is not suitable to use Linear
Regression since it would likely predict undesired results when applied to classification problems. For instance, on the left
graph of Figure 2, if the value of the x-axis goes larger than 3 or smaller than -3 (approximately), the output would be any
number out of range [0,1].
The formula of Linear Regression:
Ŷlinear = θ T X + b (2)
• θ : a vector of coefficients for input features.
• b: Intercept of linear equation.
Take the formula (2) as input for the formula (1) we have the equation of Logistic Regression:
L
Ŷlogistic = f (Ŷlinear ) = f (θ T X + b) = T (3)
1 + e−k(θ X+b−x0 )
Note that Ŷ linear is the predicted output of linear regression that represents continuous output, when taking it as input for
Ŷ logistic , the returned output is the probability of an observation that belong to a class. In binary classification, the desired output
must be between 0 and 1 so the sigmoid function (a specified function of logistic function) is the most suitable for this problem.
1
σ (x) = (4)
1 + e−x
The sigmoid function’s graph is in Figure 1. We use the sigmoid function as a logistic function to solve binary classification
problems. Combining equation (??) and (4) we have the following equation:
1
Ŷlogistic = σ (Ŷlinear ) = σ (θ T X + b) = T (5)
1 + e−(θ X+b)
2.4 | Loss Function: Cross-Entropy
In Logistic Regression, Cross-Entropy is used to measure the difference between the predicted probabilities ŷ from the logistic
regression model and the actual labels y. In other words, Cross-Entropy measures how much information the model is missing
about the true probability of the outcome3 .
Cross-Entropy formula:
3/6
2.5 | Cost Function
1 m
J(θ , b) = ∑ L(y(i) , ŷ(i) ) (7)
m i=1
A good predictor model should predict output as true as possible, so optimizing is a process in that the model finds the best
parameters to minimize the loss between predicted value and true value4 , in this case, we want to find parameters θ and b,
throughout a cost function, we would calculate the best parameters that the model fits the data well. Cross-entropy lays the
foundation for a Logistic Regression model to optimize the loss of all observations on the dataset it is trained on.
3.2.3 | Variables
3.2.4 | Outcome
3.3 | Data Preprocessing
4/6
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0
1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0
8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0
1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0
0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0
... ... ... ... ... ... ... ...
Learning rate a=0.001 a=0.012 a=0.023 a=0.034 a=0.045 a=0.056 a=0.067 a=0.078 a=0.089 a=0.1
Predicted as true 186/231 179/231 176/231 173/231 171/231 173/231 163/231 167/231 167/231 167/231
Accuracy 80.52% 77.49% 76.19% 74.89% 74.03% 74.89% 70.56% 72.29% 72.29% 72.29%
Cost 0.53 0.47 0.37 0.33 0.43 0.48 0.31 0.32 0.23 0.26
5/6
4 | Conclusions
References
1. Acito, F. Logistic Regression, 125–167 (Springer Nature Switzerland, Cham, 2023).
2. Shackelford, T. K. & Weekes-Shackelford, V. A. (eds.). Diabetes Mellitus, 1987–1987 (Springer International Publishing,
Cham, 2021).
3. Dezert, J. & Dambreville, F. Cross-entropy and relative entropy of basic belief assignments. In 2023 26th International
Conference on Information Fusion (FUSION), 1–8, DOI: 10.23919/FUSION52260.2023.10224207 (2023).
4. Zou, X., Hu, Y., Tian, Z. & Shen, K. Logistic regression model optimization and case analysis. In 2019 IEEE 7th International
Conference on Computer Science and Network Technology (ICCSNT), 135–139, DOI: 10.1109/ICCSNT47585.2019.8962457
(2019).
6/6