0% found this document useful (0 votes)
16 views6 pages

A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems

This document describes a research project applying logistic regression to predict diabetes diagnosis. Logistic regression is commonly used for binary classification problems like predicting if a patient has diabetes or not based on medical data. The project builds a logistic regression model from scratch in Python to understand the underlying concepts. It introduces the mathematical foundations of logistic regression, including the logistic function, sigmoid function, and using cross-entropy as the loss function to optimize the model.

Uploaded by

hongngan04082005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems

This document describes a research project applying logistic regression to predict diabetes diagnosis. Logistic regression is commonly used for binary classification problems like predicting if a patient has diabetes or not based on medical data. The project builds a logistic regression model from scratch in Python to understand the underlying concepts. It introduces the mathematical foundations of logistic regression, including the logistic function, sigmoid function, and using cross-entropy as the loss function to optimize the model.

Uploaded by

hongngan04082005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Research Project on Applying Logistic Regression

to Predict the Result of Diabetes Diagnosis


DINH TRAC DUC ANH1,2
1 Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh
City, Vietnam
2 Vietnam National University Ho Chi Minh City, Ho Chi Minh City, Vietnam
1 E-mail: [email protected]

ABSTRACT

In the scope of machine learning and data analysis, Logistic Regression is the most common model which has a wide range
of applications in many fields such as agriculture, economics, medical treatment... Diabetes prediction is also an application
of Logistic Regression in the field of medical diagnosis that aims to predict the risk of a patient whether or not they have
diabetes type 1 based on their medical report. This research project aims to apply a logistic regression model to predict binary
classification output as the diabetes prediction problem above using Python, a high-level programming language to build a
Logistic Regression model from scratch. Despite some advanced libraries made for Python like Sci-kit learn or Pytorch could
provide lots of convenient functions for programmers. However, understanding the underlying concept of things is a good way
to learn and acquire deeper knowledge. This project not only introduces a way to build a Logistic Regression model step by
step but also provides the core mathematical foundation behind the model.

Keywords: Logistic Regression, Binary Classification, Diabetics Prediction.

1 | Introduction
The rise of Logistic Regression in machine learning stemmed from the need for a model capable of predicting categorical
outcomes1 . Typically, these involve binary values like 0 and 1, representing opposing states like true/false, benign/malignant,
or pregnant/not pregnant. Mastering Logistic Regression and its implementation paves the way for accurate solutions in
classification problems, particularly those with binary outputs.
Diabetes mellitus encompasses a group of diseases disrupting the body’s blood sugar utilization. Chronic diabetes, including
types 1 and 2, often presents with similar symptoms that can go undiagnosed for years2 . These symptoms, which may appear
suddenly, can include frequent urination, weight loss, thirst, fatigue, and vision changes. Regardless of the type, all diabetic
conditions lead to an excess of blood glucose due to insulin deficiency, a hormone crucial for converting glucose into energy.
High blood sugar poses a significant risk for various serious health complications.
Early diagnosis is crucial, yet traditional methods such as Fasting Plasma Glucose (FPG), Oral Glucose Tolerance Test
(OGTT), or Glycated Hemoglobin (HbA1c) often prove cumbersome. This research delves into the potential of Logistic
Regression to accurately predict diabetes based on readily available data, potentially paving the way for more accessible and
reliable diagnoses. Notably, predicting type 1 diabetes also relies on the Logistic Regression algorithm, aiming to estimate
the probability of a patient receiving a diagnosis. This problem demands a binary output, where 1 indicates a type 1 diabetes
diagnosis and 0 signifies its absence.

2 | Mathematical Foundation of the Research Project


2.1 | Input and Output of Logistic Regression
As a machine learning model, Logistic Regression often works with data stored in tabular formats such as .csv and .xlsx files.
We represent the input, which has n features and m samples, as a matrix with m rows and n columns, denoted as X.
Alongside the input sample is the target Y , which is the true output. It is a matrix with 1 column (a column vector), and it
must have the same number of rows (m) as the input sample. Note that in Logistic Regression, the target Y contains categorical
values, not continuous ones. In binary classification problems, the target contains only two values: 0 and 1.
We aim to create a model that can accurately predict the output that is closest to the target Y . The model takes X as input,
and after performing some algorithms, it returns Ŷ as a matrix or a column vector with the same size as Y . Ŷ contains values

1
that represent the probability that each input belongs to a particular class. For example, if the returned output is 0.9888, it is
highly predicted to belong to class 1.
X: Matrix m rows n columns.
Y : The target.
Ŷ : Predicted output.

2.2 | Logistic Function


Between 1838 and 1847, Logistic Function was first introduced and developed by Pierre François Verhulst in a series of three
papers conducted to model population growth in the mid-1830s. Throughout the steps of history, Logistic Function finds its
application in various fields, including Data Science, Statistics, and Artificial Intelligent Neural Networks.
A logistic function or logistic curve is a common S-shaped curve with the equation:

L
f (x) = (1)
1 + e−k(x−x0 )
x0 : The x value of the function’s midpoint.
L: The supremum of the values of the function.
k: The steepness of the curve.

Figure 1. Logistic graph

Figure 1 shows the plot of Logistic function with L = 1, x0 = 0 and k = 1. The plot has 2 horizontal asymptotes y = 0 and
y = L (in this case y = 1). By observing the Logistic curve in Figure 1 and equation (1). Interval D ∈ (−∞, +∞) is the domain
of the Logistic function.
As the name "Logistic Regression", the Logistic function plays a significant role in being the core value of the model. It is
beneficial to use Logistic Regression as a predictor model than Linear Regression since It works best on predicting categorical
output which can denoted as 0 and 1, for example: benign or malignant, buy or not buy, true or false.

2/6
Figure 2. Difference between Linear Regression and Logistic Regression
Similarities between linear regression and a logistic regression model are trying to predict output from a set of inputs
but Linear Regression predicts continuous values and it will find the best linear equation to fit the data. However, in Binary
classification and any other classification problems in general, the output is not continuous, it is not suitable to use Linear
Regression since it would likely predict undesired results when applied to classification problems. For instance, on the left
graph of Figure 2, if the value of the x-axis goes larger than 3 or smaller than -3 (approximately), the output would be any
number out of range [0,1].
The formula of Linear Regression:

Ŷlinear = θ T X + b (2)
• θ : a vector of coefficients for input features.
• b: Intercept of linear equation.
Take the formula (2) as input for the formula (1) we have the equation of Logistic Regression:

L
Ŷlogistic = f (Ŷlinear ) = f (θ T X + b) = T (3)
1 + e−k(θ X+b−x0 )
Note that Ŷ linear is the predicted output of linear regression that represents continuous output, when taking it as input for
Ŷ logistic , the returned output is the probability of an observation that belong to a class. In binary classification, the desired output
must be between 0 and 1 so the sigmoid function (a specified function of logistic function) is the most suitable for this problem.

2.3 | Sigmoid Function


The Sigmoid1 function is a specified function of the Logistic function where the supremum L = 1, the midpoint x0 = 0, and the
steepness k = 1. The Sigmoid function is denoted as Sigma (σ ) letter.

1
σ (x) = (4)
1 + e−x
The sigmoid function’s graph is in Figure 1. We use the sigmoid function as a logistic function to solve binary classification
problems. Combining equation (??) and (4) we have the following equation:

1
Ŷlogistic = σ (Ŷlinear ) = σ (θ T X + b) = T (5)
1 + e−(θ X+b)
2.4 | Loss Function: Cross-Entropy
In Logistic Regression, Cross-Entropy is used to measure the difference between the predicted probabilities ŷ from the logistic
regression model and the actual labels y. In other words, Cross-Entropy measures how much information the model is missing
about the true probability of the outcome3 .
Cross-Entropy formula:

L(y, ŷ) = −y ln (ŷ) − (1 − y) ln (1 − ŷ) (6)


• y: True label of an observation trained.
• ŷ: Predicted output of an observation trained.
• L(y, ŷ): The loss of a single observation trained.
There are several features when optimizing the model using Cross-Entropy give our benefits:
• Cross-Entropy is a convex function, which means that there is a unique global minimum. This makes it easy to find the
optimal parameters for the logistic regression model using gradient descent.
• Cross-Entropy is differentiable, suitable feature for applying a gradient descent algorithm.
• In binary classification, the value set of y is {0, 1}. When y = 0, the equation (6) becomes L = − ln (1 − ŷ) if ŷ is close to
y, the loss will be relatively small but in contrast, the loss will be large if ŷ is far from y. In the case y = 1, the equation
(6) becomes L = − ln (ŷ), the same scenario that loss will be large if ŷ is close to y and low if ŷ is far from y.

3/6
2.5 | Cost Function
1 m
J(θ , b) = ∑ L(y(i) , ŷ(i) ) (7)
m i=1
A good predictor model should predict output as true as possible, so optimizing is a process in that the model finds the best
parameters to minimize the loss between predicted value and true value4 , in this case, we want to find parameters θ and b,
throughout a cost function, we would calculate the best parameters that the model fits the data well. Cross-entropy lays the
foundation for a Logistic Regression model to optimize the loss of all observations on the dataset it is trained on.

2.6 | Gradient Descent


2.7 | Pipeline for Building a Logistic Regression Model

Figure 3. Pipeline for a Logistic Regression model.

3 | Diabetics Prediction Using Logistic Regression Model


3.1 | Libraries and Packages Requirement
• NumPy - a functional library for Vectors and Matrices calculation.

• Pandas - a powerful library to work with data.


• Matplotlib - is a useful library for creating interactive visualizations.
• train_test_split from sklearn.model_selection - a module of sklearn is useful for splitting datasets into train samples
and test samples.

• ydata_profiling - a library provides a one-line Exploratory Data Analysis (EDA).

3.2 | About Dataset


3.2.1 | Overview
The dataset was collected by the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset is a part of a
larger dataset of the Pima Indian Diabetes Database. This dataset only focuses on female Pima Indian heritage (a subgroup of
Native Americans) patients above 21 years old.

3.2.2 | The Code Used for Data Exploratory

Figure 4. The code used for data exploratory

3.2.3 | Variables
3.2.4 | Outcome
3.3 | Data Preprocessing

4/6
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0
1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0
8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0
1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0
0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0
... ... ... ... ... ... ... ...

Table 1. Data before normalization.

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age


0.64 0.848 0.15 0.907 -0.692 0.204 0.468 1.425
-0.844 -1.123 -0.16 0.531 -0.692 -0.684 -0.365 -0.191
1.233 1.942 -0.264 -1.287 -0.692 -1.103 0.604 -0.106
-0.844 -0.998 -0.16 0.154 0.123 -0.494 -0.92 -1.041
-1.141 0.504 -1.504 0.907 0.765 1.409 5.481 -0.02
... ... ... ... ... ... ... ...

Table 2. Data after normalization.

Figure 5. Correlation among variables before and after normalization.

3.4 | Split Train and Test Sample


3.5 | Result

Learning rate a=0.001 a=0.012 a=0.023 a=0.034 a=0.045 a=0.056 a=0.067 a=0.078 a=0.089 a=0.1
Predicted as true 186/231 179/231 176/231 173/231 171/231 173/231 163/231 167/231 167/231 167/231
Accuracy 80.52% 77.49% 76.19% 74.89% 74.03% 74.89% 70.56% 72.29% 72.29% 72.29%
Cost 0.53 0.47 0.37 0.33 0.43 0.48 0.31 0.32 0.23 0.26

Table 3. Learning rate sweep

5/6
4 | Conclusions
References
1. Acito, F. Logistic Regression, 125–167 (Springer Nature Switzerland, Cham, 2023).
2. Shackelford, T. K. & Weekes-Shackelford, V. A. (eds.). Diabetes Mellitus, 1987–1987 (Springer International Publishing,
Cham, 2021).
3. Dezert, J. & Dambreville, F. Cross-entropy and relative entropy of basic belief assignments. In 2023 26th International
Conference on Information Fusion (FUSION), 1–8, DOI: 10.23919/FUSION52260.2023.10224207 (2023).
4. Zou, X., Hu, Y., Tian, Z. & Shen, K. Logistic regression model optimization and case analysis. In 2019 IEEE 7th International
Conference on Computer Science and Network Technology (ICCSNT), 135–139, DOI: 10.1109/ICCSNT47585.2019.8962457
(2019).

6/6

You might also like