0% found this document useful (0 votes)

16 views6 pages

A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems

This document describes a research project applying logistic regression to predict diabetes diagnosis. Logistic regression is commonly used for binary classification problems like predicting if a patient has diabetes or not based on medical data. The project builds a logistic regression model from scratch in Python to understand the underlying concepts. It introduces the mathematical foundations of logistic regression, including the logistic function, sigmoid function, and using cross-entropy as the loss function to optimize the model.

Uploaded by

hongngan04082005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views6 pages

A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems

Uploaded by

hongngan04082005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

A Research Project on Applying Logistic Regression

to Predict the Result of Diabetes Diagnosis

DINH TRAC DUC ANH1,2
1 Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh
City, Vietnam
2 Vietnam National University Ho Chi Minh City, Ho Chi Minh City, Vietnam
1 E-mail: [email protected]

ABSTRACT

In the scope of machine learning and data analysis, Logistic Regression is the most common model which has a wide range
of applications in many fields such as agriculture, economics, medical treatment... Diabetes prediction is also an application
of Logistic Regression in the field of medical diagnosis that aims to predict the risk of a patient whether or not they have
diabetes type 1 based on their medical report. This research project aims to apply a logistic regression model to predict binary
classification output as the diabetes prediction problem above using Python, a high-level programming language to build a
Logistic Regression model from scratch. Despite some advanced libraries made for Python like Sci-kit learn or Pytorch could
provide lots of convenient functions for programmers. However, understanding the underlying concept of things is a good way
to learn and acquire deeper knowledge. This project not only introduces a way to build a Logistic Regression model step by
step but also provides the core mathematical foundation behind the model.

Keywords: Logistic Regression, Binary Classification, Diabetics Prediction.

1 | Introduction
The rise of Logistic Regression in machine learning stemmed from the need for a model capable of predicting categorical
outcomes1 . Typically, these involve binary values like 0 and 1, representing opposing states like true/false, benign/malignant,
or pregnant/not pregnant. Mastering Logistic Regression and its implementation paves the way for accurate solutions in
classification problems, particularly those with binary outputs.
Diabetes mellitus encompasses a group of diseases disrupting the body’s blood sugar utilization. Chronic diabetes, including
types 1 and 2, often presents with similar symptoms that can go undiagnosed for years2 . These symptoms, which may appear
suddenly, can include frequent urination, weight loss, thirst, fatigue, and vision changes. Regardless of the type, all diabetic
conditions lead to an excess of blood glucose due to insulin deficiency, a hormone crucial for converting glucose into energy.
High blood sugar poses a significant risk for various serious health complications.
Early diagnosis is crucial, yet traditional methods such as Fasting Plasma Glucose (FPG), Oral Glucose Tolerance Test
(OGTT), or Glycated Hemoglobin (HbA1c) often prove cumbersome. This research delves into the potential of Logistic
Regression to accurately predict diabetes based on readily available data, potentially paving the way for more accessible and
reliable diagnoses. Notably, predicting type 1 diabetes also relies on the Logistic Regression algorithm, aiming to estimate
the probability of a patient receiving a diagnosis. This problem demands a binary output, where 1 indicates a type 1 diabetes
diagnosis and 0 signifies its absence.

2 | Mathematical Foundation of the Research Project

2.1 | Input and Output of Logistic Regression
As a machine learning model, Logistic Regression often works with data stored in tabular formats such as .csv and .xlsx files.
We represent the input, which has n features and m samples, as a matrix with m rows and n columns, denoted as X.
Alongside the input sample is the target Y , which is the true output. It is a matrix with 1 column (a column vector), and it
must have the same number of rows (m) as the input sample. Note that in Logistic Regression, the target Y contains categorical
values, not continuous ones. In binary classification problems, the target contains only two values: 0 and 1.
We aim to create a model that can accurately predict the output that is closest to the target Y . The model takes X as input,
and after performing some algorithms, it returns Ŷ as a matrix or a column vector with the same size as Y . Ŷ contains values

1
that represent the probability that each input belongs to a particular class. For example, if the returned output is 0.9888, it is
highly predicted to belong to class 1.
X: Matrix m rows n columns.
Y : The target.
Ŷ : Predicted output.

2.2 | Logistic Function

Between 1838 and 1847, Logistic Function was first introduced and developed by Pierre François Verhulst in a series of three
papers conducted to model population growth in the mid-1830s. Throughout the steps of history, Logistic Function finds its
application in various fields, including Data Science, Statistics, and Artificial Intelligent Neural Networks.
A logistic function or logistic curve is a common S-shaped curve with the equation:

L
f (x) = (1)
1 + e−k(x−x0 )
x0 : The x value of the function’s midpoint.
L: The supremum of the values of the function.
k: The steepness of the curve.

Figure 1. Logistic graph

Figure 1 shows the plot of Logistic function with L = 1, x0 = 0 and k = 1. The plot has 2 horizontal asymptotes y = 0 and
y = L (in this case y = 1). By observing the Logistic curve in Figure 1 and equation (1). Interval D ∈ (−∞, +∞) is the domain
of the Logistic function.
As the name "Logistic Regression", the Logistic function plays a significant role in being the core value of the model. It is
beneficial to use Logistic Regression as a predictor model than Linear Regression since It works best on predicting categorical
output which can denoted as 0 and 1, for example: benign or malignant, buy or not buy, true or false.

2/6
Figure 2. Difference between Linear Regression and Logistic Regression
Similarities between linear regression and a logistic regression model are trying to predict output from a set of inputs
but Linear Regression predicts continuous values and it will find the best linear equation to fit the data. However, in Binary
classification and any other classification problems in general, the output is not continuous, it is not suitable to use Linear
Regression since it would likely predict undesired results when applied to classification problems. For instance, on the left
graph of Figure 2, if the value of the x-axis goes larger than 3 or smaller than -3 (approximately), the output would be any
number out of range [0,1].
The formula of Linear Regression:

Ŷlinear = θ T X + b (2)
• θ : a vector of coefficients for input features.
• b: Intercept of linear equation.
Take the formula (2) as input for the formula (1) we have the equation of Logistic Regression:

L
Ŷlogistic = f (Ŷlinear ) = f (θ T X + b) = T (3)
1 + e−k(θ X+b−x0 )
Note that Ŷ linear is the predicted output of linear regression that represents continuous output, when taking it as input for
Ŷ logistic , the returned output is the probability of an observation that belong to a class. In binary classification, the desired output
must be between 0 and 1 so the sigmoid function (a specified function of logistic function) is the most suitable for this problem.

2.3 | Sigmoid Function

The Sigmoid1 function is a specified function of the Logistic function where the supremum L = 1, the midpoint x0 = 0, and the
steepness k = 1. The Sigmoid function is denoted as Sigma (σ ) letter.

1
σ (x) = (4)
1 + e−x
The sigmoid function’s graph is in Figure 1. We use the sigmoid function as a logistic function to solve binary classification
problems. Combining equation (??) and (4) we have the following equation:

1
Ŷlogistic = σ (Ŷlinear ) = σ (θ T X + b) = T (5)
1 + e−(θ X+b)
2.4 | Loss Function: Cross-Entropy
In Logistic Regression, Cross-Entropy is used to measure the difference between the predicted probabilities ŷ from the logistic
regression model and the actual labels y. In other words, Cross-Entropy measures how much information the model is missing
about the true probability of the outcome3 .
Cross-Entropy formula:

L(y, ŷ) = −y ln (ŷ) − (1 − y) ln (1 − ŷ) (6)

• y: True label of an observation trained.
• ŷ: Predicted output of an observation trained.
• L(y, ŷ): The loss of a single observation trained.
There are several features when optimizing the model using Cross-Entropy give our benefits:
• Cross-Entropy is a convex function, which means that there is a unique global minimum. This makes it easy to find the
optimal parameters for the logistic regression model using gradient descent.
• Cross-Entropy is differentiable, suitable feature for applying a gradient descent algorithm.
• In binary classification, the value set of y is {0, 1}. When y = 0, the equation (6) becomes L = − ln (1 − ŷ) if ŷ is close to
y, the loss will be relatively small but in contrast, the loss will be large if ŷ is far from y. In the case y = 1, the equation
(6) becomes L = − ln (ŷ), the same scenario that loss will be large if ŷ is close to y and low if ŷ is far from y.

3/6
2.5 | Cost Function
1 m
J(θ , b) = ∑ L(y(i) , ŷ(i) ) (7)
m i=1
A good predictor model should predict output as true as possible, so optimizing is a process in that the model finds the best
parameters to minimize the loss between predicted value and true value4 , in this case, we want to find parameters θ and b,
throughout a cost function, we would calculate the best parameters that the model fits the data well. Cross-entropy lays the
foundation for a Logistic Regression model to optimize the loss of all observations on the dataset it is trained on.

2.6 | Gradient Descent

2.7 | Pipeline for Building a Logistic Regression Model

Figure 3. Pipeline for a Logistic Regression model.

3 | Diabetics Prediction Using Logistic Regression Model

3.1 | Libraries and Packages Requirement
• NumPy - a functional library for Vectors and Matrices calculation.

• Pandas - a powerful library to work with data.

• Matplotlib - is a useful library for creating interactive visualizations.
• train_test_split from sklearn.model_selection - a module of sklearn is useful for splitting datasets into train samples
and test samples.

• ydata_profiling - a library provides a one-line Exploratory Data Analysis (EDA).

3.2 | About Dataset

3.2.1 | Overview
The dataset was collected by the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset is a part of a
larger dataset of the Pima Indian Diabetes Database. This dataset only focuses on female Pima Indian heritage (a subgroup of
Native Americans) patients above 21 years old.

3.2.2 | The Code Used for Data Exploratory

Figure 4. The code used for data exploratory

3.2.3 | Variables
3.2.4 | Outcome
3.3 | Data Preprocessing

4/6
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0
1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0
8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0
1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0
0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0
... ... ... ... ... ... ... ...

Table 1. Data before normalization.

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

0.64 0.848 0.15 0.907 -0.692 0.204 0.468 1.425
-0.844 -1.123 -0.16 0.531 -0.692 -0.684 -0.365 -0.191
1.233 1.942 -0.264 -1.287 -0.692 -1.103 0.604 -0.106
-0.844 -0.998 -0.16 0.154 0.123 -0.494 -0.92 -1.041
-1.141 0.504 -1.504 0.907 0.765 1.409 5.481 -0.02
... ... ... ... ... ... ... ...

Table 2. Data after normalization.

Figure 5. Correlation among variables before and after normalization.

3.4 | Split Train and Test Sample

3.5 | Result

Learning rate a=0.001 a=0.012 a=0.023 a=0.034 a=0.045 a=0.056 a=0.067 a=0.078 a=0.089 a=0.1
Predicted as true 186/231 179/231 176/231 173/231 171/231 173/231 163/231 167/231 167/231 167/231
Accuracy 80.52% 77.49% 76.19% 74.89% 74.03% 74.89% 70.56% 72.29% 72.29% 72.29%
Cost 0.53 0.47 0.37 0.33 0.43 0.48 0.31 0.32 0.23 0.26

Table 3. Learning rate sweep

5/6
4 | Conclusions
References
1. Acito, F. Logistic Regression, 125–167 (Springer Nature Switzerland, Cham, 2023).
2. Shackelford, T. K. & Weekes-Shackelford, V. A. (eds.). Diabetes Mellitus, 1987–1987 (Springer International Publishing,
Cham, 2021).
3. Dezert, J. & Dambreville, F. Cross-entropy and relative entropy of basic belief assignments. In 2023 26th International
Conference on Information Fusion (FUSION), 1–8, DOI: 10.23919/FUSION52260.2023.10224207 (2023).
4. Zou, X., Hu, Y., Tian, Z. & Shen, K. Logistic regression model optimization and case analysis. In 2019 IEEE 7th International
Conference on Computer Science and Network Technology (ICCSNT), 135–139, DOI: 10.1109/ICCSNT47585.2019.8962457
(2019).

6/6

Ubc 1985
100% (11)
Ubc 1985
840 pages
Athlean X BUILT For Size
50% (2)
Athlean X BUILT For Size
9 pages
Logistic Regression
No ratings yet
Logistic Regression
22 pages
Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
No ratings yet
Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
23 pages
MACHINE LEARNING Presentation Logistic Regression
No ratings yet
MACHINE LEARNING Presentation Logistic Regression
18 pages
Logistic Regression
No ratings yet
Logistic Regression
14 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
What Is Logistic Regression
No ratings yet
What Is Logistic Regression
20 pages
Logistic Regression
No ratings yet
Logistic Regression
6 pages
Linear and Logistic Regression
No ratings yet
Linear and Logistic Regression
21 pages
09 23ECE216 LogisticRegression
No ratings yet
09 23ECE216 LogisticRegression
40 pages
Experiment No 8
No ratings yet
Experiment No 8
4 pages
Report Logistic Regression
No ratings yet
Report Logistic Regression
21 pages
2-Logistic Regression
No ratings yet
2-Logistic Regression
15 pages
Misc 5
No ratings yet
Misc 5
1 page
Logistic Regression
No ratings yet
Logistic Regression
25 pages
ML Unit 3
No ratings yet
ML Unit 3
40 pages
Chp2 Logistic Regression
No ratings yet
Chp2 Logistic Regression
6 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
Logistic Regression in Machine Learning
No ratings yet
Logistic Regression in Machine Learning
3 pages
Wa0004.
No ratings yet
Wa0004.
9 pages
Logistics Regression
No ratings yet
Logistics Regression
8 pages
Logistic Regression
No ratings yet
Logistic Regression
10 pages
ML 4
No ratings yet
ML 4
80 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
53 pages
11-Logistic Regression
No ratings yet
11-Logistic Regression
27 pages
Logistic Regression
No ratings yet
Logistic Regression
14 pages
Lecture Note #9 - PEC-CS701E
No ratings yet
Lecture Note #9 - PEC-CS701E
41 pages
Logistic Regression
No ratings yet
Logistic Regression
16 pages
ML Lec-9
No ratings yet
ML Lec-9
13 pages
Lecture 22. GLM
No ratings yet
Lecture 22. GLM
41 pages
Compare & Contrast Linear Vs Logistic Regression
No ratings yet
Compare & Contrast Linear Vs Logistic Regression
3 pages
YLP Logistic Regression
No ratings yet
YLP Logistic Regression
61 pages
Logistic Regression Report
No ratings yet
Logistic Regression Report
39 pages
MLStackCafe QAS 1672810525772
No ratings yet
MLStackCafe QAS 1672810525772
12 pages
Logisticregression
No ratings yet
Logisticregression
22 pages
Lecture 4-Logistic Regression
No ratings yet
Lecture 4-Logistic Regression
20 pages
Task 1
No ratings yet
Task 1
7 pages
Logistic Regression
No ratings yet
Logistic Regression
20 pages
4.logistic Regression
No ratings yet
4.logistic Regression
28 pages
Exp2 Milf
No ratings yet
Exp2 Milf
7 pages
Interview Questions
No ratings yet
Interview Questions
26 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
56 pages
Logistic Regressions
No ratings yet
Logistic Regressions
11 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Session 5 - Logistic Regression
No ratings yet
Session 5 - Logistic Regression
69 pages
Lecture 08
No ratings yet
Lecture 08
42 pages
ML-Unit 4
No ratings yet
ML-Unit 4
29 pages
Classification Basics
No ratings yet
Classification Basics
14 pages
ML Assignment
No ratings yet
ML Assignment
20 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Logistic Regression
No ratings yet
Logistic Regression
13 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
02 LogisticRegression
No ratings yet
02 LogisticRegression
29 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Advanced Regression
No ratings yet
Advanced Regression
13 pages
Module1.4 Regression
No ratings yet
Module1.4 Regression
24 pages
Lecture 07
No ratings yet
Lecture 07
26 pages
Logistic Regression
No ratings yet
Logistic Regression
16 pages
ML Exp 8
No ratings yet
ML Exp 8
22 pages
Logistics Regression Notes
No ratings yet
Logistics Regression Notes
12 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
American Choral Directors Association The Choral Journal
No ratings yet
American Choral Directors Association The Choral Journal
3 pages
Perhitungan Tugas Besar Geometri Jalan Raya (Andre Gunawan 1622201019)
No ratings yet
Perhitungan Tugas Besar Geometri Jalan Raya (Andre Gunawan 1622201019)
77 pages
Ammeraal Beltech: Innovation & Service in Belting
No ratings yet
Ammeraal Beltech: Innovation & Service in Belting
6 pages
A - First Solar FS 275
No ratings yet
A - First Solar FS 275
2 pages
Types of Steel Beam Connections and Their Details
No ratings yet
Types of Steel Beam Connections and Their Details
5 pages
CV
No ratings yet
CV
4 pages
Niagara
No ratings yet
Niagara
45 pages
Xe155ucr Spec
No ratings yet
Xe155ucr Spec
20 pages
Effluent Treatment Plant
100% (1)
Effluent Treatment Plant
11 pages
Review of Economic Development Assistance Tools in Nova Scotia
No ratings yet
Review of Economic Development Assistance Tools in Nova Scotia
163 pages
Resignation Letter
No ratings yet
Resignation Letter
2 pages
Powers of Central Government Under The Environmental Protection Act 1986
No ratings yet
Powers of Central Government Under The Environmental Protection Act 1986
4 pages
Exploring Social Psychology 8th Edition Myers Full Download
No ratings yet
Exploring Social Psychology 8th Edition Myers Full Download
405 pages
TM 55 4920 413 13 and P
No ratings yet
TM 55 4920 413 13 and P
115 pages
Dissertation Essex Uni
100% (2)
Dissertation Essex Uni
6 pages
2024 WASCCE ALT A Chemistry Likely Questions
No ratings yet
2024 WASCCE ALT A Chemistry Likely Questions
3 pages
NCM 110 Nsginfos
No ratings yet
NCM 110 Nsginfos
17 pages
Commissioning Report For Boiler Air and Flue Gas System Unit 1
No ratings yet
Commissioning Report For Boiler Air and Flue Gas System Unit 1
6 pages
ICD Dadri Report1
No ratings yet
ICD Dadri Report1
9 pages
Flet (Joyelle McSweeney) (Z-Library)
No ratings yet
Flet (Joyelle McSweeney) (Z-Library)
155 pages
The Unknown Life of Jesus Christ
No ratings yet
The Unknown Life of Jesus Christ
104 pages
Grade 9 English 6 T
No ratings yet
Grade 9 English 6 T
2 pages
Chapter 22
No ratings yet
Chapter 22
54 pages
A) Program To Implement A FLYING KITE
No ratings yet
A) Program To Implement A FLYING KITE
36 pages
Computing Maze Game Activity Sheet
No ratings yet
Computing Maze Game Activity Sheet
3 pages
List Perubahan Harga 15 Maret 2020
No ratings yet
List Perubahan Harga 15 Maret 2020
1,756 pages
Sexual Sounds Can Trigger Porn Filter
No ratings yet
Sexual Sounds Can Trigger Porn Filter
1 page

A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems

Uploaded by

A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems

Uploaded by

A Research Project on Applying Logistic Regression

to Predict the Result of Diabetes Diagnosis

Keywords: Logistic Regression, Binary Classification, Diabetics Prediction.

2 | Mathematical Foundation of the Research Project

2.2 | Logistic Function

Figure 1. Logistic graph

2.3 | Sigmoid Function

L(y, ŷ) = −y ln (ŷ) − (1 − y) ln (1 − ŷ) (6)

2.6 | Gradient Descent

Figure 3. Pipeline for a Logistic Regression model.

3 | Diabetics Prediction Using Logistic Regression Model

• Pandas - a powerful library to work with data.

• ydata_profiling - a library provides a one-line Exploratory Data Analysis (EDA).

3.2 | About Dataset

3.2.2 | The Code Used for Data Exploratory

Figure 4. The code used for data exploratory

Table 1. Data before normalization.

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

Table 2. Data after normalization.

Figure 5. Correlation among variables before and after normalization.

3.4 | Split Train and Test Sample

Table 3. Learning rate sweep

You might also like