0% found this document useful (0 votes)
151 views6 pages

A Transformer On Tabular Data Comparative Analysis With Linear and Tree Base Machine Learning Algorithm On Diabetic Dataset

This document discusses using machine learning techniques like transformers and linear/tree-based algorithms to analyze a diabetic patient dataset and predict the likelihood of diabetes. Specifically, it evaluates using a Tabpfn transformer on tabular data compared to algorithms like Random Forest, Decision Tree, SVM, KNN, Gradient Boosting etc. The Tabpfn transformer achieved the highest F1-scores between 98.46-91.541%, outperforming other existing approaches. Early and accurate prediction of diabetes using AI can help patients better manage their condition through preventative measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views6 pages

A Transformer On Tabular Data Comparative Analysis With Linear and Tree Base Machine Learning Algorithm On Diabetic Dataset

This document discusses using machine learning techniques like transformers and linear/tree-based algorithms to analyze a diabetic patient dataset and predict the likelihood of diabetes. Specifically, it evaluates using a Tabpfn transformer on tabular data compared to algorithms like Random Forest, Decision Tree, SVM, KNN, Gradient Boosting etc. The Tabpfn transformer achieved the highest F1-scores between 98.46-91.541%, outperforming other existing approaches. Early and accurate prediction of diabetes using AI can help patients better manage their condition through preventative measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

A Transformer on Tabular Data Comparative


Analysis with Linear and Tree Base Machine
Learning Algorithm on Diabetic Dataset
Kamin Gorettie Precody 1,
Komiwe Faith Phiri 2
Dr. Ashish Kumar Chakraverti3
1,2,3
Department of Computer Science & Engineering, School of Engineering and Technology, Sharda University, Greater Noida.

Abstract:- Lifestyle diseases have a rating of 80% as one I. INTRODUCTION


of the top causes of death. About over 41 million lives are
claimed just by lifestyle diseases, which are over 70% of Diabetes is a condition caused as a result of high
all deaths around the world. In this same percentage glucose levels in the human body. Diabetes ought not be
about roughly 15 million deaths happen to people of the overlooked on the off chance that it is untreated, Diabetes
age range 30 to about 69 years. Lifestyle diseases are might cause a few significant issues in an individual like
primarily originated due to the day-to-day habits of an heart related issues, kidney issue, pulse, eye harm and it can
individual. These habits that detract from activities and likewise influence different organs of human body. Diabetes
push people towards a sedentary routine can cause can be controlled on the off chance that it is anticipated
numerous health issues that may lead to harmful before. To accomplish this objective this venture work we
diseases that are nearly life-threatening. Furthermore, will do early expectation of Diabetes in a human body or a
there are two common complex diseases that are heart patient for a higher precision through applying, Different AI
disease and diabetes, researchers have discovered Methods. AI methods Give improved results to forecast by
diabetes to be a silent but deadly disease, and many developing models from datasets gathered from patients.
researchers use machine learning methods to help
medical professionals for the diagnosing of lifestyle As per (WHO) World Wellbeing Association around
diseases. This paper reviewed the literature on 422 million individuals experience the ill effects of diabetes
predictions and diagnoses of lifestyle diseases with the especially from low-or inactive pay nations. Furthermore,
use of transformers and machine learning techniques it this could be expanded to 490 billion up to the extended
is presented and used on Diabetics data of patients. Our time of 2030. Nonetheless, predominance of diabetes is
research paper will highlight the importance of found among different Nations like Canada, China, and
transformers and machine learning in analyzing huge India and so on.
datasets of patients to predict the whole kinds of diabetes
and how they can be treated and how they can be AI methods can assume a fundamental part in
prevented. Further, we have utilized Transformers on foreseeing diabetes at a beginning phase. These strategies
tabular data (Tabpfn), Random Forest, Decision Tree, use calculations and measurable models to dissect enormous
Support Vector Machine K-Nearest Neighbors, Gradient datasets and anticipate results in view of the information. By
Boosting, Histogram Gradient Boosting, and Adaptive applying AI calculations to information gathered from
Boosting for predicting how likely a person will have a patients, it is feasible to recognize designs and foresee the
bank account. The stratified holdout cross-validation probability of creating diabetes. This can be accomplished
method has been used to split the training dataset by dissecting different variables, for example, age, family
randomly into 90% train and 10% test sets. The result ancestry, way of life propensities, and clinical history. AI
was collected and further compared with some existing procedures can give higher exactness in anticipating
approaches, which indicates that using transformers on diabetes, which can assist patients with going to early
tabular data (Tabpfn) outperforms the existing state-of- preventive lengths to deal with their condition.
the-art approach. The Tabpfn transformer on tabular
data was optimal among adapted models based on F1- The utilization of AI methods for foreseeing diabetes is
score, which are 98.46 %, 98.0694%, 91.736%, and certainly not another idea. Nonetheless, with headways in
91.541% respectively. innovation, it has become simpler to gather and examine a
lot of information from patients. AI models can be prepared
Keywords:- Transformer, Lifestyle Diseases, Machine to distinguish examples and patterns in the information,
Learning Techniques, Prediction. which can assist with foreseeing the probability of creating
diabetes. These models can similarly be invigorated
regularly to chip away at their precision as new data opens
up.

IJISRT23MAY1179 www.ijisrt.com 1664


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
All in all, diabetes is a serious medical problem that they propose the use of algorithms like Bayesian and KNN
can prompt significant confusions in the event that did not (K-Nearest Neighbor) to apply on diabetes patient’s
oversee as expected. Early identification is critical to database and analyze them by taking various attributes of
forestalling complexities and working on the personal diabetes for prediction of diabetes disease.
satisfaction for those living with diabetes. AI strategies can
give a more exact expectation of diabetes, which can assist Muhammad Azeem Sarwar et al. [5] proposed study on
patients with going to early preventive lengths. With prediction of diabetes using machine learning algorithms in
additional innovative work, AI could assume a huge part in healthcare they applied six different machine learning
further developing diabetes the board and lessening the algorithms Performance and accuracy of the applied
weight of this constant sickness on people and medical care algorithms are discussed and compared.
frameworks.
A comparison of the different machine learning
II. RELATED WORK techniques used in this study reveals which algorithm is best
suited for the prediction of diabetes. Diabetes Prediction is
Investigating other written works, the number of becoming an area of interest for researchers in order to train
researchers used various datasets from patients Investigating the program to identify the patient are diabetic or not by
records and other sources like Kaggle. These where all applying proper classifiers to the dataset.
investigated, it was also visible that the highest accuracies
were obtained using Artificial Neural Networks or Based on previous research work, it has been observed
SVM.Many of these diseases use prediction algorithms and that the classification process is not much improved. Hence
approaches which can be applied by machine learning, a system is required as Diabetes Prediction is an important
ensemble learning approaches and association rules for area in computers, to handle the issues identified based on
achieving the perfect classification precision.There is previous research.
actually a close connection linking data mining and machine
learning such that machine learning techniques are also III. MACHINE LEARNING ALGORITHM
known as data mining techniques. In this paper we came
into possession of some diabetes datasets from Kaggle, Arthur Samuel in 1959 coined the term ML—a branch
medical records of both for genders and of all ages. of computer science (CS) that helps computers to learn
Countless of researchers have used ML and DM based independently without explicit programming. In ML, an
Algorithm Few of them are listed and explained below: algorithm manipulates a dataset enabling it to make
predictions by learning patterns from previous data.
K.VijiyaKumar et al. [1] proposed random Forest
algorithm for the Prediction of diabetes develop a system ML can be categorized as supervised, unsupervised
which can perform early prediction of diabetes for a patient and reinforcement learning [12]. Supervised learning
with a higher accuracy by using Random Forest algorithm in contains classified data having labels. When such data is
ma- chine learning technique. The proposed model gives the supplied to an algorithm, it can predict a test case’s
best results for diabetic prediction and the result showed that outcome. Classification and regression are the main methods
the prediction system is capable of predicting the diabetes of supervised learning. Supervised ML can be achieved
disease effectively, efficiently and most importantly, using various algorithms such as Naive Bayes classification,
instantly. decision tree, and SVMs.

Nonso Nnamoko et al. [2] presented predicting In this paper, we have utilized Transformers on tabular
diabetes onset: an ensemble supervised learning approach data (Tabpfn), Random Forest, Decision Tree, Support
they used five widely used classifiers are employed for the Vector Machine K-Nearest Neighbors, Gradient Boosting,
ensembles and a meta-classifier is used to aggregate their Histogram Gradient Boosting, and Adaptive Boosting for
outputs. The results are presented and compared with similar predicting how likely a person will have diabetes. The
studies that used the same dataset within the literature. It is stratified holdout cross-validation method has been used to
shown that by using the proposed method, diabetes onset split the training dataset randomly into 90% train and 10%
prediction can be done with higher accuracy. test sets. The result was collected and further compared with
some existing approaches, which indicates that using
Tejas N. Joshi et al. [3] presented Diabetes Prediction transformers on tabular data (Tabpfn) outperforms the
Using Machine Learning Techniques aims to predict existing state-of-the-art approach. The Tabpfn transformer
diabetes via three different supervised machine learning on tabular data was optimal among adapted models based on
methods including: SVM, Logistic regression, ANN. This F1-score, which are 98.46 %, 98.0694%, 91.736%, and
project pro- poses an effective technique for earlier 91.541% respectively.
detection of the diabetes disease.

Deeraj Shetty et al. [4] proposed diabetes disease


prediction using data mining assemble Intelligent Diabetes
Disease Prediction System that gives analysis of diabetes
malady utilizing diabetes patient’s database. In this system,

IJISRT23MAY1179 www.ijisrt.com 1665


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IV. PROPOSED SYSTEM training and testing sets was similar, which is important for
building an accurate prediction model.

Overall, the data used for this project was obtained


from a reliable source and was preprocessed to ensure its
quality and suitability for the prediction task. The relational
views created for the features helped in understanding the
dataset and preparing it for analysis.

 Training Data
Training data needs to be collected alongside the
testing data further preprocessing is needed to know better
the predictors. Training data helps us to prepare a budget
request at some point and it’s a proper document for
building a business case and justifying budget requests.

 Predictive Features
To understand what drives the target outcome, there
should be some research or an investigation to get ideas on
the data points. Once the quality of understanding of what
will fit well, the target outcome is achieved, further process
Fig 1 Training Phase and Testing Phase of data requests can help build a business case. The main
predictive features that are taken into feasibility criteria are:
We are going to build a system that will be able to Age.
efficiently predict if a patient is a diabetic or not. The
system is utilizing the new techniques known as  Gender
transformers which is going to use the new technique which  Polyuria
we call Active Learning. Active Learning is a new technique  Polydipsia Sudden
with the aim of
 Weight Loss
 Weakness
 Data Collection
 Polyphagia
The dataset used for this project was obtained from  Genital Thrush
Kaggle, a popular platform for sharing datasets and  Visual Blurring
conducting data-driven research. The dataset is an updated  Itching
version of the Pima Indians Diabetes Database, which  Irritability
includes demographic, diagnostic, and historical medical  Delayed Healing
data of patients. The updated dataset consists of 768  Partial Paresis
instances with eight features, which are age, number of  Muscle Stiffness
pregnancies, glucose concentration, blood pressure, skin  Alopecia
thickness, insulin, BMI, and diabetes pedigree function.  Obesity
 Class
To explore the dataset and prepare it for analysis,
relational views were created for the features. The age  Working of the Model
feature represents the age of the patient, which is a The first task of the project would be to gather and
continuous variable. The number of pregnancies is a discrete clean the dataset. This would involve finding a reliable
variable that represents the number of times a patient has source of data and performing data cleaning and
been pregnant. The glucose concentration, blood pressure, preprocessing to ensure that the data is ready for analysis.
skin thickness, and insulin features are continuous variables The duration of this task could be 2 weeks.
that represent different diagnostic measurements. The BMI
feature is a continuous variable that represents the body The next task would be to perform exploratory data
mass index of the patient, and the diabetes pedigree function analysis on the dataset to identify trends and patterns. This
is a continuous variable that represents the genetic could take 3 weeks, and the output would be a report on the
predisposition to diabetes. findings.

 Data Processing The third task would be to apply feature engineering


The data was preprocessed before being used for the techniques to the dataset to improve the accuracy of the
prediction task. The preprocessing steps included handling models. This could take 2 weeks, and the output would be a
missing values, normalizing the features, and encoding dataset that is ready for modeling.
categorical variables. The dataset was split into training and
testing sets using a stratified holdout cross-validation
method. This ensured that the distribution of classes in the

IJISRT23MAY1179 www.ijisrt.com 1666


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
The fourth task would be to build and train machine  STEP 3: Display the prediction to a user in an
learning models using both classical algorithms and appropriate format.
transformers. This could take 6 weeks, and the output would  STOP
be a set of models with their respective accuracy and
performance metrics. V. SIMULATION RESULTS

Finally, the last task would be to analyze and compare


the performance of the models and draw conclusions on the
suitability of transformers in early disease prediction. This
could take 2 weeks, and the output would be a report
summarizing the findings and conclusions.

A. Algorithms
 Input: Data set from Kaggle
 Output: a prediction model
 Variables:

 Required Accuracy--Minimum threshold accuracy of the


model (%)
 Current Accuracy--Accuracy of the model after training
(%) Fig 2 Scores
 X train--Training data for the model: predictor
 Y train--Training data for the model: target  Required Accuracy = 90%
 X test--Testing data for the model: predictor Type of testing adapted: To test the accuracy or the
 Y test--Testing data for the model: target model-- performance of each algorithm that was adapted to this
lifestyle disease prediction model project, we have used a different classification algorithm
since the data was a bit unbalanced. Besides that, we
 SVM Parameters--Kernel, C, Gamma, and Degree consider the fact that it’s medical data so it’s prone to bias.
BEGIN We have utilized the F1 score metric to evaluate how good
or bad the model performs.
 STEP 1: Determine the value of the required Accuracy
 STEP 2: Prepare the dataset from the questionnaire In terms of medical data having more False Positives
 STEP 3: Note the predictor and target values means that the model is performing good. We care more
 STEP 4: Preprocess the dataset: about false positives than true negatives.
 STEP 4.1: Data integration
 STEP 4.2: Data transformation Test results of various stages: We have used accuracy,
 STEP 4.3: Data reduction f1 score, recall, et precision to test the performance of each
 STEP 4.4: Data cleaning machine learning Algorithm that was adapted to this project.
 STEP 5: X train, Y train--70% of data collected
 STEP 6: X test, Y test--30% of data collected We used transformers on tabular data and the
 STEP 7: current Accuracy--0 performance of the algorithm was quite good and we
 STEP 8: while(current Accuracy < required Accuracy) collected the result, and we compared the result that was
 STEP 9: Deployment apply on the same sample data to each algorithm such as
 STOP decision tree, random first, gradient boosting and logistic
regression.
B. Algorithm Part 2
We notice that the performance of transformers on the
 Input: Predictor values of a web user tabular data was quite impressive and optimal as compared
 Output: Yes, if a user suffers a lifestyle disease (with to many others.
his/her name). No, if a user does not suffer from a
lifestyle disease. VI. CONCLUSION
 Variables: user Input--a web user’s values
The result of the proposed model has shown us a very
 Model--Trained Model From Algorithm I important step towards using the concept of transformer and
 Prediction--Output From The Model self-attention to solve tabular data classification tasks for a
small data set.
 BEGIN STEP 1: user Input Æ Store user input in an
It shows us that we can in the future use a transformer
appropriate format
to train a huge amount of data for different classification
 STEP 2: prediction Æ predict if an individual suffers
problems.
from any lifestyle disease using user Input and model

IJISRT23MAY1179 www.ijisrt.com 1667


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
We got the following result after our first run for the The use of machine learning in healthcare is rapidly
comparative analysis on the diabetes dataset. growing, and this study is an example of how it can be used
to improve diagnosis and treatment. The use of machine
The stratified holdout cross-validation method has learning can help doctors and other healthcare professionals
been used to split the training dataset randomly into 90% to make more accurate predictions about disease outcomes
train and 10% test set. The result was collected and further and can also help identify patients who are at high risk for
compared with some existing approaches, which indicates developing lifestyle diseases. This information can then be
that using transformers on tabular data (Tabpfn) outperforms used to develop targeted prevention and treatment strategies.
the existing state-of-the-art approach. The Tabpfn
transformer on tabular data was optimal among adapted Furthermore, the findings of this study suggest that
models based on F1-score, which are 98.46 %, 98.0694%, machine learning techniques can be used to improve the
91.736%, and 91.541% respectively. accuracy of diabetes prediction. This is important because
early detection of diabetes can lead to better management
 Scope of Improvement and prevention of complications. Machine learning can help
The scope of this project is to use machine learning identify patients who are at high risk of developing diabetes,
techniques to predict, diagnose, and prevent lifestyle enabling healthcare professionals to develop targeted
diseases, with a particular focus on diabetes. With lifestyle prevention strategies.
diseases responsible for over 80% of deaths worldwide, this
project aims to develop a system that can accurately identify In conclusion, this study provides evidence that
individuals at risk of developing diabetes at an early stage, machine learning techniques can be used to improve the
enabling timely intervention and treatment. accuracy of diabetes prediction. This has significant
implications for healthcare, as it can help healthcare
The project involves a comprehensive review of the professionals to identify patients who are at high risk of
existing literature on the prediction and diagnosis of lifestyle developing lifestyle diseases such as diabetes. This
diseases using machine learning techniques. The results of information can then be used to develop targeted prevention
this review will inform the development of a system that is and treatment strategies, ultimately leading to improved
reliable, efficient, and easy to use for medical professionals health outcomes. However, it is important to note that
of different levels of expertise. further research is needed to fully understand the potential
of machine learning in healthcare and to develop effective
The project also involves the collection and analysis of strategies for implementing these techniques in clinical
patient data to train the machine learning algorithms. The setting
data will be used to develop models that can accurately
predict the likelihood of an individual developing diabetes REFERENCES
based on their lifestyle habits and other risk factors.
[1]. International Diabetes federation. Diabetic Atlas Fifth
Another scope of the project is to compare different Edition 2011.
machine learning techniques to determine which algorithm [2]. American Diabetes Association. Diagnosis and
is best suited for predicting diabetes. The project will classification of diabetes mellitus. Diabetes Care
evaluate various algorithms, including Transformers on 2009;32(Suppl. 1): S62–
tabular data (Tabpfn), Random Forest, Decision Tree, [3]. Krentz AJ, Bailey CJ. Oral antidiabetic agents: current
Support Vector Machine K-Nearest Neighbors, Gradient role in type 2 diabetes mellitus. Drugs 2005;65(3):385–
Boosting, Histogram Gradient Boosting, and Adaptive 411.
Boosting. The project will then identify the most effective [4]. Tsave O, Halevas E, Yavropoulou MP, Kosmidis
algorithm for predicting diabetes. Papadimitriou A, Yovos JG, Hatzidimitriou A, et al.
Structure-specific adipogenic capacity of novel,
Finally, the project aims to provide insights into the welldefined ternary Zn(II)-Schiff base materials.
lifestyle habits of patients, which could be used to design Biomolecular correlations in zincinduced
targeted interventions that promote healthy habits and differentiation of 3T3-L1 pre-adipocytes to adipocytes.
prevent the onset of lifestyle diseases. This will improve the J Inorg Biochem Nov 2015; 152:123–37.
overall health of the patient population and reduce the [5]. Halevas E, Tsave O, Yavropoulou MP, Hatzidimitriou
burden on healthcare systems, resulting in significant cost A, Yovos JG, Psycharis V, et al. Design, synthesis and
savings. characterization of novel binary V(V)-Schiff base
materials linked with insulinmimetic vanadium-
In conclusion, lifestyle diseases are a major health induced differentiation of 3T3-L1 fibroblasts to
concern worldwide, and efforts must be made to prevent and adipocytes. Structure–function correlations at the
treat them. Machine learning techniques have been used in molecular level. J Inorg Biochem Jun 2015; 147:99–
this study to predict the likelihood of diabetes, a common 115.
lifestyle disease. The results of this study show that using
transformers on tabular data (Tabpfn) outperformed other
existing approaches, indicating that this technique could be
used to improve the accuracy of diabetes prediction.

IJISRT23MAY1179 www.ijisrt.com 1668


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[6]. Tsave O, Yavropoulou MP, Kafantari M, Gabriel C,
Yovos JG, Salifoglou A. The adipogenic potential of
Cr(III). A molecular approach exemplifying
metalinduced enhancement of insulin mimesis in
diabetes mellitus II. J Inorg Biochem Oct 2016;
163:323–31.
[7]. Sakurai H, Kojima Y, Yoshikawa Y, Kawabe K, Yasui
H. Antidiabetic vanadium(IV) and zinc(II) complexes
review article coordination. Chem Rev March 2002;
226(1–2):187–98.
[8]. Nongyao Nai-arun, Rungruttikarn
Moungmai(2015)Comparison of classifiers for the risk
of diabetes ELSEVIER Procedia Computer Science 69
(2015) 132-142.
[9]. Pima Indian diabetes datasets from UCI Repository.
[10]. Çalisir D, Dogantekin E. An automatic diabetes
diagnosis system based on LDA Wavelet Support
Vector Machine Classifier. Expert Syst Appl
2011;38(7):8311–5
[11]. HianChyeKoh and Gerald Tan: Data Mining
Applications in Healthcare. Journal of Healthcare
Information Management, Vol 19, No 2.
[12]. P. Giudici: Applied Data Mining Statistical Methods
for Business and Industry. Wiley & sons, 2003.
[13]. G.Piatetsky-shapiro, U.Fayyed and P.Smith: From data
mining to Knowledge discovery: An overview.
Advances in knowledge Discovery and Data Mining.
pages 1-35, MIT Press, 1996.
[14]. S.Vijiyarani, S.Sudha: Disease Prediction In Data
Mining Technique – A Survey. International Journal of
Computer Applications & Information Technology
Vol. II, Issue I, January 2013.
[15]. Huy Nguyen Anh Pham and Evangelos Triantaphyllou:
Prediction of Diabetes by Employing a New Data
Mining Approach Which Balances Fitting and
generalization.
[16]. Og uz Karan, Canan Bayraktara, Haluk Gumus_kaya,
Bekir Karlıkc: Diagnosing diabetes using neural
networks on small mobile devices. Expert Systems
with Applications 39 (2012) 54–60.

IJISRT23MAY1179 www.ijisrt.com 1669

You might also like