A Transformer On Tabular Data Comparative Analysis With Linear and Tree Base Machine Learning Algorithm On Diabetic Dataset
A Transformer On Tabular Data Comparative Analysis With Linear and Tree Base Machine Learning Algorithm On Diabetic Dataset
ISSN No:-2456-2165
Nonso Nnamoko et al. [2] presented predicting In this paper, we have utilized Transformers on tabular
diabetes onset: an ensemble supervised learning approach data (Tabpfn), Random Forest, Decision Tree, Support
they used five widely used classifiers are employed for the Vector Machine K-Nearest Neighbors, Gradient Boosting,
ensembles and a meta-classifier is used to aggregate their Histogram Gradient Boosting, and Adaptive Boosting for
outputs. The results are presented and compared with similar predicting how likely a person will have diabetes. The
studies that used the same dataset within the literature. It is stratified holdout cross-validation method has been used to
shown that by using the proposed method, diabetes onset split the training dataset randomly into 90% train and 10%
prediction can be done with higher accuracy. test sets. The result was collected and further compared with
some existing approaches, which indicates that using
Tejas N. Joshi et al. [3] presented Diabetes Prediction transformers on tabular data (Tabpfn) outperforms the
Using Machine Learning Techniques aims to predict existing state-of-the-art approach. The Tabpfn transformer
diabetes via three different supervised machine learning on tabular data was optimal among adapted models based on
methods including: SVM, Logistic regression, ANN. This F1-score, which are 98.46 %, 98.0694%, 91.736%, and
project pro- poses an effective technique for earlier 91.541% respectively.
detection of the diabetes disease.
Training Data
Training data needs to be collected alongside the
testing data further preprocessing is needed to know better
the predictors. Training data helps us to prepare a budget
request at some point and it’s a proper document for
building a business case and justifying budget requests.
Predictive Features
To understand what drives the target outcome, there
should be some research or an investigation to get ideas on
the data points. Once the quality of understanding of what
will fit well, the target outcome is achieved, further process
Fig 1 Training Phase and Testing Phase of data requests can help build a business case. The main
predictive features that are taken into feasibility criteria are:
We are going to build a system that will be able to Age.
efficiently predict if a patient is a diabetic or not. The
system is utilizing the new techniques known as Gender
transformers which is going to use the new technique which Polyuria
we call Active Learning. Active Learning is a new technique Polydipsia Sudden
with the aim of
Weight Loss
Weakness
Data Collection
Polyphagia
The dataset used for this project was obtained from Genital Thrush
Kaggle, a popular platform for sharing datasets and Visual Blurring
conducting data-driven research. The dataset is an updated Itching
version of the Pima Indians Diabetes Database, which Irritability
includes demographic, diagnostic, and historical medical Delayed Healing
data of patients. The updated dataset consists of 768 Partial Paresis
instances with eight features, which are age, number of Muscle Stiffness
pregnancies, glucose concentration, blood pressure, skin Alopecia
thickness, insulin, BMI, and diabetes pedigree function. Obesity
Class
To explore the dataset and prepare it for analysis,
relational views were created for the features. The age Working of the Model
feature represents the age of the patient, which is a The first task of the project would be to gather and
continuous variable. The number of pregnancies is a discrete clean the dataset. This would involve finding a reliable
variable that represents the number of times a patient has source of data and performing data cleaning and
been pregnant. The glucose concentration, blood pressure, preprocessing to ensure that the data is ready for analysis.
skin thickness, and insulin features are continuous variables The duration of this task could be 2 weeks.
that represent different diagnostic measurements. The BMI
feature is a continuous variable that represents the body The next task would be to perform exploratory data
mass index of the patient, and the diabetes pedigree function analysis on the dataset to identify trends and patterns. This
is a continuous variable that represents the genetic could take 3 weeks, and the output would be a report on the
predisposition to diabetes. findings.
A. Algorithms
Input: Data set from Kaggle
Output: a prediction model
Variables: