Minor Project Report
Minor Project Report
ALGORITHMS
A PROJECT REPORT
Submitted by
NOVEMBER 2023
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
BONAFIDE CERTIFICATE
Certified that 18CSP111L B. Tech project report titled “ DIABETES PREDICTION USING
[Reg.No.RA2011003010535], Mr. Adarsh [RA2011003010522] who carried out the project work
under my supervision. Certified further, that to the best of my knowledge the work reported herein
does not form part of any other thesis or dissertation on the basis of which a degree or award was
DR. M. PUSHPALATHA
HEAD OF THE DEPARTMENT
Department of Computing Technologies
ii
Department of Computing Technologies
SRM Institute of Science and Technology
Own Work Declaration Form
I hereby certify that this assessment compiles with the University’s Rules and Regulations relating
to Academic misconduct and plagiarism, as listed in the University Website, Regulations, and the
Education Committee guidelines.
I confirm that all the work contained in this assessment is our own except where indicated, and that
we have met the following conditions:
Clearly references / listed all sources as appropriate
Referenced and put in inverted commas all quoted text (from books, web, etc.)
Not made any use of the report(s) or essay(s) of any other student(s) either past or
present
Acknowledged in appropriate places any help that I have received from others (e.g.
fellow students, technicians, statisticians, external sources)
Compiled with any other plagiarism criteria specified in the Course handbook /
University website
I understand that any false claim for this work will be penalized in accordance with the University
policies and regulations.
DECLARATION:
iii
ACKNOWLEDGEMENT
of Science and Technology, for the facilities extended for the project work and his continued
support.
We extend our sincere thanks to Dean - CET, SRM Institute of Science and Technology, Dr. T. V.
We wish to thank Dr. Revathi Venkataraman, Professor and Chairperson, School of Computing,
SRM Institute of Science and Technology, for her support throughout the project work.
We are incredibly grateful to our Head of the Department, Dr. M. Pushpalatha, Professor,
Department of Computing Technologies, SRM Institute of Science and Technology, for her
We want to convey our thanks to our Project Coordinators, Dr. R. Manjula , Associate Professor,
Panel Head, Dr. Senthil Kumar T Assistant Professor Department of Computing Technologies,
SRM Institute of Science and Technology for their input during the project reviews and support.
We register our immeasurable thanks to our Faculty Advisor, Mrs Brindha R, Assistant Professor,
Department of Computing Technologies, SRM Institute of Science and Technology for leading and
Our inexpressible respect and thanks to our guide, Dr. R. Manjula , Assistant Professor,
iv
Department of Computing Technologies, SRM Institute of Science and Technology, for providing us
with an opportunity to pursue our project under his mentorship. He provided us with the freedom
and support to explore the research topics of our interest. His passion for solving problems and
We sincerely thank all the staff and students of Computing Technologies Department, School of
Computing, S.R.M Institute of Science and Technology, for their help during our project. Finally,
we would like to thank our parents, family members, and friends for their unconditional love,
v
ABSTRACT
vi
TABLE OF CONTENTS
ABSTRACT VI
LIST OF FIGURES IX
LIST OF TABLES X
ABBREVIATIONS XI
1 INTRODUCTION 1
1.1 Overview 1
1.2 Approach 2
1.3 General Steps Involved 3
2 LITERATURE SURVEY 5
2.1 Literature Review
4 METHODOLOGY………………………………………12
vii
5 RESULTS AND DISCUSSIONS 28
5.1
REFERENCES 31
APPENDIX 1 33
APPENDIX 2 72
PLAGIARISM REPORT 00
PAPER PUBLICATION 00
viii
LIST OF FIGURES
ix
LIST OF TABLES
x
LIST OF SYMBOLS AND ABBREVIATIONS
AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
xi
CHAPTER 1
INTRODUCTION
1.1 Overview
The term "diabetes mellitus" in this project refers to a group of metabolic diseases characterized by
high blood sugar levels that can either be brought on by insufficient insulin synthesis or by
inadequate body cell insulin responsiveness. The hormone that controls blood glucose levels is
insulin. The result of this ongoing disease is blood that circulates with too much sugar. One of the
non-communicable diseases that puts people at danger for health is diabetes. Diabetes is a chronic
disease in which the body either produces insufficient insulin or is unable to use the insulin that is
produced. Diabetes should not be disregarded because, if ignored, it can result in a number of major
health problems, such as heart conditions, renal disease, high blood pressure, eye damage, and
organ failure.
Machine learning presents a powerful tool for predicting and reducing Diabetes prediction. Various
machine learning algorithms, including KNN, random forests, decision trees and logistic regression,
can be trained using historical data to identify potential patients trends and predict the details and
observe it carefully. An individual's dietary preferences, medical history, and patterns of physical
activity are all included in the dataset. The most pertinent variables are found using feature selection
approaches, which improves the prediction models precision and effectiveness. If diabetes is
discovered earlier, it can be treated. Different machine learning techniques are used and evaluated
for their efficacy in predicting diabetes .We will use a variety of methodologies to better accurately
forecast the start of diabetes in human bodies and patients in order to accomplish this goal. Here,
we'll investigate using a group of models, including the Ada boost classifier, Naive Bayes classifier,
and Random Forest classifier.
1
1.2 Approach
Predicting diabetes at an early stage is crucial for timely intervention and management. Several
approaches can be employed for early diabetes prediction.
Patient History: Gather comprehensive patient history, including family medical history,
lifestyle factors (diet, exercise), and any previous instances of elevated blood sugar levels.
Feature Selection: Identify relevant features using techniques like correlation analysis to
select the most important predictors.
Machine Learning Algorithms: Implement machine learning models like Logistic
Regression, Decision Trees, Random Forest, or even advanced techniques like Neural
Networks for prediction.
Biometric Data: Collect data like BMI (Body Mass Index), waist circumference, and blood
pressure.
Biological Markers: Monitor glucose levels, insulin resistance, and other relevant
biomarkers.
Community Health Programs: Implement community-based programs to raise awareness
about diabetes prevention, encourage regular check-ups, and promote a healthy lifestyle.
Continuous Monitoring: For individuals at high risk, establish a system for continuous
monitoring and follow-up care.
Data Privacy: Ensure that patient data is anonymized and privacy regulations are strictly
adhered to.
Informed Consent: Obtain informed consent from individuals participating in research
studies or data collection efforts.
Collaboration with Healthcare Providers: Collaborate with healthcare providers to offer
screenings and early detection camps in communities.
2
1.3 General Steps Involved in Diabetes Prediction
1. Data Collection Process: The initial step involves the meticulous gathering of patient’s data
encompassing various facets such as demographics, Health history, and patient’s interactions. For
our illustration, we have employed a dataset from Kaggle representing a people body data provider.
This data collection process is systematic, revolving around defining a research question or
hypothesis, selecting a suitable sample population, and determining the appropriate data collection
methods and tools.
2. Data Preprocessing Steps: Following data collection, the data is subjected to thorough
preprocessing. This entails addressing missing values, handling outliers, and rectifying
inconsistencies. The objective is to transform and normalize the data, making it conducive for
utilization in machine learning algorithms. Data preprocessing is a comprehensive procedure
involving several facets:
Data Integration: Merging information from diverse datasets into a cohesive whole.
3. Feature Engineering Endeavors: This phase focuses on extracting pertinent features from the
data that are likely to impact human body. These features may encompass variables such as
3
patient’s pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI, Diabetes Pedigree
Function, Age and Outcome . Feature engineering is the process of shaping raw data into valuable
features for machine learning models. This process involves several key aspects:
Feature Selection: Identifying the most pertinent features from a broader set, often through
statistical analysis or assessing attribute significance.
Feature Extraction: Generating new features from existing ones, employing techniques like
principal component analysis or domain-specific knowledge.
Feature Scaling: Ensuring that feature values are standardized to a comparable range, vital
for certain machine learning models.
4. Model Selection Considerations: The critical decision of choosing a suitable machine learning
model for the specific problem at hand is pivotal. Common models used for diabetes prediction
include logistic regression, decision trees, random forests, and support vector machines. Model
selection is a crucial step, as it entails picking the optimal model from a range of candidates, all
trained on the same dataset. The aim is to identify the model capable of generalizing effectively to
new data, yielding accurate predictions. Model selection techniques encompass various methods
such as cross-validation, holdout validation, and bootstrapping. The choice of model significantly
influences predictive performance, emphasizing the importance of careful selection.
5. Model Training: The selected model is trained using the preprocessed data. Model training plays
a pivotal role in machine learning, involving the process of teaching the model to make accurate
predictions. It entails iteratively adjusting the model's parameters, enabling it to recognize patterns
in input data and generate desired outputs. The training process employs a training dataset
containing input features and corresponding target labels. The model optimizes its parameters based
on this data, facilitating accurate predictions on unseen data. Training success relies on factors
including data quality, algorithm choice, optimization techniques, and hyperparameters. Model
performance is assessed using validation methods such as cross-validation, ensuring reliable
4
predictions.
6. Model Evaluation Procedures: The performance of the trained model is assessed using suitable
evaluation metrics such as accuracy, precision, recall, and F1 score. Model evaluation is a pivotal
facet of machine learning, involving the scrutiny of a trained model's performance on new, unseen
data. The objective is to verify the model's ability to generalize effectively and produce accurate
predictions. Evaluation metrics like accuracy, precision, recall, and F1 score are applied, depending
on the application's nature. Model evaluation is an iterative process, often necessitating adjustments
to hyperparameters and data preprocessing to enhance performance. Additionally, it aids in model
comparison and the selection of the most suitable model for a given application.
5
CHAPTER 2
LITERATURE SURVEY
Currently, both traditional statistics-based forecasting and predictions utilizing integrated classifiers
are employed in algorithms for predicting patients turnover among domestic and global users [1].
These methods combine machine learning techniques with statistical theory and utilize consumer
visual insights to establish relationships between various indicators. For instance, MGUIIS and CO.
developed a predictive model based on logistic regression, focusing on the average time patients
spend per day. Experimental results using a real dataset and after identifying and replacing null
values show that the proposed technique has higher accuracy after imputation of missing values.
[2]. They conducted a comparison study to validate the effectiveness of their new technique in
predicting people behaviour before and after optimization. Authors of this study conducted patients
health analysis using a logistic regression model, training and evaluating it with factors such as
pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI, Diabetes Pedigree Function,
Age and Outcome. In the initial testing phase, the model displayed a 74% accuracy rate, which
later increased to 79% [8]. Furthermore, combining the two distinct datasets mentioned above
significantly enhanced the model's accuracy [9]. However, it's worth noting that the diabetes
prediction model overlooked some critical factors influencing subscriber decision-making
processes, such as recent package utilization and satisfaction with patients support [11]. Thus, it
may not serve as a comprehensive tool for identifying the causes of patients turnover. Nonetheless,
this research carries significant value. In the Improved Diabetes Prediction Method[3]. This method
comprises three key steps: quantifying tie strength, utilizing machine learning techniques to
amalgamate traditional and social variables, and employing an influence propagation model[13].
For strategic planners, a pattern analysis framework is recommended to offer guidance. The chat
graph approach to diabetes prediction focuses on forecasting report based on conversation
activity[5]. However, this approach does not consider the social elements derived from graph
theory. Users are grouped into categories for diabetes prediction based on their online actions, using
a clustering method, which then applies rules to prevent them from leaving[15]. In contrast, the
Diabetes Prediction by exploratory data mining, and a common technique for statistical data
analysis, used in many fields, including machine learning, pattern recognition, image analysis,
information retrieval, bioinformatics, data compression, and computer graphics.
6
1. A. Sneha, N. and Gangil,T., Analysis of diabetes mellitus for early prediction using optimal
features selection. Journal of Big Data, 6(1), p.13.(2019)[1]
Authors have focused on selecting the attributes in early detection of Diabetes Mellitus using
predictive analysis and designing a prediction algorithm using Machine learning techniques. The
data is collected from CImachine repository.15 attributes hasbeen used for the purpose of
classification. Support Vector Machine, Random forest and Naïve Bayes are the classifiers used
with an accuracy of 77.73 %, 75.39% and 73.48%"Logistic Regression Model and Its Applications"
by C. Zhenhai and Liu Wei .
The author proposes a random forest algorithm for diabetes prediction to develop a system that can
perform early prediction of diabetes for patients with higher accuracy by using the random forest
algorithms. The proposed model gives the best results to predict diabetes and the results show that
the prediction system can predict diabetes effectively, efficiently, and most importantly
instantaneously. Nanos Nnamoko et al presented Prediction of diabetes onset: a group-supervised
learning approach, they used five widely used classifiers for groups and one used Meta classifier.
Results are presented and compared with similar studies that have used the same data sets in the
literature. It is shown that by using the proposed method, prediction of the onset of diabetes can be
made with greater accuracy
.
3. C. Tejas N. Joshi, Prof. Pramila M. Chawan, "Diabetes Prediction Using Machine Learning
Techniques". Int. Journal of Engineering Research and Application, Vol. 8, Issue 1, (Part -II)
January 2018, pp.-09-13[3]
Diabetes prediction is presented by machine learning techniques to predict diabetes through three
different supervised machine learning methods including SVM, logistic regression, ANN. This
project proposes an effective technique for the early detection of diabetes. Deeraj Shetty et al.
proposed diabetes disease prediction using data mining assemble Intelligent Diabetes Disease
7
Prediction System that gives analysis of diabetes malady utilizing diabetes patients diagnoses
information. In this system, they propose the use of algorithms like Bayesian and KNN (K-Nearest
Neighbor) to apply on diabetes patient’s databases and analyze them by taking various attributes of
diabetes for prediction of diabetes disease.
4. D. Sisodia, D. and Sisodia, DS, 2018. “Prediction of diabetes using classification algorithms.
Procedia computer science”, 132, pp.1578-1585.(2018) .[4]
The authors designed a support system for estimating disease, including diabetes, using the Pima
Indian Selected Diabetes Database (PIDD). In this study, three machine learning recognition
algorithms, including Bayes Naive, SVM, and Decision Tree, were used to diagnose diabetes at an
earlier stage with an accuracy of 76.3%, 65.1. % and 73.82%.approach using a case study and
compare it with other methods, such as decision trees and neural networks. The results show that
their method outperforms the other methods in terms of classification accuracy and feature
selection. The proposed method has practical applications in patients retention, marketing, and
patients relationship management in the telecommunications industry. The paper provides a
valuable contribution to the field of patients diabetes prediction and highlights the importance of
stratified sampling and model combination in improving the accuracy of diabetes prediction
models.
5. E. Rahul Joshi and Minyechil Alehegn, “Analysis and prediction of diabetes diseases using
machine learning algorithm”: Ensemble approach, International Research Journal of
Engineering and Technology Volume: 04 Issue: 10 | Oct - 2017.[5]
The authors have proposed the ML techniques which are used to guess the data set at an initial
phase to save the life. Using KNN and Naïve Bayes algorithm. In this study they proposed method
provide high accuracy with accuracy value of 90.36% and decision. Stump provided less accuracy
than other by providing 83.72% accuracy.Random Forest, Naive Bayes, and KNN, are the most
widely employed predictive algorithms here. The single algorithm offered less precision than
ensemble one. The decision tree was highly accurate in most of the tests. Java and Weka are the
tools in this hybrid study for predicting diabetes data. They proposed a theory based on Analysis
and prediction of diabetes diseases using machine learning algorithms: Ensemble approach. To
8
make this system as an ensemble hybrid model, the following algorithms are used: KNN, Naive
Bayes, Random forest and J48 which is used to increase the performance and accuracy. J48 is one
of the most popular as well as better accuracy. All these algorithms are used to enhance the
accuracy and all these are advanced when compared to others. The random forest provides better
accuracy than J48 as well as Naive Bayes in 10 cross-validation splitting methods. The fuzzy rule
was developed to reduce the wrong treatment.
6. In Article 5 In the paper “Predictive Supervised Machine Learning Models for Diabetes
Mellitus” by
Authors: L. J. Muhammad, Ebrahem A. Algehyne & Sani Sharif Usman. Published by
Springer[6]
In this study, the diagnostic dataset of DM type 2 was collected from the Murtala Mohammed
Specialist Hospital, Kano, and used to develop predictive supervised machine learning models
based on logistic regression, support vector machine, K-nearest neighbor, random forest, naive
Bayes and gradient booting algorithms.
Algorithms used: K-nearest neighbor, random forest, naive Bayes and gradient booting algorithms.
The random forest predictive learning-based model appeared to be one of the best developed
models with 88.76% in terms of accuracy; however, in terms of receiver operating characteristic
curve, random forest and gradient booting predictive learning-based models were found to be the
best predictive learning models with 86.28% predictive ability, respectively.
9
CHAPTER 3
SYSTEM ARCHITECTURE AND DESIGN
In order to create, train, and implement a diabetes prediction model, a number of components are
usually included in the system architecture for Early Diabetes Prediction using Machine Learning.
Data collection, the first component, entails gathering and preparing consumer data from a variety
of sources, including Health records, patients reviews, and demographic data. After that, the data is
cleansed, changed, and made ready for modelling. The second component is feature engineering,
which comprises selecting and adjusting relevant data characteristics to improve the accuracy of the
prediction model. For this, methods like dimensionality reduction, feature scaling, and feature
selection may be applied.
10
3.2 Use Case Diagram
To efficiently detect and anticipate patients diabetes inside a firm, the patients diabetes prediction
system architecture consists of several components and phases.
The data collection phase, which forms the basis of the architecture, involves gathering pertinent
data from a variety of sources, including Health history, patients feedback, demographic data, and
patients interactions. After that, the data is kept for later processing and analysis in a central data
repository, such a data warehouse or big data platform. Data preparation is the following step, when
the gathered data is cleaned, transformed, and feature engineered. This stage guarantees that the
format of the data is appropriate for modelling and analysis. Missing value handling, data
normalisation, and feature creation based on domain expertise are a few examples of activities that
could be included.
The model creation step of the design comes after data preparation. At this point, statistical or
machine learning methods are used to develop prediction models. In order to find trends and
11
pinpoint the main causes of patients attrition, these models are trained on historical patients data,
which includes both non-diabetes and diabetes patients. Various techniques, based on the available
data and the complexity of the task, can be used, including logistic regression, decision trees, and
neural networks.
Lastly, a feedback loop is incorporated into the system design to help the diabetes prediction model
get better over time. The model may be frequently retrained and modified to adjust to shifting
consumer behaviour and market dynamics by gathering input on the forecast accuracy and tracking
the actual diabetes results.
The fourth component is training the model, which comprises using techniques such as cross-
validation and hyperparameter tweaking to maximize the model's performance by training the
selected algorithm on the prepared data. As the final component, the trained model's performance is
evaluated on a holdout set of data, ensuring that it generalises well to new data. Model deployment,
the last phase, involves applying the learned model to new data and using it to generate predictions.
To do this, the model may be released as a REST API or microservice that can be integrated into
apps that currently exist. Overall, there are several elements in the system architecture for Patients
Diabetes Prediction using Machine Learning that call for proficiency in machine learning
algorithms, feature engineering, data pretreatment, and deployment infrastructure. For the system to
manage massive data volumes and keep producing precise predictions over time, it must also be
scalable, dependable, and maintainable.
12
CHAPTER 4
METHODOLOGY
A rudimentary machine learning model with a predefined set of characteristics that was trained on
sparse and antiquated data may already be in use at certain organisations. Less accurate may result
from this approach's failure to take dynamic shifts in patients behaviour and preferences into
account.
It's possible that others have put in place more sophisticated machine learning algorithms that make
use of a wide range of consumer data, such as demographics, past Health report, activity, patients
blood sample, and sentiment analysis on social media. These models might generate precise and
dynamic diabetes prediction by utilising deep learning methods like neural networks and decision
trees.
Predictive modelling, data analysis, and data gathering are usually combined in the current patients
diabetes prediction system. Companies collect pertinent data from a variety of sources, including
past Health, demographic data, and contacts with patients. Databases and data warehouses are used
to organise and store this data.
Data preparation is the process of transforming and cleaning obtained data to make sure it is
suitable for analysis and of a high quality. In order to produce a consistent and trustworthy dataset,
this stage entails addressing missing values, eliminating outliers, and normalising data. Businesses
use predictive modelling approaches to create diabetes prediction models when the data is prepared.
These models find trends, correlations, and variables that lead to patients turnover using machine
learning algorithms or statistical techniques. To train the models using past patients data, several
13
algorithms including logistic regression, decision trees, random forests, or neural networks are
frequently employed on pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI,
Diabetes Pedigree Function, Age and Outcome.
AUC, accuracy, precision, recall, and other measures are used to assess the diabetes prediction
models' performance. This assessment aids in determining how well the model predicts client
attrition. The model is incorporated into business systems or patients relationship management
(CRM) platforms for real-time diabetes prediction if it satisfies the required performance
benchmarks.
To evaluate the effectiveness of the model in the current system, firms frequently track the diabetes
forecasts and contrast them with the actual diabetes results. Through retraining and updating the
diabetes prediction models based on the most recent data and people behaviour, this feedback loop
enables continuous development of the models. In order to offer insights into patients categories
that are at high risk of diabetes, retention strategy efficacy, and diabetes-related patterns, the current
system may additionally include capabilities like dashboards or visualisation tools. In order to lower
diabetes and patients pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI,
Diabetes Pedigree Function, Age and Outcome , these visualisations aid in understanding the
dynamics of diabetes and informing decision-making.
14
4.2 Proposed System
Predictive diabetes model is a tool for classifying, a system that examines the traits of potential
consumers to determine what traits are essential in forecasting turnover rates. Let's imagine we have
a dataset with information on pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI,
Diabetes Pedigree Function, Age and Outcome of people[1]. These people' characteristics,
including their glucose, Blood Pressure , Insulin, BMI, Age and Outcome among others, are
described in the data. The outcome of the user's turnover should be predicted by our model. Hence,
the target variable will be terminated. The data should be examined with an emphasis as to how
various aspects connect to the patients diabetes status [14].
We are prepared to construct many models in search of the optimum fit. patients turnover is a
problem of binary classification since clients can leave or stay for a predetermined amount of time.
We’ll test:
15
Naive Bayesian
Naive Bayes is a machine learning algorithm based on the Bayes theorem of probability. It is a
probabilistic algorithm that uses the conditional probability of features to classify data into different
categories. Naive Bayes is commonly used for text classification and spam filtering, but it can also
be used in other classification tasks such as sentiment analysis, recommendation systems, and
patients diabetes prediction. The algorithm works by calculating the probability of each feature
given a class label and then multiplying all these probabilities to get the probability of a data point
belonging to a particular class. The class with the highest probability is then assigned as the
prediction for the data point.
Naive Bayes is a probabilistic machine learning algorithm commonly used for classification tasks. It
is based on Bayes' theorem and assumes that the features are conditionally independent given the
class label. Despite its simplicity and naive assumption, Naive Bayes often performs remarkably
well and is widely used in various applications such as spam filtering, sentiment analysis, and
document categorization.
The algorithm is called "naive" because it assumes that the presence or absence of a particular
feature is independent of the presence or absence of any other feature, given the class label. This
assumption allows for simplified calculations and efficient training.
During the training phase, Naive Bayes calculates the probabilities of each feature given each class
label by counting occurrences in the training data. It estimates the prior probabilities of each class
label based on the frequency of their occurrences. These probabilities are then combined using
Bayes' theorem to calculate the posterior probability of each class label given the observed features.
During the prediction phase, Naive Bayes uses the calculated probabilities to determine the most
likely class label for a new instance. It calculates the posterior probabilities for each class label and
selects the label with the highest probability as the predicted class.
Naive Bayes has several advantages. It is computationally efficient and works well with large
datasets. It can handle high-dimensional feature spaces and is robust to irrelevant features, as the
independence assumption allows it to disregard irrelevant correlations. Naive Bayes is also less
prone to overfitting, especially when the training data is limited. Despite its simplicity, Naive Bayes
performs well in many real-world scenarios. However, the assumption of feature independence can
limit its effectiveness in cases where there are strong dependencies among the features. In such
16
cases, more sophisticated algorithms may be more appropriate. Additionally, Naive Bayes is
sensitive to the presence of rare or unseen feature combinations in the training data, which can
result in zero probabilities and affect the accuracy of predictions.
In summary, Naive Bayes is a simple yet effective probabilistic algorithm used for classification
tasks. Its efficiency, ability to handle high-dimensional data, and robustness to irrelevant features
make it a popular choice in various applications. However, its assumption of feature independence
may limit its performance in certain scenarios.
One of the strengths of Naive Bayes is that it requires a relatively small amount of training data to
estimate the parameters needed for classification. However, it can be sensitive to irrelevant or
correlated features, and its assumption of independence may not hold in some real-world
applications.
Kernel SVM
Kernel Support Vector Machine (SVM) is a popular classification algorithm in machine learning
that can be used for both linear and non-linear data. It works by finding the hyperplane that
maximizes the margin between the two classes in the dataset. In kernel SVM, the data is
transformed into a higher dimensional space using a kernel function, such as a radial basis function
(RBF) or polynomial function, to make it easier to separate the classes. The transformed data is then
used to find the optimal hyperplane.
Kernel Support Vector Machines (SVM) is a powerful machine learning algorithm that has gained
popularity due to its ability to handle non-linearly separable data. SVMs are binary classifiers that
aim to find an optimal hyperplane to separate data points belonging to different classes. However,
in cases where the data is not linearly separable, the kernel trick comes into play.
Kernel SVM extends the capabilities of traditional SVMs by transforming the input data into a
higher-dimensional feature space, where it becomes linearly separable. The kernel function plays a
crucial role in this process by efficiently mapping the data points into the desired space. Common
kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid. The kernel
trick allows the SVM algorithm to operate in the original input space, avoiding the need for explicit
computation in the higher-dimensional feature space. This makes kernel SVM computationally
efficient, even for complex data.
17
One of the key advantages of kernel SVM is its ability to capture intricate decision boundaries,
enabling it to handle non-linear relationships in the data. The RBF kernel, in particular, is widely
used and exhibits excellent performance across various domains. Kernel SVMs are robust against
overfitting as they focus on maximizing the margin between support vectors rather than attempting
to fit every training point precisely. Support vectors are the data points closest to the decision
boundary and are critical for determining the optimal hyperplane.
Despite its strengths, kernel SVMs have some considerations. Choosing an appropriate kernel
function and tuning its parameters can be challenging, requiring careful experimentation.
Additionally, kernel SVMs can be computationally demanding, especially with large datasets, as the
training complexity increases with the number of support vectors. In summary, kernel SVM is a
versatile algorithm that leverages the kernel trick to handle non-linear data effectively. Its ability to
capture complex decision boundaries makes it a valuable tool in various machine learning tasks,
including classification and regression. However, proper kernel selection and parameter tuning are
crucial for achieving optimal performance.
Kernel SVM is useful when the data is not linearly separable and there are complex decision
boundaries between the classes. It has been widely used in various fields, including image
classification, text classification, and bioinformatics.
KNN
KNN, or k-nearest neighbors, is a classification algorithm that is based on the idea of finding the k
nearest data points in the feature space to the point being classified. The algorithm then assigns the
class that appears most frequently among the k nearest neighbors to the point being classified.
The k value in KNN is a hyperparameter that needs to be set before running the algorithm. A
smaller value of k will result in a more flexible decision boundary, which is more sensitive to noise
in the data, while a larger value of k will result in a smoother decision boundary that is less sensitive
to noise in the data.
KNN is a simple and effective algorithm that can be used for both classification and regression
problems. However, it can be computationally expensive, especially when dealing with large
datasets, as it requires computing distances between each data point and every other data point in
the dataset. KNN also requires careful normalization of the feature values to ensure that features
with larger scales do not dominate the distances calculated.
18
K-Nearest Neighbors (KNN) is a simple yet effective algorithm in machine learning that is widely
used for both classification and regression tasks. KNN is a non-parametric algorithm, meaning it
does not make any assumptions about the underlying data distribution.
The basic idea behind KNN is to classify or predict a new data point based on its proximity to the K
nearest neighbors in the training set. The "K" in KNN represents the number of neighbors to
consider. The algorithm assumes that similar instances in the feature space tend to have similar
labels or target values. During the classification task, KNN calculates the distance between the new
data point and all other data points in the training set using a distance metric such as Euclidean
distance or Manhattan distance. It then selects the K nearest neighbors based on the shortest
distances. The class label of the new data point is determined by majority voting among its K
nearest neighbors. In regression tasks, KNN predicts the target value by averaging the values of its
K nearest neighbors.
KNN is a lazy learning algorithm, meaning it does not explicitly build a model during the training
phase. Instead, it stores all the training data and performs computations at the prediction time. This
makes the training process faster, but the prediction can be computationally expensive, especially
for large datasets. One of the advantages of KNN is its simplicity. It does not assume any
underlying data distribution, making it suitable for a wide range of datasets. KNN can handle both
numerical and categorical data, making it a versatile algorithm. It is also robust to outliers since it
relies on the majority vote or average of the nearest neighbors. Additionally, KNN does not require
the tuning of hyperparameters or the need for extensive training.
However, KNN has some considerations. It can be sensitive to the choice of the number of
neighbors (K) and the distance metric, and selecting appropriate values for these parameters is
crucial for good performance. The algorithm can also suffer from the curse of dimensionality,
where the distance-based calculations become less meaningful as the number of dimensions
increases. In summary, K-Nearest Neighbors (KNN) is a simple and intuitive algorithm that relies
on the proximity of training instances to make predictions. Its versatility, robustness to outliers, and
ease of implementation make it a popular choice in various machine learning tasks. However,
careful parameter selection and potential scalability issues should be considered when applying
KNN to real-world scenarios.
19
Support Vector Machine with Radial basis function kernel
Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for
classification or regression tasks. The Radial Basis Function (RBF) kernel is one of the most
commonly used kernels in SVM. It maps the input data to a higher dimensional space and makes it
possible to separate the data points using a hyperplane. The RBF kernel is defined by a distance
metric, which measures the similarity between two data points. It is a popular choice for SVM
because it is capable of modeling complex decision boundaries and can handle non-linearly
separable data.
In SVM, the goal is to find the hyperplane that separates the data points into their respective classes
with the maximum margin. The margin is the distance between the hyperplane and the closest data
points from each class. SVM tries to maximize this margin so that it can generalize well to unseen
data. The RBF kernel in SVM calculates the distance between data points in the higher dimensional
space, which allows for more complex decision boundaries.
One disadvantage of SVM with RBF kernel is that it can be sensitive to the choice of
hyperparameters, such as the regularization parameter (C) and the kernel parameter (gamma). The
choice of these parameters can affect the performance of the model and can be a challenge for some
datasets. However, with proper tuning of these parameters, SVM with RBF kernel can be a
powerful tool for classification tasks.
These models need to be worked on and we’ll do so using the the given steps:
- Search for Parameters: We'll choose the parameters and values we want to look for in each of our
models. The best parameters found in our model will be set when we run the GridSearchCV.
- Best Models Fit: We train the system using the train dataset after determining the best estimator.
- Performance Evaluation: Using our test set, we will evaluate the models that performed the best
after being trained on our training dataset.
20
4.3 Data Retrieval Process
The word "read" describes the process of getting data from a storage device. Data retrieval in
databases is the method of finding and attempting to remove data from a database based on a user-
or application-provided query. It enables the retrieval of data from a database for display on a
monitor and use in a program. Shows the Generalised block diagram for the proposed system. The
audio processing program Audio Signal Synthesis ,Recovery, and Music Analysis has some
significance for music recovery applications and is freely available. The database is accessible
through its internet website as well. Ten genres make up the dataset we utilised, and we split it into
training and testing sets.. We have 70% of the data in the training area, and we have about 30% of
the information in the test section.
We develop our algorithm using a data training set, and then we use it to forecast the genre of music
sound in a test dataset. During testing, we evaluate the algorithm's accuracy since we are familiar
with how it operates.
4.4 Implementation
Implementation Process :
- Importing libraries
- Preprocessing the data
- Preview Data
- Features data-type [eg: Pregnancies, Glucose,BP, BMI, Insulin, Age etc.]
- Count of null values
- Data Modelling
- Modelling Evaluation
1. Data Visualisation
Because data visualization facilitates the analysis of intricate data patterns, the identification of
trends, and the ability to make well-informed decisions, it is essential for the early diagnosis of
diabetes. The following are some ways that data visualization can be applied to the early detection
21
of diabetes: Create dynamic dashboards that show a range of diabetes-related data, including blood
sugar readings, body mass index, family history, and lifestyle decisions. These parameters can be
changed by users to examine how they affect the risk of diabetes in real time.
Heatmaps: To see the relationship between many factors and the risk of diabetes, use heatmaps.
Heatmaps can show you which combinations of variables are more common in people who have
diabetes at a young age.
Through data visualization, patterns, trends, and relationships within the data can be easily
identified and interpreted. It allows individuals to explore and gain insights from the data by
visually examining the distributions, variations, and correlations between different variables. By
presenting data visually, it becomes easier to spot outliers, detect patterns, and make data-driven
decisions. Various types of visualizations can be employed depending on the nature of the data and
the intended purpose. Commonly used visualizations include bar charts, line charts, scatter plots,
22
pie charts, histograms, heatmaps, and geographical maps. Each type of visualization serves a
specific purpose in representing different aspects of the data, such as comparing values, showing
trends over time, displaying the composition of categories, or illustrating spatial patterns.
Data visualization plays a crucial role in data analysis and decision-making across numerous
domains, including business, finance, healthcare, marketing, and research. It enables stakeholders to
gain a holistic view of complex datasets and effectively communicate insights to a wide range of
audiences. Moreover, interactive data visualizations allow users to interact with the data and
customize the visual representations based on their needs. They can zoom in, filter, and manipulate
the data to explore specific aspects or drill down into details. This interactivity enhances the user's
engagement and promotes a deeper understanding of the data.
Temporal Analysis as Track changes in diabetes-related factors over time with time-series graphics.
Patients with prediabetic conditions or those with a family history of diabetes may find this very
helpful. We can also use visualizations to teach patients about the risks associated with their
conditions. Patients are assisted in changing their lifestyles to lower their risk since visual
representations are frequently simpler to understand than raw numerical statistics. Analyze and
compare the diabetes-related parameters of people with and without the disease. Box plots and
violin plots are useful tools for displaying the variations in several parameters, which can help
discover important factors linked to the onset of diabetes early.
In summary, data visualization is a powerful tool for transforming data into meaningful and
actionable insights. It simplifies complex information, uncovers patterns, and facilitates effective
communication of data-driven findings. By leveraging visual representations, individuals can make
informed decisions, drive innovation, and gain a deeper understanding of the underlying data.
According to the dataset's gender distribution, there are roughly equal numbers of male and
female patients. The test and component are taken according to that .
Younger patients make up the majority of the dataset's patients .There is lot of change in
younger patient in their body.
While roughly different changes in patient body when they have diabetes , non-diabetes and
in early diabetes patient.
Changes been notice according to different patient report of body component changes .
Most patients seem to require access to a report , health changes according to body
component level.
23
Figure 4.2. Distribution of Label Encoded Categorical Variables
Data Preprocessing:
Describe the dataset used for the analysis, including the number of samples, features, and any
preprocessing steps applied (e.g., handling missing values, feature scaling, etc.).
Model Evaluation:
Present the evaluation metrics used to assess the performance of the predictive model(s). Common
metrics include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC).
Provide a confusion matrix or ROC curve to visually represent the model's performance.
24
Feature Importance:
Discuss the features that were found to be most important in predicting early diabetes. This
information is valuable for understanding the underlying factors contributing to diabetes risk.
Model Performance:
Present the accuracy or performance metric achieved by the model on the test dataset.
Compare the performance of the machine learning model with baseline models or traditional
methods, if applicable.
Interpretation of Results:
Interpret the findings in the context of diabetes research. Explain the significance of the identified
features and how they relate to established risk factors for diabetes.
Discuss any surprising or unexpected results and propose possible explanations.
Clinical Implications:
Discuss how the predictive model can be utilized in clinical settings for early diabetes risk
assessment. Highlight the potential benefits of early detection, such as preventive interventions and
lifestyle modifications.
Limitations:
Address the limitations of the study, such as dataset limitations, potential biases, or constraints of
the machine learning techniques used.
Discuss any challenges encountered during the analysis and how they might have influenced the
results.
Classification Models
Classification precision is one of among the most well-liked classification assessment indicators
used to assess baseline techniques due to the quantity of precise forecasts made as a fraction of all
predictions [5]. Nevertheless, when there are issues with disparities in class, it is not the most
beneficial statistic. The "Accuracy" score, which gauges the extent to which the model's predictions
are able to differentiate between both favourable and adverse classes, will thus be used to categorize
25
the data [4].
The first cycle of foundation algorithms for classification revealed that the K Nearest neighbor
model and Random Forest scored better than the remaining five models, according to the dataset's
greatest mean Accuracy Scores. Figure 5 compares the Accuracy scores in graphical form and we
can see that K Nearest neighbor model has a good accuracy compared to the rest. Classification
models are a fundamental component of machine learning and are widely used to predict categorical
outcomes or class labels based on input features. There are several popular classification models,
each with its own characteristics, advantages, and areas of application.
K Nearest neighbor model is a widely used classification model based on closet training examples
in the feature space, It determines the relationship between the input features and the probability of
belonging to a certain class. It is very useful for nonlinear data because there is no assumption about
data in this algorithm.
K Nearest neighbor model is interpretable and computationally efficient, making it suitable for
both small and large datasets.
Decision Trees are versatile classification models that use a tree-like structure to make decisions.
Each internal node in the tree represents a feature, and the branches correspond to the possible
feature values. Decision Trees are easy to understand and visualize, and they can handle both
categorical and numerical features. However, they are prone to overfitting, especially when the tree
becomes deep and complex.
Random Forest is an ensemble learning method that combines multiple decision trees to make
predictions. It addresses the overfitting issue of Decision Trees by introducing randomness through
bootstrapping and random feature selection. Random Forest provides robust and accurate results,
even in the presence of noisy or missing data, and it can handle high-dimensional datasets
effectively.
Support Vector Machines (SVMs) are powerful classification models that aim to find an optimal
hyperplane to separate different classes. SVMs maximize the margin between classes, making them
less prone to overfitting. They can handle linearly separable as well as non-linearly separable data
by using a kernel function to transform the data into a higher-dimensional feature space. SVMs
26
work well with small to medium-sized datasets but can be computationally expensive with large
datasets.
Naive Bayes is a probabilistic classification model based on Bayes' theorem. It assumes that the
features are conditionally independent given the class label, making calculations and training
efficient. Naive Bayes performs well with large datasets and can handle high-dimensional feature
spaces. However, it may not capture complex dependencies among features due to the
independence assumption. Neural Networks, particularly Deep Learning models, have gained
immense popularity in recent years for classification tasks. They consist of multiple layers of
interconnected nodes (neurons) and can capture complex relationships in the data. Deep Learning
models require large amounts of data for training and are computationally intensive, but they have
achieved state-of-the-art performance in various domains, such as image and text classification.
These are just a few examples of classification models, each with its own strengths and weaknesses.
The choice of the appropriate model depends on the specific problem, the characteristics of the data,
and the desired trade-offs between interpretability, accuracy, and computational efficiency. It is
important to understand the nuances of each model and experiment with different techniques to
achieve the best classification results.
27
CHAPTER 5
RESULTS AND DISCUSSIONS
Overall, the models run successfully and we found logistic regression to be most useful in this case.
Hence, the improvement of this model has been focused on and we have got better accuracy. The
final result is depicted in the form of a confusion matrix.
We have 208+924 correct predictions, according to the Confusion matrix, and 166+111 wrong
ones. With an accuracy of 80%, our model demonstrates the qualities of a respectable model.
28
fig. 5.2 Accuracy Graph
29
Depending on the Mean score of training accuracy and test accuracy, The Accuracy graph in Figure
5.2 depicts a model's ability to differentiate among categories. The orange line depicts the test
accuracy Rate which is the Accuracy curve of a random classifier, is something that a machine
learning model tries to avoid the best it can. The graph above shows that the enhanced Logistic
Regression model had a greater area under the curve score.
Table 5.2 depicts the comparison of the algorithms used and their accuracy compared. The
first cycle of foundation algorithms for classification revealed that the K Nearest neighbor
model and Random Forest scored better than the remaining five models, according to the
dataset's greatest mean Accuracy Scores. Figure 5 compares the Accuracy scores in graphical
form and we can see that K Nearest neighbor algorithm has a good accuracy compared to the
rest.
30
Table 5.1 Comparing the accuracies of different algorithms
In the above figure once we have given patients details it will predict the accuracy of patients
diabetes.
31
CHAPTER 6
6.1 CONCLUSION
Although there are a vast variety of work have been done in developing strategies, algorithm for
early prediction of diabetes, from all of that approaches we use different machine learning
algorithm which conventional mathematical techniques and also combining different type of
algorithm. The objective of the project was to develop a model which could identify patients with
diabetes who are at high risk of hospital admission. Prediction of risk of hospital admission is a
fairly complex task. Many factors influence this process and the outcome. There is presently a
serious need for methods that can increase healthcare institution’s understanding of what is
important in predicting the hospital admission risk. This project is a small contribution to the
present existing methods of diabetes detection by proposing a system that can be used as an
assistive tool in identifying the patients at greater risk of being diabetic. This project achieves this
by analyzing many key factors like the patient’s blood glucose level, body mass index, etc., using
various machine learning models and through retrospective analysis of patients’ medical records.
The project predicts the onset of diabetes in a person based on the relevant medical details that are
collected using a Web application. When the user enters all the relevant medical data required in
the online Web application, this data is then passed on to the trained model for it to make
predictions whether the person is diabetic or nondiabetic. The model is developed using different
machine learning algorithms. The model makes the prediction with an accuracy of 98%, which is
fairly good and reliable. In the future, unused classificatory have been searched and can be applied
to other datasets in a combined model to further improve the accuracy of diabetes prediction.
Early Diabetes create a good impact on the Diabetes Prediction.
Future enhancements in the field of early diabetes prediction using machine learning techniques.
As technology advances and more data becomes available, there are numerous opportunities to
32
improve the accuracy, efficiency, and applicability of diabetes prediction models. Here are some
future enhancements that researchers and practitioners could consider:
Lifestyle Data: Incorporating data from wearable devices and smartphones to capture real-time
lifestyle information, such as physical activity, sleep patterns, and dietary habits.
Environmental Data: Considering environmental factors such as pollution levels and access to
green spaces, which might influence diabetes risk.
Ensemble Models: Building ensemble models that combine predictions from multiple algorithms
or models to enhance overall accuracy and robustness.
Explainable AI: Developing models that provide interpretable results, allowing clinicians and
patients to understand the reasoning behind predictions, which is crucial for gaining trust and
acceptance in healthcare settings.
Dynamic Models: Developing models that can adapt and evolve with changing patient data,
allowing for dynamic and personalized risk predictions over time.
33
pecially concerning underrepresented demographic groups.
Open Data: Promoting the sharing of anonymized healthcare datasets for research purposes,
fostering innovation and accelerating progress in the field.
By focusing on these areas, researchers and developers can significantly enhance the accuracy,
reliability, and usability of early diabetes prediction models, ultimately improving the quality of
care and outcomes for individuals at risk of developing diabetes.
Future enhancements in the field of early diabetes prediction using machine learning techniques. As
technology advances and more data becomes available, there are numerous opportunities to
improve the accuracy, efficiency, and applicability of diabetes prediction models. Here are some
future enhancements that researchers and practitioners could consider:
34
2. Utilizing Advanced Machine Learning Techniques:
Deep Learning: Exploring deep learning algorithms, such as convolutional neural networks (CNNs)
or recurrent neural networks (RNNs), for more complex pattern recognition in high-dimensional
data.
Ensemble Models: Building ensemble models that combine predictions from multiple algorithms or
models to enhance overall accuracy and robustness.
Explainable AI: Developing models that provide interpretable results, allowing clinicians and
patients to understand the reasoning behind predictions, which is crucial for gaining trust and
acceptance in healthcare settings.
35
REFERENCES
[1] Kaouthar Driss, Wadii Bolia,Approach for Diabetes prediction based on ML,2020.
[2] Gaurav Tripathi, Rakesh Kumar ,” Early diabetes prediction using ML with performance
metrics evaluation according to confusion metrics”(2020).
[4] Narendra Mohan , Vinit Jain ; Special use of Performance Analysis of Support Vector
Machine in Diabetes Prediction , 2020.
[5] Roxana Mirshahvalad , Nastaran Asadi Zanjani ;Approach of combining the algorithm to get
best output done with Ensemble Perception Algorithm to Diabetes Prediction ;2017.
[6] Satish Kumar Kalagotla A stacking technique (comparative analysis of AdaBoost and
stacking) for Diabetes Prediction,2021.
[7] Rao, N.M.; Kannan, K.; Gao, X.Z.; Roy, D.S., Novel classifiers for intelligent disease
diagnosis with evolution of multi-objective parameter. Comput. Electr. Eng. 2018, 67, 483–496.
[8] Ashiquzzaman, A.; Kawsar Tushar, A.; Rashedul Islam, M.D.; Shon, D.; Kichang, L.M.;
Jeong-Ho, P.; Dong-Sun, L.; Jongmyon, K. Reduction of overfitting in diabetes prediction using
deep learning neural network.; Notes of Electrical Engineering: Singapore, In IT Convergence and
Security ,2017.
[9] Manal Alghamdi 1,2 , Mouraz AI- mallah 1,2,3, shreif Sakr , Plos one , Diabtes prediction
mellitus using SMOTE and ensemble machine approach , 2017.
[10] G. Webb, “Multiboosting: A technique for combining boosting and wagging,” Machine
36
Learning, vol. 40, pp. 159 – 196, 2000
[11] Joshi, S.; Borse, M. Diabetes Mellitus Using Back-Propagation Neural Network, Detection
and Prediction . In Proceedings of the 2016 International Conference on Micro-Electronics and
Telecommunication Engineering (ICMETE), Uttarpradesh, India, 22–23 September 2016; pp.
110–113.
[12] Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus Nahla H. Barakat,
Andrew P. Bradley, Senior Member, IEEE, and Mohamed Nabil H. Barakat.
[13] Gaganjot Kaur “Diabetes Research” Department of Computer Science and Diabetes
Federation.
[14] Mirshahvalad, R.; Zanjani, N.A. Diabetes prediction using Ensemble perceptron algorithm
9th International Conference on Computational Intelligence and Communication Networks
(CICN) in . Proceedings of the 2017,Girne, Cyprus, 16–17 September 2017; pp. 190–194.
[15] Tao Zheng, Wei Xie, Liling Xu, Xiaoying He, Ya Zhang, Mingrong You, Gong Yang, You
Chen, A machine learning-based framework to identify type 2 diabetes through electronic health
records, International Journal of Medical Informatics, Volume 97,2017,Pages 120- 127,ISSN
1386-5056.
[16] Luis Fregoso-Aparicio1,Julite Noguez 2*, Luis Montesinos2 and Jos`e A ,Garc`ia ;Garc`ia
,BMC ,Machine learning and deep learning prediction model for tupe 2 diabetes,2021.
37
APPENDIX 1
At the core of any data-driven project lies the need for efficient numerical operations and data
manipulation. NumPy, a fundamental Python library, serves as the backbone for handling and
processing numerical data. Its powerful array structures and mathematical functions facilitate the
manipulation of clinical and demographic data, making it an indispensable tool in EARLY
DIABETES prediction projects. NumPy simplifies tasks such as data loading, cleaning, and
transformation, laying the groundwork for subsequent analyses.
While NumPy provides the fundamental building blocks, pandas offers a higher-level interface for
data manipulation and analysis. In EARLY DIABETES prediction projects, datasets can be
complex, containing diverse features and potentially missing values. pandas simplifies data
wrangling by providing data structures like DataFrames that are equipped with versatile methods
for data cleaning, selection, and transformation. With pandas, researchers can easily load
structured datasets, handle missing data, and conduct exploratory data analysis (EDA) to gain
insights into the EARLY DIABETES data.
scikit-learn, often referred to as sklearn, stands as a comprehensive machine learning library that
encompasses an extensive range of tools for model development, evaluation, and deployment. In
the context of EARLY DIABETES prediction, it serves as the primary workhorse for
implementing machine learning algorithms, including logistic regression, decision trees, random
forests, and support vector machines. sklearn offers modules for data preprocessing, model
selection, and performance evaluation, streamlining the end-to-end process of EARLY
DIABETES prediction.
38
Evaluate the Dataset
39
40
Finding Missing Values by Re-Evaluating Columns
41
We To validate the column datatypes for missing values, you can use the info() method in pandas
to display information about the DataFrame, including the column datatypes and the number of
non-null values in each column.
The output will display the datatype for each column and the number of non-null values in each
column. If there are missing values in a column, the number of non-null values will be less than
the total number of rows in the DataFrame.
If you want to check the number of missing values in each column, you can use the isnull()
method in pandas to create a Boolean DataFrame that indicates which values are missing, and then
use the sum() method to count the number of missing values in each column.
42
43
44
45
46
Data Modelling
47
48
49
50
51
52
53
54
APPENDIX 2
55
56
PLAGIARISM REPORT
90
PAPER PUBLICATION PROOF
90