Paper 4
Paper 4
net/publication/326416823
CITATIONS READS
24 5,715
3 authors, including:
Pramila M. Chawan
Veermata Jijabai Technological Institute
183 PUBLICATIONS 716 CITATIONS
SEE PROFILE
All content following this page was uploaded by Pramila M. Chawan on 16 July 2018.
Abstract: Diabetes mellitus or simply diabetes is a disease diabetes (90% of all diabetic patients), mainly characterized
caused due to the increase level of blood glucose. Diabetes by insulin resistance. The main causes of T2D include
is a chronic disease with the potential to cause a worldwide lifestyle, physical activity, dietary habits and heredity,
health care crisis. According to International Diabetes whereas T1D is thought to be due to autoimmunological
Federation 382 million people are living with diabetes destruction of the Langerhans islets hosting pancreatic-β
across the whole world. By 2035, this will be doubled as 592 cells. T1D affects almost 10% of all diabetic patients
million. Various traditional methods, based on physical and worldwide, with 10% of them ultimately developing
chemical tests, are available for diagnosing diabetes. idiopathic diabetes. Other forms of DM, classified on the
However, early prediction of diabetes is quite challenging basis of insulin secretion profile and/or onset, include
task for medical practitioners due to complex Gestational Diabetes, endocrinopathies, MODY (Maturity
interdependence on various factors as diabetes affects Onset Diabetes of the Young), neonatal, mitochondrial, and
human organs such as kidney, eye, heart, nerves, foot etc. pregnancy diabetes. The symptoms of DM include polyuria,
Data science methods have the potential to benefit other polydipsia, and significant weight loss among others.
scientific fields by shedding new light on common Diagnosis depends on blood glucose levels (fasting plasma
questions. One such task is to help make predictions on glucose = 7.0 mmol/L.
medical data. Machine learning is an emerging scientific
field in data science dealing with the ways in which Machine Learning
machines learn from experience. The aim of this project is Machine learning is the scientific field dealing with the ways
to develop a system which can perform early prediction of in which machines learn from experience. For many
diabetes for a patient with a higher accuracy by combining scientists, the term “machine learning” is identical to the
the results of different machine learning techniques. This term “artificial intelligence”, given that the possibility of
project aims to predict diabetes via three different learning is the main characteristic of an entity called
supervised machine learning methods including: SVM, intelligent in the broadest sense of the word. The purpose of
Logistic regression. This project also aims to propose an machine learning is the construction of computer systems
effective technique for earlier detection of the diabetes that can adapt and learn from their experience. A more
disease. detailed and formal definition of machine learning is given
Index Terms: Diabetes, Machine Learning, Supervised, by Mitchel: A computer program is said to learn from
SVM, Logistic Regression. experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as
I. INTRODUCTION measured by P, improves with experience E.
Diabetes Mellitus With the rise of Machine Learning approaches we have the
Diabetes is one of deadliest diseases in the world. It is not ability to find a solution to this issue, we have developed a
only a disease but also a creator of different kinds of diseases system using data mining which has the ability to predict
like heart attack, blindness, kidney diseases, etc. The normal whether the patient has diabetes or not. Furthermore,
identifying process is that patients need to visit a diagnostic predicting the disease early leads to treating the patients
center, consult their doctor, and sit tight for a day or more to before it becomes critical. Data mining has the ability to
get their reports. Moreover, every time they want to get their extract hidden knowledge from a huge amount of diabetes-
diagnosis report, they have to waste their money in vain. related data. Because of that, it has a significant role in
Diabetes Mellitus (DM) is defined as a group of metabolic diabetes research, now more than ever. The aim of this
disorders mainly caused by abnormal insulin secretion and/or research is to develop a system which can predict the
action. Insulin deficiency results in elevated blood glucose diabetic risk level of a patient with a higher accuracy. This
levels (hyperglycemia) and impaired metabolism of research has focused on developing a system based on three
carbohydrates, fat and proteins. DM is one of the most classification methods namely, Support Vector Machine,
common endocrine disorders, affecting more than 200 Logistic regression and Artificial Neural Network
million people worldwide. The onset of diabetes is estimated algorithms.
to rise dramatically in the upcoming years. DM can be
divided into several distinct types. However, there are two Supervised Learning
major clinical types, type 1 diabetes (T1D) and type 2 In supervised learning, the system must “learn” inductively a
diabetes (T2D), according to the etiopathology of the function called target function, which is an expression of a
disorder. T2D appears to be the most common form of model describing the data. The objective function is used to
predict the value of a variable, called dependent variable or factors as diabetes affects human organs such as kidney, eye,
output variable, from a set of variables, called independent heart, nerves, foot etc.
variables or input variables or characteristics or features. The Data science methods have the potential to benefit other
set of possible input values of the function, i.e. its domain, scientific fields by shedding new light on common questions.
are called instances. Each case is described by a set of One such task is help to make predictions on medical data.
characteristics (attributes or features). A subset of all cases, Machine learning is an emerging scientific field in data
for which the output variable value is known, is called science dealing with the ways in which machines learn from
training data or examples. In order to infer the best target experience. The aim of this project is to develop a system
function, the learning system, given a training set, takes into which can perform early prediction of diabetes for a patient
consideration alternative functions, called hypothesis and with a higher accuracy using different machine learning
denoted by h. In supervised learning, there are two kinds of techniques. This project aims to predict diabetes via five
learning tasks: classification and regression. Classification different supervised machine learning methods including:
models try to predict distinct classes, such as e.g. blood SVM, Logistic regression. This project also aims to propose
groups, while regression models predict numerical values. an effective technique for earlier detection of the diabetes
Some of the most common techniques are Decision Trees disease.
(DT), Rule Learning, and Instance Based Learning (IBL),
such as k-Nearest Neighbours (k-NN), Genetic Algorithms III. METHODOLOGY
(GA), Artificial Neural Networks (ANN), and Support 3.1 System Flow
Vector Machines (SVM). The methodology consists of 6 different phases as shown in
Figure 1 i.e. Data Extraction, Data Pre-processing, SVM
Unsupervised Learning based processing, Logistic Regression based processing, Post
In unsupervised learning, the system tries to discover the processing and Analysis of Results.
hidden structure of data or associations between variables. In
that case, training data consists of instances without any
corresponding labels. Association Rule Mining appeared
much later than machine learning and is subject to greater
influence from the research area of databases. Cluster
analysis or clustering is the task of grouping a set of objects
in such a way that objects in the same group (called a cluster)
are more similar (in some sense or another) to each other than
to those in other groups (clusters). It is a main task of
exploratory data mining, and a common technique for
statistical data analysis, used in many fields, including
machine learning, pattern recognition, image analysis,
information retrieval, bioinformatics, data compression, and
computer graphics.
Reinforcement Learning
The term Reinforcement Learning is a general term given to
a family of techniques, in which the system attempts to learn
through direct interaction with the environment so as to
maximize some notion of cumulative reward. It is important
to mention that the system has no prior knowledge about the
behaviour of the environment and the only way to find out is
through trial and failure (trial and error). Reinforcement 3.2 Algorithms
learning is mainly applied to autonomous systems, due to its Classification is one of the most important decision making
independence in relation to its environment. techniques in many real world problem. In this work, the
main objective is to classify the data as diabetic or non-
II. PROBLEM STATEMENT diabetic and improve the classification accuracy. For many
Diabetes is a chronic disease with the potential to cause a classification problem, the higher number of samples chosen
worldwide health care crisis. According to International but it doesn’t leads to higher classification accuracy. In many
Diabetes Federation 382 million people are living with cases, the performance of algorithm is high in the context of
diabetes across the whole world. By 2035, this will be speed but the accuracy of data classification is low. The main
doubled as 592 million. Diabetes mellitus or simply diabetes objective of our model is to achieve high accuracy.
is a disease caused due to the increase level of blood glucose. Classification accuracy can be increase if we use much of the
Various traditional methods, based on physical and chemical data set for training and few data sets for testing. This survey
tests, are available for diagnosing diabetes. However, early has analyzed various classification techniques for
prediction of diabetes is quite challenging task for medical classification of diabetic and non-diabetic data. Thus, it is
practitioners due to complex interdependence on various
observed that techniques like Support Vector Machine, 3.2.2 Logistic Regression
Logistic Regression, and Artificial Neural Network are most In statistics Logistic regression is a regression model where
suitable for implementing the Diabetes prediction system. the dependent variable is categorical, namely binary
3.2.1 Support Vector Machine dependent variable-that is, where it can take only two values,
The Support Vector Machine (SVM) was first proposed by "0" and "1", which represent outcomes such as pass/fail,
Vapnik, and SVM is a set of related supervised learning win/lose, alive/dead or healthy/sick. Logistic regression is
method always used in medical diagnosis for classification used in various fields, including machine learning, most
and regression. SVM simultaneously minimize the empirical medical fields, and social sciences. For example, the Trauma
classification error and maximize the geometric margin. So and Injury Severity Score (TRISS), which is widely used to
SVM is called Maximum Margin Classifiers. SVM is a predict mortality in injured patients, was originally
general algorithm based on guaranteed risk bounds of developed using logistic regression. Many other medical
statistical learning theory, so called structural risk scales used to assess severity of a patient have been
minimization principle. developed using logistic regression.
SVMs can efficiently perform nonlinear classification using
what is called the kernel trick, implicitly mapping their inputs
into high-dimensional feature spaces. The kernel trick allows
constructing the classifier without explicitly knowing the
feature space.
IV. CONCLUSION
Machine learning has the great ability to revolutionize the
diabetes risk prediction with the help of advanced
computational methods and availability of large amount of
epidemiological and genetic diabetes risk dataset. Detection
of diabetes in its early stages is the key for treatment. This
work has described a machine learning approach to
predicting diabetes. The technique may also help researchers
to develop an accurate and effective tool that will reach at the
table of clinicians to help them make better decision about
the disease status.
REFERENCES
[1] Komi, Zhai. 2017. Application of Data Mining
Methods in Diabetes Prediction
[2] Analysis of Various Data Mining Techniques to
Predict Diabetes Mellitus, Omar Kassem Khalil
Aissa Boudjella, 2016 Sixth International
Conference on Developments in eSystems
Engineering.
[3] Alan Siper, Roger Farley and Craig Lombardo,
“Machine Learning and Data Mining Methods in
Diabetes Research”, Proceedings of Student/Faculty
Research Day, CSIS, Pace University, May 6th,
2005.
[4] Devi, M. Renuka, and J. Maria Shyla. "Analysis of
Various Data Mining Techniques to Predict
Diabetes Mellitus." International Journal of Applied
Engineering Research 11.1 (2016): 727-730.
[5] Berry, Michael, and Gordon Linoff. Data mining
techniques: for marketing, sales, and customer
support. John Wiley & Sons, Inc., 1997
[6] Witten, Ian H., et al. Data Mining: Practical machine
learning tools and techniques. Morgan Kaufmann,
2016.
[7] Giri, Donna, et al. "Automated diagnosis of
coronary artery disease affected patients using LDA,
PCA, ICA and discrete wavelet transform."
Knowledge-Based Systems 37 (2013): 274-282.