An Optimized Approach For Prediction of Heart Diseases Using Gradient Boosting Classifier
An Optimized Approach For Prediction of Heart Diseases Using Gradient Boosting Classifier
1
Master of Enginerring Student, Computer Science and Enginerring, Guru Nanak Dev Engineering College, Ludhiana,
INDIA
2
Assistant Professor, Computer Science and Enginerring, Ludhiana, INDIA
3
Assistant Professor, Computer Science and Enginerring, Ludhiana, INDIA
ABSTRACT
Data mining in medical domain will yield in discovering and retreating valuable patterns and knowledge which might prove
useful in clinical identification. Disease Prediction system supported modeling predicts the diseases. The system analyses the
diseases provided by the record of patient as input and provides the results as an output. Disease Prediction is done by
implementing the data mining techniques like support vector machine (SVM), Logistic regression and optimized gradient
boosting classifier (GBC). During this work, uses fourteen medical attributes like age, sex, blood pressure, cholesterol, fasting
blood sugar (FBS), thalach etc. for prediction. The dataset contains 284 records of patients. The various techniques used for
predicting the risk level of every person supported age, gender, blood pressure, fasting blood sugar (FBS) cholesterol, pulse rate
etc. This work examines totally different the various data mining algorithms and compares the results with cross validation or
without cross validation and different performance measures, i.e. accuracy, precision, recall and f1-score. The accuracy of the
model applying on the GBC technique with cross validation and while not cross validation is calculated. Then the one with an
accuracy is taken which is in the form of prediction result.
Keywords: Gradient boosting classifier (GBC), Logistic regression, Optimization techniques, Predictive Modeling.
1. INTRODUCTION
Data mining is the process of dividing through massive volume of data sets to find structures and establish
relationships to determine problems through data analysis. Data processing tools allow enterprises to predict future
trends. In medical field, many applications depend on data mining are very important and valuable. The massive
volume of information gain from hospitals has no directing value unless convert into meaningful data and information.
The collected patient data from many hospitals is appropriate for risk analysis for heart disease and many other
diseases. Data mining also help to analyze the information from many databases, identify it and summate the identified
relationships. Data mining is the pre-determining range of analysis with the reasonable of discovery comprehensible
information from datasets. It is a combination of various approach like machine learning, DBMS (data base
management system), artificial intelligence and statistics. Data mining plays a significant role due to expanded
parameters such as accuracy and concluding the unknown cause which are not setup through manually. Other data
mining techniques include Sequence or Path Analysis, Classification, Clustering and Forecasting. Sequence or Path
Analysis parameters seek for patterns where one task ends up in another task. A Classification for new patterns and
within the way to information is organized. Classification algorithms predict variables and other factors within the
database[11].
Data Mining processing techniques are helpful in several analysis as well as arithmetic, information processing,
biological science and promoting. At Present, most of the patients complain about the test conducted by hospitals for
diagnosis which cause them both money and time loss. Sometimes, after conducting numerous tests, the patient’s
results are negative and ends up in both money and time loss are caused because of the doctor’s wrong intuitions and
inexperience.
2. LITERATURE SURVEY
Poornima Singh et al[8] conducts an effective heart disease prediction system (EHDPS) is build model using neural
network for predicting the uncertainty level of heart disease. The system uses 15 medical parameters like age, sex, vital
sign, cholesterol, and obesity for prediction. The EHDPS predicts the likelihood of patients getting heart condition. It
enables significant knowledge, eg, relationships between medical factors associated with heart condition and patterns,
to be established. We have employed the multilayer perceptron neural network with backpropagation because the
training algorithm. The obtained results have explained that the diagnostic system can effectively predict the heart
diseases.
Rania Salah El-Sayed[3] studied an intelligent system that diagnose and classify the severity of the disease because of
heart disease. this method will use attribute filtering techniques genetic algorithm that has been known to be an awfully
adaptive and efficient method of feature selection and reduce number of attributes which indirectly reduces the quantity
of diagnosis tests which are required to be appropriated by a patient. The classification techniques like Support Vector
Machines, Naive Bayesian Theorem, nearest neighbor and Linear discriminant analysis are applied during this paper to
grasp the classification accuracy of the techniques within the prediction of the heart disease. Apply proposed system on
the Cleveland cardiopathy database. Then compare the results with other techniques per using the identical data.
R. Bhuvaneeswari et al[5] explained for increasing death rate everywhere the earth. Data classification is also a crucial
task within the medical field which assists the physicians to predict the diseases. Recently, machine learning (ML)
algorithms have been employed to classify the knowledge within the medical field. The data complexity and quantity
must be examined and managed to transform the efficient and accurate HD diagnosis. In this paper, a gradient boosting
tree (GBT) depends on classifier model to predict the disease efficiently. Besides, a bunch of comprehensive
experiments were administrated using Staglog and Cleveland cardiopathy dataset. The experimental values ensured the
prevalence of the GBT classifier supported several performance measures.
Anagha Sridhar & Anagha S Kapardhi[4] conducts machine Learning is employed across many ranges round the
world. The healthcare industry is not any exclusion. Machine Learning can play a necessary condition in predicting
existence/absence of disorders, Heart diseases and more. Such information that predicted as well as prior can provide
important intuitions to doctors who can then adapt their diagnosis and dealing per patient basis. During this paper, we
will examine a project where we worked on predicting available Heart Diseases in people using Machine Learning
algorithms. Naïve Bayes and Decision Tree Classifier algorithms are used. The dataset is used in this paper from
Kaggle.
Dr. S. Anitha & Dr. N. Sridevi[1] studied the heart diseases have associate bumper accord of attention in medical
analysis thanks to its impact on human health. Heart diseases square measure amongst the nation’s outstanding reason
behind death. data processing has developed as a significant approach for computing applications in medical IP.
Numerous algorithms connected with data processing have significantly helped to recognize medical knowledge
additional evidently. In this work, supervised machine learning algorithms particularly SVM, KNN and Naïve Bayes
square measure want to predict the guts diseases. The machine learning algorithms square measure enforced using R
artificial language. The performances of the algorithms square measure measured in terms of accuracy. The practicality
of the algorithms square measure examined and also the outcomes were deliberated.
Reddy Prasad et al[2] proposed data that the prediction of heart diseases using machine learning techniques by
summarizing the few current researches. during this paper the logistic regression algorithms is used and also the health
care data which classifies the patients whether they are having heart diseases or not in keeping with the information
within the record. Also, I will be able to try and use this data a model which predicts the patient whether or not they're
having cardiopathy or not.
Senthilkumar Mohan[6] et al propose a model that aims to applying machine learning techniques resulting in
improving the accuracy in the prediction of heart disease. Various techniques in data mining helps to find out the
harshness of heart disease among humans. The severity of the disease is classified based on various methods like K-
Nearest Neighbor Algorithm (KNN), Decision Trees (DT), Logistic regression, and Naive Bayes (NB) and Support
vector machine (SVM). The prediction model is introduced with different combinations of features and a number of
other known classification techniques. We build a performance level of prediction model with an accuracy level of
88.7% with the hybrid random forest linear model (HRFLM) for heart condition.
3. METHODOLOGY
In this work, Performance of different techniques like Support vector machine (SVM), Logistic regression and gradient
boosting classifier (GBC) is analyzed for prediction of heart diseases. From the analysis it was found that GBC
classifier is more accurate than other algorithms. So, Disease Predictor also uses various algorithms for the prediction
of different diseases.
SVM (SUPPORT VECTOR MACHINE):
Support vector machine which separates the info into two categories to spot a maximum distance in hyperplane.
Support vector machine algorithms apply with kernel that convert a computer file into specific term. Support vector
machine may be executed with classification, regression and outlier detection. It utilizes the quadratic optimization
problem and simply extensible. It also helps to reduce the errors in data.
Linear kernel: Scalar product utilized by linear kernel between the 2 any considerations.
Polynomial kernel: It is established kind of linear kernel and characterize rounded and nonlinear input zone.
Radial basis function (RBF) kernel: It is used to mapping the input extract infinite volume space. In radial basis
function, gamma range start from zero to at least one. Mostly Radial basis function used the worth of gamma
is 0.1 in support vector machine classification.
LOGISTIC REGRESSION:
Logistic regression is additionally a sigmoid function that helps to predict the values. Logistic regression may be a
discriminative classification technique that works on real-valued input vector. The calculations of input vector to be
categorized is understood as features or predictors. Logistic regression also can be applied with various class
classification. It is a way of fitting the simplest line to the attributes present. It is accustomed predict other attributes
one can use every attribute within the dataset. Logistic regression is another method to use aside from the simpler
regression toward the mean. It also can remove the improper data and predict the probability for a specific class.
During this work, logistic regression is accustomed predict the disease whether an individual is diagnosed with disease
or not. ROC (Receiver operating characteristics) is employed in logistic regression. ROC curve basically represents the
info in graphical form for each parameter and diagnosis the diseases[10].
GRADIENT BOOSTING CLASSIFIER:
Gradient boosting classifier may be a sort of machine learning boosting. It relies on the intuition that the simplest
possible next model when combined with previous models minimizes the general prediction error. The key idea is to
line the target outcomes for this next model so as to attenuate the error. The target outcome for every case within the
data depends on what proportion changing that case's prediction impacts the general prediction error:
If a tiny low change within the prediction for a case causes an outsized call in error, then next target outcome of
the case may be a high value. Predictions from the new model that are near its targets will reduce the error.
If a tiny low change within the prediction for a case causes no change in error, then next target outcome of the
case is zero. Changing this prediction doesn't decrease the error.
The working of the system is defined as follows:
Various performance parameters are used to classify the heart disease such as accuracy, precision and recall. These
parameters are help to measure the result for prediction and some terms are involve in parameters.
TP (True Positive): - True Positive is the result in which model predicts the positive values that are actually true.
TN (True Negative): - True Negative is the result in which model predicts the negative values that are actually true.
FP (False Positive): - False Positive is the result in which model predicts the positive values that are actually false.
FN (False Negative): - False Negative is the result in which model predicts the negative values that are actually false.
Three parameters are used for prediction in this work are as follows:
Accuracy: - It is the proportion of sum of true negative and true positive to the total samples in dataset.
Accuracy= (True Positive + True Negative)/total (1)
Precision: - It is the proportion of true positive to the sum of true positive and false positive.
Precision= True Positive/ (True Positive + False Positive) (2)
Recall: - It is the proportion of true positive to the sum of true positive and false negative.
Recall= True Positive/ (True Positive + False Negative) (3)
Predicted Positive TP FP
Predicted Negative FN TN
A confusion matrix is generated to measure the performance matrices for each technique and it is represented in two-
dimensional form.
Figure 3: Visualization
Above figure 3 represents the target value for heart disease in which 0 value follow how many patients are diagnosed
with heart disease and 1 value follow how many patients are not diagnosed with heart disease. Figure 4 describes the
comparison between three algorithms with performance metrics and show the accuracy, precision, recall and f1-score
for three algorithms in bar chart.
Volume 9, Issue 7, July 2020 Page 134
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: [email protected]
Volume 9, Issue 7, July 2020 ISSN 2319 - 4847
Figure 5 describes the comparison between GBC algorithm with cross validation and performance to show the
accuracy, precision and recall for three algorithms in bar chart. Gradient boosting classifier algorithm depicts the good
resuts with cross validation.
Table 3 shown the values of accuracy, precision, recall and F1-score for the GBC algorithm using cross validation.
5. CONCLUSION
This work aims to classify how to predict the heart diseases using data mining algorithms. It is difficult to manually the
predict of disease supported risk factors and attributes. Data mining techniques are helpful to predict the output from
present dataset. Three algorithms are used such as Gradient boosting Classifier, Support Vector Machine and Logistic
regression and obtain the various level of accuracy for each technique. These algorithms give the results are based on
accuracy, performance parameters and speed of model. Gradient boosting classifier obtained with good results for
prediction of heart diseases with better facilitates.
References
[1] S. Anitha and N. Sridevi, “Heart disease prediction using data mining techniques,” journal of analysis and
computation (JAC), vol.8, February 2019.
[2] S. N. Reddy Prasad, Pidaparthi Anjali, “Heart disease prediction using logistic regression algorithm using machine
learning,” International Journal of Engineering and Advanced Technology (IJEAT), Volume-8, Issue-3S, February
2019.
[3] R. S. El-Sayed, “Linear discriminant analysis for an efficient diagnosis of heart disease via attribute filtering based
on genetic algorithm,” Journal of Computers, vol. 13, no. 11, pp. 1290-1299, 2018.
[4] A. S. K. Anagha Sridhar, “Predicting heart disease using machine learning algorithm,” International Research
Journal of Engineering and Technology (IRJET), Volume: 06 Issue: 04 Apr 2019.
[5] G. P. R. Bhuvaneeswari, P. Sudhakar, “Heart disease prediction model based on gradient boosting tree (gbt)
classification algorithm,” International Journal of Recent Technology and Engineering (IJRTE), Volume-8, Issue-
11, September 2019.
[6] S. Mohan, C. Thirumalai, and G. Srivastava, “Effective heart disease prediction using hybrid machine learning
techniques,” IEEE Access, vol. 7, pp. 81542–81554, 2019.
[7] S. A. Chaitrali Dangare, “A data mining approach for prediction of heart disease using neural networks,”
International Journal of Computer Engineering and Technology (IJCET), Volume 3, Issue 3, October-December
2012.
[8] P. S. S. S. G. S. Pandit Jain, “Effective heart disease prediction system using data mining techniques,” International
Journal of Nanomedicine, 06-Dec-2019.
[9] J. Thomas and R. T. Princy, “2016 international conference on circuit, power and computing technologies (iccpct),”
in Human heart disease prediction system using data mining techniques, 2016.
[10] S. Manikandan, “Heart attack prediction system,” in 2017 International Conference on Energy, Communication,
Data Analytics and Soft Computing (ICECDS), pp. 817–820, Aug 2017.
[11] M. Abdar, S. Rostam Niakan Kalhori, T. Sutikno, I. Subroto, and G. Arji, “Comparing performance of data
mining algorithms in prediction heart diseases,” International Journal of Electrical and Computer Engineering
(IJECE), vol. 5, December 2015.