Prediction of Heart Disease Using Machine Learning Algorithms
Prediction of Heart Disease Using Machine Learning Algorithms
https://doi.org/10.22214/ijraset.2022.40768
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue III Mar 2022- Available at www.ijraset.com
Abstract: In living organisms, the heart plays an important function. Diagnosis and prediction of heart diseases necessitates
greater precision, perfection, and accuracy because even a minor error will result in fatigue or death. There are multiple death
cases related to the heart, and the number is growing rapidly day by day. The scope of this study is restricted to discovering
associations in CHD data using three super- vised learning techniques: Logistic Regression, K-Nearest Neighbour, and Random
Forest, in order to improve the prediction rate. As a result, this paper conducts a comparative analysis of the results of various
machine learning algorithms. The trial results verify that Logistic Regression algorithm has achieved the highest accuracy of
89% com- pared to other ML algorithms implemented.
Keywords: Machine Learning, Logistic Regression, K-Nearest Neighbour, Random Forest, Python, Heart Disease, Prediction
model, Healthcare
I. INTRODUCTION
Heart disease has risen to become one of the leading causes of death all over the world. Ac- cording to the World Health
Organization, cardiac illnesses claim the lives of 17.7 million people each year, accounting for 31% of all fatalities worldwide.
Heart disease has become the top cause of death in India as well. As a result, it is essential to be able to forecast heart-related
disorders in a reliable and precise manner. Data on various health-related con- cerns is compiled by medical institutions all over the
world. These data can be used to gain significant information utilizing a variety of machine learning techniques. However, the
amount of data collected is enormous, and it is frequently noisy.
II. PROBLEM-STATEMENT
We analyzing the various machine learning algorithms and finding the best to predict the presence or absence of heart disease. The
tar- get we will be exploring is binary classification which is 0 to show the absence of heart disease and 1 to show the presence of
heart disease.
III. PROPOSED METHOD
We are going to use various machine learning algorithms to predict the target. We will be using a number of different features about
a person to predict whether they have heart dis- ease or not. The dependent variable is whether or not a patient has heart disease,
while the independent variables are the patient's many medical characteristics. The various machine learning algorithms used for our
model will be Logistic Regression, K-Nearest Neighbours, and Random Forest. We will compare the scores of all these models by
splitting our data into training and testing in an approximate 80:20 ratio. We will also tune the hyper parameters for all these models
to yield the best results. And finally conclude the best prediction model for our heart disease dataset.
Flow Diagram
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1275
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue III Mar 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1276
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue III Mar 2022- Available at www.ijraset.com
2) K-nearest Neighbours: It's a ma- chine learning algorithm that's supervised. The idea behind nearest neighbour methods is to
find a predetermined number of training samples that are closest in distance to the new point and use them to predict the mark.
It makes no assumptions about the data and is typically used for classification tasks where little to no prior knowledge of the
data distribution is available. Finding the k closest data points in the training set to the data point for which a target value is
unavailable and assigning the average value of the identified data points to it is the aim of this algorithm.
Fig 4-KNN
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1277
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue III Mar 2022- Available at www.ijraset.com
3) Random Forest: Random forest is a supervised machine learning algorithm that can be used to solve problems in both
classification and regression. It builds decision trees out of data samples, then gets predictions from each of them before voting
on the best solution.
The best parameter found for logistic regression is {'solver': 'liblinear', 'c': 0.23357214690901212} with a accuracy score of
0.8852459016393442
The best parameter found for random forest is {'n_estimators': 210, 'min_samples_split': 4, 'min_samples_leaf': 19, 'max_depth': 3}
with a accuracy score of 0.8688524590163934
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1278
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue III Mar 2022- Available at www.ijraset.com
In our base research (Paper 1) we found that the machine learning algorithms used were KNN, SVM, Decision Tree and the highest
accuracy achieved was 87%. Also there was a lack of tuning of hyperparameters. In our re- search paper we worked on ensemble
learning algorithms like Random Forest , Logestic Regression, KNN. And after tuning the hyperparameters we found that the
highest accuracy is achieved through Logistic Regression with a accuracy rate of 89%
XI. RESULTS
After tuning the hyper parameters for KNN, Logistic Regression, Random forest and selecting the best ones we found the following
results for accuracy:
KNN: 0.6885245901639344
Logistic Regression: 0.8852459016393442
Random Forest: 0.8360655737704918
Among these we can see that random forest with a certain set of hyperparameters Logistic Regression performs the best.
Now we will find the other metrics for the logistic regression model.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1279
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue III Mar 2022- Available at www.ijraset.com
A. ROC Curve
The metric compares the true positive rate with the false positive rate.
The True Positive Rate (TPR) is defined as follows:
TPR=
+
FPR=
+
It also provides us with AUC scores which denotes the area underneath the ROC curve
B. Confusion Matrix
A confusion matrix is a table that is used to describe the output of a classification model/classifier by comparing the true values of
the training and test datasets. It is divided into four parts, each of which is defined as follows:
1) True positives (TP): These are cases in which we expected yes (they have the disease) and they do.
2) Real negatives (TN): We predicted they wouldn't have the disorder, and they don't.
3) False positives (FP): We expected that they will have the disease, but they don't. (This is often referred to as a "Type I error.")
4) False negatives (FN): We expected that they will not have the disorder, but they do. (This is often referred to as a "Type II
error.")
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1280
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue III Mar 2022- Available at www.ijraset.com
C. Classification Report
The Classification report is used to find the quality of predictions from a classification algorithm. It helps us to find how many
predictions are correct and how many are wrong. More specifically, it gives us an understanding of True negatives and False
Negatives, True Positives and False Positives, and uses them to predict the metrics of a classification
The main metrics found by the Classification report are accuracy, precision, recall, and f1- score.
The model's accuracy is expressed in decimal form. Precision refers to a classifier's ability to avoid labelling a negative occurrence
as positive. Recall - This metric indicates the per- centage of true positives that were successfully classified. The F1 score is a
weighted harmonic mean of precision and recalls, with 1.0 being the highest and 0.0 being the poorest. F1 Score = 2*(Recall *
Precision) / (Recall + Precision) Support - The number of samples used to calculate each metric. Support - The number of samples
used to calculate each metric.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1281
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue III Mar 2022- Available at www.ijraset.com
XII. CONCLUSION
With the rising number of deaths due to heart disease, it is becoming increasingly important to build a system that can effectively
and accurately forecast heart disease. The motivation for the study was to find the most efficient ML algorithm for detection of heart
diseases. This study compares the accuracy score of KNN, Logistic Regression and Random Forest for predicting heart disease
using UCI machine learning repository dataset. The result of this study indicates that the Logistic regression algorithm is the most
efficient algorithm with accuracy score of 89% for prediction of heart disease. Accuracy of the algorithms in ma- chine learning
depends upon the dataset that used for training and testing purpose.
REFERENCES
[1] Singh, A., & Kumar, R. (2020, February). Heart disease prediction using machine learning algorithms. In 2020 international conference on electrical and
electronics engineering (ICE3) (pp. 452-457). IEEE.
[2] Patel, J., TejalUpadhyay, D., & Patel, S. (2015). Heart disease prediction using machine learning and data mining technique. Heart Disease, 7(1), 129-137.
[3] Rajesh, N., T, M., Hafeez, S., & Krishna, H. (2018). Prediction of Heart Disease Using Machine Learning Algorithms. International Journal of Engineering &
Technology, 7(2.32), 363-366. doi:http://dx.doi.org/10.14419/ijet.v7i2.32.15714
[4] Ramalingam, V. V., Dandapath, A., & Raja, M. K. (2018). Heart disease prediction using machine learning techniques: a survey. International Journal of
Engineering & Technology, 7(2.8), 684-687
[5] Kaur, A., & Arora, J. (2018). HEART DISEASE PREDICTION USING DATA MINING TECHNIQUES: A SURVEY. International Journal of Advanced
Research in Computer Science, 9(2).
[6] “Sultana, M., Haider, A., & Uddin, M. (2016). Analysis of data mining techniques for heart disease prediction. 2016 3rd International Conference on Electrical
Engineering and Information Communication Technology (ICEEICT), 1-5.
[7] Deekshatulu, B. L., & Chandra, P. (2013). Classification of heart disease using k-nearest neighbor and genetic algorithm. Procedia technology, 10, 85-94.
[8] Learning, M. (2017). Heart disease diagnosis and prediction using machine learning and data mining techniques: a review. Advances in Computational
Sciences and Technology, 10(7), 2137-2159.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1282