Proj Report
Proj Report
Submitted by:
Jigeishu Srivastava
(2201100220016)
PRAYAGRAJ
Session: 2024-25
Abstract
In this digital world, data is an asset, and enormous data was generated in all the fields. Data in the healthcare
industry consists of all the information related to patients. Now day’s cardiovascular diseases are growing
rapidly by busy and stress full life. All type of age groups is under the threat of these chronic diseases so there
is a need of detection of these diseases by using symptoms or reports. Early identification and treatment are the
best available option for the infected people. Main objective behind to develop a system is to help the doctors
to cross verify their diagnosed results which gives promising solution over existing death rates.
The recent development in medical supportive technologies based on data mining, deep learning plays an
important role in detecting cardiovascular diseases that are caused by taking many factors in consideration
such as age, type of chest pain, blood pressure, cholesterol levels, etc.
By using our proposed work try to implement a promising solution for detection of heart disease. The Given
heart disease prediction system enhances medical care and reduces the cost. This project gives us significant
knowledge that can help us predict the patients with heart disease.
Acknowledgement
I would like to express my deepest gratitude to Dr. Alka Verma, who provided invaluable
guidance and support throughout the development of this mini project. Her expertise and
encouragement have been instrumental in the successful completion of this work.
I am also grateful to my peers and family for their constant support and motivation during this
project. Their belief in my abilities has been a source of strength and inspiration.
Finally, I extend my thanks to Institute of Engineering and Rural Technology, Prayagraj for
providing the resources and environment necessary for learning and experimentation, which
significantly contributed to the completion of this project.
Contents
Chapter 1 Introduction................................................................................................. 6
1. Introduction ............................................................................................................. 6
1.1. Preface.............................................................................................................. 6
1.2. Motivation ........................................................................................................ 6
1.3 Problem Statement ................................................................................................ 7
1.4 Objectives.............................................................................................................. 8
1.5 Scope And Limitations.......................................................................................... 8
1.6 Organization Of Project ........................................................................................ 8
3. Methodology ........................................................................................................ 7
3.1 System Architecture ............................................................................................. 7
3.2 Dataset Details ..................................................................................................... 8
3.3 Machine Learning ................................................................................................ 8
3.3.1 Supervised Machine Learning.......................................................................... 8
3.3.2 Unsupervised Machine Learning ................................................................... 11
3.4 Supervised Algorithms ....................................................................................... 11
3.4.1 Random Forest ............................................................................................... 11
3.4.2 K-Nearest Neighbour ..................................................................................... 12
3.4.3 Logistic Regression ........................................................................................ 13
3.4.4 Xgboost .......................................................................................................... 14
4. Implementation.................................................................................................. 15
4.1 Existing System.................................................................................................. 15
4.2 Proposed System ............................................................................................ 15
4.2.1 Data Collection............................................................................................... 16
4.2.2 Data Pre-Processing ....................................................................................... 16
4.2.3 Feature Selection ............................................................................................ 17
4.2.4 Model Selection ............................................................................................. 18
5. Results………………………………………………………………………….
5.1 Hardware Platform Used .................................................................................... 20
5.2 Libraries And Software Platform Used .............................................................. 20
5.3 Visualization Results........................................................................................... 20
6. Conclusion .......................................................................................................... 27
1. INTRODUCTION
1.1. PREFACE
Machine Learning is a powerful tool that enables us to extract valuable information from data that
was previously unknown or implicit. The domain of machine learning is extensive and multifaceted
and it encompasses various classifiers such as supervised, unsupervised, and ensemble learning, that
can be employed to forecast and assess the precision of a particular dataset. The implementation of
machine learning is increasing day by day, and it has the potential to revolutionize many fields,
including healthcare. Cardiovascular disease (CVD) is an area in healthcare that can significantly gain
from the implementation of machine learning techniques. With 17.9 million fatalities globally, as per
the World Health Organization, CVD is currently the primary cause of death in adults. To help
address this problem, our project aims to predict which patients are likely to be diagnosed with CVD
based on their medical history. By recognizing patients who exhibit symptoms for example, chest pain
or elevated blood pressure, we can help diagnose the illness with fewer medical examinations and
provide more efficient treatments. Our project focuses on three data mining techniques: XGBoost,
KNN, and Random Forest Classifier. By using these techniques in combination, we are able to
achieve an accuracy rate of above 95%, which is better than previous systems that relied on only one
data mining technique. The objective of our project is to classify by examining their medical
characteristics, such as age, gender, fasting sugar levels, chest pain, and more, it is possible to predict
whether a person is likely to have heart disease or not.
To accomplish this, we selected a dataset from the Kaggle repository this dataset was created by
combining different datasets already available independently but not combined before, that contains
medical history and characteristics of the patient. We trained our algorithms using the 12 medical
attributes of each patient and used XGBoost, Random Forest and KNN to classify the patients based
on their medical history. We found that XGBoost was the most efficient algorithmMOTIVATION
The main motivation of doing this research is to present a heart disease prediction model for the
prediction of occurrence of heart disease. Further, this research work is aimed towards identifying the
best classification algorithm for identifying the possibility of heart disease in a patient. This work is
justified by performing a comparative study and analysis using classification algorithms namely,
XGBoost, Logistic Regression, KNN, and Random Forest are used at different levels of evaluations.
Although these are commonly used machine learning algorithms, the heart disease prediction is a vital
task involving highest possible accuracy. Hence, these algorithms are evaluated at numerous levels
and types of evaluation strategies
The major challenge in heart disease is its detection. There are instruments available which can
predict heart disease but either they are expensive or are not efficient to calculate chance of heart
disease in human. Early detection of cardiac diseases can decrease the mortality rate and overall
complications. However, it is not possible to monitor patients every day in all cases accurately and
consultation of a patient for 24 hours by a doctor is not available since it requires more sapience, time
and expertise. Since we have a good amount of data in today’s world, we can use various machine
learning algorithms to analyse the data for hidden patterns. The hidden patterns can be used for health
diagnin medicinal data.
1.3 OBJECTIVES
3. To determine significant risk factors based on medical data set which may lead to heart disease.
Scope
1. The system will help identify important factors that lead to a heart disease.
3. It will help the patients to obtain results quick and diagnose as early as possible
Ijaz Bo Jin, Chao Che et al. (2018) proposed a “Predicting the Risk of Heart Failure with EHR Sequential Data
Modeling” model designed by applying neural network. This paper used the electronic health record (EHR) data
from real-world datasets related to congestive heart disease to perform the experiment and predict the heart
disease before itself. We tend to used one-hot encryption and word vectors to model the diagnosing events and
foretold coronary failure events victimization the essential principles of an extended memory network model.
By analyzing the results, we tend to reveal the importance of respecting the sequential nature of clinical records.
Aakash Chauhan at al. (2018) presented “Heart Disease Prediction using Evolutionary Rule Learning”. This
study eliminates the manual task that additionally helps in extracting the information (data) directly from the
electronic records. To generate strong association rules, we have applied frequent pattern growth association
mining on patient’s dataset. This will facilitate (help) in decreasing the number of services and shown that
overwhelming majority of the rules help within the best prediction of coronary sickness.
Ashir Javeed, Shijie Zhou et al. (2017) designed “An Intelligent Learning System based on Random Search
Algorithm and Optimized Random Forest Model for Improved Heart Disease Detection”. This paper uses
random search algorithm (RSA) for factor selection and random forest model for diagnosing the cardiovascular
disease. This model is principally optimized for using grid search algorithmic program. Two forms of
experiments are used for cardiovascular disease prediction. In the first form, only random forest model is
developed and within the second experiment the proposed Random Search Algorithm based random forest
model is developed. This methodology is efficient and less complex than conventional random forest model.
Comparing to conventional random forest it produces 3.3% higher accuracy. The proposed learning system can
help the physicians to improve the quality of heart failure detection.
“Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques” proposed by Senthilkumar
Mohan, Chandrasegar Thirumalai et al. (2019) was efficient technique using hybrid machine learning
methodology. The hybrid approach is combination of random forest and linear method. The dataset and subsets
of attributes were collected for prediction. The subset of some attributes was chosen from the pre-processed
knowledge (data) set of cardiovascular disease. After prep-processing, the hybrid techniques were applied and
diagnosis the cardiovascular disease.
K. Prasanna Lakshmi, Dr. C.R.K.Reddy (2015) designed “Fast Rule-Based Heart Disease Prediction using
Associative Classification Mining”. In the proposed Stream Associative Classification Heart Disease Prediction
(SACHDP), we used associative classification mining over landmark window of data streams. This paper
contains two phases: one is generating rules from associative classification mining and next one is pruning the
rules using chi-square testing and arranging the rules in an order to form a classifier.
M.Satish, et al. (2015) used different Data Mining techniques like Rule based, Decision Tree, Naive Bayes, and
Artificial Neural Network. An efficient approach called pruning classification association rule (PCAR) was
used to generate association rules from cardiovascular disease warehouse for prediction of heart disease. Heart
Disease data warehouse was used for pre-processing for mining. All the above discussed data mining technique
were described.
Lokanath Sarangi, Mihir Narayan Mohanty, Srikanta Pattnaik (2015) “An Intelligent Decision Support System
for Cardiac Disease Detection”, designed a cost-efficient model by using genetic algorithm optimizer technique.
The weights were optimized and fed as an input to the given network. The accuracy achieved was 90% by using
the hybrid technique of GA and neural networks.
“Prediction and Diagnosis of Heart Disease by Data Mining Techniques” designed by Boshra Bahrami,
Mirsaeid Hosseini Shirvani. This paper uses various classification methodologies for diagnosing cardiovascular
disease. Classifiers Like KNN, SVM classifier and Decision Tree are used to divide the datasets. Once the
classification and performance evaluation the Decision tree is examined as the best one for cardiovascular
disease prediction from the dataset.
Mamatha Alex P and Shaicy P Shaji (2019) designed “Prediction and Diagnosis of Heart Disease Patients using
Data Mining Technique”. This paper uses techniques of Artificial Neural Network, KNN, Random Forest and
Support Vector Machine. Comparing with the above-mentioned classification techniques in data mining to
predict the higher accuracy for diagnosing the heart disease is Artificial Neural Network.
3. METHODOLOGY
The system architecture gives an overview of the working of the system. The working of this
system is shown below:
Dataset Attributes
3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-
Anginal Pain, ASY: Asymptomatic]
4. RestingBP: resting blood pressure [mm Hg]
6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave
abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH:
showing probable or definite left ventricular hypertrophy by Estes' criteria]
8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, down:
downsloping]
12. HeartDisease: output class [1: heart disease, 0: Normal]
In machine learning, classification refers to a predictive modelling problem where a class label is
predicted for a given example of input data.
Supervised machine learning can be classified into two types of problems, which are given below:
a) Classification
b) Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms
predict the categories present in the dataset.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship
between input and output variables. These are used to predict continuous output variables, such as
market trends, weather prediction, etc.
Some popular Regression algorithms are given below:
• Lasso Regression
3.3.2 UNSUPERVISED MACHINE LEARNING
Unsupervised learning is different from the supervised learning technique; as its name suggests,
there is no need for supervision. It means, in unsupervised machine learning, the machine is trained
using the unlabeled dataset, and the machine predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labeled, and the model acts on that data without any supervision. The main aim of the
unsupervised learning algorithm is to group or categories the unsorted dataset according to the
similarities, patterns, and differences. Machines are instructed to find the hidden patterns from
the input dataset.
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on
the concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of
that dataset." Instead of relying on one decision tree, the random forest takes the prediction from
each tree and based on the majority votes of predictions, and it predicts the final output. The
greater number of trees in the forest leads to higher accuracy and prevents the problem of over
fitting.
Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output.
Therefore, below are two assumptions for a better Random Forest classifier:
1. There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
2. The predictions from each tree must have very low correlations.
Advantages:
• It enhances the accuracy of the model and prevents the over fitting issue.
Disadvantages:
• Although Random Forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the available
categories. K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm. KNN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification problems. K-NN is a non-parametric
algorithm, which means it does not make any assumption on underlying data. It is also called a
lazy learner algorithm because it does not learn from the training set immediately instead it stores
the dataset and at the time of classification, it performs an action on the dataset.
3.4.3 LOGISTIC REGRESSION
Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables. Logistic regression predicts the output of a
categorical dependent variable. Therefore, the outcome must be a categorical or discrete value. It
can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1,
it gives the probabilistic values which lie between 0 and 1. Logistic Regression is much similar to
the Linear Regression except that how they are used. Linear Regression is used for solving
Regression problems, whereas Logistic regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1). The curve from the logistic function indicates the
likelihood of something such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc. Logistic Regression is a significant machine learning algorithm because
it has the ability to provide probabilities and classify new data using continuous and discrete
datasets.
One of the strengths of XGBoost is its built-in L1 and L2 regularization, which helps prevent
over fitting and makes it a regularized form of GBM. When using the Scikit Learn library, alpha
and lambda hyper-parameters related to regularization are passed to XGBoost. Alpha is used for
L1 regularization, while lambda is used for L2 regularization.
Another strength of XGBoost is its ability to leverage parallel processing to execute models
much faster than GBM. When using the Scikit Learn library, the nthread hyper-parameter is used
for parallel processing, representing the number of CPU cores to be used. If you want to use all
available cores, don't specify a value for nthread, and the algorithm will detect them
automatically.
XGBoost also has built-in capabilities to handle missing values. When the algorithm encounters
a missing value at a node, it tries both left and right-hand splits and learns the way that leads to a
higher loss for each node. It then does the same when working on testing data.
Cross-validation is another feature of XGBoost that allows the user to run a cross- validation at
each iteration of the boosting process, making it easy to get the exact optimum number of
boosting iterations in a single run. This is unlike GBM, where a grid search must be run, and only
a limited number of values can be tested.
4. IMPLEMENTATION
Heart disease is even being highlighted as a silent killer which leads to the death of a person
without obvious symptoms. The nature of the disease is the cause of growing anxiety about the
disease & its consequences. Hence continued efforts are being done to predict the possibility of
this deadly disease in prior. So that various tools & techniques are regularly being experimented
with to suit the present-day health needs. Machine Learning techniques can be a boon in this
regard. Even though heart disease can occur in different forms, there is a common set of core risk
factors that influence whether someone will ultimately be at risk for heart disease or not. By
collecting the data from various sources, classifying them under suitable headings & finally
analyzing to extract the desired data we can conclude. This technique can be very well adapted to
the do the prediction of heart disease. As the well-known quote says “Prevention is better than
cure”, early prediction & its control can be helpful to prevent & decrease the death rates due to
heart disease.
The working of the system starts with the collection of data and selecting the important attributes.
Then the required data is pre-processed into the required format. The data is then divided into
two parts training and testing data. The algorithms are applied and the model is trained using the
training data. The accuracy of the system is obtained by testing the system using the testing data.
This system is implemented using the following modules.
1. Data Collection
2. Data Pre-Processing
3. Feature Selection
4. Model Selection
4.2.1 DATA COLLECTION
It is the primary and most crucial fundamental step while applying machine learning and
analytics. The data required in this project is the patient’s medical data. We have collected the
dataset from Kaggle which includes all the required information for prediction. The features that
the dataset includes are medical information like age, sex, chest paint type, resting blood
pressure, cholesterol, fasting blood sugar, old peak etc. The dataset consists of 918 observations
having 14 attributes.
This is one of the most crucial tasks in the process of analytics. Often it is observed that more than
half of the total time of analytics process is taken by pre-processing phase. It is an important step for
the creation of a machine learning model. Initially, data may not be clean or in the required
format for the model which can cause misleading outcomes. In pre-processing of data, we
transform data into our required format. It is used to deal with noises, duplicates, and missing
values of the dataset. Data pre-processing has the activities like importing datasets, splitting
datasets, attribute scaling, etc. Pre-processing of data is required for improving the accuracy of
the model.
4.2.3 FEATURE SELECTION
Once we have the required data, next step is featuring extraction. Many times, it happens that
some features do not contribute in evaluation or have negative impact on the accuracy. Feature
selection is the step where we try to reduce number of features and try to create new features from
existing ones. These new features now created should summarize the information obtained from
existing features. The final features to be considered while prediction can be identified using
correlation matrix shown in following image:
It is the process to select one final algorithm for concerned purpose. It is decided by observing the
accuracy by applying multiple algorithms. We can use logistic regression, XGBoost, KNN,
random forest, etc. The final accuracy depends of the type of model we select.
Comparative analysis is performed among algorithms and the algorithm that gives the highest
accuracy is used for heart disease prediction
The hardware requirement may serve as the basis for a contract for the implementation of the
system and should therefore be complete and consistent in specification.
The hardware used for the system is mentioned below.
It should be noted that better the hardware facilities available, higher would-be response time of
the system.
The software requirement document is the specification of the system. The software requirement
provides a basis for creating the software requirements specification.
OPERATING SYSTEM: Windows
Based on the findings obtained from various algorithms used for identifying patients who have
been diagnosed with heart disease, it is observed that Logistic Regression, Random Forest
Classifier, and KNN have provided better results as compared to other techniques such as
Logistic Regression, XGBoost, SVM and Decision Tree.
Algorithms used in previous research studies. The highest level of accuracy possible by
Random Forest and Logistic Regression is either greater than or nearly equal to the accuracy
that was obtained from earlier research studies. It can be inferred that the improvement in
accuracy is due to the increased number of attributes used from the medical dataset that was used
in the project.
Additionally, the study has revealed that Logistic Regression and Random Forest outperform
KNN in the detection of patients who are diagnosed with the possibility of having a heart
disease, indicating that Logistic Regression and Random Forest Classifier are more effective in
diagnosing heart disease.
In this project, the data was formulated in different formulations and the model was trained
using Logistic Regression tree algorithm with above 87% accuracy.
6. CONCLUSION
Cardiovascular disease (CVD) is one of the leading causes of deaths happening worldwide, making
early detection and intervention crucial for improving patient outcomes. To address this need, a
machine learning technique was used to develop a model using patient medical history data to predict
the probability of fatal heart disease. The dataset includes variables such as chest pain, sugar levels,
and blood pressure, which are important indicators of heart health.
These classification algorithms - Logistic Regression, Random Forest Classifier, and KNN - were
utilized to develop the model, which achieved an accuracy rate of over 87%. The accuracy of the
model was further improved by increasing the size of the dataset, enabling the identification of more
subtle patterns and risk factors.
The application of machine learning techniques in medical diagnosis has several benefits, including
increased speed and accuracy of diagnoses, reduced costs, and improved patient outcomes. By
analyzing large amounts of data and identifying complex patterns, machine learning algorithms can
provide valuable insights into patient health that may not be immediately apparent to human
clinicians.