Mini Project 2024
Mini Project 2024
Learning
BACHELOR OF TECHNOLOGY
IN
Submitted by
SUPERVISOR
Dr.D.Raghavendra Gowda
Associate Professor
May, 2024
Department of Computer Science and Engineering
CERTIFICATE
This is to certify that the project titled Heart Disease Prediction Using
Machine Learning is carried out by
We avail this opportunity to express our deep sense of gratitude and heart-
ful thanks to Dr. Teegala Vijender Reddy, Chairman and Sri Teegala
Upender Reddy, Secretary of VCE, for providing a congenial atmosphere to
complete this project successfully.
Bandi Chandu
Jakkani Prabhas
Dyavarashetty Shailnath
ii
Abstract
Our system aims to predict heart disease using the decision tree algorithm,
striving for accurate and reliable predictions while addressing the limitations
of other methods. The decision tree algorithm splits data features based on
information gain, entropy, and the Gini index, using the feature with the
highest information gain for predictions.
iv
Table of Contents
vi
4.2.6 Model Feature Spliting: . . . . . . . . . . . . . . . . . . . . 19
CHAPTER 5 RESULT AND ANALYSIS . . . . . . . . . . . . . . . 22
5.1 DATASET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 IMPLEMENTATION DETAILS . . . . . . . . . . . . . . . . . . . 24
5.2.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.2 LIBRARIES(MODULES) . . . . . . . . . . . . . . . . . . . 24
5.2.3 numpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.4 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.5 Scikit-Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.6 Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.1 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.2 OUTPUT: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 COMPARISION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 33
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
List of Tables
viii
List of Figures
5.1 OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
Abbreviations
Abbreviation Description
Introduction
Heart disease is a major global health issue, responsible for over 15 million
deaths annually. It includes various conditions like heart failure, coronary
artery disease, and arrhythmias, all of which impair the heart’s ability to
function properly. Heart failure occurs when the heart can’t pump enough
blood, coronary artery disease involves the narrowing of the arteries supplying
blood to the heart, and arrhythmias are irregular heartbeats that can lead to
severe complications, such as stroke or cardiac arrest. These conditions not
only reduce the quality of life but also pose significant risks to life itself.
One of the biggest challenges in treating heart disease is that it often goes
undetected until it reaches an advanced stage. Many people do not experience
symptoms until significant damage has already occurred, making the disease
more difficult to treat effectively. Late detection limits treatment options
and reduces the chances of successful intervention, contributing to the high
mortality rate associated with heart conditions. This delay in diagnosis often
means that by the time patients seek help, their condition has progressed to
a point where fewer and less effective treatment options are available.
Early detection and prediction of heart disease can drastically improve survival
rates and patient outcomes. By identifying individuals at risk before they show
symptoms, healthcare providers can implement preventive measures such as
lifestyle changes, medications, and sometimes surgical procedures to manage
risk factors like high blood pressure and cholesterol. Advances in medical
technology and predictive analytics are enhancing our ability to detect heart
disease earlier and more accurately. Widespread adoption of early detection
programs could significantly reduce the number of deaths from heart disease,
enabling timely medical interventions that save lives and improve health out-
1
comes.
Key indicators for predicting heart disease include a range of clinical and
lifestyle factors such as blood pressure, cholesterol levels, old peak (ST depres-
sion induced by exercise relative to rest), exercise habits, blood sugar levels,
and heart rate. By analyzing these variables, machine learning models can
predict the likelihood of heart disease with significant accuracy.
Machine learning and data mining play a major role in predicting, analyzing,
and processing of medical data. Data mining techniques help gather and
structure data, making it easier to understand and use in disease prediction.
Several machine learning algorithms are utilized in the prediction of heart
disease, including k-Nearest Neighbors (KNN) and k-means clustering. KNN
is a simple, yet effective, algorithm that classifies data points based on their
proximity to other data points. However, it may not perform well with large
datasets. K-means clustering, on the other hand, partitions data into k pre-
defined clusters, but it also has limitations such as the need for a predefined
number of clusters, high computational cost, and implementation complexity.
In our proposed system, we aim to predict heart disease using the deci-
sion tree algorithm. The main goal of our system is to achieve accurate
and reliable predictions while overcoming the limitations of other algorithms.
The decision tree algorithm splits data features using information gain, en-
tropy, and the Gini index. The core concept of the decision tree algorithm
is to consider the feature with the highest information gain to make predictions.
In a decision tree, there are three main components: the root node, decision
nodes, and leaf nodes. The root node is the topmost node and is selected
by calculating information gain and entropy for each attribute. The attribute
with the highest information gain is chosen as the root node because it best
separates the data.
In a decision tree, there are three primary components: the root node,
decision nodes, and leaf nodes. These components work together to classify
data and make predictions.
Root Node:
The root node is the topmost node and the starting point of the decision
tree. To select the root node, the algorithm calculates the information gain
and entropy for each attribute in the dataset. Information gain assesses how
well an attribute separates the data into distinct classes, while entropy mea-
sures the uncertainty or disorder in the data. The attribute with the highest
information gain is chosen as the root node, as it provides the best split to
reduce uncertainty and separate the data effectively.
Decision Nodes:
Once the root node is established, the algorithm identifies the decision nodes.
Literature Survey
6
PLS-DA outperformed the other techniques across all performance metrics.
However, PLS-DA is noted for its complexity, making it challenging to imple-
ment and use[4].
Mohammad M. Ghiasi a, Sohrab Zendehboudi a, Ali Asghar Mohsenipour
b proposed the Traditional methods for diagnosing coronary artery disease
(CAD) can be costly and have health risks. Decision trees (DTs) offer an
interpretable and cost-effective alternative. Recent advances in DT algorithms
and PI-explanations enhance their accuracy and simplicity, making them viable
tools for CAD diagnosis. Studies show that DTs can effectively complement
traditional diagnostic approaches[5].
Lior Rokach and Oded Maimon Lior Rokach and Oded Maimon conducted a
comprehensive survey on decision trees, a popular method for classifying data.
Their research delves into various techniques for constructing decision trees,
emphasizing key concepts like information gain, entropy, and the Gini index.
Their survey offers valuable insights into the methods used to effectively split
data and simplify decision tree structures[6].
Ibomoiye Domor Mienye a, Yanxia Sun a, Zenghui Wang b conducted a Pre-
diction performance of improved decision tree-based algorithms. Their research
delves into various techniques for constructing decision trees, emphasizing key
concepts like information gain, entropy, and the Gini index. Their survey
offers valuable insights into the methods used to effectively split data and
simplify decision tree structures[7].
Yacine Izza, Alexey Ignatiev, Joao Marques-Silva are Recent studies have
shown that decision trees (DTs) can be less interpretable than previously
thought, as their decision paths can be unnecessarily long. To address this,
researchers propose PI-explanations, which identify the minimal set of features
necessary for a prediction. This new model allows for polynomial-time compu-
tation of one PI-explanation and can enumerate all PI-explanations by reducing
the task to finding minimal hitting sets. Experiments demonstrate that DTs
often include extraneous features in their paths, validating the efficiency of
PI-explanations[8].
According to recent studies, machine learning has notably advanced the di-
CHALLENGES
10
3.0.6 Noise and Outliers:
Decision trees can be sensitive to noisy or outlier data, which can nega-
tively impact the accuracy of the model.
PROPOSED METHODOLOGY
13
This makes the predictions easier to understand and interpret.
In our proposed system for enhancing heart disease prediction, we leverage the
decision tree algorithm due to its inherent advantages over traditional clus-
tering methods like k-means. Decision trees offer flexibility by not requiring
a predefined number of clusters, unlike k-means, which necessitates specifying
the number of groups beforehand. Instead, decision trees utilize metrics such
as information gain, entropy, or Gini index to autonomously determine optimal
features for data partitioning and prediction. This adaptability is particularly
advantageous in medical contexts, where the relationships between various
health indicators and heart disease can be complex and multifaceted.
4.2 Methodology:
In this section,Figure 4.1 shows the flow of the data and work flow of
decision tree.we outline the methodology employed to achieve the objectives
of this project. The primary aim of our approach is to enhance heart disease
prediction accuracy using the decision tree algorithm. A decision tree is a
popular machine learning algorithm used for both classification and regression
tasks. It’s called a ”tree” because it’s structured like a flowchart, with
branches representing decisions and leaves representing outcomes. We will
4.2.1 Model
:
In this section, we outline the methodology employed to achieve the
objectives of this project. The primary aim of our approach is to enhance
heart disease prediction accuracy using the decision tree algorithm. A decision
tree is a popular machine learning algorithm used for both classification and
regression tasks. It’s called a ”tree” because it’s structured like a flowchart,
with branches representing decisions and leaves representing outcomes. We
will detail each step involved, from data collection and preprocessing to model
training, evaluation, and deployment.
Finally, the test dataset remains entirely separate from both training and
validation phases, serving as a final checkpoint to assess the model’s ability
to generalize. By evaluating the model on unseen data, practitioners gain
confidence in its ability to perform accurately in real-world scenarios. This
rigorous evaluation process ensures that the developed model is robust, reliable,
and capable of making meaningful predictions beyond the specific dataset used
for training, thereby fostering trust and applicability in practical applications
such as healthcare, finance, and engineering.
4.2.5 Model
A decision tree is a versatile machine learning algorithm widely applied in
both classification and regression tasks due to its intuitive structure resembling
a flowchart. Each node in the tree represents a decision based on a feature,
with branches leading to subsequent nodes or leaves that signify outcomes or
predictions. This hierarchical structure allows for clear visualization and in-
terpretation of the decision-making process, making decision trees particularly
valuable in domains where transparency and explainability are crucial, such
as healthcare.
In decision tree algorithms, the selection of the best split at each node
is pivotal in optimizing predictive accuracy. Metrics like information gain,
entropy, and the Gini index serve as criteria for evaluating potential splits
based on how well they separate the data into distinct classes or reduce
uncertainty. Information gain measures the reduction in entropy or disorder
achieved by a split, making it essential for identifying the most informative
features.
Entropy quantifies the randomness or impurity of a dataset, while the Gini
index assesses how often a randomly chosen element from the set would be
incorrectly labeled if it were randomly labeled according to the distribution
n
X |Si |
Information Gain = Entropy(S) − Entropy(Si ) (4.1)
i=1
|S|
c
X
Gini Index = 1 − p2i (4.3)
i=1
5.1 DATASET
22
Table 5.2 shows In datasets designed for heart disease prediction, several
essential attributes are commonly included to facilitate accurate machine learn-
ing models. These attributes encompass critical physiological and demographic
factors that influence cardiovascular health assessment. For instance, age serves
as a fundamental indicator, with older individuals generally at higher risk.
Gender (sex) is another significant factor, with males typically having a higher
susceptibility to heart disease compared to females.
Other crucial attributes include resting blood pressure, which provides in-
sights into cardiovascular workload, and cholesterol levels, particularly LDL
cholesterol, which correlates with arterial plaque buildup. Fasting blood sugar
levels indicate diabetes risk, impacting heart health. Parameters like resting
heart rate and maximum heart rate during exercise reflect cardiovascular fitness
and the heart’s response to physical stress. The presence of exercise-induced
angina and the type of chest pain provide further diagnostic clues regarding
coronary artery disease.
5.2.1 Python
Python is a high-level, interpreted programming language known for its read-
ability, simplicity, and versatility. It supports multiple programming paradigms,
including procedural, object-oriented, and functional programming.
Moreover, Python embraces functional programming paradigms, enabling de-
velopers to leverage concepts such as higher-order functions and immutable
data structures. This approach facilitates concise and expressive code, par-
ticularly for tasks involving data manipulation, filtering, and transformation.
Python’s flexibility in accommodating diverse programming styles makes it
an ideal choice for a wide range of applications, from web development and
scientific computing to artificial intelligence and data analysis. Its extensive
libraries and frameworks further extend its capabilities, empowering developers
to tackle complex problems efficiently and effectively.
VERISION:python 3.10
5.2.2 LIBRARIES(MODULES)
5.2.3 numpy
NumPy, short for Numerical Python, is a fundamental library for scientific
computing in Python. It provides support for large, multi-dimensional arrays
and matrices, along with a collection of mathematical functions to operate on
these arrays. NumPy is a key component of the scientific Python stack and
is widely used in data science, machine learning, and engineering.
• The ndarray is the core data structure of NumPy, allowing for the creation
VERSION:1.25.2
5.2.4 Pandas
Pandas is a powerful and flexible open-source data analysis and manip-
ulation library for Python. It is built on top of NumPy and provides
high-level data structures and functions designed to make data analysis fast
and easy. Pandas is widely used in data science, machine learning, and other
data-intensive fields for its ability to handle and manipulate large datasets
efficiently.
Moreover, Pandas’ extensive set of built-in functions for data manipulation,
aggregation, and visualization further enhances its utility. From handling miss-
ing data to merging datasets and performing group-based operations, Pandas
simplifies intricate data tasks into concise and readable Python code. Its inte-
gration with other Python libraries and tools extends its functionality, making
Pandas a cornerstone in the toolkit of data professionals seeking efficient data
analysis solutions.
VERSION:2.0.3
5.2.5 Scikit-Learn
Scikit-learn (often abbreviated as sklearn) is one of the most popular and
powerful machine learning libraries for Python. It provides simple and efficient
tools for data mining and data analysis, and it is built on NumPy, SciPy, and
matplotlib. Scikit-learn is widely used for building machine learning models
due to its simplicity, flexibility, and comprehensive range of algorithms.
Furthermore, Colab integrates smoothly with Google Drive, allowing for easy
importation of datasets and storage of model checkpoints and experimental
results. This integration fosters collaboration by enabling teams to work on
5.3 RESULTS
1 import numpy as np
2 import pandas as pd
3 from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t
4 from s k l e a r n . t r e e import D e c i s i o n T r e e C l a s s i f i e r
5 from s k l e a r n . m e t r i c s import a c c u r a c y s c o r e
6
7 d f = pd . r e a d c s v ( ’ / c o n t e n t / h e a r t d i s e a s e d a t a ( 2 ) . c s v ’ )
8
9 print ( d f . head ( ) )
10 X = df . i l o c [ : , : −1]. values
11 y = d f . i l o c [ : , −1]. v a l u e s
12 # DESICION TREE CLASS
13
14 class DecisionTree :
15 def init ( s e l f , m i n s a m p l e s s p l i t =2, max depth=f l o a t
( ’ inf ’)) :
16 s e l f . min samples split = min samples split
17 s e l f . max depth = max depth
18 s e l f . r o o t = None
19
36 def f i n d b e s t s p l i t ( s e l f , X, y , n u m f e a t u r e s ) :
37 b e s t s p l i t = {}
38 best impurity = float ( ’ i n f ’ )
39 for f e a t u r e in range ( n u m f e a t u r e s ) :
40 t h r e s h o l d s = np . unique (X [ : , f e a t u r e ] )
41 for t h r e s h o l d in t h r e s h o l d s :
42 left X , l e f t y , right X , r i g h t y =
s p l i t d a t a s e t (X, y , f e a t u r e , t h r e s h o l d )
60 def p r e d i c t ( s e l f , X) :
61 return np . a r r a y ( [ s e l f . p r e d i c t ( i n p u t s ) for i n p u t s
in X ] )
62
75 p r e d i c t i o n s = t r e e . p r e d i c t (X)
76 print ( p r e d i c t i o n s )
77 a c c u r a c y = np .sum( p r e d i c t i o n s == y ) / len ( y )
78 print ( f ’ Accuracy : { a c c u r a c y : . 4 f } ’ )
Listing 5.1: Python code for heart disease prediction using a decision tree
5.3.2 OUTPUT:
5.4 COMPARISION
Table 5.2 show the K-means algorithm often yields lower accuracy compared
to the decision tree due to inherent drawbacks. These include the need for
predefined clusters and computationally expensive processes. Decision trees,
on the other hand, overcome these limitations by offering greater flexibility
and interpretability. They can adapt to the data’s structure more effectively,
resulting in improved accuracy and performance in many cases.
In Figure 5.3, Decision trees, a staple of supervised learning, construct
hierarchical decision rules based on dataset features to optimize accuracy, often
through maximizing information gain, reducing entropy, or minimizing the Gini
index at each split. While decision trees provide interpretability, their accuracy
may vary, especially with complex datasets, continuous variables, or outliers.
Comparatively, k-means clustering, an unsupervised algorithm, partitions data
into clusters based on similarity to enhance accuracy, but its performance
hinges on selecting an appropriate number of clusters (k) and initial centroids.
In Figure 5.2, k-means clustering is an unsupervised algorithm for grouping
data based on similarity. It’s useful for exploring data and finding patterns,
especially in large datasets. However, determining the optimal number of
clusters beforehand can be tricky, and the algorithm’s performance depends
on the initial centroid selection.
CONCLUSION
33
REFERENCES
[1] Aristidis Likas, Nikos Vlassis, and Jakob J Verbeek. “The global k-means
clustering algorithm”. In: Pattern recognition 36.2 (2003), pp. 451–461.
[2] Devansh Shah, Samir Patel, and Santosh Kumar Bharti. “Heart disease
prediction using machine learning techniques”. In: SN Computer Science
1.6 (2020), p. 345.
[3] Bahzad Charbuty and Adnan Abdulazeez. “Classification based on deci-
sion tree algorithm for machine learning”. In: Journal of Applied Science
and Technology Trends 2.01 (2021), pp. 20–28.
[4] K.R. Lakshmi, M. Veera Krishna, and S. Prem Kumar. “Performance
Comparison of Data Mining Techniques for Predicting of Heart Disease
Survivability”. In: International Journal of Scientific Research Publications
3.6 (2018), pp. 476–487.
[5] Mohammad M Ghiasi, Sohrab Zendehboudi, and Ali Asghar Mohsenipour.
“Decision tree-based diagnosis of coronary artery disease: CART model”.
In: Computer methods and programs in biomedicine 192 (2020), p. 105400.
[6] Lior Rokach and Oded Maimon. “Top-Down Induction of Decision Trees
Classifiers—A Survey”. In: IEEE Transactions on Systems, Man, and
Cybernetics—Part C 35.4 (2005), pp. 476–487.
[7] Ibomoiye Domor Mienye, Yanxia Sun, and Zenghui Wang. “Prediction
performance of improved decision tree-based algorithms: a review”. In:
Procedia Manufacturing 35 (2019), pp. 698–703.
[8] Yacine Izza, Alexey Ignatiev, and Joao Marques-Silva. “On explaining
decision trees”. In: arXiv preprint arXiv:2010.11034 (2020).
[9] Chintan M. Bhatt, Parth Patel, Tarang Ghetia, and Pier Luigi Mazzeo.
“Effective Heart Disease Prediction Using Machine Learning Techniques”.
In: Journal of Medical Systems 45.12 (2021), pp. 1–12.
[10] Anonymous. “Heart Disease Prediction Using Machine Learning”. In:
Journal of Medical Research 9 (2020), pp. 659–662.
[11] K Usha Rani. “Analysis of heart diseases dataset using neural network
approach”. In: arXiv preprint arXiv:1110.2626 (2011).
[12] Oded Z Maimon and Lior Rokach. Data mining with decision trees: theory
and applications. Vol. 81. World scientific, 2014.
[13] Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. “KNN
model-based approach in classification”. In: On The Move to Meaningful
Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated
International Conferences, CoopIS, DOA, and ODBASE 2003, Catania,
Sicily, Italy, November 3-7, 2003. Proceedings. Springer. 2003, pp. 986–
996.
[14] GM Cramer, RA Ford, and RL Hall. “Estimation of toxic hazard—a
decision tree approach”. In: Food and cosmetics toxicology 16.3 (1976),
pp. 255–276.
34
[15] K Krishna and M Narasimha Murty. “Genetic K-means algorithm”. In:
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cyber-
netics) 29.3 (1999), pp. 433–439.