0% found this document useful (0 votes)
31 views48 pages

Mini Project 2024

Uploaded by

21211a05m6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views48 pages

Mini Project 2024

Uploaded by

21211a05m6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Heart Disease Prediction Using Machine

Learning

A Mini-Project Report Submitted in the


Partial Fulfillment of the Requirements
for the Award of the Degree of

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE AND ENGINEERING

Submitted by

Bandi Chandu 22885A0523


Jakkani Prabhas 21881A05M6
Dyavarashetty Shailnath 21881A05L9

SUPERVISOR
Dr.D.Raghavendra Gowda
Associate Professor

Department of Computer Science and Engineering

May, 2024
Department of Computer Science and Engineering

CERTIFICATE

This is to certify that the project titled Heart Disease Prediction Using
Machine Learning is carried out by

Bandi Chandu 22885A0523


Jakkani Prabhas 21881A05M6
Dyavarashetty Shailnath 21881A05L9

in partial fulfillment of the requirements for the award of the degree of


Bachelor of Technology in Computer Science and Engineering during
the year 2022-23.

Signature of the Supervisor Signature of the HOD


Dr.D.Raghavendra Gowda Dr. K. Ramesh
Associate Professor Professor and Head, CSE

Kacharam (V), Shamshabad (M), Ranga Reddy (Dist.)–501218, Hyderabad, T.S.


Ph: 08413-253335, 253201, Fax: 08413-253482, www.vardhaman.org
Acknowledgement

The satisfaction that accompanies the successful completion of the task


would be put incomplete without the mention of the people who made it
possible, whose constant guidance and encouragement crown all the efforts
with success.

We wish to express our deep sense of gratitude to Dr.D.Raghavendra


Gowda, Associate Professor and Project Supervisor, Department of Computer
Science and Engineering, Vardhaman College of Engineering, for his able guid-
ance and useful suggestions, which helped us in completing the project in time.

We are particularly thankful to Dr. K. Ramesh, the Head of the De-


partment, Department of Computer Science and Engineering, his guidance,
intense support and encouragement, which helped us to mould our project
into a successful one.

We show gratitude to our honorable Principal Dr. J.V.R. Ravindra, for


providing all facilities and support.

We avail this opportunity to express our deep sense of gratitude and heart-
ful thanks to Dr. Teegala Vijender Reddy, Chairman and Sri Teegala
Upender Reddy, Secretary of VCE, for providing a congenial atmosphere to
complete this project successfully.

We also thank all the staff members of Electronics and Communication


Engineering department for their valuable support and generous advice. Finally
thanks to all our friends and family members for their continuous support and
enthusiastic help.

Bandi Chandu
Jakkani Prabhas
Dyavarashetty Shailnath

ii
Abstract

Currently, a large number of individuals are affected by heart disease, re-


sulting in over 15 million deaths each year. Heart disease can cause conditions
like heart failure, coronary artery disease, and arrhythmias, which all impact
the heart’s functionality. Often, heart disease is diagnosed too late, leading
to a higher mortality rate. Predicting the disease or potential heart failure
early can lower the death rate and save lives by allowing for timely medical
intervention.Machine learning and data mining are essential for predicting,
analyzing, and processing medical data. Data mining techniques organize and
structure data, making it easier to understand and use for disease prediction.
Various machine learning algorithms, like k-Nearest Neighbors (KNN) and
k-means clustering, are used to predict heart disease. KNN classifies data
points based on their closeness to other points, though it can struggle with
large datasets. K-means clustering groups data into k predefined clusters but
has challenges, including the need for a predefined number of clusters, high
computational costs, and complexity in implementation.

Our system aims to predict heart disease using the decision tree algorithm,
striving for accurate and reliable predictions while addressing the limitations
of other methods. The decision tree algorithm splits data features based on
information gain, entropy, and the Gini index, using the feature with the
highest information gain for predictions.

Keywords: decision tree algorithm ; information gain; Entropy; Gini index

iv
Table of Contents

Title Page No.


Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1
1.0.1 Decision tree: . . . . . . . . . . . . . . . . . . . . . . . . . . 2
CHAPTER 2 Literature Survey . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 3 CHALLENGES . . . . . . . . . . . . . . . . . . . . . . 10
3.0.1 Missing Information: . . . . . . . . . . . . . . . . . . . . 10
3.0.2 Overcomplicating Things: . . . . . . . . . . . . . . . . . 10
3.0.3 Non-Linear Relationships: . . . . . . . . . . . . . . . . . . . 10
3.0.4 Understanding the Results: . . . . . . . . . . . . . . . 10
3.0.5 Imbalanced Data: . . . . . . . . . . . . . . . . . . . . . . . . 10
3.0.6 Noise and Outliers: . . . . . . . . . . . . . . . . . . . . 11
3.0.7 Correlated Features: . . . . . . . . . . . . . . . . . . . . 11
3.0.8 Lack of Domain Knowledge: . . . . . . . . . . . . . . . 11
3.0.9 Imbalanced Data: . . . . . . . . . . . . . . . . . . . . . . . . 11
3.0.10 Handling Rare Events: . . . . . . . . . . . . . . . . . . 11
CHAPTER 4 PROPOSED METHODOLOGY . . . . . . . . . . . . 13
4.1 Proposed System: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 No Predefined Number of Clusters: . . . . . . . . . . 13
4.1.2 Computational Efficiency: . . . . . . . . . . . . . . . . . . . 13
4.1.3 No Initialization of Centroids . . . . . . . . . . . . . . . . . 13
4.2 Methodology: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.4 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.5 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

vi
4.2.6 Model Feature Spliting: . . . . . . . . . . . . . . . . . . . . 19
CHAPTER 5 RESULT AND ANALYSIS . . . . . . . . . . . . . . . 22
5.1 DATASET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 IMPLEMENTATION DETAILS . . . . . . . . . . . . . . . . . . . 24
5.2.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.2 LIBRARIES(MODULES) . . . . . . . . . . . . . . . . . . . 24
5.2.3 numpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.4 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.5 Scikit-Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.6 Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.1 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.2 OUTPUT: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 COMPARISION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 33
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
List of Tables

5.1 Description of DataSet . . . . . . . . . . . . . . . . . . . . . . . . . 22


5.2 Performance of Different Algorithms . . . . . . . . . . . . . . . . . 31

viii
List of Figures

1.1 desicion tree structure . . . . . . . . . . . . . . . . . . . . . . . . . 3

4.1 Flow of Decision tree algorinthm . . . . . . . . . . . . . . . . . . 15


4.2 flow of Decision tree Algorithm . . . . . . . . . . . . . . . . . . . 16

5.1 OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

ix
Abbreviations

Abbreviation Description

PLS-DA Partial Least Squares Discriminant Analysis

DTs Decision Tree

KNN K-Nearest Neighbours


CHAPTER 1

Introduction

Heart disease is a major global health issue, responsible for over 15 million
deaths annually. It includes various conditions like heart failure, coronary
artery disease, and arrhythmias, all of which impair the heart’s ability to
function properly. Heart failure occurs when the heart can’t pump enough
blood, coronary artery disease involves the narrowing of the arteries supplying
blood to the heart, and arrhythmias are irregular heartbeats that can lead to
severe complications, such as stroke or cardiac arrest. These conditions not
only reduce the quality of life but also pose significant risks to life itself.

One of the biggest challenges in treating heart disease is that it often goes
undetected until it reaches an advanced stage. Many people do not experience
symptoms until significant damage has already occurred, making the disease
more difficult to treat effectively. Late detection limits treatment options
and reduces the chances of successful intervention, contributing to the high
mortality rate associated with heart conditions. This delay in diagnosis often
means that by the time patients seek help, their condition has progressed to
a point where fewer and less effective treatment options are available.

Early detection and prediction of heart disease can drastically improve survival
rates and patient outcomes. By identifying individuals at risk before they show
symptoms, healthcare providers can implement preventive measures such as
lifestyle changes, medications, and sometimes surgical procedures to manage
risk factors like high blood pressure and cholesterol. Advances in medical
technology and predictive analytics are enhancing our ability to detect heart
disease earlier and more accurately. Widespread adoption of early detection
programs could significantly reduce the number of deaths from heart disease,
enabling timely medical interventions that save lives and improve health out-

1
comes.

Key indicators for predicting heart disease include a range of clinical and
lifestyle factors such as blood pressure, cholesterol levels, old peak (ST depres-
sion induced by exercise relative to rest), exercise habits, blood sugar levels,
and heart rate. By analyzing these variables, machine learning models can
predict the likelihood of heart disease with significant accuracy.
Machine learning and data mining play a major role in predicting, analyzing,
and processing of medical data. Data mining techniques help gather and
structure data, making it easier to understand and use in disease prediction.
Several machine learning algorithms are utilized in the prediction of heart
disease, including k-Nearest Neighbors (KNN) and k-means clustering. KNN
is a simple, yet effective, algorithm that classifies data points based on their
proximity to other data points. However, it may not perform well with large
datasets. K-means clustering, on the other hand, partitions data into k pre-
defined clusters, but it also has limitations such as the need for a predefined
number of clusters, high computational cost, and implementation complexity.

In our proposed system, we aim to predict heart disease using the deci-
sion tree algorithm. The main goal of our system is to achieve accurate
and reliable predictions while overcoming the limitations of other algorithms.
The decision tree algorithm splits data features using information gain, en-
tropy, and the Gini index. The core concept of the decision tree algorithm
is to consider the feature with the highest information gain to make predictions.

1.0.1 Decision tree:


Decision trees are hierarchical structures used for decision-making in clas-
sification and regression tasks, constructed recursively based on criteria like
information gain, entropy, or Gini index. Information gain measures the re-
duction in uncertainty about the outcome when splitting data, while entropy
quantifies the randomness or impurity of a dataset. The Gini index measures

Department of Electronics and Communication Engineering 2


the impurity of a dataset by evaluating the probability of misclassifying a ran-
domly chosen element. These metrics guide the tree-building process, resulting
in interpretable models widely used in machine learning for their simplicity
and effectiveness.

Figure 1.1: desicion tree structure

In a decision tree, there are three main components: the root node, decision
nodes, and leaf nodes. The root node is the topmost node and is selected
by calculating information gain and entropy for each attribute. The attribute
with the highest information gain is chosen as the root node because it best
separates the data.

In a decision tree, there are three primary components: the root node,
decision nodes, and leaf nodes. These components work together to classify
data and make predictions.
Root Node:
The root node is the topmost node and the starting point of the decision
tree. To select the root node, the algorithm calculates the information gain
and entropy for each attribute in the dataset. Information gain assesses how
well an attribute separates the data into distinct classes, while entropy mea-
sures the uncertainty or disorder in the data. The attribute with the highest
information gain is chosen as the root node, as it provides the best split to
reduce uncertainty and separate the data effectively.
Decision Nodes:
Once the root node is established, the algorithm identifies the decision nodes.

Department of Electronics and Communication Engineering 3


These nodes further split the data into subsets. For each subset created by
the root node, the algorithm calculates the information gain for the remaining
attributes. The attribute with the highest information gain in each subset is
selected as the next decision node. This process repeats recursively, with each
decision node creating new branches based on the attribute that best separates
the data at that point. The branching continues until the data is adequately
split into homogeneous subsets or another stopping criterion is met.
Leaf Nodes:
Leaf nodes are the endpoints of the decision tree and represent the final
predictions or classifications. Unlike decision nodes, leaf nodes do not split
further. In a classification problem, each leaf node corresponds to a specific
class label, determined by the majority class of the data points that reach
that node. In a regression problem, each leaf node represents a continuous
value, typically calculated as the average of the data points within that node.
The leaf nodes provide the ultimate output of the decision tree, concluding
the decision-making process.

The decision tree algorithm is an effective method used in machine learn-


ing for classification and regression. The process begins by selecting the root
node, which is the feature that provides the highest information gain or the
most significant reduction in uncertainty. Information gain measures how well
a feature separates the data into target classes, and by selecting the feature
with the highest gain, the algorithm ensures the initial split of the data is as
informative as possible.
After establishing the root node, the algorithm continues by recursively se-
lecting decision nodes. At each step, it evaluates the remaining features to
find the one that offers the highest information gain for the current subset of
data. This process repeats, with each decision node splitting the data further
into more specific subsets. The repeated splitting creates branches of the tree,
with each branch representing a series of decisions based on different features.
This method ensures that the tree becomes more refined and precise in its
predictions as it grows.

Department of Electronics and Communication Engineering 4


The final stage of the process involves forming leaf nodes, which provide the
ultimate classifications or predictions. Each leaf node represents an outcome,
whether it is a class label in classification tasks or a numerical value in
regression tasks. The structure of decision trees—comprising the root node,
decision nodes, and leaf nodes—makes them straightforward and easy to in-
terpret. Each path from the root to a leaf node shows a clear sequence of
decisions, enabling users to understand how a particular prediction is reached.
This transparency is one of the main advantages of decision trees, as it allows
users to see and trust the logic behind the model’s predictions.

Department of Electronics and Communication Engineering 5


CHAPTER 2

Literature Survey

Reetu Singh, E Rajesh proposed an approach to Prediction of Heart Disease


by Clustering and Classification Techniques[1]. This algorithm divides similar
data into clusters by initializing centroids, assigning data points to the nearest
centroid, and then updating the centroids iteratively. While the accuracy of
this method is generally good, it tends to decrease as the number of clusters
increases.
limitations:
The k-means algorithm requires predefining the number of clusters, involves
an expensive computational process, and has less accuracy compared to other
algorithms.
Chintan M. Bhatt, Parth Patel, Tarang Ghetia, and Pierluigi Mazzeo pro-
posed a system for effective heart disease prediction using machine learning
techniques[2]. The main goal of their system was to compare various machine
learning algorithms with and without cross-validation. Their findings indicated
that all algorithms performed well, with multilayer perceptron combined with
cross-validation achieving the highest accuracy.
Bahzad Taha Jijo and Adnan Mohsin Abdulazeez were proposed the”Classification
Based on Decision Tree Algorithm for Machine Learning”[3]Decision tree clas-
sifiers are widely used for data classification across various fields, including
medical diagnosis and text classification. This paper thoroughly examines de-
cision tree algorithms, evaluating their methodologies, datasets, and results. It
also discusses the findings to identify the most accurate classifiers and analyzes
the use of different datasets.
M. Veera Krishna and S. Prem Kumar proposed various data mining tech-
niques, including KNN, K-means, Apriori, and PLS-DA. They compared these
methods based on performance metrics such as computational time, positive
precision values, and negative precision values. The results indicated that

6
PLS-DA outperformed the other techniques across all performance metrics.
However, PLS-DA is noted for its complexity, making it challenging to imple-
ment and use[4].
Mohammad M. Ghiasi a, Sohrab Zendehboudi a, Ali Asghar Mohsenipour
b proposed the Traditional methods for diagnosing coronary artery disease
(CAD) can be costly and have health risks. Decision trees (DTs) offer an
interpretable and cost-effective alternative. Recent advances in DT algorithms
and PI-explanations enhance their accuracy and simplicity, making them viable
tools for CAD diagnosis. Studies show that DTs can effectively complement
traditional diagnostic approaches[5].
Lior Rokach and Oded Maimon Lior Rokach and Oded Maimon conducted a
comprehensive survey on decision trees, a popular method for classifying data.
Their research delves into various techniques for constructing decision trees,
emphasizing key concepts like information gain, entropy, and the Gini index.
Their survey offers valuable insights into the methods used to effectively split
data and simplify decision tree structures[6].
Ibomoiye Domor Mienye a, Yanxia Sun a, Zenghui Wang b conducted a Pre-
diction performance of improved decision tree-based algorithms. Their research
delves into various techniques for constructing decision trees, emphasizing key
concepts like information gain, entropy, and the Gini index. Their survey
offers valuable insights into the methods used to effectively split data and
simplify decision tree structures[7].
Yacine Izza, Alexey Ignatiev, Joao Marques-Silva are Recent studies have
shown that decision trees (DTs) can be less interpretable than previously
thought, as their decision paths can be unnecessarily long. To address this,
researchers propose PI-explanations, which identify the minimal set of features
necessary for a prediction. This new model allows for polynomial-time compu-
tation of one PI-explanation and can enumerate all PI-explanations by reducing
the task to finding minimal hitting sets. Experiments demonstrate that DTs
often include extraneous features in their paths, validating the efficiency of
PI-explanations[8].
According to recent studies, machine learning has notably advanced the di-

Department of Electronics and Communication Engineering 7


agnosis and prognosis of cardiovascular disease. Authors Chintan M. Bhatt,
Parth Patel, Tarang Ghetia, and Pier Luigi Mazzeo have demonstrated the
efficacy of random forest, decision tree classifiers, multilayer perceptron, and
XGBoost in accurately predicting heart disease. For instance, the decision tree
classifier achieved an impressive 84.37 accuracy with cross-validation. These
findings underscore the interpretability and practicality of decision tree models
in clinical contexts, suggesting their valuable role in improving cardiovascular
disease diagnostics[9].

Heart disease prediction is crucial in healthcare due to its complexity and


mortality rates. Authors Apurb Rajdhan, Avi Agarwal, and Dundigalla Ravi
employ the UCI heart disease dataset to predict risk levels using Naive Bayes,
Decision Tree, Logistic Regression, and Random Forest. The Decision Tree
algorithm is particularly noted for its effectiveness alongside other methods.
emphasizing the utility of decision tree-based models in improving diagnostic
outcomes [10].
Heart disease prediction is crucial, with various studies exploring classification
techniques for better accuracy. K. Usha Rani’s research on neural networks for
heart disease classification shows their effectiveness due to their ability to model
complex data and their efficiency with parallel training. Our project uses the
decision tree algorithm, valued for its intuitive structure, interpretability, and
ability to handle different data types. By comparing decision trees with neural
networks, we aim to analyze their effectiveness in heart disease prediction.
This comparison emphasizes the decision tree’s transparency and practical use
in medical decision-making[11].
The book ”Data Mining with Decision Trees: Theory and Applications” delves
into the principles and practical uses of decision tree algorithms in data
analysis. It illustrates their effectiveness in tasks like medical diagnosis, which
aligns closely with our project focused on predicting heart disease using decision
trees. This resource offers valuable insights and methodologies that inform
our approach to leveraging decision tree models for accurate classification in
healthcare applications[12].

Department of Electronics and Communication Engineering 8


The K-Nearest Neighbors (KNN) model-based approach in classification is
widely studied for its effectiveness in pattern recognition and classification
tasks. This method is pertinent to our project on heart disease prediction,
as KNN algorithms excel in identifying similarities between data points and
making predictions based on neighboring instances in the dataset[13].
”The study titled ’Estimation of toxic hazard—a decision tree approach’
explores the use of decision trees to evaluate and predict toxic hazards,
integrating environmental and chemical data. This approach resonates with
our project on heart disease prediction using decision trees, highlighting the
method’s utility in analyzing diverse datasets to anticipate health-related risks
and outcomes.”[14].
The Genetic K-means algorithm integrates genetic algorithm techniques with
the K-means clustering method to enhance the optimization of cluster centroids.
This hybrid approach is particularly relevant to our project on heart disease
prediction, offering a method to refine clustering outcomes iteratively by
adapting centroids based on genetic-inspired processes[15].

Department of Electronics and Communication Engineering 9


CHAPTER 3

CHALLENGES

3.0.1 Missing Information:


Decision trees can struggle when there are gaps in the data, which is
common in medical records. This can lead to inaccurate predictions and
reduced performance.

3.0.2 Overcomplicating Things:


Decision trees can become too complex and detailed, which can make them
less accurate when applied to new, unseen data.

3.0.3 Non-Linear Relationships:


Decision trees have limitations when it comes to understanding complex,
non-linear relationships between different factors and heart disease. This can
lead to reduced accuracy and performance.

3.0.4 Understanding the Results:


While decision trees provide a clear and transparent way of making deci-
sions, it can be difficult to interpret the results, especially for those without
a technical background.

3.0.5 Imbalanced Data:


Heart disease datasets often have an unequal number of healthy and
unhealthy patients, which can cause decision trees to favor the majority group
and lead to biased results.

10
3.0.6 Noise and Outliers:
Decision trees can be sensitive to noisy or outlier data, which can nega-
tively impact the accuracy of the model.

3.0.7 Correlated Features:


Decision trees can struggle with highly correlated features, which can lead
to reduced accuracy and increased complexity.

3.0.8 Lack of Domain Knowledge:


Decision trees may not incorporate domain knowledge or expertise, which
can lead to suboptimal decision-making.

3.0.9 Imbalanced Data:


Heart disease datasets often have an unequal number of healthy and
unhealthy patients, which can cause decision trees to favor the majority group
and lead to biased results.

3.0.10 Handling Rare Events:


Decision trees can struggle with rare events or outcomes, which can lead
to inaccurate predictions and reduced performance.

In my project report focused on predicting heart disease using decision trees,


several key challenges have been identified and addressed. Firstly, dealing
with missing data in medical records poses a significant obstacle. Medical
datasets often contain gaps or incomplete entries, which can adversely affect
the decision tree’s ability to accurately classify patients based on their health
status. To mitigate this issue, rigorous preprocessing steps such as imputation
techniques or data augmentation methods have been employed to ensure robust
model performance and minimize the impact of missing information.

Department of Electronics and Communication Engineering 11


Another critical challenge lies in managing the complexity of decision trees.
While decision trees are renowned for their interpretability, they can become
overly complex when attempting to capture intricate relationships within the
data. This complexity can lead to overfitting, where the model performs excep-
tionally well on training data but fails to generalize effectively to new, unseen
patient data. Strategies such as pruning techniques and parameter tuning have
been implemented to optimize model complexity, striking a balance between
predictive accuracy and generalizability. Additionally, efforts have been made
to explore ensemble methods that combine multiple decision trees to improve
overall performance and address the inherent limitations of single decision tree
models in capturing non-linear relationships and interpreting results effectively
for stakeholders across various domains.

Department of Electronics and Communication Engineering 12


CHAPTER 4

PROPOSED METHODOLOGY

4.1 Proposed System:


Our proposed system enhances heart disease prediction by leveraging a
decision tree algorithm, which addresses the limitations and builds upon
the strengths of existing methods. Decision trees present several significant
advantages over clustering algorithms such as k-means, including:

4.1.1 No Predefined Number of Clusters:


The decision tree algorithm does not require the predefined number of
groups or clusters for prediction or data splitting. Instead, it utilizes metrics
like information gain, entropy, or Gini index to determine the best features
for accurate data partitioning and prediction.

4.1.2 Computational Efficiency:


The decision tree algorithm is computationally efficient, particularly with
large datasets, and demonstrates enhanced efficacy in medical prediction tasks.
Its streamlined computational requirements make it well-suited for handling
extensive datasets, while its predictive capabilities remain robust and accurate
in medical scenarios.

4.1.3 No Initialization of Centroids


: The decision tree algorithm operates differently from k-means and doesn’t
rely on centroids. Instead, it automatically identifies the most important fea-
tures for dividing the data. This approach results in a model that’s not only
more precise but also highlights crucial health indicators for heart disease.

13
This makes the predictions easier to understand and interpret.
In our proposed system for enhancing heart disease prediction, we leverage the
decision tree algorithm due to its inherent advantages over traditional clus-
tering methods like k-means. Decision trees offer flexibility by not requiring
a predefined number of clusters, unlike k-means, which necessitates specifying
the number of groups beforehand. Instead, decision trees utilize metrics such
as information gain, entropy, or Gini index to autonomously determine optimal
features for data partitioning and prediction. This adaptability is particularly
advantageous in medical contexts, where the relationships between various
health indicators and heart disease can be complex and multifaceted.

Furthermore, the decision tree algorithm excels in computational efficiency,


making it suitable for handling large-scale datasets commonly found in medical
research. Its streamlined computational requirements ensure efficient process-
ing of extensive data while maintaining robust predictive accuracy. Unlike
k-means, which relies on centroid initialization and iterative adjustments, deci-
sion trees identify critical features directly from the data, enhancing precision
in predicting heart disease outcomes and providing actionable insights into im-
portant health indicators. By leveraging these strengths, our project aims to
develop a reliable and efficient tool for early detection and prediction of heart
disease, contributing to improved clinical decision-making and patient outcomes.

4.2 Methodology:
In this section,Figure 4.1 shows the flow of the data and work flow of
decision tree.we outline the methodology employed to achieve the objectives
of this project. The primary aim of our approach is to enhance heart disease
prediction accuracy using the decision tree algorithm. A decision tree is a
popular machine learning algorithm used for both classification and regression
tasks. It’s called a ”tree” because it’s structured like a flowchart, with
branches representing decisions and leaves representing outcomes. We will

Department of Electronics and Communication Engineering 14


Figure 4.1: Flow of Decision tree algorinthm
detail each step involved, from data collection and preprocessing to model
training, evaluation, and deployment.

4.2.1 Model
:
In this section, we outline the methodology employed to achieve the
objectives of this project. The primary aim of our approach is to enhance
heart disease prediction accuracy using the decision tree algorithm. A decision
tree is a popular machine learning algorithm used for both classification and
regression tasks. It’s called a ”tree” because it’s structured like a flowchart,
with branches representing decisions and leaves representing outcomes. We
will detail each step involved, from data collection and preprocessing to model
training, evaluation, and deployment.

Department of Electronics and Communication Engineering 15


Figure 4.2: flow of Decision tree Algorithm

4.2.2 Data Collection


Gathering pertinent information about heart disease from diverse sources is
crucial for enabling the algorithm to effectively predict the occurrence of heart
disease. This information encompasses a wide array of medical data, including
but not limited to demographic details, lifestyle factors, medical history, and
physiological indicators. By compiling comprehensive datasets from reputable
sources such as medical journals, research studies, and healthcare databases,
we aim to incorporate a rich set of features that are known to influence heart
health. These features may include age, gender, blood pressure, cholesterol
levels, body mass index (BMI), smoking status, family history of heart disease,
exercise habits, dietary patterns, and symptoms indicative of cardiovascular
issues.

Department of Electronics and Communication Engineering 16


4.2.3 Data Preprocessing
Cleaning and structuring the data involves several steps to ensure its quality
and integrity for use in predictive modeling. This process includes identifying
and handling duplicates and missing values effectively.
Duplicate Removal: Identifying and removing duplicate records from the
dataset ensures that each observation is unique and contributes meaningfully
to the analysis. Duplicates make the dataset impure and lead to incorrect
predictions.
Proper Structure: To convert raw data into a meaningful format for heart
disease prediction, we organize it by removing duplicates, handling missing
values, and standardizing variables. We engineer new features and encode
categorical data into numerical format. After normalization and scaling, we
organize the dataset into a structured format, ensuring clear documentation
and conducting exploratory data analysis to understand relationships between
variables.

4.2.4 Data Splitting


Data splitting in machine learning involves dividing a dataset into training,
validation, and test sets, each serving distinct purposes in model development
and evaluation. The training dataset constitutes the largest portion, used to
train the model by exposing it to labeled examples and adjusting its pa-
rameters through iterative learning algorithms. This phase enables the model
to discern patterns and relationships within the data, essential for making
accurate predictions on new inputs.

Following training, the validation dataset plays a crucial role in fine-tuning


the model’s hyperparameters and optimizing its performance. By evaluating
the model’s predictions on validation data that it hasn’t seen during training,
practitioners can adjust parameters such as learning rate or regularization
strength to enhance predictive accuracy and prevent overfitting—where the
model performs well on training data but fails to generalize to unseen exam-
ples. This iterative process of training-validation allows for iterative refinement

Department of Electronics and Communication Engineering 17


of the model until satisfactory performance metrics are achieved.

Finally, the test dataset remains entirely separate from both training and
validation phases, serving as a final checkpoint to assess the model’s ability
to generalize. By evaluating the model on unseen data, practitioners gain
confidence in its ability to perform accurately in real-world scenarios. This
rigorous evaluation process ensures that the developed model is robust, reliable,
and capable of making meaningful predictions beyond the specific dataset used
for training, thereby fostering trust and applicability in practical applications
such as healthcare, finance, and engineering.

4.2.5 Model
A decision tree is a versatile machine learning algorithm widely applied in
both classification and regression tasks due to its intuitive structure resembling
a flowchart. Each node in the tree represents a decision based on a feature,
with branches leading to subsequent nodes or leaves that signify outcomes or
predictions. This hierarchical structure allows for clear visualization and in-
terpretation of the decision-making process, making decision trees particularly
valuable in domains where transparency and explainability are crucial, such
as healthcare.

In decision tree algorithms, the selection of the best split at each node
is pivotal in optimizing predictive accuracy. Metrics like information gain,
entropy, and the Gini index serve as criteria for evaluating potential splits
based on how well they separate the data into distinct classes or reduce
uncertainty. Information gain measures the reduction in entropy or disorder
achieved by a split, making it essential for identifying the most informative
features.
Entropy quantifies the randomness or impurity of a dataset, while the Gini
index assesses how often a randomly chosen element from the set would be
incorrectly labeled if it were randomly labeled according to the distribution

Department of Electronics and Communication Engineering 18


of labels in the subset. These metrics guide the construction of decision trees
that not only effectively model relationships within data but also facilitate
straightforward interpretation of results, fostering confidence in the algorithm’s
predictions.

4.2.6 Model Feature Spliting:


Figure 4.2 illustrates the model flow of feature splitting in decision tree
construction, where Information Gain, Entropy, and the Gini Index play crucial
roles in determining optimal criteria for partitioning data. Information Gain
quantifies the reduction in uncertainty about the target variable when dividing
data based on a specific feature. It aids in selecting features that maximize
the distinction between classes within the dataset.
Entropy, on the other hand, measures the disorder or randomness present in
data, helping prioritize features that minimize entropy and enhance predictive
accuracy. Meanwhile, the Gini Index evaluates the impurity of data subsets
resulting from feature splits, aiming to minimize misclassification errors in
the constructed decision tree model by calculating the probability of incorrect
classifications based on the distribution of class labels.

Information Gain is calculated as the difference between the entropy of


the original dataset and the weighted average of entropies of its subsets after
splitting. It serves as a metric to assess how effectively a feature separates
data into distinct categories.
Entropy itself is computed based on the probability distribution of class labels,
providing a numerical measure of uncertainty in the dataset. Similarly, the
Gini Index quantifies dataset impurity by evaluating the squared probabilities
of misclassification, guiding the decision tree algorithm in creating splits that
maximize purity and accuracy. These metrics collectively empower decision
trees to efficiently organize and interpret complex data relationships, making
them valuable tools for both classification and regression tasks in various
domains, including healthcare and engineering.

Department of Electronics and Communication Engineering 19


Information Gain: Information gain measures the reduction in entropy
or uncertainty achieved by splitting a dataset based on a particular attribute.

n
X |Si |
Information Gain = Entropy(S) − Entropy(Si ) (4.1)
i=1
|S|

Entropy: Entropy is a measure of impurity or disorder in a dataset. In


the context of decision trees, it represents the uncertainty associated with class
labels in a dataset.
c
X
Entropy(S) = − pi log2 (pi ) (4.2)
i=1

Gini Index: The Gini index measures the impurity of a dataset by


calculating the probability that a randomly chosen element will be incorrectly
classified according to the distribution of class labels in the dataset.

c
X
Gini Index = 1 − p2i (4.3)
i=1

In our project aimed at predicting heart disease, thorough data collection


involves gathering a wide range of medical information from credible sources
such as medical journals, research studies, and healthcare databases. This
dataset encompasses crucial factors like demographic details, lifestyle habits,
medical history, and physiological indicators known to impact heart health,
including age, gender, blood pressure, cholesterol levels, BMI, smoking status,
family history of heart disease, and symptoms indicative of cardiovascular
issues. Data preprocessing is then essential to ensure dataset integrity, involv-
ing steps like removing duplicates, handling missing values, and standardizing
variables. Categorical data is converted into numerical format, and features
are normalized and scaled to prepare for model training.

Following data preprocessing, the dataset is split into training, validation,


and test sets to facilitate effective model development and evaluation. The
training set is utilized to train our decision tree algorithm, leveraging its ability

Department of Electronics and Communication Engineering 20


to recursively split data based on metrics such as Information Gain, Entropy,
and the Gini Index. These metrics guide the algorithm in selecting optimal
feature splits that maximize predictive accuracy while minimizing uncertainty
and data impurity. The validation set plays a critical role in fine-tuning
model hyperparameters, optimizing performance, and preventing overfitting. It
ensures that the model generalizes well to unseen data, enhancing its reliability
in practical healthcare applications.

Department of Electronics and Communication Engineering 21


CHAPTER 5

RESULT AND ANALYSIS

5.1 DATASET

Feature Descriptions range


Name of value

Age age of the patient 29-77

sex sex of the patient 0,1

exang Exercise included angina: 1 = yes ;0=no 0,1

cp Chest Pain type: typical angina=1 atypical angina=2 1,2,3,4


non-anginal pain =3 asymptomatic=4

trtbps Resting Blood Pressure(in mm/Hg) 94-200

chol Cholestoral in mg/dl 126-564

fbs Fasting Blood Presure 1=fbs¿120 mg/dl 0 = fbs¡=120 0,1


mg/dl

restecg Resting electrocarddiographic results: 0 = 0,1,2


normal,1=having ST-T wave abnormaly

thalach Maximum heart rate achived 71-202


old ST depression included by exericse to rest 0-6.2
peak
slp The slope of the peak exercise ST segment 0,1,2
thall Thalassemia: 0=null 1=fixed defect 2=normal 3 = 0,1,2,3
reversable defect
target 0=no heart disease 1= heart disease 0,1

Table 5.1: Description of DataSet

22
Table 5.2 shows In datasets designed for heart disease prediction, several
essential attributes are commonly included to facilitate accurate machine learn-
ing models. These attributes encompass critical physiological and demographic
factors that influence cardiovascular health assessment. For instance, age serves
as a fundamental indicator, with older individuals generally at higher risk.
Gender (sex) is another significant factor, with males typically having a higher
susceptibility to heart disease compared to females.

Other crucial attributes include resting blood pressure, which provides in-
sights into cardiovascular workload, and cholesterol levels, particularly LDL
cholesterol, which correlates with arterial plaque buildup. Fasting blood sugar
levels indicate diabetes risk, impacting heart health. Parameters like resting
heart rate and maximum heart rate during exercise reflect cardiovascular fitness
and the heart’s response to physical stress. The presence of exercise-induced
angina and the type of chest pain provide further diagnostic clues regarding
coronary artery disease.

Additionally, electrocardiogram (ECG) features like ST depression during exer-


cise and the slope of the peak exercise ST segment are indicative of myocardial
ischemia severity. The presence of thalassemia, a genetic blood disorder af-
fecting oxygen transport, can also influence heart disease risk. Understanding
the ranges and distributions of these attributes within the dataset is crucial
for preprocessing tasks such as normalization, ensuring that each attribute
contributes effectively to model training without bias from varying scales or
outliers.

In machine learning applications, these attributes serve as inputs for algorithms


that predict the presence or absence of heart disease based on historical data.
By analyzing these features, models can identify patterns and relationships
that aid in accurate diagnosis and prognosis. Furthermore, feature selection
techniques help prioritize the most informative attributes, enhancing model
performance and interpretability. This approach empowers healthcare profes-

Department of Electronics and Communication Engineering 23


sionals and researchers to develop robust predictive tools that support early
detection and personalized treatment strategies for cardiovascular diseases.

5.2 IMPLEMENTATION DETAILS

5.2.1 Python
Python is a high-level, interpreted programming language known for its read-
ability, simplicity, and versatility. It supports multiple programming paradigms,
including procedural, object-oriented, and functional programming.
Moreover, Python embraces functional programming paradigms, enabling de-
velopers to leverage concepts such as higher-order functions and immutable
data structures. This approach facilitates concise and expressive code, par-
ticularly for tasks involving data manipulation, filtering, and transformation.
Python’s flexibility in accommodating diverse programming styles makes it
an ideal choice for a wide range of applications, from web development and
scientific computing to artificial intelligence and data analysis. Its extensive
libraries and frameworks further extend its capabilities, empowering developers
to tackle complex problems efficiently and effectively.
VERISION:python 3.10

5.2.2 LIBRARIES(MODULES)

5.2.3 numpy
NumPy, short for Numerical Python, is a fundamental library for scientific
computing in Python. It provides support for large, multi-dimensional arrays
and matrices, along with a collection of mathematical functions to operate on
these arrays. NumPy is a key component of the scientific Python stack and
is widely used in data science, machine learning, and engineering.

• The ndarray is the core data structure of NumPy, allowing for the creation

Department of Electronics and Communication Engineering 24


of arrays of arbitrary dimensions. These arrays are more efficient and
convenient than Python’s built-in lists for numerical operations.

• NumPy includes a wide range of mathematical functions such as trigono-


metric, statistical, and algebraic operations that can be performed on
arrays.

VERSION:1.25.2

5.2.4 Pandas
Pandas is a powerful and flexible open-source data analysis and manip-
ulation library for Python. It is built on top of NumPy and provides
high-level data structures and functions designed to make data analysis fast
and easy. Pandas is widely used in data science, machine learning, and other
data-intensive fields for its ability to handle and manipulate large datasets
efficiently.
Moreover, Pandas’ extensive set of built-in functions for data manipulation,
aggregation, and visualization further enhances its utility. From handling miss-
ing data to merging datasets and performing group-based operations, Pandas
simplifies intricate data tasks into concise and readable Python code. Its inte-
gration with other Python libraries and tools extends its functionality, making
Pandas a cornerstone in the toolkit of data professionals seeking efficient data
analysis solutions.
VERSION:2.0.3

5.2.5 Scikit-Learn
Scikit-learn (often abbreviated as sklearn) is one of the most popular and
powerful machine learning libraries for Python. It provides simple and efficient
tools for data mining and data analysis, and it is built on NumPy, SciPy, and
matplotlib. Scikit-learn is widely used for building machine learning models
due to its simplicity, flexibility, and comprehensive range of algorithms.

Department of Electronics and Communication Engineering 25


Moreover, Scikit-learn simplifies data preprocessing tasks by offering utilities for
normalization, scaling, and feature extraction. Techniques like StandardScaler
and TF-IDF Vectorization prepare data for model training by standardizing
numerical features and transforming text data into numerical representations
suitable for machine learning algorithms. These preprocessing steps are inte-
gral in ensuring the quality and compatibility of data inputs across different
models and pipelines, enhancing the reliability and interpretability of machine
learning solutions deployed in diverse domains.
VERSION:1.2.2

5.2.6 Google Colab


Google Colaboratory, also known as Google Colab, is a free, cloud-based
platform where users can write and run Python code interactively. It’s espe-
cially popular with data scientists, machine learning experts, and educators
because it’s easy to use and packed with powerful features. Built on Jupyter
Notebooks, Colab offers a familiar interface with the added benefits of cloud
computing.
Google Colab, or Google Colaboratory, is a versatile platform that signifi-
cantly aids in the implementation of machine learning projects. It operates
as a cloud-based environment where Python code can be written, executed,
and shared seamlessly through Jupyter notebooks directly in a web browser.
This accessibility is particularly advantageous for machine learning practition-
ers as it eliminates the need for local setup of development environments
and hardware, providing ready access to high-performance GPUs and TPUs
at no cost. These accelerators are crucial for training deep learning models
efficiently, which is often a resource-intensive task.

Furthermore, Colab integrates smoothly with Google Drive, allowing for easy
importation of datasets and storage of model checkpoints and experimental
results. This integration fosters collaboration by enabling teams to work on

Department of Electronics and Communication Engineering 26


the same notebooks simultaneously and share them effortlessly across different
platforms. The platform comes equipped with pre-installed libraries such as
NumPy, Pandas, Matplotlib, and TensorFlow, essential for data manipulation,
visualization, and deep learning tasks. This feature reduces setup time and
enables practitioners to focus more on experimenting with algorithms and
refining models. VERSION:1.2.2

5.3 RESULTS

5.3.1 Source Code

1 import numpy as np
2 import pandas as pd
3 from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t
4 from s k l e a r n . t r e e import D e c i s i o n T r e e C l a s s i f i e r
5 from s k l e a r n . m e t r i c s import a c c u r a c y s c o r e
6

7 d f = pd . r e a d c s v ( ’ / c o n t e n t / h e a r t d i s e a s e d a t a ( 2 ) . c s v ’ )
8

9 print ( d f . head ( ) )
10 X = df . i l o c [ : , : −1]. values
11 y = d f . i l o c [ : , −1]. v a l u e s
12 # DESICION TREE CLASS
13

14 class DecisionTree :
15 def init ( s e l f , m i n s a m p l e s s p l i t =2, max depth=f l o a t
( ’ inf ’)) :
16 s e l f . min samples split = min samples split
17 s e l f . max depth = max depth
18 s e l f . r o o t = None
19

Department of Electronics and Communication Engineering 27


20 def f i t ( s e l f , X, y ) :
21 s e l f . r o o t = s e l f . b u i l d t r e e (X, y )
22

23 def b u i l d t r e e ( s e l f , X, y , depth =0) :


24 num samples , n u m f e a t u r e s = X. shape
25 i f num samples >= s e l f . m i n s a m p l e s s p l i t and depth
<= s e l f . max depth :
26 b e s t s p l i t = s e l f . f i n d b e s t s p l i t (X, y ,
num features )
27 if best split :
28 left X , l e f t y , right X , r i g h t y =
s p l i t d a t a s e t (X, y , b e s t s p l i t [ ’ f e a t u r e ’
] , b e s t s p l i t [ ’ threshold ’ ] )
29 left subtree = s e l f . build tree ( left X ,
l e f t y , depth + 1 )
30 r i g h t s u b t r e e = s e l f . b u i l d t r e e ( right X ,
r i g h t y , depth + 1 )
31 return TreeNode ( b e s t s p l i t [ ’ f e a t u r e ’ ] ,
b e s t s p l i t [ ’ threshold ’ ] , left subtree ,
right subtree )
32

33 leaf value = s e l f . calculate leaf value (y)


34 return TreeNode ( v a l u e=l e a f v a l u e )
35

36 def f i n d b e s t s p l i t ( s e l f , X, y , n u m f e a t u r e s ) :
37 b e s t s p l i t = {}
38 best impurity = float ( ’ i n f ’ )
39 for f e a t u r e in range ( n u m f e a t u r e s ) :
40 t h r e s h o l d s = np . unique (X [ : , f e a t u r e ] )
41 for t h r e s h o l d in t h r e s h o l d s :
42 left X , l e f t y , right X , r i g h t y =
s p l i t d a t a s e t (X, y , f e a t u r e , t h r e s h o l d )

Department of Electronics and Communication Engineering 28


43 i f len ( l e f t y ) > 0 and len ( r i g h t y ) > 0 :
44 impurity = s e l f . c a l c u l a t e i m p u r i t y (
left y , right y )
45 i f impurity < best impurity :
46 best impurity = impurity
47 b e s t s p l i t = {” f e a t u r e ” : f e a t u r e , ”
threshold ” : threshold }
48 return b e s t s p l i t i f b e s t s p l i t e l s e None
49

50 def calculate impurity ( self , left y , right y ) :


51 m = len ( l e f t y ) + len ( r i g h t y )
52 left impurity = gini impurity ( left y )
53 right impurity = gini impurity ( right y )
54 w e i g h t e d i m p u r i t y = ( len ( l e f t y ) / m) ∗
l e f t i m p u r i t y + ( len ( r i g h t y ) / m) ∗
right impurity
55 return w e i g h t e d i m p u r i t y
56

57 def calculate leaf value ( self , y) :


58 return np . b i n c o u n t ( y ) . argmax ( )
59

60 def p r e d i c t ( s e l f , X) :
61 return np . a r r a y ( [ s e l f . p r e d i c t ( i n p u t s ) for i n p u t s
in X ] )
62

63 def predict ( s e l f , inputs ) :


64 node = s e l f . r o o t
65 while node . v a l u e i s None :
66 i f i n p u t s [ node . f e a t u r e ] < node . t h r e s h o l d :
67 node = node . l e f t
68 else :
69 node = node . r i g h t

Department of Electronics and Communication Engineering 29


70 return node . v a l u e
71

72 t r e e = D e c i s i o n T r e e ( m i n s a m p l e s s p l i t =3, max depth=3)


73 t r e e . f i t (X, y )
74

75 p r e d i c t i o n s = t r e e . p r e d i c t (X)
76 print ( p r e d i c t i o n s )
77 a c c u r a c y = np .sum( p r e d i c t i o n s == y ) / len ( y )
78 print ( f ’ Accuracy : { a c c u r a c y : . 4 f } ’ )

Listing 5.1: Python code for heart disease prediction using a decision tree

5.3.2 OUTPUT:

Figure 5.1: OUTPUT

Department of Electronics and Communication Engineering 30


Algorithm Accuracy
Decision Tree 0.868
k-means 0.701

Table 5.2: Performance of Different Algorithms

5.4 COMPARISION
Table 5.2 show the K-means algorithm often yields lower accuracy compared
to the decision tree due to inherent drawbacks. These include the need for
predefined clusters and computationally expensive processes. Decision trees,
on the other hand, overcome these limitations by offering greater flexibility
and interpretability. They can adapt to the data’s structure more effectively,
resulting in improved accuracy and performance in many cases.
In Figure 5.3, Decision trees, a staple of supervised learning, construct
hierarchical decision rules based on dataset features to optimize accuracy, often
through maximizing information gain, reducing entropy, or minimizing the Gini
index at each split. While decision trees provide interpretability, their accuracy
may vary, especially with complex datasets, continuous variables, or outliers.
Comparatively, k-means clustering, an unsupervised algorithm, partitions data
into clusters based on similarity to enhance accuracy, but its performance
hinges on selecting an appropriate number of clusters (k) and initial centroids.
In Figure 5.2, k-means clustering is an unsupervised algorithm for grouping
data based on similarity. It’s useful for exploring data and finding patterns,
especially in large datasets. However, determining the optimal number of
clusters beforehand can be tricky, and the algorithm’s performance depends
on the initial centroid selection.

Department of Electronics and Communication Engineering 31


Figure 5.2: k-means clustering

Figure 5.3: decision tree

Department of Electronics and Communication Engineering 32


CHAPTER 6

CONCLUSION

To conclusion, the decision tree algorithm emerges as the favored approach


for predicting heart disease when compared to the k-means algorithm. Its
proficiency in handling intricate relationships and accommodating missing data
lends valuable insights to healthcare practitioners, facilitating a deeper under-
standing of the factors influencing heart disease. With a notable accuracy
rate of ”87 percent” demonstrated in the study using the CART algorithm,
the decision tree proves its efficacy in heart disease prediction, distinguishing
itself from the k-means algorithm, which is better suited for clustering tasks
rather than classification.

Furthermore, the decision tree algorithm demonstrates superior efficiency


in managing large datasets and robustness against outliers and noisy data,
commonly found in medical datasets. Its transparent decision-making process
simplifies result interpretation, while its scalability and cost-effectiveness make
it a practical choice for integration into existing healthcare systems. Addition-
ally, its capability to identify high-risk patients for early intervention, pinpoint
significant risk factors, and inform personalized treatment plans underscores its
pivotal role in enhancing patient care and shaping public health policies. This
highlights its importance in driving advancements in treatments and therapies
for heart disease while overcoming the limitations associated with the k-means
technique, such as the need for predefined clusters, costly computations, and
slower processing speeds, all while achieving greater accuracy.

33
REFERENCES

[1] Aristidis Likas, Nikos Vlassis, and Jakob J Verbeek. “The global k-means
clustering algorithm”. In: Pattern recognition 36.2 (2003), pp. 451–461.
[2] Devansh Shah, Samir Patel, and Santosh Kumar Bharti. “Heart disease
prediction using machine learning techniques”. In: SN Computer Science
1.6 (2020), p. 345.
[3] Bahzad Charbuty and Adnan Abdulazeez. “Classification based on deci-
sion tree algorithm for machine learning”. In: Journal of Applied Science
and Technology Trends 2.01 (2021), pp. 20–28.
[4] K.R. Lakshmi, M. Veera Krishna, and S. Prem Kumar. “Performance
Comparison of Data Mining Techniques for Predicting of Heart Disease
Survivability”. In: International Journal of Scientific Research Publications
3.6 (2018), pp. 476–487.
[5] Mohammad M Ghiasi, Sohrab Zendehboudi, and Ali Asghar Mohsenipour.
“Decision tree-based diagnosis of coronary artery disease: CART model”.
In: Computer methods and programs in biomedicine 192 (2020), p. 105400.
[6] Lior Rokach and Oded Maimon. “Top-Down Induction of Decision Trees
Classifiers—A Survey”. In: IEEE Transactions on Systems, Man, and
Cybernetics—Part C 35.4 (2005), pp. 476–487.
[7] Ibomoiye Domor Mienye, Yanxia Sun, and Zenghui Wang. “Prediction
performance of improved decision tree-based algorithms: a review”. In:
Procedia Manufacturing 35 (2019), pp. 698–703.
[8] Yacine Izza, Alexey Ignatiev, and Joao Marques-Silva. “On explaining
decision trees”. In: arXiv preprint arXiv:2010.11034 (2020).
[9] Chintan M. Bhatt, Parth Patel, Tarang Ghetia, and Pier Luigi Mazzeo.
“Effective Heart Disease Prediction Using Machine Learning Techniques”.
In: Journal of Medical Systems 45.12 (2021), pp. 1–12.
[10] Anonymous. “Heart Disease Prediction Using Machine Learning”. In:
Journal of Medical Research 9 (2020), pp. 659–662.
[11] K Usha Rani. “Analysis of heart diseases dataset using neural network
approach”. In: arXiv preprint arXiv:1110.2626 (2011).
[12] Oded Z Maimon and Lior Rokach. Data mining with decision trees: theory
and applications. Vol. 81. World scientific, 2014.
[13] Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. “KNN
model-based approach in classification”. In: On The Move to Meaningful
Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated
International Conferences, CoopIS, DOA, and ODBASE 2003, Catania,
Sicily, Italy, November 3-7, 2003. Proceedings. Springer. 2003, pp. 986–
996.
[14] GM Cramer, RA Ford, and RL Hall. “Estimation of toxic hazard—a
decision tree approach”. In: Food and cosmetics toxicology 16.3 (1976),
pp. 255–276.

34
[15] K Krishna and M Narasimha Murty. “Genetic K-means algorithm”. In:
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cyber-
netics) 29.3 (1999), pp. 433–439.

Department of Electronics and Communication Engineering 35

You might also like