Project Report
Project Report
Project Report
on
of
Master of Computer Application
School of IT
Campus Jalandhar
CERTIFICATE
This is to certify that the project report titled “ Heart Attack Prediction Using Machine
Learning” submitted by Mr. Abhishek Chauhan (University Roll Number 2229826) is a
Bonafede piece of work conducted under my direct supervision and guidance. No part of this
work has been submitted for any other degree of any other university.
It may be considered for evaluation in partial fulfillment of the requirement for the award of
degree of Master of Computer Application.
Date:-
AIMETC, Jalandhar
I
DECLARATION
This is to certify that the above statement made by the candidate is correct to the best of my
knowledge
II
ABSTRACT
In the medical field, the diagnosis of heart disease is the most difficult task. The
diagnosis of heart disease is difficult as a decision relied on grouping of large
clinical and pathological data. Due to this complication, the interest increased in a
significant amount between the researchers and clinical professionals about the
efficient and accurate heart disease prediction. In case of heart disease, the correct
diagnosis in early stage is important as time is the very important factor. Heart
disease is the principal source of deaths widespread, and the prediction of heart
disease is significant at an untimely phase. Machine learning in recent years has
been the evolving, reliable and supporting tools in medical domain and has
provided the greatest support for predicting disease with correct case of training
and testing. The main idea behind this work is to study diverse prediction models
for the heart disease and selecting important heart disease feature using Random
Forests algorithm. Random Forests is the Supervised Machine Learning
algorithm which has the high accuracy compared to other Supervised Machine
Learning algorithms such as logistic regression etc. By using Random Forests
algorithm, we are going to predict if a person has heart disease or not.
Abhishek Chauhan
III
ACKNOWLEDGEMENT
Apart from the efforts of me, the success of any work depends largely on the encouragement and
guidance of many others I take this opportunity to express my gratitude to the people who have
been instrumental in the work progress this report. First of all I sincerely acknowledge my
gratitude to Almighty God for his compassion and bountiful of blessings, which made me to see
this wonderful moments. I would like to express my deep sense of gratitude to Dr. Avnip Deora
Dean (School of I.T) AIMETC’s School of IT for continuous support of my study and research
his motivation, enthusiasm and immense knowledge provided me to during this research work is
deeply appreciated.
I would like to express my deep sense of gratitude for his suggestions. I am highly indebted to
him as he did not spare any effort to review and audit my report linguistically and technically.
Last but not least I would like to thank my family for encouraging me not only in its project
work but throughout my life.
IV
TABLE OF CONTENTS
CERTIFICATE I
DECLARATIO
II
N
ABSTRACT III
ACKNOWLED IV
GEMENT
CHAPTER
TITLE PAGE
S.No.
NO.
CHAPTER-1 INTRODUCTION
1 1.1 Introduction 1
1.2 Background of study
CHAPTER-2 LITERATURE SURVEY
2 3
2.1 Data source
CHAPTER-3 AIM AND SCOPE OF
PRESENT INVESTIGATION
3.1 EXISTING SYSTEM
3.2 PROPOSED
SYSTEM
3.3 FEASIBILITY
STUDY
3 3.3.1 Economic 6
Feasibility
3.3.2 Technical
Feasibility
3.3.3 Operational
Feasibility
CHAPTER-6 SCREENSHOTS
6 15
1
CHAPTER-7 INTRODUCTION OF
TECHNOLOGIES AND LIBRARIES USED
IN PROJECT
7.1TECHNOLOGIES
7 7.2 Libraries 24
7.3 ALGORITHMS
7.4 SYSTEM ARCHITECTUR
7.5 MODULES
8 CHAPTER-8 IMPLEMENTATION 35
9 BIBLIOGRAPHY 37
0
Chapter 1
INTRODUCTION
1.1 Introduction
The heart is a kind of muscular organ which pumps blood into the body and is the central part
of the body‘s cardiovascular system which also contains lungs. Cardiovascular system also
comprises a network of blood vessels, for example, veins, arteries, and capillaries. These
blood vessels deliver blood all over the body. Abnormalities in normal blood flow from the
heart cause several types of heart diseases which are commonly known as cardiovascular
diseases (CVD). Heart diseases are the main reasons for death worldwide. According to the
survey of the World Health Organization (WHO), 17.5 million total global deaths occur
because of heart attacks and strokes. More than 75% of deaths from cardio-vascular diseases
occur mostly in middle-income and low-income countries. Also, 80% of the deaths that occur
due to CVDs are because of stroke and heart attack. Therefore, prediction of cardiac
abnormalities at the early stage and tools for the prediction of heart diseases can save a lot of
life and help doctors to design an effective treatment plan which ultimately reduces the
mortality rate due to cardiovascular diseases.
Due to the development of advance healthcare systems, lots of patient data are nowadays
available (i.e., Big Data in Electronic Health Record System) which can be used for
designing predictive models for cardiovascular diseases. Data mining or machine learning is
a discovery method for analyzing big data from an assorted perspective and encapsulating it
into useful information. ―Data Mining is a non-trivial
extraction of implicit previously unknown and potentially useful information about data‖.
Nowadays, a huge amount of data pertaining to disease diagnosis, patients etc. are generated
by healthcare industries. Data mining provides a number of techniques which discover
hidden patterns or similarities from data.
Therefore, in this paper, a machine learning algorithm is proposed for the implementation of
a heart disease prediction system which was validated on two open access heart disease
prediction datasets.2 Data mining is the computer-based process of extracting useful
information from enormous sets of databases. Data mining is most helpful in an explorative
analysis because of nontrivial information from large volumes of evidence. Medical data
mining has great potential for exploring the cryptic patterns in the data sets of the clinical
domain. These patterns can be utilized for healthcare diagnosis. However, the available raw
medical data are widely distributed, voluminous and heterogeneous in nature. This data needs
to be collected in an organized form. This collected data can be then integrated to form a
1|Page
medical information system. Data mining provides a user-oriented approach to novel and
hidden patterns in the Data The data mining tools are useful for answering business questions
and techniques for predicting the various diseases in the healthcare field. Disease prediction
plays a significant role in data mining. This paper analyzes the heart disease predictions using
classification algorithms. These invisible patterns can be utilized for health diagnosis in
healthcare data. Data mining technology affords an efficient approach to the latest and
indefinite patterns in the data. The information which is identified can be used by the
healthcare administrators to get better services. Heart disease was the most crucial reason for
victims in the countries like India, United States. In this project we are predicting the heart
disease using classification algorithms. Machine learning techniques like Classification
algorithms such as Random Forest, Logistic Regression are used to explore different kinds of
heart-based problems.
Heart disease predictor is an offline platform designed and developed to explore the path of
machine learning. The goal is to predict the health of the patient from collective data to be
able to detect configurations at risk for the patient, and therefore, in cases requiring
emergency medical assistance, alert the appropriate medical staff of the situation of the latter.
We initially have a dataset collecting information of many patients with which we can
conclude the results into a complete form and can predict data precisely. The results of the
predictions, derived from the predictive models generated by machine learning, will be
presented through several distinct graphical interfaces according to the datasets considered.
We will then bring criticism as to the scope of our results. Data has been collected from
Kaggle. Data collection is the process of gathering and measuring information from countless
different sources to use the data.
2|Page
Chapter 2
LITERATURE SURVEY
Machine Learning techniques are used to analyze and predict the medical data information
resources. Diagnosis of heart disease is a significant and tedious task in medicine. The term
heart disease encompasses the various diseases that affect the heart. The exposure of heart
disease from various factors or symptom is an issue which is not complimentary from false
presumptions often accompanied by unpredictable effects. The data classification is based on
Supervised Machine Learning algorithm which results in better accuracy. Here we are using the
Random Forest as the training algorithm to train the heart disease dataset and to predict the
heart disease. The results showed that the medicinal prescription and designed prediction
system is capable of prophesying the heart attack successfully. Machine Learning techniques
are used to indicate the early mortality by analyzing the heart disease patients and their clinical
records (Richards, G. et al., 2001). (Sung, S.F. et al., 2015) have brought about the two
Machine Learning techniques, k-nearest neighbor model and existing multi linear regression to
predict the stroke severity index (SSI) of the patients. Their study show that k-nearest neighbor
performed better than Multi Linear Regression model. (Arslan, A. K. et al.,2016) have
suggested various Machine Learning techniques such as support vector machine (SVM),
penalized logistic regression (PLR) to predict the heart stroke. Their results show that SVM
produced the best performance in prediction when compared to other models.
Boshra Brahmi et al, [20] developed different Machine Learning techniques to evaluate the
prediction and diagnosis of heart disease. The main objective is to evaluate the different
classification techniques such as J48, Decision Tree, KNN and Naïve Bayes. After this,
evaluating some performance in measures of accuracy, precision, sensitivity, specificity is
evaluated.
2.1 Data source:
Clinical databases have collected a significant amount of information about patients and their
medical conditions. Records set with medical attributes were obtained from the Cleveland Heart
Disease database. With the help of the dataset, the patterns significant to the heart attack
diagnosis are extracted.
The records were split equally into two datasets: training dataset and testing dataset. A total of
303 records with 76 medical attributes were obtained. All the attributes are numeric-valued. We
are working on a reduced set of attributes, i.e., only 14 attributes.
All these restrictions were announced to shrink the digit of designs, these are as follows:
1. The features should seem on a single side of the rule.
2. The rule should distinct various features into the different groups.
3|Page
3. The count of features available from the rule is organized by medica history
people having heart disease only.
The following table shows the list of attributes on which we are working.
Variable definitions in the Dataset:
Age: Age of the patient
Sex: Sex of the patient
exang: exercise induced angina (1 = yes; 0 = no)
ca: number of major vessels (0-3)
cp: Chest Pain type chest pain type
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic
trtbps: resting blood pressure (in mm Hg)
chol: cholestoral in mg/dl fetched via BMI sensor
fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
rest_ecg: resting electrocardiographic results
Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or
depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
thalach: maximum heart rate achieved
target: 0= less chance of heart attack 1= more chance of heart attack
Additional variable descriptions to help us:
age - age in years
sex - sex (1 = male; 0 = female)
cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 0 =
asymptomatic)
trestbps - resting blood pressure (in mm Hg on admission to the hospital)
chol - serum cholestoral in mg/dl
fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg - resting electrocardiographic results (1 = normal; 2 = having ST-T wave abnormality;
0 = hypertrophy)
thalach - maximum heart rate achieved
exang - exercise induced angina (1 = yes; 0 = no)
4|Page
oldpeak - ST depression induced by exercise relative to rest
slope - the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 =
downsloping)
ca - number of major vessels (0-3) colored by flourosopy
thal - 2 = normal; 1 = fixed defect; 3 = reversable defect
num - the predicted attribute - diagnosis of heart disease (angiographic disease status)
(Value 0 = < diameter narrowing; Value 1 = > 50% diameter narrowing)
5|Page
Chapter 3
AIM AND SCOPE OF PRESENT INVESTIGATION
6|Page
be implemented on the development of the smartphone-based application for detecting and
predicting heart disease risk level.
7|Page
3.3.3 Operational Feasibility:
Operational feasibility is defined as the process of assessing the degree to which a
proposed system solves business problems or takes advantage of business opportunities.
The system is self-explanatory and doesn‘t need any extra sophisticated training. The
system has built-in methods and classes which are
required to produce the result. The application can be handled very easily with a novice
user. The overall time that a user needs to get trained is 14 less than one hour. As the
software that is used for developing this application is very economical and is readily
available in the market. Therefore, the proposed system is operationally feasible.
8|Page
CHAPTER 4
REQUIREMENT ANALYSIS
Software Requirement Specification (SRS) is the starting point of the software developing
activity. As system grew more complex it became evident that the goal of the entire system
cannot be easily comprehended. Hence the need for the requirement phase arose. The
software project is initiated by the client needs. The SRS is the means of translating the
ideas of the minds of clients (the input) into a formal document (the output of the
requirement phase.) Under requirement specification, the focus is on specifying what has
been found giving analysis such as representation, specification languages and tools, and
checking the specifications are addressed during this activity. The Requirement phases
terminates with the production of the validate SRS document. Producing the SRS
document is the basic goal of this phase. The purpose of the Software Requirement
Specification is to reduce the communication gap between the clients and the developers.
Software Requirement Specification is the medium though which the client and user needs
are accurately specified. It forms the basis of software development. A good SRS should
satisfy all the parties involved in the system.
4.1.1 Product Perspective:
The application is developed in such a way that any future enhancement can be easily
implementable. The project is developed in such a way that it requires minimal
maintenance. The software used are open source and easy to install. The application
developed should be easy to install and use. This is an independent application which can
be easily run on to any system which has Python installed and Jupiter Notebook.
4.1.2 Product Features:
The application is developed in a way that heart disease accuracy is predicted using
Random Forest. We can compare the accuracy for the implemented algorithms. User
characteristics Application is developed in such a way that its users are v Easy to use v
Error free 20 v Minimal training or no training v Patient regular monitor Assumption &
Dependencies It is considered that the dataset taken fulfils all the requirements.
9|Page
4.1.3 Domain Requirements:
This document is the only one that describes the requirements of the system. It is meant for
the use by the developers and will also be the bases for validating the final heart disease
system. Any changes made to the requirements in the future will have to go through a
formal change approval process. User Requirements User can decide on the prediction
accuracy to decide on which algorithm can be used in realtime predictions. Non-Functional
Requirements ÿ Dataset collected should be in the CSV format ÿ. The column values
should be numerical values ÿ Training set and test set are stored as CSV files ÿ Error rates
can be calculated for prediction algorithms product.
4.1.4 Requirements Efficiency:
Less time for predicting the Heart Disease Reliability: Maturity, fault tolerance and
recoverability. Portability: can the software easily be transferred to another environment,
including install ability.
4.1.5 Usability:
How easy it is to understand, learn and operate the software system Organizational.
Requirements: Do not block some available ports through the windows firewall. Internet
connection should be available Implementation Requirements The dataset collection,
internet connection to install related libraries.
Engineering Standard Requirements User interface is developed in python, which gets
input such stock symbol.
4.1.6 Hardware Interfaces:
Ethernet on the AS/400 supports TCP/IP, Advanced Peer-to-Peer Networking (APPN) and
advanced program-to-program communications (APPC). ISDN To connect AS/400 to an
Integrated Services Digital Network (ISDN) for faster, more accurate data transmission.
An ISDN is a public or private digital communications network that can support data, fax,
image, and other services over the same physical interface. We can use other protocols on
ISDN, such as IDLC and X.25. Software Interfaces Anaconda Navigator and Jupiter
Notebook are used.
10 | P a g
4.1.7 Operational Requirements:
a. Economic: The developed product is economic as it is not required any hardware
interface etc. Environmental Statements of fact and assumptions that define the
expectations of the system in terms of mission objectives, environment, constraints, and
measures of effectiveness and suitability (MOE/MOS). The customers are those that
perform the eight primary functions of systems engineering, with special emphasis on the
operator as the key customer.
b. Health and Safety: The software may be safety critical. If so, there are issues associated
with its integrity level. The software may not be safety-critical although it forms part of a
safety-critical system. There is little point in producing 'perfect' code in some language if
hardware and system software (in widest sense) are not reliable. If a computer system is to
run software of a high integrity level, then that system should not at the same time
accommodate software of a lower integrity level. Systems with different requirements for
safety levels must be separated. Otherwise, the highest level of integrity required must be
applied to all systems in the same environment.
11 | P a g
CHAPTER 5
The data flow diagram (DFD) is one of the most important tools used by system analysis.
Data flow diagrams are made up of number of symbols, which represents system
components. Most data flow modeling methods use four kinds of symbols: Processes, Data
stores, Data flows and external entities. These symbols are used to represent four kinds of
system components. Circles in DFD represent processes. Data Flow represented by a thin
line in the DFD, and each data store has a unique name and square or rectangle represents
external entities.
5.2. Types of DFD:
• Logical DFD - This type of DFD concentrates on the system process and flow of
data in the system. In this project logical DFD is all about how data comes from
backend to frontend. By passing through number of authenticity and authentication
only if the database got connected only than the data flow starts.
• Physical DFD - This type of DFD shows how the data flow is actually
implemented in the system. It is more specific and close to the implementation.
• Entities - Entities are source and destination of information data. Entities are
represented by rectangles with their respective names.
• Process - Activities and action taken on the data are represented by Circle or
Round-edged rectangles.
• Data Storage - There are two variants of data storage - it can either be represented
as a rectangle with absence of both smaller sides or as an open-sided rectangle with
only one side missing.
• Data Flow - Movement of data is shown by pointed arrows. Data movement is
shown from the base of arrow as its source towards head of the arrow as
12 | P a g
destination.
LEVEL:0
DATA
COLLECTI
13 | P a g
LEVEL 1:
14 | P a g
CHAPTER 6
SCREENSHOTS
1.
In the above screenshot, we are importing and printing the modules and libraries
and
also reading the dataset form system.
2.
15 | P a g
And here we are printing the datasets and values.
3.
16 | P a g
4.
5.
17 | P a g
Age Variable
6.
18 | P a g
Sex Variable
7.
Cp Variable
Almost half of the patients have an observation value of 0. In other words, there is
asymptomatic angina
Half of the patients are asymptomatic; they have pain without symptoms.
If we examine the other half of the pie chart, 1 out of 4 patients has an observation
value of 2.
In other words, atypical angina is in 29% of the patients.
This observation value shows patients with shortness of breath or non-classical
pain.
The other two observation values are less than the others.
19 | P a g
16.5% of patients have a value of 1. In other words, typical angina is seen. Typical
angina is the classic exertion pain that comes during any physical activity.
The other 8% has the value of non-anginal pain, which is three types of angina.
Non-anginal pain is the term used to describe chest pain that is not caused by heart
disease or a heart attack.
8.
20 | P a g
Sex - Target Variable
Patients at high risk of heart attack from women are almost more than half of those
with low.
The situation is different for those with an observation value of 1, that is, for men.
The blue-colored bar has more observation values.
So men are more likely than not to have a heart attack.
In summary, female patients are at higher risk for heart attack
The correlation between the two variables is -0.280937. In other words, we can say
that there is a negative low-intensity correlation
9.
21 | P a g
11.
12.
22 | P a g
13.
23 | P a g
14.
24 | P a g
CHAPTER 7
7.1TECHNOLOGIES
7.1.1 Python:
Python is an interpreted high-level programming language for general-purpose
programming. Created by Guido van Rossum and first released in 1991, Python has a
design philosophy that emphasizes code readability, notably using significant whitespace.
It provides constructs that enable clear programming on both small and large scales.
Python features a dynamic type of system and automatic memory management. It supports
multiple programming paradigms, including object-oriented, imperative, functional and
procedural, and has a large and comprehensive standard library. Python interpreters are
available for many operating systems. C Python, the reference implementation of Python,
is open-source software and has a communitybased development model, as do nearly all its
variant implementations. C Python is managed by the non-profit Python Software
Foundation.
7.2 Libraries
7.2.1 Pandas
Pandas is an open-source Python Library providing high-performance data manipulation
and analysis tool using its powerful data structures. The name Pandas is derived from the
word Panel Data – an Econometrics from Multidimensional data. In 2008, developer Wes
McKinney started developing pandas when in need of high performance, flexible tool for
analysis of data. Prior to Pandas, Python was majorly
used for data mining and preparation. It had very little contribution towards data analysis.
Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the
processing and analysis of data, regardless of the origin of data — load, prepare,
manipulate, model, and analyze. Python with Pandas is used in a wide range of fields
including academic and commercial domains including finance, economics, Statistics,
analytics, etc.
25 | P a g
Key Features of Pandas:
Fast and efficient Data Frame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
7.2.2 NumPy:
7.2.3 Sckit-Learn:
Simple and efficient tools for data mining and data analysis
26 | P a g
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license
7.3 ALGORITHMS
27 | P a g
7.3.1 Logistic Regression
A popular statistical technique to predict binomial outcomes (y = 0 or 1) is Logistic
Regression. Logistic regression predicts categorical outcomes (binomial / multinomial
values of y). The predictions of Logistic Regression (henceforth, LogR in this article) are
in the form of probabilities of an event occurring, i.e., the probability of y=1, given certain
values of input variables x. Thus, the results of LogR range between 0-1.
LogR models the data points using the standard logistic function, which is an Sshaped
curve also called as sigmoid curve and is given by the equation.
28 | P a g
7.3.2Decision Tree:
A decision tree is a hierarchical model used in decision support that depicts decisions
and their potential outcomes, incorporating chance events, resource expenses, and
utility. This algorithmic model utilizes conditional control statements and is non-
parametric, supervised learning, useful for both classification and regression tasks. The
tree structure is comprised of a root node, branches, internal nodes, and leaf nodes,
forming a hierarchical, tree-like structure.
29 | P a g
7.3.3 Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
30 | P a g
7.3.4 Random Forest:
Random Forest is a supervised learning algorithm which is used for both classification as
well as regression. But however, it is mainly used for classification problems. As we know
that a forest is made up of trees and more trees means more robust forest. Similarly,
random forest creates decision trees on data samples and then gets the prediction from each
of them and finally selects the best solution by means of voting. It is ensemble method
which is better than a single decision tree because it reduces
the over-fitting by averaging the result.
Working of Random Forest with the help of following steps:
First, start with the selection of random samples from a given dataset.
Next, this algorithm will construct a decision tree for every sample. Then it will get the
prediction result from every decision tree.
In this step, voting will be performed for every predicted result.
At last, select the most voted prediction results as the final prediction result.
The following diagram will illustrate its working
31 | P a g
7.4 SYSTEM ARCHITECTURE
The below figure shows the process flow diagram or proposed work. First, we collected the
Cleveland Heart Disease Database from UCI website then pre-processed the dataset and
select 16 important features
After that applied ANN and Logistic algorithm individually and compute the accuracy.
Finally, we used proposed Ensemble Voting method and compute best method for
diagnosis of heart disease.
32 | P a g
7.5 MODULES:
The entire work of this project is divided into 4 modules.
They are:
a. Data Pre-Processing
b. Feature
c. Classification
d. Prediction
a. Data Pre-processing:
This file contains all the pre-processing functions needed to process all input documents
and texts. First, we read the train, test and validation data files then performed some
preprocessing like tokenizing, stemming etc. There are some exploratory data analyses is
performed like response variable distribution and data quality checks like null or missing
values etc. Data preprocessing is the process of transforming raw data into an
understandable format. It is also an important step in data mining as we cannot work with
raw data. The quality of the data should be checked before applying machine learning or
data mining algorithms. Preprocessing of data is mainly to check the data quality. The
quality can be checked by the following-
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or do not
match.
Timeliness: The data should be updated correctly
Believability: The data should be trustable.
Interpretability: The understandability of the data.
b. Feature:
Extraction In this file we have performed feature extraction and selection methods from
sci-kit learn python libraries. For feature selection, we have used methods like simple bag-
33 | P a g
of-words and n-grams and then term frequency like tf-idf weighting. We have also used
word2vec and POS tagging to extract the features, though POS tagging and word2vec has
not been used at this point in the project
Bag of Words:
It‘s an algorithm that transforms the text into fixed-length vectors. This is possible by
counting the number of times the word is present in a document. The word occurrences
allow to compare different documents and evaluate their similarities for applications, such
as search, document classification, and topic modeling.
N-grams:
N-grams are continuous sequences of words or symbols or tokens in a document. In
technical terms, they can be defined as the neighbouring sequences of items in a document.
They come into play when we deal with text data in NLP(Natural Language Processing)
tasks.
TF-IDF Weighting:
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in
the fields of information retrieval (IR) and machine learning, that can quantify the
importance or relevance of string representations (words, phrases, lemmas, etc) in a
document amongst a collection of documents (also known as a corpus).
c. Classification:
Here we have built all the classifiers for the breast cancer diseases detection. The extracted
features are fed into different classifiers. We have used Naive-bayes, Logistic Regression,
Linear SVM, Stochastic gradient decent and Random Forest classifiers from sklearn. Each
of the extracted features was used in all the classifiers. Once fitting the model, we
compared the f1 score and checked the confusion matrix.
After fitting all the classifiers, 2 best performing models were selected as candidate models
for heart diseases classification.
d. Prediction:
Our finally selected and best performing classifier was algorithm which was then saved on
disk with name final_model.sav. Once you close this repository, this model will be copied
34 | P a g
to user's machine and will be used by prediction.py file to classify the heart diseases. It
takes a news article as input from user then model is used for final classification output that
is shown to user along with probability of truth.
CHAPTER 8
IMPLEMENTATION
35 | P a g
1.Environment Setup
2.Import Libraries
Import essential libraries for data handling, preprocessing, modeling, and evaluation.
Load the heart disease dataset (e.g., UCI Heart Disease dataset) using Pandas.
4.Data Exploration
5.Data Preprocessing
Handle missing values, if any, by filling them with appropriate statistics (mean,
median).
Encode categorical variables using one-hot encoding or label encoding.
Split the dataset into features (X) and target (y).
Further split the features and target into training and testing sets.
Scale the features using StandardScaler for better model performance.
6.Model Training
7.Model Evaluation
36 | P a g
Generate and display the confusion matrix.
Generate and display the classification report, which includes precision, recall, and F1-
score.
8.Visualization
Plot the confusion matrix using Seaborn for better visualization and understanding of
the model’s performance.
This process will allow you to implement a machine learning model for heart attack
prediction from loading the data to evaluating the model’s performance.
BIBLIOGRAPHY
37 | P a g
American Heart Association. (2022). Heart Attack. Retrieved from
https://www.heart.org/en/health-topics/heart-attack
Dua, D., & Graff, C. (2019). UCI Machine Learning Repository. University of
California,Irvine, School of Information and Computer Sciences. Retrieved from
http://archive.ics.uci.edu/ml
Mathews, S. C., McShea, M. J., Hanley, C. L., & Ravitz, A. D. (2019). Lab Values:
Interpreting Chemistry and Hematology for Adult Patients. Hoboken, NJ: Wiley. World
Health Organization. (2022). Cardiovascular Diseases (CVDs). Retrieved from
https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1
Rasheed, J., Haroon, S., & Ali, A. (2020). Machine Learning Techniques for Predicting
Heart Diseases: A Comprehensive Review. Journal of Healthcare Engineering, 2020, 1- 15.
doi:10.1155/2020/8518034
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction (2nd ed.). New York, NY: Springer.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., .Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine
Learning Research, 12, 2825-2830. Retrieved from
http://jmlr.org/papers/v12/pedregosa11a.html
38 | P a g