PythonHeartDisease FinalDocumentByMS
PythonHeartDisease FinalDocumentByMS
ABSTRACT
In day to day life many factors that affect a human heart. Many problems are occurring at a
rapid pace and new heart diseases are rapidly being identified. In today’s world of stress Heart,
being an essential organ in a human body which pumps blood through the body for the blood
circulation is essential and its health is to be conserved for a healthy living. The main motivation of
doing this project is to present a heart disease prediction model for the prediction of occurrence of
heart disease. Further, this research work is aimed towards identifying the best classification
algorithm for identifying the possibility of heart disease in a patient.
The identification of the possibility of heart disease in a person is complicated task for
medical practitioners because it requires years of experience and intense medical tests to be
conducted. The main objective of this significant research work is to identify the best classification
algorithms suitable for providing maximum accuracy when classification of normal and abnormal
person is carried out.
Convolutional neural network (CNN) architecture is used to map the relationship between
the indoor PM and weather data to the found values. The proposed method is compared with the
state-of-the-art deep neural network (DNN)based techniques in terms of the root mean square and
mean absolute error accuracy measures. In addition, support vector machine based classification
and K-Nearest Neighbor based classification is also carried out and accuracy is found out. The
applied SVM, KNN and CNN classification helps to predict the heart disease with more accuracy in
the new data set. The coding language used is Python 3.7.
ii
TABLE OF CONTENTS
LIST OF ABBREVIATIONS ix
1 INTRODUCTION 1
1.1 INTRODUCTION 1
1.2 OBJECTIVES 2
2 LITERATURE REVIEW 6
3 PROPOSED WORK 10
4 SYSTEM SPECIFICATION 22
5 SOFTWARE DESCRIPTION
6 PROJECT DESCRIPTION
A. SOURCE CODE
B. SCREEN SHOTS
REFERENCES 41
iv
LIST OF FIGURES
LIST OF ABBREVIATIONS
ABBREVIATION EXPANSION
CHAPTER 1
INTRODUCTION
1.1. INTRODUCTION
In day to day life many factors that affect a human heart. Many problems are occurring at a
rapid pace and new heart diseases are rapidly being identified. In today’s world of stress Heart,
being an essential organ in a human body which pumps blood through the body for the blood
circulation is essential and its health is to be conserved for a healthy living. The health of a human
heart is based on the experiences in a person’s life and is completely dependent on professional and
personal behaviors of a person.
There may also be several genetic factors through which a type of heart disease is passed
down from generations. According to the World Health Organization, every year more than 12
million deaths are occurring worldwide due to the various types of heart diseases which is also
known by the term cardiovascular disease. The term Heart disease includes many diseases that are
diverse and specifically affect the heart and the arteries of a human being. Even young aged people
around their 20-30 years of lifespan are getting affected by heart diseases.
The increase in the possibility of heart disease among young may be due to the bad eating
habits, lack of sleep, restless nature, depression and numerous other factors such as obesity, poor
diet, family history, high blood pressure, high blood cholesterol, idle behavior, family history,
smoking and hypertension. The diagnosis of the heart diseases is a very important and is itself the
most complicated task in the medical field.
All the mentioned factors are taken into consideration when analyzing and understanding the
patients by the doctor through manual check-ups at regular intervals of time. The symptoms of
heart disease greatly depend upon which of the discomfort felt by an individual. Some symptoms
are not usually identified by the common people. However, common symptoms include chest pain,
breathlessness, and heart palpitations. The chest pain common to many types of heart disease is
known as angina, or angina pectoris, and occurs when a part of the heart does not receive enough
oxygen. Angina may be triggered by stressful events or physical exertion and normally lasts under
10 minutes. Heart attacks can also occur as a result of different types of heart disease. The signs of a
vii
heart attack are similar to angina except that they can occur during rest and tend to be more severe.
The symptoms of a heart attack can sometimes resemble indigestion.
Heartburn and a stomach ache can occur, as well as a heavy feeling in the chest. Other
symptoms of a heart attack include pain that travels through the body, for example from the chest to
the arms, neck, back, abdomen, or jaw, lightheadedness and dizzy sensations, profuse sweating,
nausea and vomiting.
Heart failure is also an outcome of heart disease, and breathlessness can occur when the
heart becomes too weak to circulate blood. Some heart conditions occur with no symptoms at all,
especially in older adults and individuals with diabetes. The term 'congenital heart disease' covers a
range of conditions, but the general symptoms include sweating, high levels of fatigue, fast
heartbeat and breathing, breathlessness, chest pain. However, these symptoms might not develop
until a person is older than 13 years.
In these types of cases, the diagnosis becomes an intricate task requiring great experience
and high skill. A risk of a heart attack or the possibility of the heart disease if identified early, can
help the patients take precautions and take regulatory measures. Recently, the healthcare industry
has been generating huge amounts of data about patients and their disease diagnosis reports are
being especially taken for the prediction of heart attacks worldwide. When the data about heart
disease is huge, the machine learning techniques can be implemented for the analysis.
Data Mining is a task of extracting the vital decision making information from a collective
of past records for future analysis or prediction. The information may be hidden and is not
identifiable without the use of data mining. The classification is one data mining technique through
which the future outcome or predictions can be made based on the historical data that is available.
The medical data mining made a possible solution to integrate the classification techniques and
provide computerized training on the dataset that further leads to exploring the hidden patterns in
the medical data sets which is used for the prediction of the patient’s future state. Thus, by using
medical data mining it is possible to provide insights on a patient’s history and is able to provide
clinical support through the analysis. For clinical analysis of the patients, these patterns are very
much essential. In simple English, the medical data mining uses classification algorithms that are a
vital part for identifying the possibility of heart attack before the occurrence.
The classification algorithms can be trained and tested to make the predictions that
viii
determine the person’s nature of being affected by heart disease. In this research work, the
supervised machine learning concept is utilized for making the predictions.
A comparative analysis of the three data mining classification algorithms namely Random
Forest, Decision Tree and Naïve Bayes are used to make predictions. The analysis is done at several
levels of cross validation and several percentage of percentage split evaluation methods
respectively.
The StatLog dataset from UCI machine learning repository is utilized for making heart
disease predictions in this research work. The predictions are made using the classification model
that is built from the classification algorithms when the heart disease dataset is used for training.
This final model can be used for prediction of any types of heart diseases.
ix
CHAPTER 2
LITERATURE REVIEW
2.1 RELATED WORK
They performed classical statistical analysis and data mining analysis using mainly Bayesian
networks. RESULTS: The mean age of the 533 patients was 63 (± 17) and the sample was
composed of 390 (73 %) men and 143 (27 %) women. Cardiac arrest was observed at home for 411
(77 %) patients, in a public place for 62 (12 %) patients and on a public highway for60 (11 %)
patients. The belief network of the variables showed that the probability of remaining alive after
heart failure is directly associated to five variables: age, sex, the initial cardiac rhythm, the origin of
the heart failure and specialized resuscitation techniques employed.
CONCLUSIONS: Data mining methods could help clinicians to predict the survival of
patients and then adapt their practices accordingly. This work could be carried out for each medical
procedure or medical problem and it would become possible to build a decision tree rapidly with the
data of a service or a physician. The comparison between classic analysis and data mining analysis
showed us the contribution of the data mining method for sorting variables and quickly conclude on
the importance or the impact of the data and variables on the criterion of the study. The main limit
of the method is knowledge acquisition and the necessity to gather sufficient data to produce a
relevant model.
xi
Ninety five percent of sudden cardiac arrest victims die before reaching the hospital and
heart disease claims more lives each year than the following six leading causes of death combined
(cancer, chronic lower respiratory diseases, accidents, diabetes mellitus, influenza and pneumonia).
Almost 150,000 people in the U.S. who die from heart disease each year are under the age of 65.
These data show the interest for predicting the risk of death after heart failure and the need to
analyze the events that occurred during care to provide prognostic information. Classic statistical
analyses have already been done and provide some information about epidemiology of the heart
failure and causes of the failure. This paper presents the use of a probability in a statistical approach
in order to profile heart failure in a sample of patients and predict the impact of some events in the
care process.
They concluded that it seems that the use of the Bayesian network in medical analysis could
be useful to explore data and to find hid-den relationships between events or characteristics of the
sample. It is a first approach for discussing hypotheses between clinicians and statistical experts.
The main limit of these tools is the necessity to have enough data to find regularity in the
relationships
xii
Some research methods are already well enough developed to have been made part of
commercially available software. Several expert system shells use variations of ID3 for inducing
rules from examples. Other systems use inductive, neural net, or genetic learning approaches to
discover patterns in personal computer databases. Many forward-looking companies are using these
and other tools to analyze their databases for interesting and useful patterns.
American Airlines searches its frequent flyer database to find its better customers, targeting
them for specific marketing promotions. Farm Journal analyzes its subscriber database and uses
advanced printing technology to custom-build hundreds of editions tailored to particular groups.
Several banks, using patterns discovered in loan and credit histories, have derived better loan
approval and bankruptcy prediction methods. General Motors is using a database of automobile
trouble reports to derive diagnostic expert systems for various models. Packaged-goods
manufacturers are searching the supermarket scanner data to measure the effects of their promotions
and to look for shopping patterns.
A combination of business and research interests has produced increasing demands for, as
well as increased activity to provide, tools and techniques for discovery in databases. This book is
the first to bring together leading-edge research from around the world on this topic. It spans many
different approaches to discovery, including inductive learning, Bayesian statistics, semantic query
optimization, knowledge acquisition for expert systems, information theory, and fuzzy sets.
The book is aimed at those interested or involved in computer science and the management
of data, to both inform and inspire further research and applications. It will be of particular interest
to professionals working with databases and management information systems and to those
applying machine learning to real-world problems.
xiv
In this paper [3] the authors stated that ECG is a test that measures a heart’s electrical
activity, which provides valuable clinical information about the heart’s status. In this paper, they
proposed a classification method for extracting multi-parametric features by analyzing HRV from
ECG, data preprocessing and heart disease pattern. The proposed method is an associative classifier
based on the efficient FP-growth method.
Since the volume of patterns produced can be large, they offered a rule cohesion measure
that allows a strong push of pruning patterns in the pattern generating process. They conducted an
experiment for the associative classifier, which utilizes multiple rules and pruning, and biased
confidence (or cohesion measure) and dataset consisting of 670 participants distributed into two
groups, namely normal people and patients with coronary artery disease.
The most widely used signal in clinical practice is ECG (Electrocardiogram), which is
frequently recorded and widely used for the assessment of cardiac function. ECG processing
techniques have been proposed to support pattern recognition, parameter extraction, spectro-
temporal techniques for the assessment of the heart’s status, denoising, baseline correction and
arrhythmia detection. Control of the heart rate is known to be affected by the sympathetic and
parasympathetic nervous system.
It is reported that Heart Rate Variability (HRV) is related to autonomic nerve activity and is
used as a clinical tool to diagnose cardiac autonomic function in both health and disease. This paper
provides a classification technique that could automatically diagnose Coronary Artery Disease
(CAD) under the framework of ECG patterns and clinical investigations. Through the ECG pattern
we are able to recognize the features that could well reflect either the existence or non-existence of
CAD. Such features can be perceived through HRV analysis based on following knowledge:
1. In patients with CAD, reduction of the cardiac vagal activity evaluated by spectral HRV
analysis was found to correlate with the angiographic severity.
2. The reduction of variance (standard deviation of all normal RR intervals) and low-
frequency of HRV seem related to an increase in chronic heart failure.
Their classification method uses multiple rules to predict the highest probability classes for
each record. The proposed associative classifier can also relax the independence assumption of
some classifiers, such as NB (Naive Bayesian) and DT (Decision Tree). For example, the NB makes
the assumption of conditional independence, that is, given the class label of a sample, the values of
the attributes are conditionally independent of one another. When the assumption holds true, then
the NB is the most accurate in comparison with all other classifiers. In practice, however,
dependences can exist between variables of the real data. Their classifier can consider the
dependences of linear characteristics of HRV and clinical information. Finally, they implemented
their classifier and several different classifiers to validate their accuracy in diagnosing heart disease.
The performances of the CANFIS model were evaluated in terms of training performances
and classification accuracies and the results showed that the proposed CANFIS model has great
potential in predicting the heart disease.
Poor clinical decisions can lead to disastrous consequences which are therefore
unacceptable. Clinical decisions are often made based on doctors’ intuition and experience rather
than on the knowledge rich data hidden in the database. This practice leads to unwanted biases,
errors and excessive medical costs which affects the quality of service provided to patients. Wu, et
al proposed that integration of clinical decision support with computer-based patient records could
reduce medical errors, enhance patient safety, decrease unwanted practice variation, and improve
patient outcome [12].
Most hospitals today employ some sort of hospital information systems to manage their
healthcare or patient data [13]. Unfortunately, these data are rarely used to support clinical decision
making. The main objective of this research is to develop a prototype Intelligent Heart Disease
Prediction System with CANFIS and genetic algorithm using historical heart disease databases to
make intelligent clinical decisions which traditional decision support systems cannot.
They concluded that from their studies, they have managed to achieve our research
objectives. Available dataset of Heart disease from UCI Machine Learning Repository has been
studied and preprocessed and cleaned out to prepare it for classification process. Coactive Neuro-
fuzzy modeling was proposed as a dependable and robust method developed to identify a nonlinear
relationship and mapping between the different attributes. It has been shown that of GA is a very
useful technique for auto-tuning of the CANFIS parameters and selection of optimal feature set.
The fact is that computers cannot replace humans and by comparing the computer-aided detection
results with the pathologic findings, doctors can learn more about the best way to evaluate areas that
computer aided detection highlights.
xvii
i
Results show that each technique has its unique strength in realizing the objectives of the
defined mining goals. IHDPS can answer complex “what if” queries which traditional decision
support systems cannot. Using medical profiles such as age, sex, blood pressure and blood sugar it
can predict the likelihood of patients getting a heart disease. It enables significant knowledge, e.g.
patterns, relationships between medical factors related to heart disease, to be established. IHDPS is
Web-based, user-friendly, scalable, reliable and expandable. It is implemented on the .NET
platform.
Most hospitals today employ some sort of hospital information systems to manage their
healthcare or patient data [15]. These systems typically generate huge amounts of data which take
the form of numbers, text, charts and images. Unfortunately, these data are rarely used to support
clinical decision making. There is a wealth of hidden information in these data that is largely
untapped. This raises an important question: “How can we turn data into useful information that can
enable healthcare practitioners to make intelligent clinical decisions?” This is the main motivation
for this research.
xix
Many hospital information systems are designed to support patient billing, inventory
management and generation of simple statistics. Some hospitals use decision support systems, but
they are largely limited. They can answer simple queries like “What is the average age of patients
who have heart disease?”, “How many surgeries had resulted in hospital stays longer than 10
days?”, “Identify the female patients who are single, above 30 years old, and who have been treated
for cancer.” However, they cannot answer complex queries like “Identify the important preoperative
predictors that increase the length of hospital stay”, “Given patient records on cancer, should
treatment include chemotherapy alone, radiation alone, or both chemotherapy and radiation?”, and
“Given patient records, predict the probability of patients getting a heart disease.”
Clinical decisions are often made based on doctors’ intuition and experience rather than on
the knowledge-rich data hidden in the database. This practice leads to unwanted biases, errors and
excessive medical costs which affects the quality of service provided to patients. Wu, et al proposed
that integration of clinical decision support with computer-based patient records could reduce
medical errors, enhance patient safety, decrease unwanted practice variation, and improve patient
outcome [16]. This suggestion is promising as data modelling and analysis tools, e.g., data mining,
have the potential to generate a knowledge-rich environment which can help to significantly
improve the quality of clinical decisions.
The main objective of this research is to develop a prototype Intelligent Heart Disease
Prediction System (IHDPS) using three data mining modeling techniques, namely, Decision Trees,
Naïve Bayes and Neural Network. IHDPS can discover and extract hidden knowledge (patterns and
relationships) associated with heart disease from a historical heart disease database.
It can answer complex queries for diagnosing heart disease and thus assist healthcare
practitioners to make intelligent clinical decisions which traditional decision support systems
cannot. By providing effective treatments, it also helps to reduce treatment costs. To enhance
visualization and ease of interpretation, it displays the results both in tabular and graphical forms.
IHDPS uses the CRISP-DM methodology to build the mining models. It consists of six major
phases: business understanding, data understanding, data preparation, modeling, evaluation, and
deployment. Business understanding phase focuses on understanding the objectives and
requirements from a business perspective, converting this knowledge into a data mining problem
definition, and designing a preliminary plan to achieve the objectives.
xx
Data understanding phase uses the raw the data and proceeds to understand the data, identify
its quality, gain preliminary insights, and detect interesting subsets to form hypotheses for hidden
information. Data preparation phase constructs the final dataset that will be fed into the modeling
tools. This includes table, record, and attribute selection as well as data cleaning and transformation.
The modeling phase selects and applies various techniques, and calibrates their parameters to
optimal values. The evaluation phase evaluates the model to ensure that it achieves the business
objectives. The deployment phase specifies the tasks that are needed to use the models [17]. Data
Mining Extension (DMX), a SQL-style query language for data mining, is used for building and
accessing the models’ contents. Tabular and graphical visualizations are incorporated to enhance
analysis and interpretation of results.
A total of 909 records with 15 medical attributes (factors) were obtained from the Cleveland
Heart Disease database [18]. Figure 2.1 lists the attributes. The records were split equally into two
datasets: training dataset (455 records) and testing dataset (454 records). To avoid bias, the records
for each set were selected randomly. For the sake of consistency, only categorical attributes were
used for all the three models. All the non-categorical medical attributes were transformed to
categorical data. The attribute “Diagnosis” was identified as the predictable attribute with value “1”
for patients with heart disease and value “0” for patients with no heart disease. The attribute
“PatientID” was used as the key; the rest are input attributes. It is assumed that problems such as
missing data, inconsistent data, and duplicate data have all been resolved.
Mining models:
Data Mining Extension (DMX) query language was used for model creation, model
training, model prediction and model content access. All parameters were set to the default setting
except for parameters “Minimum Support = 1” for Decision Tree and “Minimum Dependency
Probability = 0.005” for Naïve Bayes [19]. The trained models were evaluated against the test
datasets for accuracy and effectiveness before they were deployed in IHDPS. The models were
validated using Lift Chart and Classification Matrix.
They concluded that a prototype heart disease prediction system is developed using three
data mining classification modeling techniques. The system extracts hidden knowledge from a
historical heart disease database. DMX query language and functions are used to build and access
the models.
xxi
xxii
Predictable attribute
1. Diagnosis (value 0: < 50% diameter narrowing (no heart disease); value 1: > 50%
diameter narrowing (has heart disease))
Key attribute
1. PatientID – Patient’s identification number
Input attributes
1. Sex (value 1: Male; value 0 : Female)
2. Chest Pain Type (value 1: typical type 1 angina, value 2: typical type angina, value 3:
non-angina pain; value 4: asymptomatic)
3. Fasting Blood Sugar (value 1: > 120 mg/dl; value 0: < 120 mg/dl)
4. Restecg – resting electrographic results (value 0: normal; value 1: 1 having ST-T
wave abnormality; value 2: showing probable or definite left ventricular hypertrophy)
5. Exang – exercise induced angina (value 1: yes; value 0: no)
6. Slope – the slope of the peak exercise ST segment (value 1: unsloping; value 2: flat;
value 3: downsloping)
7. CA – number of major vessels colored by floursopy (value 0 – 3)
8. Thal (value 3: normal; value 6: fixed defect; value 7: reversible defect)
9. Trest Blood Pressure (mm Hg on admission to the hospital)
10. Serum Cholesterol (mg/dl)
11. Thalach – maximum heart rate achieved
12. Oldpeak – ST depression induced by exercise relative to rest 13. Age in Year
The models are trained and validated against a test dataset. Lift Chart and Classification
Matrix methods are used to evaluate the effectiveness of the models. All three models are able to
extract patterns in response to the predictable state. The most effective model to predict patients
with heart disease appears to be Naïve Bayes followed by Neural Network and Decision Trees.
Five mining goals are defined based on business intelligence and data exploration. The goals
are evaluated against the trained models. All three models could answer complex queries, each with
xxii
i
its own strength with respect to ease of model interpretation, access to detailed information and
accuracy. Naïve Bayes could answer four out of the five goals; Decision Trees, three; and Neural
Network, two. Although not the most effective model, Decision Trees results are easier to read and
interpret. The drill through feature to access detailed patients’ profiles is only available in Decision
Trees.
Naïve Bayes fared better than Decision Trees as it could identify all the significant medical
predictors. The relationship between attributes produced by Neural Network is more difficult to
understand. IHDPS can be further enhanced and expanded. For example, it can incorporate other
medical attributes besides the 15 listed in Figure 2.1. It can also incorporate other data mining
techniques, e.g., Time Series, Clustering and Association Rules. Continuous data can also be used
instead of just categorical data. Another area is to use Text Mining to mine the vast amount of
unstructured data available in healthcare databases. Another challenge would be to integrate data
mining and text mining [8].
In this paper [6] the authors stated that the diagnosis of diseases is a vital and intricate job in
medicine. The recognition of heart disease from diverse features or signs is a multi-layered problem
that is not free from false assumptions and is frequently accompanied by impulsive effects. Thus the
attempt to exploit knowledge and experience of several specialists and clinical screening data of
patients composed in databases to assist the diagnosis procedure is regarded as a valuable option.
This research work is the extension of our previous research with intelligent and effective heart
attack prediction system using neural network. A proficient methodology for the extraction of
significant patterns from the heart disease warehouses for heart attack prediction has been
presented.
Initially, the data warehouse is pre-processed in order to make it suitable for the mining
process. Once the preprocessing gets over, the heart disease warehouse is clustered with the aid of
the K-means clustering algorithm, which will extract the data appropriate to heart attack from the
xxi
v
warehouse. Consequently the frequent patterns applicable to heart disease are mined with the aid of
the MAFIA algorithm from the data extracted. In addition, the patterns vital to heart attack
prediction are selected on basis of the computed significant weightage.
The neural network is trained with the selected significant patterns for the effective
prediction of heart attack. They have employed the Multi-layer Perceptron Neural Network with
Back-propagation as the training algorithm. The results thus obtained have illustrated that the
designed prediction system is capable of predicting the heart attack effectively.
Clustering algorithm is divided into two categories: partition and hierarchical clustering
algorithm. This paper discusses one partition clustering algorithm (kmeans) and one hierarchical
clustering algorithm (agglomerative). K-means algorithm has higher efficiency and scalability and
converges fast when dealing with large data sets. Hierarchical clustering constructs a hierarchy of
clusters by either repeatedly merging two smaller clusters into a larger one or splitting a larger
cluster into smaller ones. Using WEKA data mining tool we have calculated the performance of k-
means and hierarchical clustering algorithm on the basis of accuracy and running time.
As a result of modern methods for scientific data collection, huge quantities of data are
getting accumulated at various databases. Cluster analysis [9] is one of the major data analysis
methods which helps to identify the natural grouping in a set of data items. The K-Means clustering
algorithm is proposed by Mac Queen in 1967 which is a partition-based cluster analysis method.
Clustering is a way that classifies the raw data reasonably and searches the hidden patterns that may
exist in datasets [10]. It is a process of grouping data objects into disjointed clusters so that the
xxv
data’s in the same cluster is similar, yet data’s belonging to different cluster differ. K-means is a
numerical, unsupervised, non-deterministic, iterative method. It is simple and very fast, so in many
practical applications, the method is proved to be a very effective way that can produce good
clustering results.
The demand for organizing the sharp increasing data’s and learning valuable information
from data, which makes clustering techniques are widely applied in many application areas such as
artificial intelligence, biology, customer relationship management, data compression, data mining,
information retrieval, image processing, machine learning, marketing, medicine, pattern
recognition, psychology, statistics and so on.
Clustering methods can be divided into two general classes, designated supervised and
unsupervised clustering. In this paper, we focus on unsupervised clustering which may again be
separated into two major categories: partition clustering and hierarchical clustering. There are many
algorithms for partition clustering category, such as kmeans clustering (MacQueen 1967), k-medoid
clustering, genetic k-means algorithm (GKA), Self-Organizing Map (SOM) and also graph-
theoretical methods (CLICK, CAST).
Among those methods, K-means clustering is the most popular one because of simple
algorithm and fast execution speed. Hierarchical clustering methods are among the first methods
developed and analyzed for clustering problems. There are two main approaches. (i) The
agglomerative approach, which builds a larger cluster by merging two smaller clusters in a bottom-
up fashion. The clusters so constructed form a binary tree; individual objects are the leaf nodes and
the root node is the cluster that has all data objects. (ii) The divisive approach, which splits a cluster
into two smaller ones in a top-down fashion. All clusters so constructed also form a binary tree.
The process of k-means algorithm: This part briefly describes the standard k-means
algorithm. K-means is a typical clustering algorithm in data mining and which is widely used for
clustering large set of data’s. It was one of the most simple, non-supervised learning algorithms,
which was applied to solve the problem of the well-known cluster. It is a partitioning clustering
algorithm, this method is to classify the given date objects into k different clusters through the
iterative, converging to a local minimum. So the results of generated clusters are compact and
independent. The algorithm consists of two separate phases.
The first phase selects k centers randomly, where the value k is fixed in advance. The next
phase is to take each data object to the nearest center. Euclidean distance is generally considered to
xxv
i
determine the distance between each data object and the cluster centers. When all the data objects
are included in some clusters, the first step is completed and an early grouping is done.
Recalculating the average of the early formed clusters. This iterative process continues repeatedly
until the criterion function becomes the minimum. Supposing that the target object is x, xi indicates
the average of cluster Ci, criterion function is defined as follows (eq. 1.):
(1)
E is the sum of the squared error of all objects in database.
The distance of criterion function is Euclidean distance, which is used for determining the
nearest distance between each data objects and cluster center. The Euclidean distance between one
vector x=(x1 ,x2 ,…xn) and another vector y=(y1 ,y2 ,…yn ), The Euclidean distance d(xi, yi) can
be obtained as follow:
(2)
The k-means clustering algorithm always converges to local minimum. Before the k-means
algorithm converges, calculations of distance and cluster centers are done while loops are executed
a number of times, where the positive integer t is known as the number of k-means iterations. The
precise value of t varies depending on the initial starting cluster centers. The distribution of data
points has a relationship with the new clustering center, so the computational time complexity of the
k means algorithm is O (nkt). n is the number of all data objects, k is the number of clusters, t is the
iterations of algorithm. Usually requiring k <<n and t <<n[1]. The reason behind choosing K-means
algorithm to study is its popularity for the following reasons:
Its time complexity is O (nkl), n is number of patterns ,k is the number of clusters, l is the
number of iterations taken by the algorithm to converge.
It is order independent, for a given initial seed set of cluster centers, it generates the same
partition of the data irrespective of the order in which the patterns are presented to the algorithm.
Its space complexity is O (n+k).It requires additional space to store the data matrix.
xxv
ii
CHAPTER 3
SYSTEM ANALYSIS
3.1.1. DRAWBACKS
KNN Classification is not considered so that probability of (disease) yes/no records in
the given new test data is not possible.
Feature reduction before malware identification is not carried out.
Data columns with numeric values only take from KNN classification.
3.2.2 ADVANTAGES
SVM Classification is considered so that probability of safe/risk category in the given
new test data is possible.
Feature reduction before disease identification is carried out.
SVM supports well even if the dataset size is big.
Convolutional Neural Network based prediction model is worked out to find algorithm
efficiency.
xxv
iii
CHAPTER - 4
SYSTEM SPECIFICATION
CHAPTER - 5
SOFTWARE DESCRIPTION
Another attractive feature that Google offers to the developers is the use of GPU.Colab
supports GPU and it is totally free. The reasons for making it free for public could be to make its
software a standard in the academics for teaching machine learning and data science. It may also
have a long term perspective of building a customer base for Google Cloud APIs which are sold
per-use basis. Irrespective of the reasons, the introduction of Colab has eased the learning and
development of machine learning applications. Google Colab is a powerful platform for learning
xxx
i
and quickly developing machine learning models in Python. It is based on Jupyter notebook and
supports collaborative development.
The team members can share and concurrently edit the notebooks, even remotely. The
notebooks can also be published on GitHub and shared with the general public. Colab supports
many popular ML libraries such as PyTorch, TensorFlow, Keras andOpenCV. The restriction as of
today is that it does not support R or Scala yet. There is also a limitation to sessions and size.
Considering the benefits, these are small sacrifices one needs to make.
Packages
Google Colab is the most widely used Python distribution for data science and comes pre-loaded
with all the most popular libraries and tools. Some of the biggest Python libraries included in
Anaconda include
• Pandas - It is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool.
• NumPy - It is the core library for scientific` computing, which contains a powerful
N-dimensional array object.
• Matplotlib - Matplotlib is a plotting library for the Python programming language.
• Scikit-learn - Machine learning library for the Python. It has classification, regression and
clustering algorithms.
xxx
ii
CHAPTER 6
PROJECT DESCRIPTION
So the new system should sort out the current drawbacks and thus the proposed
convolutional neural network (CNN) architecture should be considered with a matrix input for
predicting the risk of heart. The proposed convolutional neural network (CNN) architecture has 4
hidden layers comprising of 2 convolutional layers, and 2 fully connected layers. The same network
has been used in IoT implementation and experimental evaluations. The input layer has 4 features
and the size of the input layer is 4 × 1. The first and the second convolutional layersuse 64 feature
maps to avail superior learning of the input features.
The kernel for both convolution layers has a size of 1×1. The convolution is done with one
stride on both the first and the second convolution layers. The fully-connected layers have 128
neurons. All the activation functions are ReLU and the output layer with the linear activation
function has one neuron. The resulting network has around 1.5 million learnable parameters. The
CNN loss function is the mean squared error and was minimized using Adam Optimizing
Algorithm.
It should achieve high detection accuracy and efficiency while analyzing the minimal
number of categories. It should eliminate the need to analyze records that have little or no
significant influence on risk detection effectiveness.
xxx
iii
In day to day life many factors that affect a human heart. Many problems are occurring at a
rapid pace and new heart diseases are rapidly being identified. In today’s world of stress Heart,
being an essential organ in a human body which pumps blood through the body for the blood
circulation is essential and its health is to be conserved for a healthy living. The main motivation of
doing this project is to present a heart disease prediction model for the prediction of occurrence of
heart disease. Further, this research work is aimed towards identifying the best classification
algorithm for identifying the possibility of heart disease in a patient.
The identification of the possibility of heart disease in a person is complicated task for
medical practitioners because it requires years of experience and intense medical tests to be
conducted. The main objective of this significant research work is to identify the best classification
algorithm suitable for providing maximum accuracy when classification of normal and abnormal
person is carried out.
Convolutional neural network (CNN) architecture is used to map the relationship between
the indoor PM and weather data to the found values. The proposed method is compared with the
state-of-the-art deep neural network (DNN)based techniques in terms of the root mean square and
mean absolute error accuracy measures. In addition, support vector machine based classification
and K-Nearest Neighbor based classification is also carried out and accuracy is found out. The
applied SVM, KNN and CNN classification helps to predict the heart disease with more accuracy in
the new data set. The coding language used is Python 3.7.
1
3. SVM/KNN CLASSIFICATION
In this module, 80% of the data in given data set is taken as training data and 20% of the
data is taken as test data. The text (categorical) columns are converted into numerical values.
Then the model is trained with training data and then predicted with test data. Of which, most of
the apps are classified as disease present or not.
4. CNN CLASSIFICATION
Here the dataset is taken first. It can be seen that news data is stored in the form of csv
values (Comma Separated Values). Each record contains attribute values for one heart disease
definition. Data Encoding: It converts the categorical column (label in out case) into numerical
values. These are some variables required for the model training. Tokenization process divides
a large piece of continuous text into distinct units or tokens basically. Here we use columns
separately for a temporal basis as a pipeline just for good accuracy.
2
Generating Word Embedding: It allows words with similar meanings to have a similar
representation. Here each individual word is represented as real-valued vectors in a predefined
vector space. For that we will use glove.6B.50d.txt. It has the predefined vector space for
words. Creating Model Architecture: Here TensorFlow is used to create the model. Here we
use the TensorFlow embedding technique with Keras Embedding Layer where we map original
input data into some set of real-valued dimensions. Model Evaluation and Prediction: Now, the
detection model is built using TensorFlow. Now we will try to test the model by using some
news text by predicting whether it is true or false. Thus a fake news detection model is created
using TensorFlow using python.
It is also used the Dropout, Batch-normalization and Flatten layers in addition to the
layers. Flatten layer converts the output of convolutional layers into a one dimensional feature
vector. It is important to flatten the outputs because Dense (Fully connected) layers only accept
a feature vector as input. Dropout and Batch-normalization layers are for preventing the model
from overfitting. Once the model is created, it can be imported and then compiled using
‘model.compile’. The model is trained for just ten epochs but we can increase the number of
epochs. After the training process is completed we can make predictions on the test set. The
accuracy value is displayed during iterations.
3
System analysis decide the following input design details like, what data to input, what
medium to use, how the data should be arranged or coded, data items and transactions needing
validations to detect errors and at last the dialogue to guide user in providing input. Input data
of a system may not be necessarily is raw data captured in the system from scratch.
These can also be the output of another system or subsystem. The design of input covers
all the phases of input from the creation of initial data to actual entering of the data to the
system for processing. The design of inputs involves identifying the data needed, specifying the
characteristics of each data item, capturing and preparing data from computer processing and
ensuring correctness of data. Any Ambiguity in input leads to a total fault in output. The goal of
designing the input data is to make data entry as easy and error free as possible.
4
The output is designed in such a way that it is attractive, convenient and informative.
Codings are made with various features, which make the console output more pleasing.
As the outputs are the most important sources of information to the users, better design should
improve the system’s relationships with user and also will help in decision-making. Form
design elaborates the way output is presented and the layout available for capturing
information.
5
Download dataset
from UCI repository
CNN
Classification in
reduced feature
set
CHAPTER 7
EXPERIMENT AND RESULTS
CHAPTER 8
8.1 CONCLUSION
The project focuses on SVM classification algorithms to effectively detect heart disease
types. The dataset is taken from UCI repository. Preprocessed such as zero value, N/A value
and unicode character elimination is carried out here. Important features are extracted out for
better classification. Confusion matrix is prepared with accuracy score calculation. In addition,
KNN classification algorithms as well as neural network to effectively detect risk types are also
carried out. The dataset is taken and preprocessed such as unicode removal. Important features
are extracted out for better classification.
Furthermore, the algorithm consistently outperformed all the tested classification and
methods under different conditions. The future enhancements can be made with still more
permission sets. SVM and KNN classification gives better accuracy in prediction. More number
of datasets can be taken and checked for accuracy with same algorithm parameters. For other
diseases also, these algorithms can be evaluated.
8
APPENDICES
A. SOURCE CODE
SVM CODING
#!/usr/bin/env python
#python 3.7 32 bit
# coding: utf-8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#get_ipython().run_line_magic('matplotlib', 'inline')
#X = X.filter(['Item_Weight', 'Item_Visibility',
'Outlet_Establishment_Year','Item_Outlet_Sales', 'Loan_Status'])
# In[2]:
#X
print(df.columns.values)
#df=df.iloc[:, [2,3,4,5,6,7,8,9,53]]#3,6,
df=df.iloc[:, [1,2,3,4,5,6,7,8,9,10,11,12,13]]
# As we can see above, not much can be done in the current form of the dataset.
We need to view the data in a better format.
# In[3]:
# In[4]:
print(X.shape)
10
# In[6]:
# The above plots shows the relationship between our features. But the only
problem with them is that they do not show us which of the "dots" is Malignant
and which is Benign.
#
# This issue will be addressed below by using "target" variable as the "hue" for
the plots.
# In[7]:
# **Note:**
#
# 1.0 (Orange) = Benign (No Cancer)
#
# 0.0 (Blue) = Malignant (Cancer)
# In[8]:
print(X['classfactor'])
print(X['classfactor'].value_counts())
# In[9]:
# In[10]:
plt.figure(figsize=(20,12))
sns.heatmap(df_Sales.corr(), annot=True)
12
#X = X.drop(['Malware'], axis = 1) # We drop our "target" feature and use all the
remaining features in our dataframe to train the model.
#print(X.head())
# In[12]:
y = X['classfactor']
X = X.drop(['classfactor'], axis = 1)
print(y.head())
# Let's split our data using 80% for training and the remaining 20% for testing.
# In[14]:
indices =range(len(X))
X_train, X_test, y_train, y_test,tr,te = train_test_split(X, y, indices,test_size =
0.15, random_state = 20)
13
# Let now check the size our training and testing data.
# In[15]:
print ('The size of our training "X" (input features) is', X_train.shape)
print ('\n')
print ('The size of our testing "X" (input features) is', X_test.shape)
print ('\n')
print ('The size of our training "y" (output feature) is', y_train.shape)
print ('\n')
print ('The size of our testing "y" (output features) is', y_test.shape)
# In[16]:
# In[17]:
svc_model = SVC()
# # Now, let's train our SVM model with our "training" dataset.
# In[18]:
14
svc_model.fit(X_train, y_train)
# # Let's use our trained model to make a prediction using our testing data
# In[19]:
y_predict = svc_model.predict(X_test)
print(y_predict)
f=open('output.html',"w")
f.write('<html><head><title>Output</title>');
s="<link rel='stylesheet' href='css/bootstrap.min.css'><script
src='css/jquery.min.js'></script> <script src='css/bootstrap.min.js'></script><link
rel='stylesheet' href='css/all.css'>"
f.write(s)
f.write("</head><body>")
f.write("<center><h2>TRAINING/TEST DATASET SIZE</h2></center><font
color='red' size='4' face='Tahoma'>")
print ('The size of our training "X" (input features) is', X_train.shape)
f.write('The size of our training "X" (input features) is' + str(X_train.shape))
f.write("<br/>")
print ('\n')
print ('The size of our testing "X" (input features) is', X_test.shape)
15
# In[21]:
#sns.heatmap(confusion, annot=True)
#print(classification_report(y_test, y_predict))
X_train_min = X_train.min()
print(X_train_min)
# In[25]:
X_train_max = X_train.max()
print(X_train_max)
# In[26]:
16
X_test_min = X_test.min()
X_test_range = (X_test - X_test_min).max()
X_test_scaled = (X_test - X_test_min)/X_test_range
print('X_test_scaled:')
print(X_test_scaled.head())
# In[29]:
svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)
# In[30]:
y_predict = svc_model.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_predict)
print('Confusion Matrix:')
print(cm)
# In[31]:
columns=['predicted_Low','predicted_High'])
leng=len(te)
print("------");
for i in range(0,leng):
print(te[i] , ":", y_predict[i])
ac = accuracy_score(y_test,y_predict)
print('Accuracy:')
print(round(ac,3))
f.write( 'Accuracy Score<br/>')
f.write(str(round(ac,3)) + "<br/>")
print('Confusion Matrix')
print(confusion_matrix(y_test, y_predict))
f.write('Confusion Matrix [SVM]<br/>')
f.write('----------------------<br/>')
#print (page)
19
KNN CODING
#!/usr/bin/env python
#python 3.7 32 bit
#python 3.9 32 bit if sns function plot require check 99th line
# coding: utf-8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#get_ipython().run_line_magic('matplotlib', 'inline')
df = pd.DataFrame(data = dataset)
#X = X.filter(['Item_Weight', 'Item_Visibility',
'Outlet_Establishment_Year','Item_Outlet_Sales', 'classfactor'])
# In[2]:
#X
print(df.columns.values)
#The below line is correct. For speed some columns are eliminated
df=df.iloc[:, [1,2,3,4,5,6,7,8,9,10,11,12,13]] #15,16,17,18,19,20
#The below line is used for speed and so some columns are eliminated
#df=df.iloc[:, [2,3,4,9,53]]
#df=df.iloc[1:10001,:]
#for i in range(1001,15000):
# df.at[i,'Malware']=0
#for i in range(15001,19501):
# df.at[i,'Malware']=1
#print(df['Malware'])
#for i in range(15001,16501):
# df[i,'Malware']=1
# As we can see above, not much can be done in the current form of the dataset. We need to
view the data in a better format.
# In[3]:
20
# In[4]:
print(X.shape)
# As we can see,we have 596 rows (Instances) and 31 columns(Features)
# In[5]:
print(X.columns)
# In[6]:
# The above plots shows the relationship between our features. But the only problem with them
is that they do not show us which of the "dots" is Malignant and which is Benign.
#
# This issue will be addressed below by using "target" variable as the "hue" for the plots.
# In[7]:
# **Note:**
#
# 1.0 (Orange) = Benign (No Cancer)
#
# 0.0 (Blue) = Malignant (Cancer)
# In[8]:
#X['classfactor']=X['classfactor']
print(X['classfactor'])
print(X['classfactor'].value_counts())
# In[9]:
try:
sns.countplot(X['classfactor'], label = "Count")
except:
tmp=1
# In[10]:
try:
plt.figure(figsize=(20,12))
sns.heatmap(df_Sales.corr(), annot=True)
except:
tmp=1
#X = X.drop(['classfactor'], axis = 1) # We drop our "target" feature and use all the remaining
features in our dataframe to train the model.
#print(X.head())
# In[12]:
y = X['classfactor']
X = X.drop(['classfactor'], axis = 1)
print(y.head())
# Let's split our data using 80% for training and the remaining 20% for testing.
# In[14]:
22
indices =range(len(X))
X_train, X_test, y_train, y_test,tr,te = train_test_split(X, y, indices,test_size = 0.25,
random_state = 16)
# Let now check the size our training and testing data.
#iris = datasets.load_iris()
#X, y = iris.data[:, :], iris.target
indices =range(len(X))
Xtrain, Xtest, y_train, y_test,tr,te = train_test_split(X, y, indices,stratify = y, random_state = 16,
train_size = 0.75)
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
knn.fit(Xtrain, y_train)
y_pred = knn.predict(Xtest)
# In[16]:
leng=len(te)
print("------");
for i in range(0,leng):
print(te[i] , ":", y_pred[i])
# In[15]:
f=open('output.html',"w")
f.write('<html><head><title>Output</title>');
s="<link rel='stylesheet' href='css/bootstrap.min.css'><script src='css/jquery.min.js'></script>
<script src='css/bootstrap.min.js'></script><link rel='stylesheet' href='css/all.css'>"
f.write(s)
f.write("</head><body>")
f.write("<center><h2>TRAINING/TEST DATASET SIZE</h2></center><font color='red'
size='4' face='Tahoma'>")
print ('The size of our training "X" (input features) is', X_train.shape)
f.write('The size of our training "X" (input features) is' + str(X_train.shape))
f.write("<br/>")
print ('\n')
print ('The size of our testing "X" (input features) is', X_test.shape)
f.write('The size of our testing "X" (input features) is' + str(X_test.shape))
f.write("<br/>")
print ('\n')
print ('The size of our training "y" (output feature) is', y_train.shape)
f.write('The size of our training "y" (output feature) is' + str( y_train.shape))
23
f.write("<br/>")
print ('\n')
print ('The size of our testing "y" (output features) is', y_test.shape)
f.write('The size of our testing "y" (output features) is'+ str(y_test.shape))
f.write("<br/><br/></font>")
print('Accuracy Score')
f.write( 'Accuracy Score<br/>')
print(round(accuracy_score(y_test, y_pred),3))
f.write(str(round(accuracy_score(y_test, y_pred),3)) +"<br/>")
print(classification_report(y_test, y_pred))
print('Confusion Matrix [KNN]')
f.write('Confusion Matrix [KNN]<br/>')
f.write('----------------------<br/>')
print('----------------------')
print(confusion_matrix(y_test, y_pred))
f.write(str(confusion_matrix(y_test, y_pred)) + "<br/>")
f.write("<center><table class='table table-bordered table-striped table-hover' border='1'
style='border-radius:5px;width:50%'><tr><th>Index</th><th>Category</th></tr>")
for i in range(0,leng):
s= "<tr><td>" + str(te[i]) + "</td><td>"+ str(y_pred[i]) + "</td>"
f.write(s)
f.write("</table></center></body></html>")
f.close()
import webbrowser
import os
filename='file:///'+os.getcwd()+'/' + 'output.html'
webbrowser.open_new_tab(filename)
#import urllib.request
#page = urllib.request.urlopen('output.html').read()
#print (page)
exit()
24
B. SAMPLE SCREENS
In this output, all the column names are displayed. Sample five records from the dataset
is printed using df.head(). The test records category after classification is made and printed as
zero and one. The accuracy score is also displayed. The confusion matrix is printed below.
25
In this output, the test records category after classification is made and printed as zero
and one. The accuracy score is also displayed. The confusion matrix is printed below. The
prepard output is saved as web page and using webbrowser module in python, the html file
saved during the classification is made to open in brower.
26
In this output, all the column names are displayed. Sample five records from the dataset
is printed using df.head(). The test records category after classification is made and printed as
zero and one. The accuracy score is also displayed. The confusion matrix is printed below.
27
In this output, the test records category after classification is made and printed as zero
and one. The accuracy score is also displayed. The confusion matrix is printed below. The
prepard output is saved as web page and using webbrowser module in python, the html file
saved during the classification is made to open in brower.
28
Once the CNN model is created, it is imported and then compiled using
‘model.compile’. The model is trained for just ten epochs for faster output. During the training
process, accuracy calculation is made and predictions are printed out. The accuracy value is
displayed during iterations. The results of first epoch are printed out.
29
The accuracy value is displayed during iterations. The results of second epoch are
printed out. And it is noted that accuracy is increasing.
30
The accuracy value is displayed during iterations. The results of third epoch are printed
out. And it is noted that accuracy is increasing.
31
The accuracy value is displayed during iterations. The results of fourth epoch are printed
out. And it is noted that accuracy is increasing.
32
The accuracy value is displayed during iterations. The results of fifth epoch are printed
out. And it is noted that accuracy is increasing.
33
The accuracy value is displayed during iterations. The results of tenth epoch are printed
out. And it is noted that accuracy is increasing and the program ends.
34
BIBLIOGRAPHY
JOURNAL REFERENCES
[1] Franck Le Duff, CristianMunteanb, Marc Cuggiaa and Philippe Mabob, “Predicting
Survival Causes After Out of Hospital Cardiac Arrest using Data Mining Method”, Studies in
Health Technology and Informatics, Vol. 107, No. 2, pp. 1256-1259, 2004.
[2] W.J. Frawley and G. Piatetsky-Shapiro, “Knowledge Discovery in Databases: An
Overview”, AI Magazine, Vol. 13, No. 3, pp. 57-70, 1996.
[3] Kiyong Noh, HeonGyu Lee, Ho-Sun Shon, Bum Ju Lee and Keun Ho Ryu, “Associative
Classification Approach for Diagnosing Cardiovascular Disease”, Intelligent Computing in
Signal Processing and Pattern Recognition, Vol. 345, pp. 721-727, 2006.
[4] Latha Parthiban and R. Subramanian, “Intelligent Heart Disease Prediction System using
CANFIS and Genetic Algorithm”, International Journal of Biological, Biomedical and Medical
Sciences, Vol. 3, No. 3, pp. 1-8, 2008.
[5] Sellappan Palaniappan and Rafiah Awang, “Intelligent Heart Disease Prediction System
using Data Mining Techniques”, International Journal of Computer Science and Network
Security, Vol. 8, No. 8, pp. 1-6, 2008
[6] Shantakumar B. Patil and Y.S. Kumaraswamy, “Intelligent and Effective Heart Attack
Prediction System using Data Mining and Artificial Neural Network”, European Journal of
Scientific Research, Vol. 31, No. 4, pp. 642-656, 2009.
[7] Nidhi Singh and Divakar Singh, “Performance Evaluation of K-Means and Hierarchal
Clustering in Terms of Accuracy and Running Time”, Ph.D Dissertation, Department of
Computer Science and Engineering, Barkatullah University Institute of Technology, 2012.
[8] Weiguo, F., Wallace, L., Rich, S., Zhongju, Z.: “Tapping the Power of Text Mining”,
Communication of the ACM. 49(9), 77-82, 2006.
[9] Jiawei Han M. K, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers,
An Imprint of Elsevier, 2006
[10] Huang Z, “Extensions to the k-means algorithm for clustering large data sets with
categorical values,” Data Mining and Knowledge Discovery, Vol.2, pp:283–304, 1998