0% found this document useful (0 votes)
32 views

PythonHeartDisease FinalDocumentByMS

The document discusses predicting heart disease using data mining techniques. It introduces the problem of heart disease and the need for accurate prediction. The proposed system uses classification algorithms like CNN, KNN and SVM on medical data to predict heart disease with high accuracy.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

PythonHeartDisease FinalDocumentByMS

The document discusses predicting heart disease using data mining techniques. It introduces the problem of heart disease and the need for accurate prediction. The proposed system uses classification algorithms like CNN, KNN and SVM on medical data to predict heart disease with high accuracy.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

i

PREDICTION OF HEART DISEASE USING VARIOUS DATA MINING


TECHNIQUES

ABSTRACT

In day to day life many factors that affect a human heart. Many problems are occurring at a
rapid pace and new heart diseases are rapidly being identified. In today’s world of stress Heart,
being an essential organ in a human body which pumps blood through the body for the blood
circulation is essential and its health is to be conserved for a healthy living. The main motivation of
doing this project is to present a heart disease prediction model for the prediction of occurrence of
heart disease. Further, this research work is aimed towards identifying the best classification
algorithm for identifying the possibility of heart disease in a patient.

The identification of the possibility of heart disease in a person is complicated task for
medical practitioners because it requires years of experience and intense medical tests to be
conducted. The main objective of this significant research work is to identify the best classification
algorithms suitable for providing maximum accuracy when classification of normal and abnormal
person is carried out.

Convolutional neural network (CNN) architecture is used to map the relationship between
the indoor PM and weather data to the found values. The proposed method is compared with the
state-of-the-art deep neural network (DNN)based techniques in terms of the root mean square and
mean absolute error accuracy measures. In addition, support vector machine based classification
and K-Nearest Neighbor based classification is also carried out and accuracy is found out. The
applied SVM, KNN and CNN classification helps to predict the heart disease with more accuracy in
the new data set. The coding language used is Python 3.7.
ii

TABLE OF CONTENTS

CHAPTER.NO TITLE PAGE


ABSTRACT iv

LIST OF FIGURES viii

LIST OF ABBREVIATIONS ix

1 INTRODUCTION 1

1.1 INTRODUCTION 1

1.2 OBJECTIVES 2

2 LITERATURE REVIEW 6

3 PROPOSED WORK 10

3.1 EXISTING SYSTEM


3.2 DRAWBACKS OFEXISTING SYSTEM
10
3.3 PROPOSED SYSTEM
19
3.4 ADVANTAGESOF PROPOSED SYSTEM
19
3.5 FEASIBLITY STUDY
19
3.5.1 ECONOMICAL FEASIBLITY
19
3.5.2 OPERATIONAL FEASIBLITY
20
3.5.3 TECHNICAL FEASIBLITY
20

4 SYSTEM SPECIFICATION 22

4.1 HARDWARE REQUIRMENTS 22


iii

4.2 SOFTWARE REQUIREMENTS 24

5 SOFTWARE DESCRIPTION

5.1 GOOGLE COLAB

6 PROJECT DESCRIPTION

6.1 PROBLEM DEFINITION

6.2 OVERVIEW OF THE PROJECT

6.3 MODULE DESCRIPTION

6.4 INPUT DESIGN

6.5 OUTPUT DESIGN

6.6 SYSTEM FLOW DIAGRAM

7 EXPERIMENT AND RESULTS

8 CONCLUSION AND FUTURE 32


ENHANCEMENTS
APPENDIX 36

A. SOURCE CODE

B. SCREEN SHOTS

REFERENCES 41
iv

LIST OF FIGURES

FIGURE NO. TITLE PAGE NO.

6.6.1 SYSTEM FLOW DIAGRAM 3

6.7.1 USE CASE DIAGRAM 11

6.8.1 SEQUENCE DIAGRAM 11

6.9.1 DATA FLOW DIAGRAM 12


ix

LIST OF ABBREVIATIONS

ABBREVIATION EXPANSION

CNN CONVOLUTIONAL NEURAL NETWORK

KNN K NEAREST NEIGHBOR

SVM SUPPORT VECTOR MACHINE

TPR TRUE POSITIVE RATE

FPR FALSE POSITIVE RATE


vi

CHAPTER 1
INTRODUCTION

1.1. INTRODUCTION
In day to day life many factors that affect a human heart. Many problems are occurring at a
rapid pace and new heart diseases are rapidly being identified. In today’s world of stress Heart,
being an essential organ in a human body which pumps blood through the body for the blood
circulation is essential and its health is to be conserved for a healthy living. The health of a human
heart is based on the experiences in a person’s life and is completely dependent on professional and
personal behaviors of a person.

There may also be several genetic factors through which a type of heart disease is passed
down from generations. According to the World Health Organization, every year more than 12
million deaths are occurring worldwide due to the various types of heart diseases which is also
known by the term cardiovascular disease. The term Heart disease includes many diseases that are
diverse and specifically affect the heart and the arteries of a human being. Even young aged people
around their 20-30 years of lifespan are getting affected by heart diseases.

The increase in the possibility of heart disease among young may be due to the bad eating
habits, lack of sleep, restless nature, depression and numerous other factors such as obesity, poor
diet, family history, high blood pressure, high blood cholesterol, idle behavior, family history,
smoking and hypertension. The diagnosis of the heart diseases is a very important and is itself the
most complicated task in the medical field.

All the mentioned factors are taken into consideration when analyzing and understanding the
patients by the doctor through manual check-ups at regular intervals of time. The symptoms of
heart disease greatly depend upon which of the discomfort felt by an individual. Some symptoms
are not usually identified by the common people. However, common symptoms include chest pain,
breathlessness, and heart palpitations. The chest pain common to many types of heart disease is
known as angina, or angina pectoris, and occurs when a part of the heart does not receive enough
oxygen. Angina may be triggered by stressful events or physical exertion and normally lasts under
10 minutes. Heart attacks can also occur as a result of different types of heart disease. The signs of a
vii

heart attack are similar to angina except that they can occur during rest and tend to be more severe.
The symptoms of a heart attack can sometimes resemble indigestion.
Heartburn and a stomach ache can occur, as well as a heavy feeling in the chest. Other
symptoms of a heart attack include pain that travels through the body, for example from the chest to
the arms, neck, back, abdomen, or jaw, lightheadedness and dizzy sensations, profuse sweating,
nausea and vomiting.

Heart failure is also an outcome of heart disease, and breathlessness can occur when the
heart becomes too weak to circulate blood. Some heart conditions occur with no symptoms at all,
especially in older adults and individuals with diabetes. The term 'congenital heart disease' covers a
range of conditions, but the general symptoms include sweating, high levels of fatigue, fast
heartbeat and breathing, breathlessness, chest pain. However, these symptoms might not develop
until a person is older than 13 years.

In these types of cases, the diagnosis becomes an intricate task requiring great experience
and high skill. A risk of a heart attack or the possibility of the heart disease if identified early, can
help the patients take precautions and take regulatory measures. Recently, the healthcare industry
has been generating huge amounts of data about patients and their disease diagnosis reports are
being especially taken for the prediction of heart attacks worldwide. When the data about heart
disease is huge, the machine learning techniques can be implemented for the analysis.

Data Mining is a task of extracting the vital decision making information from a collective
of past records for future analysis or prediction. The information may be hidden and is not
identifiable without the use of data mining. The classification is one data mining technique through
which the future outcome or predictions can be made based on the historical data that is available.
The medical data mining made a possible solution to integrate the classification techniques and
provide computerized training on the dataset that further leads to exploring the hidden patterns in
the medical data sets which is used for the prediction of the patient’s future state. Thus, by using
medical data mining it is possible to provide insights on a patient’s history and is able to provide
clinical support through the analysis. For clinical analysis of the patients, these patterns are very
much essential. In simple English, the medical data mining uses classification algorithms that are a
vital part for identifying the possibility of heart attack before the occurrence.

The classification algorithms can be trained and tested to make the predictions that
viii

determine the person’s nature of being affected by heart disease. In this research work, the
supervised machine learning concept is utilized for making the predictions.
A comparative analysis of the three data mining classification algorithms namely Random
Forest, Decision Tree and Naïve Bayes are used to make predictions. The analysis is done at several
levels of cross validation and several percentage of percentage split evaluation methods
respectively.

The StatLog dataset from UCI machine learning repository is utilized for making heart
disease predictions in this research work. The predictions are made using the classification model
that is built from the classification algorithms when the heart disease dataset is used for training.
This final model can be used for prediction of any types of heart diseases.
ix

1.2. OBJECTIVES OF PROJECT

 Identification of safe or risk category of heart records.

 Feature reduction of data set for convolutional neural network.

 To consider SVM/KNN Classification so that probability of safe/risk in the given


new test data is possible.
 To reduce features before risk classification using SVM/KNN is carried out.
x

CHAPTER 2
LITERATURE REVIEW
2.1 RELATED WORK

2.1.1 PREDICTING SURVIVAL CAUSES AFTER OUT OF HOSPITAL CARDIAC


ARREST USING DATA MINING METHOD
AUTHORS
FRANCK LE DUFF
CRISTIAN MUNTEAN
MARC CUGGIA
PHILIPPE MABO
In this paper [1] the authors stated that the prognosis of life for patients with heart failure
remains poor. By using data mining methods, the purpose of this study was to evaluate the most
important criteria for predicting patient survival and to profile patients to estimate their survival
chances together with the most appropriate technique for health care. METHODS: Five hundred
and thirty three patients who had suffered from cardiac arrest were included in the analysis.

They performed classical statistical analysis and data mining analysis using mainly Bayesian
networks. RESULTS: The mean age of the 533 patients was 63 (± 17) and the sample was
composed of 390 (73 %) men and 143 (27 %) women. Cardiac arrest was observed at home for 411
(77 %) patients, in a public place for 62 (12 %) patients and on a public highway for60 (11 %)
patients. The belief network of the variables showed that the probability of remaining alive after
heart failure is directly associated to five variables: age, sex, the initial cardiac rhythm, the origin of
the heart failure and specialized resuscitation techniques employed.

CONCLUSIONS: Data mining methods could help clinicians to predict the survival of
patients and then adapt their practices accordingly. This work could be carried out for each medical
procedure or medical problem and it would become possible to build a decision tree rapidly with the
data of a service or a physician. The comparison between classic analysis and data mining analysis
showed us the contribution of the data mining method for sorting variables and quickly conclude on
the importance or the impact of the data and variables on the criterion of the study. The main limit
of the method is knowledge acquisition and the necessity to gather sufficient data to produce a
relevant model.
xi

Cardiac arrest is defined as a spontaneous irreversible arrest of the general circulation by


cardiac inefficacity. It is recognized with the absence of the femoral pulse for more than 5 seconds.
Without resuscitation, cardiac arrest leads to sudden cardiac death. The public health impact of
sudden cardiac death is heavy since the survival rate is estimated at between 1 and 20 % for cardiac
arrest patients. This represents 300,000 to 400,000 deaths a year in the United States and 20,000 to
30,000 deaths in France. The profile of the patient is now well known since it generally concerns
men from about 45 to 75 years. Hospitalization must be optimal and fast. According to the type of
cardiac attack, the procedure of assumption of responsibility can vary and some studies [11] show
the interest of various techniques over others, according to the cause of the cardiac arrest. Heart
disease is the number one cause of death in the U.S. Ac-cording to the American Heart Association,
an estimated 1,100,000 people in the U.S. will have a coronary attack each year.

Ninety five percent of sudden cardiac arrest victims die before reaching the hospital and
heart disease claims more lives each year than the following six leading causes of death combined
(cancer, chronic lower respiratory diseases, accidents, diabetes mellitus, influenza and pneumonia).
Almost 150,000 people in the U.S. who die from heart disease each year are under the age of 65.
These data show the interest for predicting the risk of death after heart failure and the need to
analyze the events that occurred during care to provide prognostic information. Classic statistical
analyses have already been done and provide some information about epidemiology of the heart
failure and causes of the failure. This paper presents the use of a probability in a statistical approach
in order to profile heart failure in a sample of patients and predict the impact of some events in the
care process.

They concluded that it seems that the use of the Bayesian network in medical analysis could
be useful to explore data and to find hid-den relationships between events or characteristics of the
sample. It is a first approach for discussing hypotheses between clinicians and statistical experts.
The main limit of these tools is the necessity to have enough data to find regularity in the
relationships
xii

2.1.2 KNOWLEDGE DISCOVERY IN DATABASES: AN OVERVIEW


AUTHORS
WILLIAM J. FRAWLEY
GREGORY PIATETSKY-SHAPIRO
CHRISTOPHER J. MATHEUS
In this paper [2] the authors stated that after a decade of fundamental interdisciplinary
research in machine learning, the spadework in this field has been done; the 1990s should see the
widespread exploitation of knowledge discovery as an aid to assembling knowledge bases. The
contributors to the AAAI Press book Knowledge Discovery in Databases were excited at the
potential benefits of this research. The editors hope that some of this excitement will communicate
itself to AI Magazine readers of this article.
It has been estimated that the amount of information in the world doubles every 20 months.
The size and number of databases probably increases even faster. In 1989, the total number of
databases in the world was estimated at five million, although most of them are small DBASE III
databases. The automation of business activities produces an ever-increasing stream of data because
even simple transactions, such as a telephone call, the use of a credit card, or a medical test, are
typically recorded in a computer. Scientific and government databases are also rapidly growing.
The National Aeronautics and Space Administration already has much more data than it can
analyze. Earth observation satellites, planned for the 1990s, are expected to generate one terabyte
(1015 bytes) of data every day— more than all previous missions combined. At a rate of one picture
each second, it would take a person several years (working nights and weekends) just to look at the
pictures generated in one day. In biology, the federally funded Human Genome project will store
thousands of bytes for each of the several billion genetic bases.
Closer to everyday lives, the 1990 U.S. census data of a million million bytes encode
patterns that in hidden ways describe the lifestyles and subcultures of today’s United States. What
are we supposed to do with this flood of raw data? Clearly, little of it will ever be seen by human
eyes. If it will be understood at all, it will have to be analyzed by computers. Although simple
statistical techniques for data analysis were developed long ago, advanced techniques for intelligent
data analysis are not yet mature. As a result, there is a growing gap between data generation and
data understanding. At the same time, there is a growing realization and expectation that data,
intelligently analyzed and presented, will be a valuable resource to be used for a competitive
advantage. The computer science community is responding to both the scientific and practical
challenges presented by the need to find the knowledge adrift in the flood of data.
xiii

In assessing the potential of AI technologies, Michie (1990), a leading European expert on


machine learning, predicted that “the next area that is going to explode is the use of machine
learning tools as a component of large-scale data analysis.” A recent National Science Foundation
workshop on the future of database research ranked data mining among the most promising research
topics for the 1990s.

Some research methods are already well enough developed to have been made part of
commercially available software. Several expert system shells use variations of ID3 for inducing
rules from examples. Other systems use inductive, neural net, or genetic learning approaches to
discover patterns in personal computer databases. Many forward-looking companies are using these
and other tools to analyze their databases for interesting and useful patterns.

American Airlines searches its frequent flyer database to find its better customers, targeting
them for specific marketing promotions. Farm Journal analyzes its subscriber database and uses
advanced printing technology to custom-build hundreds of editions tailored to particular groups.
Several banks, using patterns discovered in loan and credit histories, have derived better loan
approval and bankruptcy prediction methods. General Motors is using a database of automobile
trouble reports to derive diagnostic expert systems for various models. Packaged-goods
manufacturers are searching the supermarket scanner data to measure the effects of their promotions
and to look for shopping patterns.

A combination of business and research interests has produced increasing demands for, as
well as increased activity to provide, tools and techniques for discovery in databases. This book is
the first to bring together leading-edge research from around the world on this topic. It spans many
different approaches to discovery, including inductive learning, Bayesian statistics, semantic query
optimization, knowledge acquisition for expert systems, information theory, and fuzzy sets.

The book is aimed at those interested or involved in computer science and the management
of data, to both inform and inspire further research and applications. It will be of particular interest
to professionals working with databases and management information systems and to those
applying machine learning to real-world problems.
xiv

What Is Knowledge Discovery? There is an immense diversity of current research on


knowledge discovery in databases. To provide a point of reference for this research, we begin here
by defining and explaining relevant terms.

Definition of Knowledge Discovery:


Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and
potentially useful information from data. Given a set of facts (data) F, a language L, and some
measure of certainty C, we define a pattern as a statement S in L that describes relationships among
a subset FS of F with a certainty c, such that S is simpler (in some sense) than the enumeration of all
facts in FS.

A pattern that is interesting (according to a user-imposed interest measure) and certain


enough (again according to the user’s criteria) is called knowledge. The output of a program that
monitors the set of facts in a database and produces patterns in this sense is discovered knowledge.
These definitions about the language, the certainty, and the simplicity and interestingness measures
are intentionally vague to cover a wide variety of approaches. Collectively, these terms capture our
view of the fundamental characteristics of discovery in databases. In the following paragraphs, we
summarize the connotations of these terms and suggest their relevance to the problem of knowledge
discovery in databases.
They noted that one of the earliest discovery processes was encountered by Jonathan Swift’s
Gulliver in his visit to the Academy of Labado. The “Project for improving speculative Knowledge
by practical and mechanical operations” was generating sequences of words by random
permutations and “where they found three or four Words that might make Part of a Sentence, they
dictated them to . . . Scribes.” This process, although promising to produce many interesting
sentences in the (very) long run, is rather inefficient and was recently proved to be NP-hard.

2.1.3 ASSOCIATIVE CLASSIFICATION APPROACH FOR DIAGNOSING


CARDIOVASCULAR DISEASE
AUTHORS
KIYONG NOH
HEON GYU LEE
HO-SUN SHON
BUM JU LEE
KEUN HO RYU
xv

In this paper [3] the authors stated that ECG is a test that measures a heart’s electrical
activity, which provides valuable clinical information about the heart’s status. In this paper, they
proposed a classification method for extracting multi-parametric features by analyzing HRV from
ECG, data preprocessing and heart disease pattern. The proposed method is an associative classifier
based on the efficient FP-growth method.

Since the volume of patterns produced can be large, they offered a rule cohesion measure
that allows a strong push of pruning patterns in the pattern generating process. They conducted an
experiment for the associative classifier, which utilizes multiple rules and pruning, and biased
confidence (or cohesion measure) and dataset consisting of 670 participants distributed into two
groups, namely normal people and patients with coronary artery disease.

The most widely used signal in clinical practice is ECG (Electrocardiogram), which is
frequently recorded and widely used for the assessment of cardiac function. ECG processing
techniques have been proposed to support pattern recognition, parameter extraction, spectro-
temporal techniques for the assessment of the heart’s status, denoising, baseline correction and
arrhythmia detection. Control of the heart rate is known to be affected by the sympathetic and
parasympathetic nervous system.

It is reported that Heart Rate Variability (HRV) is related to autonomic nerve activity and is
used as a clinical tool to diagnose cardiac autonomic function in both health and disease. This paper
provides a classification technique that could automatically diagnose Coronary Artery Disease
(CAD) under the framework of ECG patterns and clinical investigations. Through the ECG pattern
we are able to recognize the features that could well reflect either the existence or non-existence of
CAD. Such features can be perceived through HRV analysis based on following knowledge:
1. In patients with CAD, reduction of the cardiac vagal activity evaluated by spectral HRV
analysis was found to correlate with the angiographic severity.
2. The reduction of variance (standard deviation of all normal RR intervals) and low-
frequency of HRV seem related to an increase in chronic heart failure.

They concluded that they investigated an effectiveness and accuracy of classification


method for diagnosing ECG patterns. To achieve this purpose, we introduce an associative classifier
that is further extended from CMAR by using a cohesion measure for pruning redundant rules.
xvi

Their classification method uses multiple rules to predict the highest probability classes for
each record. The proposed associative classifier can also relax the independence assumption of
some classifiers, such as NB (Naive Bayesian) and DT (Decision Tree). For example, the NB makes
the assumption of conditional independence, that is, given the class label of a sample, the values of
the attributes are conditionally independent of one another. When the assumption holds true, then
the NB is the most accurate in comparison with all other classifiers. In practice, however,
dependences can exist between variables of the real data. Their classifier can consider the
dependences of linear characteristics of HRV and clinical information. Finally, they implemented
their classifier and several different classifiers to validate their accuracy in diagnosing heart disease.

2.1.4 INTELLIGENT HEART DISEASE PREDICTION SYSTEM USING CANFIS AND


GENETIC ALGORITHM
AUTHORS
LATHA PARTHIBAN
R.SUBRAMANIAN
In this paper [4] the authors stated that Heart disease (HD) is a major cause of morbidity and
mortality in the modern society. Medical diagnosis is an important but complicated task that should
be performed accurately and efficiently and its automation would be very useful. All doctors are
unfortunately not equally skilled in every sub specialty and they are in many places a scarce
resource. A system for automated medical diagnosis would enhance medical care and reduce costs.
In this paper, a new approach based on coactive neuro-fuzzy inference system (CANFIS) was
presented for prediction of heart disease. The proposed CANFIS model combined the neural
network adaptive capabilities and the fuzzy logic qualitative approach which is then integrated with
genetic algorithm to diagnose the presence of the disease.

The performances of the CANFIS model were evaluated in terms of training performances
and classification accuracies and the results showed that the proposed CANFIS model has great
potential in predicting the heart disease.

A major challenge facing healthcare organizations (hospitals, medical centers) is the


provision of quality services at affordable costs. Quality service implies diagnosing patients
correctly and administering treatments that are effective.
xvii

Poor clinical decisions can lead to disastrous consequences which are therefore
unacceptable. Clinical decisions are often made based on doctors’ intuition and experience rather
than on the knowledge rich data hidden in the database. This practice leads to unwanted biases,
errors and excessive medical costs which affects the quality of service provided to patients. Wu, et
al proposed that integration of clinical decision support with computer-based patient records could
reduce medical errors, enhance patient safety, decrease unwanted practice variation, and improve
patient outcome [12].

Most hospitals today employ some sort of hospital information systems to manage their
healthcare or patient data [13]. Unfortunately, these data are rarely used to support clinical decision
making. The main objective of this research is to develop a prototype Intelligent Heart Disease
Prediction System with CANFIS and genetic algorithm using historical heart disease databases to
make intelligent clinical decisions which traditional decision support systems cannot.

The cost of management of HD is a significant economic burden and so prevention of heart


disease is very important step in the management. Prevention of HD can be approached in many
ways including health promotion campaigns, specific protection strategies, life style modification
programs, early detection and good control of risk factors and constant vigilance of emerging risk
factors. The CANFIS model integrates adaptable fuzzy inputs with a modular neural network to
rapidly and accurately approximate complex functions. Fuzzy inference systems are also valuable,
as they combine the explanatory nature of rules (MFs) with the power of neural networks. These
kinds of networks solve problems more efficiently than neural networks when the underlying
function to model is highly variable or locally extreme [14].

They concluded that from their studies, they have managed to achieve our research
objectives. Available dataset of Heart disease from UCI Machine Learning Repository has been
studied and preprocessed and cleaned out to prepare it for classification process. Coactive Neuro-
fuzzy modeling was proposed as a dependable and robust method developed to identify a nonlinear
relationship and mapping between the different attributes. It has been shown that of GA is a very
useful technique for auto-tuning of the CANFIS parameters and selection of optimal feature set.
The fact is that computers cannot replace humans and by comparing the computer-aided detection
results with the pathologic findings, doctors can learn more about the best way to evaluate areas that
computer aided detection highlights.
xvii
i

2.1.5 INTELLIGENT HEART DISEASE PREDICTION SYSTEM USING DATA MINING


TECHNIQUES
AUTHORS
SELLAPPAN PALANIAPPAN
RAFIAH AWANG
In this paper [5] the authors stated that The healthcare industry collects huge amounts of
healthcare data which, unfortunately, are not “mined” to discover hidden information for effective
decision making. Discovery of hidden patterns and relationships often goes unexploited. Advanced
data mining techniques can help remedy this situation. This research has developed a prototype
Intelligent Heart Disease Prediction System (IHDPS) using data mining techniques, namely,
Decision Trees, Naïve Bayes and Neural Network.

Results show that each technique has its unique strength in realizing the objectives of the
defined mining goals. IHDPS can answer complex “what if” queries which traditional decision
support systems cannot. Using medical profiles such as age, sex, blood pressure and blood sugar it
can predict the likelihood of patients getting a heart disease. It enables significant knowledge, e.g.
patterns, relationships between medical factors related to heart disease, to be established. IHDPS is
Web-based, user-friendly, scalable, reliable and expandable. It is implemented on the .NET
platform.

A major challenge facing healthcare organizations (hospitals, medical centers) is the


provision of quality services at affordable costs. Quality service implies diagnosing patients
correctly and administering treatments that are effective. Poor clinical decisions can lead to
disastrous consequences which are therefore unacceptable. Hospitals must also minimize the cost of
clinical tests. They can achieve these results by employing appropriate computer-based information
and/or decision support systems.

Most hospitals today employ some sort of hospital information systems to manage their
healthcare or patient data [15]. These systems typically generate huge amounts of data which take
the form of numbers, text, charts and images. Unfortunately, these data are rarely used to support
clinical decision making. There is a wealth of hidden information in these data that is largely
untapped. This raises an important question: “How can we turn data into useful information that can
enable healthcare practitioners to make intelligent clinical decisions?” This is the main motivation
for this research.
xix

Many hospital information systems are designed to support patient billing, inventory
management and generation of simple statistics. Some hospitals use decision support systems, but
they are largely limited. They can answer simple queries like “What is the average age of patients
who have heart disease?”, “How many surgeries had resulted in hospital stays longer than 10
days?”, “Identify the female patients who are single, above 30 years old, and who have been treated
for cancer.” However, they cannot answer complex queries like “Identify the important preoperative
predictors that increase the length of hospital stay”, “Given patient records on cancer, should
treatment include chemotherapy alone, radiation alone, or both chemotherapy and radiation?”, and
“Given patient records, predict the probability of patients getting a heart disease.”

Clinical decisions are often made based on doctors’ intuition and experience rather than on
the knowledge-rich data hidden in the database. This practice leads to unwanted biases, errors and
excessive medical costs which affects the quality of service provided to patients. Wu, et al proposed
that integration of clinical decision support with computer-based patient records could reduce
medical errors, enhance patient safety, decrease unwanted practice variation, and improve patient
outcome [16]. This suggestion is promising as data modelling and analysis tools, e.g., data mining,
have the potential to generate a knowledge-rich environment which can help to significantly
improve the quality of clinical decisions.

The main objective of this research is to develop a prototype Intelligent Heart Disease
Prediction System (IHDPS) using three data mining modeling techniques, namely, Decision Trees,
Naïve Bayes and Neural Network. IHDPS can discover and extract hidden knowledge (patterns and
relationships) associated with heart disease from a historical heart disease database.

It can answer complex queries for diagnosing heart disease and thus assist healthcare
practitioners to make intelligent clinical decisions which traditional decision support systems
cannot. By providing effective treatments, it also helps to reduce treatment costs. To enhance
visualization and ease of interpretation, it displays the results both in tabular and graphical forms.
IHDPS uses the CRISP-DM methodology to build the mining models. It consists of six major
phases: business understanding, data understanding, data preparation, modeling, evaluation, and
deployment. Business understanding phase focuses on understanding the objectives and
requirements from a business perspective, converting this knowledge into a data mining problem
definition, and designing a preliminary plan to achieve the objectives.
xx

Data understanding phase uses the raw the data and proceeds to understand the data, identify
its quality, gain preliminary insights, and detect interesting subsets to form hypotheses for hidden
information. Data preparation phase constructs the final dataset that will be fed into the modeling
tools. This includes table, record, and attribute selection as well as data cleaning and transformation.
The modeling phase selects and applies various techniques, and calibrates their parameters to
optimal values. The evaluation phase evaluates the model to ensure that it achieves the business
objectives. The deployment phase specifies the tasks that are needed to use the models [17]. Data
Mining Extension (DMX), a SQL-style query language for data mining, is used for building and
accessing the models’ contents. Tabular and graphical visualizations are incorporated to enhance
analysis and interpretation of results.

A total of 909 records with 15 medical attributes (factors) were obtained from the Cleveland
Heart Disease database [18]. Figure 2.1 lists the attributes. The records were split equally into two
datasets: training dataset (455 records) and testing dataset (454 records). To avoid bias, the records
for each set were selected randomly. For the sake of consistency, only categorical attributes were
used for all the three models. All the non-categorical medical attributes were transformed to
categorical data. The attribute “Diagnosis” was identified as the predictable attribute with value “1”
for patients with heart disease and value “0” for patients with no heart disease. The attribute
“PatientID” was used as the key; the rest are input attributes. It is assumed that problems such as
missing data, inconsistent data, and duplicate data have all been resolved.
Mining models:
Data Mining Extension (DMX) query language was used for model creation, model
training, model prediction and model content access. All parameters were set to the default setting
except for parameters “Minimum Support = 1” for Decision Tree and “Minimum Dependency
Probability = 0.005” for Naïve Bayes [19]. The trained models were evaluated against the test
datasets for accuracy and effectiveness before they were deployed in IHDPS. The models were
validated using Lift Chart and Classification Matrix.

They concluded that a prototype heart disease prediction system is developed using three
data mining classification modeling techniques. The system extracts hidden knowledge from a
historical heart disease database. DMX query language and functions are used to build and access
the models.
xxi
xxii

Predictable attribute
1. Diagnosis (value 0: < 50% diameter narrowing (no heart disease); value 1: > 50%
diameter narrowing (has heart disease))

Key attribute
1. PatientID – Patient’s identification number

Input attributes
1. Sex (value 1: Male; value 0 : Female)
2. Chest Pain Type (value 1: typical type 1 angina, value 2: typical type angina, value 3:
non-angina pain; value 4: asymptomatic)
3. Fasting Blood Sugar (value 1: > 120 mg/dl; value 0: < 120 mg/dl)
4. Restecg – resting electrographic results (value 0: normal; value 1: 1 having ST-T
wave abnormality; value 2: showing probable or definite left ventricular hypertrophy)
5. Exang – exercise induced angina (value 1: yes; value 0: no)
6. Slope – the slope of the peak exercise ST segment (value 1: unsloping; value 2: flat;
value 3: downsloping)
7. CA – number of major vessels colored by floursopy (value 0 – 3)
8. Thal (value 3: normal; value 6: fixed defect; value 7: reversible defect)
9. Trest Blood Pressure (mm Hg on admission to the hospital)
10. Serum Cholesterol (mg/dl)
11. Thalach – maximum heart rate achieved
12. Oldpeak – ST depression induced by exercise relative to rest 13. Age in Year

FIGURE 2.1 DESCRIPTIONS OF ATTRIBUTES

The models are trained and validated against a test dataset. Lift Chart and Classification
Matrix methods are used to evaluate the effectiveness of the models. All three models are able to
extract patterns in response to the predictable state. The most effective model to predict patients
with heart disease appears to be Naïve Bayes followed by Neural Network and Decision Trees.

Five mining goals are defined based on business intelligence and data exploration. The goals
are evaluated against the trained models. All three models could answer complex queries, each with
xxii
i

its own strength with respect to ease of model interpretation, access to detailed information and
accuracy. Naïve Bayes could answer four out of the five goals; Decision Trees, three; and Neural
Network, two. Although not the most effective model, Decision Trees results are easier to read and
interpret. The drill through feature to access detailed patients’ profiles is only available in Decision
Trees.

Naïve Bayes fared better than Decision Trees as it could identify all the significant medical
predictors. The relationship between attributes produced by Neural Network is more difficult to
understand. IHDPS can be further enhanced and expanded. For example, it can incorporate other
medical attributes besides the 15 listed in Figure 2.1. It can also incorporate other data mining
techniques, e.g., Time Series, Clustering and Association Rules. Continuous data can also be used
instead of just categorical data. Another area is to use Text Mining to mine the vast amount of
unstructured data available in healthcare databases. Another challenge would be to integrate data
mining and text mining [8].

2.1.6 INTELLIGENT AND EFFECTIVE HEART ATTACK PREDICTION SYSTEM


USING DATA MINING AND ARTIFICIAL NEURAL NETWORK
AUTHORS
B.P. SHANTHAKUMAR
Y.S.KUMARASWAMY

In this paper [6] the authors stated that the diagnosis of diseases is a vital and intricate job in
medicine. The recognition of heart disease from diverse features or signs is a multi-layered problem
that is not free from false assumptions and is frequently accompanied by impulsive effects. Thus the
attempt to exploit knowledge and experience of several specialists and clinical screening data of
patients composed in databases to assist the diagnosis procedure is regarded as a valuable option.
This research work is the extension of our previous research with intelligent and effective heart
attack prediction system using neural network. A proficient methodology for the extraction of
significant patterns from the heart disease warehouses for heart attack prediction has been
presented.

Initially, the data warehouse is pre-processed in order to make it suitable for the mining
process. Once the preprocessing gets over, the heart disease warehouse is clustered with the aid of
the K-means clustering algorithm, which will extract the data appropriate to heart attack from the
xxi
v

warehouse. Consequently the frequent patterns applicable to heart disease are mined with the aid of
the MAFIA algorithm from the data extracted. In addition, the patterns vital to heart attack
prediction are selected on basis of the computed significant weightage.
The neural network is trained with the selected significant patterns for the effective
prediction of heart attack. They have employed the Multi-layer Perceptron Neural Network with
Back-propagation as the training algorithm. The results thus obtained have illustrated that the
designed prediction system is capable of predicting the heart attack effectively.

2.1.7 PERFORMANCE EVALUATION OF K-MEANS AND HEIRARICHAL


CLUSTERING IN TERMS OF ACCURACY AND RUNNING TIME
AUTHORS
NIDHI SINGH
DIVAKAR SINGH
In this paper [7] the authors stated that as in today’s world large number of modern
techniques is evolving for scientific data collection, large number of data is getting accumulated at
various databases. Systematic data analysis methods are in need to gain/extract useful information
from rapidly growing databanks. Clustering analysis method is one of the main analytical methods
in data mining; in which k-means clustering algorithm is most popularly/widely used for many
applications.

Clustering algorithm is divided into two categories: partition and hierarchical clustering
algorithm. This paper discusses one partition clustering algorithm (kmeans) and one hierarchical
clustering algorithm (agglomerative). K-means algorithm has higher efficiency and scalability and
converges fast when dealing with large data sets. Hierarchical clustering constructs a hierarchy of
clusters by either repeatedly merging two smaller clusters into a larger one or splitting a larger
cluster into smaller ones. Using WEKA data mining tool we have calculated the performance of k-
means and hierarchical clustering algorithm on the basis of accuracy and running time.

As a result of modern methods for scientific data collection, huge quantities of data are
getting accumulated at various databases. Cluster analysis [9] is one of the major data analysis
methods which helps to identify the natural grouping in a set of data items. The K-Means clustering
algorithm is proposed by Mac Queen in 1967 which is a partition-based cluster analysis method.
Clustering is a way that classifies the raw data reasonably and searches the hidden patterns that may
exist in datasets [10]. It is a process of grouping data objects into disjointed clusters so that the
xxv

data’s in the same cluster is similar, yet data’s belonging to different cluster differ. K-means is a
numerical, unsupervised, non-deterministic, iterative method. It is simple and very fast, so in many
practical applications, the method is proved to be a very effective way that can produce good
clustering results.
The demand for organizing the sharp increasing data’s and learning valuable information
from data, which makes clustering techniques are widely applied in many application areas such as
artificial intelligence, biology, customer relationship management, data compression, data mining,
information retrieval, image processing, machine learning, marketing, medicine, pattern
recognition, psychology, statistics and so on.

Clustering methods can be divided into two general classes, designated supervised and
unsupervised clustering. In this paper, we focus on unsupervised clustering which may again be
separated into two major categories: partition clustering and hierarchical clustering. There are many
algorithms for partition clustering category, such as kmeans clustering (MacQueen 1967), k-medoid
clustering, genetic k-means algorithm (GKA), Self-Organizing Map (SOM) and also graph-
theoretical methods (CLICK, CAST).

Among those methods, K-means clustering is the most popular one because of simple
algorithm and fast execution speed. Hierarchical clustering methods are among the first methods
developed and analyzed for clustering problems. There are two main approaches. (i) The
agglomerative approach, which builds a larger cluster by merging two smaller clusters in a bottom-
up fashion. The clusters so constructed form a binary tree; individual objects are the leaf nodes and
the root node is the cluster that has all data objects. (ii) The divisive approach, which splits a cluster
into two smaller ones in a top-down fashion. All clusters so constructed also form a binary tree.

The process of k-means algorithm: This part briefly describes the standard k-means
algorithm. K-means is a typical clustering algorithm in data mining and which is widely used for
clustering large set of data’s. It was one of the most simple, non-supervised learning algorithms,
which was applied to solve the problem of the well-known cluster. It is a partitioning clustering
algorithm, this method is to classify the given date objects into k different clusters through the
iterative, converging to a local minimum. So the results of generated clusters are compact and
independent. The algorithm consists of two separate phases.
The first phase selects k centers randomly, where the value k is fixed in advance. The next
phase is to take each data object to the nearest center. Euclidean distance is generally considered to
xxv
i

determine the distance between each data object and the cluster centers. When all the data objects
are included in some clusters, the first step is completed and an early grouping is done.
Recalculating the average of the early formed clusters. This iterative process continues repeatedly
until the criterion function becomes the minimum. Supposing that the target object is x, xi indicates
the average of cluster Ci, criterion function is defined as follows (eq. 1.):

(1)
E is the sum of the squared error of all objects in database.
The distance of criterion function is Euclidean distance, which is used for determining the
nearest distance between each data objects and cluster center. The Euclidean distance between one
vector x=(x1 ,x2 ,…xn) and another vector y=(y1 ,y2 ,…yn ), The Euclidean distance d(xi, yi) can
be obtained as follow:

(2)
The k-means clustering algorithm always converges to local minimum. Before the k-means
algorithm converges, calculations of distance and cluster centers are done while loops are executed
a number of times, where the positive integer t is known as the number of k-means iterations. The
precise value of t varies depending on the initial starting cluster centers. The distribution of data
points has a relationship with the new clustering center, so the computational time complexity of the
k means algorithm is O (nkt). n is the number of all data objects, k is the number of clusters, t is the
iterations of algorithm. Usually requiring k <<n and t <<n[1]. The reason behind choosing K-means
algorithm to study is its popularity for the following reasons:

Its time complexity is O (nkl), n is number of patterns ,k is the number of clusters, l is the
number of iterations taken by the algorithm to converge.
It is order independent, for a given initial seed set of cluster centers, it generates the same
partition of the data irrespective of the order in which the patterns are presented to the algorithm.
Its space complexity is O (n+k).It requires additional space to store the data matrix.
xxv
ii

CHAPTER 3

SYSTEM ANALYSIS

3.1 EXISTING SYSTEM


The existing system focuses on KNN classification algorithms to effectively detect heart
categories in dataset. The dataset is taken from UCI repository. Preprocessed such as zero value,
N/A value and unicode character elimination are not is not carried out here. Important features are
extracted out for better classification. Confusion matrix is prepared with accuracy score calculation.

3.1.1. DRAWBACKS
 KNN Classification is not considered so that probability of (disease) yes/no records in
the given new test data is not possible.
 Feature reduction before malware identification is not carried out.
 Data columns with numeric values only take from KNN classification.

3.2 PROPOSED SYSTEM AND ITS ADVANTAGES


The proposed system focuses on SVM classification algorithms as well as neural network to
effectively detect risk categories in dataset records. The dataset is taken from UCI repository and
preprocessed such as Unicode removal. Important features are extracted out for better classification.
Confusion matrix is prepared with accuracy score calculation. Accuracy prediction is also carried
out. Convolutional Neural Network based prediction model is worked out to find algorithm
efficiency. 540 training records and 210 test records are taken out for convolutional neural network
training.

3.2.2 ADVANTAGES
 SVM Classification is considered so that probability of safe/risk category in the given
new test data is possible.
 Feature reduction before disease identification is carried out.
 SVM supports well even if the dataset size is big.
 Convolutional Neural Network based prediction model is worked out to find algorithm
efficiency.
xxv
iii

3.3 FEASIBILITY STUDY


The feasibility study deals with all the analysis that takes up in developing the project. Each
structure has to be thought of in the developing of the project, as it has to serve the end user in a
user-friendly manner. One must know the type of information to be gathered and the system
analysis consist of collecting, Organizing and evaluating facts about a system and its environment.
The main objective of the system analysis is to study the existing operation and to learn and
accomplish the processing activities. The disease presence records needs to be analyzed well. The
details are processed through coding themselves. It will be controlled by the programs alone.

3.3.1 ECONOMIC FEASIBILITY


The organization has to buy a personal computer with a keyboard and a mouse, this is a
direct cost. There are many direct benefits of covering the manual system to computerized system.
The user can be given responses on asking questions, justification of any capital outlay is that it will
reduce expenditure or improve the quality of service or goods, which in turn may be expected to
provide the increased profits.

3.3.2 OPERATIONAL FEASIBILITY


The proposed system accessing process to solves problems what occurred in existing
system. The current day-to-day operations of the organization can be fit into this system. Mainly
operational feasibility should include on analysis of how the proposed system will affects the
organizational structures and procedures.

3.3.3 TECHNICAL FEASIBILITY


The cost and benefit analysis may be concluded that computerized system is favorable in
today’s fast moving world. The assessment of technical feasibility must be based on an outline
design of the system requirements in terms of input, output, files, programs and procedure. The
project aims to identify of disease presence or not. Feature reduction, SVM/KNN classification for
heart dataset is to be processed. The current system aims to overcome the problems of the existing
system. The current system is to reduce the technical skill requirements so that more number of
users can access the application.
xxi
x

CHAPTER - 4
SYSTEM SPECIFICATION

4.1 HARDWARE REQUIREMENTS


This section gives the details and specification of the hardware on which the system is
expected to work.
Processor : Intel Dual Core Processor
RAM : 4 GB SD RAM
Monitor : 17” Color
Hard disk : 500 GB
Keyboard : Standard102 keys
Mouse : Optical mouse

4.2 SOFTWARE REQUIREMENTS


This section gives the details of the software that are used for the development.
Operating System : Windows 10 Pro
Environment : IDLE/CoLabs
Language : Python 3.7
xxx

CHAPTER - 5
SOFTWARE DESCRIPTION

5.1 GOOGLE COLAB


If you have used Jupyter notebook previously, you would quickly learn to use Google
Colab. To be precise, Colab is a free Jupyter notebook environment that runs entirely in the cloud.
Most importantly, it does not require a setup and the notebooks that you create can be
simultaneously edited by your team members just the way you edit documents in Google Docs.
Colab supports many popular machine learning libraries which can be easily loaded in your
notebook.

As a programmer, you can perform the following using Google Colab.


Write and execute code in Python
Document your code that supports mathematical equations
Create/Upload/Share notebooks
Import/Save notebooks from/to Google Drive
Import/Publish notebooks from GitHub
Import external datasets e.g. from Kaggle
Integrate PyTorch, TensorFlow, Keras, OpenCV
Free Cloud service with free GPU

Google is quite aggressive in AI research. Over many years, Google developed AI


framework called TensorFlow and a development tool called Colaboratory. TodayTensorFlow is
open-sourced and since 2017, Google made Colaboratory free for public use. Colaboratory is now
known as Google Colab or simply Colab.

Another attractive feature that Google offers to the developers is the use of GPU.Colab
supports GPU and it is totally free. The reasons for making it free for public could be to make its
software a standard in the academics for teaching machine learning and data science. It may also
have a long term perspective of building a customer base for Google Cloud APIs which are sold
per-use basis. Irrespective of the reasons, the introduction of Colab has eased the learning and
development of machine learning applications. Google Colab is a powerful platform for learning
xxx
i

and quickly developing machine learning models in Python. It is based on Jupyter notebook and
supports collaborative development.

The team members can share and concurrently edit the notebooks, even remotely. The
notebooks can also be published on GitHub and shared with the general public. Colab supports
many popular ML libraries such as PyTorch, TensorFlow, Keras andOpenCV. The restriction as of
today is that it does not support R or Scala yet. There is also a limitation to sessions and size.
Considering the benefits, these are small sacrifices one needs to make.

Packages
Google Colab is the most widely used Python distribution for data science and comes pre-loaded
with all the most popular libraries and tools. Some of the biggest Python libraries included in
Anaconda include

• Pandas - It is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool.
• NumPy - It is the core library for scientific` computing, which contains a powerful
N-dimensional array object.
• Matplotlib - Matplotlib is a plotting library for the Python programming language.
• Scikit-learn - Machine learning library for the Python. It has classification, regression and
clustering algorithms.
xxx
ii

CHAPTER 6
PROJECT DESCRIPTION

6.1. PROBLEM DEFINITION


Despite extensive research that shows a correlation between indoor air pollution and
aggravation of heart, the present methods does not provide a personalized risk assessment in real-
time based on indoor air quality is still at an infant state.

So the new system should sort out the current drawbacks and thus the proposed
convolutional neural network (CNN) architecture should be considered with a matrix input for
predicting the risk of heart. The proposed convolutional neural network (CNN) architecture has 4
hidden layers comprising of 2 convolutional layers, and 2 fully connected layers. The same network
has been used in IoT implementation and experimental evaluations. The input layer has 4 features
and the size of the input layer is 4 × 1. The first and the second convolutional layersuse 64 feature
maps to avail superior learning of the input features.

The kernel for both convolution layers has a size of 1×1. The convolution is done with one
stride on both the first and the second convolution layers. The fully-connected layers have 128
neurons. All the activation functions are ReLU and the output layer with the linear activation
function has one neuron. The resulting network has around 1.5 million learnable parameters. The
CNN loss function is the mean squared error and was minimized using Adam Optimizing
Algorithm.

Hence if there is an application that identifies an essential subset of categories (risk/safe)


that can be used to effectively identify heart records, then it would be satisfactory.

It should achieve high detection accuracy and efficiency while analyzing the minimal
number of categories. It should eliminate the need to analyze records that have little or no
significant influence on risk detection effectiveness.
xxx
iii

6.2 OVERVIEW OF PROJECT

In day to day life many factors that affect a human heart. Many problems are occurring at a
rapid pace and new heart diseases are rapidly being identified. In today’s world of stress Heart,
being an essential organ in a human body which pumps blood through the body for the blood
circulation is essential and its health is to be conserved for a healthy living. The main motivation of
doing this project is to present a heart disease prediction model for the prediction of occurrence of
heart disease. Further, this research work is aimed towards identifying the best classification
algorithm for identifying the possibility of heart disease in a patient.

The identification of the possibility of heart disease in a person is complicated task for
medical practitioners because it requires years of experience and intense medical tests to be
conducted. The main objective of this significant research work is to identify the best classification
algorithm suitable for providing maximum accuracy when classification of normal and abnormal
person is carried out.

Convolutional neural network (CNN) architecture is used to map the relationship between
the indoor PM and weather data to the found values. The proposed method is compared with the
state-of-the-art deep neural network (DNN)based techniques in terms of the root mean square and
mean absolute error accuracy measures. In addition, support vector machine based classification
and K-Nearest Neighbor based classification is also carried out and accuracy is found out. The
applied SVM, KNN and CNN classification helps to predict the heart disease with more accuracy in
the new data set. The coding language used is Python 3.7.
1

6.3 MODULE DESCRIPTION


The following modules are present in the proposed application.
1. DATA SET COLLECTION
2. DATA SET SUBSETTING BASED ON HEART DISEASE
ATRIBUTES
3. SVM/KNN CLASSIFICATION
4. CNN CLASSIFICATION

1. DATA SET COLLECTION


The dataset which contains columns (Age, Sex, Cp_Chestpaintype,
t_restingbloodpressure_d, serumcholestrolinmg, fastingbloodsugarlevel, restecg, thalach, exang,
oldpeak, peakslope, numvessels, thal, classfactor) are saved in a single Excel workbook as
records. This is the input for the project.

2. DATA SET SUBSETTING BASED ON RISK TYPES


The dataset which contains columns (Age, Sex, Cp_Chestpaintype,
t_restingbloodpressure_d, serumcholestrolinmg, fastingbloodsugarlevel, restecg, thalach, exang,
oldpeak, peakslope, numvessels, thal, classfactor) are saved in a single Excel workbook as
records. This is the input for the project in which 540 (collectively (Safe 0 and Risk 1) for
training records and 210 (collectively (Safe 0 and Risk 1) for testing records are split and given
for neural network.

3. SVM/KNN CLASSIFICATION
In this module, 80% of the data in given data set is taken as training data and 20% of the
data is taken as test data. The text (categorical) columns are converted into numerical values.
Then the model is trained with training data and then predicted with test data. Of which, most of
the apps are classified as disease present or not.

4. CNN CLASSIFICATION
Here the dataset is taken first. It can be seen that news data is stored in the form of csv
values (Comma Separated Values). Each record contains attribute values for one heart disease
definition. Data Encoding: It converts the categorical column (label in out case) into numerical
values. These are some variables required for the model training. Tokenization process divides
a large piece of continuous text into distinct units or tokens basically. Here we use columns
separately for a temporal basis as a pipeline just for good accuracy.
2

Generating Word Embedding: It allows words with similar meanings to have a similar
representation. Here each individual word is represented as real-valued vectors in a predefined
vector space. For that we will use glove.6B.50d.txt. It has the predefined vector space for
words. Creating Model Architecture: Here TensorFlow is used to create the model. Here we
use the TensorFlow embedding technique with Keras Embedding Layer where we map original
input data into some set of real-valued dimensions. Model Evaluation and Prediction: Now, the
detection model is built using TensorFlow. Now we will try to test the model by using some
news text by predicting whether it is true or false. Thus a fake news detection model is created
using TensorFlow using python.

It is also used the Dropout, Batch-normalization and Flatten layers in addition to the
layers. Flatten layer converts the output of convolutional layers into a one dimensional feature
vector. It is important to flatten the outputs because Dense (Fully connected) layers only accept
a feature vector as input. Dropout and Batch-normalization layers are for preventing the model
from overfitting. Once the model is created, it can be imported and then compiled using
‘model.compile’. The model is trained for just ten epochs but we can increase the number of
epochs. After the training process is completed we can make predictions on the test set. The
accuracy value is displayed during iterations.
3

6.4 INPUT DESIGN


Input design is the process of converting user-originated inputs to a computer
understandable format. Input design is one of the most expensive phases of the operation of
computerized system and is often the major problem of a system. A large number of problems
with a system can usually be tracked backs to fault input design and method. Every moment of
input design should be analyzed and designed with utmost care. The system takes input from
the users, processes it and produces an output. Input design is link that ties the information
system into the world of its users. The system should be user-friendly to gain appropriate
information to the user. The decisions made during the input design are
 To provide cost effective method of input.
 To achieve the highest possible level of accuracy.
 To ensure that the input is understand by the user.

System analysis decide the following input design details like, what data to input, what
medium to use, how the data should be arranged or coded, data items and transactions needing
validations to detect errors and at last the dialogue to guide user in providing input. Input data
of a system may not be necessarily is raw data captured in the system from scratch.

These can also be the output of another system or subsystem. The design of input covers
all the phases of input from the creation of initial data to actual entering of the data to the
system for processing. The design of inputs involves identifying the data needed, specifying the
characteristics of each data item, capturing and preparing data from computer processing and
ensuring correctness of data. Any Ambiguity in input leads to a total fault in output. The goal of
designing the input data is to make data entry as easy and error free as possible.
4

6.5 OUTPUT DESIGN


Output design generally refers to the results and information that are generated by the
system for many end-users; output is the main reason for developing the system and the basis
on which they evaluate the usefulness of the application.

The output is designed in such a way that it is attractive, convenient and informative.
Codings are made with various features, which make the console output more pleasing.

As the outputs are the most important sources of information to the users, better design should
improve the system’s relationships with user and also will help in decision-making. Form
design elaborates the way output is presented and the layout available for capturing
information.
5

6.6 SYSTEM FLOW DIAGRAM

MACHINE LEARNING-BASED HEART DISEASE


PREDICTION

Data Set Preprocess Classification

Remove N/A values SVM/KNN


Classification

Subset train and test Feature Reduction


records for CNN for CNN

Download dataset
from UCI repository
CNN
Classification in
reduced feature
set

FIG 6.6.1 SYSTEM FLOW DIAGRAM


6

CHAPTER 7
EXPERIMENT AND RESULTS

• Convolutional Neural Network based prediction model is worked out to find


algorithm efficiency.
• Convolutional Neural Network reduces feature automatically and five forty
thousand records are taken for training data with two hundred ten records for
testing data.
• KNN Classification is considered so that probability of safe/risk category in the
given new test data is possible.
• When compared to SVM, KNN yields more accuracy for the same dataset.
• Feature reduction before malware identification is carried out.
• KNN supports well even if the dataset size is big.
7

CHAPTER 8

8.1 CONCLUSION

The project focuses on SVM classification algorithms to effectively detect heart disease
types. The dataset is taken from UCI repository. Preprocessed such as zero value, N/A value
and unicode character elimination is carried out here. Important features are extracted out for
better classification. Confusion matrix is prepared with accuracy score calculation. In addition,
KNN classification algorithms as well as neural network to effectively detect risk types are also
carried out. The dataset is taken and preprocessed such as unicode removal. Important features
are extracted out for better classification.

Confusion matrix is prepared with accuracy score calculation. Accuracy prediction is


also carried out. Convolutional Neural Network based prediction model is worked out to find
algorithm efficiency. 540 training records and 210 test records are taken out for convolutional
neural network training. There are several directions for future research. The current
investigation of classification is still preliminary.

8.2 FUTURE WORKS

Furthermore, the algorithm consistently outperformed all the tested classification and
methods under different conditions. The future enhancements can be made with still more
permission sets. SVM and KNN classification gives better accuracy in prediction. More number
of datasets can be taken and checked for accuracy with same algorithm parameters. For other
diseases also, these algorithms can be evaluated.
8

APPENDICES
A. SOURCE CODE

SVM CODING

#!/usr/bin/env python
#python 3.7 32 bit
# coding: utf-8

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#get_ipython().run_line_magic('matplotlib', 'inline')

#Import Cancer data from the Sklearn library


# Dataset can also be found here
(http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic
%29)

#from sklearn.datasets import load_breast_cancer


#cancer = load_breast_cancer()
#dataset = pd.read_csv('BigMartSales1000Records.csv')

#from google.colab import files


#uploaded = files.upload()
dataset=pd.read_csv('heartdisease.csv')
#dataset = pd.read_csv('dataset_malwares.csv')
df = pd.DataFrame(data = dataset)
9

#X = X.filter(['Item_Weight', 'Item_Visibility',
'Outlet_Establishment_Year','Item_Outlet_Sales', 'Loan_Status'])
# In[2]:
#X
print(df.columns.values)
#df=df.iloc[:, [2,3,4,5,6,7,8,9,53]]#3,6,
df=df.iloc[:, [1,2,3,4,5,6,7,8,9,10,11,12,13]]
# As we can see above, not much can be done in the current form of the dataset.
We need to view the data in a better format.

# # Let's view the data in a dataframe.

# In[3]:

#df_Sales = pd.DataFrame(np.c_[X['Item_Outlet_Sales'], X['Loan_Status']],


columns = np.append(X['Item_Outlet_Sales'], ['Loan_Status']))
df_Sales =df#pd.DataFrame(np.c_[X['Item_Weight'], X['Item_Visibility'],
X['Outlet_Establishment_Year'],X['Item_Outlet_Sales'], X['Loan_Status']])
#,
#columns =
np.append(['Item_Weight','Item_Visibility','Outlet_Establishment_Year','Item_Ou
tlet_Sales','Loan_Status'])
X=df_Sales
print(X.head())
print(X.columns.values)

# # Let's Explore Our Dataset

# In[4]:
print(X.shape)
10

# As we can see,we have 596 rows (Instances) and 31 columns(Features)


# In[5]:
print(X.columns)

# Above is the name of each columns in our dataframe.

# # The next step is to Visualize our data

# In[6]:

# Let's plot out just the first 5 variables (features)


#sns.pairplot(df_Sales)#, vars = ['Item Weight', 'Item Visibility', 'Outlet
Establishment Year', 'Item Outlet Sales'] )

# The above plots shows the relationship between our features. But the only
problem with them is that they do not show us which of the "dots" is Malignant
and which is Benign.
#
# This issue will be addressed below by using "target" variable as the "hue" for
the plots.

# In[7]:

# Let's plot out just the first 5 variables (features)


#sns.pairplot(df_Sales, hue = 'Loan_Status', vars = ['Item_Weight',
'Item_Fat_Content', 'Outlet_Establishment_Year','Item_Outlet_Sales'] )
11

# **Note:**
#
# 1.0 (Orange) = Benign (No Cancer)
#
# 0.0 (Blue) = Malignant (Cancer)

# # How many Benign and Malignant do we have in our dataset?

# In[8]:

print(X['classfactor'])

print(X['classfactor'].value_counts())

# As we can see, we have 212 - Malignant, and 357 - Benign

# Let's visulaize our counts

# In[9]:

sns.countplot(X['classfactor'], label = "Count")

# # Let's check the correlation between our features

# In[10]:

plt.figure(figsize=(20,12))
sns.heatmap(df_Sales.corr(), annot=True)
12

#X = X.drop(['Malware'], axis = 1) # We drop our "target" feature and use all the
remaining features in our dataframe to train the model.
#print(X.head())

# In[12]:

y = X['classfactor']
X = X.drop(['classfactor'], axis = 1)
print(y.head())

from sklearn.model_selection import train_test_split

# Let's split our data using 80% for training and the remaining 20% for testing.

# In[14]:

indices =range(len(X))
X_train, X_test, y_train, y_test,tr,te = train_test_split(X, y, indices,test_size =
0.15, random_state = 20)
13

# Let now check the size our training and testing data.

# In[15]:

print ('The size of our training "X" (input features) is', X_train.shape)
print ('\n')
print ('The size of our testing "X" (input features) is', X_test.shape)
print ('\n')
print ('The size of our training "y" (output feature) is', y_train.shape)
print ('\n')
print ('The size of our testing "y" (output features) is', y_test.shape)

# # Import Support Vector Machine (SVM) Model

# In[16]:

from sklearn.svm import SVC

# In[17]:

svc_model = SVC()

# # Now, let's train our SVM model with our "training" dataset.

# In[18]:
14

svc_model.fit(X_train, y_train)

# # Let's use our trained model to make a prediction using our testing data

# In[19]:

y_predict = svc_model.predict(X_test)
print(y_predict)

from sklearn.metrics import classification_report, confusion_matrix,


accuracy_score

f=open('output.html',"w")
f.write('<html><head><title>Output</title>');
s="<link rel='stylesheet' href='css/bootstrap.min.css'><script
src='css/jquery.min.js'></script> <script src='css/bootstrap.min.js'></script><link
rel='stylesheet' href='css/all.css'>"
f.write(s)
f.write("</head><body>")
f.write("<center><h2>TRAINING/TEST DATASET SIZE</h2></center><font
color='red' size='4' face='Tahoma'>")

print ('The size of our training "X" (input features) is', X_train.shape)
f.write('The size of our training "X" (input features) is' + str(X_train.shape))
f.write("<br/>")
print ('\n')
print ('The size of our testing "X" (input features) is', X_test.shape)
15

f.write('The size of our testing "X" (input features) is' + str(X_test.shape))


f.write("<br/>")
print ('\n')
print ('The size of our training "y" (output feature) is', y_train.shape)
f.write('The size of our training "y" (output feature) is' + str( y_train.shape))
f.write("<br/>")
print ('\n')
print ('The size of our testing "y" (output features) is', y_test.shape)
f.write('The size of our testing "y" (output features) is'+ str(y_test.shape))
f.write("<br/><br/></font>")
print('Accuracy Score')

# In[21]:

cm = np.array(confusion_matrix(y_test, y_predict, labels=[1,0]))


confusion = pd.DataFrame(cm, index=['is_Low', 'is_High'],
columns=['predicted_Low','predicted_High'])
print(confusion)

#sns.heatmap(confusion, annot=True)
#print(classification_report(y_test, y_predict))
X_train_min = X_train.min()
print(X_train_min)
# In[25]:
X_train_max = X_train.max()
print(X_train_max)
# In[26]:
16

X_train_range = (X_train_max- X_train_min)


print(X_train_range)
# In[27]:
X_train_scaled = (X_train - X_train_min)/(X_train_range)
print('X_train_scaled:')
print(X_train_scaled.head())
# # Normalize Training Data
# In[28]:

X_test_min = X_test.min()
X_test_range = (X_test - X_test_min).max()
X_test_scaled = (X_test - X_test_min)/X_test_range
print('X_test_scaled:')
print(X_test_scaled.head())
# In[29]:
svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)

# In[30]:

y_predict = svc_model.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_predict)
print('Confusion Matrix:')
print(cm)

# # SVM with Normalized data

# In[31]:

cm = np.array(confusion_matrix(y_test, y_predict, labels=[1,0]))


confusion = pd.DataFrame(cm, index=['is_Low', 'is_High'],
17

columns=['predicted_Low','predicted_High'])
leng=len(te)
print("------");
for i in range(0,leng):
print(te[i] , ":", y_predict[i])
ac = accuracy_score(y_test,y_predict)
print('Accuracy:')
print(round(ac,3))
f.write( 'Accuracy Score<br/>')
f.write(str(round(ac,3)) + "<br/>")
print('Confusion Matrix')
print(confusion_matrix(y_test, y_predict))
f.write('Confusion Matrix [SVM]<br/>')
f.write('----------------------<br/>')

f.write(str(confusion_matrix(y_test, y_predict)) + "<br/>")


f.write("<center><table class='table table-bordered table-striped table-hover'
border='1'
style='border-radius:5px;width:50%'><tr><th>Index</th><th>Category</th></
tr>")
for i in range(0,leng):
s= "<tr><td>" + str(te[i]) + "</td><td>"+ str(y_predict[i]) + "</td>"
f.write(s)
f.write("</table></center></body></html>")
f.close()
import webbrowser
import os
filename='file:///'+os.getcwd()+'/' + 'output.html'
webbrowser.open_new_tab(filename)
#import urllib.request
#page = urllib.request.urlopen('output.html').read()
18

#print (page)
19

KNN CODING
#!/usr/bin/env python
#python 3.7 32 bit
#python 3.9 32 bit if sns function plot require check 99th line
# coding: utf-8

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#get_ipython().run_line_magic('matplotlib', 'inline')

#Import Cancer data from the Sklearn library


# Dataset can also be found here
(http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29)

#from sklearn.datasets import load_breast_cancer


#cancer = load_breast_cancer()
#dataset = pd.read_csv('BigMartSales1000Records.csv')
#dataset = pd.read_csv('dataset_malwares.csv')
#from google.colab import files
#uploaded = files.upload()
dataset=pd.read_csv('heartdisease.csv')

df = pd.DataFrame(data = dataset)
#X = X.filter(['Item_Weight', 'Item_Visibility',
'Outlet_Establishment_Year','Item_Outlet_Sales', 'classfactor'])
# In[2]:
#X
print(df.columns.values)
#The below line is correct. For speed some columns are eliminated
df=df.iloc[:, [1,2,3,4,5,6,7,8,9,10,11,12,13]] #15,16,17,18,19,20
#The below line is used for speed and so some columns are eliminated
#df=df.iloc[:, [2,3,4,9,53]]
#df=df.iloc[1:10001,:]
#for i in range(1001,15000):
# df.at[i,'Malware']=0

#for i in range(15001,19501):
# df.at[i,'Malware']=1
#print(df['Malware'])

#for i in range(15001,16501):
# df[i,'Malware']=1

# As we can see above, not much can be done in the current form of the dataset. We need to
view the data in a better format.

# # Let's view the data in a dataframe.

# In[3]:
20

#df_Sales = pd.DataFrame(np.c_[X['Item_Outlet_Sales'], X['classfactor']], columns =


np.append(X['Item_Outlet_Sales'], ['classfactor']))
df_Sales =df#pd.DataFrame(np.c_[X['Item_Weight'], X['Item_Visibility'],
X['Outlet_Establishment_Year'],X['Item_Outlet_Sales'], X['classfactor']])
#,
#columns =
np.append(['Item_Weight','Item_Visibility','Outlet_Establishment_Year','Item_Outlet_Sales','cla
ssfactor'])
X=df_Sales
print(X.head())
print(X.columns.values)

# # Let's Explore Our Dataset

# In[4]:
print(X.shape)
# As we can see,we have 596 rows (Instances) and 31 columns(Features)
# In[5]:
print(X.columns)

# Above is the name of each columns in our dataframe.

# # The next step is to Visualize our data

# In[6]:

# Let's plot out just the first 5 variables (features)


#sns.pairplot(df_Sales)#, vars = ['Item Weight', 'Item Visibility', 'Outlet Establishment Year',
'Item Outlet Sales'] )

# The above plots shows the relationship between our features. But the only problem with them
is that they do not show us which of the "dots" is Malignant and which is Benign.
#
# This issue will be addressed below by using "target" variable as the "hue" for the plots.

# In[7]:

# Let's plot out just the first 5 variables (features)


#sns.pairplot(df_Sales, hue = 'classfactor', vars = ['Item_Weight', 'Item_Fat_Content',
'Outlet_Establishment_Year','Item_Outlet_Sales'] )

# **Note:**
#
# 1.0 (Orange) = Benign (No Cancer)
#
# 0.0 (Blue) = Malignant (Cancer)

# # How many Benign and Malignant do we have in our dataset?


21

# In[8]:
#X['classfactor']=X['classfactor']
print(X['classfactor'])

print(X['classfactor'].value_counts())

# As we can see, we have 212 - Malignant, and 357 - Benign

# Let's visulaize our counts

# In[9]:

try:
sns.countplot(X['classfactor'], label = "Count")
except:
tmp=1

# # Let's check the correlation between our features

# In[10]:

try:
plt.figure(figsize=(20,12))
sns.heatmap(df_Sales.corr(), annot=True)
except:
tmp=1

#X = X.drop(['classfactor'], axis = 1) # We drop our "target" feature and use all the remaining
features in our dataframe to train the model.
#print(X.head())

# In[12]:

y = X['classfactor']
X = X.drop(['classfactor'], axis = 1)
print(y.head())

from sklearn.model_selection import train_test_split

# Let's split our data using 80% for training and the remaining 20% for testing.

# In[14]:
22

indices =range(len(X))
X_train, X_test, y_train, y_test,tr,te = train_test_split(X, y, indices,test_size = 0.25,
random_state = 16)

# Let now check the size our training and testing data.

# # Import Support Vector Machine (SVM) Model

from sklearn import neighbors, datasets, preprocessing


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

#iris = datasets.load_iris()
#X, y = iris.data[:, :], iris.target
indices =range(len(X))
Xtrain, Xtest, y_train, y_test,tr,te = train_test_split(X, y, indices,stratify = y, random_state = 16,
train_size = 0.75)

scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
knn.fit(Xtrain, y_train)
y_pred = knn.predict(Xtest)
# In[16]:
leng=len(te)
print("------");
for i in range(0,leng):
print(te[i] , ":", y_pred[i])

# In[15]:
f=open('output.html',"w")
f.write('<html><head><title>Output</title>');
s="<link rel='stylesheet' href='css/bootstrap.min.css'><script src='css/jquery.min.js'></script>
<script src='css/bootstrap.min.js'></script><link rel='stylesheet' href='css/all.css'>"
f.write(s)
f.write("</head><body>")
f.write("<center><h2>TRAINING/TEST DATASET SIZE</h2></center><font color='red'
size='4' face='Tahoma'>")

print ('The size of our training "X" (input features) is', X_train.shape)
f.write('The size of our training "X" (input features) is' + str(X_train.shape))
f.write("<br/>")
print ('\n')
print ('The size of our testing "X" (input features) is', X_test.shape)
f.write('The size of our testing "X" (input features) is' + str(X_test.shape))
f.write("<br/>")
print ('\n')
print ('The size of our training "y" (output feature) is', y_train.shape)
f.write('The size of our training "y" (output feature) is' + str( y_train.shape))
23

f.write("<br/>")
print ('\n')
print ('The size of our testing "y" (output features) is', y_test.shape)
f.write('The size of our testing "y" (output features) is'+ str(y_test.shape))
f.write("<br/><br/></font>")
print('Accuracy Score')
f.write( 'Accuracy Score<br/>')

print(round(accuracy_score(y_test, y_pred),3))
f.write(str(round(accuracy_score(y_test, y_pred),3)) +"<br/>")
print(classification_report(y_test, y_pred))
print('Confusion Matrix [KNN]')
f.write('Confusion Matrix [KNN]<br/>')
f.write('----------------------<br/>')
print('----------------------')
print(confusion_matrix(y_test, y_pred))
f.write(str(confusion_matrix(y_test, y_pred)) + "<br/>")
f.write("<center><table class='table table-bordered table-striped table-hover' border='1'
style='border-radius:5px;width:50%'><tr><th>Index</th><th>Category</th></tr>")
for i in range(0,leng):
s= "<tr><td>" + str(te[i]) + "</td><td>"+ str(y_pred[i]) + "</td>"
f.write(s)

f.write("</table></center></body></html>")
f.close()
import webbrowser
import os
filename='file:///'+os.getcwd()+'/' + 'output.html'
webbrowser.open_new_tab(filename)
#import urllib.request
#page = urllib.request.urlopen('output.html').read()
#print (page)
exit()
24

B. SAMPLE SCREENS

SVM BASED CLASSIFICATION

In this output, all the column names are displayed. Sample five records from the dataset
is printed using df.head(). The test records category after classification is made and printed as
zero and one. The accuracy score is also displayed. The confusion matrix is printed below.
25

In this output, the test records category after classification is made and printed as zero
and one. The accuracy score is also displayed. The confusion matrix is printed below. The
prepard output is saved as web page and using webbrowser module in python, the html file
saved during the classification is made to open in brower.
26

KNN BASED CLASSIFICATION

In this output, all the column names are displayed. Sample five records from the dataset
is printed using df.head(). The test records category after classification is made and printed as
zero and one. The accuracy score is also displayed. The confusion matrix is printed below.
27

In this output, the test records category after classification is made and printed as zero
and one. The accuracy score is also displayed. The confusion matrix is printed below. The
prepard output is saved as web page and using webbrowser module in python, the html file
saved during the classification is made to open in brower.
28

NEURAL NETWORK ITERATION 1

Once the CNN model is created, it is imported and then compiled using
‘model.compile’. The model is trained for just ten epochs for faster output. During the training
process, accuracy calculation is made and predictions are printed out. The accuracy value is
displayed during iterations. The results of first epoch are printed out.
29

NEURAL NETWORK ITERATION 2

The accuracy value is displayed during iterations. The results of second epoch are
printed out. And it is noted that accuracy is increasing.
30

NEURAL NETWORK ITERATION 3

The accuracy value is displayed during iterations. The results of third epoch are printed
out. And it is noted that accuracy is increasing.
31

NEURAL NETWORK ITERATION 4

The accuracy value is displayed during iterations. The results of fourth epoch are printed
out. And it is noted that accuracy is increasing.
32

NEURAL NETWORK ITERATION 5

The accuracy value is displayed during iterations. The results of fifth epoch are printed
out. And it is noted that accuracy is increasing.
33

NEURAL NETWORK ITERATION 10

The accuracy value is displayed during iterations. The results of tenth epoch are printed
out. And it is noted that accuracy is increasing and the program ends.
34

BIBLIOGRAPHY
JOURNAL REFERENCES
[1] Franck Le Duff, CristianMunteanb, Marc Cuggiaa and Philippe Mabob, “Predicting
Survival Causes After Out of Hospital Cardiac Arrest using Data Mining Method”, Studies in
Health Technology and Informatics, Vol. 107, No. 2, pp. 1256-1259, 2004.
[2] W.J. Frawley and G. Piatetsky-Shapiro, “Knowledge Discovery in Databases: An
Overview”, AI Magazine, Vol. 13, No. 3, pp. 57-70, 1996.
[3] Kiyong Noh, HeonGyu Lee, Ho-Sun Shon, Bum Ju Lee and Keun Ho Ryu, “Associative
Classification Approach for Diagnosing Cardiovascular Disease”, Intelligent Computing in
Signal Processing and Pattern Recognition, Vol. 345, pp. 721-727, 2006.
[4] Latha Parthiban and R. Subramanian, “Intelligent Heart Disease Prediction System using
CANFIS and Genetic Algorithm”, International Journal of Biological, Biomedical and Medical
Sciences, Vol. 3, No. 3, pp. 1-8, 2008.
[5] Sellappan Palaniappan and Rafiah Awang, “Intelligent Heart Disease Prediction System
using Data Mining Techniques”, International Journal of Computer Science and Network
Security, Vol. 8, No. 8, pp. 1-6, 2008
[6] Shantakumar B. Patil and Y.S. Kumaraswamy, “Intelligent and Effective Heart Attack
Prediction System using Data Mining and Artificial Neural Network”, European Journal of
Scientific Research, Vol. 31, No. 4, pp. 642-656, 2009.
[7] Nidhi Singh and Divakar Singh, “Performance Evaluation of K-Means and Hierarchal
Clustering in Terms of Accuracy and Running Time”, Ph.D Dissertation, Department of
Computer Science and Engineering, Barkatullah University Institute of Technology, 2012.
[8] Weiguo, F., Wallace, L., Rich, S., Zhongju, Z.: “Tapping the Power of Text Mining”,
Communication of the ACM. 49(9), 77-82, 2006.
[9] Jiawei Han M. K, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers,
An Imprint of Elsevier, 2006
[10] Huang Z, “Extensions to the k-means algorithm for clustering large data sets with
categorical values,” Data Mining and Knowledge Discovery, Vol.2, pp:283–304, 1998

[11] A comparison of antiarrhythmic-drug therapy with implant-able defibrillators in patients


resuscitated from near-fatalventricular arrhythmias. The Antiarrhythmics versusImplantable
Defibrillators (AVID) Investigators. N Engl JMed. 1997 Nov 27;337(22):1576-83.
[12] R.Wu, W.Peters, M.W.Morgan, “The Next Generation Clinical Decision Support: Linking
Evidence to Best Practice”, Journal of Healthcare Information Management. 16(4), pp. 50-55,
2002.
35

[13] Mary K.Obenshain, “Application of Data Mining Techniques to Healthcare Data”,


Infection Control and Hospital Epidemiology, vol. 25, no.8, pp. 690–695, Aug. 2004.
[14] G.Camps-Valls, L.Gomez-Chova, J.Calpe-Maravilla, J.D.MartinGuerrero, E.Soria-Olivas,
L.Alonso-Chorda, J.Moreno, “Robust support vector method for hyperspectral data
classification and knowledge discovery.” Trans. Geosci. Rem. Sens. vol.42, no.7, pp.1530–
1542, July.2004.
[15] Obenshain, M.K: “Application of Data Mining Techniques to Healthcare Data”, Infection
Control and Hospital Epidemiology, 25(8), 690–695, 2004.
[16] Wu, R., Peters, W., Morgan, M.W.: “The Next Generation Clinical Decision Support:
Linking Evidence to Best Practice”, Journal Healthcare Information Management. 16(4), 50-55,
2002.
[17] Charly, K.: “Data Mining for the Enterprise”, 31st Annual Hawaii Int. Conf. on System
Sciences, IEEE Computer, 7, 295-304, 1998.
[18] Blake, C.L., Mertz, C.J.: “UCI Machine Learning Databases”,
http://mlearn.ics.uci.edu/databases/heart-disease/, 2004.
[19] Mohd, H., Mohamed, S. H. S.: “Acceptance Model of Electronic Medical Record”, Journal
of Advancing Information and Management Studies. 2(1), 75-92, 2005.

You might also like