100% found this document useful (1 vote)
2K views54 pages

Heart Disease Prediction Using Machine Learning

Uploaded by

Aarthi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views54 pages

Heart Disease Prediction Using Machine Learning

Uploaded by

Aarthi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Heart Disease Prediction Using Machine

Learning Algorithms
ABSTRACT

Heart plays significant role in living organisms. Diagnosis and prediction of heart
related diseases requires more precision, perfection and correctness because a little mistake
can cause fatigue problem or death of the person, there are numerous death cases related to
heart and their counting is increasing exponentially day by day. To deal with the problem
there is essential need of prediction system for awareness about diseases. Machine learning is
the branch of Artificial Intelligence(AI), it provides prestigious support in predicting any kind
of event which take training from natural events. In this paper, we calculate accuracy of
machine learning algorithms for predicting heart disease, for this algorithms are k-nearest
neighbor, decision tree, linear regression and support vector machine(SVM) by using UCI
repository dataset for training and testing. For implementation of Python programming
Anaconda(jupytor) notebook is best tool, which have many type of library, header file, that
make the work more accurate and precise.
Keywords—supervised; unsupervised; reinforced; linear regression; decision tree; python
programming; jupytor Notebook; confusion matrix;
CHAPTER 1
INTRODUCTION
Heart is one of the most extensive and vital organ of human body so the care of heart is
essential. Most of diseases are related to heart so the prediction about heart diseases is
necessary and for this purpose comparative study needed in his field, today most of patient
are died because their diseases are recognized at last stage due to lack of accuracy of
instrument so there is need to know about the more efficient algorithms for diseases
prediction. Machine Learning is one of the efficient technology for the testing, which is based
on training and testing. It is the branch of Artificial Intelligence(AI) which is one of broad
area of learning where machines emulating human abilities, machine learning is a specific
branch of AI. On the other hand machines learning systems are trained to learn how to
process and make use of data hence the combination of both technology is also called as
Machine Intelligence. As the definition of machine learning, it learns from the natural
phenomenon, natural things so in this project we uses the biological parameter as testing data
such as cholesterol, Blood pressure, sex, age, etc. and on the basis of these, comparison is
done in the terms of accuracy of algorithms such as in this project we have used four
algorithms which are decision tree, linear regression, k-neighbour, SVM. In this paper, we
calculate the accuracy of four different machine learning approaches and on the basis of
calculation we conclude that which one is best among them. Section 1 of this paper consist
the introduction about the machine learning and heart diseases. Section II described, the
machine learning classification. Section III illustrated the related work of researchers. Section
IV is about the methodology used for this prediction system. Section V is about the
algorithms used in this project. Section VI briefly describes the dataset and their analysis with
the result of this project. And the last Section VII concludes the summary of his paper with
slight view about future scope of this paper.

The heart disease (HD) has been considered as one of the complex and life deadliest human
diseases in the world. In this disease, usually the heart is unable to push the required amount
of blood to other parts of the body to fulfil the normal functionalities of the body, and due to
this, ultimately the heart failure occurs [1]. The rate of heart disease in the United States is
very high [2]. The symptoms of heart disease include shortness of breath, weakness of
physical body, swollen feet, and fatigue with related signs, for example, elevated jugular
venous pressure and peripheral edema caused by functional cardiac or non-cardiac
abnormalities [3]. Investigation techniques in early stages used to identify heart disease were
complicated, and its resulting complexity is one of the major reasons that affect the standard
of life [4]. The heart disease diagnosis and treatment are very complex, especially in the
developing countries, due to the rare availability of diagnostic apparatus and shortage of
physicians and others resources which affect proper prediction and treatment of heart patients
[5]. The accurate and proper diagnosis of the heart disease risk in patients is necessary for
reducing their associated risks of severe heart issues and improving security of heart [6]. The
European Society of Cardiology (ESC) reported that 26 million adults worldwide were
diagnosed with heart disease and 3.6 million were diagnosed every year. Approximately 50%
of heart disease people suffering from HD die within initial 1-2 years, and concerned costs of
heart disease management are approximately 3% of health-care financial budget [7]. The
invasive-based techniques to the diagnosing of heart disease are based on the analysis of the
patient’s medical history, physical examination report, and analysis of concerned symptoms
by medical experts. All these techniques mostly cause imprecise diagnosis and often delay in
the diagnosis results due to human errors. Moreover, it is more expensive and
computationally complex and takes time in assessments [8]. In order to resolve these
complexities in invasive-based diagnosing of heart disease, a non-invasive medical decision
support system based on machine learning predictive models such as support vector machine
(SVM), k-nearest neighbour (K-NN), artificial neural network (ANN), decision tree (DT),
logistic regression (LR), AdaBoost (AB), Naïve Bayes (NB), fuzzy logic (FL), and rough set
[9, 10] has been developed by various researchers and widely used for heart disease
diagnosis, and due to these machine-learning-based expert medical decision system, the ratio
of heart disease death decreased [11]. Heart disease diagnosis through the machine-learning-
based system has been reported in various research studies. The classification performance of
different machine learning algorithms on Cleveland heart disease dataset has been reported in
the literature review. Cleveland heart disease dataset is online available on the University of
California Irvine (UCI) data mining repository which was used by various researchers [12,
13]. It is the dataset that has been used by various researchers for investigation of different
classification issues related to the heart diseases through different machine learning
classification algorithms. Detrano et al. [13] proposed a logistic regression classifier based
decision support system for heart disease classification and obtained a classification accuracy
of 77%. Cleveland dataset used [14] with global evolutionary approaches and achieved high
prediction performance in accuracy. The study used feature selection methods for selection of
features. Therefore, the classification performance of the approach depends on selected
features. Gudadhe et al. [15] used multilayer perceptron (MLP) and support vector machine
algorithms for heart disease classification proposed classification system and obtained
accuracy of 80.41%. Kahramanli and Allahverdi [16] designed a heart disease classification
system used a hybrid technique in which a neural network integrates a fuzzy neural network
and artificial neural network. And the proposed classification system achieved a classification
accuracy of 87.4%. Palaniappan and Awang [17] designed an expert medical diagnosing
heart disease system and applied machine learning techniques such as Naïve Bayes, decision
tree, and ANN in the system. The Naive Bayes predictive model obtained performance
accuracy 86.12%. The second best predictive model was ANN which obtained an accuracy of
88.12%, and decision tree classifier achieved 80.4% with correct prediction. Olaniyi and
Oyedotun [18] proposed a three-phase model based on the ANN to diagnose heart disease in
angina and achieved a classification accuracy of 88.89%. Moreover, the proposed system
could be easily deployed in healthcare information systems. Das et al. [19] proposed an ANN
ensemble-based predictive model that diagnoses the heart disease and used statistical analysis
system enterprise miner 5.2 with the classification system and achieved 89.01% accuracy,
80.09% sensitivity, and 95.91% specificity. Jabbar et al. [20] designed a diagnostic system
for heart disease and used machine learning classifier multilayer perceptron ANN-driven
back propagation learning algorithm and feature selection algorithm. The proposed system
gives excellent performance in terms of accuracy. In order to diagnose heart disease, an
integrated decision support medical system based on ANN and Fuzzy AHP were designed by
the authors in [12] which utilizes machine learning algorithm, artificial neural network, and
Fuzzy analytical hierarchical processing. Their proposed classification system achieved a
classification accuracy of 91.10% contribution of the proposed research is to design machine-
learning-based medical intelligent decision support system for the diagnosis of heart disease.
In the present study, various machines learning predictive models such as logistic regression,
k-nearest neighbour, ANN, SVM, decision tree, Naive Bayes, and random forest have been
used for classification of people with heart disease and healthy people. The feature selection
algorithms, Relief, minimal redundancy- maximal-relevance (mRMR), Shrinkage and
Selection Operator (LASSO), were also used to select the most important and highly
correlated features that great influence on target predicted value. Cross-validation methods
like k-fold were also used. In order to evaluate the performance of classifier, various
performance evaluation metrics such as classification accuracy, classification error,
specificity, sensitivity, Matthews’ correlation coefficient (MCC), and receiver optimistic
curves (ROC) were used. Additionally, model execution time has also been computed.
Moreover, data pre-processing techniques were applied to the heart disease dataset. The
proposed system has been trained and tested on Cleveland heart disease dataset, 2016. UCI
data-mining repository the dataset of Cleveland heart disease is available online. All the
computations were performed in Python on an Intel(R) Core™ i5-2400CPU @3.10GHz PC.
Major contributions of the proposed research work are as follows:
(a) All classifiers’ performances have been checked on full features in terms of classification
accuracy and execution time.
(b) The classifiers’ performances have been checked on selected features as selected by
feature selection (FS) algorithms Relief, mRMR, and LASSO with k-fold Cross-validation.
(c) The study suggests which feature algorithm is feasible with which classifier for designing
high-level intelligent system for heart disease that accurately classifies heart disease and
healthy people remaining parts of the paper are structured as follows:
In Section 2, the background information regarding heart disease dataset briefly reviews the
theoretical and mathematical background of feature selection and classification algorithms of
machine learning. It additionally discusses cross-validation method and performance
evaluation metrics. In Section 3, experimental results are discussed in detail. It final Section 4
is concerned with the conclusion of the paper.

Figure 1: A hybrid intelligent system framework predicting heart disease.


Some risk factors are controllable. Apart from the above factors, lifestyle habits such as
eating habits, physical inactivity, and obesity are also considered to be major risk factors [5,
8,15]. There are different types of heart diseases such as coronary heart disease, angina
pectoris, congestive heart failure, cardio-myopathy, congenital heart disease, arrhythmias,
and myocarditis. It is difficult to manually determine the odds of getting heart disease based
on risk factors [1]. However, machine learning techniques are useful to predict the output
from existing data. Hence, this paper applies one such machine learning technique called
classification for predicting heart disease risk from the risk factors. It also tries to improve the
accuracy of predicting heart disease risk using a strategy termed ensemble.
CHAPTER 2
LITERATURE REVIEW

[1] A. S. Abdullah and R. R. Rajalaxmi, “A data mining model for predicting the
coronary heart disease using random forest classifier,” 2012.
The proposed work is mainly concerned with the development of a data mining model with
the Random Forest classification algorithm. The developed model will have the
functionalities such as predicting the occurrence of various events related to each patient
record, prevention of risk factors with its associated cost metrics and an improvement in
overall prediction accuracy. As a result, the causes and the symptoms related to each event
will be made in accordance with the record related to each patient and thereby CHD can be
reduced to a great extent. Coronary Heart Disease (CHD) is a common form of disease
affecting the heart and an important cause for premature death. From the point of view of
medical sciences, data mining is involved in discovering various sorts of metabolic
syndromes. Classification techniques in data mining play a significant role in prediction and
data exploration. Classification technique such as Decision Trees has been used in predicting
the accuracy and events related to CHD. In this paper, a Data mining model has been
developed using Random Forest classifier to improve the prediction accuracy and to
investigate various events related to CHD. This model can help the medical practitioners for
predicting CHD with its various events and how it might be related with different segments
of the population. The events investigated are Angina, Acute Myocardial Infarction (AMI),
Percutaneous Coronary Intervention (PCI), and Coronary Artery Bypass Graft surgery
(CABG). Experimental results have shown that classification using Random Forest
Classification algorithm can be successfully used in predicting the events and risk factors
related to CHD.
The effects produced due to CHD are constant fatigue, physical disability, mental
stress and depression. This paper focuses on the creation of a data mining model using the
Random forest classification algorithm for evaluating and predicting various events related to
CHD. Some of these studies, has made with the implementation of data mining algorithm
such as K-NN, Naïve bayes, K-means, ID3, and Apriority algorithms. The growing
healthcare burden and suffering due to life threatening diseases such as heart disease and the
escalating cost of drug development can be significantly reduced by design and development
of novel methods in data mining technologies and allied medical informatics disciplines. In
CHD, if the risk factors are predicted in advance two sorts of problem can be solved. First,
various surgical treatments such as angioplasty, coronary stents, coronary artery bypass and
heart transplant can be avoided to a great extent. Second, the associated cost with each risk
factor can be reduced.
[2] A. H. Alkeshuosh, M. Z. Moghadam, I. Al Mansoori, and M. Abdar, “Using PSO
algorithm for producing best rules in diagnosis of heart disease,” 2017.
The experimental results show that the PSO algorithm achieved higher predictive
accuracy and much smaller rule list than C4.5. In this paper we proposed PSO algorithm for
production of best rules in prediction of heart disease. The experiments show that the rules
discovered for the dataset by PSO are generally with higher accuracy, generalization and
comprehensibility. Based on the average accuracy, the accuracy of the PSO method is 87%
and the accuracy of C4.5 is 63%. By using the PSO, one can extract effective classification
rules with acceptable accuracy. Furthermore, we conclude that PSO algorithm in rule
production has good performance for rule discovery on continuous data. For future work we
consider using improved PSO algorithm for producing the best rules in heart disease data set.
Heart disease is still a growing global health issue. In the health care system, limiting human
experience and expertise in manual diagnosis leads to inaccurate diagnosis, and the
information about various illnesses is either inadequate or lacking in accuracy as they are
collected from various types of medical equipment. Since the correct prediction of a person's
condition is of great importance, equipping medical science with intelligent tools for
diagnosing and treating illness can reduce doctors' mistakes and financial losses. In this
paper, the Particle Swarm Optimization (PSO) algorithm, which is one of the most powerful
evolutionary algorithms, is used to generate rules for heart disease. First the random rules are
encoded and then they are optimized based on their accuracy using PSO algorithm. Finally
we compare our results with the C4.5 algorithm.
The task of classification becomes very difficult when the number of possible various
combinations of parameters is so high. The self-adaptability of evolutionary algorithms
depended on population is very useful in rule extraction and selection for data mining.
[3] N. Al-milli, “Back propagation neural network for prediction of heart disease'' 2013
In this work, we present an approach that based on back propagation neural network
to model heart disease diagnosis. In this research paper, a heart disease prediction system is
developed using neural network. The proposed system used 13 medical attributes for heart
disease predictions. The experiments conducted in this work have shown the good
performance of the proposed algorithm compared to similar approaches of the state of the art
Moreover, new algorithms and new tools are continued to develop and represent day by day.
Diagnosing of heart disease is one of the important issue and many researchers investigated
to develop intelligent medical decision support systems to improve the ability of the
physicians. Neural network is widely used tool for predicting heart disease diagnosis. In this
research paper, a heart disease prediction system is developed using neural network. The
proposed system used 13 medical attributes for heart disease predictions. The experiments
conducted in this work have shown the good performance of the proposed algorithm
compared to similar approaches of the state of the art.
[4] C. A. Devi, S. P. Rajamhoana, K. Umamaheswari, R. Kiruba, K. Karunya, and R.
Deepika, ``Analysis of neural networks based heart disease prediction system,'' 2018.
In this research paper, we have presented Heart disease prediction system (HDPS)
using data mining and artificial neural network (ANN) techniques. From the ANN, a
multilayer perceptron neural network along with back propagation algorithm is used to
develop the system. Because MLPNN model proves the better results and helps the domain
experts and even person related with the field to plan for a better diagnose and provide the
patient with early diagnosis results as it performs realistically well even without retraining.
The experimental result shows that using neural networks the system predicts Heart disease
with nearly 100% accuracy.
This hidden information is useful for making effective decisions. Computer based
information along with advanced Data mining techniques are used for appropriate results.
Neural network is widely used tool for predicting Heart disease diagnosis. In this research
paper, a Heart Disease Prediction system (HDPS) is developed using neural network. The
HDPS system predicts the likelihood of patient getting a Heart disease. For prediction, the
system uses sex, blood pressure, cholesterol like 13 medical parameters. Here two more
parameters are added i.e. obesity and smoking for better accuracy. From the results, it has
been seen that neural network predict heart disease with nearly 100% accuracy. Predication
should be done to reduce risk of Heart disease. Diagnosis is usually based on signs,
symptoms and physical examination of a patient. Almost all the doctors are predicting heart
disease by learning and experience. The diagnosis of disease is a difficult and tedious task in
medical field. Predicting Heart disease from various factors or symptoms is a multi-layered
issue which may lead to false presumptions and unpredictable effects. Healthcare industry
today generates large amounts of complex data about patients, hospitals resources, disease
diagnosis, electronic patient records, medical devices etc. The large amount of data is a key
resource to be processed and analyzed for knowledge extraction that enables support for cost-
savings and decision making. Only human intelligence alone is not enough for proper
diagnosis.
[5] P. K. Anooj, “Clinical decision support system: Risk level prediction of heart disease
using weighted fuzzy rules,'' 2012.
This process is time consuming and really depends on medical experts’ opinions
which may be subjective. To handle this problem, machine learning techniques have been
developed to gain knowledge automatically from examples or raw data. Here, a weighted
fuzzy rule-based clinical decision support system (CDSS) is presented for the diagnosis of
heart disease, automatically obtaining knowledge from the patient’s clinical data. The
proposed clinical decision support system for the risk prediction of heart patients consists of
two phases: (1) automated approach for the generation of weighted fuzzy rules and (2)
developing a fuzzy rule-based decision support system. In the first phase, we have used the
mining technique, attribute selection and attribute weight age method to obtain the weighted
fuzzy rules. Then, the fuzzy system is constructed in accordance with the weighted fuzzy
rules and chosen attributes. Finally, the experimentation is carried out on the proposed system
using the datasets obtained from the UCI repository and the performance of the system is
compared with the neural network-based system utilizing accuracy, sensitivity and
specificity.
In the proposed work, we have proposed an effective clinical decision support system
using fuzzy logic in which automatically generated weighted fuzzy rules are used. At first,
data pre-processing is applied on the heart disease dataset for removing the missing values
and other noisy information. Then, using the class label, the input database is divided into
two subsets of data that are then used for mining the frequent attribute category individually.
Subsequently, the deviation range is computed using these frequent attribute categories so as
to compute the relevant attributes. Based on the deviation range, the attributes are selected
whether any deviation exists or not. Using this deviation range, the decision rules are
constructed and these rules are scanned in the learning database to find its frequency.
According to its frequency, the weight age is calculated for every decision rule obtained and
the weighted fuzzy rules are obtained with the help of fuzzy membership function. Finally,
the weighted fuzzy rules are given to the Mamdani fuzzy inference system so that the system
can learn these rules and the risk prediction can be carried out on the designed fuzzy system.
[6] L. Baccour, “Amended fused TOPSIS-VIKOR for classification (ATOVIC) applied
to some UCI data sets,'' 2016
Classification procedure is an important task of expert and intelligent systems.
Developing new algorithms of classification which improve accuracy or true positive
rates could have an influence on some life problems such as diagnosis prediction in medical
domain. Multi-criteria decision making (MCDM) methods are expected to search the best
alternative according to some criteria. Each criterion has a value relative to each alternative.
There are only two sets: a set of criteria and a set of alternatives. This work merges MCDM
methods TOPSIS and VIKOR and modifies them to be used for classification where the used
sets are three: the classes, the objects and the attributes (features) describing the objects.
Hence, ATOVIC, a new classification algorithm is proposed. In ATOVIC, criteria are
replaced by features and alternatives are replaced by objects. The latter belong to
corresponding classes. Two sets are employed one serves as reference and second serves as
test. An object from test set will be classified to the relative class based on the reference set.
ATOVIC is applied on a benchmark (UCI) CLEVELAND data set to predict heart disease.
Following the complexity of the data set and its importance, ATOVIC application is done on
different test sets of CLEVELAND using binary classification and multi-classification.
Moreover, ATOVIC is applied to thyroid data set to detect hyperthyroidism and
hypothyroidism diseases. The obtained results show the efficiency of ATOVIC in medical
domain. In addition, ATOVIC is applied to three other data sets: chess, nursery and titanic,
from UCI and KEEL websites. The obtained results are compared to those of some classifiers
from literature. The experimental results demonstrate that ATOVIC method improves
accuracy and true positive rates comparing to most classifiers considered from literature.
Hence ATOVIC is promising for use in prediction or classification.
CHAPTER 3
MACHINE LEARNING

Machine Learning is one of efficient technology which is based on two terms namely testing
and training i.e. system take training directly from data and experience and based on this
training test should be applied on different type of need as per the algorithm required.
There are three type of machine learning algorithms:

Fig 2 classification of machine learning

A. Supervised Learning
Supervised learning can be define as learning with the proper guide or you can say that
learning in the present of teacher .we have a training dataset which act as the teacher for
prediction on the given dataset that is for testing a data there are always a training dataset.
Supervised learning is based on "train me" concept. Supervised learning have following
processes:
• Classification
• Random Forest
• Decision tree
• Regression
To recognize patterns and measures probability of uninterruptable outcomes, is phenomenon
of regression. System have ability to identify numbers, their values and grouping sense of
numbers which means width and height, etc. There are following supervised machine
learning algorithms:
• Linear Regression
• Logistical Regression
• Support Vector Machines (SVM)
• Neural Networks
• Random Forest
• Gradient Boosted Trees
• Decision Trees
• Naive Bayes
B. Unsupervised Learning
Unsupervised learning can be define as the learning without a guidance which in
Unsupervised learning there are no teacher are guiding. In Unsupervised learning when a
dataset is given it automatically work on the dataset and find the pattern and relationship
between them and according to the created relationships, when new data is given it classify
them and store in one of them relation . Unsupervised learning is based on "self sufficient "
concept. For example suppose there are combination fruits mango, banana and apple and
when Unsupervised learning is applied it classify them in three different clusters on the basis
if there relation with each other and when a new data is given it automatically send it to one
of the cluster . Supervisor learning say there are mango, banana and apple but Unsupervised
learning said it as there are three different clusters. Unsupervised algorithms have following
process:
• Dimensionality
• Clustering
There are following unsupervised machine learning algorithms:
• t-SNE
• k-means clustering
• PCA
C. Reinforcement
Reinforced learning is the agent ability to interact with the environment and find out the
outcome. It is based on "hit and trial" concept. In reinforced learning each agent is awarded
with positive and negative points and on the basis of positive points reinforced learning give
the dataset output that is on the basis of positive awards it trained and on the basis of this
training perform the testing on datasets
Machine learning algorithm
Machines are by nature not intelligent. Initially, machines were designed to perform specific
tasks, such as running on the railway, controlling the traffic flow, digging deep holes,
travelling into the space, and shooting at moving objects. Machines do their tasks much faster
with a higher level of precision compared to humans. They have made our lives easy and
smooth. The fundamental difference between humans and machines in performing their work
is intelligence. The human brain receives data gathered by the five senses: vision, hearing,
smell, taste, and tactility. These gathered data are sent to the human brain via the neural
system for perception and taking action. In the perception process, the data is organized,
recognized by comparing it to previous experiences that were stored in the memory, and
interpreted. Accordingly, the brain takes the decision and directs the body parts to react
against that action. At the end of the experience, it might be stored in the memory for future
benefits. A machine cannot deal with the gathered data in an intelligent way. It does not have
the ability to analyze data for classification, benefit from previous experiences, and store the
new experiences to the memory units; that is, machines do not learn from experience.
Although machines are expected to do mechanical jobs much faster than humans, it is not
expected from a machine to: understand the play Romeo and Juliet, jump over a hole in the
street, form friendships, interact with other machines through a common language, recognize
dangers and the ways to avoid them, decide about a disease from its symptoms and laboratory
tests, recognize the face of the criminal, and so on. The challenge is to make dumb machines
learn to cope correctly with such situations. Because machines have been originally created to
help humans in their daily lives, it is necessary for the machines to think, understand to solve
problems, and take suitable decisions akin to humans. In other words, we need smart
machines. In fact, the term smart machine is symbolic to machine learning success stories and
its future targets. We will discuss the issue of smart machines in Section 1.4. The question of
whether a machine can think was first asked by the British mathematician Alan Turing in
1955, which was the start of the artificial intelligence history. He was the one who proposed a
test to measure the performance of a machine in terms of intelligence. Section 1.4 also
discusses the progress that has been achieved in determining whether our machines can pass
the Turing test. Computers are machines that follow programming instructions to accomplish
the required tasks and help us in solving problems. Our brain is similar to a CPU that solves
problems for us. Suppose that we want to find the smallest number in a list of unordered
numbers. We can perform this job easily. Different persons can have different methods to do
the same job. In other words, different persons can use different algorithms to perform the
same task. These methods or algorithms are basically a sequence of instructions that are
executed to reach from one state to another in order to produce output from input. If there are
different algorithms that can perform the same task, then one is right in questioning which
algorithm is better. For example, if two programs are made based on two different algorithms
to find the smallest number in an unordered list, then for the same list of unordered number
(or same set of input) and on the same machine, one measure of efficiency can be speed or
quickness of program and another can be minimum memory usage. Thus, time and space are
the usual measures to test the efficiency of an algorithm. In some situations, time and space
can be interrelated, that is, the reduction in memory usage leading to fast execution of the
algorithm. For example, an efficient algorithm enabling a program to handle full input data in
cache memory will also consequently allow faster execution of program.
INTRODUCTION TO THE DEEP LEARNING

Deep learning

Deep learning is a class of machine learning algorithms that uses multiple layers to


progressively extract higher level features from the raw input. For example, in image
processing, lower layers may identify edges, while higher layers may identify the concepts
relevant to a human such as digits or letters or faces.

Deep learning (also known as deep structured learning or differential programming) is part of


a broader family of machine learning methods based on artificial neural
networks with representation learning. Learning can be supervised, semi-
supervised or unsupervised.

Deep learning architectures such as deep neural networks, deep belief networks, recurrent


neural networks and convolutional neural networks have been applied to fields
including computer vision, speech recognition, natural language processing, audio
recognition, social network filtering, machine translation, bioinformatics, drug design,
medical image analysis, material inspection and board game programs, where they have
produced results comparable to and in some cases surpassing human expert performance.

Artificial neural networks (ANNs) were inspired by information processing and distributed


communication nodes in biological systems. ANNs have various differences from
biological brains. Specifically, neural networks tend to be static and symbolic, while the
biological brain of most living organisms is dynamic (plastic) and analog.

Deep learning is a subset of machine learning in artificial intelligence (AI) that has networks
capable of learning unsupervised from data that is unstructured or unlabeled. Also known as
deep neural learning or deep neural network.

CNN is a feed forward neural network that is generally used for Image recognition and object
classification. ... A Recurrent Neural Network looks something like this: In RNN, the
previous states is fed as input to the current state of the network. RNN can be used in NLP,
Time Series Prediction, Machine Translation, etc.
Convolutional Neural Network (cnn)

Convolutional Neural Network is one of the main categories to do image classification and
image recognition in neural networks. Scene labeling, objects detections, and face
recognition, etc., are some of the areas where convolutional neural networks are widely used.

CNN takes an image as input, which is classified and process under a certain category such as
dog, cat, lion, tiger, etc. The computer sees an image as an array of pixels and depends on the
resolution of the image. Based on image resolution, it will see as h * w * d, where h= height
w= width and d= dimension. For example, An RGB image is 6 * 6 * 3 array of the matrix,
and the grayscale image is 4 * 4 * 1 array of the matrix.

In CNN, each input image will pass through a sequence of convolution layers along with
pooling, fully connected layers, filters (Also known as kernels). After that, we will apply the
Soft-max function to classify an object with probabilistic values 0 and 1.

Convolution Layer

Convolution layer is the first layer to extract features from an input image. By learning image
features using a small square of input data, the convolutional layer preserves the relationship
between pixels. It is a mathematical operation which takes two inputs such as image matrix
and a kernel or filter.

Strides

Stride is the number of pixels which are shift over the input matrix. When the stride is
equaled to 1, then we move the filters to 1 pixel at a time and similarly, if the stride is equaled
to 2, then we move the filters to 2 pixels at a time. The following figure shows that the
convolution would work with a stride of 2.

Padding

Padding plays a crucial role in building the convolutional neural network. If the image will
get shrink and if we will take a neural network with 100's of layers on it, it will give us a
small image after filtered in the end.

Pooling Layer
Pooling layer plays an important role in pre-processing of an image. Pooling layer reduces
the number of parameters when the images are too large. Pooling is "downscaling" of the
image obtained from the previous layers. It can be compared to shrinking an image to reduce
its pixel density. Spatial pooling is also called downsampling or subsampling, which reduces
the dimensionality of each map but retains the important information.

max

average

sum

Fully Connected Layer

The fully connected layer is a layer in which the input from the other layers will be flattened
into a vector and sent. It will transform the output into the desired number of classes by the
network.

Recurrent Neural Network (RNN)

A recurrent neural network (RNN) is a kind of artificial neural network mainly used in
speech recognition and natural language processing (NLP). RNN is used in deep learning and
in the development of models that imitate the activity of neurons in the human brain.

Recurrent Networks are designed to recognize patterns in sequences of data, such as text,
genomes, handwriting, the spoken word, and numerical time series data emanating from
sensors, stock markets, and government agencies.

A recurrent neural network looks similar to a traditional neural network except that a
memory-state is added to the neurons. The computation is to include a simple memory.

The recurrent neural network is a type of deep learning-oriented algorithm, which follows a
sequential approach. In neural networks, we always assume that each input and output is
dependent on all other layers. These types of neural networks are called recurrent because
they sequentially perform mathematical computations.

Applications:
There are many applications for deep learning

 Automatic speech recognition


 Image recognition
 Visual art processing
 Natural language processing
 Drug discovery and toxicology
 Customer relationship management
 Recommendation systems
 Bioinformatics
 Medical Image Analysis
 Mobile advertising
 Image restoration
 Financial fraud detection
 Military

INTRODUCTION TO PYTHON
Python:

Python is an interpreted, high-level, general-purpose programming language. Created


by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace. Its language
constructs and object-oriented approach aim to help programmers write clear, logical code for
small and large-scale projects.

Python is dynamically typed and garbage-collected. It supports multiple programming


paradigms, including structured (particularly, procedural,) object-oriented, and functional
programming. Python is often described as a "batteries included" language due to its
comprehensive standard library.

Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0,
released in 2000, introduced features like list comprehensions and a garbage
collection system capable of collecting reference cycles. Python 3.0, released in 2008, was a
major revision of the language that is not completely backward-compatible, and much
Python 2 code does not run unmodified on Python 3.

The Python 2 language, i.e. Python 2.7.x, was officially discontinued on 1 January 2020 (first
planned for 2015) after which security patches and other improvements will not be released
for it. With Python 2's end-of-life, only Python 3.5.xand later are supported.

Python interpreters are available for many operating systems. A global community of


programmers develops and maintains CPython, an open sourcereference implementation.
A non-profit organization, the Python Software Foundation, manages and directs resources
for Python and CPython development.

Python is used for:

 web development (server-side),


 software development,
 mathematics,
 system scripting.

Python do?:

 Python can be used on a server to create web applications.


 Python can be used alongside software to create workflows.
 Python can connect to database systems. It can also read and modify files.
 Python can be used to handle big data and perform complex mathematics.
 Python can be used for rapid prototyping, or for production-ready software
development.

Why Python?:

 Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
 Python has a simple syntax similar to the English language.
 Python has syntax that allows developers to write programs with fewer lines than
some other programming languages.
 Python runs on an interpreter system, meaning that code can be executed as soon as it
is written. This means that prototyping can be very quick.
 Python can be treated in a procedural way, an object-orientated way or a functional
way.

Python compared to other programming languages

 Python was designed for readability, and has some similarities to the English
language with influence from mathematics.
 Python uses new lines to complete a command, as opposed to other programming
languages which often use semicolons or parentheses.
 Python relies on indentation, using whitespace, to define scope; such as the scope of
loops, functions and classes. Other programming languages often use curly-brackets
for this purpose.

Python installation procedure:

Windows Based

It is highly unlikely that your Windows system shipped with Python already installed.
Windows systems typically do not. Fortunately, installing does not involve much more than
downloading the Python installer from the python.org website and running it. Let’s take a
look at how to install Python 3 on Windows:

Step 1: Download the Python 3 Installer


1. Open a browser window and navigate to the Download page for
Windows at python.org.
2. Underneath the heading at the top that says Python Releases for Windows, click on
the link for the Latest Python 3 Release - Python 3.x.x. (As of this writing, the latest
is Python 3.6.5.)
3. Scroll to the bottom and select either Windows x86-64 executable installer for 64-bit
or Windows x86 executable installer for 32-bit. (See below.)

Sidebar: 32-bit or 64-bit Python?


For Windows, you can choose either the 32-bit or 64-bit installer. Here’s what the difference
between the two comes down to:

 If your system has a 32-bit processor, then you should choose the 32-bit installer.
 On a 64-bit system, either installer will actually work for most purposes. The 32-bit
version will generally use less memory, but the 64-bit version performs better for
applications with intensive computation.
 If you’re unsure which version to pick, go with the 64-bit version.

Note: Remember that if you get this choice “wrong” and would like to switch to another
version of Python, you can just uninstall Python and then re-install it by downloading another
installer from python.org.

Step 2: Run the Installer

Once you have chosen and downloaded an installer, simply run it by double-clicking on the
downloaded file. A dialog should appear that looks something like this:
Important: You want to be sure to check the box that says Add Python 3.x to PATH as
shown to ensure that the interpreter will be placed in your execution path.
Then just click Install Now. That should be all there is to it. A few minutes later you should
have a working Python 3 installation on your system.

Mac OS based

While current versions of macOS (previously known as “Mac OS X”) include a version of
Python 2, it is likely out of date by a few months. Also, this tutorial series uses Python 3, so
let’s get you upgraded to that.
The best way we found to install Python 3 on macOS is through the Homebrew package
manager. This approach is also recommended by community guides like The Hitchhiker’s
Guide to Python.

Step 1: Install Homebrew (Part 1)

To get started, you first want to install Homebrew:

1. Open a browser and navigate to http://brew.sh/. After the page has finished
loading, select the Homebrew bootstrap code under “Install Homebrew”. Then hit
cmd+c  to copy it to the clipboard. Make sure you’ve captured the text of the
complete command because otherwise the installation will fail.
2. Now you need to open a Terminal app window, paste the Homebrew bootstrap
code, and then hit Enter. This will begin the Homebrew installation.
3. If you’re doing this on a fresh install of macOS, you may get a pop up alert asking
you to install Apple’s “command line developer tools”. You’ll need those to
continue with the installation, so please confirm the dialog box by clicking on
“Install”.

At this point, you’re likely waiting for the command line developer tools to finish installing,
and that’s going to take a few minutes. Time to grab a coffee or tea!

Step 2: Install Homebrew (Part 2)

You can continue installing Homebrew and then Python after the command line developer
tools installation is complete:

1. Confirm the “The software was installed” dialog from the developer tools installer.
2. Back in the terminal, hit Enter to continue with the Homebrew installation.
3. Homebrew asks you to enter your password so it can finalize the installation. Enter
your user account password and hit Enter to continue.
4. Depending on your internet connection, Homebrew will take a few minutes to
download its required files. Once the installation is complete, you’ll end up back at
the command prompt in your terminal window.

Whew! Now that the Homebrew package manager is set up, let’s continue on with installing
Python 3 on your system.

Step 3: Install Python

Once Homebrew has finished installing, return to your terminal and run the following
command:

$ brew install python3


Note: When you copy this command, be sure you don’t include the $ character at the
beginning. That’s just an indicator that this is a console command.
This will download and install the latest version of Python. After the Homebrew brew
install command finishes, Python 3 should be installed on your system.
You can make sure everything went correctly by testing if Python can be accessed from the
terminal:

1. Open the terminal by launching Terminal app.


2. Type pip3 and hit Enter.
3. You should see the help text from Python’s “Pip” package manager. If you get an
error message running pip3, go through the Python install steps again to make sure
you have a working Python installation.

Assuming everything went well and you saw the output from Pip in your command prompt
window…congratulations! You just installed Python on your system, and you’re all set to
continue with the next section in this tutorial.

Packages need for python based programming:

 Numpy
NumPy is a Python package which stands for 'Numerical Python'. It is the core library for
scientific computing, which contains a powerful n-dimensional array object, provide tools
for integrating C, C++ etc. It is also useful in linear algebra, random number capability
etc.
 Pandas
Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on
the Numpy package and its key data structure is called the DataFrame. DataFrames allow
you to store and manipulate tabular data in rows of observations and columns of
variables.
 Keras
Keras is a high-level neural networks API, written in Python and capable of running on
top of TensorFlow, CNTK, or Theano. Use Keras if you need a deep learning library that:
Allows for easy and fast prototyping (through user friendliness, modularity, and
extensibility).
 Sklearn
Scikit-learn is a free machine learning library for Python. It features various algorithms
like support vector machine, random forests, and k-neighbours, and it also supports
Python numerical and scientific libraries like NumPy and SciPy.
 Scipy
SciPy is an open-source Python library which is used to solve scientific and mathematical
problems. It is built on the NumPy extension and allows the user to manipulate and
visualize data with a wide range of high-level commands.
 Tensorflow
TensorFlow is a Python library for fast numerical computing created and released by
Google. It is a foundation library that can be used to create Deep Learning models
directly or by using wrapper libraries that simplify the process built on top
of TensorFlow.
 Django
Django is a high-level Python Web framework that encourages rapid development and
clean, pragmatic design. Built by experienced developers, it takes care of much of the
hassle of Web development, so you can focus on writing your app without needing to
reinvent the wheel. It's free and open source.
 Pyodbc
pyodbc is an open source Python module that makes accessing ODBC databases simple.
It implements the DB API 2.0 specification but is packed with even more Pythonic
convenience. Precompiled binary wheels are provided for most Python versions on
Windows and macOS. On other operating systems this will build from source.
 Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of
arrays. Matplotlib is a multi-platform data visualization library built on NumPy arrays
and designed to work with the broader SciPy stack. It was introduced by John Hunter in
the year 2002.
 Opencv
OpenCV-Python is a library of Python bindings designed to solve computer vision
problems. Python is a general purpose programming language started by Guido van
Rossum that became very popular very quickly, mainly because of its simplicity and code
readability.
 Nltk
Natural Language Processing with Python NLTK is one of the leading platforms for
working with human language data and Python, the module NLTK is used for natural
language processing. NLTK is literally an acronym for Natural Language Toolkit. In this
article you will learn how to tokenize data (by words and sentences).
 SQLAIchemy
SQLAlchemy is a library that facilitates the communication between Python programs
and databases. Most of the times, this library is used as an Object Relational Mapper
(ORM) tool that translates Python classes to tables on relational databases and
automatically converts function calls to SQL statements.
 Urllib
urllib is a Python module that can be used for opening URLs. It defines functions and
classes to help in URL actions. With Python you can also access and retrieve data from
the internet like XML, HTML, JSON, etc. You can also use Python to work with this data
directly.

Installation of packages:

Syntax for installation of packages via cmd terminal using the basic

Step:1- First check pip cmd

First check pip cmd

If ok then

Step:2- pip list

Check the list of packages installed and then install required by following cmds

Step:3- pip install package name

The package name should as requirement

INTRODUCTION TO OPENCV

Open cv:
OpenCV was started at Intel in 1999 by Gary Bradsky and the first release came out in 2000.
Vadim Pisarevsky joined Gary Bradsky to manage Intel’s Russian software OpenCV team. In
2005, OpenCV was used on Stanley, the vehicle who won 2005 DARPA Grand Challenge.
Later its active development continued under the support of Willow Garage, with Gary
Bradsky and Vadim Pisarevsky leading the project. Right now, OpenCV supports a lot of
algorithms related to Computer Vision and Machine Learning and it is expanding day-by-
day. Currently OpenCV supports a wide variety of programming languages like C++, Python,
Java etc and is available on different platforms including Windows, Linux, OS X, Android,
iOS etc. Also, interfaces based on CUDA and OpenCL are also under active development for
high-speed GPU operations. OpenCV-Python is the Python API of OpenCV. It combines the
best qualities of OpenCV C++ API and Python language.

OpenCV-Python Python is a general purpose programming language started by Guido van


Rossum, which became very popular in short time mainly because of its simplicity and code
readability. It enables the programmer to express his ideas in fewer lines of code without
reducing any readability. Compared to other languages like C/C++, Python is slower. But
another important feature of Python is that it can be easily extended with C/C++. This feature
helps us to write computationally intensive codes in C/C++ and create a Python wrapper for it
so that we can use these wrappers as Python modules. This gives us two advantages: first, our
code is as fast as original C/C++ code (since it is the actual C++ code working in
background) and second, it is very easy to code in Python. This is how OpenCV-Python
works, it is a Python wrapper around original C++ implementation. And the support of
Numpy makes the task more easier. Numpy is a highly optimized library for numerical
operations. It gives MATLAB-style syntax. All the OpenCV array structures are converted
to-and-from Numpy arrays. So whatever operations you can do in Numpy, you can combine
it with OpenCV, which increases number of weapons in your arsenal. Besides that, several
other libraries like SciPy,

Matplotlib which supports Numpy can be used with this. So OpenCV-Python is an


appropriate tool for fast prototyping of computer vision problems.

Since OpenCV is an open source initiative, all are welcome to make contributions to this
library. And it is same for this tutorial also. So, if you find any mistake in this tutorial
(whether it be a small spelling mistake or a big error in code or concepts, whatever), feel free
to correct it. 1.1. Introduction to OpenCV 7 OpenCV-Python Tutorials Documentation,
Release 1 And that will be a good task for fresher’s who begin to contribute to open source
projects. Just fork the OpenCV in github, make necessary corrections and send a pull request
to OpenCV.

OpenCV developers will check your pull request, give you important feedback and once it
passes the approval of the reviewer, it will be merged to OpenCV. Then you become a open
source contributor. Similar is the case with other tutorials, documentation etc. As new
modules are added to OpenCV-Python, this tutorial will have to be expanded. So those who
knows about particular algorithm can write up a tutorial which includes a basic theory of the
algorithm and a code showing basic usage of the algorithm and submit it to OpenCV.
Remember, we together can make this project a great success!!! Contributors Below is the list
of contributors who submitted tutorials to OpenCV-Python.

1. Alexander Mordvintsev (GSoC-2013 mentor)

2. Abid Rahman K. (GSoC-2013 intern)

Additional Resources

1. A Quick guide to Python - A Byte of Python

2. Basic Numpy Tutorials

3. Numpy Examples List

4. OpenCV Documentation

5. OpenCV Forum

Install OpenCV-Python in Windows

Goals In this tutorial

We will learn to setup OpenCV-Python in your Windows system. Below steps are tested in a
Windows 7-64 bit machine with Visual Studio 2010 and Visual Studio 2012. The screenshots
shows VS2012.

Installing Open CV from prebuilt binaries


1. Below Python packages are to be downloaded and installed to their default locations.

1.1. Python-2.7.x.

1.2. Numpy.

1.3. Matplotlib (Matplotlib is optional, but recommended since we use it a lot in our
tutorials).

2. Install all packages into their default locations. Python will be installed to C:/Python27/.

3. After installation, open Python IDLE. Enter import numpy and make sure Numpy is
working fine.

4. Download latest OpenCV release from source forge site and double-click to extract it.

5. Goto opencv/build/python/2.7 folder.

6. Copy cv2.pyd to C:/Python27/lib/site-packeges.

7. Open Python IDLE and type following codes in Python terminal.

>>> import cv2

>>> print cv2.__version__

If the results are printed out without any errors, congratulations!!! You have installed
OpenCV-Python successful

Download and install necessary Python packages to their default locations

1. Python 3.6.8.x

2. Numpy

3. Matplotlib (Matplotlib is optional, but recommended since we use it a lot in our tutorials.)

Make sure Python and Numpy are working fine.


4. Download OpenCV source. It can be from Source forge (for official release version) or
from Github (for latest source).

5. Extract it to a folder, opencv and create a new folder build in it.

6. Open CMake-gui (Start > All Programs > CMake-gui)

7. Fill the fields as follows (see the image below):

7.1. Click on Browse Source... and locate the opencv folder.

7.2. Click on Browse Build... and locate the build folder we created.

7.3. Click on Configure.

7.4. It will open a new window to select the compiler. Choose appropriate compiler
(here, Visual Studio 11) and click Finish.

7.5. Wait until analysis is finished.

8. You will see all the fields are marked in red. Click on the WITH field to expand it. It
decides what extra features you need. So mark appropriate fields. See the below image:

9. Now click on BUILD field to expand it. First few fields configure the build method. See
the below image:

10. Remaining fields specify what modules are to be built. Since GPU modules are not yet
supported by Open CV Python, you can completely avoid it to save time (But if you work
with them, keep it there). See the image below:

11. Now click on ENABLE field to expand it. Make sure ENABLE_SOLUTION_FOLDERS
is unchecked (Solution folders are not supported by Visual Studio Express edition). See the
image below:

12. Also make sure that in the PYTHON field, everything is filled. (Ignore
PYTHON_DEBUG_LIBRARY). See image below:

13. Finally click the Generate button.


14. Now go to our opencv/build folder. There you will find OpenCV.sln file. Open it with
Visual Studio.

15. Check build mode as Release instead of Debug.

16. In the solution explorer, right-click on the Solution (or ALL_BUILD) and build it. It will
take some time to finish.

17. Again, right-click on INSTALL and build it. Now OpenCV-Python will be installed.

18. Open Python IDLE and enter import cv2. If no error, it is installed correctly

Using OpenCV Read an image

Use the function cv2.imread () to read an image. The image should be in the working
directory or a full path of image should be given. Second argument is a flag which specifies
the way image should be read.

cv2.IMREAD_COLOR : Loads a color image. Any transparency of image will be neglected.


It is the default flag.

cv2.IMREAD_GRAYSCALE : Loads image in gray scale mode

cv2.IMREAD_UNCHANGED : Loads image as such including alpha channel

See the code below:

 import numpy as np

 import cv2

 # Load an color image in grayscale

 img = cv2.imread('messi5.jpg',0)

Warning: Even if the image path is wrong, it won’t throw any error, but print img will give
you None
Display an image Use the function cv2.imshow() to display an image in a window. The
window automatically fits to the image size. First argument is a window name which is a
string. second argument is our image. You can create as many windows as you wish, but with
different window names.

 cv2.imshow('image’, mg)

 cv2.waitKey(0)

cv2.waitKey() is a keyboard binding function. Its argument is the time in milliseconds.

 cv2.destroyAllWindows()

cv2.destroyAllWindows () simply destroys all the windows we created

Write an image

Use the function cv2.imwrite () to save an image. First argument is the file name, second
argument is the image you want to save.

 cv2.imwrite('messigray.png',img)

This will save the image in PNG format in the working directory

Below program loads an image in gray scale, displays it, save the image if you press ‘s’ and
exit, or simply exit without saving if you press ESC key.

 import numpy as np

 import cv2

 img = cv2.imread('messi5.jpg',0)

 cv2.imshow('image’, mg)

 k = cv2.waitKey(0)

 if k == 27: # wait for ESC key to exit


 cv2.destroyAllWindows()

 elif k == ord('s'): # wait for 's' key to save and exit

 cv2.imwrite('messigray.png',img)

 cv2.destroyAllWindows()

CHAPTER 4
SOFTWARE REQUIREMENTS
approaches for finding which one is best among then and get the result on the favor of svm.
Kumar et al.[5] have worked on various machine learning and data mining algorithms and
analysis of these algorithms are trained by UCI machine learning dataset which have 303
samples with 14 input feature and found svm is best among them, here other different
algorithms are naivy bayes, knn and decision tree. Gavhane et al.[1] have worked on the
multi layer perceptron model for the prediction of heart diseases in human being and the
accuracy of the algorithm using CAD technology. If the number of person using the
prediction system for their diseases prediction then the awareness about the diseases is also
going to increases and it make reduction in the death rate of heart patient. Some researchers
have work on one or two algorithm for predication diseases. Krishnan et al.[2] proved that
decision tree is more accurate as compare to the naïve bayes classification algorithm in their
project. Machine learning algorithms are used for various type of diseases predication and
many of the researchers have work on this like Kohali et al.[7] work on heart diseases
prediction using logistic regression, diabetes prediction using support vector machine, breast
cancer prediction using Adaboot classifier and concluded that the logistic regression give the
accuracy of 87.1%, support vector machine give the accuracy of 85.71%, Adaboot classifier
give the accuracy up to 98.57% which good for predication point of view. A survey paper on
heart diseases predication have proven that the old machine learning algorithms does not
perform good accuracy for the predication while hybridization perform good and give better
accuracy for the predication[8].

Fig.3 Architecture of Prediction System


TABLE.1 Attributes of the Dataset
C. Preprocessing of data
Preprocessing needed for achieving prestigious result from the machine learning algorithms.
For example Random forest algorithm does not support null values dataset and for this we
have to manage null values from original raw data. For our project we have to convert some
categorized value by dummy value means in the form of “0”and “1” by using following code:
D. Data Balancing
Data balancing is essential for accurate result because by data balancing graph we can see
that both the target classes are equal. Fig.4 represents the target classes where “0” represents
with heart diseases patient and “1” represents no heart diseases pateints.

Fig.4 Target class view


E. Histogram of attributes
Histogram of attributes shows the range of dataset attributes and code which is used to create
it. dataset.hist()

Fig.5 Histogram of attributes


MACHINE LEARNING ALGORITHMS
A. Linear regression
It is the supervised learning technique. It is based on the relationship between independent
variable and dependent variable as seen in Fig.5 variable “x” and “y” are independent and
dependent variable and relation between them is shown by equation of line which is linear in
nature that why this approach is called linear regression.
Fig.6 relation between x and y
It gives a relation equation to predict a dependent variable value “y” based on a independent
variable value “x” as we can see in the Fig.6 so it is concluded that linear regression
technique give the linear relationship between x(input) and y(output).
B. Decision tree
On the other hand decision tree is the graphical representation of the data and it is also the
kind of supervised machine learning algorithms.

Fig.7 Decision tree


For the tree construction we use entropy of the data attributes and on the basis of attribute
root and other nodes are drawn.

In the above equation of entropy (1) Pij is probability of the node and according to it the
entropy of each node is calculated. The node which have highest entropy calculation is
selected as the root node and this process is repeated until all the nodes of the tree are
calculated or until the tree constructed. When the number of nodes are imbalanced then tree is
create the over fitting problem which is not good for the calculation and this is one of reason
why decision tree have less accuracy as compare to linear regression.
C. Support Vector Machine
It is one category of machine learning technique which work on the concept of hyperplan
means it classify the data by creating hyper plan between them. Training sample dataset is
(Yi, Xi) where i=1,2,3,…….n and Xi is the ith vector, Yi is the target vector. Number of
hyper plan decide the type of support vector such as example if a line is used as hyper plan
then method is called linear support vector.

Fig.8 Linear Regression


D. K-nearest Neighbour
It work on the basis of distance between the location of data and on the basis of this distinct
data are classified with each other. All the other group of data are called neighbor of each
other and number of neighbor are decided by the user which play very crucial role in analysis
of the dataset.

Fig.9 KNN where k=3


In the above Fig. k=3 shows that there are three neighbor that means three different type of
data are there. Each cluster represented in two dimensional space whose coordinates are
represented as (Xi,Yi) where Xi is the x-axis, Y represent yaxis and i= 1,2,3,….n.
Result Analysis
A. About Jupytor Notebook
Jupiter notebook is used as the simulation tool and it is confortable for python programming
projects. Jupytor notebook contains rich text elements and code also, which are figures,
equations, links and many more. Because of the mix of rich text elements and code, these
documents are perfect location to bring together an analysis description, and its results, as
well as, they can execute data analysis in real time. Jupyter Notebook is an open-source, web-
based interactive graphics, maps, plots, visualizations, and narrative text.

Fig.10 Jupyter Notebook


B. Accuracy calculation
Accuracy of the algorithms are depends on four values namely true positive(TP), false
positive(FP), true negative(TN) and false negative(FN). Accuracy= (FN+TP) /
(TP+FP+TN+FN) (2) The numerical value of TP, FP, TN, FN defines as:
TP= Number of person with heart diseases
TN= Number of person with heart diseases and no heart diseases
FP= Number of person with no heart diseases
FN= Number of person with no heart diseases and with heart Diseases
Fig.11 Confusion matrix for Decision tree

Fig.12 Confusion Matrix for linear regression


C. Result
After performing the machine learning approach for testing and training we find that
accuracy of the knn is much efficient as compare to other algorithms. Accuracy should be
calculated with the support of confusion matrix of each algorithms as shown in Fig.6 and
Fig.7 here number of count of TP, TN, FP, FN are given and using the equation (2) of
accuracy, value has been calculated and it is conclude that knn is best among them with 87%
accuracy and the comparison is shown in
TABLE.2
TABLE.2 Accuracy comparison

4.1 HEART DISEASE


A key challenge confronting healthcare organizations (hospitals, medical centers) is the
facility of quality services at reasonable prices. Quality amenities suggest diagnosing patients
accurately and regulating medications that are effective. Poor clinical choices can prompt
deplorable results, which are in this manner unsatisfactory. Hospitals should limit the cost of
clinical tests. They can accomplish these outcomes by utilizing fitting PC based data and
additionally choice emotionally supportive networks [4] [6].
The heart is the essential piece of our body. Life is itself reliant on effective working
of the heart. In the event that task of the heart isn't legitimate, it will influence the other body
parts of human, for example, cerebrum, kidney and so on. Coronary illness is a sickness that
effects on the activity of the heart. There are a number of elements which builds danger of
Heart ailment
Some of them are listed below:
• The family history of heart disease
• Smoking
• Cholesterol
• High blood pressure
• Obesity
• Lack of physical exercise
Heart disease describes a range of conditions that affect your heart. Heart disease term
includes a number of diseases such as blood vessel diseases, such as coronary artery disease;
heart rhythm problems (arrhythmias); and heart defects you're born with (congenital heart
defects), among others. The term heart disease is sometimes used interchangeably with the
term cardiovascular disease. Cardiovascular disease (CVD) generally refers to conditions that
involve narrowed or blocked blood vessels that can lead to a heart attack (Myocardial
infarctions), chest pain (angina) or stroke. Other heart conditions, such as those that affect
your heart's muscle, valves or rhythm, also are considered forms of heart disease [3]. 17.9
million People die each year from CVDs, an estimated 31% of all deaths worldwide [4].
Nowadays healthcare sector produces large amount of information about patients, disease
diagnosis etc. however this data is not used efficiently by the researchers and practitioners.
Today a major challenge faced by Healthcare industry is quality of service (QoS). QoS
implies diagnosing disease correctly & provides effective treatments to patients. Poor
diagnosis can lead to disastrous consequences which are unacceptable [2]. There are various
heart disease risk factors. Family history, Increasing age, Ethnicity and being male are some
risk factors that cannot be controlled. But Smoking, Diabetes, High cholesterol, High blood
pressure, not being physically active, being overweight or obese are those factors that can be
controlled or prevented.
HYBRID RANDOM FOREST WITH A LINEAR MODEL (HRFLM)
Random forests are a popular classification method based on an ensemble of a single type
of decision tree. In the literature, there are many different types of decision tree algorithms,
including C4.5, CART and CHAID. Each type of decision tree algorithms may capture
different information and structures. In this paper, we propose a novel random forest
algorithm, called a hybrid random forest. We ensemble multiple types of decision trees into
a random forest, and exploit diversity of the trees to enhance the resulting model. We
conducted a series of experiments on six text classification datasets to compare our method
with traditional random forest methods and some other text categorization methods. The
results show that our method consistently outperforms these compared methods.

Three traditional model selection approaches for GLM were used to select predictive models:
(1) stepAIC; (2) drop term and (3) anova. We used a backward direction with a k = log(n) for
stepAIC, chi-square test with a k = log(n) for drop term, and chi-square test for anova. Firstly,
we used stepAIC to choose a model (i.e. GLM1) from a full model, containing all 49
numerical predictors. We then simplified GLM1 using drop term and anova to remove non-
significant predictors and developed a further model (i.e. GLM2). We then considered
possible two-way interactions of remaining predictors in the model with lowest AIC (i.e.
GLM1) and simplified this newly formed model using stepAIC; and we then added a few
second orders based on the relationships of species richness with relevant predictors to this
model and further simplified it using stepAIC, drop term and anova, which led to the third
model.
Description of the dataset
The Cleveland heart dataset from the UCI machine learning repository has been used for the
experiments. The dataset consists of 14 attributes and 303 instances. There are 8 categorical
attributes and 6 numeric attributes. The description of the dataset is shown in Table 1.
Patients from age 29 to 79 have been selected in this dataset. Male patients are denoted by a
gender value 1 and female patients are denoted by gender value 0. Four types of chest pain
can be considered as indicative of heart disease. Type 1 angina is caused by reduced blood
flow to the heart muscles because of narrowed coronary arteries. Type 1 Angina is a chest
pain that occurs during mental or emotional stress. Non-angina chest pain may be caused due
to various reasons and may not often be due to actual heart disease. The fourth type,
Asymptomatic, may not be a symptom of heart disease. The next attribute trestbps is the
reading of the resting blood pressure. Chol is the cholesterol level. Fbs is the fasting blood
sugar level; the value is assigned as 1 if the fasting blood sugar is below 120 mg/dl and 0 if it
is above. Restecg is the resting electrocardiographic result, thalach is the maximum heart
rate, exang is the exercise induced angina which is recorded as 1 if there is pain and 0 if there
is no pain, old peak is the ST depression induced by exercise, and slope is the slope of the
peak exercise ST segment, ca is the number of major vessels colored by fluoroscopy, thal is
the duration of the exercise test in minutes, and num is the class attribute. The class attribute
has a value of 0 for normal and 1 for patients diagnosed with heart disease.
Classification
• Apply number of classifications techniques on the output of the first phase.
• Classification accuracy, precision, recall and f-measure will be used to evaluate the
efficiency of the used techniques; Figure 2 shows the classification results of the original
data.
• Eliminate low efficiency algorithms based on the evaluations from previous step. This
process done by comparing the values of accuracy, precision, recall and f-measure for each
feature to determine the consistency of the classification on the data set. We notice that Naïve
Bayes and SVM always perform better than others and never been eliminate, tree decision
eliminated a couple of times. Where KNN is most of the time get eliminated.
• Apply Hybridization, where we combine the results from the chosen Classification.
Feature selection
In the heart disease datasets, the number of features can reach up to tens of thousands;
the heart disease dataset has 14 attributes. Since a large number of irrelevant and redundant
attributes are involved in these expression data, the heart disease classification task is made
more complex. If complete data are used to perform heart disease classification, accuracy will
not be as accurate, and calculation time and costs will be high. Therefore, the feature
selection, as a pre-treatment step to machine learning, reduces sizing, eliminates unresolved
data, increases learning accuracy, and improves understanding of results. The recent increase
in the dimensionality of the data poses a serious problem to the methods of selecting
characteristics with regard to efficiency and effectiveness. The FCBF's reliable method [8] is
adopted to select a subset of discriminatory features prior to classification, by eliminating
attributes with little or no effect, FCBF provides good performance with full consideration of
feature correlation and redundancy. In this document, we first standardized the data and then
selected the features by FCBF in WEKA. The number of heart disease attributes increased
from 14 to 7.
Effectiveness
In this section, we evaluate the effectiveness of all classifiers in terms of time to build the
model, correctly classified instances, incorrectly classified instances and accuracy. The
results are shown in Table 3 without optimization, Table 4 optimized by FCBF and Table 5
optimized by FCBF, PSO and ACO. In order to improve the measurement of classifier
performance, the simulation error is also taken into account in this study. To do this, we
evaluate the effectiveness of our classifier in terms of: Kappa as a randomly corrected
measure of agreement between classifications and actual classes, Mean Absolute Error as the
way in which predictions or predictions approximate possible results, Root Mean Squared
Error, Relative Absolute Error, Root Relative Absolute Error, Root Relative Squared Error.
MACHINE LEARNING
Machine learning is a branch of artificial intelligence that aims at enabling machines to
perform their jobs skill fully by using intelligent software. The statistical learning methods
constitute the backbone of intelligent software that is used to develop machine intelligence.
Because machine learning algorithms require data to learn, the discipline must have
connection with the discipline of database. Similarly, there are familiar terms such as
Knowledge Discovery from Data (KDD), data mining, and pattern recognition. One wonders
how to view the big picture in which such connection is illustrated. SAS Institute Inc., North
Carolina, is a developer of the famous analytical software Statistical Analysis System (SAS).
In order to show the connection of the discipline of machine learning with different related
disciplines, we will use the illustration from SAS.
Machine learning algorithms are helpful in bridging this gap of understanding. The
idea is very simple. We are not targeting to understand the underlying processes that help us
learn. We write computer programs that will make machines learn and enable them to
perform tasks, such as prediction. The goal of learning is to construct a model that takes the
input and produces the desired result. Sometimes, we can understand the model, whereas, at
other times, it can also be like a black box for us, the working of which cannot be intuitively
explained.

Figure 13. Different machine learning techniques and their required data.
There are some tasks that humans perform effortlessly or with some efforts, but we are
unable to explain how we perform them. For example, we can recognize the speech of our
friends without much difficulty. If we are asked how we recognize the voices, the answer is
very difficult for us to explain.
CARDIOVASCULAR DISEASE
Globally, cardiovascular diseases are the number one cause of death and they are
projected to remain so. An estimated 17 million people died from cardiovascular disease in
2005, representing 30% of all global deaths. Of these deaths, 7.2 million were due to heart
attacks and 5.7 million due to stroke. About 80% of these deaths occurred in low- and middle
income countries. If current trends are allowed to continue, by 2030 an estimated 23.6 million
people will die from cardiovascular disease

Cardiovascular diseases include:


• Coronary heart disease (heart attacks),
• Cerebrovascular disease,
• raised blood pressure (hypertension),
• Peripheral artery disease,
• Rheumatic heart disease,
• Congenital heart disease, and
• Heart failure.
This requires combining approaches to reduce the risks throughout the entire population and
by targeting individuals at high risk or with established disease.
Examples of population-wide interventions that can be implemented include:
• Comprehensive tobacco control policies,
• Taxation to reduce the intake of foods that are high in fat, sugar and salt,
• Building walking and cycle ways to increase physical activity, and
• Providing healthy school meals to children.
In addition, effective and inexpensive medication is available to treat nearly all
cardiovascular diseases; After a heart attack or stroke, the risk of a recurrence or death can be
substantially lowered with a combination of life style changes and drugs – stations to lower
cholesterol, drugs to lower blood pressure, and aspirin; There is a need for increased
government investment through national programmes aimed at prevention and control of
CVDs and other chronic diseases.
PREDICTION MODEL
The outcome variable of the prediction model can be anything, e.g., the risk of getting a side
effect, the chance of surviving at a certain time point, or the probability of having a tumour
recurrence. We can distinguish outcome variables into continuous variables or categorical
variables. Continuous variables are described by numerical values and regression models are
used to predict them, e.g., linear regression. Categorical variables are restricted to a limited
number of classes or categories and we use classification models for their prediction. If the
outcome has two categories this is referred to as binary classification and typical techniques
are decision trees and logistic regression.
The variance is the error due to the amount of over fitting done during model
generation. If you use a very flexible algorithm, e.g., an advanced machine learning algorithm
with lots of freedom to follow the data points in the training set very closely, this is more
likely to over fit the data. The error in the training set will be small, but the error in the test
set will be large. Another way to look at this is that a high variance will result in very
different models during training if the model is fitted using different training sets. Bias relates
to the error due to the assumptions made by the algorithm that is chosen for model
generation. If a linear algorithm is chosen, i.e. a linear relation between the inputs and the
outcome is assumed, this may cause large errors (large bias) if the underlying true relation is
far from linear.
CHAPTER 5
RELATED WORK
Heart is one of the core organ of human body, it play crucial role on blood pumping in human
body which is as essential as the oxygen for human body so there is always need of
protection of it, this is one of the big reasons for the researchers to work on this. So there are
number of researchers working on it .There is always need of analysis of heart related things
either diagnosis or prediction or you can say that protection of heart disease .There are
various fields like artificial intelligence, machine learning, data mining that contributed on
this work . Performance of any algorithms depends on variance and biasness of dataset[4]. As
per research on the machine learning for prediction of heart diseases himanshu et al.[4] naive
bayes perform well with low variance and high biasness as compare to high variance and low
biasness which is knn. With low biasness and high variance knn suffers from the problem of
over fitting this is the reason why performance of knn get decreased. There are various
advantage of using low variance and high biasness because as the dataset small it take less
time for training as well as testing od algorithm but there also some disadvantages of using
small size of dataset. When the dataset size get increasing the asymptotic errors are get
introduced and low biasness, low variance based algorithms play well in this type of cases.
Decision tree is one of the nonparametric machine learning algorithm but as we know it
suffers from the problem over fitting but it cloud be solve by some over fitting removable
techniques. Support vector machine is algebraic and statics background algorithm, it
construct a linear separable n-dimensional hyper plan for the classification of datasets. The
nature of heart is complex, there is need of carefully handling of it otherwise it cause death of
the person. The severity of heart diseases is classified based on various methods like knn,
decision tree, generic algorithm and naïve bayes [3]. Mohan et al.[3] define how you can
combine two different approaches to make a single approach called hybrid approach which
have the accuracy 88.4% which is more than of all other. Some of the researchers have
worked on data mining for the prediction of heart diseases. Kaur et al.[6] have worked on this
and define how the interesting pattern and knowledge are derived from the large dataset.
They perform accuracy comparison on various machine learning and data mining approaches
for finding which one is best among then and get the result on the favor of svm. Kumar et al.
[5] have worked on various machine learning and data mining algorithms and analysis of
these algorithms are trained by UCI machine learning dataset which have 303 samples with
14 input feature and found svm is best among them, here other different algorithms are naivy
bayes, knn and decision tree. Gavhane et al.[1] have worked on the multi layer perceptron
model for the prediction of heart diseases in human being and the accuracy of the algorithm
using CAD technology. If the number of person using the prediction system for their diseases
prediction then the awareness about the diseases is also going to increases and it make
reduction in the death rate of heart patient. Some researchers have work on one or two
algorithm for predication diseases. Krishnan et al.[2] proved that decision tree is more
accurate as compare to the naïve bayes classification algorithm in their project. Machine
learning algorithms are used for various type of diseases predication and many of the
researchers have work on this like Kohali et al.[7] work on heart diseases prediction using
logistic regression, diabetes prediction using support vector machine, breast cancer prediction
using Adaboot classifier and concluded that the logistic regression give the accuracy of
87.1%, support vector machine give the accuracy of 85.71%, Adaboot classifier give the
accuracy up to 98.57% which good for predication point of view. A survey paper on heart
diseases predication have proven that the old machine learning algorithms does not perform
good accuracy for the predication while hybridization perform good and give better accuracy
for the predication[8].

CODING AND ALGORITHM (ADD CODINGS)


The increase in water use efficiency due to treating the sandy soil with CKD could be
attributed to the effect of CKD on increasing the water holding capacity of the treated soil
and decreasing the water evaporation from the soil surface, hence. It increased the water
available to the plants. The available water will be used in producing more plant materials
and then more water use efficiency. The proposed system can used in hospitals to help
doctors make a quick diagnose or test new ones on some cases. Students in medical colleges
may wish to use this system to learn and test their learning.
The main goal in this paper is to investigate available data mining techniques to predict heart
disease and compare them, then combine the result from all of them to get most accurate
result. The focus was on the classification and prediction methods. The accuracy of the
algorithms can improved by hybridization or combining algorithm into single powerful
algorithm. The new algorithm can be used as expert system in hospitals to help doctors in
diagnose heart disease quickly and save life. Also, can used for education purpose in medical
schools.
CONCLUSION AND FUTURE SCOPE

Heart is one of the essential and vital organ of human body and prediction about heart
diseases is also important concern for the human beings so that the accuracy for algorithm is
one of parameter for analysis of performance of algorithms. Accuracy of the algorithms in
machine learning depends upon the dataset that used for training and testing purpose. When
we perform the analysis of algorithms on the basis of dataset whose attributes are shown in
TABLE.1 and on the basis of confusion matrix, we find KNN is best one. For the Future
Scope more machine learning approach will be used for best analysis of the heart diseases
and for earlier prediction of diseases so that the rate of the death cases can be minimized by
the awareness about the diseases.
REFERENCES
[1] Santhana Krishnan J and Geetha S, “Prediction of Heart Disease using Machine Learning
Algorithms” ICIICT, 2019.
[2] Aditi Gavhane, Gouthami Kokkula, Isha Panday, Prof. Kailash Devadkar, “Prediction of
Heart Disease using Machine Learning”, Proceedings of the 2nd International conference on
Electronics, Communication and Aerospace Technology(ICECA), 2018.
[3] Senthil kumar mohan, chandrasegar thirumalai and Gautam Srivastva, “Effective Heart
Disease Prediction Using Hybrid Machine Learning Techniques” IEEE Access 2019.
[4] Himanshu Sharma and M A Rizvi, “Prediction of Heart Disease using Machine Learning
Algorithms: A Survey” International Journal on Recent and Innovation Trends in Computing
and Communication Volume: 5 Issue: 8 , IJRITCC August 2017.
[5] M. Nikhil Kumar, K. V. S. Koushik, K. Deepak, “Prediction of Heart Diseases Using
Data Mining and Machine Learning Algorithms and Tools” International Journal of Scientific
Research in Computer Science, Engineering and Information Technology ,IJSRCSEIT 2019.
[6] Amandeep Kaur and Jyoti Arora,“Heart Diseases Prediction using Data Mining
Techniques: A survey” International Journal of Advanced Research in Computer Science ,
IJARCS 2015-2019.
[7] Pahulpreet Singh Kohli and Shriya Arora, “Application of Machine Learning in Diseases
Prediction”, 4th International Conference on Computing Communication And
Automation(ICCCA), 2018.
[8] M. Akhil, B. L. Deekshatulu, and P. Chandra, “Classification of Heart Disease Using K-
Nearest Neighbor and Genetic Algorithm,” Procedia Technol., vol. 10, pp. 85–94, 2013.
[9] S. Kumra, R. Saxena, and S. Mehta, “An Extensive Review on Swarm Robotics,” pp.
140–145, 2009.
[10] Hazra, A., Mandal, S., Gupta, A. and Mukherjee, “ A Heart Disease Diagnosis and
Prediction Using Machine Learning and Data Mining Techniques: A Review” Advances in
Computational Sciences and Technology , 2017.
[11] Patel, J., Upadhyay, P. and Patel, “Heart Disease Prediction Using Machine learning and
Data Mining Technique” Journals of Computer Science & Electronics , 2016.
[12] Chavan Patil, A.B. and Sonawane, P.“To Predict Heart Disease Risk and Medications
Using Data Mining Techniques with an IoT Based Monitoring System for Post-Operative
Heart Disease Patients” International Journal on Emerging Trends in Technology, 2017.
[13] V. Kirubha and S. M. Priya, “Survey on Data Mining Algorithms in Disease Prediction,”
vol. 38, no. 3, pp. 124–128, 2016.
[14] M. A. Jabbar, P. Chandra, and B. L. Deekshatulu, “Prediction of risk score for heart
disease using associative classification and hybrid feature subset selection,” Int. Conf. Intell.
Syst. Des. Appl. ISDA, pp. 628–634, 2012.
[15] https://archive.ics.uci.edu/ml/datasets/Heart+Disease

You might also like