Project Report First Phase @8 Suhana
Project Report First Phase @8 Suhana
MACHINE LEARNING
PROJECT REPORT
(Submitted in Partial Fulfillment of the Requirement for B-Tech Degree Course in Electronics
and Communication Engineering of APJ Abdul Kalam Technological University)
Submitted By
SUHANA SAINAB.S(AME19EC013)
SUDHARSANA.N (AME19EC012)
I.SAJEED(AME19EC006)
PROJECT REPORT
(Submitted in Partial Fulfillment of the Requirement for B. Tech Degree Course in Electronics and
Communication Engineering of A P J Abdul Kalam Technological University)
Submitted by
SUHANA SAINAB.S (AME19EC013)
SUDHARSANA .N(AME19EC012)
I.SAJEED(AME19EC006)
Dr.RAMANI.K
Principle
It is with great enthusiasm and learning spirit that we are bringing out this Project report.
Here we would like to mark my token of gratitude to all those who influenced me during
the period of my work. We would like to express my sincere thanks to The Management
of Rajadhani Institute of Science and Technology, Palakkad and Dr.RAMANI.K, the
Principal Rajadhani Institute of Science and Technology for the facilities provided here.
We express my heart-felt gratitude to Head of the Department, SITHARA KRISHNAN, As-
sistant Professor, Department of Electronics and Communication& Engineering for allowing
me to take up this work.
With immense pleasure and gratitude, We express sincere thanks to my guide Ms.ASHA
ARVIND and Co-ordinator Ms.BLESSY RAPHAEL,A Assistant Professor for her com-
mitted guidance, valuable suggestions and constructive criticisms. Her stimulating sugges-
tions and encouragement helped me through our project work. We extend my gratitude to all
teachers in the Department of Electronics and Communication Engineering, Rajadhani
Institute of Science and Technology, Palakkad for their support and inspiration.
Above all We praise and thank the Almighty God, who showered her abundant grace on me to
make this project a success. We also express my special thanks and gratitude to my family and
all my friends for their support and encouragement.
ABSTRACT
Heart attack prediction is one of the serious causes of morbidity in the world’s
population. The clinical data analysis includes a very crucial disease i.e., cardiovascular
disease as one of the most important sections for the prediction. Data Science and machine
learning (ML) can be very helpful in the prediction of heart attacks in which different risk
factors like high blood pressure, high cholesterol, abnormal pulse rate, diabetes, etc... can be
considered. The objective of this study is to optimize the prediction of heart disease using
ML. This prediction can help clinically in analyzing the risk factors of the disease and
interpretation of the patient scenario. Boosting the algorithm provided promising results to
predict symptoms of heart disease. It can further be optimized by working further on risk
factors associated withthis condition.
HEART DISEASE PREDICTION USING MACHINE LEARNING ii
Contents
2 LITERTURE SURVEY 2
4 METHODOLOGY 6
5 TECHNOLOGY 8
6 BLOCK DIAGRAM 10
7 LOGISTIC REGRESSION 12
8 TESTING 15
10 CONCLUSION 18
BIBLIOGRAPHY 18
11 SAMPLE CODE 20
12 SCREENSHOTS 22
List of Figures
to∆1
4.1 Proposed model 7
12.7 Features 24
12.8 Target
25
Chapter 1
INTRODUCTION
A heart attack which is analogous to acute myocardial infarction (AMI) is one
of the most serious diseases in the segment of cardiovascular disease. It occurs due to the
interruption of blood circulation to muscle of the heart which damages the heart the muscle.
Diagnosing heart disease is also a crucial task. The symptoms, physical examination, and
understanding of the different signs of this disease are required to diagnose heart disease.
Different factors including cholesterol, genetic heart disease, high blood pressure,low
physical activity, obesity, and smoking can be reasons for the occurrence of heart dis- ease.
The major reason for heart attacks is the stoppage of blood to the coronary arteries.The red
blood cells (RBC) start getting low when blood flow is reduced; due to this the human body
stops getting necessary oxygen and loses consciousness. The early diagnosis through
symptoms and signs can help prevent patients of heart attacks if the prediction is accurate
enough.shows different symptoms of a heart attack. The work presented takes 14
features/attributes as input having number values. It has been stated that little modifications in
lifestyle including quitting smoking/alcohol/tobacco, having healthy food habits, and routine
exercises can help in the prevention of heart attacks. Any person living a healthy lifestyle with
early treatment after diagnosis can greatly increase the positive results. However, it is difficult
to identify the high risk of heart disease where different risks like diabetes, high blood
pressure, and cholesterol problems are present. In these types of scenarios, ML can help in the
early diagnosis of disease.
Chapter 2
LITERTURE SURVEY
A thorough search has been done of the previous work on the domain of the heart disease
using different algorithms. The previous 21 years of work has been considered for study and
their shortcomings are noted down to further extend our research. A total of 50 papers from
Web of science, Science direct, and Scopus were collected from which 27 were selected for
final study after removal of duplicates and same domain-based papers. There is number of
works has been done related to disease prediction systems using different machine learning
algorithms in medical Centres.
Senthil Kumar Mohan et al, proposed Effective Heart Disease Prediction Using Hybrid
Machine Learning Techniques in which strategy that objective is to finding critical includes
by applying Machine Learning bringing about improving the exactness in the expectation of
cardiovascular malady. The expectation model is created with various blends of highlights
and a few known arrangement strategies. We produce an improved exhibition level with a
precision level of 88.7 percentage through the prediction model for heart dis-ease with
hybrid random forest with a linear model (HRFLM) they likewise educated about Diverse data
mining approaches and expectation techniques, Such as, KNN, LR, SVM, NN, and Vote have
been fairly famous of late to distinguish and predict heart disease.
Sonam Nikhar et al has built up the paper titled as Prediction of Heart Disease Using
Machine Learning Algorithms by This exploration plans to give a point by point portrayal
of Naà ¯ve Bayes and decision tree classifier that are applied in our examination especially
in the prediction of Heart Disease. Some analysis has been led to think about the execution
of prescient data mining strategy on the equivalent dataset, and the result uncovers that
Decision Tree beats over Bayesian classification system.
Aditi Gavhane, Gouthami Kokkula, Isha Pandya, Prof. Kailas Devadkar (PhD),
Prediction of Heart Disease Using Machine Learning, In this paper proposed system they
used the neural network algorithm multi-layer perceptron (MLP) to train and test the dataset.
In this algorithm there will be multiple layers like one for input, second for output and one
or more layers are hidden layers between these two input and output layers. Each node in
input layer is connected to output nodes through these hidden layers. This connection is as-
signed with some weights. There is another identity input called bias which is with weight,
which added to node to balance the perceptron. The connection between the nodes can be
V.V. Ramalingam et Al,proposed Heart disease prediction using machine learning tech-
niques in which Machine Learning algorithms and techniques have been applied to various
medical datasets to automate the analysis of large and complex data. Many researchers, in
recent times, have been using several machine learning techniques to help the health care
industry and the professionals in the diagnosis of heart related diseases. This paper
presents a survey of various models based on such algorithms and techniques and analyse
their performance. Models based on supervised learning algorithms such as Support Vec-
tor Machines (SVM), K- Nearest Neighbour (KNN), Naà ¯ve Bayes, Decision Trees (DT),
Random Forest (RF) and ensemble models are found very popular among the researchers
and systems have been applied to different clinical datasets to robotize the investigation of
huge and complex information. Numerous scientists, as of late, have been utilizing a few
Machine Learning algorithms and techniques have been applied to various medical datasets
to automate the analysis of large and complex data. Many researchers, in recent times,
have been using several machine learning techniques to help the health care industry and
the professionals in the diagnosis of heart related diseases. This paper presents a survey of
various models based on such algorithms and techniques and analyze their performance.
Models based on supervised learning algorithms such as Support Vector Machines (SVM),
K- Nearest Neighbour (KNN), Naïve Bayes, Decision Trees (DT), Random Forest (RF)
and ensemble models are found very popular among the researchers. strategies to enable the
wellbeing to mind industry and the experts in the analysis of heart related sicknesses. This
paper presents a review of different models dependent on such calculations and methods
and analyze their exhibition. Models in light of directed learning calculations, for example,
Support Vector Machines (SVM), K- Nearest Neighbour (KNN), Naà ¯ve Bayes, Decision
Trees (DT), Random Forest (RF) and group models are discovered extremely well known
among the scientists.
Chapter 3
Chapter 4
METHODOLOGY
This paper shows the analysis of various machine learning algorithms, the algorithms
that are used in this paper are K nearest neighbors (KNN), Logistic Regression and Random
Forest Classifiers which can be helpful for practitioners or medical analysts for accurately
diagnose Heart Disease. This paperwork includes examining the journals, published paper and
the data of cardiovascular disease of the recent times. Methodology gives a framework for the
proposed model. The methodology is a process which includes steps that trans-form given
data into recognized data patterns for the knowledge of the users. The proposed methodology
(includes steps, where first step is referred as the collection of the data than in second
stage it extracts significant values than the 3rd is the preprocessing stage where we explore
the data. Data preprocessing deals with the missing values, cleaning of data and
normalization depending on algorithms used .After pre-processing of data, classifier is used
to classify the pre-processed data the classifier used in the proposed model are KNN,
Logistic Regression, Random Forest Classifier. Finally, the proposed model is undertaken,
where we evaluated our model on the basis of accuracy and performance using various
performance metrics. Here in this model, an effective Heart Disease Prediction System
(EHDPS) has been developed using different classifiers. This model uses 14 medical
parameters such as chest pain, fasting sugar, blood pressure, cholesterol, age, sex etc. for
prediction .
Chapter 5
TECHNOLOGY
Machine learning is an application of AI that enables systems to learn and im- prove from
experience without being explicitly programmed. Machine learning focuses on developing
computer programs that can access data and use it to learn for themselves. Similar to how the
human brain gains knowledge and understanding, machine learning relies on input, such as
training data or knowledge graphs, to understand entities, domains and the connections
between them. With entities defined, deep learning can begin.
The machine learning process begins with observations or data, such as examples, direct
experience or instruction. It looks for patterns in data so it can later make inferences based on
the examples provided. The primary aim of ML is to allow computers to learn autonomously
without human intervention or assistance and adjust actions accordingly. Machine learning
as a concept has been around for quite some time. The term “machine learning” was coined by
Arthur Samuel, a computer scientist at IBM and a pioneer in AI and computer gaming. Samuel
designed a computer program for playing checkers. The more the program played, the more
it learned from experience, using algorithms to make predictions.
ML has proven valuable because it can solve problems at a speed and scale that cannot
be duplicated by the human mind alone. With massive amounts of computational ability behind
a single task or multiple specific tasks, machines can be trained to identify patterns in and
relationships between input data and automate routine processes
Supervised Learning: More Control, Less Bias Supervised machine learning algo- rithms
apply what has been learned in the past to new data using labeled examples to predict future
events. By analyzing a known training dataset, the learning algorithm produces an inferred
function to predict output values. The system can provide targets for any new input after
sufficient training. It can also compare its output with the correct, intended output to find
errors and modify the model accordingly.
Unsupervised Learning: Speed and Scale Unsupervised machine learning algorithms are
used when the information used to train is neither classified nor labeled. Unsupervised learning
studies how systems can infer a function to describe a hidden structure from unlabeled data.
At no point does the system know the correct output with certainty. Instead, it draws inferences
from datasets as to what the output should be.
Reinforcement Learning: Reinforcement learning is a feedback-based learning method,
in which a learning agent gets a reward for each right action and gets a penalty for each
wrong action. The agent learns automatically with these feedbacks and improves its
performance. In reinforcement learning, the agent interacts with the environment and
explores it. The goal of an agent is to get the most reward points, and hence, it improves its
performance. The robotic dog, which automatically learns the movement of his arms, is an
example of Reinforcement learning.
Chapter 6
BLOCK DIAGRAM
Data acquisition The cardiac disease dataset obtained from the UCI ML repository. It
contains 14 features and 303 records.
Data pre-processing Cardiovascular disease UCI dataset is first loaded and then data
cleaning and finding missing values was performed on all records. The dataset contains
complete information. The attributes of the dataset are multiclass variable in characteristics
with double classification.
Feature selection The patient record is identified uniquely by two features of the dataset
by sex and age from 14 attributes of the dataset and assign individual ids. The rest of
the features consists of medical information. The medical information are vital attributes
predicting heart disease. The correlation performed on all 14 attributes with the target value
to select the features with high and positive correlation feature
Splitting dataset The splitting of the dataset in the following ratios of training and test-
ing set in percentile.
Classification One of the Simplest and best ML classification algorithm is Logistic Re-
gression. The LR is the supervised ML binary classification algorithm widely used in most
application. It works on categorical dependent variable the result can be discrete or binary
categorical variable 0 or 1. The sigmoid function is used as a cost function. Sigmoid function
maps a predicted real value to a probabilistic value between ‘0’ and ‘1’.
Model building: In this phase, we will be building our Machine learning model for heart
disease detection
Chapter 7
LOGISTIC REGRESSION
World Health Organization has estimated 12 million deaths occur worldwide, every year
due to Heart diseases. Half the deaths in the United States and other developed countries are
due to cardio vascular diseases. The early prognosis of cardiovascular diseases can making
decisions on lifestyle changes in high risk patients and in turn reduce the complications.
This research intends to pinpoint the most relevant/risk factors of heart disease as well as
predict the overall risk using logistic regression Data Preparation
Source The dataset is publically available on the Kaggle website, and it is from an on-
going cardiovascular study on residents of the town of Framingham, Massachusetts. The
classification goal is to predict whether the patient has 10-year risk of future coronary heart
disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records
and 15 attributes. Variables Each attribute is a potential risk factor. There are both demo-
graphic, behavioral and medical risk factors.
Demographic:
• Sex: male or female(Nominal)
• Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to
whole numbers, the concept of age is continuous) Behavioral
• Current Smoker: whether or not the patient is a current smoker (Nominal)
• Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be
considered continuous as one can have any number of cigarettes, even half a cigarette.)
Medical( history)
• BP Meds: whether or not the patient was on blood pressure medication (Nominal)
• Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
• Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
• Diabetes: whether or not the patient had diabetes (Nominal) Medical(current)
• Tot Chol: total cholesterol level (Continuous)
• Sys BP: systolic blood pressure (Continuous)
• Dia BP: diastolic blood pressure (Continuous)
Interpreting the results: Odds Ratio, Confidence Intervals and P-values • This fitted model
shows that, holding all other features constant, the odds of getting diagnosed with heart disease
for males (sexmale = 1)over that of females (sexmale = 0) is exp(0.5815) = 1.788687. In terms
of percent change, we can say that the odds for males are 78.8 per- centage higher than the
odds for females. • The coefficient for age says that, holding all others constant, we will see
7 percentage increase in the odds of getting diagnosed with CDH for a one year increase
in age since exp(0.0655) = 1.067644. • Similarly , with every
extra cigarette one smokes thers is a 2 percentage increase in the odds of CDH. • For Total
cholesterol level and glucose level there is no significant change.
• There is a 1.7 percentage increase in odds for every unit increase in systolic Blood
Pressure.
Model Evaluation - Statistics From the above statistics it is clear that the model is highly
specific than sensitive. The negative values are predicted more accurately than the positives.
Predicted probabilities of 0 (No Coronary Heart Disease) and 1 ( Coronary Heart Disease: Yes)
for the test data with a default classification threshold of 0.5 lower the threshold Since the
model is predicting Heart disease too many type II errors is not advisable. A False Negative (
ignoring the probability of disease when there actually is one) is more dangerous than a False
Positive in this case. Hence in order to increase the sensitivity, threshold can be lowered.
• All attributes selected after the elimination process show P-values lower than 5 percentage
and thereby suggesting significant role in the Heart disease prediction.
• Men seem to be more susceptible to heart disease than women. Increase in age, number of
cigarettes smoked per day and systolic Blood Pressure also show increasing odds of having
heart disease
• Total cholesterol shows no significant change in the odds of CHD. This could be due to the
presence of ’good cholesterol(HDL) in the total cholesterol reading. Glucose too causes a very
negligible change in odds (0.2 percentage)
• The model predicted with 0.88 accuracy. The model is more specific than sensitive. Overall
model could be improved with more data
Chapter 8
TESTING
The testing would be carried out on the Hospital Management System while logging
into the system as a customer or a normal user of the system. The Unit Testing is a test
that tests each single module of the software to check for errors. This is mainly done to discover
errors in the code of the Hospital Management System. The main goal of the unit testing
would be to isolate each part of the program and to check the correctness of the code. In
the case of Hospital Management System, all the web forms and the classes will be tested.
In Integration Testing, the individual software modules are combined and tested as a whole
unit. The integration testing generally follows unit testing where each module is tested as a
separate unit. The main purpose of the integration testing is to test the functional and
performance requirements on the major items of the project. Acceptance testing is generally
performed when the project is nearing its end. This test mainly qualifies the project and decides
if it will be accepted by the users of the system. The users or the customers of the project are
responsible for the test. The system testing is mainly done on the whole integrated system to
make sure that the project that has been developed meets all the requirements.
Chapter 9
The Logistics Regression increase its accuracy with increasing training by 50 percent-
age to 90 percentage and 90 percentage training and 10 percentage testing provides highest
accuracy of 87.10The classification report, precision, recall, f1-score and accuracy of LR
classifier for UCI dataset with 90 percentage training and 10 percentage testing. The model
has precision of 0.857, recall 0.857, F1-score 0.857 and accuracy of 87.10 percentage The
ROC (Receiver Operator Characteristics) curve as used to further investigation in to the model.
The performance of the model is visualized by ROC Curve and the tradeoff between TPR
(True Positive Rate) and FPR (False Positive Rate). It ranges from 0 to 1 and the area under it
signifies the capabilities of distinguish the class of ML model. The ROC curve as near to one
it is more capable of classifying. The represents the various previous research work carried on
Logistic Regression using rapid minor and python on UCI Dataset from the year 2019 to 2021
with accuracy of prediction.
Chapter 10
CONCLUSION
The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in
high risk patients and in turn reduce the complications, which can be a great milestone in the field
of medicine. This project resolved the feature selection i.e. backward elimination and
RFECV behind the models and successfully predict the heart disease, with 85% accuracy. The
model used was Logistic Regression. Further for its enhancement, we can train on models and
predict the types of cardiovascular diseases providing recommendations to the users, enhanced
model
Bibliography
[1] World Health Organization and J. Dostupno, cardiovascular diseases: key facts, vol. 13, no. 2016,
p. 6, 2016. [Online].https: // www. who. int/ en/ news-room/ fact- sheets/ detail/
cardiovascular-diseases-( cvds) . Google Scholar
[2] K. Uyar, A. Ilhan Diagnosis of heart disease using genetic algorithm based trained recurrent fuzzy
neural networks Proced. Comput. Sci., 120 (2017), pp. 588-593.
[3] N. Kausar, S. Palaniappan, B.B. Samir, A. Abdullah, N. Dey Systematic analysis of ap-
plied data mining based optimization algorithms in clinical attribute extraction and
classification for diagnosis of cardiac patients in Applications of Intelligent Optimiza- tion
in Biology and Medicine, Cham, Switzerland: Springer (2016), pp. 217-231
[4] M. Shouman, T. Turner, R. Stocker Integrating clustering with different data mining
techniques in the diagnosis of heart disease J. Comput. Sci. Eng., 20 (1) (2013), pp. 1-10
[5] M.S. Amin, Y.K. Chiam, K.D. Varathan Identification of significant features and data
mining techniques in predicting heart disease Telemat. Inf., 36 (2019), pp. 82-93 Mar.
[6] Z. Khan, D.K. Mishra, V. Sharma, A. Sharma Empirical study of various classifica- tion
techniques for heart disease prediction Proceedings of the IEEE 5th International
Conference on Computing Communication and Automation (ICCCA) (2020), pp. 57-62,
10.1109/ICCCA49541.2020.9250852
Chapter 11
SAMPLE CODE
FIRST PHASE
import numpy as np
import pandas as pd
from sklearn.modelselectionimporttraintests plit
from sklearn.linearmodelimportLogisticRegression
from sklearn.metrics import accuracyscore
Data Collection and Processing
[]
loading the csv data to a Pandas DataFrame
heartdata = pd.readcsv(′/content/data.csv′)
heartdata[′target′].valuecounts()
Splitting the Features and Target
X = heartdata.drop(columns =′ target′, axis = 1)Y = heartdata[′target′]
print(X)
print(Y)
Chapter 12
SCREENSHOTS
Data Collection and Processing
loaded the csv data to a Pandas DataFrame