1822 B.E Cse Batchno 114
1822 B.E Cse Batchno 114
LEARNING CLASSIFICATION
Sathyabama Institute of Science and
Technology (Deemed to be University)
By
GOPALAKRISHNANS(Reg. No.38110169)
KAMESVAR G(Reg. No.38110228)
MARCH - 2022
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
(Established under Section 3 of UGC Act, 1956)
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI– 600119
www.sathyabamauniversity.ac.in
BONAFIDE CERTIFICATE
Internal Guide
Dr. A. Pravin
requirements for the award of Bachelor of Engineering degree in Computer Science and
Engineering.
DATE:
I convey my thanks to Dr. T. Sasikala M.E., Ph.D., Dean, School of Computing , Dr.
S. Vigneshwari M.E., Ph.D., and Dr. L. Lakshmanan M.E., Ph.D., Heads of the
Department of Computer Science and Engineering for providing me necessary
support and details at the right time during the progressive reviews.
I would like to express my sincere and deep sense of gratitude to my Project Guide Dr. A.
Pravin for his valuable guidance, suggestions and constant encouragement paved way
for the successful completion of my project work.
I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were help full in many
ways for the completion of the project.
ABSTRACT
v
5.3.4 MODEL TRAINING 16
5.3.5 TESTING MODEL 17
5.3.6 PERFORMANCE EVALUATION 19
5.3.7 PREDICTION 19
6 RESULT AND DISCUSSION 20
7 CONCLUSION AND FUTURE WORK 22
7.1 CONCLUSION 22
7.2 FUTURE WORK 23
8 REFERENCE 23
9 APPENDIX 25
A. SOURCE CODE 25
B. OUTPUT SCREENSHOTS 31
C. PLAGARISM REPORT 33
LIST OF FIGURES
LIST OF ABBREVATIONS
ML – Machine Learning
RF – Random Forest
vii
CHAPTER – 1
INTRODUCTION
1.1 Introduction
Heart disease (HD) is the critical health issue and numerous people have been
suffered by this disease around the world .The HD occurs with common symptoms of
breath shortness, physical body weakness and, feet are swollen. Researchers try to
come across an efficient technique for the detection of heart disease, as the current
diagnosis techniques of heart disease are not much effective in early time identification
due to several reasons, such as accuracy and execution time. The diagnosis and
treatment of heart disease is extremely difficult when modern technology and medical
experts are not available. The effective diagnosis and proper treatment can save the
lives of many people. According to the European Society of Cardiology, 26 million
approximately people of HD were diagnosed and diagnosed 3.6 million annually. Most
of the people in the United States are suffering from heart disease Diagnosis of HD is
traditionally done by the analysis of the medical history of the patient, physical
examination report and analysis of concerned symptoms by a physician. But the
results obtained from this diagnosis method are not accurate in identifying the patient
of HD. Moreover, it is expensive and computationally difficult to analyze. Thus, to
develop a non-invasive diagnosis system based on classifiers of machine learning to
resolve these issues. Expert decision system based on machine learning classifiers
and the application of artificial fuzzy logic is effectively diagnosis the HD as a result,
the ratio of death decrease.
CHAPTER -2
LITERARY REVIEW
Dr. M. Kavitha2021 Heart disease causes a significant mortality rate around the
world, and it has become a health threat for many people. Early prediction of heart
disease may save many lives; detecting cardiovascular diseases like heart attacks,
coronary artery diseases etc., is a critical challenge by the regular clinical data
analysis. Machine learning (ML) can bring an effective solution for decision making
and accurate predictions. The medical industry is showing enormous development in
using machine learning techniques. In the proposed work, a novel machine learning
approach is proposed to predict heart disease. The proposed study used the
Cleveland heart disease dataset, and data mining techniques such as regression and
classification are used. Machine learning techniques Random Forest and Decision
Tree are applied. The novel technique of the machine learning model is designed. In
implementation, 3 machine learning algorithms are used, they are 1. Random Forest,
2. Decision Tree and 3. Hybrid model (Hybrid of random forest and decision tree).
Experimental results show an accuracy level of 88.7% through the heart disease
prediction model with the hybrid model. The interface is designed to get the user's
input parameter to predict the heart disease, for which we used a hybrid model of
Decision Tree and Random Forest.
2
Abderrahmane Ed-daoudy 2019Over the last few decades, heart disease is the most
common cause of global death. So early detection of heart disease and continuous
monitoring can reduce the mortality rate. The exponential growth of data from different
sources such as wearable sensor devices used in Internet of Things health monitoring,
streaming system and others have been generating an enormous amount of data on a
continuous basis. The combination of streaming big data analytics and machine
learning is a breakthrough technology that can have a significant impact in healthcare
field especially early detection of heart disease. This technology can be more powerful
and less expensive. To overcome this issue, this paper propose a real-time heart
disease prediction system based on apache Spark which stand as a strong large scale
distributed computing platform that can be used successfully for streaming data event
against machine learning through in-memory computations. The system consists of
two main sub parts, namely streaming processing and data storage and visualization.
The first uses Spark ML lib with Spark streaming and applies classification model on
data events to predict heart disease. The seconds uses Apache Cassandra for storing
the large volume of generated data.
RahmaAtallah 2019 This paper presents a majority voting ensemble method that is
able to predict the possible presence of heart disease in humans. The prediction is
based on simple affordable medical tests conducted in any local clinic. Moreover, the
aim of this project is to provide more confidence and accuracy to the Doctor‘s
diagnosis since the model is trained using real-life data of healthy and ill patients. The
model classifies the patient based on the majority vote of several machine learning
models in order to provide more accurate solutions than having only one model.
Finally, this approach produced an accuracy of 90% based on the hard voting
ensemble model.
Noor Basha 2019Analysis and Prediction of diseases are two most demanding factors
to be faced critically by the doctors and data scientist, where data analytics be very
delightful issue, so in this regard, many health industries will working on variety of
human syndromes, where they generate huge data. Heart disease, cancer, tumor and
3
Alzheimer‘s disease are one of the chronic human diseases, where data scientist and
doctors are doing rapid and efficient analysis on these diseases using many machine
learning techniques to study and predict these diseases to save and reduce human
deaths.
CHAPTER - 3
Negative result= 0, the patient will not be diagnosed with heart disease.
In the proposed work user will search for the heart Disease diagnosis (heart
Disease and treatment related information) by giving symptoms as a query in the
search engine.
These symptoms are pre-processed to make the further process easier to find
the symptoms keyword which helps to identify the heart Disease quickly.
4
CFS+PSO are a type of instance-based learning, or lazy learning where the
function is only approximated locally and all computation is deferred until
classification.
This feature has been identified as the most suitable for the present system.
3.2.2 Advantages
1. It is easy to extract signatures from individual data instances, as their
structures. Just collect the symptoms that enough to scaling samples.
2. Can easily predict the heart Disease level and severity easily using range level
of queries.
3. The probability of vocabulary gap between diverse health seekers makes the
data more consistent compared to other formats of health data.
3.2.3 Disadvantages
1. Existing systems have failed to utilize and understand the importance of
misdiagnosis. A very important attribute which interconnects and addresses all
these issues.
2. It varies from patient‘s medical history, climatic conditions, neighborhood, and
various other factors.
CHAPTER 4
Introduction:
In this blog, we will discuss the workflow of a Machine learning project this includes all
the steps required to build the proper machine learning project from scratch.
We will also go over data pre-processing, data cleaning, feature exploration and
feature engineering and show the impact that it has on Machine Learning Model
Performance. We will also cover a couple of the pre-modeling steps that can help to
improve the model performance.
1. Gathering data
2. Data pre-processing
3. Researching the model that will be best for the type of data
6
4. Training and testing the model
5. Evaluation
The machine learning model is nothing but a piece of code; an engineer or data
scientist makes it smart through training with data. So, if you give garbage to the
model, you will get garbage in return, i.e. the trained model will provide false or wrong
prediction.
4.4 Researching the model that will be best for the type of data
Our main goal is to train the best performing model possible, using the pre-processed
data.
As shown in the above representation, we have 2 classes which are plotted on the
graph i.e. red and blue which can be represented as ‗setosa flower‘ and ‗versicolor
flower‘, we can image the X-axis as the ‗Sepal Width‘ and the Y-axis as the ‗Sepal
Length‘, so we try to create the best fit line that separates both classes of flowers.
Regression:
While a Regression problem is when the target variable is continuous (i.e. the output is
10
numeric).
Figure 4.3:Regression
As shown in the above representation, we can imagine that the graph‘s X-axis is the
‗Test scores‘ and the Y-axis represents ‗IQ‘. So we try to create the best fit line in the
given graph so that we can use that line to predict any approximate IQ that isn‘t
present in the given data.
Unsupervised Learning:
The unsupervised learning is categorized into 2 other categories which are
―Clustering‖ and ―Association‖.
Clustering:
11
A set of inputs is to be divided into groups. Unlike in classification, the groups are not
known beforehand, making this typically an unsupervised task.
12
Figure 4.5: Overview of model
Validation set:
Cross-validation is primarily used in applied machine learning to estimate the skill of a
machine learning model on unseen data. A set of unseen data is used from the
training data to tune the parameters of a classifier.
Once the data is divided into the 3 given segments we can start the training process.
In a data set, a training set is implemented to build up a model, while a test (or
validation) set is to validate the model built. Data points in the training set are excluded
from the test (validation) set. Usually, a data set is divided into a training set, a
validation set (some people use ‗test set‘ instead) in each iteration, or divided into a
training set, a validation set and a test set in each iteration. The model uses any one of
the models that we had chosen in step 3/ point 3. Once the model is trained we can
14
use the same trained model to predict using the testing data i.e. the unseen data.
Once this is done we can develop a confusion matrix, this tells us how well our model
is trained. A confusion matrix has 4 parameters, which are ‗True positives‘, ‗True
Negatives‘, ‗False Positives‘ and ‗False Negative‘. We prefer that we get more values
in the True negatives and true positives to get a more accurate model. The size of the
Confusion matrix completely depends upon the number of classes.
True positives : These are cases in which we predicted TRUE and our predicted
output is correct.
True negatives : We predicted FALSE and our predicted output is correct.
False positives :We predicted TRUE, but the actual predicted output is FALSE.
False negatives :We predicted FALSE, but the actual predicted output is TRUE.
We can also find out the accuracy of the model using the confusion matrix. Accuracy =
(True Positives +True Negatives) / (Total number of classes) i.e. for the above
example: Accuracy = (100 + 50) / 165 = 0.9090 (90.9% accuracy)
4.5 Evaluation
Model Evaluation is an integral part of the model development process. It helps to find
the best model that represents our data and how well the chosen model will work in
the future. To improve the model we might tune the hyper-parameters of the model
and try to improve the accuracy and also looking at the confusion matrix to try to
15
increase the number of true positives and true negatives.
CHAPTER 5
5.3.2 Pre-Processing:
The Wisconsin Prognostic Cleave Land Train Dataset is downloaded from the UCI
Machine Learning Repository website and saved as a text file. This file is then
imported into Excel spreadsheet and the values are saved with the corresponding
attributes as column headers. The missing values are replaced with appropriate
values.
5.3.7 Prediction:
Prediction‖ refers to the output of an algorithm after it has been trained on a historical
dataset and applied to new data when forecasting the likelihood of a particular
outcome, such as whether or not a customer will churn in 30 days.
The algorithm will generate probable values for an unknown variable for each record in
the new data, allowing the model builder to identify what that value will most likely be.
The word ―prediction‖ can be misleading. In some cases, it really does mean that you
are predicting a future outcome, such as when you‘re using machine learning to
determine the next best action in a marketing campaign.
Other times, though, the ―prediction‖ has to do with, for example, whether or not a
transaction that already occurred was fraudulent.
In that case, the transaction already happened, but you‘re making an educated guess
19
about whether or not it was legitimate, allowing you to take the appropriate action. In
this module we use trained and optimized machine learning model to predict whether
the patient give the answer some questions.
CHAPTER – 6
By applying different machine learning algorithms and then using deep learning to see
what difference comes when it is applied to the data, three approaches were used. In
the first approach, normal dataset which is acquired is directly used for classification,
and in the second approach, the data with feature selection are taken care of and
there is no outliers detection. *e results which are achieved are quite promising and
then in the third approach the dataset was normalized taking care of the outliers and
feature selection; the results achieved are much better than the previous techniques,
and when compared with other research accuracies, our results are quite promising
20
Figure 6.1: Test Accuracy of Random Forest
21
Figure 6.1: Comparison of three algorithms
Here we observe and compare the accuracy of three models namely Logistic
Regression, KNN, Random Forest among these, Logistic Regression model has the
best overall accuracy and F1 score. Therefore, we should use Logistic Regression
algorithm to predict the heart disease.
22
Figure 6.4 Web page interface
CHAPTER – 7
7.1 CONCLUSION
Clinical finding is a significant region of exploration which assists with recognizing the
event of a coronary illness. The framework, utilizing different methods referenced, will
thus uncovered the root coronary illness alongside the arrangement of most plausible
heart Diseases which have comparative side effects. The information base utilized is
a portrayal data set so to decrease the dataset tokenization, separating and
stemming is finished. The venture presents a novel mixture model to recognize and
affirm CAD cases requiring little to no effort by utilizing clinical information that can be
23
effectively gathered at clinics. Intricacy of framework is diminished by decreasing the
dimensionality of the informational collection with PSO. It gives reproducible and
target finding, and subsequently can be a significant extra device in clinical practices.
Results are equivalently, encouraging and along these line the proposed half and half
technique will be useful in coronary illness diagnostics. Trial results exhibit the
predominance of the proposed half breed technique concerning forecast precision of
CAD.
In this paper three methods in which comparative analysis was done and promising
results were achieved. *e conclusion which we found is that machine learning
algorithms performed better in this analysis. Many researchers have previously
suggested that we should use ML where the dataset is not that large, which is proved
in this paper. The methods which are used for comparison are confusion matrix,
precision, specificity, sensitivity, and F1 score. For the 13 features which were in the
dataset, K-Neighbors classifier performed better in the ML approach when data
preprocessing is applied.
REFERENCES
APPENDICES
A.SOURCE CODE
except Exception as e:
print("Unable to import the libraries",e)
#==============================Data
Preprocessing================================
#loading the dataset
dataset=pd.read_csv('heart_data.csv', sep='\t' )
27
#columns to be encoded: cp(2), restecg(6), slope(10), ca(11), thal(12)
ct=ColumnTransformer([('encoder', OneHotEncoder(drop='first'), [2,6,10,11,12])],
remainder='passthrough')
dataset_X = ct.fit_transform(dataset_X)
#==============================Evaluation=========================
==============
#scores
def scores(pred,test,model):
print(('\n==========Scores for {} ==========\n').format(model))
print(f"Accuracy Score : {accuracy_score(pred,test) * 100:.2f}% " )
print(f"Precision Score : {precision_score(pred,test) * 100:.2f}% ")
print(f"Recall Score : {recall_score(pred,test) * 100:.2f}% " )
print("Confusion Matrix :\n" ,confusion_matrix(pred,test))
#====================================LR_Tunned==================
========================
#logistic regression
lr = LogisticRegression()
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
28
# define grid search
grid_lr={"solver":solvers,
"penalty":penalty,
"C":c_values}
#===================================KNN=========================
=================
#K Nearest Neighbour
best_k=accuracy.index(max(accuracy))
#===================================KNN_Tunned==================
==================
#K Nearest Neighbour with Hyper parameter
knn_t = KNeighborsClassifier()
n_neighbors = range(1,20)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
#====================================RF=========================
=================
#Random Forest
30
n_estimators = [int(x) for x in np.linspace(start =10, stop = 200, num = 10)]
max_features = ['auto', 'sqrt','log2']
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
min_samples_split = [2, 5, 10,14,None]
min_samples_leaf = [1, 2, 4,6,8,None]
rf=RandomForestClassifier()
randomcv=RandomizedSearchCV(estimator=rf,param_distributions=random_grid,n_i
ter=300,cv=10, random_state=100,n_jobs=-1)
#=============================Saving the
models==================================
#saving model to disk
#==============================Testing model
response============================
31
#test the pickle file
deftest_model(row_number):
model=pickle.load(open('ml_model.pkl', 'rb'))
value,real=dataset_X[row_number,:].reshape(1,-1),dataset_Y[row_number,:]
print(("\n The value predicted is : {} and the real value is : {}
").format(model.predict(value), real))
test_model(102)
print('\nCompleted')
Safe
32
Need Medical Attention
33
C. PLAGARISM REPORT
34