0% found this document useful (0 votes)
4 views25 pages

Visvesvaraya Technological University: Hms Institue of Technology

This document is a Machine Learning Internship Report by Shalini M, focusing on predicting heart disease using logistic regression as part of her Bachelor of Engineering in Computer Science and Engineering. The report outlines the significance of early detection of cardiovascular diseases, the methodology employed including data preprocessing and model evaluation, and the implementation of the logistic regression model. It highlights the project's aim to improve prediction accuracy through machine learning techniques, ultimately addressing the high mortality rates associated with heart diseases.

Uploaded by

Shalini M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views25 pages

Visvesvaraya Technological University: Hms Institue of Technology

This document is a Machine Learning Internship Report by Shalini M, focusing on predicting heart disease using logistic regression as part of her Bachelor of Engineering in Computer Science and Engineering. The report outlines the significance of early detection of cardiovascular diseases, the methodology employed including data preprocessing and model evaluation, and the implementation of the logistic regression model. It highlights the project's aim to improve prediction accuracy through machine learning techniques, ultimately addressing the high mortality rates associated with heart diseases.

Uploaded by

Shalini M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

JnanaSangama, Belgaum-590014

A Machine Learning Internship Report On

“Heart disease prediction using logistic regression”


Submitted in Partial fulfillment of the Requirements for the VII Semester of the
Degree of
Bachelor of Engineering
In
Computer Science & Engineering
By
SHALINI M (1HM20CS034)

Under the Guidance of


Mr.Reetesh

HMS INSTITUE OF TECHNOLOGY


NH-4, Kesaramadu Post Kyathasandra Karnataka 572104
2022-2023
HMS INSTITUE OF TECHNOLOGY
NH-4, Kesaramadu Post Kyathasandra Karnataka 572104

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
Certified that the Machine Learning Project work entitled “HEART DISEASE PREDICTION USING
LOGISTIC REGRESSION” has been carried out By SHALINI M (1HM20CS034), bonafide student
of City Engineering College in partial fulfilment for the award of Bachelor of Engineering in Computer
Science and Engineering of the Visveshvaraya Technological University, Belgaum during the year
20222023. It is certified that all corrections/suggestions indicated for Internal Assessment have been
incorporated in the Report deposited in the departmental library. The Machine Learning Mini Project
Report has been approved as it satisfies the academic requirements in respect of project work prescribed
for the said Degree.

________________ ___________________ _________________


Signature of Guide Signature of HOD Signature of Principal
Mr. Dr. A. Vijayaraghavan BE, M Tech, Ph. D Dr. Kavitha.A.S Ph.D
Asst. Professor, Dept. of CS&E HOD, Dept. of CS&E Principal
H.M.S.I.T, TUMAKUR H.M.S.I. T, TUMAKUR H.M.S.I. T, TUMAKUR

Name of Examiners Signature with date


1.______________________ ____________________
2. _____________________ ____________________
Abstract

Heart - a primary organ of our circulatory system. Which keeps blood that's full of oxygen
circulating throughout your body. From past two decades Heart-disease remained as a leading
cause of death at global level. Statistics illustrate the lethality of cardiovascular disease by
showing the percentage of deaths caused by heart attacks worldwide. Therefore, it is crucial to
predict the condition as earliest as possible time. Cardiologist have limitations, they cannot
predict heart disease risk to a high degree of accuracy. So, a reliable, accurate and feasible
system is required to predict such diseases in time for proper treatment. In order to automate
analysis of large and complex medical datasets, Machine Learning algorithms and techniques
have been applied. Machine learning techniques have been increasingly used by researchers in
the health care industry and by professionals to diagnose conditions related to the heart. A quick
and efficient detection technique is needed to reduce the high death rate caused by heart diseases.
Here, machine learning algorithms and data mining techniques play a very crucial role.
Using machine learning algorithms, this research aims to predict the occurrence of heart disease
in a patient.
Keywords: Machine Learning, Supervised Learning, unsupervised Learning, Logistic
Regression, Cardiovascular diseases
ACKNOWLEDGEMENT

The Satisfaction that accompanies the successful completion of any vast would
be incomplete without the mention of the people who made it possible and whose constant
encouragement and guidance has been a source of inspiration throughout the project.

I am extremely grateful and thankful to our beloved Principal Dr. Kavitha. A.S,
for providing me a congenial atmosphere and also the necessary facilities for achieving the
cherished goal Project.

I feel delighted to have this page to express my sincere thanks and deep
appreciation to Dr. A. Vijayaraghavan, Head of the Department, Computer Science and
Engineering, for his valuable guidance, keep interest and constant encouragement throughout the
entire period of this project.

I would like to thank my guide Mr. Ranganatha H R Asst. Professor, Computer


Science and Engineering, for his valuable Guidance and constant support throughout the Project.

I would like to thank my Parents for their valuable guidance and moral support
throughout the Project.
Finally, I thank all the Teaching and non-teaching staff for allowing me to
successfully carry out the Project. I also thank my friends who provided lot of support in this
Project.

SHALINI M (1HM20CS034)
TABLE OF CONTENTS

Table of Contents
ACKNOWLEDGEMENT ................................................................................................................4
Chapter 1 ..........................................................................................................................................7
INTRODUCTION ............................................................................................................................7
Chapter 2 ..........................................................................................................................................8
LOGISTIC REGRESSION ..............................................................................................................8
Chapter 3 ........................................................................................................................................ 10
METHODOLOGY ......................................................................................................................... 10
Data Preprocessing...................................................................................................................... 10
Data Source ............................................................................................................................. 10
Exploratory Data Analysis .......................................................................................................... 11
Data Visualization ................................................................................................................... 11
The Correlation between Variables Analysis ............................................................................... 11
Data Validation ........................................................................................................................... 12
Chapter4......................................................................................................................................... 13
MODEL EVALUATION ................................................................................................................. 13
Train-Test Split ........................................................................................................................... 13
Model Accuracy .......................................................................................................................... 13
Model Summary ......................................................................................................................... 13
Sensitivity and Specificity............................................................................................................ 13
Threshold Adjustment................................................................................................................. 13
Threshold Tuning .................................................................................................................... 13
AUC (Area Under the Curve)...................................................................................................... 13
ROC Curve and AUC.................................................................................................................. 13
ROC Curve ............................................................................................................................. 13
Chapter 5 ........................................................................................................................................ 15
IMPLEMENTATION ..................................................................................................................... 15
Code: .......................................................................................................................................... 15
Logistic Regression ..................................................................................................................... 17
Feature Selection: Backward elemination (P-value approach) .................................................... 18
Logistic regression equation .................................................................................................... 18
Splitting data to train and test split ............................................................................................. 19
Model Evaluation ........................................................................................................................ 20
Model accuracy ....................................................................................................................... 23
Confusion matrix ........................................................................................................................ 20
ROC curve .................................................................................................................................. 22
Chapter 6 ........................................................................................................................................ 24
CONCLUSION ............................................................................................................................... 24
Appendix..................................................................................................................................... 24
Data Source References ............................................................................................................... 24
Chapter 1
INTRODUCTION
The number of people suffering from cardiovascular disease is on the rise. Numerous
factors carry the risk of developing this disease, such as age, high blood pressure, high
cholesterol, diabetes, hypertension, genes, obesity, and unhealthy lifestyles. It is possible to
identify a variety of symptoms by observing physical signs like chest pain, shortness of breath,
dizziness, and wearing yourself out easily. Even though these diseases were found to be the
leading cause of death, they have been classified as the most manageable and preventable
illnesses. Identification of cardiovascular diseases is a difficult process. The early detection of
cardiovascular disease is crucial since its complications can have an impact on a person's life as a
whole.

The signs of a woman having a heart attack are much less noticeable than the signs of a
male. In women, heart attacks may feel uncomfortable squeezing, pressure, fullness, or pain in
the center of the chest. It may also cause pain in one or both arms, the back, neck, jaw or
stomach, shortness of breath, nausea and other symptoms. Men experience typical symptoms of
heart attack, such as chest pain, discomfort, and stress. They may also experience pain in other
areas, such as arms, neck, back, and jaw, and shortness of breath, sweating, and discomfort that
mimics heartburn.

Cardiovascular disease diagnosis and treatment are very complex. While invasive-based
techniques are still employed through analysis of the patient's medical history, reports of
physical examinations by the physician tend to be less accurate and take a long time to prepare.
For this reason, a support system is implemented to predict cardiovascular disease through a
machine learning model. A machine-learning approach may improve accuracy by leveraging the
complex interactions between risk factors
Chapter 2
LOGISTIC REGRESSION

Logistic regression is a process of modeling the probability of a discrete outcome given


an input variable. The most common logistic regression models a binary outcome; something
that can take two values such as true/false, yes/no, and so on. Multinomial logistic regression can
model scenarios where there are more than two possible discrete outcomes. Logistic regression is
a useful analysis method for classification problems, where you are trying to determine if a new
sample fits best into a category. As aspects of cyber security are classification problems, such as
attack detection, logistic regression is a useful analytic technique. Logistic regression, despite its
name, is a classification model rather than regression model. Logistic regression is a simple and
more efficient method for binary and linear classification problems. It is a classification model,
which is very easy to realize and achieves very good performance with linearly separable classes.
It is an extensively employed algorithm for classification in industry. The logistic regression
model, like the Adaline and perceptron, is a statistical method for binary classification that can be
generalized to multiclass classification. Scikit-learn has a highly optimized version of logistic
regression implementation, which supports multiclass classification task.

Logistic regression is another powerful supervised ML algorithm used for binary classification
problems (when target is categorical). The best way to think about logistic regression is that it is
a linear regression but for classification problems. Logistic regression essentially uses a logistic
function defined below to model a binary output variable. The primary difference between linear
regression and logistic regression is that logistic regression's range is bounded between 0 and 1.
In addition, as opposed to linear regression, logistic regression does not require a linear
relationship between inputs and output variables. This is due to applying a nonlinear log
transformation to the odds ratio.

Formula of Logistic Regression Sigmoid function:

f(z)=σ(z)=11+e−z
In the logistic function equation, x is the input variable. Let's feed in values −20 to 20 into the
logistic function. the inputs have been transferred to between 0 and 1.

Figure 1: Sigmoid Function


Chapter 3
METHODOLOGY
Several factors that affect the human cardiovascular system are examined in this study. The
process begins with retrieved data, analysis of correlation between variables, splitting of the data, and
prediction with the logistic regression algorithm, ending with data validation.

Data Preprocessing
Data Source
The dataset used for this analysis is sourced from Kaggle and contains information
collected as part of the Framingham Heart Study. It includes the following variables:

• Sex: Gender of the patient (male or female)


• Age: Age of the patient (continuous)
• Current Smoker: Whether the patient is a current smoker (yes or no)
• Cigarettes per Day: The number of cigarettes smoked per day (continuous)
• Blood Pressure Medication (BPMeds): Whether the patient is on blood pressure
medication (yes or no)
• Prevalent Stroke: Whether the patient had a previous stroke (yes or no)
• Prevalent Hypertension: Whether the patient is hypertensive (yes or no)
• Diabetes: Whether the patient has diabetes (yes or no)
• Total Cholesterol (totChol): Total cholesterol level (continuous)
• Systolic Blood Pressure (sysBP): Systolic blood pressure (continuous)
• Diastolic Blood Pressure (diaBP): Diastolic blood pressure (continuous)
• Body Mass Index (BMI): Body Mass Index (continuous)
• Heart Rate: Heart rate (continuous)
• Glucose: Glucose level (continuous)
• Ten-Year CHD: Ten-year risk of coronary heart disease (binary: "1" for yes, "0" for no)
Data Preprocessing Steps
The dataset was initially loaded, and missing values were handled by removing rows with
missing data.
The target variable, "Ten-Year CHD," was identified as the variable to predict.

Exploratory Data Analysis


Data Visualization
Exploratory data analysis was performed to gain insights into the dataset's characteristics.
Key visualizations included histograms and pair plots, which allowed us to understand the
distribution of variables and relationships between them.

The Correlation between Variables Analysis


Besides, to facilitate data analysis, all variables in the imported dataset will be visualized in the form of a
histogram to facilitate the reading of the data in general. In the process, Analyse the Correlation between
Variables; the correlation between variables is examined to prove that the method to be used is the logistic
regression model is the right model. Relationships between variables in the available dataset will be
plotted in the form of a matrix. This is also done to check whether there is multicollinearity between
variables in the dataset.

Figure 2: Variables in Data


Data Validation
The technique used to validate the results is the method of the confusion matrix and K-fold cross
validation with 10-fold. By using a confusion matrix, the accuracy of the use of the logistic regression
model can be known. Besides, the use of the K-fold crossvalidation method¸ produces values of errors
that may occur when using a logistic regression model.

Figure 3: Confusion Matrix


Chapter4
MODEL EVALUATION
Model Accuracy
The model's accuracy was evaluated using the testing dataset, and it achieved an accuracy of
approximately 0.88.

Train-Test Split
The dataset was divided into training and testing sets to assess the model's performance.

Model Summary
The final logistic regression model was summarized, including information such as coefficients,
standard errors, z-scores, p-values, and confidence intervals. The model exhibited a good fit with
a pseudo R-squared value of 0.1148.

Sensitivity and Specificity


Sensitivity (true positive rate) and specificity (true negative rate) were calculated to understand
how well the model detects true positive cases and true negative cases, respectively.

Threshold Adjustment
Threshold Tuning
To balance sensitivity and specificity, threshold tuning was performed. Different threshold values
were tested to find an optimal balance that minimized false negatives (Type II errors) while
maintaining acceptable specificity.

AUC (Area Under the Curve)


The Area Under the ROC Curve (AUC) was calculated to quantify the model's classification
accuracy. The AUC was approximately 0.773, indicating reasonably good performance.

ROC Curve and AUC


ROC Curve
The Receiver Operating Characteristic (ROC) curve was plotted to visualize the trade-off
between true positive rate and false positive rate across various thresholds.
Figure 4: ROC curve for Heart disease classifier
Chapter 5
IMPLEMENTATION

Code:
#!/usr/bin/env python
# coding: utf-8
# In[1]:import pandas as pd
import numpy as np

import statsmodels.api as sm

import scipy.stats as st

import matplotlib.pyplot as plt


import seaborn as sn
from sklearn.metrics import confusion_matrix
import matplotlib.mlab as mlab
get_ipython().run_line_magic('matplotlib', 'inline')
# In[4]:
heart_df=pd.read_csv(r"C:\Users\shali\Downloads\archive (3)\framingham.csv")
heart_df.drop(['education'],axis=1,inplace=True)
heart_df.head()
# Variables :
# Each attribute is a potential risk factor. There are both demographic, behavioural and medical
risk factors.
# - Demographic: sex: male or female;(Nominal)
# - age: age of the patient;(Continuous - Although the recorded ages have been truncated to
whole numbers, the concept of age is continuous)
# - currentSmoker: whether or not the patient is a current smoker (Nominal)
# - cigsPerDay: the number of cigarettes that the person smoked on average in one day.(can be
considered continuous as one can have any number of cigarretts, even half a cigarette.)
# -Medical( history):
# - BPMeds: whether or not the patient was on blood pressure medication (Nominal)
# - prevalentStroke: whether or not the patient had previously had a stroke (Nominal)
# - prevalentHyp: whether or not the patient was hypertensive (Nominal)
# - diabetes: whether or not the patient had diabetes (Nominal)
# Medical(current):
# - totChol: total cholesterol level (Continuous)
# - sysBP: systolic blood pressure (Continuous)
# - diaBP: diastolic blood pressure (Continuous)
# - BMI: Body Mass Index (Continuous)
# - heartRate: heart rate (Continuous - In medical research, variables such as heart rate though in
fact discrete, yet are considered continuous because of large number of possible values.)
# - glucose: glucose level (Continuous)
# Predict variable (desired target):
# - 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)
# In[9]:
heart_df.rename(columns={'male':'Sex_male'},inplace=True)
# In[6]:
heart_df.isnull().sum()
# In[7]:
count=0
for i in heart_df.isnull().sum(axis=1):
if i>0:
count=count+1
print('Total number of rows with missing values is ', count)
print('since it is only',round((count/len(heart_df.index))*100), 'percent of the entire dataset the
rows with missing values are excluded.')
# In[8]:
heart_df.dropna(axis=0,inplace=True)
# Exploratory Analysis
# In[10]:
def draw_histograms(dataframe, features, rows, cols):
fig=plt.figure(figsize=(20,20))
for i, feature in enumerate(features):
ax=fig.add_subplot(rows,cols,i+1)
dataframe[feature].hist(bins=20,ax=ax,facecolor='midnightblue')
ax.set_title(feature+" Distribution",color='DarkRed')

fig.tight_layout()
plt.show()
draw_histograms(heart_df,heart_df.columns,6,3)
# In[11]:
heart_df.TenYearCHD.value_counts()
# In[12]:
sn.countplot(x='TenYearCHD',data=heart_df)
# There are 3179 patents with no heart disease and 572 patients with risk of heart disease.
# In[13]:
sn.pairplot(data=heart_df)
# In[14]:
heart_df.describe()

Logistic Regression
# Logistic regression is a type of regression analysis in statistics used for prediction of outcome
of a categorical dependent variable from a set of predictor or independent variables. In logistic
regression the dependent variable is always binary. Logistic regression is mainly used to for
prediction and also calculating the probability of success.
# In[15]:
from statsmodels.tools import add_constant as add_constant
heart_df_constant = add_constant(heart_df)
heart_df_constant.head()
# In[16]:
st.chisqprob = lambda chisq, df: st.chi2.sf(chisq, df)
cols=heart_df_constant.columns[:-1]
model=sm.Logit(heart_df.TenYearCHD,heart_df_constant[cols])
result=model.fit()
result.summary()
# The results above show some of the attributes with P value higher than the preferred alpha(5%)
and thereby showing low statistically significant relationship with the probability of heart
disease. Backward elemination approach is used here to remove those attributes with highest
Pvalue one at a time follwed by running the regression repeatedly until all attributes have P
Values less than 0.05.
Feature Selection: Backward elemination (P-value approach)
# In[17]:
def back_feature_elem (data_frame,dep_var,col_list):
""" Takes in the dataframe, the dependent variable and a list of column names, runs the
regression repeatedly eleminating feature with the highest
P-value above alpha one at a time and returns the regression summary with all p-values below
alpha"""

while len(col_list)>0 :
model=sm.Logit(dep_var,data_frame[col_list])
result=model.fit(disp=0)
largest_pvalue=round(result.pvalues,3).nlargest(1)
if largest_pvalue[0]<(0.05):
return result
break
else:
col_list=col_list.drop(largest_pvalue.index)

result=back_feature_elem(heart_df_constant,heart_df.TenYearCHD,cols)
# In[18]:
result.summary()

Logistic regression equation


# - P=eβ0+β1X1/1+eβ0+β1X1
#
# When all features plugged in:#
#
logit(p)=log(p/(1−p))=β0+β1∗Sexmale+β2∗age+β3∗cigsPerDay+β4∗totChol+β5∗sysBP+β6∗glu
cose

# #### Interpreting the results: Odds Ratio, Confidence Intervals and Pvalues
# In[19]:
params = np.exp(result.params)
conf = np.exp(result.conf_int())
conf['OR'] = params
pvalue=round(result.pvalues,3)
conf['pvalue']=pvalue
conf.columns = ['CI 95%(2.5%)', 'CI 95%(97.5%)', 'Odds Ratio','pvalue']
print ((conf))
# - This fitted model shows that, holding all other features constant, the odds of getting
diagnosed with heart disease for males (sex_male = 1)over that of females (sex_male = 0) is
exp(0.5815) = 1.788687. In terms of percent change, we can say that the odds for males are
78.8% higher than the odds for females.
#
#
# - The coefficient for age says that, holding all others constant, we will see 7% increase in the
odds of getting diagnosed with CDH for a one year increase in age since exp(0.0655) =
1.067644.
#
#
# - Similarly , with every extra cigarette one smokes thers is a 2% increase in the odds of CDH.
#
# - For Total cholosterol level and glucose level there is no significant change.
#
# - There is a 1.7% increase in odds for every unit increase in systolic Blood Pressure.

Splitting data to train and test split


# In[23]:
import sklearn
new_features=heart_df[['age','Sex_male','cigsPerDay','totChol','sysBP','glucose','TenYearCHD']]
x=new_features.iloc[:,:-1]
y=new_features.iloc[:,-1]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.20,random_state=5)
# In[24]:
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(x_train,y_train)
y_pred=logreg.predict(x_test)

Model Evaluation
Confusion matrix

# In[26]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Act
ual:1'])
plt.figure(figsize = (8,5))
sn.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
# The confusion matrix shows 658+4 = 662 correct predictions and 88+1= 89 incorrect ones.
#
# - True Positives: 4
#
# - True Negatives: 658
#
# - False Positives: 1 (Type I error)
#
# - False Negatives: 88 ( Type II error)
#
#

# In[27]:
TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1]
sensitivity=TP/float(TP+FN)
specificity=TN/float(TN+FP)
# #### Model Evaluation - Statistics
# In[28]:
print('The acuuracy of the model = TP+TN/(TP+TN+FP+FN) =
',(TP+TN)/float(TP+TN+FP+FN),'\n',

'The Missclassification = 1-Accuracy = ',1-((TP+TN)/float(TP+TN+FP+FN)),'\n',

'Sensitivity or True Positive Rate = TP/(TP+FN) = ',TP/float(TP+FN),'\n',

'Specificity or True Negative Rate = TN/(TN+FP) = ',TN/float(TN+FP),'\n',

'Positive Predictive value = TP/(TP+FP) = ',TP/float(TP+FP),'\n',

'Negative predictive Value = TN/(TN+FN) = ',TN/float(TN+FN),'\n',

'Positive Likelihood Ratio = Sensitivity/(1-Specificity) = ',sensitivity/(1-specificity),'\n',

'Negative likelihood Ratio = (1-Sensitivity)/Specificity = ',(1-sensitivity)/specificity)

# ##### From the above statistics it is clear that the model is highly specific than sensitive. The
negative values are predicted more accurately than the positives.

# ###### Predicted probabilities of 0 (No Coronary Heart Disease) and 1 ( Coronary Heart
Disease: Yes) for the test data with a default classification threshold of 0.5

# In[29]:
y_pred_prob=logreg.predict_proba(x_test)[:,:]
y_pred_prob_df=pd.DataFrame(data=y_pred_prob, columns=['Prob of no heart disease (0)','Prob
of Heart Disease (1)'])
y_pred_prob_df.head()

# ##### Lower the threshold


# ###### Since the model is predicting Heart disease too many type II errors is not advisable. A
False Negative ( ignoring the probability of disease when there actualy is one) is more dangerous
than a False Positive in this case. Hence inorder to increase the sensitivity, threshold can be
lowered.

# In[37]:
from sklearn.preprocessing import binarize
for i in range(1,5):
cm2=0
y_pred_prob_yes=logreg.predict_proba(x_test)
y_pred2 = binarize(y_pred_prob_yes, threshold=i/10)[:,1]

cm2=confusion_matrix(y_test,y_pred2)
print ('With',i/10,'threshold the Confusion Matrix is ','\n',cm2,'\n',
'with',cm2[0,0]+cm2[1,1],'correct predictions and',cm2[1,0],'Type II errors( False
Negatives)','\n\n',
'Sensitivity: ',cm2[1,1]/(float(cm2[1,1]+cm2[1,0])),'Specificity:
',cm2[0,0]/(float(cm2[0,0]+cm2[0,1])),'\n\n\n')

ROC curve
# In[34]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_yes[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart disease classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')
plt.grid(True)

# A common way to visualize the trade-offs of different thresholds is by using an ROC curve, a
plot of the true positive rate (# true positives/ total # positives) versus the false positive rate (#
false positives / total # negatives) for all possible choices of thresholds. A model with good
classification accuracy should have significantly more true positives than false positives at all
thresholds.
#
# The optimum position for roc curve is towards the top left corner where the specificity and
sensitivity are at optimum levels
#

# Area Under The Curve (AUC)


# The area under the ROC curve quantifies model classification accuracy; the higher the area, the
greater the disparity between true and false positives, and the stronger the model in classifying
members of the training dataset. An area of 0.5 corresponds to a model that performs no better
than random classification and a good classifier stays as far away from that as possible. An area
of 1 is ideal. The closer the AUC to 1 the better.

# In[35]:
sklearn.metrics.roc_auc_score(y_test,y_pred_prob_yes[:,1])

Model accuracy

# In[25]:
sklearn.metrics.accuracy_score(y_test,y_pred)
# Accuracy of the model is 0.88
Chapter 6
CONCLUSION

• All attributes selected after the elimination process show Pvalues lower than 5% and
thereby suggesting significant role in the Heart disease prediction.
• Men seem to be more susceptible to heart disease than women.Increase in Age,number of
cigarettes smoked per day and systolic Blood Pressure also show increasing odds of
having heart disease.
• Total cholesterol shows no significant change in the odds of CHD. This could be due to
the presence of 'good cholesterol(HDL) in the total cholesterol reading.Glucose too
causes a very negligible change in odds (0.2%)
• The model predicted with 0.88 accuracy. The model is more specific than sensitive.
• *The Area under the ROC curve is 73.5 which is somewhat satisfactory. *
• Overall model could be improved with more data.

Appendix

▪ http://www.who.int/mediacentre/factsheets/fs317/en/

Data Source References

▪ https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset/data

You might also like