Visvesvaraya Technological University: Hms Institue of Technology
Visvesvaraya Technological University: Hms Institue of Technology
JnanaSangama, Belgaum-590014
CERTIFICATE
Certified that the Machine Learning Project work entitled “HEART DISEASE PREDICTION USING
LOGISTIC REGRESSION” has been carried out By SHALINI M (1HM20CS034), bonafide student
of City Engineering College in partial fulfilment for the award of Bachelor of Engineering in Computer
Science and Engineering of the Visveshvaraya Technological University, Belgaum during the year
20222023. It is certified that all corrections/suggestions indicated for Internal Assessment have been
incorporated in the Report deposited in the departmental library. The Machine Learning Mini Project
Report has been approved as it satisfies the academic requirements in respect of project work prescribed
for the said Degree.
Heart - a primary organ of our circulatory system. Which keeps blood that's full of oxygen
circulating throughout your body. From past two decades Heart-disease remained as a leading
cause of death at global level. Statistics illustrate the lethality of cardiovascular disease by
showing the percentage of deaths caused by heart attacks worldwide. Therefore, it is crucial to
predict the condition as earliest as possible time. Cardiologist have limitations, they cannot
predict heart disease risk to a high degree of accuracy. So, a reliable, accurate and feasible
system is required to predict such diseases in time for proper treatment. In order to automate
analysis of large and complex medical datasets, Machine Learning algorithms and techniques
have been applied. Machine learning techniques have been increasingly used by researchers in
the health care industry and by professionals to diagnose conditions related to the heart. A quick
and efficient detection technique is needed to reduce the high death rate caused by heart diseases.
Here, machine learning algorithms and data mining techniques play a very crucial role.
Using machine learning algorithms, this research aims to predict the occurrence of heart disease
in a patient.
Keywords: Machine Learning, Supervised Learning, unsupervised Learning, Logistic
Regression, Cardiovascular diseases
ACKNOWLEDGEMENT
The Satisfaction that accompanies the successful completion of any vast would
be incomplete without the mention of the people who made it possible and whose constant
encouragement and guidance has been a source of inspiration throughout the project.
I am extremely grateful and thankful to our beloved Principal Dr. Kavitha. A.S,
for providing me a congenial atmosphere and also the necessary facilities for achieving the
cherished goal Project.
I feel delighted to have this page to express my sincere thanks and deep
appreciation to Dr. A. Vijayaraghavan, Head of the Department, Computer Science and
Engineering, for his valuable guidance, keep interest and constant encouragement throughout the
entire period of this project.
I would like to thank my Parents for their valuable guidance and moral support
throughout the Project.
Finally, I thank all the Teaching and non-teaching staff for allowing me to
successfully carry out the Project. I also thank my friends who provided lot of support in this
Project.
SHALINI M (1HM20CS034)
TABLE OF CONTENTS
Table of Contents
ACKNOWLEDGEMENT ................................................................................................................4
Chapter 1 ..........................................................................................................................................7
INTRODUCTION ............................................................................................................................7
Chapter 2 ..........................................................................................................................................8
LOGISTIC REGRESSION ..............................................................................................................8
Chapter 3 ........................................................................................................................................ 10
METHODOLOGY ......................................................................................................................... 10
Data Preprocessing...................................................................................................................... 10
Data Source ............................................................................................................................. 10
Exploratory Data Analysis .......................................................................................................... 11
Data Visualization ................................................................................................................... 11
The Correlation between Variables Analysis ............................................................................... 11
Data Validation ........................................................................................................................... 12
Chapter4......................................................................................................................................... 13
MODEL EVALUATION ................................................................................................................. 13
Train-Test Split ........................................................................................................................... 13
Model Accuracy .......................................................................................................................... 13
Model Summary ......................................................................................................................... 13
Sensitivity and Specificity............................................................................................................ 13
Threshold Adjustment................................................................................................................. 13
Threshold Tuning .................................................................................................................... 13
AUC (Area Under the Curve)...................................................................................................... 13
ROC Curve and AUC.................................................................................................................. 13
ROC Curve ............................................................................................................................. 13
Chapter 5 ........................................................................................................................................ 15
IMPLEMENTATION ..................................................................................................................... 15
Code: .......................................................................................................................................... 15
Logistic Regression ..................................................................................................................... 17
Feature Selection: Backward elemination (P-value approach) .................................................... 18
Logistic regression equation .................................................................................................... 18
Splitting data to train and test split ............................................................................................. 19
Model Evaluation ........................................................................................................................ 20
Model accuracy ....................................................................................................................... 23
Confusion matrix ........................................................................................................................ 20
ROC curve .................................................................................................................................. 22
Chapter 6 ........................................................................................................................................ 24
CONCLUSION ............................................................................................................................... 24
Appendix..................................................................................................................................... 24
Data Source References ............................................................................................................... 24
Chapter 1
INTRODUCTION
The number of people suffering from cardiovascular disease is on the rise. Numerous
factors carry the risk of developing this disease, such as age, high blood pressure, high
cholesterol, diabetes, hypertension, genes, obesity, and unhealthy lifestyles. It is possible to
identify a variety of symptoms by observing physical signs like chest pain, shortness of breath,
dizziness, and wearing yourself out easily. Even though these diseases were found to be the
leading cause of death, they have been classified as the most manageable and preventable
illnesses. Identification of cardiovascular diseases is a difficult process. The early detection of
cardiovascular disease is crucial since its complications can have an impact on a person's life as a
whole.
The signs of a woman having a heart attack are much less noticeable than the signs of a
male. In women, heart attacks may feel uncomfortable squeezing, pressure, fullness, or pain in
the center of the chest. It may also cause pain in one or both arms, the back, neck, jaw or
stomach, shortness of breath, nausea and other symptoms. Men experience typical symptoms of
heart attack, such as chest pain, discomfort, and stress. They may also experience pain in other
areas, such as arms, neck, back, and jaw, and shortness of breath, sweating, and discomfort that
mimics heartburn.
Cardiovascular disease diagnosis and treatment are very complex. While invasive-based
techniques are still employed through analysis of the patient's medical history, reports of
physical examinations by the physician tend to be less accurate and take a long time to prepare.
For this reason, a support system is implemented to predict cardiovascular disease through a
machine learning model. A machine-learning approach may improve accuracy by leveraging the
complex interactions between risk factors
Chapter 2
LOGISTIC REGRESSION
Logistic regression is another powerful supervised ML algorithm used for binary classification
problems (when target is categorical). The best way to think about logistic regression is that it is
a linear regression but for classification problems. Logistic regression essentially uses a logistic
function defined below to model a binary output variable. The primary difference between linear
regression and logistic regression is that logistic regression's range is bounded between 0 and 1.
In addition, as opposed to linear regression, logistic regression does not require a linear
relationship between inputs and output variables. This is due to applying a nonlinear log
transformation to the odds ratio.
f(z)=σ(z)=11+e−z
In the logistic function equation, x is the input variable. Let's feed in values −20 to 20 into the
logistic function. the inputs have been transferred to between 0 and 1.
Data Preprocessing
Data Source
The dataset used for this analysis is sourced from Kaggle and contains information
collected as part of the Framingham Heart Study. It includes the following variables:
Train-Test Split
The dataset was divided into training and testing sets to assess the model's performance.
Model Summary
The final logistic regression model was summarized, including information such as coefficients,
standard errors, z-scores, p-values, and confidence intervals. The model exhibited a good fit with
a pseudo R-squared value of 0.1148.
Threshold Adjustment
Threshold Tuning
To balance sensitivity and specificity, threshold tuning was performed. Different threshold values
were tested to find an optimal balance that minimized false negatives (Type II errors) while
maintaining acceptable specificity.
Code:
#!/usr/bin/env python
# coding: utf-8
# In[1]:import pandas as pd
import numpy as np
import statsmodels.api as sm
import scipy.stats as st
fig.tight_layout()
plt.show()
draw_histograms(heart_df,heart_df.columns,6,3)
# In[11]:
heart_df.TenYearCHD.value_counts()
# In[12]:
sn.countplot(x='TenYearCHD',data=heart_df)
# There are 3179 patents with no heart disease and 572 patients with risk of heart disease.
# In[13]:
sn.pairplot(data=heart_df)
# In[14]:
heart_df.describe()
Logistic Regression
# Logistic regression is a type of regression analysis in statistics used for prediction of outcome
of a categorical dependent variable from a set of predictor or independent variables. In logistic
regression the dependent variable is always binary. Logistic regression is mainly used to for
prediction and also calculating the probability of success.
# In[15]:
from statsmodels.tools import add_constant as add_constant
heart_df_constant = add_constant(heart_df)
heart_df_constant.head()
# In[16]:
st.chisqprob = lambda chisq, df: st.chi2.sf(chisq, df)
cols=heart_df_constant.columns[:-1]
model=sm.Logit(heart_df.TenYearCHD,heart_df_constant[cols])
result=model.fit()
result.summary()
# The results above show some of the attributes with P value higher than the preferred alpha(5%)
and thereby showing low statistically significant relationship with the probability of heart
disease. Backward elemination approach is used here to remove those attributes with highest
Pvalue one at a time follwed by running the regression repeatedly until all attributes have P
Values less than 0.05.
Feature Selection: Backward elemination (P-value approach)
# In[17]:
def back_feature_elem (data_frame,dep_var,col_list):
""" Takes in the dataframe, the dependent variable and a list of column names, runs the
regression repeatedly eleminating feature with the highest
P-value above alpha one at a time and returns the regression summary with all p-values below
alpha"""
while len(col_list)>0 :
model=sm.Logit(dep_var,data_frame[col_list])
result=model.fit(disp=0)
largest_pvalue=round(result.pvalues,3).nlargest(1)
if largest_pvalue[0]<(0.05):
return result
break
else:
col_list=col_list.drop(largest_pvalue.index)
result=back_feature_elem(heart_df_constant,heart_df.TenYearCHD,cols)
# In[18]:
result.summary()
# #### Interpreting the results: Odds Ratio, Confidence Intervals and Pvalues
# In[19]:
params = np.exp(result.params)
conf = np.exp(result.conf_int())
conf['OR'] = params
pvalue=round(result.pvalues,3)
conf['pvalue']=pvalue
conf.columns = ['CI 95%(2.5%)', 'CI 95%(97.5%)', 'Odds Ratio','pvalue']
print ((conf))
# - This fitted model shows that, holding all other features constant, the odds of getting
diagnosed with heart disease for males (sex_male = 1)over that of females (sex_male = 0) is
exp(0.5815) = 1.788687. In terms of percent change, we can say that the odds for males are
78.8% higher than the odds for females.
#
#
# - The coefficient for age says that, holding all others constant, we will see 7% increase in the
odds of getting diagnosed with CDH for a one year increase in age since exp(0.0655) =
1.067644.
#
#
# - Similarly , with every extra cigarette one smokes thers is a 2% increase in the odds of CDH.
#
# - For Total cholosterol level and glucose level there is no significant change.
#
# - There is a 1.7% increase in odds for every unit increase in systolic Blood Pressure.
Model Evaluation
Confusion matrix
# In[26]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Act
ual:1'])
plt.figure(figsize = (8,5))
sn.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
# The confusion matrix shows 658+4 = 662 correct predictions and 88+1= 89 incorrect ones.
#
# - True Positives: 4
#
# - True Negatives: 658
#
# - False Positives: 1 (Type I error)
#
# - False Negatives: 88 ( Type II error)
#
#
# In[27]:
TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1]
sensitivity=TP/float(TP+FN)
specificity=TN/float(TN+FP)
# #### Model Evaluation - Statistics
# In[28]:
print('The acuuracy of the model = TP+TN/(TP+TN+FP+FN) =
',(TP+TN)/float(TP+TN+FP+FN),'\n',
# ##### From the above statistics it is clear that the model is highly specific than sensitive. The
negative values are predicted more accurately than the positives.
# ###### Predicted probabilities of 0 (No Coronary Heart Disease) and 1 ( Coronary Heart
Disease: Yes) for the test data with a default classification threshold of 0.5
# In[29]:
y_pred_prob=logreg.predict_proba(x_test)[:,:]
y_pred_prob_df=pd.DataFrame(data=y_pred_prob, columns=['Prob of no heart disease (0)','Prob
of Heart Disease (1)'])
y_pred_prob_df.head()
# In[37]:
from sklearn.preprocessing import binarize
for i in range(1,5):
cm2=0
y_pred_prob_yes=logreg.predict_proba(x_test)
y_pred2 = binarize(y_pred_prob_yes, threshold=i/10)[:,1]
cm2=confusion_matrix(y_test,y_pred2)
print ('With',i/10,'threshold the Confusion Matrix is ','\n',cm2,'\n',
'with',cm2[0,0]+cm2[1,1],'correct predictions and',cm2[1,0],'Type II errors( False
Negatives)','\n\n',
'Sensitivity: ',cm2[1,1]/(float(cm2[1,1]+cm2[1,0])),'Specificity:
',cm2[0,0]/(float(cm2[0,0]+cm2[0,1])),'\n\n\n')
ROC curve
# In[34]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_yes[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart disease classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')
plt.grid(True)
# A common way to visualize the trade-offs of different thresholds is by using an ROC curve, a
plot of the true positive rate (# true positives/ total # positives) versus the false positive rate (#
false positives / total # negatives) for all possible choices of thresholds. A model with good
classification accuracy should have significantly more true positives than false positives at all
thresholds.
#
# The optimum position for roc curve is towards the top left corner where the specificity and
sensitivity are at optimum levels
#
# In[35]:
sklearn.metrics.roc_auc_score(y_test,y_pred_prob_yes[:,1])
Model accuracy
# In[25]:
sklearn.metrics.accuracy_score(y_test,y_pred)
# Accuracy of the model is 0.88
Chapter 6
CONCLUSION
• All attributes selected after the elimination process show Pvalues lower than 5% and
thereby suggesting significant role in the Heart disease prediction.
• Men seem to be more susceptible to heart disease than women.Increase in Age,number of
cigarettes smoked per day and systolic Blood Pressure also show increasing odds of
having heart disease.
• Total cholesterol shows no significant change in the odds of CHD. This could be due to
the presence of 'good cholesterol(HDL) in the total cholesterol reading.Glucose too
causes a very negligible change in odds (0.2%)
• The model predicted with 0.88 accuracy. The model is more specific than sensitive.
• *The Area under the ROC curve is 73.5 which is somewhat satisfactory. *
• Overall model could be improved with more data.
Appendix
▪ http://www.who.int/mediacentre/factsheets/fs317/en/
▪ https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset/data