0% found this document useful (0 votes)

4 views25 pages

Visvesvaraya Technological University: Hms Institue of Technology

This document is a Machine Learning Internship Report by Shalini M, focusing on predicting heart disease using logistic regression as part of her Bachelor of Engineering in Computer Science and Engineering. The report outlines the significance of early detection of cardiovascular diseases, the methodology employed including data preprocessing and model evaluation, and the implementation of the logistic regression model. It highlights the project's aim to improve prediction accuracy through machine learning techniques, ultimately addressing the high mortality rates associated with heart diseases.

Uploaded by

Shalini M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views25 pages

Visvesvaraya Technological University: Hms Institue of Technology

Uploaded by

Shalini M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

JnanaSangama, Belgaum-590014

A Machine Learning Internship Report On

“Heart disease prediction using logistic regression”

Submitted in Partial fulfillment of the Requirements for the VII Semester of the
Degree of
Bachelor of Engineering
In
Computer Science & Engineering
By
SHALINI M (1HM20CS034)

Under the Guidance of

Mr.Reetesh

HMS INSTITUE OF TECHNOLOGY

NH-4, Kesaramadu Post Kyathasandra Karnataka 572104
2022-2023
HMS INSTITUE OF TECHNOLOGY
NH-4, Kesaramadu Post Kyathasandra Karnataka 572104

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
Certified that the Machine Learning Project work entitled “HEART DISEASE PREDICTION USING
LOGISTIC REGRESSION” has been carried out By SHALINI M (1HM20CS034), bonafide student
of City Engineering College in partial fulfilment for the award of Bachelor of Engineering in Computer
Science and Engineering of the Visveshvaraya Technological University, Belgaum during the year
20222023. It is certified that all corrections/suggestions indicated for Internal Assessment have been
incorporated in the Report deposited in the departmental library. The Machine Learning Mini Project
Report has been approved as it satisfies the academic requirements in respect of project work prescribed
for the said Degree.

______ _ _________

Signature of Guide Signature of HOD Signature of Principal
Mr. Dr. A. Vijayaraghavan BE, M Tech, Ph. D Dr. Kavitha.A.S Ph.D
Asst. Professor, Dept. of CS&E HOD, Dept. of CS&E Principal
H.M.S.I.T, TUMAKUR H.M.S.I. T, TUMAKUR H.M.S.I. T, TUMAKUR

Name of Examiners Signature with date

1.______________________ ____________________
2. _____________________ ____________________
Abstract

Heart - a primary organ of our circulatory system. Which keeps blood that's full of oxygen
circulating throughout your body. From past two decades Heart-disease remained as a leading
cause of death at global level. Statistics illustrate the lethality of cardiovascular disease by
showing the percentage of deaths caused by heart attacks worldwide. Therefore, it is crucial to
predict the condition as earliest as possible time. Cardiologist have limitations, they cannot
predict heart disease risk to a high degree of accuracy. So, a reliable, accurate and feasible
system is required to predict such diseases in time for proper treatment. In order to automate
analysis of large and complex medical datasets, Machine Learning algorithms and techniques
have been applied. Machine learning techniques have been increasingly used by researchers in
the health care industry and by professionals to diagnose conditions related to the heart. A quick
and efficient detection technique is needed to reduce the high death rate caused by heart diseases.
Here, machine learning algorithms and data mining techniques play a very crucial role.
Using machine learning algorithms, this research aims to predict the occurrence of heart disease
in a patient.
Keywords: Machine Learning, Supervised Learning, unsupervised Learning, Logistic
Regression, Cardiovascular diseases
ACKNOWLEDGEMENT

The Satisfaction that accompanies the successful completion of any vast would
be incomplete without the mention of the people who made it possible and whose constant
encouragement and guidance has been a source of inspiration throughout the project.

I am extremely grateful and thankful to our beloved Principal Dr. Kavitha. A.S,
for providing me a congenial atmosphere and also the necessary facilities for achieving the
cherished goal Project.

I feel delighted to have this page to express my sincere thanks and deep
appreciation to Dr. A. Vijayaraghavan, Head of the Department, Computer Science and
Engineering, for his valuable guidance, keep interest and constant encouragement throughout the
entire period of this project.

I would like to thank my guide Mr. Ranganatha H R Asst. Professor, Computer

Science and Engineering, for his valuable Guidance and constant support throughout the Project.

I would like to thank my Parents for their valuable guidance and moral support
throughout the Project.
Finally, I thank all the Teaching and non-teaching staff for allowing me to
successfully carry out the Project. I also thank my friends who provided lot of support in this
Project.

SHALINI M (1HM20CS034)
TABLE OF CONTENTS

Table of Contents
ACKNOWLEDGEMENT ................................................................................................................4
Chapter 1 ..........................................................................................................................................7
INTRODUCTION ............................................................................................................................7
Chapter 2 ..........................................................................................................................................8
LOGISTIC REGRESSION ..............................................................................................................8
Chapter 3 ........................................................................................................................................ 10
METHODOLOGY ......................................................................................................................... 10
Data Preprocessing...................................................................................................................... 10
Data Source ............................................................................................................................. 10
Exploratory Data Analysis .......................................................................................................... 11
Data Visualization ................................................................................................................... 11
The Correlation between Variables Analysis ............................................................................... 11
Data Validation ........................................................................................................................... 12
Chapter4......................................................................................................................................... 13
MODEL EVALUATION ................................................................................................................. 13
Train-Test Split ........................................................................................................................... 13
Model Accuracy .......................................................................................................................... 13
Model Summary ......................................................................................................................... 13
Sensitivity and Specificity............................................................................................................ 13
Threshold Adjustment................................................................................................................. 13
Threshold Tuning .................................................................................................................... 13
AUC (Area Under the Curve)...................................................................................................... 13
ROC Curve and AUC.................................................................................................................. 13
ROC Curve ............................................................................................................................. 13
Chapter 5 ........................................................................................................................................ 15
IMPLEMENTATION ..................................................................................................................... 15
Code: .......................................................................................................................................... 15
Logistic Regression ..................................................................................................................... 17
Feature Selection: Backward elemination (P-value approach) .................................................... 18
Logistic regression equation .................................................................................................... 18
Splitting data to train and test split ............................................................................................. 19
Model Evaluation ........................................................................................................................ 20
Model accuracy ....................................................................................................................... 23
Confusion matrix ........................................................................................................................ 20
ROC curve .................................................................................................................................. 22
Chapter 6 ........................................................................................................................................ 24
CONCLUSION ............................................................................................................................... 24
Appendix..................................................................................................................................... 24
Data Source References ............................................................................................................... 24
Chapter 1
INTRODUCTION
The number of people suffering from cardiovascular disease is on the rise. Numerous
factors carry the risk of developing this disease, such as age, high blood pressure, high
cholesterol, diabetes, hypertension, genes, obesity, and unhealthy lifestyles. It is possible to
identify a variety of symptoms by observing physical signs like chest pain, shortness of breath,
dizziness, and wearing yourself out easily. Even though these diseases were found to be the
leading cause of death, they have been classified as the most manageable and preventable
illnesses. Identification of cardiovascular diseases is a difficult process. The early detection of
cardiovascular disease is crucial since its complications can have an impact on a person's life as a
whole.

The signs of a woman having a heart attack are much less noticeable than the signs of a
male. In women, heart attacks may feel uncomfortable squeezing, pressure, fullness, or pain in
the center of the chest. It may also cause pain in one or both arms, the back, neck, jaw or
stomach, shortness of breath, nausea and other symptoms. Men experience typical symptoms of
heart attack, such as chest pain, discomfort, and stress. They may also experience pain in other
areas, such as arms, neck, back, and jaw, and shortness of breath, sweating, and discomfort that
mimics heartburn.

Cardiovascular disease diagnosis and treatment are very complex. While invasive-based
techniques are still employed through analysis of the patient's medical history, reports of
physical examinations by the physician tend to be less accurate and take a long time to prepare.
For this reason, a support system is implemented to predict cardiovascular disease through a
machine learning model. A machine-learning approach may improve accuracy by leveraging the
complex interactions between risk factors
Chapter 2
LOGISTIC REGRESSION

Logistic regression is a process of modeling the probability of a discrete outcome given

an input variable. The most common logistic regression models a binary outcome; something
that can take two values such as true/false, yes/no, and so on. Multinomial logistic regression can
model scenarios where there are more than two possible discrete outcomes. Logistic regression is
a useful analysis method for classification problems, where you are trying to determine if a new
sample fits best into a category. As aspects of cyber security are classification problems, such as
attack detection, logistic regression is a useful analytic technique. Logistic regression, despite its
name, is a classification model rather than regression model. Logistic regression is a simple and
more efficient method for binary and linear classification problems. It is a classification model,
which is very easy to realize and achieves very good performance with linearly separable classes.
It is an extensively employed algorithm for classification in industry. The logistic regression
model, like the Adaline and perceptron, is a statistical method for binary classification that can be
generalized to multiclass classification. Scikit-learn has a highly optimized version of logistic
regression implementation, which supports multiclass classification task.

Logistic regression is another powerful supervised ML algorithm used for binary classification
problems (when target is categorical). The best way to think about logistic regression is that it is
a linear regression but for classification problems. Logistic regression essentially uses a logistic
function defined below to model a binary output variable. The primary difference between linear
regression and logistic regression is that logistic regression's range is bounded between 0 and 1.
In addition, as opposed to linear regression, logistic regression does not require a linear
relationship between inputs and output variables. This is due to applying a nonlinear log
transformation to the odds ratio.

Formula of Logistic Regression Sigmoid function:

f(z)=σ(z)=11+e−z
In the logistic function equation, x is the input variable. Let's feed in values −20 to 20 into the
logistic function. the inputs have been transferred to between 0 and 1.

Figure 1: Sigmoid Function

Chapter 3
METHODOLOGY
Several factors that affect the human cardiovascular system are examined in this study. The
process begins with retrieved data, analysis of correlation between variables, splitting of the data, and
prediction with the logistic regression algorithm, ending with data validation.

Data Preprocessing
Data Source
The dataset used for this analysis is sourced from Kaggle and contains information
collected as part of the Framingham Heart Study. It includes the following variables:

• Sex: Gender of the patient (male or female)

• Age: Age of the patient (continuous)
• Current Smoker: Whether the patient is a current smoker (yes or no)
• Cigarettes per Day: The number of cigarettes smoked per day (continuous)
• Blood Pressure Medication (BPMeds): Whether the patient is on blood pressure
medication (yes or no)
• Prevalent Stroke: Whether the patient had a previous stroke (yes or no)
• Prevalent Hypertension: Whether the patient is hypertensive (yes or no)
• Diabetes: Whether the patient has diabetes (yes or no)
• Total Cholesterol (totChol): Total cholesterol level (continuous)
• Systolic Blood Pressure (sysBP): Systolic blood pressure (continuous)
• Diastolic Blood Pressure (diaBP): Diastolic blood pressure (continuous)
• Body Mass Index (BMI): Body Mass Index (continuous)
• Heart Rate: Heart rate (continuous)
• Glucose: Glucose level (continuous)
• Ten-Year CHD: Ten-year risk of coronary heart disease (binary: "1" for yes, "0" for no)
Data Preprocessing Steps
The dataset was initially loaded, and missing values were handled by removing rows with
missing data.
The target variable, "Ten-Year CHD," was identified as the variable to predict.

Exploratory Data Analysis

Data Visualization
Exploratory data analysis was performed to gain insights into the dataset's characteristics.
Key visualizations included histograms and pair plots, which allowed us to understand the
distribution of variables and relationships between them.

The Correlation between Variables Analysis

Besides, to facilitate data analysis, all variables in the imported dataset will be visualized in the form of a
histogram to facilitate the reading of the data in general. In the process, Analyse the Correlation between
Variables; the correlation between variables is examined to prove that the method to be used is the logistic
regression model is the right model. Relationships between variables in the available dataset will be
plotted in the form of a matrix. This is also done to check whether there is multicollinearity between
variables in the dataset.

Figure 2: Variables in Data

Data Validation
The technique used to validate the results is the method of the confusion matrix and K-fold cross
validation with 10-fold. By using a confusion matrix, the accuracy of the use of the logistic regression
model can be known. Besides, the use of the K-fold crossvalidation method¸ produces values of errors
that may occur when using a logistic regression model.

Figure 3: Confusion Matrix

Chapter4
MODEL EVALUATION
Model Accuracy
The model's accuracy was evaluated using the testing dataset, and it achieved an accuracy of
approximately 0.88.

Train-Test Split
The dataset was divided into training and testing sets to assess the model's performance.

Model Summary
The final logistic regression model was summarized, including information such as coefficients,
standard errors, z-scores, p-values, and confidence intervals. The model exhibited a good fit with
a pseudo R-squared value of 0.1148.

Sensitivity and Specificity

Sensitivity (true positive rate) and specificity (true negative rate) were calculated to understand
how well the model detects true positive cases and true negative cases, respectively.

Threshold Adjustment
Threshold Tuning
To balance sensitivity and specificity, threshold tuning was performed. Different threshold values
were tested to find an optimal balance that minimized false negatives (Type II errors) while
maintaining acceptable specificity.

AUC (Area Under the Curve)

The Area Under the ROC Curve (AUC) was calculated to quantify the model's classification
accuracy. The AUC was approximately 0.773, indicating reasonably good performance.

ROC Curve and AUC

ROC Curve
The Receiver Operating Characteristic (ROC) curve was plotted to visualize the trade-off
between true positive rate and false positive rate across various thresholds.
Figure 4: ROC curve for Heart disease classifier
Chapter 5
IMPLEMENTATION

Code:
#!/usr/bin/env python
# coding: utf-8
# In[1]:import pandas as pd
import numpy as np

import statsmodels.api as sm

import scipy.stats as st

import matplotlib.pyplot as plt

import seaborn as sn
from sklearn.metrics import confusion_matrix
import matplotlib.mlab as mlab
get_ipython().run_line_magic('matplotlib', 'inline')
# In[4]:
heart_df=pd.read_csv(r"C:\Users\shali\Downloads\archive (3)\framingham.csv")
heart_df.drop(['education'],axis=1,inplace=True)
heart_df.head()
# Variables :
# Each attribute is a potential risk factor. There are both demographic, behavioural and medical
risk factors.
# - Demographic: sex: male or female;(Nominal)
# - age: age of the patient;(Continuous - Although the recorded ages have been truncated to
whole numbers, the concept of age is continuous)
# - currentSmoker: whether or not the patient is a current smoker (Nominal)
# - cigsPerDay: the number of cigarettes that the person smoked on average in one day.(can be
considered continuous as one can have any number of cigarretts, even half a cigarette.)
# -Medical( history):
# - BPMeds: whether or not the patient was on blood pressure medication (Nominal)
# - prevalentStroke: whether or not the patient had previously had a stroke (Nominal)
# - prevalentHyp: whether or not the patient was hypertensive (Nominal)
# - diabetes: whether or not the patient had diabetes (Nominal)
# Medical(current):
# - totChol: total cholesterol level (Continuous)
# - sysBP: systolic blood pressure (Continuous)
# - diaBP: diastolic blood pressure (Continuous)
# - BMI: Body Mass Index (Continuous)
# - heartRate: heart rate (Continuous - In medical research, variables such as heart rate though in
fact discrete, yet are considered continuous because of large number of possible values.)
# - glucose: glucose level (Continuous)
# Predict variable (desired target):
# - 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)
# In[9]:
heart_df.rename(columns={'male':'Sex_male'},inplace=True)
# In[6]:
heart_df.isnull().sum()
# In[7]:
count=0
for i in heart_df.isnull().sum(axis=1):
if i>0:
count=count+1
print('Total number of rows with missing values is ', count)
print('since it is only',round((count/len(heart_df.index))*100), 'percent of the entire dataset the
rows with missing values are excluded.')
# In[8]:
heart_df.dropna(axis=0,inplace=True)
# Exploratory Analysis
# In[10]:
def draw_histograms(dataframe, features, rows, cols):
fig=plt.figure(figsize=(20,20))
for i, feature in enumerate(features):
ax=fig.add_subplot(rows,cols,i+1)
dataframe[feature].hist(bins=20,ax=ax,facecolor='midnightblue')
ax.set_title(feature+" Distribution",color='DarkRed')

fig.tight_layout()
plt.show()
draw_histograms(heart_df,heart_df.columns,6,3)
# In[11]:
heart_df.TenYearCHD.value_counts()
# In[12]:
sn.countplot(x='TenYearCHD',data=heart_df)
# There are 3179 patents with no heart disease and 572 patients with risk of heart disease.
# In[13]:
sn.pairplot(data=heart_df)
# In[14]:
heart_df.describe()

Logistic Regression
# Logistic regression is a type of regression analysis in statistics used for prediction of outcome
of a categorical dependent variable from a set of predictor or independent variables. In logistic
regression the dependent variable is always binary. Logistic regression is mainly used to for
prediction and also calculating the probability of success.
# In[15]:
from statsmodels.tools import add_constant as add_constant
heart_df_constant = add_constant(heart_df)
heart_df_constant.head()
# In[16]:
st.chisqprob = lambda chisq, df: st.chi2.sf(chisq, df)
cols=heart_df_constant.columns[:-1]
model=sm.Logit(heart_df.TenYearCHD,heart_df_constant[cols])
result=model.fit()
result.summary()
# The results above show some of the attributes with P value higher than the preferred alpha(5%)
and thereby showing low statistically significant relationship with the probability of heart
disease. Backward elemination approach is used here to remove those attributes with highest
Pvalue one at a time follwed by running the regression repeatedly until all attributes have P
Values less than 0.05.
Feature Selection: Backward elemination (P-value approach)
# In[17]:
def back_feature_elem (data_frame,dep_var,col_list):
""" Takes in the dataframe, the dependent variable and a list of column names, runs the
regression repeatedly eleminating feature with the highest
P-value above alpha one at a time and returns the regression summary with all p-values below
alpha"""

while len(col_list)>0 :
model=sm.Logit(dep_var,data_frame[col_list])
result=model.fit(disp=0)
largest_pvalue=round(result.pvalues,3).nlargest(1)
if largest_pvalue[0]<(0.05):
return result
break
else:
col_list=col_list.drop(largest_pvalue.index)

result=back_feature_elem(heart_df_constant,heart_df.TenYearCHD,cols)
# In[18]:
result.summary()

Logistic regression equation

# - P=eβ0+β1X1/1+eβ0+β1X1
#
# When all features plugged in:#
#
logit(p)=log(p/(1−p))=β0+β1∗Sexmale+β2∗age+β3∗cigsPerDay+β4∗totChol+β5∗sysBP+β6∗glu
cose

# #### Interpreting the results: Odds Ratio, Confidence Intervals and Pvalues
# In[19]:
params = np.exp(result.params)
conf = np.exp(result.conf_int())
conf['OR'] = params
pvalue=round(result.pvalues,3)
conf['pvalue']=pvalue
conf.columns = ['CI 95%(2.5%)', 'CI 95%(97.5%)', 'Odds Ratio','pvalue']
print ((conf))
# - This fitted model shows that, holding all other features constant, the odds of getting
diagnosed with heart disease for males (sex_male = 1)over that of females (sex_male = 0) is
exp(0.5815) = 1.788687. In terms of percent change, we can say that the odds for males are
78.8% higher than the odds for females.
#
#
# - The coefficient for age says that, holding all others constant, we will see 7% increase in the
odds of getting diagnosed with CDH for a one year increase in age since exp(0.0655) =
1.067644.
#
#
# - Similarly , with every extra cigarette one smokes thers is a 2% increase in the odds of CDH.
#
# - For Total cholosterol level and glucose level there is no significant change.
#
# - There is a 1.7% increase in odds for every unit increase in systolic Blood Pressure.

Splitting data to train and test split

# In[23]:
import sklearn
new_features=heart_df[['age','Sex_male','cigsPerDay','totChol','sysBP','glucose','TenYearCHD']]
x=new_features.iloc[:,:-1]
y=new_features.iloc[:,-1]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.20,random_state=5)
# In[24]:
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(x_train,y_train)
y_pred=logreg.predict(x_test)

Model Evaluation
Confusion matrix

# In[26]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Act
ual:1'])
plt.figure(figsize = (8,5))
sn.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
# The confusion matrix shows 658+4 = 662 correct predictions and 88+1= 89 incorrect ones.
#
# - True Positives: 4
#
# - True Negatives: 658
#
# - False Positives: 1 (Type I error)
#
# - False Negatives: 88 ( Type II error)
#
#

# In[27]:
TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1]
sensitivity=TP/float(TP+FN)
specificity=TN/float(TN+FP)
# #### Model Evaluation - Statistics
# In[28]:
print('The acuuracy of the model = TP+TN/(TP+TN+FP+FN) =
',(TP+TN)/float(TP+TN+FP+FN),'\n',

'The Missclassification = 1-Accuracy = ',1-((TP+TN)/float(TP+TN+FP+FN)),'\n',

'Sensitivity or True Positive Rate = TP/(TP+FN) = ',TP/float(TP+FN),'\n',

'Specificity or True Negative Rate = TN/(TN+FP) = ',TN/float(TN+FP),'\n',

'Positive Predictive value = TP/(TP+FP) = ',TP/float(TP+FP),'\n',

'Negative predictive Value = TN/(TN+FN) = ',TN/float(TN+FN),'\n',

'Positive Likelihood Ratio = Sensitivity/(1-Specificity) = ',sensitivity/(1-specificity),'\n',

'Negative likelihood Ratio = (1-Sensitivity)/Specificity = ',(1-sensitivity)/specificity)

# ##### From the above statistics it is clear that the model is highly specific than sensitive. The
negative values are predicted more accurately than the positives.

# ###### Predicted probabilities of 0 (No Coronary Heart Disease) and 1 ( Coronary Heart
Disease: Yes) for the test data with a default classification threshold of 0.5

# In[29]:
y_pred_prob=logreg.predict_proba(x_test)[:,:]
y_pred_prob_df=pd.DataFrame(data=y_pred_prob, columns=['Prob of no heart disease (0)','Prob
of Heart Disease (1)'])
y_pred_prob_df.head()

# ##### Lower the threshold

# ###### Since the model is predicting Heart disease too many type II errors is not advisable. A
False Negative ( ignoring the probability of disease when there actualy is one) is more dangerous
than a False Positive in this case. Hence inorder to increase the sensitivity, threshold can be
lowered.

# In[37]:
from sklearn.preprocessing import binarize
for i in range(1,5):
cm2=0
y_pred_prob_yes=logreg.predict_proba(x_test)
y_pred2 = binarize(y_pred_prob_yes, threshold=i/10)[:,1]

cm2=confusion_matrix(y_test,y_pred2)
print ('With',i/10,'threshold the Confusion Matrix is ','\n',cm2,'\n',
'with',cm2[0,0]+cm2[1,1],'correct predictions and',cm2[1,0],'Type II errors( False
Negatives)','\n\n',
'Sensitivity: ',cm2[1,1]/(float(cm2[1,1]+cm2[1,0])),'Specificity:
',cm2[0,0]/(float(cm2[0,0]+cm2[0,1])),'\n\n\n')

ROC curve
# In[34]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_yes[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart disease classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')
plt.grid(True)

# A common way to visualize the trade-offs of different thresholds is by using an ROC curve, a
plot of the true positive rate (# true positives/ total # positives) versus the false positive rate (#
false positives / total # negatives) for all possible choices of thresholds. A model with good
classification accuracy should have significantly more true positives than false positives at all
thresholds.
#
# The optimum position for roc curve is towards the top left corner where the specificity and
sensitivity are at optimum levels
#

# Area Under The Curve (AUC)

# The area under the ROC curve quantifies model classification accuracy; the higher the area, the
greater the disparity between true and false positives, and the stronger the model in classifying
members of the training dataset. An area of 0.5 corresponds to a model that performs no better
than random classification and a good classifier stays as far away from that as possible. An area
of 1 is ideal. The closer the AUC to 1 the better.

# In[35]:
sklearn.metrics.roc_auc_score(y_test,y_pred_prob_yes[:,1])

Model accuracy

# In[25]:
sklearn.metrics.accuracy_score(y_test,y_pred)
# Accuracy of the model is 0.88
Chapter 6
CONCLUSION

• All attributes selected after the elimination process show Pvalues lower than 5% and
thereby suggesting significant role in the Heart disease prediction.
• Men seem to be more susceptible to heart disease than women.Increase in Age,number of
cigarettes smoked per day and systolic Blood Pressure also show increasing odds of
having heart disease.
• Total cholesterol shows no significant change in the odds of CHD. This could be due to
the presence of 'good cholesterol(HDL) in the total cholesterol reading.Glucose too
causes a very negligible change in odds (0.2%)
• The model predicted with 0.88 accuracy. The model is more specific than sensitive.
• *The Area under the ROC curve is 73.5 which is somewhat satisfactory. *
• Overall model could be improved with more data.

Appendix

▪ http://www.who.int/mediacentre/factsheets/fs317/en/

Data Source References

▪ https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset/data

Guia R
No ratings yet
Guia R
32 pages
Project Report 2022
No ratings yet
Project Report 2022
27 pages
Mini Project Front Phdumla
No ratings yet
Mini Project Front Phdumla
52 pages
SST Word
No ratings yet
SST Word
15 pages
Sahil Final Project REPORT
No ratings yet
Sahil Final Project REPORT
49 pages
Final Report of Heart Disease Prdiction
No ratings yet
Final Report of Heart Disease Prdiction
81 pages
Hearts Report Final Pages
No ratings yet
Hearts Report Final Pages
27 pages
Maindra
No ratings yet
Maindra
22 pages
CVR College of Engineering: in The Partial Fulfillment of The Requirements For The Award of The Degree of
No ratings yet
CVR College of Engineering: in The Partial Fulfillment of The Requirements For The Award of The Degree of
63 pages
Prediction of Cardiovasclar Disease Using Machine Learning Algorithm
No ratings yet
Prediction of Cardiovasclar Disease Using Machine Learning Algorithm
8 pages
Project Report PDF
No ratings yet
Project Report PDF
54 pages
Heart Disease Predicition
No ratings yet
Heart Disease Predicition
42 pages
Be MJ Report
No ratings yet
Be MJ Report
35 pages
Heart Disease Prediction Using ML
No ratings yet
Heart Disease Prediction Using ML
48 pages
1822 B.E Cse Batchno 95
No ratings yet
1822 B.E Cse Batchno 95
57 pages
A Project Report CPP
No ratings yet
A Project Report CPP
55 pages
Latexcode
No ratings yet
Latexcode
42 pages
Project Report Half
No ratings yet
Project Report Half
33 pages
Heart Disease Prediction Using Machine Learning.
No ratings yet
Heart Disease Prediction Using Machine Learning.
59 pages
Latexcode
No ratings yet
Latexcode
45 pages
Final - Urop - Report - Heart Attack Machine Learning
No ratings yet
Final - Urop - Report - Heart Attack Machine Learning
33 pages
Final Report
No ratings yet
Final Report
25 pages
GR No-01-Project-Report PDF
No ratings yet
GR No-01-Project-Report PDF
46 pages
B4report 1
No ratings yet
B4report 1
5 pages
Fypd - 18-510
No ratings yet
Fypd - 18-510
49 pages
Health and Med Tech Sadhana
No ratings yet
Health and Med Tech Sadhana
94 pages
Heart Disease Prediction Report
No ratings yet
Heart Disease Prediction Report
60 pages
Report Heart Disease
No ratings yet
Report Heart Disease
39 pages
Cardio Cure
No ratings yet
Cardio Cure
50 pages
Project Report
No ratings yet
Project Report
46 pages
Project Cardiovascular
No ratings yet
Project Cardiovascular
5 pages
Phase 1 Report
No ratings yet
Phase 1 Report
36 pages
Report Final Year Project Completed
No ratings yet
Report Final Year Project Completed
51 pages
MINI PROJECT Kshetrika
No ratings yet
MINI PROJECT Kshetrika
41 pages
Cccccccccccccccs
No ratings yet
Cccccccccccccccs
32 pages
T.John Institute of Technology: Visvesvaraya Technological University
No ratings yet
T.John Institute of Technology: Visvesvaraya Technological University
29 pages
Zeroth Review Presentation
No ratings yet
Zeroth Review Presentation
12 pages
Heart Disease Documentation
No ratings yet
Heart Disease Documentation
82 pages
Final Project Report Format
No ratings yet
Final Project Report Format
27 pages
Vikash Rai Project Report
No ratings yet
Vikash Rai Project Report
53 pages
Heart Disease Prediction Report
No ratings yet
Heart Disease Prediction Report
83 pages
Project Report Divii
No ratings yet
Project Report Divii
50 pages
In Format GROUP FILE
No ratings yet
In Format GROUP FILE
64 pages
BDA Final
No ratings yet
BDA Final
33 pages
Projectworddoc
No ratings yet
Projectworddoc
56 pages
1822 B.E Cse Batchno 114
No ratings yet
1822 B.E Cse Batchno 114
42 pages
Digantorag 12008237 Green
No ratings yet
Digantorag 12008237 Green
51 pages
Compparison of Classification Algorithm For Heart Disease - Predictionpdf
No ratings yet
Compparison of Classification Algorithm For Heart Disease - Predictionpdf
34 pages
Project Report3
No ratings yet
Project Report3
36 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
70 pages
HEART DISEASE PREDICTION Using MACHINE LEARNING ALGORITHM Presentation
No ratings yet
HEART DISEASE PREDICTION Using MACHINE LEARNING ALGORITHM Presentation
15 pages
MD - Walid - 20103160 - Sec I Complete
No ratings yet
MD - Walid - 20103160 - Sec I Complete
29 pages
Final Year Project
No ratings yet
Final Year Project
62 pages
Sample Project Doc-RIT
No ratings yet
Sample Project Doc-RIT
63 pages
Final Documentation
No ratings yet
Final Documentation
10 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
47 pages
Final - Proj AZRA Merged
No ratings yet
Final - Proj AZRA Merged
36 pages
Project FP II
No ratings yet
Project FP II
7 pages
Heart Disease Prediction Research
No ratings yet
Heart Disease Prediction Research
45 pages
Nann Mudhalvan Report
No ratings yet
Nann Mudhalvan Report
27 pages
Workshop Practice Manual
From Everand
Workshop Practice Manual
Jatinder Madan
No ratings yet
Data Representation in Machine Learning Methods With Its Applicat
No ratings yet
Data Representation in Machine Learning Methods With Its Applicat
100 pages
Credit Scoring and Default Risk Prediction: A Comparative Study Between Discriminant Analysis & Logistic Regression
No ratings yet
Credit Scoring and Default Risk Prediction: A Comparative Study Between Discriminant Analysis & Logistic Regression
15 pages
Chapter 1 Pattern Classification
No ratings yet
Chapter 1 Pattern Classification
11 pages
Diabetes Mellitus Prediction and Classifier Comparitive Study
No ratings yet
Diabetes Mellitus Prediction and Classifier Comparitive Study
7 pages
AI Unit 3
No ratings yet
AI Unit 3
18 pages
Session 1 Evaluation Model
No ratings yet
Session 1 Evaluation Model
58 pages
2 III BTech Minor AI&ML Courses Syllabus
No ratings yet
2 III BTech Minor AI&ML Courses Syllabus
4 pages
Image Classification in Remote Sensing
No ratings yet
Image Classification in Remote Sensing
8 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
52 pages
Unit 5
No ratings yet
Unit 5
46 pages
Ensemble Methods
No ratings yet
Ensemble Methods
12 pages
RG Cross Disciplinary Machinelearning MAIN
No ratings yet
RG Cross Disciplinary Machinelearning MAIN
21 pages
Ncracit 2023
No ratings yet
Ncracit 2023
479 pages
Data-Driven Early Diagnosis of Chronic Kidney Disease Development and Evaluation of An Explainable AI Model
No ratings yet
Data-Driven Early Diagnosis of Chronic Kidney Disease Development and Evaluation of An Explainable AI Model
11 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
2023-Key Contractor Selection Criteria For Db-Epc Projects in Construction
No ratings yet
2023-Key Contractor Selection Criteria For Db-Epc Projects in Construction
14 pages
Categorical Data Analysis With SAS and SPSS Applications
100% (1)
Categorical Data Analysis With SAS and SPSS Applications
576 pages
03u Handout
No ratings yet
03u Handout
47 pages
IMAGINE Grouping Tool
No ratings yet
IMAGINE Grouping Tool
2 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
Machine Learning, Architectural Styles and Property Values
No ratings yet
Machine Learning, Architectural Styles and Property Values
33 pages
Ijebea14 141
No ratings yet
Ijebea14 141
7 pages
DLT Unit-1 Answers
No ratings yet
DLT Unit-1 Answers
36 pages
Proposal Thesis: Optimization Character Recognition of Javanese Script Based On Histogram of Oriented Gradients Method
No ratings yet
Proposal Thesis: Optimization Character Recognition of Javanese Script Based On Histogram of Oriented Gradients Method
9 pages
2022 CHVR Lalitha ICSCSP 2021 Proceedings
No ratings yet
2022 CHVR Lalitha ICSCSP 2021 Proceedings
793 pages
Leaf Disease Detection and Classification Based On Machine Learning
No ratings yet
Leaf Disease Detection and Classification Based On Machine Learning
5 pages
Artificial Neural Networks Unit 3: Single-Layer Perceptrons
No ratings yet
Artificial Neural Networks Unit 3: Single-Layer Perceptrons
11 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages

Visvesvaraya Technological University: Hms Institue of Technology

Uploaded by

Visvesvaraya Technological University: Hms Institue of Technology

Uploaded by

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

A Machine Learning Internship Report On

“Heart disease prediction using logistic regression”

Under the Guidance of

HMS INSTITUE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

________________ ___________________ _________________

Name of Examiners Signature with date

I would like to thank my guide Mr. Ranganatha H R Asst. Professor, Computer

Logistic regression is a process of modeling the probability of a discrete outcome given

Formula of Logistic Regression Sigmoid function:

Figure 1: Sigmoid Function

• Sex: Gender of the patient (male or female)

Exploratory Data Analysis

The Correlation between Variables Analysis

Figure 2: Variables in Data

Figure 3: Confusion Matrix

Sensitivity and Specificity

AUC (Area Under the Curve)

ROC Curve and AUC

import matplotlib.pyplot as plt

Logistic regression equation

Splitting data to train and test split

'The Missclassification = 1-Accuracy = ',1-((TP+TN)/float(TP+TN+FP+FN)),'\n',

'Sensitivity or True Positive Rate = TP/(TP+FN) = ',TP/float(TP+FN),'\n',

'Specificity or True Negative Rate = TN/(TN+FP) = ',TN/float(TN+FP),'\n',

'Positive Predictive value = TP/(TP+FP) = ',TP/float(TP+FP),'\n',

'Negative predictive Value = TN/(TN+FN) = ',TN/float(TN+FN),'\n',

'Positive Likelihood Ratio = Sensitivity/(1-Specificity) = ',sensitivity/(1-specificity),'\n',

'Negative likelihood Ratio = (1-Sensitivity)/Specificity = ',(1-sensitivity)/specificity)

# ##### Lower the threshold

# Area Under The Curve (AUC)

Data Source References

You might also like

______ _ _________