0% found this document useful (0 votes)
81 views8 pages

Logistic Regression 205

The document describes using logistic regression to perform heart disease classification. It includes: 1) Importing necessary libraries and datasets, cleaning data by removing null values. 2) Visualizing the data and adding a constant. 3) Fitting a logistic regression model to the data and analyzing the results. 4) Performing backward feature selection to refine the model. 5) Splitting the data into training and test sets. 6) Fitting the logistic regression model on the training data and predicting on the test data. 7) Calculating the accuracy of the model on the test data.

Uploaded by

Ranadeep Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views8 pages

Logistic Regression 205

The document describes using logistic regression to perform heart disease classification. It includes: 1) Importing necessary libraries and datasets, cleaning data by removing null values. 2) Visualizing the data and adding a constant. 3) Fitting a logistic regression model to the data and analyzing the results. 4) Performing backward feature selection to refine the model. 5) Splitting the data into training and test sets. 6) Fitting the logistic regression model on the training data and predicting on the test data. 7) Calculating the accuracy of the model on the test data.

Uploaded by

Ranadeep Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

205 Vaishnavi Nilawar

Logistic Regression for Heart Disease Classification

1. Importing required Libraries

In [1]:

import pandas as pd

import numpy as np

import statsmodels.api as sm

import scipy.stats as st

import matplotlib.pyplot as plt

import seaborn as sn

from sklearn.metrics import confusion_matrix

import matplotlib.mlab as mlab

%matplotlib inline

2. Importing Dataset ¶

In [2]:

df = pd.read_csv('heart_disease.csv')

df.drop(['education'],axis=1,inplace=True)

df.rename(columns={'male':'sex_male'},inplace=True)

df.head()

Out[2]:

sex_male age currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabe

0 1 39 0 0.0 0.0 0 0

1 0 46 0 0.0 0.0 0 0

2 1 48 1 20.0 0.0 0 0

3 0 61 1 30.0 0.0 0 1

4 0 46 1 23.0 0.0 0 0

3. Checking for Null values


In [3]:

df.isnull().sum()

Out[3]:

sex_male 0

age 0

currentSmoker 0

cigsPerDay 29

BPMeds 53

prevalentStroke 0

prevalentHyp 0

diabetes 0

totChol 50

sysBP 0

diaBP 0

BMI 19

heartRate 1

glucose 388

TenYearCHD 0

dtype: int64

In [4]:

count = 0

for i in df.isnull().sum(axis=1):

if i > 0:

count = count+1

print('Total number of rows with missing values is', count)

Total number of rows with missing values is 489

4. Removing Null values


In [5]:

df.dropna(axis=0, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 3749 entries, 0 to 4237

Data columns (total 15 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 sex_male 3749 non-null int64

1 age 3749 non-null int64

2 currentSmoker 3749 non-null int64

3 cigsPerDay 3749 non-null float64

4 BPMeds 3749 non-null float64

5 prevalentStroke 3749 non-null int64

6 prevalentHyp 3749 non-null int64

7 diabetes 3749 non-null int64

8 totChol 3749 non-null float64

9 sysBP 3749 non-null float64

10 diaBP 3749 non-null float64

11 BMI 3749 non-null float64

12 heartRate 3749 non-null float64

13 glucose 3749 non-null float64

14 TenYearCHD 3749 non-null int64

dtypes: float64(8), int64(7)

memory usage: 468.6 KB

5. Data Visualization
In [6]:

def draw_histograms(dataframe, features, rows, cols):

fig=plt.figure(figsize=(20,20))

for i, feature in enumerate(features):

ax=fig.add_subplot(rows,cols,i+1)

dataframe[feature].hist(bins=20,ax=ax,facecolor='red')

ax.set_title(feature+"Distribution", color='blue')

fig.tight_layout()

plt.show()

draw_histograms(df, df.columns, 6, 3)

In [7]:

sn.countplot(x='TenYearCHD', data=df)

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x224600f92e0>

6. Insering a constant value

In [8]:

from statsmodels.tools import add_constant as add_constant

df_constant = add_constant(df)

df_constant.head()

Out[8]:

const sex_male age currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp

0 1.0 1 39 0 0.0 0.0 0 0

1 1.0 0 46 0 0.0 0.0 0 0

2 1.0 1 48 1 20.0 0.0 0 0

3 1.0 0 61 1 30.0 0.0 0 1

4 1.0 0 46 1 23.0 0.0 0 0


In [9]:

st.chisqprob = lambda chisq, df: st.chi2.sf(chisq, df)

cols = df_constant.columns[:-1]

model = sm.Logit(df.TenYearCHD, df_constant[cols])

result = model.fit()

result.summary()

Optimization terminated successfully.

Current function value: 0.377199

Iterations 7

Out[9]:

Logit Regression Results

Dep. Variable: TenYearCHD No. Observations: 3749

Model: Logit Df Residuals: 3734

Method: MLE Df Model: 14

Date: Sat, 11 Sep 2021 Pseudo R-squ.: 0.1169

Time: 11:39:42 Log-Likelihood: -1414.1

converged: True LL-Null: -1601.4

Covariance Type: nonrobust LLR p-value: 2.922e-71

coef std err z P>|z| [0.025 0.975]

const -8.6463 0.687 -12.577 0.000 -9.994 -7.299

sex_male 0.5740 0.107 5.343 0.000 0.363 0.785

age 0.0640 0.007 9.787 0.000 0.051 0.077

currentSmoker 0.0732 0.155 0.473 0.636 -0.230 0.376

cigsPerDay 0.0184 0.006 3.003 0.003 0.006 0.030

BPMeds 0.1446 0.232 0.622 0.534 -0.311 0.600

prevalentStroke 0.7191 0.489 1.471 0.141 -0.239 1.677

prevalentHyp 0.2146 0.136 1.574 0.116 -0.053 0.482

diabetes 0.0025 0.312 0.008 0.994 -0.609 0.614

totChol 0.0022 0.001 2.074 0.038 0.000 0.004

sysBP 0.0153 0.004 4.080 0.000 0.008 0.023

diaBP -0.0039 0.006 -0.619 0.536 -0.016 0.009

BMI 0.0103 0.013 0.820 0.412 -0.014 0.035

heartRate -0.0023 0.004 -0.550 0.583 -0.010 0.006

glucose 0.0076 0.002 3.408 0.001 0.003 0.012


In [10]:

def back_feature_elem (data_frame, dep_var, col_list):

while len(col_list)>0 :

model = sm.Logit(dep_var,data_frame[col_list])

result = model.fit(disp=0)

largest_pvalue = round(result.pvalues,3).nlargest(1)

if largest_pvalue[0]<(0.05):

return result

break

else:

col_list = col_list.drop(largest_pvalue.index)

result = back_feature_elem(df_constant, df.TenYearCHD, cols)

result.summary()

Out[10]:

Logit Regression Results

Dep. Variable: TenYearCHD No. Observations: 3749

Model: Logit Df Residuals: 3742

Method: MLE Df Model: 6

Date: Sat, 11 Sep 2021 Pseudo R-squ.: 0.1148

Time: 11:40:26 Log-Likelihood: -1417.6

converged: True LL-Null: -1601.4

Covariance Type: nonrobust LLR p-value: 2.548e-76

coef std err z P>|z| [0.025 0.975]

const -9.1211 0.468 -19.491 0.000 -10.038 -8.204

sex_male 0.5813 0.105 5.521 0.000 0.375 0.788

age 0.0654 0.006 10.330 0.000 0.053 0.078

cigsPerDay 0.0197 0.004 4.803 0.000 0.012 0.028

totChol 0.0023 0.001 2.099 0.036 0.000 0.004

sysBP 0.0174 0.002 8.166 0.000 0.013 0.022

glucose 0.0076 0.002 4.573 0.000 0.004 0.011


In [11]:

params = np.exp(result.params)

conf = np.exp(result.conf_int())

conf['OR'] = params

pvalue = round(result.pvalues,3)

conf['pvalue'] = pvalue

conf.columns = ['CI 95%(2.5%)','CI 95%(97.5%)', 'Odds Ratio', 'pvalue']

print((conf))

CI 95%(2.5%) CI 95%(97.5%) Odds Ratio pvalue

const 0.000044 0.000274 0.000109 0.000

sex_male 1.454877 2.198166 1.788313 0.000

age 1.054409 1.080897 1.067571 0.000

cigsPerDay 1.011730 1.028128 1.019896 0.000

totChol 1.000150 1.004386 1.002266 0.036

sysBP 1.013299 1.021791 1.017536 0.000

glucose 1.004343 1.010895 1.007614 0.000

7. Splitting the data into train and test data

In [15]:

import sklearn

new_features = df[['age','sex_male','cigsPerDay','totChol','sysBP','glucose','TenYearC
HD']]

x = new_features.iloc[:,:-1]

y = new_features.iloc[:,-1]

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20, random_state=5)

8. Fitting the data into the Model

In [ ]:

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(x_train, y_train)

y_pred = logreg.predict(x_test)

9. Calculating the accuracy

In [18]:

print("Model Accuracy:")

sklearn.metrics.accuracy_score(y_test,y_pred)

Model Accuracy:

Out[18]:

0.8706666666666667

You might also like