100% found this document useful (1 vote)
80 views16 pages

Merging - Scaled - 1D - & - Trying - Different - CLassification - ML - Models - .Ipynb - Colaboratory

The document discusses preprocessing ECG data from multiple datasets and combining them. It imports libraries, lists the file paths, sorts the files, reads and combines the CSVs for each lead across all data. It performs dimensionality reduction via PCA on the combined data, explaining over 75% of variance with 100 components. It then trains and tests different ML models like KNN on the reduced data for a single lead, achieving 78% accuracy.

Uploaded by

girishcherry12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
80 views16 pages

Merging - Scaled - 1D - & - Trying - Different - CLassification - ML - Models - .Ipynb - Colaboratory

The document discusses preprocessing ECG data from multiple datasets and combining them. It imports libraries, lists the file paths, sorts the files, reads and combines the CSVs for each lead across all data. It performs dimensionality reduction via PCA on the combined data, explaining over 75% of variance with 100 components. It then trains and tests different ML models like KNN on the reduced data for a single lead, achieving 78% accuracy.

Uploaded by

girishcherry12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Open in Colab

IMPORTING LIBRARIES

import pandas as pd
import numpy as np
import os
from natsort import natsorted
import joblib

WORKING ON COMBING MULTIPLE LEAD FILES

#creating list to store file_names


NORMAL_=[]
MI_=[]
PMI_=[]
HB_=[]

normal = '/content/drive/MyDrive/CMPE255_PROJECT/NORMAL'
abnormal = '/content/drive/MyDrive/CMPE255_PROJECT/AHB'
MI = '/content/drive/MyDrive/CMPE255_PROJECT/MI'
MI_history = '/content/drive/MyDrive/CMPE255_PROJECT/PM'

Types_ECG = {'normal':normal,'Abnormal_hear_beat':abnormal,'MI':MI,'History_MI':MI_histo

for types,folder in Types_ECG.items():


for files in os.listdir(folder):
if types=='normal':
NORMAL_.append(files)
elif types=='Abnormal_hear_beat':
HB_.append(files)
elif types=='MI':
MI_.append(files)
elif types=='History_MI':
PMI_.append(files)

NORMAL_ = natsorted(NORMAL_)
NORMAL_

['scaled_data_1D_1.csv',
'scaled_data_1D_2.csv',
'scaled_data_1D_3.csv',
'scaled_data_1D_4.csv',
'scaled_data_1D_5.csv',
'scaled_data_1D_6.csv',
'scaled_data_1D_7.csv',
'scaled_data_1D_8.csv',
'scaled_data_1D_9.csv',
'scaled_data_1D_10.csv',
'scaled_data_1D_11.csv',
'scaled_data_1D_12.csv',
'scaled_data_1D_13.csv']

D bl li k ( ) di
Double-click (or enter) to edit

Double-click (or enter) to edit

MI_ = natsorted(MI_)
MI_

['scaled_data_1D_1.csv',
'scaled_data_1D_2.csv',
'scaled_data_1D_3.csv',
'scaled_data_1D_4.csv',
'scaled_data_1D_5.csv',
'scaled_data_1D_6.csv',
'scaled_data_1D_7.csv',
'scaled_data_1D_8.csv',
'scaled_data_1D_9.csv',
'scaled_data_1D_10.csv',
'scaled_data_1D_11.csv',
'scaled_data_1D_12.csv',
'scaled_data_1D_13.csv']

PMI_ = natsorted(PMI_)
PMI_

['scaled_data_1D_1.csv',
'scaled_data_1D_2.csv',
'scaled_data_1D_3.csv',
'scaled_data_1D_4.csv',
'scaled_data_1D_5.csv',
'scaled_data_1D_6.csv',
'scaled_data_1D_7.csv',
'scaled_data_1D_8.csv',
'scaled_data_1D_9.csv',
'scaled_data_1D_10.csv',
'scaled_data_1D_11.csv',
'scaled_data_1D_12.csv',
'scaled_data_1D_13.csv']

HB_ = natsorted(HB_)
HB_

['scaled_data_1D_1.csv',
'scaled_data_1D_2.csv',
'scaled_data_1D_3.csv',
'scaled_data_1D_4.csv',
'scaled_data_1D_5.csv',
'scaled_data_1D_6.csv',
'scaled_data_1D_7.csv',
'scaled_data_1D_8.csv',
'scaled_data_1D_9.csv',
'scaled_data_1D_10.csv',
'scaled_data_1D_11.csv',
'scaled_data_1D_12.csv',
'scaled_data_1D_13.csv']
COMBINED CSV OF EACH LEAD(1-12) FROM ALL IMAGES

#loop over and create combined csv files for each leads.
for x in range(len(MI_)):
df1=pd.read_csv('/content/drive/MyDrive/CMPE255_PROJECT/NORMAL/{}'.format(NORMAL_[x]))
df2=pd.read_csv('/content/drive/MyDrive/CMPE255_PROJECT/AHB/{}'.format(HB_[x]))
df3=pd.read_csv('/content/drive/MyDrive/CMPE255_PROJECT/MI/{}'.format(MI_[x]))
df4=pd.read_csv('/content/drive/MyDrive/CMPE255_PROJECT/PM/{}'.format(PMI_[x]))
final_df = pd.concat([df1,df2,df3,df4],ignore_index=True)
final_df.to_csv('/content/drive/MyDrive/CMPE255_PROJECT/Combined_IDLead_{}.csv'.format

#now reading just lead1


df=pd.read_csv('/content/drive/MyDrive/CMPE255_PROJECT/Combined_IDLead_1.csv')
df['Target'].unique()

array(['No', 'HB', 'MI', 'PM'], dtype=object)

df.drop(columns=['Unnamed: 0'],inplace=True)

#convert Target column values as Numeric using ngroups


encode_target_label = df.groupby('Target').ngroup().rename("target").to_frame()
test_final = df.merge(encode_target_label, left_index=True, right_index=True)
test_final.drop(columns=['Target'],inplace=True)
test_final

0 1 2 3 4 5 6 7

0 0.728449 0.680755 0.619010 0.645367 0.681570 0.732488 0.758448 0.750660 0.7

1 0.957972 0.950695 0.941024 0.930501 0.913601 0.892244 0.868016 0.855127 0.8

2 0.611084 0.661575 0.695790 0.741113 0.716666 0.595794 0.425022 0.286457 0.4

3 0.839213 0.861690 0.866457 0.865756 0.855027 0.855606 0.845561 0.843187 0.8

4 0.917753 0.924369 0.873765 0.791381 0.699513 0.604927 0.500312 0.446012 0.5

... ... ... ... ... ... ... ... ...

923 0.874246 0.877014 0.864280 0.860505 0.871349 0.912404 0.958148 0.977826 0.9

924 0.829815 0.832084 0.852396 0.909665 0.988242 1.000000 0.923323 0.821865 0.7

925 0.469048 0.417983 0.362322 0.351995 0.391493 0.418305 0.440135 0.444598 0.4

926 0.682510 0.682286 0.641051 0.620212 0.608210 0.576331 0.603596 0.645714 0.6

927 0.792175 0.815695 0.819518 0.820559 0.847985 0.880933 0.902061 0.878266 0.8

928 rows × 256 columns

PERFORM DIMENSIONALITY REDUCTION JUST FOR CHECKING/UNDERSTANDING

#just for testing


# Now Perform Dimensionality reduction (PCA) on that Dataframe and check
from sklearn.decomposition import PCA

#do PCA and choose componeents as 100


pca = PCA(n_components=100)
x_pca = pca.fit_transform(test_final.iloc[:,0:-1])
x_pca = pd.DataFrame(x_pca)

# Calculate the variance explained by priciple components


explained_variance = pca.explained_variance_ratio_
print('Variance of each component:', pca.explained_variance_ratio_)
print('\n Total Variance Explained:', round(sum(list(pca.explained_variance_ratio_))*100

#store the new pca generated dimensions in a dataframe


pca_df = pd.DataFrame(data = x_pca)
target = pd.Series(test_final['target'], name='target')
result_df = pd.concat([pca_df, target], axis=1)
result_df
Variance of each component: [1.76145888e-01 9.50265614e-02 6.99060614e-02 6.1
5.34876630e-02 4.23664893e-02 3.68320213e-02 3.38541791e-02
3.00884979e-02 2.90396728e-02 2.64962509e-02 2.42272738e-02
2.10221030e-02 1.99751559e-02 1.77321042e-02 1.63016802e-02
1.53898622e-02 1.48412074e-02 1.33644825e-02 1.19674074e-02
1.16813409e-02 1.05807650e-02 9.68875480e-03 9.47385060e-03
8.65347748e-03 8.47506998e-03 7.93382172e-03 7.30163338e-03
6.76380665e-03 6.36886390e-03 6.02004791e-03 5.46823032e-03
3 2299 03 4 9 82 89 03 4 4686092 03 4 4608 684 03
result_df

0 1 2 3 4 5 6 7

0 1.018578 1.148263 -0.589582 0.193617 0.047950 -0.309400 -0.161566 0.478471

1 -1.098692 0.289832 -1.766388 1.076165 -0.261201 -0.820446 -0.474188 -0.515238

2 0.275021 -0.451289 0.106750 -0.426415 0.066133 0.692474 0.634894 -0.035867

3 -1.517085 1.662693 -1.021167 0.804267 -0.281985 0.518180 0.355748 -0.344235

4 -0.152840 -1.046283 0.351278 1.100381 -1.613642 1.484188 -0.113277 -0.251152

... ... ... ... ... ... ... ... ...

923 -1.321884 2.153021 0.788596 -1.304253 0.458186 -0.859346 -0.069127 -0.392796

924 -0.867163 -0.040504 0.940680 0.302648 -0.469672 -0.368255 1.065579 0.801522

925 3.753012 0.841636 -0.317393 -0.296117 0.593769 -0.255474 -0.057091 -0.072048

926 0.603083 0.126259 0.003433 0.283612 0.169559 -0.156326 -0.068399 -0.184308

927 -1.452945 1.233599 0.439472 0.278517 0.165928 -0.171830 -0.075000 0.033859

928 rows × 101 columns

TRYING DIFFERENT ML MODELS ON A SINGLE LEAD(EX : 1) POST


DIMENSIONALITY REDUCTION

KNN

# Import the necessary modules for ML model


from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

# Setup the pipeline steps:


steps = [('knn', KNeighborsClassifier())]

# Create the pipeline: pipeline


pipeline = Pipeline(steps)

# have paased less range value of hyperparamter since i'm using free tier version of goo
k_range = list(range(1, 9))
parameters = dict(knn__n_neighbors=k_range)

#input
X = result_df.iloc[:,0:-1]

#target
y=result_df.iloc[:,-1]

# Create train and test sets


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)

#increasing cv score takes lot of time in gooogle colab, so kept it just 2.


cv = GridSearchCV(pipeline,parameters,cv=2)

cv.fit(X_train,y_train)

# Predict the labels of the test set: y_pred


y_pred = cv.predict(X_test)

Knn_Accuracy = cv.score(X_test, y_test)

# Compute and print metrics


print("Accuracy: {}".format(Knn_Accuracy))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

Accuracy: 0.782258064516129
precision recall f1-score support

0 0.87 0.63 0.73 105


1 0.91 0.91 0.91 94
2 0.72 0.88 0.79 112
3 0.63 0.67 0.65 61

accuracy 0.78 372


macro avg 0.78 0.77 0.77 372
weighted avg 0.80 0.78 0.78 372

Tuned Model Parameters: {'knn__n_neighbors': 1}

Double-click (or enter) to edit

LOGISTIC REGRESSION

from sklearn.pipeline import Pipeline


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

# Setup the pipeline steps:


steps = [('lr', LogisticRegression())]

# Create the pipeline: pipeline


pipeline = Pipeline(steps)

#input
X = result_df.iloc[:,0:-1]

#target
y=result_df.iloc[:,-1]

#parameters for gridsearchcv if we increase range of entries from 5 to higher value, we


c_space = np.logspace(-4, 4, 3)
parameters = {'lr__C': c_space,'lr__penalty': ['l2']}

# Create train and test sets


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)

#call GridSearchCV and set crossvalscore to 2


cv = GridSearchCV(pipeline,parameters,cv=2)

cv.fit(X_train,y_train)

# Predict the labels of the test set: y_pred


y_pred = cv.predict(X_test)
LR_Accuracy = cv.score(X_test, y_test)

# Compute and print metrics


print("Accuracy: {}".format(LR_Accuracy))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

Accuracy: 0.543010752688172
precision recall f1-score support

0 0.36 0.33 0.35 105


1 0.73 0.91 0.81 94
2 0.56 0.58 0.57 112
3 0.38 0.26 0.31 61

accuracy 0.54 372


macro avg 0.51 0.52 0.51 372
weighted avg 0.52 0.54 0.53 372

Tuned Model Parameters: {'lr__C': 10000.0, 'lr__penalty': 'l2'}

SVM

# Import the necessary modules for ML model


from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

# Setup the pipeline


steps = [('SVM', SVC())]

pipeline = Pipeline(steps)

#input
X = result_df.iloc[:,0:-1]

#target
y=result_df.iloc[:,-1]

# Specify the hyperparameter space, if we increase the penalty(c) and gamma value the ac
#since it takes lots of time in google colab provided only a single value
parameters = {'SVM__C':[10],'SVM__gamma':[1]}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=21)

cv = GridSearchCV(pipeline,parameters,cv=3)
cv.fit(X_train,y_train)

y_pred = cv.predict(X_test)
SVM_Accuracy = cv.score(X_test, y_test)

# Compute and print metrics


SVM_Accuracy=cv.score(X_test, y_test)

print("Accuracy: {}".format(SVM_Accuracy))
print(classification_report(y_test, y_pred))

Accuracy: 0.8225806451612904
precision recall f1-score support

0 0.58 1.00 0.74 93


1 1.00 1.00 1.00 99
2 1.00 0.61 0.76 117
3 1.00 0.68 0.81 63

accuracy 0.82 372


macro avg 0.90 0.82 0.83 372
weighted avg 0.90 0.82 0.83 372

NOW COMBINING ALL 12 LEADS INTO A SINGLE CSV FILE AND THEN PERFROM
MODEL ANALYSIS

#lets try combining all 12 leads in a single csv


location= '/content/drive/MyDrive/CMPE255_PROJECT/'
for files in natsorted(os.listdir(location)):
if files.endswith(".csv") and not files.endswith("13.csv"):
if files!='Combined_IDLead_1.csv':
df=pd.read_csv('/content/drive/MyDrive/CMPE255_PROJECT/{}'.format(files))
df.drop(columns=['Unnamed: 0'],inplace=True)
test_final=pd.concat([test_final,df],axis=1,ignore_index=True)
test_final.drop(columns=test_final.columns[-1],axis=1,inplace=True)

#drop the target column


test_final.drop(columns=[255],axis=1,inplace=True)
test_final
0 1 2 3 4 5 6 7

0 0.728449 0.680755 0.619010 0.645367 0.681570 0.732488 0.758448 0.750660 0.7

1 0.957972 0.950695 0.941024 0.930501 0.913601 0.892244 0.868016 0.855127 0.8

2 0.611084 0.661575 0.695790 0.741113 0.716666 0.595794 0.425022 0.286457 0.4

3 0.839213 0.861690 0.866457 0.865756 0.855027 0.855606 0.845561 0.843187 0.8

4 0.917753 0.924369 0.873765 0.791381 0.699513 0.604927 0.500312 0.446012 0.5

#write the final file to csv


test_final.to_csv('final_1D.csv',header=False,index=False)

TEST DIMENSIONALITY REDUCTION EXPLAINED VARIANCE ON THE DATA

# Now Perform Dimensionality reduction (PCA) on that Dataframe and check


from sklearn.decomposition import PCA

#do PCA and choose componeents as 400


pca = PCA(n_components=400)
x_pca = pca.fit_transform(test_final)
x_pca = pd.DataFrame(x_pca)

# Calculate the variance explained by priciple components


explained_variance = pca.explained_variance_ratio_
print('Variance of each component:', pca.explained_variance_ratio_)
print('\n Total Variance Explained:', round(sum(list(pca.explained_variance_ratio_))*100

#store the new pca generated dimensions in a dataframe


#store the new pca generated dimensions in a dataframe
pca_df = pd.DataFrame(data = x_pca)
target = pd.Series(result_df.iloc[:,-1], name='target')
final_result_df = pd.concat([pca_df, target], axis=1)
final_result_df
Variance of each component: [8.04649534e-02 4.68818003e-02 3.76212504e-02 2.9
2.57031130e-02 2.32574514e-02 2.14376788e-02 2.04315151e-02
1.94482863e-02 1.79877408e-02 1.64766264e-02 1.53241665e-02
1.50689862e-02 1.41398267e-02 1.36330466e-02 1.33375324e-02
1.26355566e-02 1.25577001e-02 1.16968257e-02 1.11671338e-02
1.07975552e-02 1.06183806e-02 1.03402122e-02 1.01248410e-02
9.73197948e-03 9.25504395e-03 9.16367637e-03 8.76267060e-03
8.54270112e-03 8.20665462e-03 8.07642149e-03 7.90742343e-03
7.54929819e-03 7.21938018e-03 7.07604659e-03 6.89135251e-03
6.80575532e-03 6.71875790e-03 6.38252148e-03 6.33951897e-03
6.10254734e-03 5.94560955e-03 5.76371295e-03 5.71788829e-03
5.55354810e-03 5.42316932e-03 5.35640711e-03 5.08429353e-03
5.03302777e-03 4.96811576e-03 4.87696491e-03 4.63686128e-03
4.55349933e-03 4.45390625e-03 4.31579996e-03 4.28316592e-03
4.17213140e-03 4.12346241e-03 4.09072049e-03 3.99349122e-03
3.92129459e-03 3.81982060e-03 3.78116652e-03 3.73307150e-03
3.68894307e-03 3.55238746e-03 3.49148625e-03 3.40490507e-03
3.33593814e-03 3.25467389e-03 3.20023474e-03 3.14871964e-03
3.09091665e-03 3.07180393e-03 3.05651457e-03 2.95447952e-03
2.90507083e-03 2.84618700e-03 2.80939396e-03 2.76324718e-03
2.71487874e-03 2.68959207e-03 2.67378836e-03 2.62085254e-03
2.55991613e-03 2.53614502e-03 2.47015404e-03 2.45768102e-03
2.41851536e-03 2.39477316e-03 2.35560704e-03 2.29236345e-03
2.26928539e-03 2.24965527e-03 2.22764534e-03 2.19258829e-03
2.14654982e-03 2.09081474e-03 2.08656961e-03 2.04315332e-03
2.01191187e-03 1.99715030e-03 1.98092986e-03 1.93183566e-03
1.90133601e-03 1.86628808e-03 1.85847904e-03 1.79040117e-03
1.77318190e-03 1.76278440e-03 1.73682193e-03 1.70177712e-03
1.69142157e-03 1.66289246e-03 1.64192361e-03 1.62455779e-03
1.59836820e-03 1.57166872e-03 1.56017874e-03 1.55193712e-03
1.52130395e-03 1.50860404e-03 1.48563216e-03 1.45667689e-03
1.44862677e-03 1.43014707e-03 1.42443426e-03 1.39341888e-03
1.38941740e-03 1.38032166e-03 1.35292505e-03 1.33403513e-03
1.33300728e-03 1.31774024e-03 1.29238722e-03 1.24574072e-03
1.23408862e-03 1.21598644e-03 1.20568485e-03 1.19391143e-03
1.18690274e-03 1.16630751e-03 1.16159095e-03 1.14539199e-03
#save to dimensionally reduced csv file
1.13634359e-03 1.11858663e-03 1.10460060e-03 1.08515359e-03
final_result_df.to_csv("pca_final.csv")
1.07679695e-03 1.06488284e-03 1.05861426e-03 1.04012565e-03
1.03222232e-03 1.02519590e-03 1.01169941e-03 9.96444257e-04
import 9.76134514e-04
joblib 9.61104386e-04 9.57134099e-04 9.48294848e-04
9.35386446e-04
#save the PCA model 9.29858628e-04 9.24107282e-04 9.20229599e-04
9.00136970e-04 8.84392791e-04 8.60041244e-04 8.58222437e-04
joblib_file='PCA_ECG.pkl'
8.39586154e-04 8.34156616e-04 8.24745137e-04 8.19630377e-04
joblib.dump(pca,joblib_file)
8.11755902e-04 8.09589697e-04 7.93351930e-04 7.83229226e-04
['PCA_ECG.pkl']
7.69323633e-04 7.62916710e-04 7.61217310e-04 7.49412461e-04
7.41978508e-04 7.32319449e-04 7.28386324e-04 7.15766463e-04
7.00416470e-04 6.92792928e-04 6.87860571e-04 6.77118996e-04
TRYING DIFFERENT ML
6.69195650e-04 MODELS ON 6.52787237e-04
6.62776506e-04 THE ALL 12 LEADS COMBINED FILE WITHOUT
6.41350808e-04
6.31671343e-04 6.25941688e-04 6.20986818e-04 6.12964320e-04
DIMENSIONALITY
6.06757241e-04REDUCTION
6.00414979e-04 5.90442751e-04 5.85447566e-04
5.82053388e-04 5.72736727e-04 5.64768427e-04 5.62060875e-04
5.53942338e-04 5.47413376e-04 5.43815848e-04 5.39018247e-04
KNN 5.31538796e-04 5.21422265e-04 5.16620308e-04 5.13730678e-04
5.08883050e-04 5.04308686e-04 4.96238364e-04 4.91958416e-04
4.80055673e-04 4.74422583e-04 4.69414331e-04 4.65649131e-04
4.62052065e-04
# Import the necessary 4.58664175e-04 4.49131977e-04 4.46512859e-04
modules for ML model
from sklearn.pipeline import Pipeline 4.30056931e-04 4.24233413e-04
4.45747677e-04 4.36928354e-04
4 21656145 04 4
from sklearn.neighbors 20467965
import 04 4 16760267 04 4 15888840 04
KNeighborsClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

# Setup the pipeline steps:


steps = [('knn', KNeighborsClassifier())]

# Create the pipeline: pipeline


pipeline = Pipeline(steps)

# have paased less range value of hyperparamter since i'm using free tier version of goo
k_range = list(range(1, 30))
parameters = dict(knn__n_neighbors=k_range)

#input
X = final_result_df.iloc[:,:-1]

#target
y=final_result_df.iloc[:,-1]

# Create train and test sets


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)

#increasing cv score takes lot of time in gooogle colab, so kept it just 2.


cv = GridSearchCV(pipeline,parameters,cv=2)

cv.fit(X_train,y_train)

# Predict the labels of the test set: y_pred


y_pred = cv.predict(X_test)

Knn_Accuracy = cv.score(X_test, y_test)

# Compute and print metrics


print("Accuracy: {}".format(Knn_Accuracy))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

Accuracy: 0.793010752688172
precision recall f1-score support

0 0.92 0.65 0.76 105


1 0.95 0.91 0.93 94
2 0.70 0.86 0.77 112
3 0.65 0.74 0.69 61

accuracy 0.79 372


macro avg 0.80 0.79 0.79 372
weighted avg 0.81 0.79 0.79 372

Tuned Model Parameters: {'knn__n_neighbors': 1}

LOGISTIC REGRESSION

from sklearn.pipeline import Pipeline


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

# Setup the pipeline steps:


steps = [('lr', LogisticRegression())]

# Create the pipeline: pipeline


pipeline = Pipeline(steps)

#input
X = final_result_df.iloc[:,:-1]

#target
y=final_result_df.iloc[:,-1]

#parameters for gridsearchcv


c_space = np.logspace(-4, 4, 10)
parameters = {'lr__C': c_space,'lr__penalty': ['l2']}

# Create train and test sets


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)

#call GridSearchCV and set crossvalscore to 2


cv = GridSearchCV(pipeline,parameters,cv=2)

cv.fit(X_train,y_train)

# Predict the labels of the test set: y_pred


y_pred = cv.predict(X_test)
LR_Accuracy = cv.score(X_test, y_test)

# Compute and print metrics


print("Accuracy: {}".format(LR_Accuracy))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

Accuracy: 0.7768817204301075
precision recall f1-score support

0 0.83 0.57 0.68 105


1 0.83 0.91 0.87 94
2 0.82 0.86 0.84 112
3 0.59 0.77 0.67 61

accuracy 0.78 372


macro avg 0.77 0.78 0.76 372
weighted avg 0.79 0.78 0.77 372

Tuned Model Parameters: {'lr__C': 0.3593813663804626, 'lr__penalty': 'l2'}

SVM

# Import the necessary modules for ML model


from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

# Setup the pipeline


steps = [('SVM', SVC())]

pipeline = Pipeline(steps)

#input
X = final_result_df.iloc[:,:-1]

#target
y=final_result_df.iloc[:,-1]

# Specify the hyperparameter space, if we increase the penalty(c) and gamma value the ac
#since it takes lots of time in google colab provided only a single value
parameters = {'SVM__C':[1, 10, 100],
'SVM__gamma':[0.1, 0.01]}

# Create train and test sets


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5,random_state=21)

cv = GridSearchCV(pipeline,parameters,cv=3)
cv.fit(X_train,y_train)

y_pred = cv.predict(X_test)
SVM_Accuracy = cv.score(X_test, y_test)

# Compute and print metrics


SVM_Accuracy=cv.score(X_test, y_test)

print("Accuracy: {}".format(SVM_Accuracy))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

Accuracy: 0.9051724137931034
precision recall f1-score support

0 0.81 0.92 0.86 119


1 1.00 1.00 1.00 125
2 0.91 0.89 0.90 140
3 0.93 0.78 0.84 80

accuracy 0.91 464


macro avg 0.91 0.89 0.90 464
weighted avg 0.91 0.91 0.91 464

Tuned Model Parameters: {'SVM__C': 10, 'SVM__gamma': 0.01}

XGBOOST

from xgboost import XGBClassifier


from sklearn.metrics import accuracy_score

model = XGBClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: {}".format(accuracy))
print(classification_report(y_test, y_pred))

Accuracy: 0.853448275862069
precision recall f1-score support

0 0.79 0.70 0.74 119


1 0.98 1.00 0.99 125
2 0.82 0.87 0.84 140
3 0.80 0.82 0.81 80

accuracy 0.85 464


macro avg 0.85 0.85 0.85 464
weighted avg 0.85 0.85 0.85 464

SAVING A VERY BASIC ML MODEL AND USING IT ON REALTIME PIPELINE TO


CHECK WORKING

# Import the necessary modules for ML model


from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
import joblib
#input
X = final_result_df.iloc[:,:-1]
#target
y=final_result_df.iloc[:,-1]

# Create train and test sets


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)

joblib_file='knn_model_test.pkl'
joblib.dump(knn,joblib_file)

['knn_model_test.pkl']

# Import the necessary modules for ML model


from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

#input
X = pd.read_csv('final_1D.csv',header=None)

#target
y=final_result_df.iloc[:,-1]

# Create train and test sets


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5,random_state=21)

svm=SVC(C=10,gamma=0.01)

svm.fit(X_train,y_train)

joblib_file='svm_model_test.pkl'
joblib.dump(svm,joblib_file)

['svm_model_test.pkl']
ENSEMBLE

# Importing required modules


from sklearn import linear_model, tree, ensemble
from sklearn.naive_bayes import GaussianNB
import xgboost
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
import pickle

#input
X = final_result_df.iloc[:,0:-1]

#target
y=final_result_df.iloc[:,-1]

# Create train and test sets


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

# Stacking of ML Models
eclf = VotingClassifier(estimators=[
('SVM', SVC(probability=True)),
('knn', KNeighborsClassifier()),
('rf', ensemble.RandomForestClassifier()),
('bayes',GaussianNB()),
('logistic',LogisticRegression()),
], voting='soft')

# Hyperparameter Tuning using gridSearch


params = {'SVM__C':[1, 10, 100],
'SVM__gamma':[0.1, 0.01],
'knn__n_neighbors': [1,3,5],
'rf__n_estimators':[300, 400],
}

grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)


voting_clf = grid.fit(X_train, y_train)

print(grid.best_params_)
y_pred = voting_clf.predict(X_test)

# Compute and print metrics


Voting_Accuracy=voting_clf.score(X_test, y_test)

print("Accuracy: {}".format(Voting_Accuracy))
print(classification_report(y_test, y_pred))
print(voting_clf.best_params_)

{'SVM__C': 1, 'SVM__gamma': 0.1, 'knn__n_neighbors': 1, 'rf__n_estimators': 3


Accuracy: 0.9247311827956989
precision recall f1-score support

0 0.89 0.96 0.92 80


1 1.00 1.00 1.00 72
2 0.92 0.92 0.92 79
3 0.88 0.75 0.81 48
accuracy 0.92 279
macro avg 0.92 0.91 0.91 279
weighted avg 0.92 0.92 0.92 279

{'SVM__C': 1, 'SVM__gamma': 0.1, 'knn__n_neighbors': 1, 'rf__n_estimators': 3

# open a file, where you ant to store the data


file = open('Heart_Disease_Prediction_using_ECG.pkl', 'wb')
# dump information to that file
pickle.dump(voting_clf, file)

SAVE AND USE THE ABOVE MODEL IN THE STREAMLIT APP :


https://colab.research.google.com/drive/139YVmcUBCiP52J2sX
3QE_eiu2sukVgpn?usp=sharing

You might also like