0% found this document useful (0 votes)
21 views7 pages

My Code

Uploaded by

oyelekeayomide1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

My Code

Uploaded by

oyelekeayomide1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

From displaying all features of dataset

pd.pandas.set_option('display.max_columns', None)

# Reading Dataset:
dataset = pd.read_csv("/content/drive/MyDrive/Project Work/Kidney_data.csv")
# Top 5 records:
dataset.head()

Data Set Information

We use the following representation to collect the dataset

age - age

bp - blood pressure

sg - specific gravity

al - albumin

su - sugar

rbc - red blood cells

pc - pus cell

pcc - pus cell clumps

ba - bacteria

bgr - blood glucose random

bu - blood urea

sc - serum creatinine

sod - sodium

pot - potassium

hemo - hemoglobin

pcv - packed cell volume

wc - white blood cell count

rc - red blood cell count

htn - hypertension

dm - diabetes mellitus

cad - coronary artery disease

appet - appetite

pe - pedal edema

ane - anemia

class - class

Attribute Information:

We use 24 + class = 25 ( 11 numeric ,14 nominal)

Age(numerical) age in years

Blood Pressure(numerical) bp in mm/Hg


Specific Gravity(nominal) sg - (1.005,1.010,1.015,1.020,1.025)

Albumin(nominal) al - (0,1,2,3,4,5)

Sugar(nominal) su - (0,1,2,3,4,5)

Red Blood Cells(nominal) rbc - (normal,abnormal)

Pus Cell (nominal) pc - (normal,abnormal)

Pus Cell clumps(nominal) pcc - (present,notpresent)

Bacteria(nominal) ba - (present,notpresent)

Blood Glucose Random(numerical) bgr in mgs/dl

Blood Urea(numerical) bu in mgs/dl

Serum Creatinine(numerical) sc in mgs/dl

Sodium(numerical) sod in mEq/L

Potassium(numerical) pot in mEq/L

Hemoglobin(numerical) hemo in gms

Packed Cell Volume(numerical)

White Blood Cell Count(numerical) wc in cells/cumm

Red Blood Cell Count(numerical) rc in millions/cmm

Hypertension(nominal) htn - (yes,no)

Diabetes Mellitus(nominal) dm - (yes,no)

Coronary Artery Disease(nominal) cad - (yes,no)

Appetite(nominal) appet - (good,poor)

Pedal Edema(nominal) pe - (yes,no)

Anemia(nominal) ane - (yes,no)

Class (nominal) class - (ckd,notckd)

# Dropping unneccesary feature :


dataset = dataset.drop('id', axis=1)

# Shape of dataset:
dataset.shape

# Checking Missing (NaN) Values:


dataset.isnull().sum()

# Description:
dataset.describe()

dataset.columns

dataset.dtypes

Replacing Categorical values with numbers:

1. rbc

dataset['rbc'].value_counts()

dataset['rbc'] = dataset['rbc'].replace(to_replace = {'normal' : 0, 'abnormal' : 1})

2. pc

dataset['pc'].value_counts()
dataset['pc'] = dataset['pc'].replace(to_replace = {'normal' : 0, 'abnormal' : 1})

3. pcc

dataset['pcc'].value_counts()

dataset['pcc'] = dataset['pcc'].replace(to_replace = {'notpresent':0,'present':1})

4. ba

dataset['ba'].value_counts()

dataset['ba'] = dataset['ba'].replace(to_replace = {'notpresent':0,'present':1})

5. htn

dataset['htn'].value_counts()

dataset['htn'] = dataset['htn'].replace(to_replace = {'yes' : 1, 'no' : 0})

6. dm

dataset['dm'].value_counts()

dataset['dm'] = dataset['dm'].replace(to_replace = {'\tyes':'yes', ' yes':'yes', '\tno':'no'})

dataset['dm'] = dataset['dm'].replace(to_replace = {'yes' : 1, 'no' : 0})

7. cad

dataset['cad'].value_counts()

dataset['cad'] = dataset['cad'].replace(to_replace = {'\tno':'no'})

dataset['cad'] = dataset['cad'].replace(to_replace = {'yes' : 1, 'no' : 0})

8. appet

dataset['appet'].unique()

dataset['appet'] = dataset['appet'].replace(to_replace={'good':1,'poor':0,'no':np.nan})

9. pe

dataset['pe'].value_counts()

dataset['pe'] = dataset['pe'].replace(to_replace = {'yes' : 1, 'no' : 0})

10. ane

dataset['ane'].value_counts()
dataset['ane'] = dataset['ane'].replace(to_replace = {'yes' : 1, 'no' : 0})

11. classification

dataset['classification'].value_counts()

dataset['classification'] = dataset['classification'].replace(to_replace={'ckd\t':'ckd'})

dataset["classification"] = [1 if i == "ckd" else 0 for i in dataset["classification"]]

dataset.head()

# Datatypes:
dataset.dtypes

Converting Object values into Numeric values:

dataset['pcv'] = pd.to_numeric(dataset['pcv'], errors='coerce')


dataset['wc'] = pd.to_numeric(dataset['wc'], errors='coerce')
dataset['rc'] = pd.to_numeric(dataset['rc'], errors='coerce')

# Datatypes:
dataset.dtypes

# Description:
dataset.describe()

# Checking Missing (NaN) Values:


dataset.isnull().sum().sort_values(ascending=False)

Handling Null Values:

There is Outliers present in our dataset so We fill NaN values with Median.

dataset.columns

features = ['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
'appet', 'pe', 'ane']

for feature in features:


dataset[feature] = dataset[feature].fillna(dataset[feature].median())

dataset.isnull().any().sum()

Heatmap

plt.figure(figsize=(24,14))
sns.heatmap(dataset.corr(), annot=True, cmap='YlGnBu')
plt.show()

1. We clearly see that 'pcv' and 'hemo' feature has 85% multicollinearity
2. So we remove one of the feature. i.e pcv

dataset.drop('pcv', axis=1, inplace=True)

dataset.head()
# Target feature:
sns.countplot(x='classification', data=dataset)

# Independent and Dependent Feature:


X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

X.head()

# Feature Importance:
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model=ExtraTreesClassifier()
model.fit(X,y)

plt.figure(figsize=(8,6))
ranked_features=pd.Series(model.feature_importances_,index=X.columns)
ranked_features.nlargest(24).plot(kind='barh')
plt.show()

We take top 8 feature only.

ranked_features.nlargest(8).index

X = dataset[['sg', 'htn', 'hemo', 'dm', 'al', 'appet', 'rc', 'pc']]


X.head()

X.tail()

y.head()

# Train Test Split:


from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.3, random_state=33)

print(X_train.shape)
print(X_test.shape)

keyboard_arrow_down Random Forest Algorithm


# Importing Performance Metrics:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn import metrics

# Initialzing empty lists to append all model's name and corresponding name
acc = []
model = []
# RandomForestClassifier:
from sklearn.ensemble import RandomForestClassifier
RandomForest = RandomForestClassifier()
RandomForest = RandomForest.fit(X_train,y_train)

# Predictions:
y_pred_rf = RandomForest.predict(X_test)

# Performance:
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print('Accuracy:', accuracy_score(y_test,y_pred_rf))
print(confusion_matrix(y_test,y_pred_rf))
print(classification_report(y_test, y_pred_rf))

rf_score= RandomForest.score(X_train,y_train)
report = classification_report(y_test, y_pred_rf, output_dict=True)
df = pd.DataFrame(report).transpose()
df = df.drop(['0', '1', 'accuracy', 'weighted avg'], axis=0)
df = df.drop('support', axis=1)
df.rename(index={"macro avg": "Random Forest"}, inplace=True)
df['accuracy'] = round((rf_score * 100), 2)

x = metrics.accuracy_score(y_test, y_pred_rf)

acc.append(x)
model.append('RF')

#Confusion Matrix
print(confusion_matrix(y_test, y_pred_rf))
df = {'y_Actual': y_test, 'y_Predicted': y_pred_rf}
df1 = pd.DataFrame(df, columns = ['y_Actual', 'y_Predicted'])
clf_confusion_matrix = pd.crosstab(df['y_Predicted'], df['y_Actual'], rownames = ['Predicted'], colnames = ['Actual'])
sns.heatmap(clf_confusion_matrix, annot=True)

keyboard_arrow_down Support Vector Machine


from sklearn.svm import SVC
svm= SVC(kernel = 'linear', random_state = 0)
svm=svm.fit(X_train, y_train)

# Predictions:
y_pred_svm = svm.predict(X_test)

# Performance:
accuracy_svm = accuracy_score(y_test, y_pred_svm)

print('Accuracy:', accuracy_score(y_test,y_pred_svm))
print(confusion_matrix(y_test,y_pred_svm))
print(classification_report(y_test,y_pred_svm))

svm_score= svm.score(X_train,y_train)
report1 = classification_report(y_test, y_pred_svm, output_dict=True)
df1 = pd.DataFrame(report).transpose()
df1 = df1.drop(['0', '1', 'accuracy', 'weighted avg'], axis=0)
df1 = df1.drop('support', axis=1)
df1.rename(index={"macro avg": "Support Vector Machine"}, inplace=True)
df1['accuracy'] = round((svm_score * 100), 2)

x = metrics.accuracy_score(y_test, y_pred_svm)
acc.append(x)
model.append('SVM')

#Confusion Matrix
print(confusion_matrix(y_test, y_pred_svm))
df = {'y_Actual': y_test, 'y_Predicted': y_pred_svm}
df1 = pd.DataFrame(df, columns = ['y_Actual', 'y_Predicted'])
clf_confusion_matrix = pd.crosstab(df['y_Predicted'], df['y_Actual'], rownames = ['Predicted'], colnames = ['Actual'])
sns.heatmap(clf_confusion_matrix, annot=True)
keyboard_arrow_down Accuracy Comparison
plt.figure(figsize=[10,5],dpi = 100)
plt.title('Accuracy Comparison')
plt.xlabel('Accuracy')
plt.ylabel('Algorithm')
sns.barplot(x = acc,y = model,palette='dark')

from sklearn.metrics import f1_score


from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

# Calculate F1-score for SVM and Random Forest


f1_svm = f1_score(y_test, y_pred_svm)
f1_rf = f1_score(y_test, y_pred_rf)

# Calculate Recall for SVM and Random Forest


recall_svm = recall_score(y_test, y_pred_svm)
recall_rf = recall_score(y_test, y_pred_rf)

# Calculate Precision for SVM and Random Forest


precision_svm = precision_score(y_test, y_pred_svm)
precision_rf = precision_score(y_test, y_pred_rf)

# Create a DataFrame for better visualization


metrics_data = {
'Model': ['SVM', 'Random Forest'],
'Accuracy': [accuracy_svm, accuracy_rf],
'F1-Score': [f1_svm, f1_rf],
'Recall': [recall_svm, recall_rf],
'Precision': [precision_svm, precision_rf]
}

metrics_df = pd.DataFrame(metrics_data)

# Plotting
sns.set(style="whitegrid")
plt.figure(figsize=(8, 6))

# Bar plot for Accuracy


plt.subplot(2, 2, 1)
sns.barplot(x='Model', y='Accuracy', data=metrics_df, palette='viridis')
plt.title('Accuracy Comparison')

# Bar plot for F1-Score


plt.subplot(2, 2, 2)
sns.barplot(x='Model', y='F1-Score', data=metrics_df, palette='magma')
plt.title('F1-Score Comparison')

# Bar plot for Recall-Score


plt.subplot(2, 2, 3)
sns.barplot(x='Model', y='Recall', data=metrics_df, palette='mako')
plt.title('Recall-Score Comparison')

# Bar plot for Precision-Score


plt.subplot(2, 2, 4)
sns.barplot(x='Model', y='Precision', data=metrics_df, palette='inferno')
plt.title('Precision-Score Comparison')

plt.tight_layout()
plt.show()

You might also like