0% found this document useful (0 votes)
11 views18 pages

Razi AML Assignment2

Uploaded by

ahmed.razi98989
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views18 pages

Razi AML Assignment2

Uploaded by

ahmed.razi98989
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Razi_AML_Assignment2

August 23, 2021

1 ASSIGNMENT 2 - AML

2 Step 1: Download the Liver patient data from the following


sources:

https://www.kaggle.com/uciml/indian-liver-patient-records
Step 2: Use the following 07 features from this dataset
Age, Total_Bilirubin, Direct_Bilirubin, Alkaline_Phosphotase, Alamine_Aminotransferase, To-
tal_Protiens, Albumin,
Step 3: Your task is to predict whether a patient suffers from a liver disease using above features.
Split your data into test and train. First use a random forest algorithm for performing this task.
Then, use a Adaboost Classifier to perform similar task. Compare the accuracy of these two
algorithms

3 Step 2: Use the following 07 features from this dataset

Age, Total_Bilirubin, Direct_Bilirubin, Alkaline_Phosphotase, Alamine_Aminotransferase, To-


tal_Protiens, Albumin,

4 Step 3: Your task is to predict whether a patient suffers from a


liver disease using above features.

Split your data into test and train. First use a random forest algorithm for performing this task.
Then, use a Adaboost Classifier to perform similar task. Compare the accuracy of these two
algorithms
[35]: #Import the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1
import warnings
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

[36]: #Read the data from csv

df = pd.read_csv('indian_liver_patient.csv')
df.head()

[36]: Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \


0 65 Female 0.7 0.1 187
1 62 Male 10.9 5.5 699
2 62 Male 7.3 4.1 490
3 58 Male 1.0 0.4 182
4 72 Male 3.9 2.0 195

Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens \


0 16 18 6.8
1 64 100 7.5
2 60 68 7.0
3 14 20 6.8
4 27 59 7.3

Albumin Albumin_and_Globulin_Ratio Dataset


0 3.3 0.90 1
1 3.2 0.74 1
2 3.3 0.89 1
3 3.4 1.00 1
4 2.4 0.40 1

[37]: #Drop not needed feature columns as mentioned


df.
,→drop(columns=['Gender','Aspartate_Aminotransferase','Albumin_and_Globulin_Ratio']␣

,→,inplace=True)

df.head()

[37]: Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \


0 65 0.7 0.1 187
1 62 10.9 5.5 699
2 62 7.3 4.1 490
3 58 1.0 0.4 182
4 72 3.9 2.0 195

Alamine_Aminotransferase Total_Protiens Albumin Dataset

2
0 16 6.8 3.3 1
1 64 7.5 3.2 1
2 60 7.0 3.3 1
3 14 6.8 3.4 1
4 27 7.3 2.4 1

[38]: print (df)


value=['Age', 'Total_Bilirubin', 'Direct_Bilirubin', 'Alkaline_Phosphotase',␣
,→'Alamine_Aminotransferase',␣

,→'Aspartate_Aminotransferase','Albumin_and_Globulin_Ratio','Total_Protiens',␣

,→'Albumin']

Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \


0 65 0.7 0.1 187
1 62 10.9 5.5 699
2 62 7.3 4.1 490
3 58 1.0 0.4 182
4 72 3.9 2.0 195
.. … … … …
578 60 0.5 0.1 500
579 40 0.6 0.1 98
580 52 0.8 0.2 245
581 31 1.3 0.5 184
582 38 1.0 0.3 216

Alamine_Aminotransferase Total_Protiens Albumin Dataset


0 16 6.8 3.3 1
1 64 7.5 3.2 1
2 60 7.0 3.3 1
3 14 6.8 3.4 1
4 27 7.3 2.4 1
.. … … … …
578 20 5.9 1.6 2
579 35 6.0 3.2 1
580 48 6.4 3.2 1
581 29 6.8 3.4 1
582 21 7.3 4.4 2

[583 rows x 8 columns]

[39]: # Looking for missing values in the dataset


df.isna().sum()

[39]: Age 0
Total_Bilirubin 0
Direct_Bilirubin 0
Alkaline_Phosphotase 0

3
Alamine_Aminotransferase 0
Total_Protiens 0
Albumin 0
Dataset 0
dtype: int64

5 Analyze data frame shape, data types etc.

[40]: def analyze_dataframe(dataframe):


print("\n Shape of df :: \n" ,dataframe.shape)
print("\n Data types of df columns :: \n" ,dataframe.dtypes)
print("\n Description of df :: \n" ,dataframe.describe())

analyze_dataframe(df)

Shape of df ::
(583, 8)

Data types of df columns ::


Age int64
Total_Bilirubin float64
Direct_Bilirubin float64
Alkaline_Phosphotase int64
Alamine_Aminotransferase int64
Total_Protiens float64
Albumin float64
Dataset int64
dtype: object

Description of df ::
Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \
count 583.000000 583.000000 583.000000 583.000000
mean 44.746141 3.298799 1.486106 290.576329
std 16.189833 6.209522 2.808498 242.937989
min 4.000000 0.400000 0.100000 63.000000
25% 33.000000 0.800000 0.200000 175.500000
50% 45.000000 1.000000 0.300000 208.000000
75% 58.000000 2.600000 1.300000 298.000000
max 90.000000 75.000000 19.700000 2110.000000

Alamine_Aminotransferase Total_Protiens Albumin Dataset


count 583.000000 583.000000 583.000000 583.000000
mean 80.713551 6.483190 3.141852 1.286449

4
std 182.620356 1.085451 0.795519 0.452490
min 10.000000 2.700000 0.900000 1.000000
25% 23.000000 5.800000 2.600000 1.000000
50% 35.000000 6.600000 3.100000 1.000000
75% 60.500000 7.200000 3.800000 2.000000
max 2000.000000 9.600000 5.500000 2.000000

[41]: # Rename the dataset column to result


df.rename(columns={'Dataset':'Result'}, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 583 non-null int64
1 Total_Bilirubin 583 non-null float64
2 Direct_Bilirubin 583 non-null float64
3 Alkaline_Phosphotase 583 non-null int64
4 Alamine_Aminotransferase 583 non-null int64
5 Total_Protiens 583 non-null float64
6 Albumin 583 non-null float64
7 Result 583 non-null int64
dtypes: float64(4), int64(4)
memory usage: 36.6 KB

[42]: # Having a look at the dataset after the numerical transformation


df.head()

[42]: Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \


0 65 0.7 0.1 187
1 62 10.9 5.5 699
2 62 7.3 4.1 490
3 58 1.0 0.4 182
4 72 3.9 2.0 195

Alamine_Aminotransferase Total_Protiens Albumin Result


0 16 6.8 3.3 1
1 64 7.5 3.2 1
2 60 7.0 3.3 1
3 14 6.8 3.4 1
4 27 7.3 2.4 1

5
6 Data Pre Processing and Visualization

[43]: # Dropping the missing values


df = df.dropna()

[44]: # Having a look at the correlation matrix

fig, ax = plt.subplots(figsize=(9,7))
sns.heatmap(df.corr(), annot=True, fmt='.1g', cmap="Greens", cbar=False);

[45]: sns.pairplot(df, hue='Result')

[45]: <seaborn.axisgrid.PairGrid at 0x7f33f0d994f0>

6
[46]: def count_healthy_unhealthy_livers_and_plot(dataframe):
print ('Count of Unhealthy Livers : {} '.format(dataframe.Result.
,→value_counts()[1]))

print ('Count of Healthy Livers : {} '.format(dataframe.Result.


,→value_counts()[2]))

# visualize number of patients diagonised with liver diesease


sns.countplot(data = dataframe, x = 'Result')

count_healthy_unhealthy_livers_and_plot(df)

Count of Unhealthy Livers : 416


Count of Healthy Livers : 167

7
[47]: def pie_plot_draw():
plt.pie(x=df["Result"].value_counts(),
colors=["red","green"],
labels=["UnHealthy Liver","Healthy Liver"],
shadow = True)
plt.show()

plt.style.use("seaborn")
fig, ax = plt.subplots(figsize=(7,7))
pie_plot_draw()

8
[48]: def histogram_plot_draw(col_name):
sns.histplot(x=df[col_name], kde=True, color="blue")

histogram_plot_draw("Age")

9
[49]: # X axis data
X = df.drop("Result", axis=1)
X.head()

[49]: Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \


0 65 0.7 0.1 187
1 62 10.9 5.5 699
2 62 7.3 4.1 490
3 58 1.0 0.4 182
4 72 3.9 2.0 195

Alamine_Aminotransferase Total_Protiens Albumin


0 16 6.8 3.3
1 64 7.5 3.2
2 60 7.0 3.3
3 14 6.8 3.4
4 27 7.3 2.4

[50]: # y axis data


y = df["Result"]
y.head()
X

10
[50]: Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \
0 65 0.7 0.1 187
1 62 10.9 5.5 699
2 62 7.3 4.1 490
3 58 1.0 0.4 182
4 72 3.9 2.0 195
.. … … … …
578 60 0.5 0.1 500
579 40 0.6 0.1 98
580 52 0.8 0.2 245
581 31 1.3 0.5 184
582 38 1.0 0.3 216

Alamine_Aminotransferase Total_Protiens Albumin


0 16 6.8 3.3
1 64 7.5 3.2
2 60 7.0 3.3
3 14 6.8 3.4
4 27 7.3 2.4
.. … … …
578 20 5.9 1.6
579 35 6.0 3.2
580 48 6.4 3.2
581 29 6.8 3.4
582 21 7.3 4.4

[583 rows x 7 columns]

7 Feature Selection

[51]: from sklearn.ensemble import ExtraTreesClassifier


from sklearn.feature_selection import SelectFromModel
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
print("Showing feature importance values")
print(clf.feature_importances_)

Showing feature importance values


[0.15332017 0.14923714 0.11684129 0.1573351 0.16197282 0.12395798
0.13733548]

[52]: model=SelectFromModel(clf, prefit=True) #getting features from the above␣


,→classifer as per the importances

cols=X.columns.to_list()#getting list of columns


tf=model.get_support()#getting which features are important

11
selectedcols=[]
for i in range(len(cols)):
if tf[i]:
selectedcols.append(cols[i])
print("showing selected columns")
print(selectedcols)
#converting the data
X_new = model.transform(X)
X_new.shape

showing selected columns


['Age', 'Total_Bilirubin', 'Alkaline_Phosphotase', 'Alamine_Aminotransferase']

[52]: (583, 4)

8 Splitting the data into training and test datasets

Here, we are trying to predict whether the patient has an Unhealthy Liver or not using the given
data.
[53]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2,␣
,→random_state=42)

[54]: # Scaling the data

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

[55]: len(X_train), len(X_test)

[55]: (466, 117)

9 Random Forest Classifier

[56]: from sklearn.ensemble import RandomForestClassifier


rfc = RandomForestClassifier(n_estimators = 100)
rfc.fit(X_train,y_train)

[56]: RandomForestClassifier()

12
[57]: RandomForestClassifierScore = rfc.score(X_test, y_test)
print("Accuracy obtained by Random Forest Classifier model:
,→",RandomForestClassifierScore*100)

Accuracy obtained by Random Forest Classifier model: 74.35897435897436

[58]: # Having a look at the confusion matrix


from sklearn.metrics import classification_report,confusion_matrix
y_pred_rfc = rfc.predict(X_test)
cf_matrix = confusion_matrix(y_test, y_pred_rfc)
sns.heatmap(cf_matrix, annot=True, cmap="Spectral")
plt.title("Confusion Matrix for Random Forest Classifier", fontsize=14,␣
,→fontname="DejaVu Sans", y=1.03);

[59]: # Classification report of Random Forest Classifier

print(metrics.classification_report(y_test, y_pred_rfc))

precision recall f1-score support

13
1 0.82 0.84 0.83 87
2 0.50 0.47 0.48 30

accuracy 0.74 117


macro avg 0.66 0.65 0.66 117
weighted avg 0.74 0.74 0.74 117

10 Ada Boost Classifier

[60]: bdt = AdaBoostClassifier()


bdt.fit(X_train, y_train)

[60]: AdaBoostClassifier()

[61]: AdaBoostClassifierScore = bdt.score(X_test,y_test)


print("Accuracy obtained by Ada Boost Classifier model:
,→",AdaBoostClassifierScore*100)

Accuracy obtained by Ada Boost Classifier model: 77.77777777777779

[62]: # Confusion matrix


y_pred_bdt = bdt.predict(X_test)
cf_matrix = confusion_matrix(y_test, y_pred_bdt)
sns.heatmap(cf_matrix, annot=True, cmap="Spectral")
plt.title("Confusion Matrix for Ada Boost Classifier", fontsize=14,␣
,→fontname="DejaVu Sans", y=1.03);

14
[63]: # Classification Report of Ada Boost Classifier

print(metrics.classification_report(y_test, y_pred_bdt))

precision recall f1-score support

1 0.85 0.85 0.85 87


2 0.57 0.57 0.57 30

accuracy 0.78 117


macro avg 0.71 0.71 0.71 117
weighted avg 0.78 0.78 0.78 117

15
11 KNNNeighbourClassifier

[64]: from sklearn.neighbors import KNeighborsClassifier


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

[64]: KNeighborsClassifier(n_neighbors=3)

[65]: KNeighborsClassifierScore = knn.score(X_test,y_test)


print("Accuracy obtained by Decision KNN Classifier model:
,→",KNeighborsClassifierScore*100)

Accuracy obtained by Decision KNN Classifier model: 70.94017094017094

[66]: y_pred_knn = knn.predict(X_test)


cf_matrix = confusion_matrix(y_test, y_pred_knn)
sns.heatmap(cf_matrix, annot=True, cmap="Spectral")
plt.title("Confusion Matrix for KNN Classifier", fontsize=14, fontname="DejaVu␣
,→Sans", y=1.03);

16
[67]: # Classification Report of KNN Classifier

print(metrics.classification_report(y_test, y_pred_knn))

precision recall f1-score support

1 0.80 0.80 0.80 87


2 0.43 0.43 0.43 30

accuracy 0.71 117


macro avg 0.62 0.62 0.62 117
weighted avg 0.71 0.71 0.71 117

12 Plot and Compare Results of Classifiers

[68]: def plot_and_compare_results():


plt.style.use("seaborn")
classifiers = ["Ada Boost Classifier",
"KNN Classifier",
"Random ForestClassifier"]

scores = [AdaBoostClassifierScore,
KNeighborsClassifierScore,
RandomForestClassifierScore]

fig, ax = plt.subplots(figsize=(8,6))
sns.barplot(x=classifiers,y=scores);
plt.ylabel("Model Accuracy")
plt.xticks(rotation=60)
plt.title("Model Comparison - Model Accuracy", fontsize=15,␣
,→fontname="DejaVu Sans", y=1.04);

plot_and_compare_results()

17
18

You might also like