Razi AML Assignment2
Razi AML Assignment2
1 ASSIGNMENT 2 - AML
https://www.kaggle.com/uciml/indian-liver-patient-records
Step 2: Use the following 07 features from this dataset
Age, Total_Bilirubin, Direct_Bilirubin, Alkaline_Phosphotase, Alamine_Aminotransferase, To-
tal_Protiens, Albumin,
Step 3: Your task is to predict whether a patient suffers from a liver disease using above features.
Split your data into test and train. First use a random forest algorithm for performing this task.
Then, use a Adaboost Classifier to perform similar task. Compare the accuracy of these two
algorithms
Split your data into test and train. First use a random forest algorithm for performing this task.
Then, use a Adaboost Classifier to perform similar task. Compare the accuracy of these two
algorithms
[35]: #Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
1
import warnings
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
df = pd.read_csv('indian_liver_patient.csv')
df.head()
,→,inplace=True)
df.head()
2
0 16 6.8 3.3 1
1 64 7.5 3.2 1
2 60 7.0 3.3 1
3 14 6.8 3.4 1
4 27 7.3 2.4 1
,→'Aspartate_Aminotransferase','Albumin_and_Globulin_Ratio','Total_Protiens',␣
,→'Albumin']
[39]: Age 0
Total_Bilirubin 0
Direct_Bilirubin 0
Alkaline_Phosphotase 0
3
Alamine_Aminotransferase 0
Total_Protiens 0
Albumin 0
Dataset 0
dtype: int64
analyze_dataframe(df)
Shape of df ::
(583, 8)
Description of df ::
Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \
count 583.000000 583.000000 583.000000 583.000000
mean 44.746141 3.298799 1.486106 290.576329
std 16.189833 6.209522 2.808498 242.937989
min 4.000000 0.400000 0.100000 63.000000
25% 33.000000 0.800000 0.200000 175.500000
50% 45.000000 1.000000 0.300000 208.000000
75% 58.000000 2.600000 1.300000 298.000000
max 90.000000 75.000000 19.700000 2110.000000
4
std 182.620356 1.085451 0.795519 0.452490
min 10.000000 2.700000 0.900000 1.000000
25% 23.000000 5.800000 2.600000 1.000000
50% 35.000000 6.600000 3.100000 1.000000
75% 60.500000 7.200000 3.800000 2.000000
max 2000.000000 9.600000 5.500000 2.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 583 non-null int64
1 Total_Bilirubin 583 non-null float64
2 Direct_Bilirubin 583 non-null float64
3 Alkaline_Phosphotase 583 non-null int64
4 Alamine_Aminotransferase 583 non-null int64
5 Total_Protiens 583 non-null float64
6 Albumin 583 non-null float64
7 Result 583 non-null int64
dtypes: float64(4), int64(4)
memory usage: 36.6 KB
5
6 Data Pre Processing and Visualization
fig, ax = plt.subplots(figsize=(9,7))
sns.heatmap(df.corr(), annot=True, fmt='.1g', cmap="Greens", cbar=False);
6
[46]: def count_healthy_unhealthy_livers_and_plot(dataframe):
print ('Count of Unhealthy Livers : {} '.format(dataframe.Result.
,→value_counts()[1]))
count_healthy_unhealthy_livers_and_plot(df)
7
[47]: def pie_plot_draw():
plt.pie(x=df["Result"].value_counts(),
colors=["red","green"],
labels=["UnHealthy Liver","Healthy Liver"],
shadow = True)
plt.show()
plt.style.use("seaborn")
fig, ax = plt.subplots(figsize=(7,7))
pie_plot_draw()
8
[48]: def histogram_plot_draw(col_name):
sns.histplot(x=df[col_name], kde=True, color="blue")
histogram_plot_draw("Age")
9
[49]: # X axis data
X = df.drop("Result", axis=1)
X.head()
10
[50]: Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \
0 65 0.7 0.1 187
1 62 10.9 5.5 699
2 62 7.3 4.1 490
3 58 1.0 0.4 182
4 72 3.9 2.0 195
.. … … … …
578 60 0.5 0.1 500
579 40 0.6 0.1 98
580 52 0.8 0.2 245
581 31 1.3 0.5 184
582 38 1.0 0.3 216
7 Feature Selection
11
selectedcols=[]
for i in range(len(cols)):
if tf[i]:
selectedcols.append(cols[i])
print("showing selected columns")
print(selectedcols)
#converting the data
X_new = model.transform(X)
X_new.shape
[52]: (583, 4)
Here, we are trying to predict whether the patient has an Unhealthy Liver or not using the given
data.
[53]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2,␣
,→random_state=42)
[56]: RandomForestClassifier()
12
[57]: RandomForestClassifierScore = rfc.score(X_test, y_test)
print("Accuracy obtained by Random Forest Classifier model:
,→",RandomForestClassifierScore*100)
print(metrics.classification_report(y_test, y_pred_rfc))
13
1 0.82 0.84 0.83 87
2 0.50 0.47 0.48 30
[60]: AdaBoostClassifier()
14
[63]: # Classification Report of Ada Boost Classifier
print(metrics.classification_report(y_test, y_pred_bdt))
15
11 KNNNeighbourClassifier
[64]: KNeighborsClassifier(n_neighbors=3)
16
[67]: # Classification Report of KNN Classifier
print(metrics.classification_report(y_test, y_pred_knn))
scores = [AdaBoostClassifierScore,
KNeighborsClassifierScore,
RandomForestClassifierScore]
fig, ax = plt.subplots(figsize=(8,6))
sns.barplot(x=classifiers,y=scores);
plt.ylabel("Model Accuracy")
plt.xticks(rotation=60)
plt.title("Model Comparison - Model Accuracy", fontsize=15,␣
,→fontname="DejaVu Sans", y=1.04);
plot_and_compare_results()
17
18