3 - Analysis of Default - Ipynb - Colab
3 - Analysis of Default - Ipynb - Colab
ipynb - Colab
import pandas as pd
import matplotlib.pyplot as plt
gc=pd.read_csv("/Users/nitinsaraswat/Documents/AON/decision trees/data/german_credit_data.csv")
gc.head()
5 rows × 23 columns
gc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 23 columns):
Customer_ID 5000 non-null int64
Status_Checking_Acc 5000 non-null object
Duration_in_Months 5000 non-null int64
Credit_History 5000 non-null object
Purposre_Credit_Taken 5000 non-null object
Credit_Amount 5000 non-null int64
Savings_Acc 5000 non-null object
Years_At_Present_Employment 5000 non-null object
Inst_Rt_Income 5000 non-null int64
Marital_Status_Gender 5000 non-null object
Other_Debtors_Guarantors 5000 non-null object
Current_Address_Yrs 5000 non-null int64
Property 5000 non-null object
Age 5000 non-null int64
Other_Inst_Plans 5000 non-null object
Housing 5000 non-null object
Num_CC 5000 non-null int64
Job 5000 non-null object
Dependents 5000 non-null int64
Telephone 5000 non-null object
Foreign_Worker 5000 non-null object
Default_On_Payment 5000 non-null int64
Count 5000 non-null int64
dtypes: int64(10), object(13)
memory usage: 898.5+ KB
gc.describe()
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 1/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
Requirement already satisfied: scipy>=1.0.0 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from plotnine) (1.3.1)
Requirement already satisfied: matplotlib>=3.0.0 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from plotnine) (3.0
Requirement already satisfied: palettable in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from mizani>=0.5.2->plotnin
Requirement already satisfied: six in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from patsy>=0.4.1->plotnine) (1.11
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from pandas>=0
Requirement already satisfied: pytz>=2011k in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from pandas>=0.23.4->plotn
Requirement already satisfied: cycler>=0.10 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from matplotlib>=3.0.0->p
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-package
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from matplotlib>=3.0
Requirement already satisfied: setuptools in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->mat
WARNING: You are using pip version 19.3; however, version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting pip
Downloading https://files.pythonhosted.org/packages/d8/f3/413bab4ff08e1fc4828dfc59996d721917df8e8583ea85385d51125dceff/pip-19.0.3
100% |████████████████████████████████| 1.4MB 785kB/s ta 0:00:01
Installing collected packages: pip
Found existing installation: pip 19.0.2
Uninstalling pip-19.0.2:
Successfully uninstalled pip-19.0.2
Successfully installed pip-19.0.3
gc['Default_On_Payment'].value_counts().plot(kind="bar")
<matplotlib.axes._subplots.AxesSubplot at 0x102bdeb00>
import plotnine as pn
pd.read
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 2/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
pd.__version__
'0.23.4'
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 3/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
#The below graph does not interpret Inst_Rt_Income properly - What should be done
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 4/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
gc2=gc
gc2['Inst_Rt_Income']=gc2['Inst_Rt_Income'].apply(str)
gc2['Current_Address_Yrs']=gc2['Current_Address_Yrs'].apply(str)
gc2['Dependents']=gc2['Dependents'].apply(str)
gc2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 23 columns):
Customer_ID 5000 non-null int64
Status_Checking_Acc 5000 non-null object
Duration_in_Months 5000 non-null int64
Credit_History 5000 non-null object
Purposre_Credit_Taken 5000 non-null object
Credit_Amount 5000 non-null int64
Savings_Acc 5000 non-null object
Years_At_Present_Employment 5000 non-null object
Inst_Rt_Income 5000 non-null object
Marital_Status_Gender 5000 non-null object
Other_Debtors_Guarantors 5000 non-null object
Current_Address_Yrs 5000 non-null object
Property 5000 non-null object
Age 5000 non-null int64
Other_Inst_Plans 5000 non-null object
Housing 5000 non-null object
Num_CC 5000 non-null int64
Job 5000 non-null object
Dependents 5000 non-null object
Telephone 5000 non-null object
Foreign_Worker 5000 non-null object
Default_On_Payment 5000 non-null int64
Count 5000 non-null int64
dtypes: int64(7), object(16)
memory usage: 898.5+ KB
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 5/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 6/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 7/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
#gc2=gc
#gc2['Num_CC']=gc2['Num_CC'].apply(str)
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 8/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 9/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 10/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
plt.rcParams['figure.figsize'] = [10, 7]
gc.boxplot(column=['Credit_Amount'], by=['Default_On_Payment'])
<matplotlib.axes._subplots.AxesSubplot at 0x120c29c18>
Image("/Users/nitinsaraswat/Desktop/bp.png")
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 11/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
## If there are points below bottom line then they are outliers (less than 3/2 times of 25 percentile value)
## If there are points above top line then they are outliers (more than 3/2 times of 75% percentile value)
gc.boxplot(column=['Duration_in_Months'], by=['Default_On_Payment'])
<matplotlib.axes._subplots.AxesSubplot at 0x120c53588>
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 12/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
X=gc.drop(columns=['Customer_ID','Default_On_Payment'],axis=1)
y=gc['Default_On_Payment']
X.head()
5 rows × 21 columns
y.head()
0 0
1 0
2 0
3 0
4 1
Name: Default_On_Payment, dtype: int64
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3750 entries, 3186 to 235
Data columns (total 21 columns):
Status_Checking_Acc 3750 non-null object
Duration_in_Months 3750 non-null int64
Credit_History 3750 non-null object
Purposre_Credit_Taken 3750 non-null object
Credit_Amount 3750 non-null int64
Savings_Acc 3750 non-null object
Years_At_Present_Employment 3750 non-null object
Inst_Rt_Income 3750 non-null object
Marital_Status_Gender 3750 non-null object
Other_Debtors_Guarantors 3750 non-null object
Current_Address_Yrs 3750 non-null object
Property 3750 non-null object
Age 3750 non-null int64
Other_Inst_Plans 3750 non-null object
Housing 3750 non-null object
Num_CC 3750 non-null int64
Job 3750 non-null object
Dependents 3750 non-null object
Telephone 3750 non-null object
Foreign_Worker 3750 non-null object
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 13/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
Count 3750 non-null int64
dtypes: int64(5), object(16)
memory usage: 644.5+ KB
model = tree.DecisionTreeClassifier()
model
model.fit(X_train, y_train)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-d768f88d541e> in <module>()
----> 1 model.fit(X_train, y_train)
2 frames
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy,
force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
gc4=pd.get_dummies(gc)
#The above error shows that Decision Tree Algorithm expects the categorical variables to be properly encoded
#Create categorical variables using get_dummies
gc3=pd.get_dummies(gc,columns=['Status_Checking_Acc','Credit_History','Purposre_Credit_Taken','Savings_Acc', \
'Years_At_Present_Employment','Marital_Status_Gender','Other_Debtors_Guarantors', \
'Property','Other_Inst_Plans ','Housing','Job','Telephone','Foreign_Worker',
'Inst_Rt_Income','Num_CC','Dependents','Current_Address_Yrs'],drop_first=True)
gc3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 57 columns):
Customer_ID 5000 non-null int64
Duration_in_Months 5000 non-null int64
Credit_Amount 5000 non-null int64
Age 5000 non-null int64
Default_On_Payment 5000 non-null int64
Count 5000 non-null int64
Status_Checking_Acc_A12 5000 non-null uint8
Status_Checking_Acc_A13 5000 non-null uint8
Status_Checking_Acc_A14 5000 non-null uint8
Credit_History_A31 5000 non-null uint8
Credit_History_A32 5000 non-null uint8
Credit_History_A33 5000 non-null uint8
Credit_History_A34 5000 non-null uint8
Purposre_Credit_Taken_A41 5000 non-null uint8
Purposre_Credit_Taken_A410 5000 non-null uint8
Purposre_Credit_Taken_A42 5000 non-null uint8
Purposre_Credit_Taken_A43 5000 non-null uint8
Purposre_Credit_Taken_A44 5000 non-null uint8
Purposre_Credit_Taken_A45 5000 non-null uint8
Purposre_Credit_Taken_A46 5000 non-null uint8
Purposre_Credit_Taken_A48 5000 non-null uint8
Purposre_Credit_Taken_A49 5000 non-null uint8
Savings_Acc_A62 5000 non-null uint8
Savings_Acc_A63 5000 non-null uint8
Savings_Acc_A64 5000 non-null uint8
Savings_Acc_A65 5000 non-null uint8
Years_At_Present_Employment_A72 5000 non-null uint8
Years_At_Present_Employment_A73 5000 non-null uint8
Years_At_Present_Employment_A74 5000 non-null uint8
Years_At_Present_Employment_A75 5000 non-null uint8
Marital_Status_Gender_A92 5000 non-null uint8
Marital_Status_Gender_A93 5000 non-null uint8
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 14/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
Marital_Status_Gender_A94 5000 non-null uint8
Other_Debtors_Guarantors_A102 5000 non-null uint8
Other_Debtors_Guarantors_A103 5000 non-null uint8
Property_A122 5000 non-null uint8
Property_A123 5000 non-null uint8
Property_A124 5000 non-null uint8
Other_Inst_Plans _A142 5000 non-null uint8
Other_Inst_Plans _A143 5000 non-null uint8
Housing_A152 5000 non-null uint8
Housing_A153 5000 non-null uint8
Job_A172 5000 non-null uint8
Job_A173 5000 non-null uint8
Job_A174 5000 non-null uint8
Telephone_A192 5000 non-null uint8
Foreign_Worker_A202 5000 non-null uint8
Inst_Rt_Income_2 5000 non-null uint8
Inst_Rt_Income_3 5000 non-null uint8
Inst_Rt_Income_4 5000 non-null uint8
Num_CC_2 5000 non-null uint8
Num_CC_3 5000 non-null uint8
Num_CC_4 5000 non-null uint8
Dependents_2 5000 non-null uint8
Current Address Yrs 2 5000 non-null uint8
X=gc3.drop(columns=['Customer_ID','Default_On_Payment'],axis=1)
y=gc3['Default_On_Payment']
model = tree.DecisionTreeClassifier(max_depth=10,max_features=7)
model.fit(X_train, y_train)
model1=model.fit(X_train, y_train)
y_predict = model.predict(X_test)
accuracy_score(y_test, y_predict)
0.82
pd.DataFrame(
confusion_matrix(y_test, y_predict),
columns=['Predicted Default', 'Predicted Non-Default'],
index=['Actual Default', 'Actual Non-Default']
)
roc_auc
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 15/16
1/4/25, 11:39 AM 3_Analysis of Default.ipynb - Colab
0.7636675581731855
y_true=y_test
y_probas=model.predict_proba(X_test)
Collecting scikit-plot
Downloading https://files.pythonhosted.org/packages/7c/47/32520e259340c140a4ad27c1b97050dd3254fdc517b1d59974d47037510e/scikit_plot
Requirement already satisfied: scipy>=0.9 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from scikit-plot) (1.1.0)
Requirement already satisfied: matplotlib>=1.4.0 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from scikit-plot) (3
Requirement already satisfied: scikit-learn>=0.18 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from scikit-plot)
Collecting joblib>=0.10 (from scikit-plot)
Downloading https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13
100% |████████████████████████████████| 286kB 489kB/s ta 0:00:01
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from matplotlib>=1.4
Requirement already satisfied: python-dateutil>=2.1 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from matplotlib>=
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-package
Requirement already satisfied: cycler>=0.10 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from matplotlib>=1.4.0->s
Requirement already satisfied: numpy>=1.10.0 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from matplotlib>=1.4.0->
Requirement already satisfied: setuptools in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->mat
Requirement already satisfied: six>=1.5 in /Users/nitinsaraswat/anaconda3/lib/python3.7/site-packages (from python-dateutil>=2.1->ma
Installing collected packages: joblib, scikit-plot
Successfully installed joblib-0.13.2 scikit-plot-0.3.7
skplt.metrics.plot_roc(y_true, y_probas)
plt.show()
https://colab.research.google.com/drive/1q4mVLgoQySfROIe0pubvF9IVBxM7XLLo#printMode=true 16/16