100% found this document useful (1 vote)

96 views25 pages

Chapter 5 - Classification Problems

Uploaded by

anshita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

96 views25 pages

Chapter 5 - Classification Problems

Uploaded by

anshita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Machine Learning using Python

Chapter 5: Classiﬁcation Problems

5.1 Classiﬁcation Overview

5.3 Credit Classiﬁcation

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

credit_df = pd.read_csv( "German Credit Data.csv" )

credit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
checkin_acc 1000 non-null object
duration 1000 non-null int64
credit_history 1000 non-null object
amount 1000 non-null int64
savings_acc 1000 non-null object
present_emp_since 1000 non-null object
inst_rate 1000 non-null int64
personal_status 1000 non-null object
residing_since 1000 non-null int64
age 1000 non-null int64
inst_plans 1000 non-null object
num_credits 1000 non-null int64
job 1000 non-null object
status 1000 non-null int64
dtypes: int64(7), object(7)
memory usage: 109.5+ KB

credit_df.iloc[0:5,1:7]

duration credit_history amount savings_acc present_emp_since inst_rate

0 6 A34 1169 A65 A75 4

1 48 A32 5951 A61 A73 2

2 12 A34 2096 A61 A74 2

3 42 A32 7882 A61 A74 2

4 24 A33 4870 A61 A73 3

Copyright © 2019 by Wiley India Pvt. Ltd. 1/25

Machine Learning using Python
credit_df.iloc[0:5,7:]

personal_status residing_since age inst_plans num_credits job status

0 A93 4 67 A143 2 A173 0

1 A92 2 22 A143 1 A173 1

2 A93 3 49 A143 1 A172 0

3 A93 4 45 A143 1 A173 0

4 A93 4 53 A143 2 A173 1

credit_df.status.value_counts()

0 700
1 300
Name: status, dtype: int64

X_features = list( credit_df.columns )

X_features.remove( 'status' )
X_features

['checkin_acc',
'duration',
'credit_history',
'amount',
'savings_acc',
'present_emp_since',
'inst_rate',
'personal_status',
'residing_since',
'age',
'inst_plans',
'num_credits',
'job']

5.3.1 Encoding Categorical Features

encoded_credit_df = pd.get_dummies( credit_df[X_features],

drop_first = True )

Copyright © 2019 by Wiley India Pvt. Ltd. 2/25

Machine Learning using Python
list(encoded_credit_df.columns)

['duration',
'amount',
'inst_rate',
'residing_since',
'age',
'num_credits',
'checkin_acc_A12',
'checkin_acc_A13',
'checkin_acc_A14',
'credit_history_A31',
'credit_history_A32',
'credit_history_A33',
'credit_history_A34',
'savings_acc_A62',
'savings_acc_A63',
'savings_acc_A64',
'savings_acc_A65',
'present_emp_since_A72',
'present_emp_since_A73',
'present_emp_since_A74',
'present_emp_since_A75',
'personal_status_A92',
'personal_status_A93',
'personal_status_A94',
'inst_plans_A142',
'inst_plans_A143',
'job_A172',
'job_A173',
'job_A174']

encoded_credit_df[['checkin_acc_A12',
'checkin_acc_A13',
'checkin_acc_A14']].head(5)

checkin_acc_A12 checkin_acc_A13 checkin_acc_A14

0 0 0 0

1 1 0 0

2 0 0 1

3 0 0 0

4 0 0 0

import statsmodels.api as sm

Y = credit_df.status
X = sm.add_constant( encoded_credit_df )

5.3.2 Splitting into Train and Validation Sets

Copyright © 2019 by Wiley India Pvt. Ltd. 3/25

Machine Learning using Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,

Y,
test_size = 0.3,
random_state = 42)

5.3.3 Building Logistic Regression Model

import statsmodels.api as sm

logit = sm.Logit(y_train, X_train)

logit_model = logit.fit()

Optimization terminated successfully.

Current function value: 0.488938
Iterations 6

5.3.4 Printing Model Summary

Copyright © 2019 by Wiley India Pvt. Ltd. 4/25

Machine Learning using Python
logit_model.summary2()

Copyright © 2019 by Wiley India Pvt. Ltd. 5/25

Machine Learning using Python
Model: Logit Pseudo R-squared: 0.198

Dependent Variable: status AIC: 744.5132

Date: 2019-04-23 21:07 BIC: 881.0456

No. Observations: 700 Log-Likelihood: -342.26

Df Model: 29 LL-Null: -426.75

Df Residuals: 670 LLR p-value: 1.0630e-21

Converged: 1.0000 Scale: 1.0000

No. Iterations: 6.0000

Coef. Std.Err. z P>|z| [0.025 0.975]

const -0.1511 1.1349 -0.1331 0.8941 -2.3754 2.0733

duration 0.0206 0.0104 1.9927 0.0463 0.0003 0.0409

amount 0.0001 0.0000 2.3765 0.0175 0.0000 0.0002

inst_rate 0.3064 0.0986 3.1083 0.0019 0.1132 0.4996

residing_since 0.0967 0.0920 1.0511 0.2932 -0.0836 0.2771

age -0.0227 0.0103 -2.2131 0.0269 -0.0428 -0.0026

num_credits 0.2854 0.2139 1.3342 0.1821 -0.1338 0.7045

checkin_acc_A12 -0.4126 0.2391 -1.7260 0.0843 -0.8812 0.0559

checkin_acc_A13 -0.9053 0.4338 -2.0868 0.0369 -1.7556 -0.0550

checkin_acc_A14 -1.6052 0.2586 -6.2073 0.0000 -2.1120 -1.0983

credit_history_A31 0.1532 0.5795 0.2643 0.7916 -0.9827 1.2890

credit_history_A32 -0.4960 0.4411 -1.1245 0.2608 -1.3604 0.3685

credit_history_A33 -0.8881 0.5022 -1.7683 0.0770 -1.8724 0.0962

credit_history_A34 -1.4124 0.4528 -3.1190 0.0018 -2.2999 -0.5249

savings_acc_A62 -0.0496 0.3208 -0.1545 0.8772 -0.6782 0.5791

savings_acc_A63 -0.6640 0.4818 -1.3779 0.1682 -1.6084 0.2804

savings_acc_A64 -1.1099 0.6019 -1.8439 0.0652 -2.2896 0.0699

savings_acc_A65 -0.6061 0.2745 -2.2080 0.0272 -1.1441 -0.0681

present_emp_since_A72 0.0855 0.4722 0.1810 0.8564 -0.8401 1.0110

present_emp_since_A73 -0.0339 0.4492 -0.0754 0.9399 -0.9142 0.8465

present_emp_since_A74 -0.3789 0.4790 -0.7910 0.4289 -1.3178 0.5600

present_emp_since_A75 -0.2605 0.4554 -0.5721 0.5673 -1.1532 0.6321

personal_status_A92 -0.0069 0.4841 -0.0142 0.9887 -0.9557 0.9419

personal_status_A93 -0.4426 0.4764 -0.9291 0.3528 -1.3762 0.4911

personal_status_A94 -0.3080 0.5554 -0.5546 0.5792 -1.3967 0.7806

inst_plans_A142 -0.2976 0.5157 -0.5772 0.5638 -1.3084 0.7131

Copyright © 2019 by Wiley India Pvt. Ltd. 6/25

Machine Learning using Python
inst_plans_A143 -0.4458 0.2771 -1.6086 0.1077 -0.9889 0.0974

job_A172 -0.0955 0.7681 -0.1243 0.9011 -1.6009 1.4100

job_A173 -0.0198 0.7378 -0.0269 0.9786 -1.4658 1.4262

job_A174 -0.0428 0.7371 -0.0581 0.9537 -1.4876 1.4019

5.3.5 Model Dignostics

def get_significant_vars( lm ):
var_p_vals_df = pd.DataFrame( lm.pvalues )
var_p_vals_df['vars'] = var_p_vals_df.index
var_p_vals_df.columns = ['pvals', 'vars']
return list( var_p_vals_df[var_p_vals_df.pvals <= 0.05]['vars'] )

significant_vars = get_significant_vars( logit_model )

significant_vars

['duration',
'amount',
'inst_rate',
'age',
'checkin_acc_A13',
'checkin_acc_A14',
'credit_history_A34',
'savings_acc_A65']

final_logit = sm.Logit( y_train,

sm.add_constant( X_train[significant_vars] ) ).fit()

Optimization terminated successfully.

Current function value: 0.511350
Iterations 6

Copyright © 2019 by Wiley India Pvt. Ltd. 7/25

Machine Learning using Python
final_logit.summary2()

Model: Logit Pseudo R-squared: 0.161

Dependent Variable: status AIC: 733.8898

Date: 2019-04-23 21:07 BIC: 774.8495

No. Observations: 700 Log-Likelihood: -357.94

Df Model: 8 LL-Null: -426.75

Df Residuals: 691 LLR p-value: 7.4185e-26

Converged: 1.0000 Scale: 1.0000

No. Iterations: 6.0000

Coef. Std.Err. z P>|z| [0.025 0.975]

const -0.8969 0.4364 -2.0551 0.0399 -1.7523 -0.0415

duration 0.0197 0.0098 2.0033 0.0451 0.0004 0.0390

amount 0.0001 0.0000 2.3205 0.0203 0.0000 0.0002

inst_rate 0.2811 0.0929 3.0264 0.0025 0.0991 0.4632

age -0.0216 0.0089 -2.4207 0.0155 -0.0392 -0.0041

checkin_acc_A13 -0.8038 0.4081 -1.9697 0.0489 -1.6037 -0.0040

checkin_acc_A14 -1.5452 0.2187 -7.0649 0.0000 -1.9738 -1.1165

credit_history_A34 -0.8781 0.2319 -3.7858 0.0002 -1.3327 -0.4235

savings_acc_A65 -0.5448 0.2581 -2.1108 0.0348 -1.0507 -0.0389

5.3.6 Predicting on Test Data

y_pred_df = pd.DataFrame( { "actual": y_test,

"predicted_prob": final_logit.predict(
sm.add_constant( X_test[significant_vars] ) ) } )

Copyright © 2019 by Wiley India Pvt. Ltd. 8/25

Machine Learning using Python
y_pred_df.sample(10, random_state = 42)

actual predicted_prob

557 1 0.080493

798 0 0.076653

977 0 0.345979

136 0 0.249919

575 0 0.062264

544 0 0.040768

332 1 0.833093

917 1 0.370667

678 0 0.388392

363 0 0.088952

y_pred_df['predicted'] = y_pred_df.predicted_prob.map(
lambda x: 1 if x > 0.5 else 0)

y_pred_df.sample(10, random_state = 42)

actual predicted_prob predicted

557 1 0.080493 0

798 0 0.076653 0

977 0 0.345979 0

136 0 0.249919 0

575 0 0.062264 0

544 0 0.040768 0

332 1 0.833093 1

917 1 0.370667 0

678 0 0.388392 0

363 0 0.088952 0

5.3.7 Creating a Confusion Matrix

import matplotlib.pyplot as plt

import seaborn as sn
%matplotlib inline
from sklearn import metrics

Copyright © 2019 by Wiley India Pvt. Ltd. 9/25

Machine Learning using Python
def draw_cm( actual, predicted ):
## Cret
cm = metrics.confusion_matrix( actual, predicted, [1,0] )
sn.heatmap(cm, annot=True, fmt='.2f',
xticklabels = ["Bad credit", "Good Credit"] ,
yticklabels = ["Bad credit", "Good Credit"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

draw_cm( y_pred_df.actual,
y_pred_df.predicted )

5.3.8 Measuring Accuracies

print( metrics.classification_report( y_pred_df.actual,

y_pred_df.predicted ) )

precision recall f1-score support

0 0.76 0.90 0.82 209

1 0.59 0.33 0.42 91

micro avg 0.73 0.73 0.73 300

macro avg 0.67 0.61 0.62 300
weighted avg 0.70 0.73 0.70 300

Copyright © 2019 by Wiley India Pvt. Ltd. 10/25

Machine Learning using Python
plt.figure( figsize = (8,6) )
sn.distplot( y_pred_df[y_pred_df.actual == 1]["predicted_prob"],
kde=False, color = 'b',
label = 'Bad Credit' )
sn.distplot( y_pred_df[y_pred_df.actual == 0]["predicted_prob"],
kde=False, color = 'g',
label = 'Good Credit' )
plt.legend()
plt.show()

5.3.9 ROC & AUC

def draw_roc( actual, probs ):

fpr, \
tpr, \
thresholds = metrics.roc_curve( actual,
probs,
drop_intermediate = False )
auc_score = metrics.roc_auc_score( actual, probs )
plt.figure(figsize=(8, 6))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()

return fpr, tpr, thresholds

Copyright © 2019 by Wiley India Pvt. Ltd. 11/25

Machine Learning using Python
fpr, tpr, thresholds = draw_roc( y_pred_df.actual,
y_pred_df.predicted_prob)

auc_score = metrics.roc_auc_score( y_pred_df.actual,

y_pred_df.predicted_prob )
round( float( auc_score ), 2 )

0.78

5.3.10 Finding Optimal Cutoﬀ

5.3.10.1 Youden's index

tpr_fpr = pd.DataFrame( { 'tpr': tpr,

'fpr': fpr,
'thresholds': thresholds } )

tpr_fpr['diff'] = tpr_fpr.tpr - tpr_fpr.fpr

tpr_fpr.sort_values( 'diff', ascending = False )[0:5]

fpr thresholds tpr diﬀ

160 0.397129 0.221534 0.846154 0.449025

161 0.401914 0.216531 0.846154 0.444240

162 0.406699 0.215591 0.846154 0.439455

159 0.397129 0.223980 0.835165 0.438036

166 0.421053 0.207107 0.857143 0.436090

Copyright © 2019 by Wiley India Pvt. Ltd. 12/25

Machine Learning using Python
y_pred_df['predicted_new'] = y_pred_df.predicted_prob.map(
lambda x: 1 if x > 0.22 else 0)

draw_cm( y_pred_df.actual,
y_pred_df.predicted_new)

print(metrics.classification_report( y_pred_df.actual,
y_pred_df.predicted_new ))

precision recall f1-score support

0 0.90 0.60 0.72 209

1 0.48 0.85 0.61 91

micro avg 0.68 0.68 0.68 300

macro avg 0.69 0.72 0.67 300
weighted avg 0.77 0.68 0.69 300

5.3.10.2 Cost Based Approach

def get_total_cost( actual, predicted, cost_FPs, cost_FNs ):

cm = metrics.confusion_matrix( actual, predicted, [1,0] )
cm_mat = np.array( cm )
return cm_mat[0,1] * cost_FNs + cm_mat[1,0] * cost_FPs

cost_df = pd.DataFrame( columns = ['prob', 'cost'])

idx = 0

## iterate cut-off probability values between 0.1 and 0.5

for each_prob in range( 10, 50):
cost = get_total_cost( y_pred_df.actual,
y_pred_df.predicted_prob.map(
lambda x: 1 if x > (each_prob/100) else 0), 1, 5 )
cost_df.loc[idx] = [(each_prob/100), cost]
idx += 1

Copyright © 2019 by Wiley India Pvt. Ltd. 13/25

Machine Learning using Python
cost_df.sort_values( 'cost', ascending = True )[0:5]

prob cost

4 0.14 150.0

12 0.22 153.0

2 0.12 154.0

10 0.20 154.0

9 0.19 156.0

y_pred_df['predicted_using_cost'] = y_pred_df.predicted_prob.map(
lambda x: 1 if x > 0.14 else 0)

draw_cm( y_pred_df.actual,
y_pred_df.predicted_using_cost )

5.4 Gain Chart and Lift Chart

5.4.1 Loading and Preparing the Dataset

Copyright © 2019 by Wiley India Pvt. Ltd. 14/25

Machine Learning using Python
import pandas as pd
bank_df = pd.read_csv( 'bank.csv' )
bank_df.head( 5 )

housing- personal- curre

age job marital education default balance
loan loan campai

0 30 unemployed married primary no 1787 no no 1

1 33 services married secondary no 4789 yes yes 1

2 35 management single tertiary no 1350 yes no 1

3 30 management married tertiary no 1476 yes yes 4

4 59 blue-collar married secondary no 0 yes no 1

bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 11 columns):
age 4521 non-null int64
job 4521 non-null object
marital 4521 non-null object
education 4521 non-null object
default 4521 non-null object
balance 4521 non-null int64
housing-loan 4521 non-null object
personal-loan 4521 non-null object
current-campaign 4521 non-null int64
previous-campaign 4521 non-null int64
subscribed 4521 non-null object
dtypes: int64(4), object(7)
memory usage: 388.6+ KB

X_features = list( bank_df.columns )

X_features.remove( 'subscribed' )
X_features

['age',
'job',
'marital',
'education',
'default',
'balance',
'housing-loan',
'personal-loan',
'current-campaign',
'previous-campaign']

encoded_bank_df = pd.get_dummies( bank_df[X_features],

drop_first = True )

Y = bank_df.subscribed.map( lambda x: int( x == 'yes') )

X = encoded_bank_df

Copyright © 2019 by Wiley India Pvt. Ltd. 15/25

Machine Learning using Python

5.4.2 Building the Logistic Regression Model

logit_model = sm.Logit( Y, sm.add_constant( X ) ).fit()

Optimization terminated successfully.

Current function value: 0.335572
Iterations 7

Copyright © 2019 by Wiley India Pvt. Ltd. 16/25

Machine Learning using Python
logit_model.summary2()

Model: Logit Pseudo R-squared: 0.061

Dependent Variable: subscribed AIC: 3082.2384

Date: 2019-04-23 21:07 BIC: 3236.2341

No. Observations: 4521 Log-Likelihood: -1517.1

Df Model: 23 LL-Null: -1615.5

Df Residuals: 4497 LLR p-value: 1.4866e-29

Converged: 1.0000 Scale: 1.0000

No. Iterations: 7.0000

Coef. Std.Err. z P>|z| [0.025 0.975]

const -1.7573 0.3799 -4.6251 0.0000 -2.5019 -1.0126

age 0.0078 0.0058 1.3395 0.1804 -0.0036 0.0191

balance -0.0000 0.0000 -0.2236 0.8231 -0.0000 0.0000

current-campaign -0.0905 0.0238 -3.8042 0.0001 -0.1371 -0.0439

previous-campaign 0.1414 0.0212 6.6569 0.0000 0.0998 0.1830

job_blue-collar -0.3412 0.2000 -1.7060 0.0880 -0.7331 0.0508

job_entrepreneur -0.2900 0.3161 -0.9175 0.3589 -0.9096 0.3295

job_housemaid -0.0166 0.3339 -0.0497 0.9603 -0.6711 0.6379

job_management -0.0487 0.1984 -0.2455 0.8061 -0.4375 0.3401

job_retired 0.5454 0.2503 2.1794 0.0293 0.0549 1.0360

job_self-employed -0.2234 0.2895 -0.7715 0.4404 -0.7909 0.3441

job_services -0.2248 0.2245 -1.0012 0.3167 -0.6648 0.2152

job_student 0.3888 0.3181 1.2223 0.2216 -0.2346 1.0122

job_technician -0.2101 0.1874 -1.1213 0.2622 -0.5773 0.1571

job_unemployed -0.3723 0.3336 -1.1162 0.2643 -1.0261 0.2815

job_unknown 0.3193 0.4620 0.6913 0.4894 -0.5861 1.2248

marital_married -0.4012 0.1440 -2.7857 0.0053 -0.6835 -0.1189

marital_single -0.0463 0.1676 -0.2763 0.7823 -0.3749 0.2822

education_secondary 0.2128 0.1680 1.2670 0.2052 -0.1164 0.5420

education_tertiary 0.3891 0.1935 2.0103 0.0444 0.0098 0.7684

education_unknown -0.1956 0.2927 -0.6682 0.5040 -0.7693 0.3781

default_yes 0.2286 0.3670 0.6228 0.5334 -0.4908 0.9479

housing-loan_yes -0.5355 0.1024 -5.2273 0.0000 -0.7362 -0.3347

personal-loan_yes -0.7139 0.1689 -4.2268 0.0000 -1.0449 -0.3829

Copyright © 2019 by Wiley India Pvt. Ltd. 17/25

Machine Learning using Python
significant_vars = get_significant_vars( logit_model )

significant_vars

['const',
'current-campaign',
'previous-campaign',
'job_retired',
'marital_married',
'education_tertiary',
'housing-loan_yes',
'personal-loan_yes']

X_features = ['current-campaign',
'previous-campaign',
'job_retired',
'marital_married',
'education_tertiary',
'housing-loan_yes',
'personal-loan_yes']

logit_model_2 = sm.Logit( Y, sm.add_constant( X[X_features] ) ).fit()

Optimization terminated successfully.

Current function value: 0.337228
Iterations 7

logit_model_2.summary2()

Model: Logit Pseudo R-squared: 0.056

Dependent Variable: subscribed AIC: 3065.2182

Date: 2019-04-23 21:07 BIC: 3116.5501

No. Observations: 4521 Log-Likelihood: -1524.6

Df Model: 7 LL-Null: -1615.5

Df Residuals: 4513 LLR p-value: 8.1892e-36

Converged: 1.0000 Scale: 1.0000

No. Iterations: 7.0000

Coef. Std.Err. z P>|z| [0.025 0.975]

const -1.4754 0.1133 -13.0260 0.0000 -1.6974 -1.2534

current-campaign -0.0893 0.0236 -3.7925 0.0001 -0.1355 -0.0432

previous-campaign 0.1419 0.0211 6.7097 0.0000 0.1004 0.1833

job_retired 0.8246 0.1731 4.7628 0.0000 0.4853 1.1639

marital_married -0.3767 0.0969 -3.8878 0.0001 -0.5667 -0.1868

education_tertiary 0.2991 0.1014 2.9500 0.0032 0.1004 0.4978

housing-loan_yes -0.5834 0.0986 -5.9179 0.0000 -0.7767 -0.3902

personal-loan_yes -0.7025 0.1672 -4.2012 0.0000 -1.0302 -0.3748

Machine Learning using Python
y_pred_df = pd.DataFrame( { 'actual': Y,
'predicted_prob': logit_model_2.predict(
sm.add_constant( X[X_features] ) ) } )

sorted_predict_df = y_pred_df[['predicted_prob',
'actual']].sort_values( 'predicted_prob',
ascending = False )

num_per_decile = int( len( sorted_predict_df ) / 10 )

print( "Number of observations per decile: ", num_per_decile)

Number of observations per decile: 452

def get_deciles( df ):
df['decile'] = 1

idx = 0

for each_d in range( 0, 10 ):

df.iloc[idx:idx+num_per_decile, df.columns.get_loc('decile')] = each_d
idx += num_per_decile

df['decile'] = df['decile'] + 1

return df

deciles_predict_df = get_deciles( sorted_predict_df )

deciles_predict_df[0:10]

predicted_prob actual decile

3682 0.864769 0 1

97 0.828031 0 1

3426 0.706809 0 1

1312 0.642337 1 1

3930 0.631032 1 1

4397 0.619146 0 1

2070 0.609129 0 1

3023 0.573199 0 1

4080 0.572364 0 1

804 0.559350 0 1

gain_lift_df = pd.DataFrame(
deciles_predict_df.groupby(
'decile')['actual'].sum() ).reset_index()
gain_lift_df.columns = ['decile', 'gain']

Machine Learning using Python
gain_lift_df['gain_percentage'] = (100 *
gain_lift_df.gain.cumsum()/gain_lift_df.gain.sum())

gain_lift_df

decile gain gain_percentage

0 1 125 23.992322

1 2 83 39.923225

2 3 73 53.934741

3 4 53 64.107486

4 5 31 70.057582

5 6 46 78.886756

6 7 37 85.988484

7 8 28 91.362764

8 9 25 96.161228

9 10 20 100.000000

import matplotlib.pyplot as plt

import seaborn as sn
%matplotlib inline

plt.figure( figsize = (8,4))

plt.plot( gain_lift_df['decile'],
gain_lift_df['gain_percentage'], '-' )

plt.show()

Calculating Lift

gain_lift_df['lift'] = ( gain_lift_df.gain_percentage
/ ( gain_lift_df.decile * 10) )

Machine Learning using Python
gain_lift_df

decile gain gain_percentage lift

0 1 125 23.992322 2.399232

1 2 83 39.923225 1.996161

2 3 73 53.934741 1.797825

3 4 53 64.107486 1.602687

4 5 31 70.057582 1.401152

5 6 46 78.886756 1.314779

6 7 37 85.988484 1.228407

7 8 28 91.362764 1.142035

8 9 25 96.161228 1.068458

9 10 20 100.000000 1.000000

plt.figure( figsize = (8,4))

plt.plot( gain_lift_df['decile'], gain_lift_df['lift'], '-' )
plt.show()

5.5 Decision Trees

5.5.1 Split the dataset

Y = credit_df.status
X = encoded_credit_df

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, Y,

test_size = 0.3,
random_state = 42)

Machine Learning using Python

5.5.2 Building Decision Tree classiﬁer using Gini Criteria

from sklearn.tree import DecisionTreeClassifier

clf_tree = DecisionTreeClassifier(criterion = 'gini',

max_depth = 3 )

clf_tree.fit( X_train, y_train )

DecisionTreeClassifier(class_weight=None, criterion='gini', max_dept

h=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_stat
e=None,
splitter='best')

5.5.3 Measuring Test Accuracy

tree_predict = clf_tree.predict( X_test )

metrics.roc_auc_score( y_test, tree_predict )

0.5835743204164258

5.5.4 Displaying the Tree

Machine Learning using Python
from sklearn.tree import export_graphviz
import pydotplus as pdot
from IPython.display import Image

# Export the tree into odt file

export_graphviz( clf_tree,
out_file = "chd_tree.odt",
feature_names = X_train.columns,
class_names= ['Good Credit', 'Bad Credit'],
filled = True)

# Read the create the image file

chd_tree_graph = pdot.graphviz.graph_from_dot_file( 'chd_tree.odt' )
chd_tree_graph.write_jpg( 'chd_tree.png' )
# Render the png file
Image(filename='chd_tree.png')

5.5.5 Understanding Gini Impurity

gini_node_1 = 1 - pow(491/700, 2) - pow (209/700, 2)

print( round( gini_node_1, 4) )

0.4189

X_test.shape

(300, 29)

5.5.6 Building Decision Tree using Entropy Criteria

Machine Learning using Python
clf_tree_entropy = DecisionTreeClassifier( criterion = 'entropy',
max_depth = 3 )
clf_tree_entropy.fit( X_train, y_train )

# Export the tree into odt file

export_graphviz( clf_tree_entropy,
out_file = "chd_tree_entropy.odt",
feature_names = X_train.columns,
filled = True )

# Read the create the image file

chd_tree_graph = pdot.graphviz.graph_from_dot_file( 'chd_tree_entropy.odt' )
chd_tree_graph.write_jpg( 'chd_tree_entropy.png' )
# Render the png file
Image(filename='chd_tree_entropy.png')

Calculating entropy impurity

import math

entropy_node_1 = - (491/700) * math.log2(491/700) - (209/700) * math.log2(209/70

0)
print( round( entropy_node_1, 2) )

0.88

Measuring test accuracy

tree_predict = clf_tree_entropy.predict( X_test )

metrics.roc_auc_score( y_test, tree_predict )

0.5763972869236027

5.5.7 Finding optimal criteria and max_depth

Machine Learning using Python
from sklearn.model_selection import GridSearchCV

tuned_parameters = [{'criterion': ['gini','entropy'],

'max_depth': range(2,10)}]

clf_tree = DecisionTreeClassifier()

clf = GridSearchCV(clf_tree,
tuned_parameters,
cv=10,
scoring='roc_auc')

clf.fit(X_train, y_train )

/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/model
_selection/_search.py:841: DeprecationWarning: The default of the `i
id` parameter will change from True to False in version 0.22 and wil
l be removed in 0.24. This will change numeric results when test-set
sizes are unequal.
DeprecationWarning)

GridSearchCV(cv=10, error_score='raise-deprecating',
estimator=DecisionTreeClassifier(class_weight=None, criterion
='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_stat
e=None,
splitter='best'),
fit_params=None, iid='warn', n_jobs=None,
param_grid=[{'max_depth': range(2, 10), 'criterion': ['gini',
'entropy']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score='war
n',
scoring='roc_auc', verbose=0)

clf.best_score_

0.6824299319727891

clf.best_params_

{'criterion': 'gini', 'max_depth': 2}

CS 971
No ratings yet
CS 971
2 pages
Time Series Forecasting Project (Shoe Sales)
No ratings yet
Time Series Forecasting Project (Shoe Sales)
26 pages
Babo Dialogue Manual Version 3.90V
No ratings yet
Babo Dialogue Manual Version 3.90V
18 pages
Case Study: Microsoft Azure
No ratings yet
Case Study: Microsoft Azure
18 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Sony KDL-40NX700 PDF
No ratings yet
Sony KDL-40NX700 PDF
33 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
ML 2 - Problem Statements and Rubirics
No ratings yet
ML 2 - Problem Statements and Rubirics
3 pages
NIrupam Agarwal Business Report-ML
100% (1)
NIrupam Agarwal Business Report-ML
23 pages
Prathamesh Shukla SMDM Project 20.08.23
100% (1)
Prathamesh Shukla SMDM Project 20.08.23
34 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
No ratings yet
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
22 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
Chapter 4 - Linear Regression
100% (2)
Chapter 4 - Linear Regression
25 pages
ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Akshaya SMDM Project Report
100% (1)
Akshaya SMDM Project Report
18 pages
Time Series Forecasting - Rose - Buisness Report
100% (1)
Time Series Forecasting - Rose - Buisness Report
69 pages
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Clustering Project
100% (1)
Clustering Project
44 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
100% (1)
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
25 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
NLP Notes
No ratings yet
NLP Notes
80 pages
Project Data Mining
No ratings yet
Project Data Mining
55 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
Classification Problems
100% (1)
Classification Problems
25 pages
Great Learning Predictive Modelling Project
No ratings yet
Great Learning Predictive Modelling Project
12 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
Sprint Burndown Chart Template
No ratings yet
Sprint Burndown Chart Template
9 pages
History I
No ratings yet
History I
119 pages
The Theatre of Max Reinhardt
No ratings yet
The Theatre of Max Reinhardt
351 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Free - Proxy - List Russia Socks 5
No ratings yet
Free - Proxy - List Russia Socks 5
2 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Capstone Presentation
No ratings yet
Capstone Presentation
58 pages
Probing Systems For CNC Machine Tools
No ratings yet
Probing Systems For CNC Machine Tools
204 pages
AS Notebook - PCA - Wine Data-4
100% (1)
AS Notebook - PCA - Wine Data-4
1 page
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Course Transcript Navigating Airtable
No ratings yet
Course Transcript Navigating Airtable
5 pages
Blu Ray LG BP420 ENG PDF
No ratings yet
Blu Ray LG BP420 ENG PDF
62 pages
FRA Main Project Part B Guided
No ratings yet
FRA Main Project Part B Guided
23 pages
HERZOG, I. - Manual Programa 'Stratify' PDF
No ratings yet
HERZOG, I. - Manual Programa 'Stratify' PDF
147 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
Data Preparation
No ratings yet
Data Preparation
12 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Normalization
No ratings yet
Normalization
27 pages
Integer Processing
No ratings yet
Integer Processing
31 pages
1) Introduction A) Defining Problem Statement:-: ST ST
No ratings yet
1) Introduction A) Defining Problem Statement:-: ST ST
10 pages
Nx-Os 6x SNMP
No ratings yet
Nx-Os 6x SNMP
30 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Ml-1-Guided-Bus Report
No ratings yet
Ml-1-Guided-Bus Report
35 pages
Develop Project Authorization Procedures: Native Bush Spice Australia
No ratings yet
Develop Project Authorization Procedures: Native Bush Spice Australia
27 pages
Chapter 8 - Forecasting
No ratings yet
Chapter 8 - Forecasting
17 pages
Assignment 2 Solution
No ratings yet
Assignment 2 Solution
6 pages
Chapter 9 - Recommendation Systems
No ratings yet
Chapter 9 - Recommendation Systems
12 pages
Chapter 10 - Text Analytics
No ratings yet
Chapter 10 - Text Analytics
13 pages
Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
SMDM Guided Project Sample Business Report
No ratings yet
SMDM Guided Project Sample Business Report
17 pages
CH 5 User Defined Classes in Java
No ratings yet
CH 5 User Defined Classes in Java
6 pages
BBT 3201 - Introduction To AI Concepts - August 2019
No ratings yet
BBT 3201 - Introduction To AI Concepts - August 2019
7 pages
Mã Hóa Thông Điệp: Code
No ratings yet
Mã Hóa Thông Điệp: Code
12 pages
GRADE 9 - Introduction To A-Excite
No ratings yet
GRADE 9 - Introduction To A-Excite
12 pages
VL5F Series - Datasheet (Low) - LG Video Wall - 200220
No ratings yet
VL5F Series - Datasheet (Low) - LG Video Wall - 200220
6 pages
18-19 HE Sept22
No ratings yet
18-19 HE Sept22
8 pages
Codigo de Sistema de Panel Solar Seguidor de Luz
No ratings yet
Codigo de Sistema de Panel Solar Seguidor de Luz
4 pages
Icjecapu 01
No ratings yet
Icjecapu 01
6 pages
Ascii Characters:: Shadings
No ratings yet
Ascii Characters:: Shadings
5 pages
Mi PR 5
No ratings yet
Mi PR 5
4 pages
List of Iso 9001 Certified Companies
No ratings yet
List of Iso 9001 Certified Companies
9 pages
Solution To Problem 1: Importing The Libraries
No ratings yet
Solution To Problem 1: Importing The Libraries
6 pages
R Tutorial, #1: Data, Frequency Tables, and Histograms
No ratings yet
R Tutorial, #1: Data, Frequency Tables, and Histograms
4 pages
6th Month Capstone Project
No ratings yet
6th Month Capstone Project
4 pages
Automated Security For Virtual Data Centers & Clouds ALTOR v4.0
No ratings yet
Automated Security For Virtual Data Centers & Clouds ALTOR v4.0
2 pages
Accessing/ Traversing Peoplesoft Component Buffer: Peopletools 8.4 Peoplebook: Peoplesoft Peoplecode Developer'S Guide
No ratings yet
Accessing/ Traversing Peoplesoft Component Buffer: Peopletools 8.4 Peoplebook: Peoplesoft Peoplecode Developer'S Guide
2 pages
Templates
No ratings yet
Templates
2 pages
Java Project
No ratings yet
Java Project
2 pages
Resume Format For Engineering Freshers
No ratings yet
Resume Format For Engineering Freshers
1 page