0% found this document useful (0 votes)
47 views31 pages

3 - Modeling - Ipynb - Colaboratory

Uploaded by

Manuel Gonzales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views31 pages

3 - Modeling - Ipynb - Colaboratory

Uploaded by

Manuel Gonzales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from google.colab import drive


drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun

keyboard_arrow_down Intro:
Now we are going to tell you about 2 different modeling techniques:

1) Logistic Regression

2) Random Forest

as well as the data preparation for modelling phase.

keyboard_arrow_down Read data


random_seed = 42

Load the data for every set, train, validation an oot.

base_path = '/content/drive/MyDrive/Colab Notebooks/Riesgo de crédito/Data/'


df_train = pd.read_excel(base_path + 'prosperLoanData_train_cleanFQA.xlsx')
df_val = pd.read_excel(base_path + 'prosperLoanData_val_cleanFQA.xlsx')
df_oot = pd.read_excel(base_path + 'prosperLoanData_oot_cleanFQA.xlsx')

In the next step we are loading the object we saved in the previous notebook: iv, psi, and correlation for each of the features

#get the serialized data from previous session


input_d2 = pd.read_pickle("/content/drive/MyDrive/Colab Notebooks/Riesgo de crédito/output_Feature_QA.pkl")
corr_data, features = input_d2['corr_data'], input_d2['features']
df_iv, df_psi = input_d2['iv_df'], input_d2['psi_df']

keyboard_arrow_down Feature Selection

In practice, the problem of dimensionality implies that, given a fixed number of examples, there is a maximum number of attributes from which
the efficiency of our classifier degrades rather than increases.
Techniques to reduce dimensionality:

-Selection methods: a subset of characteristics from the original set.

Example: Mutual Information criteria.

For this exercise: Information Value, Population Stability Index & Correlation criteria.

keyboard_arrow_down -Filtering methods: the new characteristics come from a transformation of the original ones.

Find a transformation y=f(x) that preserves the information about the problem, minimizing the number of components.

Examples: PCA, LDA

The goal of LDA is to reduce dimensionality, preserving as much discriminatory information as possible while maximizing separation between
classes. LDA reduces the dimensional space to C-1, where C is the number of classes (target).

PCA, however, seeks to compress the information in the data, regardless of class (target). Builds the most relevant components or factors from
the original variables.
PCA is a technique that makes sense to apply in the case of there are high correlations between the variables (an indication that the there is
redundant information) as a consequence, few factors will explain much of the total variability.

Beware, PCA is sensitive to the scale on which the variables are expressed. Variables may need to be normalized.

# To put it into practice:

from sklearn.datasets import load_breast_cancer

breast = load_breast_cancer()
breast_data = breast.data

breast_labels = breast.target
labels = np.reshape(breast_labels,(569,1))

final_breast_data = np.concatenate([breast_data,labels],axis=1)
breast_dataset = pd.DataFrame(final_breast_data)

features = breast.feature_names
features_labels = np.append(features,'label')
breast_dataset.columns = features_labels
breast_dataset.head()

mea
mean mean mean mean mean mean mean
concav
radius texture perimeter area smoothness compactness concavity
point

0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.147

1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.070

2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.127

3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.105

4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.104

5 rows × 31 columns

from sklearn.decomposition import PCA


from sklearn.preprocessing import StandardScaler

x = breast_dataset.loc[:, features].values
x = StandardScaler().fit_transform(x) # normalizing the features
feat_cols = ['feature'+str(i) for i in range(x.shape[1])]
normalised_breast = pd.DataFrame(x,columns=feat_cols)

# Since the original labels are in 0,1 format, you will change the labels to benign and malignant using .replace function
breast_dataset['label'].replace(0, 'Benign',inplace=True)
breast_dataset['label'].replace(1, 'Malignant',inplace=True)

pca_breast = PCA(n_components=2)
principalComponents_breast = pca_breast.fit_transform(x)

principal_breast_Df = pd.DataFrame(data = principalComponents_breast


, columns = ['principal component 1', 'principal component 2'])
principal_breast_Df.tail()

principal component 1 principal component 2

564 6.439315 -3.576817

565 3.793382 -3.584048

566 1.256179 -1.902297

567 10.374794 1.672010

568 -5.475243 -0.670637

print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))

Explained variation per principal component: [0.44272026 0.18971182]


plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets = ['Benign', 'Malignant']
colors = ['r', 'g']
for target, color in zip(targets,colors):
indicesToKeep = breast_dataset['label'] == target
plt.scatter(principal_breast_Df.loc[indicesToKeep, 'principal component 1']
, principal_breast_Df.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)

plt.legend(targets,prop={'size': 15})

<matplotlib.legend.Legend at 0x7e25b3e5ae00>
<Figure size 640x480 with 0 Axes>

For this exercise: Feature Selection Criteria based on IV, PSI & Correlation

keyboard_arrow_down 1. Drop highly correlated features


Take a look to the features wiht highest correlation, and drop those thta are very correlated

corr_data[0][2].values[0]

1.0
#run through the iv ranking, and drop features if they are correlated with any feature with better ranking
feats_sorted = df_iv.feature.values.tolist()

def get_uncorr_feats(corr_data, feats_sorted):


"""Handles the corr_data structure, to drop highlly correlated features

Args:
corr_data: List of tuples containing the correlation info
feats_sorted: List, with the features to be sorted / dropped

Returns:
List with the features that have no correlation
"""
features_keep = feats_sorted[:1]
for feat in feats_sorted[1:]:
#capture the correlation tuple
crr_data = [crr for crr in corr_data if crr[0] == feat] # if feat has correlation

if len(crr_data):
#if there is a 'hit' with a feature in features_keep, do not include it
hit = len(set(crr_data[0][2].index.tolist()) & set(features_keep)) > 0
if hit:
print ('Drop: ' + feat)
else:
features_keep.append(feat)
else:
features_keep.append(feat)
return features_keep

print ('We are dropping the following features due to the high correlation with others:\n')
features_keep = get_uncorr_feats(corr_data, feats_sorted)

We are dropping the following features due to the high correlation with others:

Drop: CreditScoreRangeUpper
Drop: TotalProsperPaymentsBilled
Drop: LoanOriginalAmount
Drop: CurrentCreditLines
Drop: OpenRevolvingAccounts
Drop: TotalCreditLinespast7years

keyboard_arrow_down 2. Drop features with low IV


As we have said before, features with a IV less than 0.02 is not given us enough information, so in the next step we are dropping them.

# IV filtering with this treshold


TH_IV = 0.02

# capture low IV features


low_iv_feats = df_iv.loc[df_iv.IV < TH_IV, 'feature'].values.tolist()
features_keep_iv = list(set(features_keep) - set(low_iv_feats))

print ('We are dropping the following features due to the poor IV (<{}):'.format(TH_IV))

low_iv_feats

We are dropping the following features due to the poor IV (<0.02):


['AvailableBankcardCredit',
'ProsperPaymentsOneMonthPlusLate',
'EmploymentStatusDuration',
'ListingCategory (numeric)',
'TradesNeverDelinquent (percentage)',
'TotalInquiries',
'Term',
'RevolvingCreditBalance',
'PercentFunded',
'PublicRecordsLast10Years',
'DelinquenciesLast7Years',
'PublicRecordsLast12Months',
'Recommendations']

keyboard_arrow_down 3. Drop unstable features


# PSI filtering with this treshold
TH_PSI = 0.25

# capture high (unstable) features


high_psi_features = df_psi.loc[df_psi.PSI > TH_PSI, 'feature'].values.tolist()
f t k i li t( t(f t k i ) t(hi h i f t ))
features_keep_psi = list(set(features_keep_iv) - set(high_psi_features))

print ('We are dropping the following features due to the high PSI (>{}):'.format(TH_PSI))
high_psi_features

We are dropping the following features due to the high PSI (>0.25):
['LoanOriginalAmount',
'MonthlyLoanPayment',
'Term',
'ListingCategory (numeric)']

Note: final_features will contain our final set of features, in order to model the target

final_features = features_keep_psi

print("Number of final features: {}".format(len(final_features)))


final_features

output Number of final features: 21


['ProsperPaymentsLessThanOneMonthLate',
'BorrowerState',
'OpenCreditLines',
'TotalTrades',
'DebtToIncomeRatio',
'EmploymentStatus',
'InquiriesLast6Months',
'IsBorrowerHomeowner',
'IncomeRange',
'IncomeVerifiable',
'BankcardUtilization',
'ScorexChangeAtTimeOfListing',
'OnTimeProsperPayments',
'ProsperPrincipalOutstanding',
'OpenRevolvingMonthlyPayment',
'Occupation',
'TradesOpenedLast6Months',
'StatedMonthlyIncome',
'CreditScoreRangeLower',
'TotalProsperLoans',
'ProsperPrincipalBorrowed']

keyboard_arrow_down Data Preparation - Bucketing


Data binning (also called Discrete binning or bucketing) is a data pre-processing technique used to reduce the effects of minor observation
errors (outliers, for example).

Statistical data binning is a way to group a number of more or less continuous values into a smaller number of "bins". For example, if you have
data about a group of people, you might want to arrange their ages into a smaller number of age intervals (for example, grouping every five
years together).

In order to group data into bins you can:

Manually type a series of values to serve as the bin boundaries.


Control the number of values in each bin.
Force an even distribution of values into the bins.

In our particular case:

we have created a function that, given a number n_bins (function parameter), calculates as many buckets according to the percentiles for
each of the continuous features
we have created a function that, given a number n_bins (function parameter), calculates as many buckets according to the population
distribution for each of the categorical features

Our recommendation is that the work of bucketization has to be to some extent manual, without ever losing sight of the sense of Business.

keyboard_arrow_down For numerical features

Function that defines buckects for numerical features, with the number of buckets we want , also makes dummies
def get_bucket_numfeature(df, feat_col, n_bins, input_slider=(0., 100.)):
"""Cuts a numeric feature in 'n_bins', balacing data in percentiles

Args:
df: Pandas DataFrame with the input data
feat_col: Name of the column with the input feature
obj_col: Name of the column with the target
n_bins: Number of cuts expected
input_slider: Range considered for the bucketing

Returns:
List with the cuts corresponding to this feature
"""
#get the numeric input from the dual slider
perc_sliders = [v/100. for v in input_slider]
var_lims = df[feat_col].quantile([perc_sliders[0], perc_sliders[1]]).values
v_min, v_max = var_lims[0], var_lims[1]
#filter the dataset using the slider input
df_cut = df.loc[(df[feat_col] <= v_max) & (df[feat_col] >= v_min)][[feat_col]]
cuts = df_cut[feat_col].quantile(np.linspace(perc_sliders[0], perc_sliders[1], n_bins + 1)).values.tolist()
cuts = sorted(list(set(cuts)))
return cuts

def format_dummy_col(feat_col, dummy_col):


"""Handles column names for dummy data

Args:
feat_col: Name of the column with the input feature
dummy_col: String of the dummy column

Returns:
Dummy column with better formatting
"""
out = dummy_col.replace("(", "")\
.replace("]", "")\
.replace(".0", "")\
.replace(", ", "|")

return feat_col + '_' + out

def apply_bucketing_num(df, feat_col, cuts):


"""Applies bucketing to numerical feature

Args:
df: Pandas Dataframe with the input data
feat_col: Name of the column with the input feature
cuts: Cuts that will be applied to the input data

Returns:
Pandas dataframe with dummy columns
"""
cut_col = '{}_cut'.format(feat_col)
if len(cuts) == 2:
cuts = [cuts[0], np.mean(cuts), cuts[1]]

df[cut_col] = pd.cut(df[feat_col], cuts, include_lowest=True, precision=0)


if df[cut_col].isna().any():
df[cut_col] = df[cut_col].cat.add_categories(["NA"])
df[cut_col] = df[cut_col].fillna("NA")

dummies_df = pd.get_dummies(df[cut_col], drop_first=True)


dummies_df.columns = [format_dummy_col(feat_col, str(col)) for col in dummies_df.columns.values.tolist()]

return dummies_df

keyboard_arrow_down For categorical features

Function that defines buckects for categorical features, with the number of buckets we want , also makes dummies
def get_bucket_catfeature(df, feat_col, n_bins):
"""Cuts a categorical feature in 'n_bins', keeping categories with highest volume

Args:
df: Pandas DataFrame with the input data
feat_col: Name of the column with the input feature
n_bins: Number of cuts expected

Returns:
List with the cuts corresponding to this feature
"""
cuts = df.groupby(feat_col)[feat_col].count().sort_values(ascending=False)[:int(n_bins)].index.values.tolist()

return cuts

def apply_bucketing_cat(df, feat_col, cuts):


"""Applies bucketing to categorical feature

Args:
df: Pandas Dataframe with the input data
feat_col: Name of the column with the input feature
cuts: Cuts that will be applied to the input data

Returns:
Pandas dataframe with dummy columns
"""
cut_col = '{}_cut'.format(feat_col)
df[cut_col] = df[feat_col]
df.loc[~df[cut_col].isin(cuts), cut_col] = 'Other'
if df[cut_col].isna().any():
df[cut_col] = df[cut_col].fillna("NA")

dummies_df = pd.get_dummies(df[cut_col], prefix=feat_col, drop_first=True)

return dummies_df

keyboard_arrow_down For all features - Define Buckets

We define one function to bucketing all features in a Dataframe, well if it a numerical feature well if it is a categorical feature

def get_bucket_feature(df, feat_col, n_bins=6):


"""Trains bucketing in a feature, whether if it is numerical
or categorical

Args:
df: Pandas Dataframe with the input data
feat_col: Name of the column with the input feature
n_bins: Cuts that will be applied to the input data

Returns:
List with the cuts learned from the data
"""
if (df[feat_col].dtypes == object) | (df[feat_col].dtypes == bool):
cuts = get_bucket_catfeature(df, feat_col, n_bins)
else:
cuts = get_bucket_numfeature(df, feat_col, n_bins)
return cuts

def get_bucketing_allfeatures(df, features, n_bins=4):


"""Trains bucketing in all given features of a dataset

Args:
df: Pandas Dataframe with the input data
features: Features which bucketing will be learnt
n_bins: Cuts that will be applied to the input data

Returns:
Dict, containing all features and its corresponding
bucketing. For example:
{'feature1': cuts1,
'feature2': cuts2}
"""
out_dict = {}
for feature in features:
cuts = get_bucket_feature(df, feature, n_bins)
out_dict[feature] = cuts
return out_dict
We execute the function that binns each of the variables, and stored the bucket definition in a dictionary

dict_bucketing = get_bucketing_allfeatures(df_train, final_features, n_bins=4)

We save the buckets of features for df of train in dictionary variable

dict_bucketing

{'ProsperPaymentsLessThanOneMonthLate': [0.0, 42.0],


'BorrowerState': ['CA', 'FL', 'NY', 'TX'],
'OpenCreditLines': [0.0, 5.0, 8.0, 11.0, 40.0],
'TotalTrades': [1.0, 14.0, 21.0, 30.0, 114.0],
'DebtToIncomeRatio': [0.0, 0.13, 0.2, 0.31, 10.01],
'EmploymentStatus': ['Employed', 'Self-employed', 'Other', 'Not employed'],
'InquiriesLast6Months': [0.0, 1.0, 2.0, 27.0],
'IsBorrowerHomeowner': [True, False],
'IncomeRange': ['$25,000-49,999',
'$50,000-74,999',
'$100,000+',
'$75,000-99,999'],
'IncomeVerifiable': [True, False],
'BankcardUtilization': [0.0, 0.24, 0.55, 0.81, 1.82],
'ScorexChangeAtTimeOfListing': [-209.0, -43.0, -13.0, 16.0, 286.0],
'OnTimeProsperPayments': [0.0, 9.0, 16.0, 34.0, 110.0],
'ProsperPrincipalOutstanding': [0.0, 1494.84, 3978.69, 22894.59],
'OpenRevolvingMonthlyPayment': [0.0, 99.0, 238.0, 484.0, 5770.0],
'Occupation': ['Other',
'Professional',
'Computer Programmer',
'Administrative Assistant'],
'TradesOpenedLast6Months': [0.0, 1.0, 20.0],
'StatedMonthlyIncome': [0.0, 3166.666667, 4604.166667, 6875.0, 466666.666667],
'CreditScoreRangeLower': [600.0, 660.0, 700.0, 740.0, 880.0],
'TotalProsperLoans': [1.0, 2.0, 7.0],
'ProsperPrincipalBorrowed': [1000.0, 3500.0, 5950.0, 10000.0, 72499.0]}

keyboard_arrow_down For all features - Apply Bucket Definitions

Define a function, that apply bucketing

def apply_bucketing(df, feat_col, cuts):


"""Applies a bucketing schema

Args:
df: Pandas Dataframe with the input data
feat_col: Name of the column with the input feature
cuts: Cuts that will be applied to the input data

Returns:
Pandas DataFrame with columns dummy columns
"""
if (df[feat_col].dtypes == object) | (df[feat_col].dtypes == bool):
df_buck = apply_bucketing_cat(df, feat_col, cuts)
else:
df_buck = apply_bucketing_num(df, feat_col, cuts)
return df_buck

We apply bucketing for all features of all df (train validation and OOT)

# Apply the bucketing

# Keep each column dummy columns in independent lists


list_df_tr, list_df_val, list_df_oot = [], [], []
for feat in final_features:
list_df_tr.append(apply_bucketing(df_train, feat, dict_bucketing[feat]))
list_df_val.append(apply_bucketing(df_val, feat, dict_bucketing[feat]))
list_df_oot.append(apply_bucketing(df_oot, feat, dict_bucketing[feat]))

# Then 'vertically' combine them


df_tr_preproc = pd.concat(list_df_tr, axis=1)
df_val_preproc = pd.concat(list_df_val, axis=1)
df_oot_preproc = pd.concat(list_df_oot, axis=1)

# Capture the name of all buckets in our dataset


keep_cols_buck = df_tr_preproc.columns
df_tr_preproc
ProsperPaymentsLessThanOneMonthLate_21|42 ProsperPaymentsLessThanOneMonthLate_NA BorrowerState_FL BorrowerState

0 0 1 0

1 0 1 0

2 0 1 0

3 0 0 0

4 0 1 0

... ... ... ...

20700 0 1 0

20701 0 1 0

20702 0 1 0

20703 0 1 0

20704 0 1 0

20705 rows × 60 columns

keyboard_arrow_down Drop highly correlated buckets


We can find buckets (of different features) correlated between them, so we are going drop them

#check buckets correlations


corr = df_tr_preproc[keep_cols_buck].corr()
orig_features = keep_cols_buck.values.tolist()
corr_TH = 0.75
n_corr_list=[]
corr_feats_list=[]
for f in orig_features:
#get correlation entries for the feature
corr_f = corr[f][[col for col in orig_features if col!=f]]
#work with absolute value
corr_f_abs = corr_f.abs()
#get features above corr TH
corr_ht_th = corr_f_abs[corr_f_abs>corr_TH]
n_corr_list.append(corr_ht_th.shape[0])
corr_feats_list.append(corr_ht_th)

corr_buckets = [(feat, n, feats_corr) for n, feats_corr, feat in zip(n_corr_list, corr_feats_list, orig_features) if n>0]

print ('We are dropping the following buckets due to the high correlation with others:\n')
glm_cols = get_uncorr_feats(corr_buckets, orig_features)

We are dropping the following buckets due to the high correlation with others:

Drop: IncomeVerifiable_True
Drop: ScorexChangeAtTimeOfListing_NA
Drop: OnTimeProsperPayments_NA
Drop: ProsperPrincipalOutstanding_NA
Drop: Occupation_Professional
Drop: TotalProsperLoans_NA
Drop: ProsperPrincipalBorrowed_NA

keyboard_arrow_down Data Preparation/Preprocessing Summary:


1. Drop features that are highly correlated with others.
2. Drop features with low IV (poor predictive capacity).
3. Drop features with a high PSI (highly unstable).
4. Bucketize the variables: Although we recommend to do it manually to improve the predictive capacity of the variables and not to lose the
sense of Business, we (due to lack of time in this course) will do it automatically: for numerical variables we will break in n equipped
buckets (percentiles), for categorical variables we will collect the n-1 most populated categories and the nth will be a category 'Other'.
5. Convert each of the buckets into a dummy variable.
6. Drop buckets (it's actually no longer a bucket, but a dummy variable created from a bucket) that are highly correlated with others.
keyboard_arrow_down Modeling

keyboard_arrow_down Logistic Regression


Logistic regression is a type of regression analysis used to predict the outcome of a categorical variable (a variable that can adopt a limited
number of categories) based on independent or predictive variables. It is useful for modeling the probability of an event occurring as a function
of other factors. Logistic regression analysis is framed within the set of Generalized Linear Models (GLM) that uses the logit function as a link
function. The probabilities that describe the possible outcome of a single test are modeled, as a function of explanatory variables, using a
logistic function.

from sklearn.linear_model import LogisticRegression


from sklearn import metrics
import matplotlib.pyplot as plt
def get_auc_to_plot(y_true, y_pred):
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_pred)
return fpr, tpr, metrics.auc(fpr, tpr)

def get_auc(y_true, y_pred):


fpr, tpr, thresholds = metrics.roc_curve(y_true, y_pred)
return metrics.auc(fpr, tpr)

def plot_roc_curve(fpr, tpr, roc_auc):


plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Capture the target in each subset


y_tr, y_val, y_oot = df_train['bad'], df_val['bad'], df_oot['bad']

keyboard_arrow_down Train LR with ALL Features


lr = LogisticRegression(random_state=random_seed)
lr.fit(df_val_preproc[glm_cols], y_val)

▾ LogisticRegression
LogisticRegression(random_state=42)

lr.predict_proba(df_tr_preproc[glm_cols])[:,1]

array([0.08094888, 0.06894575, 0.02447654, ..., 0.0703194 , 0.12645319,


0.12606883])

y_tr

0 False
1 False
2 False
3 False
4 False
...
20700 False
20701 False
20702 False
20703 False
20704 False
Name: bad, Length: 20705, dtype: bool

# Predict for all subsets


pred_tr = lr.predict_proba(df_tr_preproc[glm_cols])[:, 1]
pred_val = lr.predict_proba(df_val_preproc[glm_cols])[:, 1]
pred_oot = lr.predict_proba(df_oot_preproc[glm_cols])[:, 1]

get_auc(y_tr, pred_tr), get_auc(y_val, pred_val), get_auc(y_oot, pred_oot)

(0.7024905971954076, 0.701436940407556, 0.6677899477840702)

fpr_tr, tpr_tr, roc_auc_tr = get_auc_to_plot(y_tr, pred_tr)


fpr_val, tpr_val, roc_auc_val = get_auc_to_plot(y_val, pred_val)
fpr_oot, tpr_oot, roc_auc_oot = get_auc_to_plot(y_oot, pred_oot)

print("Plot curve for train")


plot_roc_curve(fpr_tr, tpr_tr, roc_auc_tr)
print("Plot curve for validation")
plot_roc_curve(fpr_val, tpr_val, roc_auc_val)
print("Plot curve for OOT")
plot_roc_curve(fpr_oot, tpr_oot, roc_auc_oot)
Plot curve for train

Plot curve for validation

Plot curve for OOT

print ('Number of features: {}'.format(len(glm_cols)))

Number of features: 53
keyboard_arrow_down Conclusions:

Considering:

This is an application model, with a limited number of variables, with information declared in many cases
We have had to drop variables because they are Prosper's own cooks, and we have not incorporated any of our own cooks.

The results of the model are somewhat tight, but sufficient. If we would like to improve the model, we could start by improving binning (we don't
do it here due to lack of time, but it is a good tool to improve the predictive capacity of the model).

Anyway, we have too much variables... in the next steps we will try to reduce them.

Feature Selection:

keyboard_arrow_down * P-value based feature selection *


A smaller p-value means that there is stronger evidence against the null hypothesis, this being the hypothesis that B approaches 0.

Steps:

1. Train the model with all the features


2. Take those with more p_value (less significance)
3. Re-train the model
# Get p-values
# Mostly from: https://gist.github.com/rspeare/77061e6e317896be29c6de9a85db301d

import scipy.stats as stat


def get_p_vals(lr, X):
denom = (2.0*(1.0+np.cosh(lr.decision_function(X))))
denom = np.tile(denom,(X.shape[1],1)).T
F_ij = np.dot((X/denom).T,X) ## Fisher Information Matrix
Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
z_scores = lr.coef_[0]/sigma_estimates # z-score for each model coefficient
p_values = [stat.norm.sf(abs(x))*2 for x in z_scores] ### two tailed test for p-values#
return p_values

def show_lr_summary(p_values, features, lr):


df_ret = pd.DataFrame({'feature': features,
'p_val': p_values,
'betas': lr.coef_.tolist()[0]})[['feature', 'betas', 'p_val']]
return df_ret

p_values = get_p_vals(lr, df_tr_preproc[glm_cols])


#show_lr_summary(p_values, glm_cols, lr)

# remove each bucket with the highest p-value N times


# assess how the AUC changes
N_iterations = len(glm_cols) - 3
glm_cols_pvals = [c for c in glm_cols]

# keep the AUCs in each interation


auc_train, auc_val, auc_oot = [], [], []
# List of tuples where all selected features status will
# be stored per iteration
features_it_pval = []

# for each iteration


for it in range(N_iterations):
#capture the feature to be dropped
#in the first iteration we are using the p_values from the model trained 'outside'
feat_drop = [feat for feat, p in zip(glm_cols_pvals, p_values) if p == max(p_values)][0]
glm_cols_pvals.remove(feat_drop)
#re-train the model
lr_it = LogisticRegression(random_state=random_seed)
lr_it.fit(df_tr_preproc[glm_cols_pvals], y_tr)
# Predict for all subsets
pred_tr = lr_it.predict_proba(df_tr_preproc[glm_cols_pvals])[:, 1]
pred_val = lr_it.predict_proba(df_val_preproc[glm_cols_pvals])[:, 1]
pred_oot = lr_it.predict_proba(df_oot_preproc[glm_cols_pvals])[:, 1]
#get aucs
auc_train.append(get_auc(y_tr, pred_tr))
auc_val.append(get_auc(y_val, pred_val))
auc_oot.append(get_auc(y_oot, pred_oot))
#get p-values
p_values = get_p_vals(lr_it, df_tr_preproc[glm_cols_pvals])
#keep features status at iteration
features_it_pval.append((it, [c for c in glm_cols_pvals]))

Analyze the results of the model on the 3 samples (train/oos/oot) in each of the iterations:

# Plot AUCs in each iteration


iterations = [i for i in range(N_iterations)]
plt.plot(iterations, auc_train, label='Train')
plt.plot(iterations, auc_val, label='Val')
plt.plot(iterations, auc_oot, label='OOT')
plt.legend()
<matplotlib.legend.Legend at 0x7e25b3909720>

Note that at iteration 30 (more or less) it stays stable

final_features_bucketing = [feats for it, feats in features_it_pval if it==30][0]


final_features_bucketing

['ProsperPaymentsLessThanOneMonthLate_NA',
'OpenCreditLines_5|8',
'OpenCreditLines_8|11',
'OpenCreditLines_11|40',
'DebtToIncomeRatio_0.2|0.3',
'DebtToIncomeRatio_0.3|10',
'DebtToIncomeRatio_NA',
'InquiriesLast6Months_1|2',
'InquiriesLast6Months_2|27',
'BankcardUtilization_0.2|0.6',
'BankcardUtilization_0.6|0.8',
'OnTimeProsperPayments_34|110',
'ProsperPrincipalOutstanding_1495|3979',
'ProsperPrincipalOutstanding_3979|22895',
'OpenRevolvingMonthlyPayment_99|238',
'TradesOpenedLast6Months_1|20',
'StatedMonthlyIncome_3167|4604',
'StatedMonthlyIncome_4604|6875',
'StatedMonthlyIncome_6875|466667',
'CreditScoreRangeLower_660|700',
'CreditScoreRangeLower_700|740',
'CreditScoreRangeLower_740|880']

keyboard_arrow_down Train the final model with the interesting features

lr_final = LogisticRegression(random_state = random_seed)


lr_final.fit(df_tr_preproc[final_features_bucketing], y_tr)

▾ LogisticRegression
LogisticRegression(random_state=42)

pred_tr = lr_final.predict_proba(df_tr_preproc[final_features_bucketing])[:, 1]
pred_val = lr_final.predict_proba(df_val_preproc[final_features_bucketing])[:, 1]
pred_oot = lr_final.predict_proba(df_oot_preproc[final_features_bucketing])[:, 1]

get_auc(y_tr, pred_tr), get_auc(y_val, pred_val), get_auc(y_oot, pred_oot)

(0.6957180909358355, 0.6937105794521641, 0.6725019529945298)


fpr_tr, tpr_tr, roc_auc_tr = get_auc_to_plot(y_tr, pred_tr)
fpr_val, tpr_val, roc_auc_val = get_auc_to_plot(y_val, pred_val)
fpr_oot, tpr_oot, roc_auc_oot = get_auc_to_plot(y_oot, pred_oot)

print("Plot curve for train")


plot_roc_curve(fpr_tr, tpr_tr, roc_auc_tr)
print("Plot curve for validation")
plot_roc_curve(fpr_val, tpr_val, roc_auc_val)
print("Plot curve for OOT")
plot_roc_curve(fpr_oot, tpr_oot, roc_auc_oot)
Plot curve for train

Plot curve for validation

Plot curve for OOT

We are reducing the distance with the OOT sample!! Great!!

print ('Number of features: {}'.format(len(final_features_bucketing)))


Number of features: 22

p_values = get_p_vals(lr_final, df_tr_preproc[final_features_bucketing])


show_lr_summary(p_values, final_features_bucketing, lr_final)

feature betas p_val

0 ProsperPaymentsLessThanOneMonthLate_NA 0.723889 7.685526e-37

1 OpenCreditLines_5|8 -0.227113 3.426465e-04

2 OpenCreditLines_8|11 -0.301663 3.650761e-05

3 OpenCreditLines_11|40 -0.411733 1.064796e-07

4 DebtToIncomeRatio_0.2|0.3 0.353113 4.317393e-08

5 DebtToIncomeRatio_0.3|10 0.646093 3.019707e-21

6 DebtToIncomeRatio_NA 0.884044 8.358175e-37

7 InquiriesLast6Months_1|2 0.329650 6.961723e-07

8 InquiriesLast6Months_2|27 0.720581 3.334563e-32

9 BankcardUtilization_0.2|0.6 -0.304643 2.282354e-07

10 BankcardUtilization_0.6|0.8 -0.294148 5.597200e-07

11 OnTimeProsperPayments_34|110 -0.606785 7.088786e-07

12 ProsperPrincipalOutstanding_1495|3979 0.481250 9.917150e-07

13 ProsperPrincipalOutstanding_3979|22895 0.702348 1.533503e-12

14 OpenRevolvingMonthlyPayment_99|238 -0.271053 1.819088e-06

15 TradesOpenedLast6Months_1|20 0.382327 3.182780e-11

16 StatedMonthlyIncome_3167|4604 -0.257462 1.478375e-05

17 StatedMonthlyIncome_4604|6875 -0.370974 6.823177e-09

18 StatedMonthlyIncome_6875|466667 -0.772104 3.192630e-23

19 CreditScoreRangeLower_660|700 -0.314857 3.988308e-08

20 CreditScoreRangeLower_700|740 -0.466533 2.597968e-12

21 CreditScoreRangeLower_740|880 -0.820161 5.355714e-23

keyboard_arrow_down AUC based feature selection


We are going to improve the model in function of the AUC instead of in function of the P_value.
# Try to remove all available features.
# Remove the feature that has the lowest impact in AUC
N_iterations = len(glm_cols) - 3
glm_cols_auc = [c for c in glm_cols]

auc_train, auc_val, auc_oot = [], [], []


features_it_auc = []
for it in range(N_iterations):
print ('Working for iteration: {}'.format(str(it)))
#re-train the model
lr_it = LogisticRegression(random_state=random_seed)
lr_it.fit(df_tr_preproc[glm_cols_auc], y_tr)
# Predict for all subsets
pred_tr = lr_it.predict_proba(df_tr_preproc[glm_cols_auc])[:, 1]
pred_val = lr_it.predict_proba(df_val_preproc[glm_cols_auc])[:, 1]
pred_oot = lr_it.predict_proba(df_oot_preproc[glm_cols_auc])[:, 1]
#get base aucs
auc_train_it, auc_val_it, auc_oot_it = get_auc(y_tr, pred_tr), get_auc(y_val, pred_val), get_auc(y_oot, pred_oot)
auc_train.append(auc_train_it)
auc_val.append(auc_val_it)
auc_oot.append(auc_oot_it)
#set up minimum gap
min_gap = 500
for feat_eval in glm_cols_auc:
#use validation AUC only as evaluation metric
#keep features in iteration it, but the feature under evaluation
glm_cols_auc_ev = [c for c in glm_cols_auc if c!=feat_eval]
lr_it_ev = LogisticRegression()
lr_it_ev.fit(df_tr_preproc[glm_cols_auc_ev], y_tr)
#predit @ val data
pred_val = lr_it_ev.predict_proba(df_val_preproc[glm_cols_auc_ev])[:, 1]
#get auc @ val
auc_val_it_ev = get_auc(y_val, pred_val)
#check gap
gap_val_auc = auc_val_it - auc_val_it_ev
#capture the feature that has the lowest AUC impact
if gap_val_auc < min_gap:
candidate_drop = feat_eval
min_gap = gap_val_auc
#remove from the feature set the selected feature
glm_cols_auc.remove(candidate_drop)
#keep features status at iteration
features_it_auc.append((it, [c for c in glm_cols_auc]))

Working for iteration: 0


Working for iteration: 1
Working for iteration: 2
Working for iteration: 3
Working for iteration: 4
Working for iteration: 5
Working for iteration: 6
Working for iteration: 7
Working for iteration: 8
Working for iteration: 9
Working for iteration: 10
Working for iteration: 11
Working for iteration: 12
Working for iteration: 13
Working for iteration: 14
Working for iteration: 15
Working for iteration: 16
Working for iteration: 17
Working for iteration: 18
Working for iteration: 19
Working for iteration: 20
Working for iteration: 21
Working for iteration: 22
Working for iteration: 23
Working for iteration: 24
Working for iteration: 25
Working for iteration: 26
Working for iteration: 27
Working for iteration: 28
Working for iteration: 29
Working for iteration: 30
Working for iteration: 31
Working for iteration: 32
Working for iteration: 33
Working for iteration: 34
Working for iteration: 35
Working for iteration: 36
Working for iteration: 37
Working for iteration: 38
Working for iteration: 39
Working for iteration: 40
Working for iteration: 41
Working for iteration: 42
Working for iteration: 43
Working for iteration: 44
Working for iteration: 45
Working for iteration: 46
Working for iteration: 47
Working for iteration: 48
Working for iteration: 49

Plot AUCs in each iteration

iterations = [i for i in range(N_iterations)]


plt.plot(iterations, auc_train, label='Train')
plt.plot(iterations, auc_val, label='Val')
plt.plot(iterations, auc_oot, label='OOT')
plt.legend()

<matplotlib.legend.Legend at 0x7e25b48577c0>

# get the iteration where it stays stable


final_features_bucketing_auc = [feats for it, feats in features_it_auc if it==30][0]
final_features_bucketing_auc

['ProsperPaymentsLessThanOneMonthLate_NA',
'BorrowerState_Other',
'DebtToIncomeRatio_0.3|10',
'DebtToIncomeRatio_NA',
'InquiriesLast6Months_1|2',
'InquiriesLast6Months_2|27',
'IsBorrowerHomeowner_True',
'IncomeRange_$25,000-49,999',
'IncomeRange_$50,000-74,999',
'IncomeRange_$75,000-99,999',
'IncomeRange_Other',
'BankcardUtilization_0.2|0.6',
'BankcardUtilization_0.6|0.8',
'BankcardUtilization_0.8|1.8',
'ScorexChangeAtTimeOfListing_-43|-13',
'ScorexChangeAtTimeOfListing_-13|16',
'ScorexChangeAtTimeOfListing_16|286',
'ProsperPrincipalOutstanding_1495|3979',
'ProsperPrincipalOutstanding_3979|22895',
'TradesOpenedLast6Months_1|20',
'CreditScoreRangeLower_700|740',
'CreditScoreRangeLower_740|880']

keyboard_arrow_down Train the final model with the interesting features

lr_final_auc = LogisticRegression(random_state = random_seed)


lr_final_auc.fit(df_tr_preproc[final_features_bucketing_auc], y_tr)

▾ LogisticRegression
LogisticRegression(random_state=42)

pred_tr = lr_final_auc.predict_proba(df_tr_preproc[final_features_bucketing_auc])[:, 1]
pred_val = lr_final_auc.predict_proba(df_val_preproc[final_features_bucketing_auc])[:, 1]
pred_oot = lr_final_auc.predict_proba(df_oot_preproc[final_features_bucketing_auc])[:, 1]
get_auc(y_tr, pred_tr), get_auc(y_val, pred_val), get_auc(y_oot, pred_oot)

(0.687070513312058, 0.6992535003586704, 0.653453761810746)

fpr_tr, tpr_tr, roc_auc_tr = get_auc_to_plot(y_tr, pred_tr)


fpr_val, tpr_val, roc_auc_val = get_auc_to_plot(y_val, pred_val)
fpr_oot, tpr_oot, roc_auc_oot = get_auc_to_plot(y_oot, pred_oot)

print("Plot curve for train")


plot_roc_curve(fpr_tr, tpr_tr, roc_auc_tr)
print("Plot curve for validation")
plot_roc_curve(fpr_val, tpr_val, roc_auc_val)
print("Plot curve for OOT")
plot_roc_curve(fpr_oot, tpr_oot, roc_auc_oot)
Plot curve for train

Plot curve for validation

Plot curve for OOT

keyboard_arrow_down Gain Table


def get_gain_table_cami(pred, df, col_target='bad', n_buckets=10):
"""Generate the gain table given a population, and its predictions

Args:
pred: np.array / pd.Series containing predictions
df: Pandas DataFrame containing the population to be assesed
col_target: Name of the target column
n_buckets: Number of buckets for the gain table

Returns:
Pandas DataFrame representing the gain table
"""
df['pred'] = pred
df['pred_cut'] = pd.cut(df['pred'], df['pred'].quantile(np.linspace(0, 1, num=n_buckets + 1)), include_lowest=True)

gain_table = df[['pred_cut','population','bad']].rename(columns={'population':'N_population', 'bad':'N_bad'}).groupby('p

#gain_table.columns = ['N', 'avg_pred', 'BR', 'N_bads']


gain_table['N_goods'] = gain_table['N_population'] - gain_table['N_bad']
gain_table['BR'] = gain_table['N_bad'] / gain_table['N_population']
gain_table['pct_bad_acum'] = 100. * gain_table['N_bad'].cumsum() / gain_table['N_bad'].sum()
gain_table['pct_approv_acum'] = 100. * gain_table['N_population'].cumsum() / gain_table['N_population'].sum()
gain_table

return gain_table

print('Let´s check the Gain Table on the OOT population:')


get_gain_table_cami(pred_oot, df_oot)

Let´s check the Gain Table on the OOT population:


N_population N_bad N_goods BR pct_bad_acum pct_approv_acum

pred_cut

(0.00717, 0.0305] 770 12 758 0.015584 1.973684 10.001299

(0.0305, 0.0421] 788 30 758 0.038071 6.907895 20.236394

(0.0421, 0.0535] 753 45 708 0.059761 14.309211 30.016885

(0.0535, 0.0638] 769 47 722 0.061118 22.039474 40.005195

(0.0638, 0.0777] 791 52 739 0.065740 30.592105 50.279257

(0.0777, 0.0914] 761 53 708 0.069645 39.309211 60.163658

(0.0914, 0.108] 759 77 682 0.101449 51.973684 70.022081

(0.108, 0.133] 770 78 692 0.101299 64.802632 80.023380

(0.133, 0.172] 772 95 677 0.123057 80.427632 90.050656

(0.172, 0.516] 766 119 647 0.155352 100.000000 100.000000

Business Case:

Once we have established the cut-off:


# gain table for training sample
get_gain_table(pred_tr, df_train)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-51-27db69059413> in <cell line: 2>()
1 # gain table for training sample
----> 2 get_gain_table(pred_tr, df_train)

NameError: name 'get_gain_table' is not defined

keyboard_arrow_down ENSEMBLE METHODS


Ensemble methods combine several estimators to produce a better predictive performance Group weak learner into a powerful learner There
are different techniques to build ensemble methods:

.Bagging

.Boosting

keyboard_arrow_down BAGGING VS BOOSTING

keyboard_arrow_down BOOSTING
In each iteration the algorithm assigns a higher weigth to the instances that were wrongly classified previously.

In the first iteration all instance weigths are the same


ALGORITHMS:

AdaBoost
Gradient Boosted Tree
Extreme Gradient Boosting (XGBoost)

keyboard_arrow_down BAGGING
. Create k bootstrap samples D1.....Dk

. Train estimator on each Di

. Classify new instance by majority vote

keyboard_arrow_down Random Forest


Random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest
spits out a class prediction and the class with the most votes becomes our model’s prediction.
The fundamental concept behind random forest is:

A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models

The main advantages of random forests are:

To be one of the most accurate learning algorithms available


Running efficiently in large databases
Handle hundreds of input variables without excluding any
Give estimates of which variables are important in classification

Note: RF also requires lighther preprocessing

-> we have decided to fill NAs with a value lower than its minimum
dtypes = df_train[final_features].dtypes
num_feats = dtypes[dtypes!=object].index.values.tolist()
cat_feats = dtypes[dtypes==object].index.values.tolist()

def get_nafill_rf_num(df, num_features, gap_min=1e6):


"""Get a dictionary, that will store the value that will be used
to fill NAs in numeric data.

Args:
df: Pandas Dataframe with the input data
num_features: List with the names of categorical features
gap_min: Gap between minimum value and filling value

Returns:
Dictionary, with the following structure:
{feature1: fill_val1,
feature2: fill_val2}
"""
dict_fill = {}
for num_feat in num_features:
dict_fill[num_feat] = df[num_feat].min() - gap_min
return dict_fill

def apply_nafill_rf_num(df, dict_fillrf):


"""Given a dictionary with the values to be used in NA filling,
use it to fill NAs.

Args:
df: Pandas Dataframe with the input data
dict_fillrf: Dictionary, that stores filling values

Returns:
Pandas Dataframe with NA being filled
"""
df_out = df.copy()
for num_feat in dict_fillrf.keys():
df_out.loc[df_out[num_feat].isna(), num_feat] = dict_fillrf[num_feat]
return df_out

#get dictionary to fill values from train


dict_nafill = get_nafill_rf_num(df_train, num_feats)
#apply it to train, val and oot
df_train_fill = apply_nafill_rf_num(df_train, dict_nafill)
df_val_fill = apply_nafill_rf_num(df_val, dict_nafill)
df_oot_fill = apply_nafill_rf_num(df_oot, dict_nafill)

# Keep each column dummy columns in independent lists


# Initialize with the numeric data already filled
list_df_tr, list_df_val, list_df_oot = [df_train_fill[num_feats]], [df_val_fill[num_feats]], [df_oot_fill[num_feats]]
# Add the bucketing results of categorical data
for feat in cat_feats:
list_df_tr.append(apply_bucketing(df_train, feat, dict_bucketing[feat]))
list_df_val.append(apply_bucketing(df_val, feat, dict_bucketing[feat]))
list_df_oot.append(apply_bucketing(df_oot, feat, dict_bucketing[feat]))

# Then combine them column-wise


df_tr_preproc_rf = pd.concat(list_df_tr, axis=1)
df_val_preproc_rf = pd.concat(list_df_val, axis=1)
df_oot_preproc_rf = pd.concat(list_df_oot, axis=1)

# Keep the final column names


keep_cols_rf = df_tr_preproc_rf.columns

keyboard_arrow_down Train RF
# Train a RF-Classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500,
min_samples_leaf=300,
max_depth=4,
n_jobs=4,
random_state=random_seed)
rf.fit(df_tr_preproc_rf[keep_cols_rf], y_tr)
▾ RandomForestClassifier
RandomForestClassifier(max_depth=4, min_samples_leaf=300, n_estimators=500,
n_jobs=4, random_state=42)

pred_tr = rf.predict_proba(df_tr_preproc_rf[keep_cols_rf])[:, 1]
pred_val = rf.predict_proba(df_val_preproc_rf[keep_cols_rf])[:, 1]
pred_oot = rf.predict_proba(df_oot_preproc_rf[keep_cols_rf])[:, 1]

get_auc(y_tr, pred_tr), get_auc(y_val, pred_val), get_auc(y_oot, pred_oot)

(0.6980729493418506, 0.6942214856930312, 0.6529228349130477)

keyboard_arrow_down Show Feature importance


imp_df = pd.DataFrame({'feature': keep_cols_rf,
'importance': rf.feature_importances_})
print('Number of features: {}'.format(len(imp_df)))
imp_df.sort_values(by='importance', ascending=False)

Number of features: 31
feature importance

13 StatedMonthlyIncome 0.217714

4 InquiriesLast6Months 0.100930

14 CreditScoreRangeLower 0.088638

11 OpenRevolvingMonthlyPayment 0.085783

3 DebtToIncomeRatio 0.080745

6 IncomeVerifiable 0.057421

8 ScorexChangeAtTimeOfListing 0.055299

1 OpenCreditLines 0.053102

9 OnTimeProsperPayments 0.044490

27 IncomeRange_Other 0.043362

12 TradesOpenedLast6Months 0.040326

7 BankcardUtilization 0.035877

2 TotalTrades 0.025807

24 IncomeRange_$25,000-49,999 0.021331

10 ProsperPrincipalOutstanding 0.009299

16 ProsperPrincipalBorrowed 0.007930

0 ProsperPaymentsLessThanOneMonthLate 0.007754

22 EmploymentStatus_Other 0.006109

26 IncomeRange_$75,000-99,999 0.004662

15 TotalProsperLoans 0.003512

23 EmploymentStatus_Self-employed 0.003022

5 IsBorrowerHomeowner 0.002891

25 IncomeRange_$50,000-74,999 0.001782

29 Occupation_Other 0.000935

19 BorrowerState_Other 0.000732

30 Occupation_Professional 0.000272

17 BorrowerState_FL 0.000159

28 Occupation_Computer Programmer 0.000091

20 BorrowerState_TX 0.000024

21 EmploymentStatus_Not employed 0.000000

18 BorrowerState_NY 0.000000

keyboard_arrow_down Ways to find the best parameters for a RF Classifier:


Manual search
Grid search
Random search

GridSearchCV is a scikit-learn class that allows us to systematically evaluate and select the parameters of a model. By indicating a model and
the parameters to be tested, we can evaluate the performance of the model as a function of the parameters by means of cross validation. In
case we want to evaluate models with random parameters there is the RandomizedSearchCV method.

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

cv_params = {'n_estimators': [400,450,500,550,600], # The number of trees in the forest


'max_depth': [3,4,5], # The maximum depth of the tree
'min_samples_leaf': [200,300,400], # The minimum number of samples required to be at a leaf node
'max_features': [15,20] # The number of features to consider when looking for the best split'
}

optimized_RF_GS = GridSearchCV(RandomForestClassifier(random_state=random_seed), # Estimator: the model to be evaluated


cv_params, # Param_grid: a dictionary indicating the parameters to be evaluated
scoring = "roc_auc",
cv = 5, # cv: the number of sets into which the data are divided for cross validation
n_jobs = 8) # n_jobs: Number of jobs to run in parallel. -1 means using all processors
optimized_RF_GS.fit(df_tr_preproc_rf[keep_cols_rf], y_tr)

### Parameters for best model: ###


print('BEST PARAMETERS for RF Classifier:')
optimized_RF_GS.best_params_

pred_tr_GS = optimized_RF_GS.predict_proba(df_tr_preproc_rf[keep_cols_rf])[:, 1]
pred_val_GS = optimized_RF_GS.predict_proba(df_val_preproc_rf[keep_cols_rf])[:, 1]
pred_oot_GS = optimized_RF_GS.predict_proba(df_oot_preproc_rf[keep_cols_rf])[:, 1]

get_auc(y_tr, pred_tr_GS), get_auc(y_val, pred_val_GS), get_auc(y_oot, pred_oot_GS)

keyboard_arrow_down Other ways to measure model performance


Precision: proportion of correct positive classifications (true positives) from cases that are predicted as positive.
Recall: Proportion of correct positive classifications (true positives) from cases that are actually positive.

Specially important when very unbalanced samples!


lr_final_auc.predict_proba(df_oot_preproc[final_features_bucketing_auc])

array([[0.84215984, 0.15784016],
[0.98686507, 0.01313493],
[0.89972572, 0.10027428],
...,
[0.86458347, 0.13541653],
[0.76225236, 0.23774764],
[0.97579959, 0.02420041]])

lr_final_auc.predict_proba(df_oot_preproc[final_features_bucketing_auc])[:, 1].mean()

0.09165386756419946

from sklearn.metrics import recall_score, precision_score, accuracy_score

model_tx = 'LR' #'RF'


model = lr_final_auc.predict_proba(df_oot_preproc[final_features_bucketing_auc])[:, 1]
#model = rf.predict_proba(df_oot_preproc_rf[keep_cols_rf])[:, 1]
thresholds = [0.0913]

for threshold in thresholds:

y_pred_OOT = model > threshold


print(model)
print(y_pred_OOT)
print(y_oot)
print('The precision for the {} model with {} threshold is: {}'.format(model_tx, threshold, precision_score(y_oot, y
print('The recall for the {} model with {} threshold is: {}'.format(model_tx, threshold,recall_score(y_oot, y_pred_O

[0.15784016 0.01313493 0.10027428 ... 0.13541653 0.23774764 0.02420041]


[ True False True ... True True False]
0 False
1 False
2 False
3 False
4 False
...
7694 False
7695 False
7696 False
7697 False
7698 False
Name: bad, Length: 7699, dtype: bool
The precision for the LR model with 0.0913 threshold is: 0.12005149662053428
The recall for the LR model with 0.0913 threshold is: 0.6134868421052632

keyboard_arrow_down Selecting the best model: Explainability


The difference between explainability and interpretability

Interpretability is about the extent to which a cause and effect can be observed within a system. That is, the extent to which you can
predict what will happen given a change in input or parameters.
Explainability is the extent to which the internal mechanics of a machine / deep learning system can be explained in human terms, that is,
explaining why something is happening.

You might also like