3 - Modeling - Ipynb - Colaboratory
3 - Modeling - Ipynb - Colaboratory
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun
keyboard_arrow_down Intro:
Now we are going to tell you about 2 different modeling techniques:
1) Logistic Regression
2) Random Forest
In the next step we are loading the object we saved in the previous notebook: iv, psi, and correlation for each of the features
In practice, the problem of dimensionality implies that, given a fixed number of examples, there is a maximum number of attributes from which
the efficiency of our classifier degrades rather than increases.
Techniques to reduce dimensionality:
For this exercise: Information Value, Population Stability Index & Correlation criteria.
keyboard_arrow_down -Filtering methods: the new characteristics come from a transformation of the original ones.
Find a transformation y=f(x) that preserves the information about the problem, minimizing the number of components.
The goal of LDA is to reduce dimensionality, preserving as much discriminatory information as possible while maximizing separation between
classes. LDA reduces the dimensional space to C-1, where C is the number of classes (target).
PCA, however, seeks to compress the information in the data, regardless of class (target). Builds the most relevant components or factors from
the original variables.
PCA is a technique that makes sense to apply in the case of there are high correlations between the variables (an indication that the there is
redundant information) as a consequence, few factors will explain much of the total variability.
Beware, PCA is sensitive to the scale on which the variables are expressed. Variables may need to be normalized.
breast = load_breast_cancer()
breast_data = breast.data
breast_labels = breast.target
labels = np.reshape(breast_labels,(569,1))
final_breast_data = np.concatenate([breast_data,labels],axis=1)
breast_dataset = pd.DataFrame(final_breast_data)
features = breast.feature_names
features_labels = np.append(features,'label')
breast_dataset.columns = features_labels
breast_dataset.head()
mea
mean mean mean mean mean mean mean
concav
radius texture perimeter area smoothness compactness concavity
point
5 rows × 31 columns
x = breast_dataset.loc[:, features].values
x = StandardScaler().fit_transform(x) # normalizing the features
feat_cols = ['feature'+str(i) for i in range(x.shape[1])]
normalised_breast = pd.DataFrame(x,columns=feat_cols)
# Since the original labels are in 0,1 format, you will change the labels to benign and malignant using .replace function
breast_dataset['label'].replace(0, 'Benign',inplace=True)
breast_dataset['label'].replace(1, 'Malignant',inplace=True)
pca_breast = PCA(n_components=2)
principalComponents_breast = pca_breast.fit_transform(x)
plt.legend(targets,prop={'size': 15})
<matplotlib.legend.Legend at 0x7e25b3e5ae00>
<Figure size 640x480 with 0 Axes>
For this exercise: Feature Selection Criteria based on IV, PSI & Correlation
corr_data[0][2].values[0]
1.0
#run through the iv ranking, and drop features if they are correlated with any feature with better ranking
feats_sorted = df_iv.feature.values.tolist()
Args:
corr_data: List of tuples containing the correlation info
feats_sorted: List, with the features to be sorted / dropped
Returns:
List with the features that have no correlation
"""
features_keep = feats_sorted[:1]
for feat in feats_sorted[1:]:
#capture the correlation tuple
crr_data = [crr for crr in corr_data if crr[0] == feat] # if feat has correlation
if len(crr_data):
#if there is a 'hit' with a feature in features_keep, do not include it
hit = len(set(crr_data[0][2].index.tolist()) & set(features_keep)) > 0
if hit:
print ('Drop: ' + feat)
else:
features_keep.append(feat)
else:
features_keep.append(feat)
return features_keep
print ('We are dropping the following features due to the high correlation with others:\n')
features_keep = get_uncorr_feats(corr_data, feats_sorted)
We are dropping the following features due to the high correlation with others:
Drop: CreditScoreRangeUpper
Drop: TotalProsperPaymentsBilled
Drop: LoanOriginalAmount
Drop: CurrentCreditLines
Drop: OpenRevolvingAccounts
Drop: TotalCreditLinespast7years
print ('We are dropping the following features due to the poor IV (<{}):'.format(TH_IV))
low_iv_feats
print ('We are dropping the following features due to the high PSI (>{}):'.format(TH_PSI))
high_psi_features
We are dropping the following features due to the high PSI (>0.25):
['LoanOriginalAmount',
'MonthlyLoanPayment',
'Term',
'ListingCategory (numeric)']
Note: final_features will contain our final set of features, in order to model the target
final_features = features_keep_psi
Statistical data binning is a way to group a number of more or less continuous values into a smaller number of "bins". For example, if you have
data about a group of people, you might want to arrange their ages into a smaller number of age intervals (for example, grouping every five
years together).
we have created a function that, given a number n_bins (function parameter), calculates as many buckets according to the percentiles for
each of the continuous features
we have created a function that, given a number n_bins (function parameter), calculates as many buckets according to the population
distribution for each of the categorical features
Our recommendation is that the work of bucketization has to be to some extent manual, without ever losing sight of the sense of Business.
Function that defines buckects for numerical features, with the number of buckets we want , also makes dummies
def get_bucket_numfeature(df, feat_col, n_bins, input_slider=(0., 100.)):
"""Cuts a numeric feature in 'n_bins', balacing data in percentiles
Args:
df: Pandas DataFrame with the input data
feat_col: Name of the column with the input feature
obj_col: Name of the column with the target
n_bins: Number of cuts expected
input_slider: Range considered for the bucketing
Returns:
List with the cuts corresponding to this feature
"""
#get the numeric input from the dual slider
perc_sliders = [v/100. for v in input_slider]
var_lims = df[feat_col].quantile([perc_sliders[0], perc_sliders[1]]).values
v_min, v_max = var_lims[0], var_lims[1]
#filter the dataset using the slider input
df_cut = df.loc[(df[feat_col] <= v_max) & (df[feat_col] >= v_min)][[feat_col]]
cuts = df_cut[feat_col].quantile(np.linspace(perc_sliders[0], perc_sliders[1], n_bins + 1)).values.tolist()
cuts = sorted(list(set(cuts)))
return cuts
Args:
feat_col: Name of the column with the input feature
dummy_col: String of the dummy column
Returns:
Dummy column with better formatting
"""
out = dummy_col.replace("(", "")\
.replace("]", "")\
.replace(".0", "")\
.replace(", ", "|")
Args:
df: Pandas Dataframe with the input data
feat_col: Name of the column with the input feature
cuts: Cuts that will be applied to the input data
Returns:
Pandas dataframe with dummy columns
"""
cut_col = '{}_cut'.format(feat_col)
if len(cuts) == 2:
cuts = [cuts[0], np.mean(cuts), cuts[1]]
return dummies_df
Function that defines buckects for categorical features, with the number of buckets we want , also makes dummies
def get_bucket_catfeature(df, feat_col, n_bins):
"""Cuts a categorical feature in 'n_bins', keeping categories with highest volume
Args:
df: Pandas DataFrame with the input data
feat_col: Name of the column with the input feature
n_bins: Number of cuts expected
Returns:
List with the cuts corresponding to this feature
"""
cuts = df.groupby(feat_col)[feat_col].count().sort_values(ascending=False)[:int(n_bins)].index.values.tolist()
return cuts
Args:
df: Pandas Dataframe with the input data
feat_col: Name of the column with the input feature
cuts: Cuts that will be applied to the input data
Returns:
Pandas dataframe with dummy columns
"""
cut_col = '{}_cut'.format(feat_col)
df[cut_col] = df[feat_col]
df.loc[~df[cut_col].isin(cuts), cut_col] = 'Other'
if df[cut_col].isna().any():
df[cut_col] = df[cut_col].fillna("NA")
return dummies_df
We define one function to bucketing all features in a Dataframe, well if it a numerical feature well if it is a categorical feature
Args:
df: Pandas Dataframe with the input data
feat_col: Name of the column with the input feature
n_bins: Cuts that will be applied to the input data
Returns:
List with the cuts learned from the data
"""
if (df[feat_col].dtypes == object) | (df[feat_col].dtypes == bool):
cuts = get_bucket_catfeature(df, feat_col, n_bins)
else:
cuts = get_bucket_numfeature(df, feat_col, n_bins)
return cuts
Args:
df: Pandas Dataframe with the input data
features: Features which bucketing will be learnt
n_bins: Cuts that will be applied to the input data
Returns:
Dict, containing all features and its corresponding
bucketing. For example:
{'feature1': cuts1,
'feature2': cuts2}
"""
out_dict = {}
for feature in features:
cuts = get_bucket_feature(df, feature, n_bins)
out_dict[feature] = cuts
return out_dict
We execute the function that binns each of the variables, and stored the bucket definition in a dictionary
dict_bucketing
Args:
df: Pandas Dataframe with the input data
feat_col: Name of the column with the input feature
cuts: Cuts that will be applied to the input data
Returns:
Pandas DataFrame with columns dummy columns
"""
if (df[feat_col].dtypes == object) | (df[feat_col].dtypes == bool):
df_buck = apply_bucketing_cat(df, feat_col, cuts)
else:
df_buck = apply_bucketing_num(df, feat_col, cuts)
return df_buck
We apply bucketing for all features of all df (train validation and OOT)
0 0 1 0
1 0 1 0
2 0 1 0
3 0 0 0
4 0 1 0
20700 0 1 0
20701 0 1 0
20702 0 1 0
20703 0 1 0
20704 0 1 0
corr_buckets = [(feat, n, feats_corr) for n, feats_corr, feat in zip(n_corr_list, corr_feats_list, orig_features) if n>0]
print ('We are dropping the following buckets due to the high correlation with others:\n')
glm_cols = get_uncorr_feats(corr_buckets, orig_features)
We are dropping the following buckets due to the high correlation with others:
Drop: IncomeVerifiable_True
Drop: ScorexChangeAtTimeOfListing_NA
Drop: OnTimeProsperPayments_NA
Drop: ProsperPrincipalOutstanding_NA
Drop: Occupation_Professional
Drop: TotalProsperLoans_NA
Drop: ProsperPrincipalBorrowed_NA
▾ LogisticRegression
LogisticRegression(random_state=42)
lr.predict_proba(df_tr_preproc[glm_cols])[:,1]
y_tr
0 False
1 False
2 False
3 False
4 False
...
20700 False
20701 False
20702 False
20703 False
20704 False
Name: bad, Length: 20705, dtype: bool
Number of features: 53
keyboard_arrow_down Conclusions:
Considering:
This is an application model, with a limited number of variables, with information declared in many cases
We have had to drop variables because they are Prosper's own cooks, and we have not incorporated any of our own cooks.
The results of the model are somewhat tight, but sufficient. If we would like to improve the model, we could start by improving binning (we don't
do it here due to lack of time, but it is a good tool to improve the predictive capacity of the model).
Anyway, we have too much variables... in the next steps we will try to reduce them.
Feature Selection:
Steps:
Analyze the results of the model on the 3 samples (train/oos/oot) in each of the iterations:
['ProsperPaymentsLessThanOneMonthLate_NA',
'OpenCreditLines_5|8',
'OpenCreditLines_8|11',
'OpenCreditLines_11|40',
'DebtToIncomeRatio_0.2|0.3',
'DebtToIncomeRatio_0.3|10',
'DebtToIncomeRatio_NA',
'InquiriesLast6Months_1|2',
'InquiriesLast6Months_2|27',
'BankcardUtilization_0.2|0.6',
'BankcardUtilization_0.6|0.8',
'OnTimeProsperPayments_34|110',
'ProsperPrincipalOutstanding_1495|3979',
'ProsperPrincipalOutstanding_3979|22895',
'OpenRevolvingMonthlyPayment_99|238',
'TradesOpenedLast6Months_1|20',
'StatedMonthlyIncome_3167|4604',
'StatedMonthlyIncome_4604|6875',
'StatedMonthlyIncome_6875|466667',
'CreditScoreRangeLower_660|700',
'CreditScoreRangeLower_700|740',
'CreditScoreRangeLower_740|880']
▾ LogisticRegression
LogisticRegression(random_state=42)
pred_tr = lr_final.predict_proba(df_tr_preproc[final_features_bucketing])[:, 1]
pred_val = lr_final.predict_proba(df_val_preproc[final_features_bucketing])[:, 1]
pred_oot = lr_final.predict_proba(df_oot_preproc[final_features_bucketing])[:, 1]
<matplotlib.legend.Legend at 0x7e25b48577c0>
['ProsperPaymentsLessThanOneMonthLate_NA',
'BorrowerState_Other',
'DebtToIncomeRatio_0.3|10',
'DebtToIncomeRatio_NA',
'InquiriesLast6Months_1|2',
'InquiriesLast6Months_2|27',
'IsBorrowerHomeowner_True',
'IncomeRange_$25,000-49,999',
'IncomeRange_$50,000-74,999',
'IncomeRange_$75,000-99,999',
'IncomeRange_Other',
'BankcardUtilization_0.2|0.6',
'BankcardUtilization_0.6|0.8',
'BankcardUtilization_0.8|1.8',
'ScorexChangeAtTimeOfListing_-43|-13',
'ScorexChangeAtTimeOfListing_-13|16',
'ScorexChangeAtTimeOfListing_16|286',
'ProsperPrincipalOutstanding_1495|3979',
'ProsperPrincipalOutstanding_3979|22895',
'TradesOpenedLast6Months_1|20',
'CreditScoreRangeLower_700|740',
'CreditScoreRangeLower_740|880']
▾ LogisticRegression
LogisticRegression(random_state=42)
pred_tr = lr_final_auc.predict_proba(df_tr_preproc[final_features_bucketing_auc])[:, 1]
pred_val = lr_final_auc.predict_proba(df_val_preproc[final_features_bucketing_auc])[:, 1]
pred_oot = lr_final_auc.predict_proba(df_oot_preproc[final_features_bucketing_auc])[:, 1]
get_auc(y_tr, pred_tr), get_auc(y_val, pred_val), get_auc(y_oot, pred_oot)
Args:
pred: np.array / pd.Series containing predictions
df: Pandas DataFrame containing the population to be assesed
col_target: Name of the target column
n_buckets: Number of buckets for the gain table
Returns:
Pandas DataFrame representing the gain table
"""
df['pred'] = pred
df['pred_cut'] = pd.cut(df['pred'], df['pred'].quantile(np.linspace(0, 1, num=n_buckets + 1)), include_lowest=True)
return gain_table
pred_cut
Business Case:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-51-27db69059413> in <cell line: 2>()
1 # gain table for training sample
----> 2 get_gain_table(pred_tr, df_train)
.Bagging
.Boosting
keyboard_arrow_down BOOSTING
In each iteration the algorithm assigns a higher weigth to the instances that were wrongly classified previously.
AdaBoost
Gradient Boosted Tree
Extreme Gradient Boosting (XGBoost)
keyboard_arrow_down BAGGING
. Create k bootstrap samples D1.....Dk
A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models
-> we have decided to fill NAs with a value lower than its minimum
dtypes = df_train[final_features].dtypes
num_feats = dtypes[dtypes!=object].index.values.tolist()
cat_feats = dtypes[dtypes==object].index.values.tolist()
Args:
df: Pandas Dataframe with the input data
num_features: List with the names of categorical features
gap_min: Gap between minimum value and filling value
Returns:
Dictionary, with the following structure:
{feature1: fill_val1,
feature2: fill_val2}
"""
dict_fill = {}
for num_feat in num_features:
dict_fill[num_feat] = df[num_feat].min() - gap_min
return dict_fill
Args:
df: Pandas Dataframe with the input data
dict_fillrf: Dictionary, that stores filling values
Returns:
Pandas Dataframe with NA being filled
"""
df_out = df.copy()
for num_feat in dict_fillrf.keys():
df_out.loc[df_out[num_feat].isna(), num_feat] = dict_fillrf[num_feat]
return df_out
keyboard_arrow_down Train RF
# Train a RF-Classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500,
min_samples_leaf=300,
max_depth=4,
n_jobs=4,
random_state=random_seed)
rf.fit(df_tr_preproc_rf[keep_cols_rf], y_tr)
▾ RandomForestClassifier
RandomForestClassifier(max_depth=4, min_samples_leaf=300, n_estimators=500,
n_jobs=4, random_state=42)
pred_tr = rf.predict_proba(df_tr_preproc_rf[keep_cols_rf])[:, 1]
pred_val = rf.predict_proba(df_val_preproc_rf[keep_cols_rf])[:, 1]
pred_oot = rf.predict_proba(df_oot_preproc_rf[keep_cols_rf])[:, 1]
Number of features: 31
feature importance
13 StatedMonthlyIncome 0.217714
4 InquiriesLast6Months 0.100930
14 CreditScoreRangeLower 0.088638
11 OpenRevolvingMonthlyPayment 0.085783
3 DebtToIncomeRatio 0.080745
6 IncomeVerifiable 0.057421
8 ScorexChangeAtTimeOfListing 0.055299
1 OpenCreditLines 0.053102
9 OnTimeProsperPayments 0.044490
27 IncomeRange_Other 0.043362
12 TradesOpenedLast6Months 0.040326
7 BankcardUtilization 0.035877
2 TotalTrades 0.025807
24 IncomeRange_$25,000-49,999 0.021331
10 ProsperPrincipalOutstanding 0.009299
16 ProsperPrincipalBorrowed 0.007930
0 ProsperPaymentsLessThanOneMonthLate 0.007754
22 EmploymentStatus_Other 0.006109
26 IncomeRange_$75,000-99,999 0.004662
15 TotalProsperLoans 0.003512
23 EmploymentStatus_Self-employed 0.003022
5 IsBorrowerHomeowner 0.002891
25 IncomeRange_$50,000-74,999 0.001782
29 Occupation_Other 0.000935
19 BorrowerState_Other 0.000732
30 Occupation_Professional 0.000272
17 BorrowerState_FL 0.000159
20 BorrowerState_TX 0.000024
18 BorrowerState_NY 0.000000
GridSearchCV is a scikit-learn class that allows us to systematically evaluate and select the parameters of a model. By indicating a model and
the parameters to be tested, we can evaluate the performance of the model as a function of the parameters by means of cross validation. In
case we want to evaluate models with random parameters there is the RandomizedSearchCV method.
pred_tr_GS = optimized_RF_GS.predict_proba(df_tr_preproc_rf[keep_cols_rf])[:, 1]
pred_val_GS = optimized_RF_GS.predict_proba(df_val_preproc_rf[keep_cols_rf])[:, 1]
pred_oot_GS = optimized_RF_GS.predict_proba(df_oot_preproc_rf[keep_cols_rf])[:, 1]
array([[0.84215984, 0.15784016],
[0.98686507, 0.01313493],
[0.89972572, 0.10027428],
...,
[0.86458347, 0.13541653],
[0.76225236, 0.23774764],
[0.97579959, 0.02420041]])
lr_final_auc.predict_proba(df_oot_preproc[final_features_bucketing_auc])[:, 1].mean()
0.09165386756419946
Interpretability is about the extent to which a cause and effect can be observed within a system. That is, the extent to which you can
predict what will happen given a change in input or parameters.
Explainability is the extent to which the internal mechanics of a machine / deep learning system can be explained in human terms, that is,
explaining why something is happening.