0% found this document useful (0 votes)
16 views

2_Data_Analysis.ipynb - Colaboratory

Uploaded by

Manuel Gonzales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

2_Data_Analysis.ipynb - Colaboratory

Uploaded by

Manuel Gonzales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

keyboard_arrow_down Import Libraries to be used

!pip install ydata_profiling --upgrade

output Collecting phik<0.13,>=0.11.1 (from ydata_profiling)


Downloading phik-0.12.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (686 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 686.1/686.1 kB 11.5 MB/s eta 0:00:00
Requirement already satisfied: requests<3,>=2.24.0 in /usr/local/lib/python3.10/dist-packages (from ydata_profiling) (2.
Requirement already satisfied: tqdm<5,>=4.48.2 in /usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.66.2
Collecting seaborn<0.13,>=0.10.1 (from ydata_profiling)
Downloading seaborn-0.12.2-py3-none-any.whl (293 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 293.3/293.3 kB 37.2 MB/s eta 0:00:00
Collecting multimethod<2,>=1.4 (from ydata_profiling)
Downloading multimethod-1.11.1-py3-none-any.whl (10 kB)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from ydata_profiling)
Collecting typeguard<5,>=4.1.2 (from ydata_profiling)
Downloading typeguard-4.1.5-py3-none-any.whl (34 kB)
Collecting imagehash==4.3.1 (from ydata_profiling)
Downloading ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 296.5/296.5 kB 39.2 MB/s eta 0:00:00
Requirement already satisfied: wordcloud>=1.9.1 in /usr/local/lib/python3.10/dist-packages (from ydata_profiling) (1.9.3
Collecting dacite>=1.8 (from ydata_profiling)
Downloading dacite-1.8.1-py3-none-any.whl (14 kB)
Requirement already satisfied: numba<0.59.0,>=0.56.0 in /usr/local/lib/python3.10/dist-packages (from ydata_profiling) (
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata_profi
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata_profiling
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]==
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]==
Collecting tangled-up-in-unicode>=0.0.4 (from visions[type_image_path]==0.7.5->ydata_profiling)
Downloading tangled_up_in_unicode-0.2.0-py3-none-any.whl (4.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.7/4.7 MB 42.6 MB/s eta 0:00:00
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2<3.2,>=2.11.1->yda
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->y
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->ydata
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->yd
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.2->y
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.9,>=3.
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba<0.59.0,
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<3,>1.1->ydat
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from phik<0.13,>=0.11.1->ydata
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydat
Requirement already satisfied: pydantic-core==2.16.2 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->yd
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata_
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->
Requirement already satisfied: patsy>=0.5.4 in /usr/local/lib/python3.10/dist-packages (from statsmodels<1,>=0.13.2->yda
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.4->statsmodels<1,>=0.13.2
Building wheels for collected packages: htmlmin
Building wheel for htmlmin (setup.py) ... done
Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27081 sha256=472fe315bd1819347aefe1a0100ad89c
Stored in directory: /root/.cache/pip/wheels/dd/91/29/a79cecb328d01739e64017b6fb9a1ab9d8cb1853098ec5966d
Successfully built htmlmin
Installing collected packages: htmlmin, typeguard, tangled-up-in-unicode, multimethod, dacite, imagehash, visions, seabo
Attempting uninstall: seaborn
Found existing installation: seaborn 0.13.1
Uninstalling seaborn-0.13.1:
Successfully uninstalled seaborn-0.13.1
Successfully installed dacite-1.8.1 htmlmin-0.1.12 imagehash-4.3.1 multimethod-1.11.1 phik-0.12.4 seaborn-0.12.2 tangled

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from google.colab import drive


drive.mount('/content/drive')

Mounted at /content/drive

keyboard_arrow_down Read data


base_path = ''
df = pd.read_excel(base_path + 'prosperLoanData_train.xlsx')
df_val = pd.read_excel(base_path + 'prosperLoanData_val.xlsx')
df_oot = pd.read_excel(base_path + 'prosperLoanData_oot.xlsx')

keyboard_arrow_down Differenciate between usable features and other columns


list(df.columns)

'OpenCreditLines',
'TotalCreditLinespast7years',
'OpenRevolvingAccounts',
'OpenRevolvingMonthlyPayment',
'InquiriesLast6Months',
'TotalInquiries',
'CurrentDelinquencies',
'AmountDelinquent',
'DelinquenciesLast7Years',
'PublicRecordsLast10Years',
'PublicRecordsLast12Months',
'RevolvingCreditBalance',
'BankcardUtilization',
'AvailableBankcardCredit',
'TotalTrades',
'TradesNeverDelinquent (percentage)',
'TradesOpenedLast6Months',
'DebtToIncomeRatio',
'IncomeRange',
'IncomeVerifiable',
'StatedMonthlyIncome',
'LoanKey',
'TotalProsperLoans',
'TotalProsperPaymentsBilled',
'OnTimeProsperPayments',
'ProsperPaymentsLessThanOneMonthLate',
'ProsperPaymentsOneMonthPlusLate',
'ProsperPrincipalBorrowed',
'ProsperPrincipalOutstanding',
'ScorexChangeAtTimeOfListing',
'LoanCurrentDaysDelinquent',
'LoanFirstDefaultedCycleNumber',
'LoanMonthsSinceOrigination',
'LoanNumber',
'LoanOriginalAmount',
'LoanOriginationDate',
'LoanOriginationQuarter',
'MemberKey',
'MonthlyLoanPayment',
'LP_CustomerPayments',
'LP_CustomerPrincipalPayments',
'LP_InterestandFees',
'LP_ServiceFees',
'LP_CollectionFees',
'LP_GrossPrincipalLoss',
'LP_NetPrincipalLoss',
'LP_NonPrincipalRecoverypayments',
'PercentFunded',
'Recommendations',
'InvestmentFromFriendsCount',
'InvestmentFromFriendsAmount',
'Investors',
'LoanOriginationYear',
'LoanMonthsSinceOriginationY',
'LoanFirstDefaultedCycleNumberQ',
'bad_aux',
'population',
'bad']

As we have already commented, different types of information can be observed:

on the one hand, information/characteristics of the application or the applicant at the time of the application
on the other hand, information on his subsequent behaviour

The first one is used to model the target and the second one, to define the target.

Taking a look at the dataset, we can observe columns that:

others that talk about the behavior of the loan (post-grant), so we can not use them for modeling purposes
and we also observe identifiers, that in no case are characteristics of the applicant/application

keyboard_arrow_down Drop columns that they can not be used for modeling purposes
drop = ['CreditGrade', #data not filled post 2009
'BorrowerRate', #RATE: price is set after risk assesment
'LenderYield', #Yield: int rate less servicing fee
'EstimatedEffectiveYield', #prosper model generated data
'EstimatedLoss', #prosper model generated data
'EstimatedReturn', #prosper model generated data
'ProsperRating (numeric)', #prosper model generated data
'CurrentlyInGroup', #post origination date
'GroupKey', #ID
'DateCreditPulled', #drop - date when credit score was pulled
'FirstRecordedCreditLine', #need to transform from date to numeric
'LoanKey', #ID
'LoanCurrentDaysDelinquent', #part of target definition, post origination date
'LoanFirstDefaultedCycleNumber', #part of target definition, post origination date
'LoanMonthsSinceOrigination', #part of target definition, post origination date
'LoanNumber', #ID
'LoanOriginationDate', #RAW date
'LoanOriginationQuarter', #REVIEW
'MemberKey', #ID
'LP_CustomerPayments', #post origination date
'LP_CustomerPrincipalPayments', #post origination date
'LP_InterestandFees', #post origination date
'LP_ServiceFees', #post origination date
'LP_CollectionFees', #post origination date
'LP_GrossPrincipalLoss', #post origination date
'LP_NetPrincipalLoss', #post origination date
'LP_NonPrincipalRecoverypayments',#post origination date
'ListingKey', #ID
'ListingNumber', #ID
'ListingCreationDate', #Creation date
'LoanStatus', #post origination date
'ClosedDate', #post origination date
'BorrowerAPR', #price is set after risk assesment
'ProsperRating (Alpha)', #price is set after risk assesment
'ProsperScore', #prosper model generated data
'CurrentDelinquencies', #post origination date
'AmountDelinquent', #post origination date
'InvestmentFromFriendsCount', #post listing date
'InvestmentFromFriendsAmount', #post listing date
'Investors', #post listing date
'PW', #PW flag - drop
'fraud', #fraud flag - drop
'bad', #bad flag - drop
'indeterm', #indeterm flag - drop
'LoanOriginationYear', #origination date - drop
'LoanMonthsSinceOriginationY', #post origination date
'LoanFirstDefaultedCycleNumberQ', #post origination date
'bad_aux', #bad flag - drop
'population'
]

features = [c for c in df.columns if c not in drop]


col_target = 'bad'

features

['Term',
'ListingCategory (numeric)',
'BorrowerState',
'Occupation',
'EmploymentStatus',
'EmploymentStatusDuration',
'IsBorrowerHomeowner',
'CreditScoreRangeLower',
'CreditScoreRangeUpper',
'CurrentCreditLines',
'OpenCreditLines',
'TotalCreditLinespast7years',
'OpenRevolvingAccounts',
'OpenRevolvingMonthlyPayment',
'InquiriesLast6Months',
'TotalInquiries',
'DelinquenciesLast7Years',
'PublicRecordsLast10Years',
'PublicRecordsLast12Months',
'RevolvingCreditBalance',
'BankcardUtilization',
'AvailableBankcardCredit',
'TotalTrades',
'TradesNeverDelinquent (percentage)',
'TradesOpenedLast6Months',
'DebtToIncomeRatio',
'IncomeRange',
'IncomeVerifiable',
'StatedMonthlyIncome',
'TotalProsperLoans',
'TotalProsperPaymentsBilled',
'OnTimeProsperPayments',
'ProsperPaymentsLessThanOneMonthLate',
'ProsperPaymentsOneMonthPlusLate',
'ProsperPrincipalBorrowed',
'ProsperPrincipalOutstanding',
'ScorexChangeAtTimeOfListing',
'LoanOriginalAmount',
'MonthlyLoanPayment',
'PercentFunded',
'Recommendations']

keyboard_arrow_down Data exploration


We consider it is very important to know and explore the data:

Type of variables (categorical/numeric). [Beware, sometimes a numerical variable may be categorical]


Feature Quality Analysis
Stability of the variables throughout the In Time period, but also in the OOT one
Predictive power of the variable
Correlation between variables

keyboard_arrow_down Let´s explore first the type and meaning of the variables

Examples:

EmploymentStatusDuration -> How long the employee has been employed (float64)

BankcardUtilization -> The % of available revolving credit that is utilized at the time of application (float64)

DebtToIncomeRatio -> The debt to income ratio of the borrower at the time of application (float64)

TotalProsperLoans -> Total number of Prosper loans prior to application (float64)

BorrowerState -> The two letter abbreviation of the state of the address of the borrower at the time of application (object)

EmploymentStatus -> current type of employment (at time of application) (object)

IncomeRange -> The income range of the borrower at the time application (object)

print(df[features].dtypes)

Term int64
ListingCategory (numeric) int64
BorrowerState object
Occupation object
EmploymentStatus object
EmploymentStatusDuration float64
IsBorrowerHomeowner bool
CreditScoreRangeLower int64
CreditScoreRangeUpper int64
CurrentCreditLines int64
OpenCreditLines int64
TotalCreditLinespast7years int64
OpenRevolvingAccounts int64
OpenRevolvingMonthlyPayment int64
InquiriesLast6Months int64
TotalInquiries int64
DelinquenciesLast7Years int64
PublicRecordsLast10Years int64
PublicRecordsLast12Months int64
RevolvingCreditBalance int64
BankcardUtilization float64
AvailableBankcardCredit int64
TotalTrades int64
TradesNeverDelinquent (percentage) float64
TradesOpenedLast6Months int64
DebtToIncomeRatio float64
IncomeRange object
IncomeVerifiable bool
StatedMonthlyIncome float64
TotalProsperLoans float64
TotalProsperPaymentsBilled float64
OnTimeProsperPayments float64
ProsperPaymentsLessThanOneMonthLate float64
ProsperPaymentsOneMonthPlusLate float64
ProsperPrincipalBorrowed float64
ProsperPrincipalOutstanding float64
ScorexChangeAtTimeOfListing float64
LoanOriginalAmount int64
MonthlyLoanPayment float64
PercentFunded float64
Recommendations int64
dtype: object

df[df['BankcardUtilization']>0.0]['BankcardUtilization'].head(5)

0 0.49
1 0.35
2 0.43
3 0.07
4 0.21
Name: BankcardUtilization, dtype: float64

df[df['TotalProsperLoans']>0.0]['TotalProsperLoans'].head(5)

3 1.0
10 1.0
12 1.0
15 1.0
17 2.0
Name: TotalProsperLoans, dtype: float64

#capture categorical and numerical features


cat_features = [f for f, b in zip(features, df[features].dtypes == object) if b==True]
num_features = [f for f in features if f not in cat_features]

df[num_features].describe().transpose()
count mean std min

Term 20705.0 40.521806 11.704259 12.00

ListingCategory (numeric) 20705.0 4.190292 4.793123 0.00

EmploymentStatusDuration 20704.0 95.972711 93.076051 0.00

CreditScoreRangeLower 20705.0 701.139821 50.734320 600.00 6

CreditScoreRangeUpper 20705.0 720.139821 50.734320 619.00 6

CurrentCreditLines 20705.0 9.442840 5.353991 0.00

OpenCreditLines 20705.0 8.376817 4.830963 0.00

TotalCreditLinespast7years 20705.0 26.414248 13.774322 2.00

OpenRevolvingAccounts 20705.0 6.412799 4.310276 0.00

OpenRevolvingMonthlyPayment 20705.0 368.111712 420.757114 0.00

InquiriesLast6Months 20705.0 1.143347 1.590528 0.00

TotalInquiries 20705.0 4.181067 3.820159 0.00

DelinquenciesLast7Years 20705.0 3.892055 9.393200 0.00

PublicRecordsLast10Years 20705.0 0.281816 0.639556 0.00

PublicRecordsLast12Months 20705.0 0.013620 0.132623 0.00

RevolvingCreditBalance 20705.0 16575.570635 34314.938533 0.00 20

BankcardUtilization 20705.0 0.523270 0.322111 0.00

AvailableBankcardCredit 20705.0 10557.018788 18451.342828 0.00 7

TotalTrades 20705.0 22.739483 12.056456 1.00

TradesNeverDelinquent (percentage) 20705.0 0.890583 0.129383 0.08

TradesOpenedLast6Months 20705.0 0.782130 1.088975 0.00

DebtToIncomeRatio 18456.0 0.257200 0.427301 0.00

StatedMonthlyIncome 20705.0 5657.364421 6302.240785 0.00 31

TotalProsperLoans 6239.0 1.416894 0.731583 1.00

TotalProsperPaymentsBilled 6239.0 23.930758 19.049833 0.00

OnTimeProsperPayments 6239.0 23.099856 18.596028 0.00

ProsperPaymentsLessThanOneMonthLate 6239.0 0.757974 2.775348 0.00

ProsperPaymentsOneMonthPlusLate 6239.0 0.072928 0.693234 0.00

ProsperPrincipalBorrowed 6239.0 7812.921452 6671.324869 1000.00 35

ProsperPrincipalOutstanding 6239.0 2593.053545 3305.284528 0.00

ScorexChangeAtTimeOfListing 6221.0 -11.360714 51.977507 -209.00

LoanOriginalAmount 20705.0 7396.216325 5127.312064 2000.00 40

MonthlyLoanPayment 20705.0 257.635214 160.282675 0.00 1

PercentFunded 20705.0 0.995700 0.031072 0.70

Recommendations 20705.0 0.022893 0.240864 0.00

df[cat_features].describe().transpose()

count unique top freq

BorrowerState 20705 48 CA 2449

Occupation 20705 66 Other 5728

EmploymentStatus 20705 7 Employed 16668

IncomeRange 20705 7 $25,000-49,999 6511

keyboard_arrow_down Feature Quality Analysis (QA)


Recommended tool: pandas-profiling

features_report = ['Occupation','OpenCreditLines','CurrentCreditLines','DebtToIncomeRatio','TotalProsperLoans','RevolvingCre
profile = ProfileReport(df[features_report])
profile.to_file("output.html")

Summarize dataset: 100% 41/41 [00:06<00:00, 4.92it/s, Completed]

Generate report structure: 100% 1/1 [00:02<00:00, 2.95s/it]

Render HTML: 100% 1/1 [00:00<00:00, 1.19it/s]

Export report to file: 100% 1/1 [00:00<00:00, 58.41it/s]

profile

Overview

Dataset statistics
Number of variables 6

Number of observations 20705

Missing cells 16715

Missing cells (%) 13.5%

Duplicate rows 64

Duplicate rows (%) 0.3%

Total size in memory 970.7 KiB

Average record size in memory 48.0 B

Variable types
Text 1

Numeric 5

Alerts
Dataset has 64 (0.3%) duplicate rows Duplicates

CurrentCreditLines is highly overall correlated with High correlation


OpenCreditLines and 1 other fields (OpenCreditLines,
RevolvingCreditBalance)

OpenCreditLines is highly overall correlated with High correlation

In the above analysis we can observe:

variables with too much granularity (example occupation) -> we could think of grouping it according to the observed BR, without losing
sight of the sense of Business

correlations between variables -> it depends on the modeling methodology we are going to use, we should be concerned about this issue
to a greater or lesser extent
variables with a high percentage of missing/zero values -> it is important to ask whether it makes sense to record such a percentage of
missing/zeros, and the treatment we want to make of them

EJERCICIO!!!

keyboard_arrow_down Feature QA Consequences


keyboard_arrow_down
Consider the possibility of deleting applications with missing DTIs (Debt-to-Income ratio)

DTI_informed = df.loc[~df.DebtToIncomeRatio.isna()]
DTI_missing = df.loc[df.DebtToIncomeRatio.isna()]

print('Population with informed DTI: {}'.format(DTI_informed[(DTI_informed.population==True)].shape[0]))


print('Bad Rate within informed DTI population: {}'.format(100*(DTI_informed[(DTI_informed.population==True)&(DTI_informed.b
print('Population with missing DTI: {}'.format(DTI_missing[(DTI_missing.population==True)].shape[0]))
print('Bad Rate within missing DTI population: {}'.format(100*(DTI_missing[(DTI_missing.population==True)&(DTI_missing.bad==

Population with informed DTI: 18456


Bad Rate within informed DTI population: 9.774599046380581
Population with missing DTI: 2249
Bad Rate within missing DTI population: 17.11871943085816

Large differences can be observed in terms of BR between the population with informed DTI and the population with missing DTI

Does DTI being missing mean that other basic information is also missing?

DTI_missing[['DebtToIncomeRatio','StatedMonthlyIncome','LoanOriginalAmount','MonthlyLoanPayment']].head(5)

DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount MonthlyLoanPay

6 NaN 4516.666667 4000 17

12 NaN 2583.333333 12500 37

54 NaN 2500.000000 2000 8

81 NaN 1666.666667 2000 5

87 NaN 3750.000000 2000 7

DTI_missing.loc[(DTI_missing.StatedMonthlyIncome.isna()) | (DTI_missing.LoanOriginalAmount.isna()) | (DTI_missing.MonthlyLoa

(0, 87)

Given the differences in terms of BR, and given that it does not imply that other basic variables are missing, we keep these clients within the
population.

keyboard_arrow_down Let´s explore the predictive power (IV)


The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit
scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers
who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.

So, when we talk about the discriminatory power of a variable we are talking about its capacity to give information about the characteristics of
the bad payer, in contrast to those of the good payer.
These functions handle a basic bucketing schema. For numeric features it keeps the data between the "input_slider" percentiles, and splits the
data in "n_bins". For categorical data, it keeps top "n_bins" categories.
#visualization functions
def capture_df(feat_col, input_slider, n_bins, df, target_col):
"""
Handles the type of the data to generate the intermediate datadframe
"""
if df[feat_col].dtype in [int, float, np.number]:
return df_vol_br_num(feat_col, input_slider, n_bins, df, target_col)
else:
return df_vol_br_cat(feat_col, input_slider, n_bins, df, target_col)

#capture volume / BR df for numerical variables


def df_vol_br_num(feat_col, input_slider, n_bins, df, obj_col):
"""
Generate the intermediate dataframe with number of observations and
number of bads per bin. Specific for numerical features.
"""
#get the numeric input from the dual slider
perc_sliders = [v/100. for v in input_slider]
var_lims = df[feat_col].quantile([perc_sliders[0], perc_sliders[1]]).values # percentile 0, 97.5
v_min, v_max = var_lims[0], var_lims[1]
#filter the dataset using the slider input
df_cut = df.loc[(df[feat_col] <= v_max) & (df[feat_col] >= v_min)][[obj_col, feat_col]]
#number of cuts = minumum of n_bins, number of unique values of the variable
n_cuts = min(int(n_bins), df_cut[feat_col].nunique())
cuts = [c for c in np.linspace(v_min, v_max, n_cuts + 1)] # vector for delimiting cuts
if cuts[-1] < v_max: # add if linspace not performed perfectly
cuts.append(v_max)
cut_col = feat_col + '_'
df_cut[cut_col] = pd.cut(df_cut[feat_col], cuts, include_lowest=True) # bina
print(df_cut)
#generate aggregated values
N = df_cut.groupby(cut_col)[feat_col].count().values
BR = df_cut.groupby(cut_col)[obj_col].mean().values
cuts = df_cut.groupby(cut_col)[feat_col].count().index.astype(str).values
#handle NA entries -> add a first value fo Nans in bins
if df[feat_col].isna().sum() > 0:
N = np.append(([df[feat_col].isna().sum()]), N)
BR = np.append(([df.loc[df[feat_col].isna()][obj_col].mean()]), BR)
cuts = np.append(['NA'], cuts)
#generate global transformation rate
return (pd.DataFrame({'cuts': cuts,
'N': N,
'BR': BR}), df_cut[obj_col].mean())

#capture volume / BR df for categorical variables


def df_vol_br_cat(feat_col, input_slider, n_bins, df, target_col):
"""
Generate the intermediate dataframe with number of observations and
number of bads per bin. Specific for categorical features.
"""
#pick top n_bins levels by volume
cut_levels = df.groupby(feat_col)[feat_col].count().sort_values(ascending=False)[:int(n_bins)].index.values.tolist()
df_cut = df.loc[df[feat_col].isin(cut_levels)]
#capture volumes
N = df_cut.groupby(feat_col)[feat_col].count().values
#capture transformations
BR = df_cut.groupby(feat_col)[target_col].mean().values
return (pd.DataFrame({'cuts': df_cut.groupby(feat_col)[feat_col].count().index.astype(str).values,
'N': N,
'BR': BR}), df_cut[target_col].mean())

def output_graph_update(feat_col, input_slider, n_bins, df, obj_col):


"""
Generate the plotly plot showing the visualization of the intermediate
dataframe with volume and bad rate per bin.
"""
#get the df with volume and bad rate
df_tr, avg_tr = capture_df(feat_col, input_slider, n_bins, df, obj_col)
#line represents transformation rate
tr_line = go.Scatter(x = df_tr.cuts,
y = df_tr.BR,
yaxis = 'y2',
name = 'BR')
#bar represents volume @ cut
vol_bars = go.Bar(x = df_tr.cuts,
y = df_tr.N,
name = 'Volume')
#avg line
avg_line = go.Scatter(x = df_tr.cuts,
y = np.repeat(avg_tr, df_tr.shape[0]),
yaxis = 'y2',
name = 'AVG BR',
line = dict(
color = ('rgb(205, 0, 0)')
)
)
#small layout
layout = go.Layout(
title = 'BR for ' + feat_col,
yaxis = dict(title = 'Volume',
range = [0, max(df_tr.N)]),
yaxis2 = dict(title = 'BR',
overlaying='y',
side='right',
range = [0, max(df_tr.BR) + 0.05*max(df_tr.BR)])

)
return {'data': [vol_bars, tr_line, avg_line],
'layout': layout}

A variable with high predictive power is a variable that is capable of giving information about the bad payer, in contrast to the good payer. In
other words, it is capable of separating the two behaviors. In the case of a categorical variable -> a variable with high predictive power is a
variable for which certain categories concentrate a high volume of bad payers and others, a low volume.

df_aggcat, avg_br = df_vol_br_cat('Occupation', [0, 100], 10, df, 'bad')


df_aggcat

cuts N BR

0 Accountant/CPA 614 0.078176

1 Administrative Assistant 782 0.113811

2 Analyst 608 0.070724

3 Computer Programmer 838 0.069212

4 Executive 767 0.065189

5 Other 5728 0.124127

6 Professional 2635 0.083112

7 Sales - Commission 582 0.127148

8 Sales - Retail 546 0.124542

9 Teacher 663 0.066365

Next steps: toggle_off View recommended plots

In the case of a numerical variable -> a variable with high predictive power is a variable for which certain ranges concentrate a high volume of
bad payers and others, a low volume.

df_aggnum, avg_br = df_vol_br_num('DebtToIncomeRatio', [0, 97.5], 5, df, 'bad')


df_aggnum
bad DebtToIncomeRatio DebtToIncomeRatio_
0 False 0.13 (-0.001, 0.13]
1 False 0.12 (-0.001, 0.13]
2 False 0.18 (0.13, 0.26]
3 False 0.26 (0.13, 0.26]
4 False 0.36 (0.26, 0.39]
... ... ... ...
20699 False 0.34 (0.26, 0.39]
20701 False 0.20 (0.13, 0.26]
20702 False 0.08 (-0.001, 0.13]
20703 False 0.31 (0.26, 0.39]
20704 False 0.27 (0.26, 0.39]

[18013 rows x 3 columns]


cuts N BR

0 NA 2249 0.171187

1 (-0.001, 0.13] 4697 0.081116

2 (0.13, 0.26] 7681 0.087879

3 (0.26, 0.39] 3775 0.103841

4 (0.39, 0.52] 1311 0.149504

5 (0.52, 0.65] 549 0.182149

Next steps: toggle_off View recommended plots

keyboard_arrow_down Estimate the IV for all features

def get_iv(df_vbr, col_target='Default'):


"""Returns the IV (Information Value).

Args:
df_vbr: Pandas DataFrame, containing volume counts and bad counts
col_target: Name of the target column

Returns:
Estimated IV
"""
#calculate the IV
#bin-wise good/bad count
N_bad_bin = ((df_vbr[col_target]) * df_vbr.N).round()
N_good_bin = df_vbr.N - N_bad_bin
#total good-bads
N_bad = N_bad_bin.sum()
N_good = N_good_bin.sum()
#binwise dist of good-bads
dist_goods = N_good_bin / N_good
dist_bads = N_bad_bin / N_bad
#Binwise Woe
woe = np.log(dist_goods / (dist_bads + 0.00000001))
#Binwise IV
iv = (dist_goods - dist_bads) * woe
return iv

ivs = []
#for all features
for c in features:
#split the feature in 10 buckets, and get the volume of observations and default per bin
df_tr, avg_br = capture_df(c, [0., 97.5], 10, df, 'bad')
#estimate the iv using that binning
ivs.append(get_iv(df_tr, col_target='BR').sum())
df_iv = pd.DataFrame({'feature': features,
'IV': ivs}).sort_values(by='IV', ascending=False)
#show sorted by IV
df_iv
bad Term Term_
0 False 36 (28.0, 44.0]
1 False 36 (28.0, 44.0]
2 False 36 (28.0, 44.0]
3 False 36 (28.0, 44.0]
4 False 36 (28.0, 44.0]
... ... ... ...
20700 False 36 (28.0, 44.0]
20701 False 60 (44.0, 60.0]
20702 False 36 (28.0, 44.0]
20703 False 36 (28.0, 44.0]
20704 False 12 (11.999, 28.0]

[20705 rows x 3 columns]


bad ListingCategory (numeric) ListingCategory (numeric)_
0 False 7 (5.4, 7.2]
1 False 1 (-0.001, 1.8]
2 False 1 (-0.001, 1.8]
3 False 7 (5.4, 7.2]
4 False 1 (-0.001, 1.8]
... ... ... ...
20700 False 18 (16.2, 18.0]
20701 False 1 (-0.001, 1.8]
20702 False 7 (5.4, 7.2]
20703 False 3 (1.8, 3.6]
20704 False 7 (5.4, 7.2]

[20234 rows x 3 columns]


bad EmploymentStatusDuration EmploymentStatusDuration_
0 False 193.0 (175.0, 210.0]
1 False 49.0 (35.0, 70.0]
2 False 38.0 (35.0, 70.0]
3 False 168.0 (140.0, 175.0]
4 False 0.0 (-0.001, 35.0]
... ... ... ...
20700 False 74.0 (70.0, 105.0]
20701 False 66.0 (35.0, 70.0]
20702 False 31.0 (-0.001, 35.0]
20703 False 26.0 (-0.001, 35.0]
20704 False 230.0 (210.0, 245.0]

[20187 rows x 3 columns]


bad CreditScoreRangeLower CreditScoreRangeLower_
0 False 640 (620.0, 640.0]
1 False 740 (720.0, 740.0]
2 False 780 (760.0, 780.0]
3 False 800 (780.0, 800.0]
4 False 740 (720.0, 740.0]
... ... ... ...
20700 False 780 (760.0, 780.0]
20701 False 680 (660.0, 680.0]
20702 False 720 (700.0, 720.0]
20703 False 660 (640.0, 660.0]
20704 False 740 (720.0, 740.0]

[20196 rows x 3 columns]


bad CreditScoreRangeUpper CreditScoreRangeUpper_
0 False 659 (639.0, 659.0]
1 False 759 (739.0, 759.0]
2 False 799 (779.0, 799.0]
3 False 819 (799.0, 819.0]
4 False 759 (739.0, 759.0]
... ... ... ...
20700 False 799 (779.0, 799.0]
20701 False 699 (679.0, 699.0]
20702 False 739 (719.0, 739.0]
20703 False 679 (659.0, 679.0]
20704 False 759 (739.0, 759.0]

[20196 rows x 3 columns]


bad CurrentCreditLines CurrentCreditLines_
0 False 10 (8.8, 11.0]
1 False 9 (8.8, 11.0]
2 False 12 (11.0, 13.2]
4 False 13 (11.0, 13.2]
5 False 8 (6.6, 8.8]
... ... ... ...
20700 False 12 (11.0, 13.2]
20701 False 13 (11.0, 13.2]
20702 False 6 (4.4, 6.6]
20703 False 16 (15.4, 17.6]
20704 False 2 (-0.001, 2.2]

[20235 rows x 3 columns]


bad OpenCreditLines OpenCreditLines_
0 False 8 (6.0, 8.0]
1 False 9 (8.0, 10.0]
2 False 12 (10.0, 12.0]
3 False 19 (18.0, 20.0]
4 False 11 (10.0, 12.0]
... ... ... ...
20700 False 12 (10.0, 12.0]
20701 False 8 (6.0, 8.0]
20702 False 6 (4.0, 6.0]
20703 False 14 (12.0, 14.0]
20704 False 1 (-0.001, 2.0]

[20278 rows x 3 columns]


bad TotalCreditLinespast7years TotalCreditLinespast7years_
0 False 46 (41.2, 46.8]
1 False 11 (7.6, 13.2]
2 False 33 (30.0, 35.6]
3 False 48 (46.8, 52.4]
4 False 27 (24.4, 30.0]
... ... ... ...
20699 False 41 (35.6, 41.2]
20700 False 26 (24.4, 30.0]
20701 False 24 (18.8, 24.4]
20702 False 8 (7.6, 13.2]
20703 False 28 (24.4, 30.0]

[20204 rows x 3 columns]


bad OpenRevolvingAccounts OpenRevolvingAccounts_
0 False 6 (5.1, 6.8]
1 False 9 (8.5, 10.2]
2 False 9 (8.5, 10.2]
4 False 11 (10.2, 11.9]
5 False 5 (3.4, 5.1]
... ... ... ...
20700 False 5 (3.4, 5.1]
20701 False 5 (3.4, 5.1]
20702 False 6 (5.1, 6.8]
20703 False 11 (10.2, 11.9]
20704 False 1 (-0.001, 1.7]

[20228 rows x 3 columns]


bad OpenRevolvingMonthlyPayment OpenRevolvingMonthlyPayment_
0 False 279 (152.7, 305.4]
1 False 429 (305.4, 458.1]
2 False 524 (458.1, 610.8]
3 False 245 (152.7, 305.4]
4 False 279 (152.7, 305.4]
... ... ... ...
20700 False 253 (152.7, 305.4]
20701 False 93 (-0.001, 152.7]
20702 False 126 (-0.001, 152.7]
20703 False 962 (916.2, 1068.9]
20704 False 0 (-0.001, 152.7]

[20188 rows x 3 columns]


bad InquiriesLast6Months InquiriesLast6Months_
0 False 1 (0.833, 1.667]
1 False 2 (1.667, 2.5]
2 False 0 (-0.001, 0.833]
3 False 0 (-0.001, 0.833]
4 False 1 (0.833, 1.667]
... ... ... ...
20700 False 0 (-0.001, 0.833]
20701 False 0 (-0.001, 0.833]
20702 False 0 (-0.001, 0.833]
20703 False 1 (0.833, 1.667]
20704 False 1 (0.833, 1.667]

[20225 rows x 3 columns]


bad TotalInquiries TotalInquiries_
0 False 5 (4.2, 5.6]
1 False 5 (4.2, 5.6]
2 False 2 (1.4, 2.8]
3 False 3 (2.8, 4.2]
4 False 1 (-0.001, 1.4]
... ... ... ...
20700 False 6 (5.6, 7.0]
20701 False 2 (1.4, 2.8]
20702 False 8 (7.0, 8.4]
20703 False 1 (-0.001, 1.4]
20704 False 6 (5.6, 7.0]

[20254 rows x 3 columns]


bad DelinquenciesLast7Years DelinquenciesLast7Years_
0 False 8 (6.4, 9.6]
1 False 0 (-0.001, 3.2]
2 False 0 (-0.001, 3.2]
3 False 0 (-0.001, 3.2]
4 False 0 (-0.001, 3.2]
... ... ... ...
20700 False 0 (-0.001, 3.2]
20701 False 1 (-0.001, 3.2]
20702 False 0 (-0.001, 3.2]
20703 False 13 (12.8, 16.0]
20704 False 9 (6.4, 9.6]
[20188 rows x 3 columns]
bad PublicRecordsLast10Years PublicRecordsLast10Years_
0 False 1 (0.667, 1.333]
1 False 0 (-0.001, 0.667]
2 False 0 (-0.001, 0.667]
3 False 0 (-0.001, 0.667]
4 False 0 (-0.001, 0.667]
... ... ... ...
20700 False 0 (-0.001, 0.667]
20701 False 1 (0.667, 1.333]
20702 False 0 (-0.001, 0.667]
20703 False 0 (-0.001, 0.667]
20704 False 1 (0.667, 1.333]

[20482 rows x 3 columns]


bad PublicRecordsLast12Months PublicRecordsLast12Months_
0 False 0 (-0.001, 0.0]
1 False 0 (-0.001, 0.0]
2 False 0 (-0.001, 0.0]
3 False 0 (-0.001, 0.0]
4 False 0 (-0.001, 0.0]
... ... ... ...
20700 False 0 (-0.001, 0.0]
20701 False 0 (-0.001, 0.0]
20702 False 0 (-0.001, 0.0]
20703 False 0 (-0.001, 0.0]
20704 False 0 (-0.001, 0.0]

[20456 rows x 3 columns]


bad RevolvingCreditBalance RevolvingCreditBalance_
0 False 5137 (-0.001, 9728.98]
1 False 16859 (9728.98, 19457.96]
2 False 34744 (29186.94, 38915.92]
3 False 4117 (-0.001, 9728.98]
4 False 11592 (9728.98, 19457.96]
... ... ... ...
20700 False 13021 (9728.98, 19457.96]
20701 False 910 (-0.001, 9728.98]
20702 False 1953 (-0.001, 9728.98]
20703 False 75710 (68102.86, 77831.84]
20704 False 0 (-0.001, 9728.98]

[20187 rows x 3 columns]


bad BankcardUtilization BankcardUtilization_
0 False 0.49 (0.396, 0.495]
1 False 0.35 (0.297, 0.396]
2 False 0.43 (0.396, 0.495]
3 False 0.07 (-0.001, 0.099]
4 False 0.21 (0.198, 0.297]
... ... ... ...
20700 False 0.44 (0.396, 0.495]
20701 False 0.59 (0.495, 0.594]
20702 False 0.18 (0.099, 0.198]
20703 False 0.91 (0.891, 0.99]
20704 False 0.00 (-0.001, 0.099]

[20321 rows x 3 columns]


bad AvailableBankcardCredit AvailableBankcardCredit_
0 False 1913 (-0.001, 5846.6]
1 False 25483 (23386.4, 29233.0]
2 False 45734 (40926.2, 46772.8]
3 False 43923 (40926.2, 46772.8]
4 False 42738 (40926.2, 46772.8]
... ... ... ...
20700 False 16079 (11693.2, 17539.8]
20701 False 613 (-0.001, 5846.6]
20702 False 6931 (5846.6, 11693.2]
20703 False 7122 (5846.6, 11693.2]
20704 False 5500 (-0.001, 5846.6]

[20188 rows x 3 columns]


bad TotalTrades TotalTrades_
0 False 36 (35.3, 40.2]
1 False 9 (5.9, 10.8]
2 False 31 (30.4, 35.3]
3 False 47 (45.1, 50.0]
4 False 25 (20.6, 25.5]
... ... ... ...
20699 False 34 (30.4, 35.3]
20700 False 20 (15.7, 20.6]
20701 False 23 (20.6, 25.5]
20702 False 8 (5.9, 10.8]
20703 False 27 (25.5, 30.4]

[20189 rows x 3 columns]


bad TradesNeverDelinquent (percentage) \
0 False 0.66
1 False 1.00
2 False 1.00
3 False 0.97
4 F l 1 00
4 False 1.00
... ... ...
20700 False 1.00
20701 False 0.86
20702 False 1.00
20703 False 0.74
20704 False 0.70

TradesNeverDelinquent (percentage)_
0 (0.632, 0.724]
1 (0.908, 1.0]
2 (0.908, 1.0]
3 (0.908, 1.0]
4 (0.908, 1.0]
... ...
20700 (0.908, 1.0]
20701 (0.816, 0.908]
20702 (0.908, 1.0]
20703 (0.724, 0.816]
20704 (0.632, 0.724]

[20705 rows x 3 columns]


bad TradesOpenedLast6Months TradesOpenedLast6Months_
0 False 1 (0.8, 1.6]
1 False 1 (0.8, 1.6]
2 False 0 (-0.001, 0.8]
3 False 0 (-0.001, 0.8]
4 False 1 (0.8, 1.6]
... ... ... ...
20700 False 0 (-0.001, 0.8]
20701 False 1 (0.8, 1.6]
20702 False 0 (-0.001, 0.8]
20703 False 1 (0.8, 1.6]
20704 False 0 (-0.001, 0.8]

[20492 rows x 3 columns]


bad DebtToIncomeRatio DebtToIncomeRatio_
0 False 0.13 (0.065, 0.13]
1 False 0.12 (0.065, 0.13]
2 False 0.18 (0.13, 0.195]
3 False 0.26 (0.195, 0.26]
4 False 0.36 (0.325, 0.39]
... ... ... ...
20699 False 0.34 (0.325, 0.39]
20701 False 0.20 (0.195, 0.26]
20702 False 0.08 (0.065, 0.13]
20703 False 0.31 (0.26, 0.325]
20704 False 0.27 (0.26, 0.325]

[18013 rows x 3 columns]


bad StatedMonthlyIncome StatedMonthlyIncome_
0 False 5916.666667 (4750.0, 6333.333]
1 False 4166.666667 (3166.667, 4750.0]
2 False 7083.333333 (6333.333, 7916.667]
3 False 3166.666667 (3166.667, 4750.0]
4 False 3750.000000 (3166.667, 4750.0]
... ... ... ...
20700 False 15800.000000 (14250.0, 15833.333]
20701 False 4304.333333 (3166.667, 4750.0]
20702 False 2416.666667 (1583.333, 3166.667]
20703 False 4166.666667 (3166.667, 4750.0]
20704 False 6000.000000 (4750.0, 6333.333]

[20200 rows x 3 columns]


bad TotalProsperLoans TotalProsperLoans_
3 False 1.0 (0.999, 1.667]
10 False 1.0 (0.999, 1.667]
12 False 1.0 (0.999, 1.667]
15 False 1.0 (0.999, 1.667]
17 False 2.0 (1.667, 2.333]
... ... ... ...
20683 True 1.0 (0.999, 1.667]
20686 False 1.0 (0.999, 1.667]
20687 False 1.0 (0.999, 1.667]
20690 False 2.0 (1.667, 2.333]
20697 False 1.0 (0.999, 1.667]

[6118 rows x 3 columns]


bad TotalProsperPaymentsBilled TotalProsperPaymentsBilled_
3 False 23.0 (21.6, 28.8]
10 False 11.0 (7.2, 14.4]
12 False 9.0 (7.2, 14.4]
15 False 35.0 (28.8, 36.0]
17 False 23.0 (21.6, 28.8]
... ... ... ...
20683 True 10.0 (7.2, 14.4]
20686 False 6.0 (-0.001, 7.2]
20687 False 8.0 (7.2, 14.4]
20690 False 15.0 (14.4, 21.6]
20697 False 10.0 (7.2, 14.4]
[6084 rows x 3 columns]
bad OnTimeProsperPayments OnTimeProsperPayments_
3 False 23.0 (21.0, 28.0]
10 False 11.0 (7.0, 14.0]
12 False 9.0 (7.0, 14.0]
15 False 35.0 (28.0, 35.0]
17 False 23.0 (21.0, 28.0]
... ... ... ...
20683 True 1.0 (-0.001, 7.0]
20686 False 6.0 (-0.001, 7.0]
20687 False 8.0 (7.0, 14.0]
20690 False 15.0 (14.0, 21.0]
20697 False 10.0 (7.0, 14.0]

[6086 rows x 3 columns]


bad ProsperPaymentsLessThanOneMonthLate \
3 False 0.0
10 False 0.0
12 False 0.0
15 False 0.0
17 False 0.0
... ... ...
20682 False 0.0
20686 False 0.0
20687 False 0.0
20690 False 0.0
20697 False 0.0

ProsperPaymentsLessThanOneMonthLate_
3 (-0.001, 0.889]
10 (-0.001, 0.889]
12 (-0.001, 0.889]
15 (-0.001, 0.889]
17 (-0.001, 0.889]
... ...
20682 (-0.001, 0.889]
20686 (-0.001, 0.889]
20687 (-0.001, 0.889]
20690 (-0.001, 0.889]
20697 (-0.001, 0.889]

[6089 rows x 3 columns]


bad ProsperPaymentsOneMonthPlusLate ProsperPaymentsOneMonthPlusLate_
3 False 0.0 (-0.001, 0.5
10 False 0.0 (-0.001, 0.5
12 False 0.0 (-0.001, 0.5
15 False 0.0 (-0.001, 0.5
17 False 0.0 (-0.001, 0.5
... ... ... ..
20683 True 0.0 (-0.001, 0.5
20686 False 0.0 (-0.001, 0.5
20687 False 0.0 (-0.001, 0.5
20690 False 0.0 (-0.001, 0.5
20697 False 0.0 (-0.001, 0.5

[6159 rows x 3 columns]


bad ProsperPrincipalBorrowed ProsperPrincipalBorrowed_
3 False 3000.00 (999.999, 3475.25]
10 False 6400.00 (5950.5, 8425.75]
12 False 10000.00 (8425.75, 10901.0]
15 False 4000.00 (3475.25, 5950.5]
17 False 9000.00 (8425.75, 10901.0]
... ... ... ...
20683 True 2000.00 (999.999, 3475.25]
20686 False 14425.64 (13376.25, 15851.5]
20687 False 5000.00 (3475.25, 5950.5]
20690 False 2000.00 (999.999, 3475.25]
20697 False 4000.00 (3475.25, 5950.5]

[6083 rows x 3 columns]


bad ProsperPrincipalOutstanding ProsperPrincipalOutstanding_
3 False 1158.02 (-0.001, 1202.123]
10 False 4624.39 (3606.369, 4808.492]
12 False 7272.29 (7212.738, 8414.861]
15 False 0.00 (-0.001, 1202.123]
17 False 0.00 (-0.001, 1202.123]
... ... ... ...
20682 False 782.80 (-0.001, 1202.123]
20683 True 1546.84 (1202.123, 2404.246]
20687 False 4062.01 (3606.369, 4808.492]
20690 False 0.00 (-0.001, 1202.123]
20697 False 3317.05 (2404.246, 3606.369]

[6083 rows x 3 columns]


bad ScorexChangeAtTimeOfListing ScorexChangeAtTimeOfListing_
3 False 30.0 (9.4, 40.6]
10 False 4.0 (-21.8, 9.4]
12 False -35.0 (-53.0, -21.8]
15 False 28.0 (9.4, 40.6]
17 False -12.0 (-21.8, 9.4]
... ... ... ...
20682 False 0.0 (-21.8, 9.4]
20683 True 4.0 (-21.8, 9.4]
20686 False -15.0 (-21.8, 9.4]
20687 False -30.0 (-53.0, -21.8]
20697 False -33.0 (-53.0, -21.8]

[6068 rows x 3 columns]


bad LoanOriginalAmount LoanOriginalAmount_
0 False 4151 (3800.0, 5600.0]
1 False 2200 (1999.999, 3800.0]
2 False 6000 (5600.0, 7400.0]
3 False 5700 (5600.0, 7400.0]
4 False 13000 (12800.0, 14600.0]
... ... ... ...
20700 False 4000 (3800.0, 5600.0]
20701 False 4000 (3800.0, 5600.0]
20702 False 2000 (1999.999, 3800.0]
20703 False 4000 (3800.0, 5600.0]
20704 False 2000 (1999.999, 3800.0]

[20203 rows x 3 columns]


bad MonthlyLoanPayment MonthlyLoanPayment_
0 False 166.39 (123.148, 184.722]
1 False 73.81 (61.574, 123.148]
2 False 183.87 (123.148, 184.722]
3 False 186.88 (184.722, 246.296]
4 False 469.92 (431.018, 492.592]
... ... ... ...
20700 False 173.71 (123.148, 184.722]
20701 False 136.98 (123.148, 184.722]
20702 False 65.43 (61.574, 123.148]
20703 False 173.71 (123.148, 184.722]
20704 False 191.73 (184.722, 246.296]

[20188 rows x 3 columns]


bad PercentFunded PercentFunded_
0 False 0.7547 (0.73, 0.76]
1 False 1.0000 (0.97, 1.0]
2 False 1.0000 (0.97, 1.0]
3 False 1.0000 (0.97, 1.0]
4 False 1.0000 (0.97, 1.0]
... ... ... ...
20700 False 1.0000 (0.97, 1.0]
20701 False 1.0000 (0.97, 1.0]
20702 False 1.0000 (0.97, 1.0]
20703 False 1.0000 (0.97, 1.0]
20704 False 1.0000 (0.97, 1.0]

[20703 rows x 3 columns]


bad Recommendations Recommendations_
0 False 0 (-0.001, 0.0]
1 False 0 (-0.001, 0.0]
2 False 0 (-0.001, 0.0]
3 False 0 (-0.001, 0.0]
4 False 0 (-0.001, 0.0]
... ... ... ...
20700 False 0 (-0.001, 0.0]
20701 False 0 (-0.001, 0.0]
20702 False 0 (-0.001, 0.0]
20703 False 0 (-0.001, 0.0]
20704 False 0 (-0.001, 0.0]

[20321 rows x 3 columns]


feature IV

28 StatedMonthlyIncome 0.148338

26 IncomeRange 0.142623

25 DebtToIncomeRatio 0.109896

7 CreditScoreRangeLower 0.084819

8 CreditScoreRangeUpper 0.084819

36 ScorexChangeAtTimeOfListing 0.077857

3 Occupation 0.075511

31 OnTimeProsperPayments 0.073233

4 EmploymentStatus 0.064521

30 TotalProsperPaymentsBilled 0.060063

38 MonthlyLoanPayment 0.059123

14 InquiriesLast6Months 0.059109

37 LoanOriginalAmount 0.055845

10 OpenCreditLines 0.055633
9 CurrentCreditLines 0.050520

12 OpenRevolvingAccounts 0.050037

35 ProsperPrincipalOutstanding 0.049404

27 IncomeVerifiable 0.048851

22 TotalTrades 0.046902

32 ProsperPaymentsLessThanOneMonthLate 0.041367

11 TotalCreditLinespast7years 0.033806

24 TradesOpenedLast6Months 0.033370

2 BorrowerState 0.030815

34 ProsperPrincipalBorrowed 0.027194

20 BankcardUtilization 0.026802

13 OpenRevolvingMonthlyPayment 0.026497

29 TotalProsperLoans 0.023694

6 IsBorrowerHomeowner 0.022738

21 AvailableBankcardCredit 0.019976

33 ProsperPaymentsOneMonthPlusLate 0.017298

5 EmploymentStatusDuration 0.016949

1 ListingCategory (numeric) 0.015899

23 TradesNeverDelinquent (percentage) 0.014159

15 TotalInquiries 0.013414

0 Term 0.012167

19 RevolvingCreditBalance 0.009131

39 PercentFunded 0.005642

17 PublicRecordsLast10Years 0.002602

16 DelinquenciesLast7Years 0.002380

18 PublicRecordsLast12Months 0.000000

40 Recommendations 0.000000

Next steps: toggle_off View recommended plots

keyboard_arrow_down Visualize some (predictive and less predictive) features

#dynamic plotting libraries


import plotly.io as pyio
import plotly.offline as py
import plotly.graph_objs as go
#py.init_notebook_mode(connected=True)
pyio.renderers.default = "colab"

#pick some really good features by IV


#and some features not so good by IV
df_iv_loc = df_iv.loc[(df_iv.IV > 0.14) | ((df_iv.IV < 0.0050) & (df_iv.IV > 0.00))]
df_iv_loc
feature IV

28 StatedMonthlyIncome 0.148338

26 IncomeRange 0.142623

17 PublicRecordsLast10Years 0.002602

16 DelinquenciesLast7Years 0.002380

Next steps: toggle_off View recommended plots

#plot features
for c in df_iv_loc.feature.values.tolist():
py.iplot(output_graph_update(c, [0., 97.5], 6, df, 'bad'))
bad StatedMonthlyIncome StatedMonthlyIncome_
0 False 5916.666667 (5277.778, 7916.667]
1 False 4166.666667 (2638.889, 5277.778]
2 False 7083.333333 (5277.778, 7916.667]
3 False 3166.666667 (2638.889, 5277.778]
4 False 3750.000000 (2638.889, 5277.778]
... ... ... ...
20700 False 15800.000000 (13194.444, 15833.333]
20701 False 4304.333333 (2638.889, 5277.778]
20702 False 2416.666667 (-0.001, 2638.889]
20703 False 4166.666667 (2638.889, 5277.778]
20704 False 6000.000000 (5277.778, 7916.667]

[20200 rows x 3 columns]

BR for StatedMonthlyIncome

8000

7000

6000

5000
Volume

4000

3000

2000

1000

0
(-0.001, 2638.889] (2638.889, 5277.778] (5277.778, 7916.667]

BR for IncomeRange

6000

5000

4000
Volume

3000

2000

1000

0
$1-24,999 $100,000+ $25,000-49,999

bad PublicRecordsLast10Years PublicRecordsLast10Years_


0 False 1 (0.667, 1.333]
1 False 0 (-0.001, 0.667]
2 False 0 (-0.001, 0.667]
3 False 0 (-0.001, 0.667]
4 False 0 (-0.001, 0.667]
... ... ... ...
20700 False 0 (-0.001, 0.667]
20701 False 1 (0.667, 1.333]
20702 False 0 (-0.001, 0.667]
20703 False 0 (-0.001, 0.667]
20704 False 1 (0.667, 1.333]

[20482 rows x 3 columns]


BR for PublicRecordsLast10Years

16k

14k

12k

10k
Volume

8k

6k

4k

2k

0
(-0.001, 0.667] (0.667, 1.

bad DelinquenciesLast7Years DelinquenciesLast7Years_


0 False 8 (5.333, 10.667]
1 False 0 (-0.001, 5.333]
2 False 0 (-0.001, 5.333]
3 False 0 (-0.001, 5.333]
4 False 0 (-0.001, 5.333]
... ... ... ...
20700 False 0 (-0.001, 5.333]
20701 False 1 (-0.001, 5.333]
20702 False 0 (-0.001, 5.333]
20703 False 13 (10.667, 16.0]
20704 False 9 (5.333, 10.667]

[20188 rows x 3 columns]

BR for DelinquenciesLast7Years

16k

14k

12k

10k
Volume

8k

6k

4k

2k

0
(-0.001, 5.333] (5.333, 10.667] (10.667, 16.0]

keyboard_arrow_down Stability check - PSI


Population stability index (PSI) is a metric to measure how much change of a variable over time.

In simple words, Population Stability Index (PSI) compares the distribution of a scoring variable (predicted probability) in scoring data set to a
training data set that was used to develop the model. The idea is to check "How the current scoring is compared to the predicted probability
from training data set".
def PSI_numeric(series, in_out_time_series):
"""Returns the population stability index for numerical variables

Args:
series: Pandas Series, the variable to describe
in_out_time_series: Pandas Series It contains the in time / out of time series

Returns:
Estimated PSI
"""
pd_aux = pd.DataFrame(dict(data = series, in_out = in_out_time_series)).reset_index()
#capture in time and out of time series
in_series = pd_aux.loc[pd_aux.in_out == True]['data']
out_series = pd_aux.loc[pd_aux.in_out == False]['data']

#base data deciles


qqs = in_series.quantile(q=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

#cut the data, based in the base series deciles


in_series_cut = pd.cut(in_series, sorted(list(set(qqs.values))), include_lowest=True)
out_series_cut = pd.cut(out_series, sorted(list(set(qqs.values))), include_lowest=True)
#count volume per bin
in_grp = in_series_cut.value_counts(dropna=False) # may have nans
out_grp = out_series_cut.value_counts(dropna= not (np.nan in in_grp.index.values.tolist())) # drop as a function on in_s
#small fix, so some inf values are fixed
out_grp[out_grp==0] = 0.01

#N observations in each series


N_in = len(in_series_cut)
N_out = len(out_series_cut)

#convert to share in each bin


in_grp = in_grp / N_in
out_grp = out_grp / N_out

return sum((in_grp-out_grp)*np.log(in_grp/out_grp))

def PSI_categorical(series, in_out_time_series):


"""Returns the population stability index for categorical variables

Args:
series: Pandas Series, the variable to describe
in_out_time_series: Pandas Series It contains the in time / out of time series

Returns:
Estimated PSI
"""
pd_aux = pd.DataFrame(dict(data = series, in_out = in_out_time_series)).reset_index()
#capture in time and out of time series
in_series = pd_aux.loc[pd_aux.in_out == True]['data']
out_series = pd_aux.loc[pd_aux.in_out == False]['data']

#count volume per level


in_grp = in_series.value_counts(dropna=False)
out_grp = out_series.value_counts(dropna= not (np.nan in in_grp.index.values.tolist()))

#N observations in each series


N_in = len(in_series)
N_out = len(out_series)

#convert to share in each bin


in_grp = in_grp / N_in
out_grp = out_grp / N_out

#put all together in a df


df_grp = in_grp.to_frame().join(out_grp.to_frame(), lsuffix = '_in', rsuffix = '_out')
df_grp = df_grp.fillna(0.000001)

return sum((df_grp.data_in - df_grp.data_out) * np.log(df_grp.data_in / df_grp.data_out))


psi = []
#capture in time - out of time series
it_oot_series = pd.Series(np.hstack((np.ones(len(df)), np.zeros(len(df_oot)))))
#for all features
features_ = ['DebtToIncomeRatio']
for c in features:
col_series = pd.concat([df[c], df_oot[c]], ignore_index=True)
if (df[c].dtypes == object) | (df[c].dtypes == bool):
psi.append(PSI_categorical(col_series, it_oot_series))
else:
psi.append(PSI_numeric(col_series, it_oot_series))

df_psi = pd.DataFrame({'feature': features,


'PSI': psi})
df_psi = df_psi.sort_values(by='PSI')
df_psi
feature PSI

40 Recommendations 0.000000

18 PublicRecordsLast12Months 0.000000

27 IncomeVerifiable 0.000071

6 IsBorrowerHomeowner 0.000213

39 PercentFunded 0.000411

17 PublicRecordsLast10Years 0.002025

12 OpenRevolvingAccounts 0.002125

19 RevolvingCreditBalance 0.002709

10 OpenCreditLines 0.003613

11 TotalCreditLinespast7years 0.004108

14 InquiriesLast6Months 0.006294

22 TotalTrades 0.006907

9 CurrentCreditLines 0.007141

13 OpenRevolvingMonthlyPayment 0.007249

21 AvailableBankcardCredit 0.008571

20 BankcardUtilization 0.011055

28 StatedMonthlyIncome 0.011694

26 IncomeRange 0.015400

23 TradesNeverDelinquent (percentage) 0.017419

33 ProsperPaymentsOneMonthPlusLate 0.017845

32 ProsperPaymentsLessThanOneMonthLate 0.020497

25 DebtToIncomeRatio 0.021003

15 TotalInquiries 0.022078

16 DelinquenciesLast7Years 0.024698

24 TradesOpenedLast6Months 0.032809

2 BorrowerState 0.033648

29 TotalProsperLoans 0.037846

3 Occupation 0.047130

5 EmploymentStatusDuration 0.050429

34 ProsperPrincipalBorrowed 0.061380

8 CreditScoreRangeUpper 0.080324

7 CreditScoreRangeLower 0.080324

35 ProsperPrincipalOutstanding 0.112830

31 OnTimeProsperPayments 0.118072

30 TotalProsperPaymentsBilled 0.121264

36 ScorexChangeAtTimeOfListing 0.216728

37 LoanOriginalAmount 0.373789

38 MonthlyLoanPayment 0.780153

0 Term 0.977066

1 ListingCategory (numeric) 1.446103

4 EmploymentStatus 2.404796

Next steps: toggle_off View recommended plots

keyboard_arrow_down Examples of very stable features


#custom function to describe a numerical variable
def desc_num(df, df_oot, col):
"""Function that returns a custom descriptive for the numerical variable. It returns:
- Mean, median, minimum, maximum, p25, p75, std, %na %nonzero, %unique
- Histogram plot
- Stability plot

Args:
df: Pandas DataFrame with the in time input data
df_oot: Pandas DataFrame with the out of time input data
col: Name of the column with the feature under study

Returns:
Dictionary that contains the main statistics of the feature
"""
#dictionary to keep main statistics
dict_stats = {'Mean': df[col].mean(),
'Median': df[col].median(),
'Min': df[col].min(),
'Max': df[col].max(),
'p25': df[col].quantile(0.25),
'p75': df[col].quantile(0.75),
'Std': df[col].std(),
'%NA': 100. * df[col].isna().sum() / df.shape[0],
'%Nonzero': 100. * (df[col] != 0).sum() / df.shape[0],
'%Unique': 100. * df[col].nunique() / df.shape[0]}
#plot data distribution
plt_mn, plt_mx = df[col].quantile(q=[0.025, 0.975])
df[col].hist(range=(plt_mn, plt_mx))
plt.show()
#plot stability distribution
plt.hist(df[col], label='IT', range=(plt_mn, plt_mx), alpha=0.55)
plt.hist(df_oot[col], label='OOT', range=(plt_mn, plt_mx), alpha=0.55)
plt.legend(loc='upper right')
plt.show()

return dict_stats

def desc_cat(df, df_oot, col):


"""Function that returns a custom descriptive for the categorical variable. It returns:
- Mean, median, minimum, maximum, p25, p75, std, %na
- Histogram plot
- Stability plot

Args:
df: Pandas DataFrame with the in time input data
df_oot: Pandas DataFrame with the out of time input data
col: Name of the column with the feature under study

Returns:
Dictionary that contains the main statistics of the feature
"""
#dictionary to keep main statistics
dict_stats = {'Unique': df[col].nunique(),
'%Unique': 100. * df[col].nunique() / df.shape[0],
'Top': df[col].value_counts()[:1].index[0],
'Freq @ top': df[col].value_counts()[:1].values[0],
'%NA': 100. * df[col].isna().sum() / df.shape[0]
}
#plot data distribution (only top 15)
df[col].value_counts()[:15].sort_values().plot(kind='barh', colormap='Blues_r')
plt.show()
#plot stability distribution
it_vc = df[col].value_counts()[:15]
oot_vc = df_oot[col].value_counts()[:15]
df_vc = pd.DataFrame({'it_vc': 100. * it_vc / it_vc.sum(),
'oot_vc': 100. * oot_vc / oot_vc.sum()})
df_vc = df_vc.fillna(0.)
df_vc = df_vc.sort_values(by='it_vc', ascending=False)
plt.bar(np.arange(df_vc.shape[0]), df_vc.it_vc, color = 'b', width = 0.25, label='IT')
plt.bar(np.arange(df_vc.shape[0]) + 0.25, df_vc.oot_vc, color = 'g', width = 0.25, label='OOT')
plt.xticks(np.arange(df_vc.shape[0]), df_vc.index.values, rotation='vertical')
plt.legend()

plt.show()

return dict_stats

desc_num(df, df_oot, 'StatedMonthlyIncome')


{'Mean': 5657.364420836705,
'Median': 4604.166667,
'Min': 0.0,
'Max': 466666.666667,
'p25': 3166.666667,
'p75': 6875.0,
'Std': 6302.240785234885,
'%NA': 0.0,
'%Nonzero': 98.31441680753441,
'%Unique': 15.36826853417049}

desc_cat(df, df_oot, 'IncomeRange')

You might also like