Credit Risk Modelling Using R
Credit Risk Modelling Using R
data structure
CREDIT RISK MODELING IN R
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
What is loan default?
EL = PD × EAD × LGD
EL = PD × EAD × LGD
Marital status
...
Behavioral information
Current account balance
...
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
| loan_data$loan_status
loan_data$home_ownership | 0 | 1 | Row Total |
------------------------|-----------|-----------|-----------|
MORTGAGE | 10821 | 1181 | 12002 |
| 0.902 | 0.098 | 0.413 |
------------------------|-----------|-----------|-----------|
OTHER | 80 | 17 | 97 |
| 0.825 | 0.175 | 0.003 |
------------------------|-----------|-----------|-----------|
OWN | 2049 | 252 | 2301 |
| 0.890 | 0.110 | 0.079 |
------------------------|-----------|-----------|-----------|
RENT | 12915 | 1777 | 14692 |
| 0.879 | 0.121 | 0.505 |
------------------------|-----------|-----------|-----------|
Column Total | 25865 | 3227 | 29092 |
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Using function hist()
hist(loan_data$int_rate)
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 ...
Q3 + 1.5 * IQR
# Find outlier
index_outlier_expert <- which(loan_data$annual_inc > 3000000)
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Outlier deleted
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
0 5000 12.73 C 12 MORTGAGE 6000000 144
Replace
Keep
CONTINUOUS CATEGORICAL
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Start analysis
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Final data structure
str(training_set)
1
P (loan status = 1∣age) =
1 + e−(β^0 +β^1 age)
βj > 0
eβj > 1
The odds increase as xj increases
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
An example with "age" and "home ownership"
log_model_small <- glm(loan_status ~ age + home_ownership, family = "binomial", data = training_set)
log_model_small
1
P (loan status = 1∣age, home ownership) = ^ ^ ^ ^ ^
1 + e−(β0 +β1 age+β2 OTHER+β3 OWN+β4 RENT)
1
=
1 + e(−(1.886396+(−0.009308)×33+(0.158581)×1))
= 0.115579
1
-2.03499
1
0.1155779
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Recap: model evaluation
test_set$loan_status model_prediction Actual loan status v. Model
... ...
prediction
[8066,] 1 1
[8067,] 0 0 No default (0) Default (1)
[8068,] 0 0
No default (0) 8 2
[8069,] 0 0
[8070,] 0 0 Default (1) 1 3
[8071,] 0 1
[8072,] 1 0
[8073,] 1 1
[8074,] 0 0
[8075,] 0 0
[8076,] 0 0
[8077,] 1 1
[8078,] 0 0
[8079,] 0 1
... ...
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Best cut-off for accuracy?
T P +T N
Accuracy = T P +F P +T N +F N
T P +T N
Accuracy = T P +F P +T N +F N
T P +T N
Accuracy = T P +F P +T N +F N
T P +T N
Accuracy = T P +F P +T N +F N
Accuracy = 89.31%
= (100 − 89.31)%
Recall:
1
P (loan status = 1∣x1 , ..., xm ) =
1 + e−(β0 +β1 x1 +...+βm xm )
βj < 0
The probability of default decreases as xj increases
βj > 0
The probability of default increases as xj increases
1
P (loan status = 1∣x1 , ..., xm ) =
1 + e−(β0 +β1 x1 +...+βm xm )
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Decision tree example
= Gini_R-prop(cases in N1)*Gini_N1 -
prop(cases in N2) * Gini_N1
= 0.039488
Maximum gain
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Imagine...
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Problems with large decision trees
Too complex: not clear anymore
Classification tree:
rpart(formula = loan_status ~ ., data = undersampled_training_set, method = "class",
control = rpart.control(cp = 0.001))
Variables actually used in tree construction:
age annual_inc emp_cat grade home_ownership ir_cat loan_amnt
Root node error: 2190/6570 = 0.33333
n= 6570
CP nsplit rel error xerror xstd
1 0.0059361 0 1.00000 1.00000 0.017447
2 0.0044140 4 0.97443 0.99909 0.017443
3 0.0036530 7 0.96119 0.98174 0.017366
4 0.0031963 8 0.95753 0.98904 0.017399
...
16 0.0010654 76 0.84247 1.02511 0.017554
17 0.0010000 79 0.83927 1.02511 0.017554
CP = 0.003653
plot(ptree_undersample,
uniform=TRUE)
text(ptree_undersample)
plot(ptree_undersample,
uniform=TRUE)
text(ptree_undersample,
use.n=TRUE)
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Other interesting rpart() - arguments
In rpart()
weights : include case weights
OR
0 1
1 0.7382920 0.2617080
2 0.5665138 0.4334862
3 0.5992366 0.4007634
... ...
29084 0.7382920 0.2617080
29090 0.7382920 0.2617080
29091 0.7382920 0.2617080
pred_undersample_class
0 1
0 8314 346
1 964 73
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Constructing a confusion matrix
predict(log_reg_model, newdata = test_set, type = "response")
1 2 3 4 5 ...
0.08825517 0.3502768 0.28632298 0.1657199 0.11264550 ...
0 1
1 0.7873134 0.2126866
2 0.6250000 0.3750000
3 0.6250000 0.3750000
4 0.7873134 0.2126866
5 0.5756867 0.4243133
80%
0.1600124
test_set$loan_status pred_full_20
1 0 0
2 0 0
3 0 1
4 0 0
5 0 1
... ... ...
0.08972541
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Until now
Strategy table/curve : still make assumption
T P +T N
Actual loan status v. Model Accuracy = T P +F P +T N +F N
prediction TP
Sensitivity = T P +F N
No default (0) Default (1)
TN
No default (0) TN FP Specificity = T N +F P
Default (1) FN TP
TP
Sensitivity = T P +F N
TN
Specificity = T N +F P
TP
Sensitivity = T P +F N
TN
Specificity = T N +F P
TP
Sensitivity = T P +F N
TN
Specificity = T N +F P
TP
Sensitivity = T P +F N
TN
Specificity = T N +F P
TP
Sensitivity = T P +F N
TN
Specificity = T N +F P
TP
Sensitivity = T P +F N
TN
Specificity = T N +F P
TP
Sensitivity = T P +F N
TN
Specificity = T N +F P
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
ROC curves for 4 logistic regression models
auc(test_set$loan_status, pred_1_remove_amnt)
auc(test_set$loan_status, pred_1_remove_grade)
auc(test_set$loan_status, pred_1_remove_home)
Lore Dirick
Manager of Data Science Curriculum at
Flatiron School
Other methods
Discriminant analysis
Random forest
Neural networks