Using Logistic Regression Predict Credit Default
Using Logistic Regression Predict Credit Default
Methods
The Methods for this project included:
1. Data Discovery: Cleansing, merging, imputing, and deleting.
2. Multicollinearity: Removing variance inflation factors.
3. Variable Preparation: User and SAS defined discretization.
4. Modeling and Logistic Regression: Training and validation files created then modeled.
5. KS testing and Cluster Analysis: Optimization of profit and group discovery.
#analyticsx
Using Logistic Regression to Predict Credit Default
Steven Leopard and Jun Song
Dr. Jennifer Priestley and Professor Michael Frankel
2.5
M
3
A 2.0
Percent
Percent
1.5 2
C
CPR H
PERF
1.0
K 1
0.5
E
Y
0.0 0
10 20 30 40 50 60 70 80 90 100 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
age age
#analyticsx
Using Logistic Regression to Predict Credit Default
Steven Leopard and Jun Song
Dr. Jennifer Priestley and Professor Michael Frankel
Tables Illustrating Cutoff Points for Macro Variance Inflation Factors (VIFs)
Number of
Number of Variables based on PCTREM
PCTREM MSTD Obs. Variables variable
Values
350
0.8 4 1255429 317
300
0.6 4 1255429 299 250
200
0.4 4 1255429 182
150
Typically in certain work environments an altogether completely different dataset would be used to validate. After the
master file has been split, a proc logistic is ran on the training file. The model was created using backward selection in
2
which 85 redundant or insignificant variables were removed. By using backward selection the model analyzes
everything at once then begins removing variables that ultimately enhance the capability of prediction for the model.
1 The ROC curves below show the area or C-stat for the model with all remaining variable transformations and again
with only those with the highest chi-square value.
0
0 10 20 30 40 50 60 70 80 90 100
prminqs ROC - All Variables 0.85 ROC - 10 Variables 0.81
0.21
0.20
0.19
0.18
0.17
0.16
0.15
0.14
1 2 3 4 5 6 7 8
ORDprminqs
#analyticsx
Using Logistic Regression to Predict Credit Default
Steven Leopard and Jun Song
Dr. Jennifer Priestley and Professor Michael Frankel
Percent
104 60.00%
0.100 137E3 387E3 338E3 17636 59.6 88.6 53.4 71.2 4.4 $100,000.00
0.200 105E3 566E3 159E3 49494 76.3 67.9 78.1 60.2 8.0 102 $80,000.00
40.00%
0.300 79817 639E3 85796 74477 81.8 51.7 88.2 51.8 10.4
0.400 59516 679E3 46001 94778 84.0 38.6 93.7 43.6 12.3 100 20.00% $60,000.00
0.500 41719 7E5 25255 113E3 84.3 27.0 96.5 37.7 13.9 $40,000.00
98 0.00%
0.600 26380 713E3 12121 128E3 84.1 17.1 98.3 31.5 15.2 0 1 2 3 4 5 6 7 8 9 10 $20,000.00
0.700 13379 72E4 4669 141E3 83.4 8.7 99.4 25.9 16.4 96 $0.00
0.800 4225 724E3 1073 15E4 82.8 2.7 99.9 20.3 17.2 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Decile
Cutoff Percent Cluster1 Cluster2 Cluster3 Cluster4
0.900 217 725E3 30 154E3 82.5 0.1 100.0 12.1 17.5
1.000 0 725E3 0 154E3 82.4 0.0 100.0 . 17.6
Conclusion
KS Curve and Cluster Analysis After several procedures (cleansing the data, eliminating variables that were over coded, transforming the remaining,
and running proc logistic) the model had a C-stat of .8122. The profitability function maxed out at approximately
A KS test is predominantly used in a marketing context but can be used in the financial market as well. The idea for a 25%. In other words, the probability for someone to default is expectable at or below .25 to receive a credit loan. This
KS test is if a list of x amount of potential customers existed and stretched out over some domain then how deep into function showed an average profit per costumer of $117.11. KS testing showed that by targeting 31-40% percent of
this list should solicitation attempt to acquire in order to optimize the profit. customers, the greatest difference between actual good and bad credit observations will be found. Each cluster was
scored based on the validation file used for the model and the profit for 1000 people varies from $70,000 to
$140,000. Based on the clustering analysis, cluster 2 yielded the largest profit and cluster 3 yielded the lowest.