0% found this document useful (0 votes)

134 views6 pages

Using Logistic Regression Predict Credit Default

Logistic regression or the regression for dependent variable 0 or 1 is used to predict credit card default cases.

Uploaded by

Sneha Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

134 views6 pages

Using Logistic Regression Predict Credit Default

Logistic regression or the regression for dependent variable 0 or 1 is used to predict credit card default cases.

Uploaded by

Sneha Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

#analyticsx

Using Logistic Regression to Predict Credit Default

Steven Leopard and Jun Song
Dr. Jennifer Priestley and Professor Michael Frankel

Executive Summary Introduction

The proceeding documentation was created over the course of developing a functional model to predict the risk of This research describes the process and results of developing a binary classification model, using Logistic Regression, to
default in customers seeking a credit loan using data provided by Equifax Credit Union. In doing so, maximum generate Credit Risk Scores. These scores are then used to maximize a profitability function.
profitability was achieved by determining the necessary risk of defaulted loans over the potential for profit of The data for this project came from a Sub-Prime lender. Three datasets were provided:
successful credit extensions in the sub-prime market. CPR. 1,462,955 observations and 338 variables. Each observation represents a unique customer. This file contains all
The original inner merge of the pre and post credit extension data contained 1.4 million observations and 336 of the potential predictors of credit performance. The variables have differing levels of completeness.
predictors. Of these predictors, the binary response was created from the delinquent cycles of the observations. The PERF. 17,244,104 observations and 18 variables. This file contains the post hoc performance data for each customer,
remaining variables were cleansed of all coding and then transformed into 3 or 6 additional versions depending on the including the response variable for modeling – DELQID.
variable’s naturally binary status. These versions included both SAS and user defined discretization along with the odds
ratio of default and the log function of the ratio. All variance inflation factors 4 or greater were removed to prevent TRAN. 8,536,608 observations and 5 variables. This file contains information on the transaction patterns of each
multicollinearity. customer.
After transforming and cleansing, the data was split into two separate sets in which to both build and validate the Each file contains a consistent “matchkey” variable which was used to merge the datasets.
model. A C-statistic of .812 was found after trimming the model down to a more manageable and cost effective 10
variables. The variables were single versions of each original raw data source and were scored to produce a profitability
of $107 per person at a 24% risk of default.
Two more additional analyses were preformed to further optimize the model. The KS test reported that the 31-40% The process of the project included:
decile of observations yielded the largest difference of good and bad credit risks and the cluster analysis found four
groups within the dataset. Of these four groups, cluster 2 produced a profit of $140,000. Once finished, the model
provided a profitable way to predict credit default, optimize the sample size needed, and distinguish the ideal group in
which to target credit extensions.

Methods
The Methods for this project included:
1. Data Discovery: Cleansing, merging, imputing, and deleting.
2. Multicollinearity: Removing variance inflation factors.
3. Variable Preparation: User and SAS defined discretization.
4. Modeling and Logistic Regression: Training and validation files created then modeled.
5. KS testing and Cluster Analysis: Optimization of profit and group discovery.
#analyticsx
Using Logistic Regression to Predict Credit Default
Steven Leopard and Jun Song
Dr. Jennifer Priestley and Professor Michael Frankel

Data Cleansing and Merging Imputing Coded Variables

The merge of the raw data was made possible by the ordinal variable MATCHKEY in which customers with the same For this part of Data Discovery, we deal with coded values and create our binary response variable. Some imputing
value for this variable from both datasets were included in an inner merge, or the intersection of the two datasets by was done by hand before we applied the macro and the figures below are an example of AGE before and after. Ages
the variable MATCHKEY. For the case in which duplicate MATCHKEYs exist, we pick the highest value for DELQID to equal to 99 were in fact coded values and reset to the median value of age (47); we can see the spike leave from 99
minimize risk. and go to 47. We use the median in our macro instead of mean since most of the data is skewed and for normally
distributed data the mean and median are nearly equal.
The variables in the CPR dataset provide information on clientele before credit is extended and will be used as a For the macro, we choose any coded values above 4 standard deviations to be imputed. We choose 4 so that the
basis for prediction. The PERF dataset provides information post credit approval and cannot be used for prediction bulk of the data and any possible non coded outliers were still preserved for both normally distributed and skewed
alone but rather justification of any predictions. More specifically, the PERF variable DELQID will be used as the variables. F
response or the value we are trying to predict. DELQID is a quantitative variable numbered 0-7 in which each The second step for the macro was to delete any variables that had more than 25% coded. We choose 25% simply by
number specifies a costumer’s current payment status i.e.; a person given a 0 is too new to rate, 1 signifies the observing a pattern in convergence. Originally, we started at more than 80% delete, than 60%, 40%, 25%, 20%, 5%,
person is in the current payment cycle, 2 signifies that the person is one cycle late, 3 is two cycles late, and so on. and eventually choose 25%.
After using SAS to merge the two datasets we are left with 1,743,505 observations and 356 variables.

Merging Visualization and Multiple Matchkeys Histograms of Age Pre/Post Imputation

Distribution of AGE Distribution of AGE
3.0 4

2.5
M
3
A 2.0

Percent

Percent
1.5 2
C
CPR H
PERF
1.0

K 1

0.5
E
Y
0.0 0
10 20 30 40 50 60 70 80 90 100 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
age age
#analyticsx
Using Logistic Regression to Predict Credit Default
Steven Leopard and Jun Song
Dr. Jennifer Priestley and Professor Michael Frankel

Imputing and/or Deleting Variables Multicollinearity

For the macro, we choose any coded values above 4 standard deviations (MSTD) to be imputed. We choose 4 so that After the macro has run, we can remove the remaining variables from the PERF dataset. PERF alone had only
the bulk of the data and any possible non coded outliers were still preserved for both normally distributed and 18 variables, of which, DELQID is the only relevant one for now. MATCHKEY we will keep too since it’s linked
skewed variables. to CPR. The last step in the data cleansing is eliminating variance inflation or multicollinearity. We can use
SAS to check for VIFs or variance inflation by running a linear regression using all the variables. Some texts
The second step for the macro was to delete any variables that had more than 25% coded. We choose 25% simply by say a variable with a VIF of 10 or higher should be removed but others say 5 (e.g., Rogerson, 2001) or even 4
observing a pattern in convergence after running the macro several times. Originally, we started at more than 80% (e.g., Pan & Jackson, 2008). VIFs are calculated as the reciprocal of tolerance or such that 𝑟 2 is the
delete, than 60%, 40%, 25%, 20%, 5%, and eventually choose 25%. percentage of deviation that can be explained and is the percentage of deviation that cannot. When
this number gets lower, the reciprocal or the VIF, gets higher. Variance inflation can cause signs to change
such that when a beta coefficient should either increase or decrease the predicted value of the response
variable it does the opposite. After removing all VIFs and the blank variable BEACON, we are left with 56
variables.

Tables Illustrating Cutoff Points for Macro Variance Inflation Factors (VIFs)
Number of
Number of Variables based on PCTREM
PCTREM MSTD Obs. Variables variable
Values
350
0.8 4 1255429 317
300
0.6 4 1255429 299 250
200
0.4 4 1255429 182
150

0.25 4 1255429 146 100

50
0.2 4 1255429 143
0
1 0.8 0.6 0.4 0.2 0
0.05 4 1255429 139 PCTREM Value
#analyticsx
Using Logistic Regression to Predict Credit Default
Steven Leopard and Jun Song
Dr. Jennifer Priestley and Professor Michael Frankel

Creating the Response Horizontal Plot of Age Before Discretization

Finally, the creation of the variable GOODBAD was done so we could give a simple yes or no, 0 or 1, answer to the
question concerning credit. This variable is made from DELQID such that any values less than 3 are given a 0 and
considered good, while the rest are given 1’s and considered bad. This is the primary component in binary
classifications and Bernoulli’s probability distribution. Table 4 is a frequency chart of GOODBAD, note that over 80% of
the observations are good or 0.

Frequencies of the Response Variable GOODBAD

Goodbad Frequency Percent Cumulative Frequency Cumulative Percent

0 1034829 82.43 1034829 82.43
1 220600 17.57 1255429 100.00

Discretization User and SAS Defined Discretization

This section deals with the transformation of the variables remaining after cleansing the data. Here we will be creating 3 To solve the problem previously described, we need to transform a continuous variable into a bounded, discrete variable
to 6 discrete versions of the variables in a way that better fits the dependent variable GOODBAD. We are trying to map otherwise known as discretization. There are two ways we will discretize the data, user defined (Disc1) and SAS
the optimal monotonic transformation of the continuous variable back to the form of the function that is used in the defined (Disc2). For the discretization of both, there are 3 transformations of the variables; ordinal, odds, and log of
model. Certain variables in their raw form are not useful in binary classifications. This creates a problem when a odds; seven total including the original. This holds true unless we come across an ‘inherently’ binary variable. If so, that
variable like AGE is graphed with GOODBAD; we see horizontal lines at 0 and 1 as is such in the following figure variable will have only 4 versions: itself and Disc1 transformations. The following figures show the difference between
user and SAS defined discretization of the variable PRMINQS (Number of promotions, account money / revolving
inquiries).
#analyticsx
Using Logistic Regression to Predict Credit Default
Steven Leopard and Jun Song
Dr. Jennifer Priestley and Professor Michael Frankel

Raw Form of PRMINQS Building the Model

Distribution of PRM INQS After all the variables have been discretized and variance inflation factors have been removed one additional step
5
remains before modeling. The training and validation datasets must be created to both build and legitimize the model.
To do this, a new variable RANDOM will be used in order to split the data 70/30 such that the training dataset will
4 receive the majority of observations. With the training portion, the model can be built then scored by the random,
mutually exclusive validation file. This is done so that the model can be tested on a subset of the data different than
3
that used in the creation of the model which theoretically, should work on any subset in which the model is applied.
Percent

Typically in certain work environments an altogether completely different dataset would be used to validate. After the
master file has been split, a proc logistic is ran on the training file. The model was created using backward selection in
2
which 85 redundant or insignificant variables were removed. By using backward selection the model analyzes
everything at once then begins removing variables that ultimately enhance the capability of prediction for the model.
1 The ROC curves below show the area or C-stat for the model with all remaining variable transformations and again
with only those with the highest chi-square value.
0
0 10 20 30 40 50 60 70 80 90 100
prminqs ROC - All Variables 0.85 ROC - 10 Variables 0.81

User Created SAS Created

meangoodbad
0.22

0.21

0.20

0.19

0.18

0.17

0.16

0.15

0.14
1 2 3 4 5 6 7 8

ORDprminqs
#analyticsx
Using Logistic Regression to Predict Credit Default
Steven Leopard and Jun Song
Dr. Jennifer Priestley and Professor Michael Frankel

Scoring on the Validation Set

A clustering analysis determines how many, if any, groups or clusters exists within a dataset. In order to decide how
Once the model has been built the validation file can now be used to score the data. When scoring the data, percentiles many clusters the data set has three tests are used; Cubic Clustering Criterion, Pseudo-F Statistics, and Pseudo-T
are created to determine the cutoff point for the probability of default in this way we are able to maximize the Statistic. The inflection points within the testing occurs when the number of clusters is four.
profitability of the model. By using the classification table we can create a graph to see where the derivative of the
function = 0.
KS Curve Clusters 1 - 4
Classification Table and Profit Function
Classification Table Peak Average Profit Per Customer KS Curve Profit Per 1000 People
Correct Incorrect Percentages 108 $160,000.00
Prob Non- Non- Sensi- Speci- False False 100.00%
Level Event Event Event Event Correct tivity ficity POS NEG 106 $140,000.00

Profit (in Dollars)

80.00%
0.000 154E3 0 725E3 0 17.6 100.0 0.0 82.4 . $120,000.00

Percent
104 60.00%
0.100 137E3 387E3 338E3 17636 59.6 88.6 53.4 71.2 4.4 $100,000.00
0.200 105E3 566E3 159E3 49494 76.3 67.9 78.1 60.2 8.0 102 $80,000.00
40.00%
0.300 79817 639E3 85796 74477 81.8 51.7 88.2 51.8 10.4
0.400 59516 679E3 46001 94778 84.0 38.6 93.7 43.6 12.3 100 20.00% $60,000.00
0.500 41719 7E5 25255 113E3 84.3 27.0 96.5 37.7 13.9 $40,000.00
98 0.00%
0.600 26380 713E3 12121 128E3 84.1 17.1 98.3 31.5 15.2 0 1 2 3 4 5 6 7 8 9 10 $20,000.00
0.700 13379 72E4 4669 141E3 83.4 8.7 99.4 25.9 16.4 96 $0.00
0.800 4225 724E3 1073 15E4 82.8 2.7 99.9 20.3 17.2 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Decile
Cutoff Percent Cluster1 Cluster2 Cluster3 Cluster4
0.900 217 725E3 30 154E3 82.5 0.1 100.0 12.1 17.5
1.000 0 725E3 0 154E3 82.4 0.0 100.0 . 17.6

Conclusion
KS Curve and Cluster Analysis After several procedures (cleansing the data, eliminating variables that were over coded, transforming the remaining,
and running proc logistic) the model had a C-stat of .8122. The profitability function maxed out at approximately
A KS test is predominantly used in a marketing context but can be used in the financial market as well. The idea for a 25%. In other words, the probability for someone to default is expectable at or below .25 to receive a credit loan. This
KS test is if a list of x amount of potential customers existed and stretched out over some domain then how deep into function showed an average profit per costumer of $117.11. KS testing showed that by targeting 31-40% percent of
this list should solicitation attempt to acquire in order to optimize the profit. customers, the greatest difference between actual good and bad credit observations will be found. Each cluster was
scored based on the validation file used for the model and the profit for 1000 people varies from $70,000 to
$140,000. Based on the clustering analysis, cluster 2 yielded the largest profit and cluster 3 yielded the lowest.

Analysis of German Credit Data
100% (1)
Analysis of German Credit Data
24 pages
Finance and Risk Analytics GRP Assgn
No ratings yet
Finance and Risk Analytics GRP Assgn
7 pages
IDS 575 Project Report
No ratings yet
IDS 575 Project Report
9 pages
EDA Assignment
100% (1)
EDA Assignment
19 pages
M.pharmacy Statistics Notes
100% (2)
M.pharmacy Statistics Notes
19 pages
Assignment 3 F1 - F4
No ratings yet
Assignment 3 F1 - F4
19 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Estimating Parameter Values For Single Facilities
No ratings yet
Estimating Parameter Values For Single Facilities
85 pages
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
No ratings yet
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
18 pages
Summary and Context
No ratings yet
Summary and Context
51 pages
Survival Analysis
No ratings yet
Survival Analysis
6 pages
Reading Material - Module-5 - Introduction To Special Topics
No ratings yet
Reading Material - Module-5 - Introduction To Special Topics
27 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
November 2010)
No ratings yet
November 2010)
6 pages
Module 7 Homework Prompt - JMP
No ratings yet
Module 7 Homework Prompt - JMP
6 pages
Customer Scoring - Case Study
No ratings yet
Customer Scoring - Case Study
15 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
EDA Credit Assignment Shakti - PDF
No ratings yet
EDA Credit Assignment Shakti - PDF
51 pages
Capastone - Project - Subash Karnatakapu
No ratings yet
Capastone - Project - Subash Karnatakapu
54 pages
Credit-Scoring-CASE
No ratings yet
Credit-Scoring-CASE
29 pages
Data Analysis Powerpoint
100% (1)
Data Analysis Powerpoint
17 pages
Lyn Thomas-Book
No ratings yet
Lyn Thomas-Book
85 pages
Mlproj
No ratings yet
Mlproj
49 pages
Linear+Regression+ +transcription
No ratings yet
Linear+Regression+ +transcription
22 pages
Building Credit Scorecard
No ratings yet
Building Credit Scorecard
58 pages
Final - Bank Customer Response Prediction Model
No ratings yet
Final - Bank Customer Response Prediction Model
23 pages
Finance & Risk Analytics QSTN 1 - Credit Risk
No ratings yet
Finance & Risk Analytics QSTN 1 - Credit Risk
24 pages
Capstone Project
No ratings yet
Capstone Project
33 pages
Data Preparation
No ratings yet
Data Preparation
4 pages
EAD Model Development Using SAS
No ratings yet
EAD Model Development Using SAS
12 pages
Default Payment Analysis of Credit Card Clients: July 2018
No ratings yet
Default Payment Analysis of Credit Card Clients: July 2018
7 pages
Default of Credit Card Clients
No ratings yet
Default of Credit Card Clients
33 pages
CR Models
No ratings yet
CR Models
3 pages
Business Report FRA-Extended Project
No ratings yet
Business Report FRA-Extended Project
22 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Discriminant Analysis
0% (1)
Discriminant Analysis
16 pages
Predicting The Term Deposit Subscription
No ratings yet
Predicting The Term Deposit Subscription
38 pages
Finance Risk Analytics - Priyanka Sharma - Business Report
No ratings yet
Finance Risk Analytics - Priyanka Sharma - Business Report
49 pages
Progress Report 2
No ratings yet
Progress Report 2
10 pages
FRA Presentation
No ratings yet
FRA Presentation
21 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
July 2020 Batch - Interview Question Bank
No ratings yet
July 2020 Batch - Interview Question Bank
147 pages
DW&DM (Unit - 4)
No ratings yet
DW&DM (Unit - 4)
9 pages
FRA Assignment: (Type The Company Name)
No ratings yet
FRA Assignment: (Type The Company Name)
7 pages
Linear Regression Hands-On
No ratings yet
Linear Regression Hands-On
27 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
HW 1
No ratings yet
HW 1
4 pages
Loan Status Prediction
No ratings yet
Loan Status Prediction
23 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
75.an Approach For Prediction of Loan Approval Using
No ratings yet
75.an Approach For Prediction of Loan Approval Using
5 pages
1 Advanced Data Analysis-Course Outline
No ratings yet
1 Advanced Data Analysis-Course Outline
7 pages
Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A
No ratings yet
Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A
17 pages
Ppa Final Project
No ratings yet
Ppa Final Project
17 pages
A Support Vector Machine Approach To Credit Scoring
No ratings yet
A Support Vector Machine Approach To Credit Scoring
15 pages
Mid-Sem Model Answer 7
No ratings yet
Mid-Sem Model Answer 7
5 pages
Credit Defaulter Classifier 1659348484
No ratings yet
Credit Defaulter Classifier 1659348484
7 pages
Final Exam - MBA
No ratings yet
Final Exam - MBA
41 pages
ECO 213B Statistics For Business and Economics Group Homework 2
No ratings yet
ECO 213B Statistics For Business and Economics Group Homework 2
3 pages
Statistics Honours Regular
No ratings yet
Statistics Honours Regular
26 pages
HSU B301 Biostatistics For Health Sciences Moderated
No ratings yet
HSU B301 Biostatistics For Health Sciences Moderated
6 pages
Assessment Final Exam
No ratings yet
Assessment Final Exam
7 pages
Lab Mannual Spss Final
No ratings yet
Lab Mannual Spss Final
67 pages
Assignment: Ecv 5602 Statistical Methods For Transportation
No ratings yet
Assignment: Ecv 5602 Statistical Methods For Transportation
21 pages
Ujian Akhir Tengah Semester 3 TRIA
No ratings yet
Ujian Akhir Tengah Semester 3 TRIA
22 pages
Range and Quartiles
0% (1)
Range and Quartiles
4 pages
Quantitative Methods For Business Management: The Association of Business Executives QCF
No ratings yet
Quantitative Methods For Business Management: The Association of Business Executives QCF
27 pages
Business Statistics MCQs
100% (2)
Business Statistics MCQs
55 pages
Expected Shortfall
No ratings yet
Expected Shortfall
4 pages
Brukutu Ventures
No ratings yet
Brukutu Ventures
8 pages
3 Hours / 70 Marks: Seat No
No ratings yet
3 Hours / 70 Marks: Seat No
4 pages
MATM 111 Data Management Dispersion and Normal Curve
No ratings yet
MATM 111 Data Management Dispersion and Normal Curve
15 pages
SSC CGL Syllabus 2020: Tier Type of Exam Mode of Exam
No ratings yet
SSC CGL Syllabus 2020: Tier Type of Exam Mode of Exam
7 pages
MMW - Reviewer (Original)
No ratings yet
MMW - Reviewer (Original)
3 pages
Second Year Syllabus (CBCS) : Faculty of Commerce, Osmania University HYDERABAD - 500 007
No ratings yet
Second Year Syllabus (CBCS) : Faculty of Commerce, Osmania University HYDERABAD - 500 007
17 pages
Interpreting Test Score: Online Workshop 8602 Aiou
100% (1)
Interpreting Test Score: Online Workshop 8602 Aiou
39 pages
Research Methodology Research Design Comprehensive Exam Study Guide
No ratings yet
Research Methodology Research Design Comprehensive Exam Study Guide
3 pages
Pre-Term 1 Exam (XI) Applied Mathematics 21-22
No ratings yet
Pre-Term 1 Exam (XI) Applied Mathematics 21-22
7 pages
Financial Leverage Dynamics: The Roles of Liquidity and Profitability in Shaping Firms' Financial Health
No ratings yet
Financial Leverage Dynamics: The Roles of Liquidity and Profitability in Shaping Firms' Financial Health
24 pages
Chapter Three Frequency Analysis 3.1. General: Engineering Hydrology Lecture Note
No ratings yet
Chapter Three Frequency Analysis 3.1. General: Engineering Hydrology Lecture Note
29 pages
Statistical Method
No ratings yet
Statistical Method
227 pages
ASM - CCA-2 - MK GRP 1
No ratings yet
ASM - CCA-2 - MK GRP 1
21 pages
MM 501
No ratings yet
MM 501
24 pages
Package Pearsonds': R Topics Documented
No ratings yet
Package Pearsonds': R Topics Documented
28 pages
Basic Statistics For The Utterly Confused
No ratings yet
Basic Statistics For The Utterly Confused
45 pages
Transformation 1
No ratings yet
Transformation 1
35 pages
Assignment 5 - STAT
No ratings yet
Assignment 5 - STAT
8 pages

Using Logistic Regression Predict Credit Default

Uploaded by

Using Logistic Regression Predict Credit Default

Uploaded by

#analyticsx

Using Logistic Regression to Predict Credit Default

Executive Summary Introduction

Data Cleansing and Merging Imputing Coded Variables

Merging Visualization and Multiple Matchkeys Histograms of Age Pre/Post Imputation

Imputing and/or Deleting Variables Multicollinearity

0.25 4 1255429 146 100

Creating the Response Horizontal Plot of Age Before Discretization

Frequencies of the Response Variable GOODBAD

Goodbad Frequency Percent Cumulative Frequency Cumulative Percent

Discretization User and SAS Defined Discretization

Raw Form of PRMINQS Building the Model

User Created SAS Created

Scoring on the Validation Set

Profit (in Dollars)

You might also like