Case: German Credit: Var. # Variable Name Description Variable Type Code Description
Case: German Credit: Var. # Variable Name Description Variable Type Code Description
Case: German Credit: Var. # Variable Name Description Variable Type Code Description
New applicants for credit can also be evaluated on these 30 "predictor" variables. We want to develop a
credit scoring rule that can be used to determine if a new applicant is a good credit risk or a bad credit risk,
based on values for one or more of the predictor variables. All the variables are explained in Table 1.1.
(Note: The original data set had a number of categorical variables, some of which have been transformed
into a series of binary variables so that they can be appropriately handled by XLMiner. Several ordered
categorical variables have been left as is; to be treated by XLMiner as numerical. The data has been
organized in the spreadsheet German CreditI.xls)
Var. # Variable Name Description Variable Type Code Description
1<<=2 years
2<<=3 years
3:>4years
21. REAL_ESTATE Applicant owns real estate Binary 0: No, 1:Yes
22. PROP_UNKN_NONE Applicant owns no property (or unknown) Binary 0: No, 1:Yes
1 : unskilled - resident
2 : skilled employee / official
3 : management/ self-employed/highly
qualified employee/ officer
29. NUM_DEPENDENTS Number of people for whom liable to Numerical
provide maintenance
30. TELEPHONE Applicant has phone in his or her name Binary 0: No, 1:Yes
31. FOREIGN Foreign worker Binary 0: No, 1:Yes
32 RESPONSE Credit rating is good Binary 0: No, 1:Yes
INSTALL_RATE
EMPLOYMENT
MALE_SINGLE
RETRAINING
EDUCATION
FURNITURE
USED_CAR
CHK_ACCT
DURATION
SAV_ACCT
NEW_CAR
MALE_DIV
RADIO/TV
HISTORY
AMOUNT
1 0 6 4 0 0 0 1 0 0 1169 4 4 4 0 1
2 1 48 2 0 0 0 1 0 0 5951 0 2 2 0 0
3 3 12 4 0 0 0 0 1 0 2096 0 3 2 0 1
4 0 42 2 0 0 1 0 0 0 7882 0 3 2 0 1
PRESENT_RESIDENT
PROP_UNKN_NONE
NUM_DEPENDENTS
MALE_MAR_or_WID
OTHER_INSTALL
CO-APPLICANT
NUM_CREDITS
REAL_ESTATE
GUARANTOR
TELEPHONE
RESPONSE
OWN_RES
FOREIGN
RENT
AGE
JOB
0 0 0 4 1 0 67 0 0 1 2 2 1 1 0 1
0 0 0 2 1 0 22 0 0 1 1 2 1 0 0 0
0 0 0 3 1 0 49 0 0 1 1 1 2 0 0 1
0 0 1 4 0 0 45 0 0 0 1 2 2 0 0 1
Table 1.2 The data (first several rows)
The consequences of misclassification have been assessed as follows: the costs of a false positive
(incorrectly saying an applicant is a good credit risk) outweigh the cost of a false negative (incorrectly
saying an applicant is a bad credit risk) by a factor of five. This can be summarized in the following table.
Predicted (Decision)
Bad 500 DM 0
Predicted (Decision)
Bad - 500 DM 0
Let us use this table in assessing the performance of the various models because it is simpler to explain to
decision-makers who are used to thinking of their decision in terms of net profits.
Assignment
1. Review the predictor variables and guess from their definition at what their role might be in a credit
decision. Are there any surprises in the data?
2. Divide the data randomly into training (60%) and validation (40%) partitions, and develop classification
models using the following data mining techniques in XLMiner:
Logistic regression
Classification trees
Neural networks
Discriminant Analysis.
3. Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the
validation data. For the logistic regression model use a cutoff predicted probability of success
("success"=1) of 0.5. Which technique gives the most net profit on the validation data?
4. Let's see if we can improve our performance by changing the cutoff. Rather than accepting XLMiner's
initial classification of everyone's credit status, let's use the "predicted probability of success" in logistic
regression as a basis for selecting the best credit risks first, followed by poorer risk applicants.
a. Sort the validation data on "predicted probability of success."
b. For each validation case, calculate the actual cost/gain of extending credit.
c. Add another column for cumulative net profit.
d. How far into the validation data do you go to get maximum net profit? (Often this is specified as a
percentile or rounded to deciles.)
e. If this logistic regression model is scored to future applicants, what "probability of success" cutoff
should be used in extending credit?