0% found this document useful (0 votes)

30 views92 pages

Holdout Method for Credit Scoring Analysis

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views92 pages

Holdout Method for Credit Scoring Analysis

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

试题结构

预测性描述性
Base基础用一个或多个自变量预测分析具有多个属性的数据集
因变量的值，找出潜在的模式，没有因
变量
• 场景：客户是否会违约
是一个因变量，可以用 • 场景：观察个体之间的
客户的性别、年龄、收相似程度，例如根据年
统计分析入、职位、经济状况、龄、性别、收入等因素
历史信用状况等因素进进行客户细分；根据客
行预测户对多个产品的购买，
发现产品之间的相关性
• 主要算法：决策树、线
性回归、Logistic回归、 • 主要算法：聚类、关联
时间序列支持向量机、神经网络分析、因子分析、主成
、判别分析、… 份分析、社交网络分析
运筹优化、…

22
数据挖掘方法论：SEMMA

3
3
Section 1
Preprocessing Data
for Credit Scoring and
PD Modeling-Univariate Analysis
Motivation
 Dirty, noisy data
– For example, Age = -2003
 Inconsistent data
– Value '0' means actual zero or missing value
 Incomplete data
– Income=?
 Data integration and data merging problems
– Amounts in euro versus amounts in dollar
 Duplicate data
– Salary versus professional Income

5
Preprocessing Data for Credit Scoring
 Types of variables
 Sampling
 Visual data exploration
 Missing values
 Outlier detection and treatment
 Transforming data
 Recoding categorical variables

6
Types of Variables
 Continuous
– Defined on a continuous interval
– For example, income, amount on savings account
– In SAS: interval
 Discrete
– Binary
 For example, gender
– Nominal
 No ordering between values
 For example, purpose of loan, marital status
– Ordinal
 Implicit ordering between values
 For example, credit rating (AAA is better than AA,
7 AA is better than A, …)
Sampling
 Take sample of past applicants to build scoring model
 Think carefully about the population on which the model that is going
to be built using the sample will be used
 Timing of sample
– How far do I go back to get my sample?
– Trade-off: many data versus recent data
 Number of bads versus number of goods
– Undersampling, oversampling might be needed (dependent on
classification algorithm, see later)
 Sample taken must be from a normal business period to get as accurate
a picture as possible of the target population
 Make sure performance window is long enough to stabilize bad rate
(for example, 18 months)
 Example sampling problems
– Application scoring: reject inference
– Behavioral scoring: seasonality depending upon the choice of the
observation point
8
Visual Data Exploration

9
Missing Values
 Reasons
– Non-applicable (e.g., default date not known for
non-defaulters)
– Not disclosed (e.g., income)
– Error when merging data (e.g., typos in name and/or ID)
 Keep
 The fact that a variable is missing can be important
information.
 Add an additional category for the missing values.
 Add an additional missing value indicator variable
(either one per variable, or one for the entire
observation).

10 continued...
Missing Values
 Delete
– When too many missing values, removing the variable
or observation might be an option.
– Horizontally versus vertically missing values
 Replace
– Estimate missing value using imputation procedures.
– Be consistent when treating missing values during
model development and during model usage!

11
Deleting Missing Values
ID Age Income Marital Credit Bureau Class
Status Score

1 34 1800 ? 620 Bad

2 28 1200 Single ? Good
3 22 1000 Single ? Good
4 60 2200 Widowed 700 Bad
5 58 2000 Married ? Good
6 44 ? ? ? Good
7 22 1200 Single ? Good
8 26 1500 Married 350 Good
9 34 ? Single ? Bad
10 50 2100 Divorced ? Good

12
Imputation Procedures for Missing Values
 For continuous attributes
– Replace with median/mean (median more robust to outliers)
– If missing values only occur during model development, can
also replace with median/mean of all instances of the same
class
 For ordinal/nominal attributes
– Replace with modal value (= most frequent category)
– If missing values only occur during model development,
replace with modal value of all instances of the same class
 Regression or tree-based imputation
– Predict missing value using other variables
– Cannot use target class as predictor if missing values can
occur during model usage
– More complicated and often do not substantially improve
the performance of the scorecard!
13
Outliers
 Extreme or unusual observations
– E.g., due to recording, data entry errors or noise
 Types of outliers
– Valid observation: salary of boss, ratio variables
– Invalid observation: age = -2003
 Outliers can be hidden in one-dimensional views
of the data (multidimensional nature of data)
 Univariate outliers versus multivariate outliers
 Detection versus treatment

14
Multivariate Outliers
Multivariate outliers!

15
Univariate Outlier Detection Methods:
Histograms
Age
3500

3000

2500
F
r
e 2000
q
u
e
1500
n
c
y
1000

500

0
0-5 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 150-200

16
Univariate Outlier Detection Methods:
z-score
ID Age z-score
 z-score measures how many standard
deviations an observation lies away 1 30 (30-40)/10=-1

2 50 (50-40)/10=+1
from the mean for a specific variable 3 10 (10-40)/10=-3
as follows:
xi   4 40 (40-40)/10=0

zi  5 60 (60-40)/10=+2

 6

…
80

…
(80-40)/10=+4

 μ is mean of variable xi and μ=40 μ=0

σ=10 σ=1
σ standard deviation.
 Outliers are defined when |zi| > 3 (or 2.5).
 The mean of z-scores is 0,
standard deviation is 1.
 Calculate the z-score in SAS using
17
PROC STANDARD.
Computing the z-scores in SAS
data credit;
input age income;
datalines;
34 1300
24 1000
20 2000
40 2100
54 1700
39 2500
23 2200
34 700
56 1500
;

proc standard data=credit mean=0 std=1

out=creditstand;
run;
18
Univariate Outlier Detection Methods: Box Plot

 A box plot is a visual representation of five numbers:

– Median M P(X≤M)=0.50
– First Quartile Q1: P(X≤Q1)=0.25
– Third Quartile Q3: P(X≤Q3)=0.75
– Minimum
– Maximum
– IQR=Q3-Q1
1,5*IQR

Min Q1 M Q3 Outliers
19
Multivariate Outlier Detection Methods
 Mahalanobis distance
D 2  (x i  μ)T Σ1 (x i  μ)

–  is the vector of means and  is the covariance matrix

– Calculate distance for every point xi and sort
 Clustering methods
– Look for elements outside clusters
 Regression methods
– Fit regression line and look for points with large errors
– Residual plots
 Practical advice: only focus on the univariate outliers!

20
Outlier Treatment
 For invalid outliers:
– For example, age = 300 years
– Treat as missing value (keep, delete, replace)
 For valid outliers: truncation\winsorizing\capping
– Truncation based on z-scores:
 Replace all variable values having z-scores of
> 3 with the mean + 3 times the standard deviation
 Replace all variable values having z-scores of
< -3 with the mean -3 times the standard deviation
– Truncation based on IQR (more robust than z-scores)
 Truncate to M±3s, with M=median and s=IQR/(2 x 0.6745)
 See Van Gestel, Baesens et al. 2007
– Truncation using a sigmoid
 Use a sigmoid transform, f(x)=1/(1+e-x)

21
Truncation: Example

μ-3σ μ μ+3σ

22
Section 2
Preprocessing Data
for Credit Scoring and
PD Modeling-Dimensionality
Reduction
Improving Regression Selection
60
All
Subsets
45
Minutes

30
Stepwise
15

0
25 50 75 100
Number of Variables
24
Improving Input Selection

Variable Clustering

Categorical Recoding

25
Variable Redundancy

26
First Principal Component

First Eigenvalue=1.94

27
Second Principal Component

First Eigenvalue=1.94 Second Eigenvalue=1.02

28
Variable Clustering Alternative
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10

29 ...
Variable Clustering Alternative
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10

30 ...
Variable Clustering Alternative
X1
X2
X3
X4 Inputs selected by
X5 • cluster representation
X6 • expert opinion
X7 • target correlation
X8
X9
X10

31 ...
Variable Clustering Alternative
X1 X1
X2
X3
X4 X4 Inputs selected by
X5 • cluster representation
X6 X6 • expert opinion
X7 • target correlation
X8 X8
X9 X9
X10 X10

32 ...
Variable Clustering Alternative
X1 X1
X2
X3 X3
X4 X4 Inputs selected by
X5 • cluster representation
X6 X6 • expert opinion
X7 • target correlation
X8 X8
X9 X9
X10 X10

33 ...
Second Principal Component

First Eigenvalue=1.94 Second Eigenvalue=1.02

34 ...
Rotated Components

35
Input/Rotated Component Correlations
X1 X2 X3

First
RC

Second
RC

36
Cluster Inputs
X1 X2 X3

First
RC

Second
RC

37
Split New Cluster?

First Eigenvalue=1.95 Second Eigenvalue=0.05

38 ...
Selection by 1 – R2 Ratio
X2

First
1-R 2own cluster 1 – 0.90
Cluster = = 0.101
PC
2
1-R next closest 1 – 0.01
R2 = 0.90

Second
Cluster
PC
R2 = 0.01

39 ...
Implementing Variable
Clustering

This demonstration illustrates using the

Variable Clustering node to group variables
according to their similarity.

40
Recoding Techniques

1,2,3,… Enumeration

001000 Dummy-coding

x=w(Σyi) Target-based transform

41 ...
Dummy-Coding

Level DA DB DC DD DE DF DG DH DI DJ
A 1 0 0 0 0 0 0 0 0 0
B 0 1 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 0 0 0
D 0 0 0 1 0 0 0 0 0 0
E 0 0 0 0 1 0 0 0 0 0
F 0 0 0 0 0 1 0 0 0 0
G 0 0 0 0 0 0 1 0 0 0
H 0 0 0 0 0 0 0 1 0 0
I 0 0 0 0 0 0 0 0 1 0
J 0 0 0 0 0 0 0 0 0 1

42
Quasi-complete separation

43
Target-Based Enumeration

Level Ni ΣYi pi
A 1562 430 0.28
B 970 432 0.45
C 223 45 0.20
D 111 36 0.32
E 85 23 0.27
F 50 20 0.40
G 23 8 0.35
H 17 5 0.29
I 12 6 0.50
J 5 5 1.00

44
Target-Based Enumeration

Level Ni ΣYi pi
J 5 5 1.00
I 12 6 0.50
B 970 432 0.45
F 50 20 0.40
G 23 8 0.35
D 111 36 0.32
H 17 5 0.29
A 1562 430 0.28
E 85 23 0.27
C 223 45 0.20

45
Weight of Evidence

Level Ni ΣYi pi log(pi/1-pi)

J 5 5 1.00 .
I 12 6 0.50 0.00
B 970 432 0.45 -0.10
F 50 20 0.40 -0.18
G 23 8 0.35 -0.27
D 111 36 0.32 -0.32
H 17 5 0.29 -0.38
A 1562 430 0.28 -0.42
E 85 23 0.27 -0.43
C 223 45 0.20 -0.60

46
Level Clustering

Level Ni ΣYi pi log(pi/1-pi)

J 5 5 1.00 .
I 12 6 0.50 0.00
B 970 432 0.45 -0.10
F 50 20 0.40 -0.18
G 23 8 0.35 -0.27
D 111 36 0.32 -0.32
H 17 5 0.29 -0.38
A 1562 430 0.28 -0.42
E 85 23 0.27 -0.43
C 223 45 0.20 -0.60

47
Level Clustering

Level Ni ΣYi pi log(pi/1-pi)

CL1 1037 463 0.45 -0.09

CL2 134 44 0.33 -0.31

CL3 1664 458 0.28 -0.42

CL4 223 45 0.20 -0.60

48
Level Clustering Dummy-Coding

Level Ni ΣYi pi log(pi/1-pi)

CL1 1037 463 0.45 -0.09

CL2 134 44 0.33 -0.31

CL3 1664 458 0.28 -0.42

CL4 223 45 0.20 -0.60

49
Smoothed Weight of Evidence

Level Ni ΣYi pi log(pi/1-pi)

J 5 5 1.00 .
I 12 6 0.50 0.00
B 970 432 0.45 -0.10
F 50 20 0.40 -0.18
G 23 8 0.35 -0.27
D 111 36 0.32 -0.32
H 17 5 0.29 -0.38
A 1562 430 0.28 -0.42
E 85 23 0.27 -0.43
C 223 45 0.20 -0.60

50
Smoothed Weight of Evidence

Level Ni ΣYi pi log(pi/1-pi)

J 5 +24 5 +8 0.45 -0.09
I 12 +24 6 +8 0.39 -0.19
B 970 +24 432 +8 0.44 -0.10
F 50 +24 20 +8 0.38 -0.22
G 23 +24 8 +8 0.34 -0.29
D 111 +24 36 +8 0.33 -0.32
H 17 +24 5 +8 0.32 -0.33
A 1562 +24 430 +8 0.28 -0.42
E 85 +24 23 +8 0.28 -0.40
C 223 +24 45 +8 0.21 -0.56

51
Section 3
Classification Techniques
for Credit Scoring and
PD Modeling
The Classification Problem
 The classification problem can be stated as follows:
– Given an observation (also called pattern) with
characteristics x=(x1,..,xn), determine its class c
from a predetermined set of classes {c1,...,cm}
 The classes {c1,..,cm} are known on beforehand
 Supervised learning!
 Binary (2 classes) versus multiclass (> 2 classes)
classification

53
Regression for classification
Customer Age Income Gender G/B Customer Age Income Gender G/B Y
John 30 1200 M B John 30 1200 M B 0
Sarah 25 800 F G Sarah 25 800 F G 1
Sophie 52 2200 F G Sophie 52 2200 F G 1
David 48 2000 M B David 48 2000 M B 0
Peter 34 1800 M G Peter 34 1800 M B 1

Linear regression gives : Y=β0+ β1Age+ β2Income+ β3Gender

Can be estimated using OLS (in SAS use proc reg or proc glm)
Two problems:
• No guarantee that Y is between 0 and 1 (i.e. a probability)
• Target/Errors not normally distributed
1
Using a bounding function to limit the outcome between 0 and 1: f ( z ) 
1
1  e z
0,9

0,8

0,7

0,6

0,5

0,4

0,3

0,2

0,1

0
54 -7 -5 -3 -1 1 3 5 7
Logistic Regression
• Linear regression with a transformation such that the output
is always between 0 and 1 can thus be interpreted as a
probability (e.g. probability of good customer)

1
P(customer  good | age , income, gender,... ) 
1  e ( 0  1age  2income 3 gender...)
• Or, alternatively
P(customer  good | age, income, gender,... )
ln ( )   0  1age   2income   3 gender  ...
P(customer  bad | age, income, gender )

• The parameters can be estimated in SAS using proc logistic

• Once the model has been estimated using historical data, we
can use it to score or assign probabilities to new data

55
Logistic Regression in SAS
Historical data
Customer Age Income Gender Response
John 30 1200 M No
Sarah 25 800 F Yes
Sophie 52 2200 F Yes
David 48 2000 M No
Peter 34 1800 M Yes
proc logistic data=responsedata;
class Gender;
model Response=Age Income Gender;
run;

1
P(response  yes | Age, Income, Gender,...) 
1  e ( 0.100.22age0.050income0.80gender...)

New data
Customer Age Income Gender Response score
Emma 28 1000 F 0,44
Will 44 1500 M 0,76
Dan 30 1200 M 0,18
56 Bob 58 2400 M 0,88
Logistic Regression
 The logistic regression model is formulated as follows:
1 e 0  1 X1 ...  n X n
P(Y  1 | X 1 ,..., X n )  
1  e ( 0  1 X1 ...  n X n ) 1  e 0  1 X1 ...  n X n
1 1
P(Y  0 | X 1 ,..., X n )  1  P(Y  1 | X 1 ,..., X n )  1  
1  e ( 0  1 X1 ...  n X n ) 1  e 0  1 X1 ...  n X n

 Hence,
0  P(Y  1 | X1 ,..., X n ), P(Y  0 | X1 ,..., X n )  1

 Model reformulation: P(Y  1 | X 1 ,..., X n ) 0  1X1...  n X n

e
P(Y  0 | X 1 ,..., X n )

57 continued...
Logistic Regression
P(Y  1 | X 1 ,..., X n )
ln ( )   0  1 X 1  ...   n X n
P(Y  0 | X 1 ,..., X n )
P(Y  1 | X 1 ,..., X n )
is the odds in favor of Y=1
1  P(Y  1 | X 1 ,..., X n )

P(Y  1 | X 1 ,..., X n )
ln ( ) is called the logit
1  P(Y  1 | X 1 ,..., X n )

58
Logistic Regression
 If Xi increases by 1:
logit | X i 1  logit | X i  i
i
odds | X i 1  odds | X i e
ei is the odds-ratio: the multiplicative increase
in the odds when Xi increases by one (other
variables remaining constant/ceteris paribus)
 i>0  ei > 1  odds and probability increase
with Xi
 i<0  ei < 1  odds and probability
decrease with Xi
59
Logistic Regression and Weight of Evidence
Coding
Cust Age Age Group Age WoE
Actual Age ID

After
1 20 1: until 22 -1.1
classing
2 31 2: 22 until 35 0.2
After re-
coding
3 49 3: 35+ 0.9

60 continued...
Logistic Regression and Weight of Evidence
Coding

1
P(Y  1 | X 1 ,..., X n )  (  0   age*age_ woe  purpose* purpose_ woe...)
1 e

 No dummy variables!
 More robust

61
Section 4
Measuring the
Performance of Credit Scoring
Classification Models
How to Measure Performance?
Performance
– How well does the trained model perform in predicting
new unseen (!) instances?
– Decide on performance measure
 Classification: Percentage Correctly Classified (PCC),
Sensitivity, Specificity, Area Under ROC curve
(AUROC), ...
 Regression: Mean Absolute Deviation (MAD), Mean
Squared Error (MSE), ...
 Methods

– Split sample method

– Single sample method
– N-fold cross-validation
63
Split Sample Method
 For large data sets
– Large= > 1000 Obs, with > 50 bad customers
 Set aside a test set (typically one-third of the data)
which is NOT used during training!
 Estimate performance of trained classifier on test set
 Note: for decision trees, validation set is part of
training set
 Stratification
– Same class distribution in training set and test set

64
Evaluating classification models
• Train (Estimation) data versus Test (Hold-out) data
• Train data is used to build model (e.g. logistic regression or decision tree)
• Test data is used to measure performance
• Strict separation between training and test set needed!

Train data
Customer Age Income Gender Response … Target proc logistic
John 30 1200 M No 0 data=responsedata;
Sarah 25 800 F Yes 1 Build Model
class Gender;
Sophie 52 2200 F Yes 1
David 48 2000 M No 0 model Response=Age
Peter 34 1800 M Yes 1 Income Gender;
run;
Data
Apply
Test data Model
Customer Age Income Gender … Response Response score
Emma 28 1000 F No 0,44
Will 44 1500 M Yes 0,76
Dan 30 1200 M No 0,18
Bob 58 2400 M No 0,88

65
Performance Measures for Classification
 Classification accuracy, confusion matrix, sensitivity,
specificity
 Area under the ROC curve
 The Lorenz (Power) curve
 The Cumulative lift curve
 Kolmogorov Smirnov

66
Classification performance: confusion matrix
G/B Score Predicted
G/B Score John G 0,72 G
John G 0,72 Sophie B 0,56 G
Sophie B 0,56 Cut off=0,50
David G 0,44 B
David G 0,44 Emma B 0,18 B
Emma B 0,18 Bob B 0,36 B
Bob B 0,36
Confusion matrix
Actual status
Positive (Good) Negative (Bad)
Predicted Positive (Good) True Positive (John) False Positive (Sophie)
status Negative (Bad) False Negative (David) True Negative (Emma, Bob)

Classification accuracy=(TP+TN)/(TP+FP+FN+TN)=3/5
Classification error = (FP +FN)/(TP+FP+FN+TN)=2/5
Sensitivity=TP/(TP+FN)=1/2
Specificity=TN/(FP+TN)=2/3

All these measures depend upon the cut off!

67
Classification Performance: ROC analysis
• Make a table with sensitivity and specificity for each possible cut-off
• ROC curve plots sensitivity versus 1-specificity for each possible cut-off
ROC Curve
Cut-off sensitivity specificity 1-specificity
0 1 0 1 1

0,01 0,8

Sensitivity
0,02 0,6
0,4
0,2
….
0
0 0,2 0,4 0,6 0,8 1
0,99 (1 - Specificity)

1 0 1 0 Scorecard - A Random Scorecard - B

• Perfect model has sensitivity of 1 and specificity of 1 (i.e. upper left corner)
• Scorecard A is better than B in above figure
• ROC curve can be summarized by the area underneath (AUC); the bigger
the better!

68
The Receiver Operating Characteristic Curve

ROC Curve

1
0.8
Sensitivity

0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
(1 - Specificity)

Scorecard - A Random Scorecard - B

69 ...
The Area Under the ROC Curve
 How to compare intersecting ROC curves?
 The area under the ROC curve (AUC)
 The AUC provides a simple figure-of-merit for the
performance of the constructed classifier
 An intuitive interpretation of the AUC is that it provides
an estimate of the probability that a randomly chosen
instance of class 1 is correctly ranked higher than a
randomly chosen instance of class 0 (Hanley and
McNeil, 1983) (Wilcoxon or Mann-Whitney or U
statistic)
 The higher the better
 A good classifier should have an AUC larger than 0.5

70
Cumulative Accuracy Profile (CAP)

• Sort the population from high score to low model score

• Measure the (cumulative) percentage of bads for each score decile
• E.g. top 30% most likely bads according to model captures 65% of true
bads
• Often summarised as top-decile lift, or how many bads in top 10%
71
Accuracy Ratio
A
A

B Current model
B
Perfect model

AR=B/(A+B)

 The accuracy ratio (AR) is defined as:

(Area below power curve for current model-Area below power
curve for random model) /
(Area below power curve for perfect model-Area below power
curve for random model)
 Perfect model has an AR of 1
 Random model has an AR of 0
 AR is sometimes also called the “Gini coefficient”
72  AR=2*AUC-1
72
The Kolmogorov-Smirnov (KS) Distance
 Separation measure
 Measures the distance between the cumulative score
distributions P(s|B) and P(s|G)
 KS = maxs |P(s|G)-P(s|B) |, where:
 P(s|G) = ∑x≤s p(x|G) (equals 1- sensitivity)
 P(s|B) = ∑x≤s p(x|B) (equals the specificity)
 KS distance metric is the maximum vertical distance
between both curves
 KS distance can also be measured on the ROC graph
– Maximum vertical distance between ROC graph and
diagonal

73 continued...
The Kolmogorov-Smirnov Distance
1
0.9
0.8 P(s|B)
0.7
0.6
0.5
0.4 KS distance
0.3
0.2
P(s|G)
0.1
0
score

74
Section 5
Input Selection for
Classification
Input Selection
 Inputs=Features=Attributes=Characteristics=Variables.
 Also called feature selection, attribute selection,
characteristic selection, variable selection
 If n features are present, 2n-1 possible feature sets can
be considered.
 Heuristic search methods are needed!
 Good feature subsets contain features highly correlated
with (predictive of) the class, yet uncorrelated with (not
predictive of) each other (Hall and Smith 1998).
 Can improve the performance and estimation time
of the classifier.

76 continued...
Filter Methods for Input Selection
Continuous target Discrete target
(e.g., LGD) (e.g., PD)
Continuous input Pearson correlation Fisher score
(e.g., income) Spearman correlation
Hoeffding’s D

Categorical input Fisher score Chi-squared analysis

(e.g., marital ANOVA analysis Cramer’s V
status) Information value
Gain/entropy

77
Pearson Correlation
 Compute Pearson correlation between each
continuous variable and continuous target
 Always varies between -1 and +1
 Only keep variables for which |ρP| > 0.50; or keep,
e.g., top 10%

78
Chi-Squared-Based Filter
Good Payer Bad Payer Total
Observed
Frequencies Married 500 100 600
Not Married 300 100 400
800 200 1000

Under the independence assumption,

P(married and good payer) = P(married).P(good payer) = 0.6 * 0.8.
The expected number of good payers that are married is 0.6*0.8*1000 = 480.

Good Bad Payer Total

Payer
Independence
Frequencies Married 480 120 600
Not Married 320 80 400
800 200 1000
79 continued...
Chi-Squared Based Filter
 2500  480² 100  120² 300  320² 100  80²
    10.41
480 120 320 80

 χ²is chi-squared distributed with k-1 degrees of freedom, with k

being the number of classes of the characteristic.
 The bigger (lower) the value of χ², the more (less) predictive the
attribute is.
 Apply as filter: rank all attributes with respect to their p-value and
select the most predictive ones.
 Note that
²
Cramer' s V 
n
is always bounded between 0 and 1; higher values indicate better
predictive input; a threshold of 0.10 can be used.

80
Information Value/Gain
Information value is defined as (see previously):
IV =  ( p_good category - p_bad category ) * woe category )
A predictive variable has IV > 0.10.

Gain measures the decrease in impurity (based on, e.g.,

entropy or gini) when splitting on the variable (see
decision trees).

Cramer’s V, Information Value, and Gain typically give

similar results in terms of input importance.

Both Cramer’s V and Information Value are readily

available in SAS Enterprise Miner
81
Example: Filters (Applicants Data Set)

Van Gestel, Baesens 2008

82
Fisher Score
 Defined as
| xG  x B |
Fisher score 
sG2  sB2
 A higher value indicates a more predictive input.
 It can also be used if the target is continuous and input
is discrete.
 It essentially generalizes to an ANOVA test in the case
of multiple categories.
 Example for Applicants data set:

Van Gestel, Baesens 2008

83
Empirical Logit Plots

Check Linearity

84
Input Selection Procedure
 Step 1: Use a filter procedure
– For quick filtering of inputs
– Inputs are selected independent of the
classification algorithm (e.g., logistic regression).
 Step 2: Forward/backward/stepwise regression
– Use the p-value of the logistic regression for input
selection.

85
Forward/Backward/Stepwise Regression
 Use the p-value to decide upon the importance
of the inputs:
– p-value < 0.01: highly significant
– 0.01 < p-value < 0.05: significant
– 0.05 < p-value < 0.10: weakly significant
– 0.1 < p-value: not significant
 Can be used in different ways:
– Forward
– Backward
– Stepwise

86
Search Strategies
 Forward selection
– Starts from empty model and always adds
variables based on low p-values
 Backward elimination
– Starts from full model and always deletes variables
based on high p-values
 Stepwise
– Starts as forward selection, but checks whether
added variables cannot be removed later

87
Example: Search Space for Four Inputs
{}

{I1} {I2} {I3} {I4}

{I1, I2} {I1, I3} {I2, I3} {I1, I4} {I2, I4} {I3, I4}

{I1, I2, I3} {I1, I2, I4} {I1, I3, I4} {I2, I3, I4}

{I1, I2, I3, I4}

Note: I1=age; I2=income;

88
I3=marital status; I4=employment status
Forward/Backward/Stepwise Logistic
Regression
SELECTION= PROC LOGISTIC first estimates parameters for
FORWARD, effects forced into the model. These effects are the
intercepts and the first n explanatory effects in the
MODEL statement, where n is the number specified
by the START= or INCLUDE= option in the MODEL
statement (n is zero by default). Next, the
procedure computes the score chi-square statistic
for each effect not in the model and examines the
largest of these statistics. If it is significant at the
SLENTRY= level, the corresponding effect is added
to the model. After an effect is entered in the model,
it is never removed from the model. The process is
repeated until none of the remaining effects meet
the specified level for entry or until the STOP=
value is reached.
89 continued...
Stepwise Logistic Regression
SELECTION= Parameters for the complete model as specified in
BACKWARD the MODEL statement are estimated unless the
, START= option is specified. In that case, only the
parameters for the intercepts and the first n
explanatory effects in the MODEL statement are
estimated, where n is the number specified by the
START= option. Results of the Wald test for
individual parameters are examined. The least
significant effect that does not meet the SLSTAY=
level for staying in the model is removed. After an
effect is removed from the model, it remains
excluded. The process is repeated until no other
effect in the model meets the specified level for
removal or until the STOP= value is reached.

90 continued...
Stepwise Logistic Regression
SELECTION= This is similar to the SELECTION=FORWARD
STEPWISE option except that effects already in the model do
not necessarily remain. Effects are entered into and
removed from the model in such a way that each
forward selection step might be followed by one or
more backward elimination steps. The stepwise
selection process terminates if no further effect can
be added to the model or if the effect just entered
into the model is the only effect removed in the
subsequent backward elimination.
SELECTION= PROC LOGISTIC uses the branch and bound
SCORE algorithm of Furnival and Wilson (1974).

91
Stepwise Logistic Regression: Example
proc logistic data=[Link];
class checking history purpose savings
employed marital coapp resident
property other housing;
model good_bad= amount duration age
installp checking history purpose
savings employed marital coapp
resident property other housing
/selection=stepwise slentry=0.10
slstay=0.01;
run;

92
Thanks!

Data Preparation for Data Science
No ratings yet
Data Preparation for Data Science
57 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
19 pages
Chapter 02 Overview
No ratings yet
Chapter 02 Overview
43 pages
CSC 452 DM Week04 Data PreProcessing A 13102020 015436pm
No ratings yet
CSC 452 DM Week04 Data PreProcessing A 13102020 015436pm
31 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Unit 1
No ratings yet
Unit 1
21 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
Exploratory Data Analysis: By:-Shobhit Tyagi
No ratings yet
Exploratory Data Analysis: By:-Shobhit Tyagi
20 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Data Screening Assumptions
No ratings yet
Data Screening Assumptions
29 pages
CH 2
No ratings yet
CH 2
36 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
38 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
BRM CS
No ratings yet
BRM CS
4 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Data Screening Techniques Explained
No ratings yet
Data Screening Techniques Explained
36 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
31 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
No ratings yet
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
16 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Chapter 2
No ratings yet
Chapter 2
46 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Research File 3
No ratings yet
Research File 3
10 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Data Pre-Processing Essentials
No ratings yet
Data Pre-Processing Essentials
21 pages
03 Data Science Process - Spring-24-25
No ratings yet
03 Data Science Process - Spring-24-25
48 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
CRISP-DM Methodology for Predictive Analytics
No ratings yet
CRISP-DM Methodology for Predictive Analytics
21 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Unit 1 (Data Collection and Management)
No ratings yet
Unit 1 (Data Collection and Management)
43 pages
L1-D2 Basics of Data Preperation and Quality
100% (1)
L1-D2 Basics of Data Preperation and Quality
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Unit 4
No ratings yet
Unit 4
42 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Data Cleaning
No ratings yet
Data Cleaning
39 pages
Internet Accessibility, Cyber Valuing, and Degree of Productivity of School Heads, Teachers, and Students
No ratings yet
Internet Accessibility, Cyber Valuing, and Degree of Productivity of School Heads, Teachers, and Students
12 pages
Overview of Data Preprocessing
No ratings yet
Overview of Data Preprocessing
4 pages
PSG Beed
No ratings yet
PSG Beed
10 pages
Dire Dawa University: Education and Economic Growth of Ethiopia
No ratings yet
Dire Dawa University: Education and Economic Growth of Ethiopia
40 pages
The Correlation Between Grammar Mastery and Speaking Ability of The Eighth Grade Students
No ratings yet
The Correlation Between Grammar Mastery and Speaking Ability of The Eighth Grade Students
13 pages
Syllabus Sem-1
No ratings yet
Syllabus Sem-1
2 pages
The Entrepreneurial Intention of Undergraduates in Sri Lankan
No ratings yet
The Entrepreneurial Intention of Undergraduates in Sri Lankan
17 pages
VecLI - A Framework For Calculating Vector Landscape Indices Considering Landscape Fragmentation
No ratings yet
VecLI - A Framework For Calculating Vector Landscape Indices Considering Landscape Fragmentation
8 pages
Ad3411-Data Science and Analytics Laboratory
No ratings yet
Ad3411-Data Science and Analytics Laboratory
27 pages
AI & ML Program Overview
No ratings yet
AI & ML Program Overview
157 pages
Midsem Compressed
No ratings yet
Midsem Compressed
32 pages
Research II Mastery Test Overview
No ratings yet
Research II Mastery Test Overview
21 pages
A Case Study On Noise Pollution in Selected Zones of Madurai City, Tamilnadu, India
No ratings yet
A Case Study On Noise Pollution in Selected Zones of Madurai City, Tamilnadu, India
10 pages
Smartphone Impact on Academic Performance
No ratings yet
Smartphone Impact on Academic Performance
20 pages
Ethos Vol. 11 No. 1 Jan-June 2018
No ratings yet
Ethos Vol. 11 No. 1 Jan-June 2018
81 pages
Math AI SL IA
No ratings yet
Math AI SL IA
13 pages
Fernandez Et Al 2014
No ratings yet
Fernandez Et Al 2014
18 pages
Statistics Assignment Guide
No ratings yet
Statistics Assignment Guide
3 pages
Discharge Coefficient and Jet Deflection Studies For Combustor Air Entry Holes
No ratings yet
Discharge Coefficient and Jet Deflection Studies For Combustor Air Entry Holes
119 pages
The Effect of Project Management Leadership For The Success in Case of Addis Ababa Water and Sewerage Authority Project Office Proposal
100% (2)
The Effect of Project Management Leadership For The Success in Case of Addis Ababa Water and Sewerage Authority Project Office Proposal
27 pages
IB Math Studies: Data Analysis Tasks
No ratings yet
IB Math Studies: Data Analysis Tasks
24 pages
CEO Traits & Environmental Disclosure
No ratings yet
CEO Traits & Environmental Disclosure
12 pages
Chapter 3 Answer
No ratings yet
Chapter 3 Answer
16 pages
Impact of Foreign Language Classroom Anxiety On Hi
No ratings yet
Impact of Foreign Language Classroom Anxiety On Hi
12 pages
3risklandscapeajer 37870-8831
No ratings yet
3risklandscapeajer 37870-8831
15 pages
RT Project Documentation-Sample
No ratings yet
RT Project Documentation-Sample
22 pages
1 Characteristics of Time Series 1.3 Measures of Dependence
No ratings yet
1 Characteristics of Time Series 1.3 Measures of Dependence
10 pages
Syllabuss
No ratings yet
Syllabuss
45 pages
Piccoli 2002
No ratings yet
Piccoli 2002
5 pages

Holdout Method for Credit Scoring Analysis

Uploaded by

Holdout Method for Credit Scoring Analysis

Uploaded by

试题结构

1 34 1800 ? 620 Bad

 μ is mean of variable xi and μ=40 μ=0

proc standard data=credit mean=0 std=1

 A box plot is a visual representation of five numbers:

–  is the vector of means and  is the covariance matrix

First Eigenvalue=1.94 Second Eigenvalue=1.02

First Eigenvalue=1.94 Second Eigenvalue=1.02

First Eigenvalue=1.95 Second Eigenvalue=0.05

This demonstration illustrates using the

x=w(Σyi) Target-based transform

Level Ni ΣYi pi log(pi/1-pi)

Level Ni ΣYi pi log(pi/1-pi)

Level Ni ΣYi pi log(pi/1-pi)

CL2 134 44 0.33 -0.31

CL3 1664 458 0.28 -0.42

CL4 223 45 0.20 -0.60

Level Ni ΣYi pi log(pi/1-pi)

CL2 134 44 0.33 -0.31

CL3 1664 458 0.28 -0.42

CL4 223 45 0.20 -0.60

Level Ni ΣYi pi log(pi/1-pi)

Level Ni ΣYi pi log(pi/1-pi)

Linear regression gives : Y=β0+ β1Age+ β2Income+ β3Gender

• The parameters can be estimated in SAS using proc logistic

 Model reformulation: P(Y  1 | X 1 ,..., X n ) 0  1X1...  n X n

– Split sample method

All these measures depend upon the cut off!

1 0 1 0 Scorecard - A Random Scorecard - B

Scorecard - A Random Scorecard - B

• Sort the population from high score to low model score

 The accuracy ratio (AR) is defined as:

Categorical input Fisher score Chi-squared analysis

Under the independence assumption,

Good Bad Payer Total

 χ²is chi-squared distributed with k-1 degrees of freedom, with k

Gain measures the decrease in impurity (based on, e.g.,

Cramer’s V, Information Value, and Gain typically give

Both Cramer’s V and Information Value are readily

Van Gestel, Baesens 2008

Van Gestel, Baesens 2008

{I1} {I2} {I3} {I4}

{I1, I2, I3, I4}

Note: I1=age; I2=income;

You might also like