Example New
Example New
预测性 描述性
Base基础 用一个或多个自变量预测 分析具有多个属性的数据集
因变量的值 ,找出潜在的模式,没有因
变量
• 场景:客户是否会违约
是一个因变量,可以用 • 场景:观察个体之间的
客户的性别、年龄、收 相似程度,例如根据年
统计分析 入、职位、经济状况、 龄、性别、收入等因素
历史信用状况等因素进 进行客户细分;根据客
行预测 户对多个产品的购买,
发现产品之间的相关性
• 主要算法:决策树、线
性回归、Logistic回归、 • 主要算法:聚类、关联
时间序列 支持向量机、神经网络 分析、因子分析、主成
、判别分析、… 份分析、社交网络分析
运筹优化 、…
22
数据挖掘方法论:SEMMA
3
3
Section 1
Preprocessing Data
for Credit Scoring and
PD Modeling-Univariate Analysis
Motivation
Dirty, noisy data
– For example, Age = -2003
Inconsistent data
– Value '0' means actual zero or missing value
Incomplete data
– Income=?
Data integration and data merging problems
– Amounts in euro versus amounts in dollar
Duplicate data
– Salary versus professional Income
5
Preprocessing Data for Credit Scoring
Types of variables
Sampling
Visual data exploration
Missing values
Outlier detection and treatment
Transforming data
Recoding categorical variables
6
Types of Variables
Continuous
– Defined on a continuous interval
– For example, income, amount on savings account
– In SAS: interval
Discrete
– Binary
For example, gender
– Nominal
No ordering between values
For example, purpose of loan, marital status
– Ordinal
Implicit ordering between values
For example, credit rating (AAA is better than AA,
7 AA is better than A, …)
Sampling
Take sample of past applicants to build scoring model
Think carefully about the population on which the model that is going
to be built using the sample will be used
Timing of sample
– How far do I go back to get my sample?
– Trade-off: many data versus recent data
Number of bads versus number of goods
– Undersampling, oversampling might be needed (dependent on
classification algorithm, see later)
Sample taken must be from a normal business period to get as accurate
a picture as possible of the target population
Make sure performance window is long enough to stabilize bad rate
(for example, 18 months)
Example sampling problems
– Application scoring: reject inference
– Behavioral scoring: seasonality depending upon the choice of the
observation point
8
Visual Data Exploration
9
Missing Values
Reasons
– Non-applicable (e.g., default date not known for
non-defaulters)
– Not disclosed (e.g., income)
– Error when merging data (e.g., typos in name and/or ID)
Keep
The fact that a variable is missing can be important
information.
Add an additional category for the missing values.
Add an additional missing value indicator variable
(either one per variable, or one for the entire
observation).
10 continued...
Missing Values
Delete
– When too many missing values, removing the variable
or observation might be an option.
– Horizontally versus vertically missing values
Replace
– Estimate missing value using imputation procedures.
– Be consistent when treating missing values during
model development and during model usage!
11
Deleting Missing Values
ID Age Income Marital Credit Bureau Class
Status Score
12
Imputation Procedures for Missing Values
For continuous attributes
– Replace with median/mean (median more robust to outliers)
– If missing values only occur during model development, can
also replace with median/mean of all instances of the same
class
For ordinal/nominal attributes
– Replace with modal value (= most frequent category)
– If missing values only occur during model development,
replace with modal value of all instances of the same class
Regression or tree-based imputation
– Predict missing value using other variables
– Cannot use target class as predictor if missing values can
occur during model usage
– More complicated and often do not substantially improve
the performance of the scorecard!
13
Outliers
Extreme or unusual observations
– E.g., due to recording, data entry errors or noise
Types of outliers
– Valid observation: salary of boss, ratio variables
– Invalid observation: age = -2003
Outliers can be hidden in one-dimensional views
of the data (multidimensional nature of data)
Univariate outliers versus multivariate outliers
Detection versus treatment
14
Multivariate Outliers
Multivariate outliers!
15
Univariate Outlier Detection Methods:
Histograms
Age
3500
3000
2500
F
r
e 2000
q
u
e
1500
n
c
y
1000
500
0
0-5 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 150-200
16
Univariate Outlier Detection Methods:
z-score
ID Age z-score
z-score measures how many standard
deviations an observation lies away 1 30 (30-40)/10=-1
2 50 (50-40)/10=+1
from the mean for a specific variable 3 10 (10-40)/10=-3
as follows:
xi 4 40 (40-40)/10=0
zi 5 60 (60-40)/10=+2
6
…
80
…
(80-40)/10=+4
Min Q1 M Q3 Outliers
19
Multivariate Outlier Detection Methods
Mahalanobis distance
D 2 (x i μ)T Σ1 (x i μ)
20
Outlier Treatment
For invalid outliers:
– For example, age = 300 years
– Treat as missing value (keep, delete, replace)
For valid outliers: truncation\winsorizing\capping
– Truncation based on z-scores:
Replace all variable values having z-scores of
> 3 with the mean + 3 times the standard deviation
Replace all variable values having z-scores of
< -3 with the mean -3 times the standard deviation
– Truncation based on IQR (more robust than z-scores)
Truncate to M±3s, with M=median and s=IQR/(2 x 0.6745)
See Van Gestel, Baesens et al. 2007
– Truncation using a sigmoid
Use a sigmoid transform, f(x)=1/(1+e-x)
21
Truncation: Example
μ-3σ μ μ+3σ
22
Section 2
Preprocessing Data
for Credit Scoring and
PD Modeling-Dimensionality
Reduction
Improving Regression Selection
60
All
Subsets
45
Minutes
30
Stepwise
15
0
25 50 75 100
Number of Variables
24
Improving Input Selection
Variable Clustering
Categorical Recoding
25
Variable Redundancy
26
First Principal Component
First Eigenvalue=1.94
27
Second Principal Component
28
Variable Clustering Alternative
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
29 ...
Variable Clustering Alternative
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
30 ...
Variable Clustering Alternative
X1
X2
X3
X4 Inputs selected by
X5 • cluster representation
X6 • expert opinion
X7 • target correlation
X8
X9
X10
31 ...
Variable Clustering Alternative
X1 X1
X2
X3
X4 X4 Inputs selected by
X5 • cluster representation
X6 X6 • expert opinion
X7 • target correlation
X8 X8
X9 X9
X10 X10
32 ...
Variable Clustering Alternative
X1 X1
X2
X3 X3
X4 X4 Inputs selected by
X5 • cluster representation
X6 X6 • expert opinion
X7 • target correlation
X8 X8
X9 X9
X10 X10
33 ...
Second Principal Component
35
Input/Rotated Component Correlations
X1 X2 X3
First
RC
Second
RC
36
Cluster Inputs
X1 X2 X3
First
RC
Second
RC
37
Split New Cluster?
First
1-R 2own cluster 1 – 0.90
Cluster = = 0.101
PC
2
1-R next closest 1 – 0.01
R2 = 0.90
Second
Cluster
PC
R2 = 0.01
39 ...
Implementing Variable
Clustering
40
Recoding Techniques
1,2,3,… Enumeration
001000 Dummy-coding
41 ...
Dummy-Coding
Level DA DB DC DD DE DF DG DH DI DJ
A 1 0 0 0 0 0 0 0 0 0
B 0 1 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 0 0 0
D 0 0 0 1 0 0 0 0 0 0
E 0 0 0 0 1 0 0 0 0 0
F 0 0 0 0 0 1 0 0 0 0
G 0 0 0 0 0 0 1 0 0 0
H 0 0 0 0 0 0 0 1 0 0
I 0 0 0 0 0 0 0 0 1 0
J 0 0 0 0 0 0 0 0 0 1
42
Quasi-complete separation
43
Target-Based Enumeration
Level Ni ΣYi pi
A 1562 430 0.28
B 970 432 0.45
C 223 45 0.20
D 111 36 0.32
E 85 23 0.27
F 50 20 0.40
G 23 8 0.35
H 17 5 0.29
I 12 6 0.50
J 5 5 1.00
44
Target-Based Enumeration
Level Ni ΣYi pi
J 5 5 1.00
I 12 6 0.50
B 970 432 0.45
F 50 20 0.40
G 23 8 0.35
D 111 36 0.32
H 17 5 0.29
A 1562 430 0.28
E 85 23 0.27
C 223 45 0.20
45
Weight of Evidence
46
Level Clustering
47
Level Clustering
48
Level Clustering Dummy-Coding
49
Smoothed Weight of Evidence
50
Smoothed Weight of Evidence
51
Section 3
Classification Techniques
for Credit Scoring and
PD Modeling
The Classification Problem
The classification problem can be stated as follows:
– Given an observation (also called pattern) with
characteristics x=(x1,..,xn), determine its class c
from a predetermined set of classes {c1,...,cm}
The classes {c1,..,cm} are known on beforehand
Supervised learning!
Binary (2 classes) versus multiclass (> 2 classes)
classification
53
Regression for classification
Customer Age Income Gender G/B Customer Age Income Gender G/B Y
John 30 1200 M B John 30 1200 M B 0
Sarah 25 800 F G Sarah 25 800 F G 1
Sophie 52 2200 F G Sophie 52 2200 F G 1
David 48 2000 M B David 48 2000 M B 0
Peter 34 1800 M G Peter 34 1800 M B 1
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
54 -7 -5 -3 -1 1 3 5 7
Logistic Regression
• Linear regression with a transformation such that the output
is always between 0 and 1 can thus be interpreted as a
probability (e.g. probability of good customer)
1
P(customer good | age , income, gender,... )
1 e ( 0 1age 2income 3 gender...)
• Or, alternatively
P(customer good | age, income, gender,... )
ln ( ) 0 1age 2income 3 gender ...
P(customer bad | age, income, gender )
55
Logistic Regression in SAS
Historical data
Customer Age Income Gender Response
John 30 1200 M No
Sarah 25 800 F Yes
Sophie 52 2200 F Yes
David 48 2000 M No
Peter 34 1800 M Yes
proc logistic data=responsedata;
class Gender;
model Response=Age Income Gender;
run;
1
P(response yes | Age, Income, Gender,...)
1 e ( 0.100.22age0.050income0.80gender...)
New data
Customer Age Income Gender Response score
Emma 28 1000 F 0,44
Will 44 1500 M 0,76
Dan 30 1200 M 0,18
56 Bob 58 2400 M 0,88
Logistic Regression
The logistic regression model is formulated as follows:
1 e 0 1 X1 ... n X n
P(Y 1 | X 1 ,..., X n )
1 e ( 0 1 X1 ... n X n ) 1 e 0 1 X1 ... n X n
1 1
P(Y 0 | X 1 ,..., X n ) 1 P(Y 1 | X 1 ,..., X n ) 1
1 e ( 0 1 X1 ... n X n ) 1 e 0 1 X1 ... n X n
Hence,
0 P(Y 1 | X1 ,..., X n ), P(Y 0 | X1 ,..., X n ) 1
57 continued...
Logistic Regression
P(Y 1 | X 1 ,..., X n )
ln ( ) 0 1 X 1 ... n X n
P(Y 0 | X 1 ,..., X n )
P(Y 1 | X 1 ,..., X n )
is the odds in favor of Y=1
1 P(Y 1 | X 1 ,..., X n )
P(Y 1 | X 1 ,..., X n )
ln ( ) is called the logit
1 P(Y 1 | X 1 ,..., X n )
58
Logistic Regression
If Xi increases by 1:
logit | X i 1 logit | X i i
i
odds | X i 1 odds | X i e
ei is the odds-ratio: the multiplicative increase
in the odds when Xi increases by one (other
variables remaining constant/ceteris paribus)
i>0 ei > 1 odds and probability increase
with Xi
i<0 ei < 1 odds and probability
decrease with Xi
59
Logistic Regression and Weight of Evidence
Coding
Cust Age Age Group Age WoE
Actual Age ID
After
1 20 1: until 22 -1.1
classing
2 31 2: 22 until 35 0.2
After re-
coding
3 49 3: 35+ 0.9
60 continued...
Logistic Regression and Weight of Evidence
Coding
1
P(Y 1 | X 1 ,..., X n ) ( 0 age*age_ woe purpose* purpose_ woe...)
1 e
No dummy variables!
More robust
61
Section 4
Measuring the
Performance of Credit Scoring
Classification Models
How to Measure Performance?
Performance
– How well does the trained model perform in predicting
new unseen (!) instances?
– Decide on performance measure
Classification: Percentage Correctly Classified (PCC),
Sensitivity, Specificity, Area Under ROC curve
(AUROC), ...
Regression: Mean Absolute Deviation (MAD), Mean
Squared Error (MSE), ...
Methods
64
Evaluating classification models
• Train (Estimation) data versus Test (Hold-out) data
• Train data is used to build model (e.g. logistic regression or decision tree)
• Test data is used to measure performance
• Strict separation between training and test set needed!
Train data
Customer Age Income Gender Response … Target proc logistic
John 30 1200 M No 0 data=responsedata;
Sarah 25 800 F Yes 1 Build Model
class Gender;
Sophie 52 2200 F Yes 1
David 48 2000 M No 0 model Response=Age
Peter 34 1800 M Yes 1 Income Gender;
run;
Data
Apply
Test data Model
Customer Age Income Gender … Response Response score
Emma 28 1000 F No 0,44
Will 44 1500 M Yes 0,76
Dan 30 1200 M No 0,18
Bob 58 2400 M No 0,88
65
Performance Measures for Classification
Classification accuracy, confusion matrix, sensitivity,
specificity
Area under the ROC curve
The Lorenz (Power) curve
The Cumulative lift curve
Kolmogorov Smirnov
66
Classification performance: confusion matrix
G/B Score Predicted
G/B Score John G 0,72 G
John G 0,72 Sophie B 0,56 G
Sophie B 0,56 Cut off=0,50
David G 0,44 B
David G 0,44 Emma B 0,18 B
Emma B 0,18 Bob B 0,36 B
Bob B 0,36
Confusion matrix
Actual status
Positive (Good) Negative (Bad)
Predicted Positive (Good) True Positive (John) False Positive (Sophie)
status Negative (Bad) False Negative (David) True Negative (Emma, Bob)
Classification accuracy=(TP+TN)/(TP+FP+FN+TN)=3/5
Classification error = (FP +FN)/(TP+FP+FN+TN)=2/5
Sensitivity=TP/(TP+FN)=1/2
Specificity=TN/(FP+TN)=2/3
0,01 0,8
Sensitivity
0,02 0,6
0,4
0,2
….
0
0 0,2 0,4 0,6 0,8 1
0,99 (1 - Specificity)
• Perfect model has sensitivity of 1 and specificity of 1 (i.e. upper left corner)
• Scorecard A is better than B in above figure
• ROC curve can be summarized by the area underneath (AUC); the bigger
the better!
68
The Receiver Operating Characteristic Curve
ROC Curve
1
0.8
Sensitivity
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
(1 - Specificity)
69 ...
The Area Under the ROC Curve
How to compare intersecting ROC curves?
The area under the ROC curve (AUC)
The AUC provides a simple figure-of-merit for the
performance of the constructed classifier
An intuitive interpretation of the AUC is that it provides
an estimate of the probability that a randomly chosen
instance of class 1 is correctly ranked higher than a
randomly chosen instance of class 0 (Hanley and
McNeil, 1983) (Wilcoxon or Mann-Whitney or U
statistic)
The higher the better
A good classifier should have an AUC larger than 0.5
70
Cumulative Accuracy Profile (CAP)
B Current model
B
Perfect model
AR=B/(A+B)
73 continued...
The Kolmogorov-Smirnov Distance
1
0.9
0.8 P(s|B)
0.7
0.6
0.5
0.4 KS distance
0.3
0.2
P(s|G)
0.1
0
score
74
Section 5
Input Selection for
Classification
Input Selection
Inputs=Features=Attributes=Characteristics=Variables.
Also called feature selection, attribute selection,
characteristic selection, variable selection
If n features are present, 2n-1 possible feature sets can
be considered.
Heuristic search methods are needed!
Good feature subsets contain features highly correlated
with (predictive of) the class, yet uncorrelated with (not
predictive of) each other (Hall and Smith 1998).
Can improve the performance and estimation time
of the classifier.
76 continued...
Filter Methods for Input Selection
Continuous target Discrete target
(e.g., LGD) (e.g., PD)
Continuous input Pearson correlation Fisher score
(e.g., income) Spearman correlation
Hoeffding’s D
77
Pearson Correlation
Compute Pearson correlation between each
continuous variable and continuous target
Always varies between -1 and +1
Only keep variables for which |ρP| > 0.50; or keep,
e.g., top 10%
78
Chi-Squared-Based Filter
Good Payer Bad Payer Total
Observed
Frequencies Married 500 100 600
Not Married 300 100 400
800 200 1000
80
Information Value/Gain
Information value is defined as (see previously):
IV = ( p_good category - p_bad category ) * woe category )
A predictive variable has IV > 0.10.
Check Linearity
84
Input Selection Procedure
Step 1: Use a filter procedure
– For quick filtering of inputs
– Inputs are selected independent of the
classification algorithm (e.g., logistic regression).
Step 2: Forward/backward/stepwise regression
– Use the p-value of the logistic regression for input
selection.
85
Forward/Backward/Stepwise Regression
Use the p-value to decide upon the importance
of the inputs:
– p-value < 0.01: highly significant
– 0.01 < p-value < 0.05: significant
– 0.05 < p-value < 0.10: weakly significant
– 0.1 < p-value: not significant
Can be used in different ways:
– Forward
– Backward
– Stepwise
86
Search Strategies
Forward selection
– Starts from empty model and always adds
variables based on low p-values
Backward elimination
– Starts from full model and always deletes variables
based on high p-values
Stepwise
– Starts as forward selection, but checks whether
added variables cannot be removed later
87
Example: Search Space for Four Inputs
{}
{I1, I2} {I1, I3} {I2, I3} {I1, I4} {I2, I4} {I3, I4}
{I1, I2, I3} {I1, I2, I4} {I1, I3, I4} {I2, I3, I4}
90 continued...
Stepwise Logistic Regression
SELECTION= This is similar to the SELECTION=FORWARD
STEPWISE option except that effects already in the model do
not necessarily remain. Effects are entered into and
removed from the model in such a way that each
forward selection step might be followed by one or
more backward elimination steps. The stepwise
selection process terminates if no further effect can
be added to the model or if the effect just entered
into the model is the only effect removed in the
subsequent backward elimination.
SELECTION= PROC LOGISTIC uses the branch and bound
SCORE algorithm of Furnival and Wilson (1974).
91
Stepwise Logistic Regression: Example
proc logistic data=mydata.applicants;
class checking history purpose savings
employed marital coapp resident
property other housing;
model good_bad= amount duration age
installp checking history purpose
savings employed marital coapp
resident property other housing
/selection=stepwise slentry=0.10
slstay=0.01;
run;
92
Thanks!
93