0% found this document useful (0 votes)
14 views

Example New

Uploaded by

li
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Example New

Uploaded by

li
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

试题结构

预测性 描述性
Base基础 用一个或多个自变量预测 分析具有多个属性的数据集
因变量的值 ,找出潜在的模式,没有因
变量
• 场景:客户是否会违约
是一个因变量,可以用 • 场景:观察个体之间的
客户的性别、年龄、收 相似程度,例如根据年
统计分析 入、职位、经济状况、 龄、性别、收入等因素
历史信用状况等因素进 进行客户细分;根据客
行预测 户对多个产品的购买,
发现产品之间的相关性
• 主要算法:决策树、线
性回归、Logistic回归、 • 主要算法:聚类、关联
时间序列 支持向量机、神经网络 分析、因子分析、主成
、判别分析、… 份分析、社交网络分析
运筹优化 、…

22
数据挖掘方法论:SEMMA

3
3
Section 1
Preprocessing Data
for Credit Scoring and
PD Modeling-Univariate Analysis
Motivation
 Dirty, noisy data
– For example, Age = -2003
 Inconsistent data
– Value '0' means actual zero or missing value
 Incomplete data
– Income=?
 Data integration and data merging problems
– Amounts in euro versus amounts in dollar
 Duplicate data
– Salary versus professional Income

5
Preprocessing Data for Credit Scoring
 Types of variables
 Sampling
 Visual data exploration
 Missing values
 Outlier detection and treatment
 Transforming data
 Recoding categorical variables

6
Types of Variables
 Continuous
– Defined on a continuous interval
– For example, income, amount on savings account
– In SAS: interval
 Discrete
– Binary
 For example, gender
– Nominal
 No ordering between values
 For example, purpose of loan, marital status
– Ordinal
 Implicit ordering between values
 For example, credit rating (AAA is better than AA,
7 AA is better than A, …)
Sampling
 Take sample of past applicants to build scoring model
 Think carefully about the population on which the model that is going
to be built using the sample will be used
 Timing of sample
– How far do I go back to get my sample?
– Trade-off: many data versus recent data
 Number of bads versus number of goods
– Undersampling, oversampling might be needed (dependent on
classification algorithm, see later)
 Sample taken must be from a normal business period to get as accurate
a picture as possible of the target population
 Make sure performance window is long enough to stabilize bad rate
(for example, 18 months)
 Example sampling problems
– Application scoring: reject inference
– Behavioral scoring: seasonality depending upon the choice of the
observation point
8
Visual Data Exploration

9
Missing Values
 Reasons
– Non-applicable (e.g., default date not known for
non-defaulters)
– Not disclosed (e.g., income)
– Error when merging data (e.g., typos in name and/or ID)
 Keep
 The fact that a variable is missing can be important
information.
 Add an additional category for the missing values.
 Add an additional missing value indicator variable
(either one per variable, or one for the entire
observation).

10 continued...
Missing Values
 Delete
– When too many missing values, removing the variable
or observation might be an option.
– Horizontally versus vertically missing values
 Replace
– Estimate missing value using imputation procedures.
– Be consistent when treating missing values during
model development and during model usage!

11
Deleting Missing Values
ID Age Income Marital Credit Bureau Class
Status Score

1 34 1800 ? 620 Bad


2 28 1200 Single ? Good
3 22 1000 Single ? Good
4 60 2200 Widowed 700 Bad
5 58 2000 Married ? Good
6 44 ? ? ? Good
7 22 1200 Single ? Good
8 26 1500 Married 350 Good
9 34 ? Single ? Bad
10 50 2100 Divorced ? Good

12
Imputation Procedures for Missing Values
 For continuous attributes
– Replace with median/mean (median more robust to outliers)
– If missing values only occur during model development, can
also replace with median/mean of all instances of the same
class
 For ordinal/nominal attributes
– Replace with modal value (= most frequent category)
– If missing values only occur during model development,
replace with modal value of all instances of the same class
 Regression or tree-based imputation
– Predict missing value using other variables
– Cannot use target class as predictor if missing values can
occur during model usage
– More complicated and often do not substantially improve
the performance of the scorecard!
13
Outliers
 Extreme or unusual observations
– E.g., due to recording, data entry errors or noise
 Types of outliers
– Valid observation: salary of boss, ratio variables
– Invalid observation: age = -2003
 Outliers can be hidden in one-dimensional views
of the data (multidimensional nature of data)
 Univariate outliers versus multivariate outliers
 Detection versus treatment

14
Multivariate Outliers
Multivariate outliers!

15
Univariate Outlier Detection Methods:
Histograms
Age
3500

3000

2500
F
r
e 2000
q
u
e
1500
n
c
y
1000

500

0
0-5 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 150-200

16
Univariate Outlier Detection Methods:
z-score
ID Age z-score
 z-score measures how many standard
deviations an observation lies away 1 30 (30-40)/10=-1

2 50 (50-40)/10=+1
from the mean for a specific variable 3 10 (10-40)/10=-3
as follows:
xi   4 40 (40-40)/10=0

zi  5 60 (60-40)/10=+2

 6


80


(80-40)/10=+4

 μ is mean of variable xi and μ=40 μ=0


σ=10 σ=1
σ standard deviation.
 Outliers are defined when |zi| > 3 (or 2.5).
 The mean of z-scores is 0,
standard deviation is 1.
 Calculate the z-score in SAS using
17
PROC STANDARD.
Computing the z-scores in SAS
data credit;
input age income;
datalines;
34 1300
24 1000
20 2000
40 2100
54 1700
39 2500
23 2200
34 700
56 1500
;

proc standard data=credit mean=0 std=1


out=creditstand;
run;
18
Univariate Outlier Detection Methods: Box Plot

 A box plot is a visual representation of five numbers:


– Median M P(X≤M)=0.50
– First Quartile Q1: P(X≤Q1)=0.25
– Third Quartile Q3: P(X≤Q3)=0.75
– Minimum
– Maximum
– IQR=Q3-Q1
1,5*IQR

Min Q1 M Q3 Outliers
19
Multivariate Outlier Detection Methods
 Mahalanobis distance
D 2  (x i  μ)T Σ1 (x i  μ)

–  is the vector of means and  is the covariance matrix


– Calculate distance for every point xi and sort
 Clustering methods
– Look for elements outside clusters
 Regression methods
– Fit regression line and look for points with large errors
– Residual plots
 Practical advice: only focus on the univariate outliers!

20
Outlier Treatment
 For invalid outliers:
– For example, age = 300 years
– Treat as missing value (keep, delete, replace)
 For valid outliers: truncation\winsorizing\capping
– Truncation based on z-scores:
 Replace all variable values having z-scores of
> 3 with the mean + 3 times the standard deviation
 Replace all variable values having z-scores of
< -3 with the mean -3 times the standard deviation
– Truncation based on IQR (more robust than z-scores)
 Truncate to M±3s, with M=median and s=IQR/(2 x 0.6745)
 See Van Gestel, Baesens et al. 2007
– Truncation using a sigmoid
 Use a sigmoid transform, f(x)=1/(1+e-x)

21
Truncation: Example

μ-3σ μ μ+3σ

22
Section 2
Preprocessing Data
for Credit Scoring and
PD Modeling-Dimensionality
Reduction
Improving Regression Selection
60
All
Subsets
45
Minutes

30
Stepwise
15

0
25 50 75 100
Number of Variables
24
Improving Input Selection

Variable Clustering

Categorical Recoding

25
Variable Redundancy

26
First Principal Component

First Eigenvalue=1.94

27
Second Principal Component

First Eigenvalue=1.94 Second Eigenvalue=1.02

28
Variable Clustering Alternative
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10

29 ...
Variable Clustering Alternative
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10

30 ...
Variable Clustering Alternative
X1
X2
X3
X4 Inputs selected by
X5 • cluster representation
X6 • expert opinion
X7 • target correlation
X8
X9
X10

31 ...
Variable Clustering Alternative
X1 X1
X2
X3
X4 X4 Inputs selected by
X5 • cluster representation
X6 X6 • expert opinion
X7 • target correlation
X8 X8
X9 X9
X10 X10

32 ...
Variable Clustering Alternative
X1 X1
X2
X3 X3
X4 X4 Inputs selected by
X5 • cluster representation
X6 X6 • expert opinion
X7 • target correlation
X8 X8
X9 X9
X10 X10

33 ...
Second Principal Component

First Eigenvalue=1.94 Second Eigenvalue=1.02


34 ...
Rotated Components

35
Input/Rotated Component Correlations
X1 X2 X3

First
RC

Second
RC

36
Cluster Inputs
X1 X2 X3

First
RC

Second
RC

37
Split New Cluster?

First Eigenvalue=1.95 Second Eigenvalue=0.05


38 ...
Selection by 1 – R2 Ratio
X2

First
1-R 2own cluster 1 – 0.90
Cluster = = 0.101
PC
2
1-R next closest 1 – 0.01
R2 = 0.90

Second
Cluster
PC
R2 = 0.01

39 ...
Implementing Variable
Clustering

This demonstration illustrates using the


Variable Clustering node to group variables
according to their similarity.

40
Recoding Techniques

1,2,3,… Enumeration

001000 Dummy-coding

x=w(Σyi) Target-based transform

41 ...
Dummy-Coding

Level DA DB DC DD DE DF DG DH DI DJ
A 1 0 0 0 0 0 0 0 0 0
B 0 1 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 0 0 0
D 0 0 0 1 0 0 0 0 0 0
E 0 0 0 0 1 0 0 0 0 0
F 0 0 0 0 0 1 0 0 0 0
G 0 0 0 0 0 0 1 0 0 0
H 0 0 0 0 0 0 0 1 0 0
I 0 0 0 0 0 0 0 0 1 0
J 0 0 0 0 0 0 0 0 0 1

42
Quasi-complete separation

43
Target-Based Enumeration

Level Ni ΣYi pi
A 1562 430 0.28
B 970 432 0.45
C 223 45 0.20
D 111 36 0.32
E 85 23 0.27
F 50 20 0.40
G 23 8 0.35
H 17 5 0.29
I 12 6 0.50
J 5 5 1.00

44
Target-Based Enumeration

Level Ni ΣYi pi
J 5 5 1.00
I 12 6 0.50
B 970 432 0.45
F 50 20 0.40
G 23 8 0.35
D 111 36 0.32
H 17 5 0.29
A 1562 430 0.28
E 85 23 0.27
C 223 45 0.20

45
Weight of Evidence

Level Ni ΣYi pi log(pi/1-pi)


J 5 5 1.00 .
I 12 6 0.50 0.00
B 970 432 0.45 -0.10
F 50 20 0.40 -0.18
G 23 8 0.35 -0.27
D 111 36 0.32 -0.32
H 17 5 0.29 -0.38
A 1562 430 0.28 -0.42
E 85 23 0.27 -0.43
C 223 45 0.20 -0.60

46
Level Clustering

Level Ni ΣYi pi log(pi/1-pi)


J 5 5 1.00 .
I 12 6 0.50 0.00
B 970 432 0.45 -0.10
F 50 20 0.40 -0.18
G 23 8 0.35 -0.27
D 111 36 0.32 -0.32
H 17 5 0.29 -0.38
A 1562 430 0.28 -0.42
E 85 23 0.27 -0.43
C 223 45 0.20 -0.60

47
Level Clustering

Level Ni ΣYi pi log(pi/1-pi)


CL1 1037 463 0.45 -0.09

CL2 134 44 0.33 -0.31

CL3 1664 458 0.28 -0.42

CL4 223 45 0.20 -0.60

48
Level Clustering Dummy-Coding

Level Ni ΣYi pi log(pi/1-pi)


CL1 1037 463 0.45 -0.09

CL2 134 44 0.33 -0.31

CL3 1664 458 0.28 -0.42

CL4 223 45 0.20 -0.60

49
Smoothed Weight of Evidence

Level Ni ΣYi pi log(pi/1-pi)


J 5 5 1.00 .
I 12 6 0.50 0.00
B 970 432 0.45 -0.10
F 50 20 0.40 -0.18
G 23 8 0.35 -0.27
D 111 36 0.32 -0.32
H 17 5 0.29 -0.38
A 1562 430 0.28 -0.42
E 85 23 0.27 -0.43
C 223 45 0.20 -0.60

50
Smoothed Weight of Evidence

Level Ni ΣYi pi log(pi/1-pi)


J 5 +24 5 +8 0.45 -0.09
I 12 +24 6 +8 0.39 -0.19
B 970 +24 432 +8 0.44 -0.10
F 50 +24 20 +8 0.38 -0.22
G 23 +24 8 +8 0.34 -0.29
D 111 +24 36 +8 0.33 -0.32
H 17 +24 5 +8 0.32 -0.33
A 1562 +24 430 +8 0.28 -0.42
E 85 +24 23 +8 0.28 -0.40
C 223 +24 45 +8 0.21 -0.56

51
Section 3
Classification Techniques
for Credit Scoring and
PD Modeling
The Classification Problem
 The classification problem can be stated as follows:
– Given an observation (also called pattern) with
characteristics x=(x1,..,xn), determine its class c
from a predetermined set of classes {c1,...,cm}
 The classes {c1,..,cm} are known on beforehand
 Supervised learning!
 Binary (2 classes) versus multiclass (> 2 classes)
classification

53
Regression for classification
Customer Age Income Gender G/B Customer Age Income Gender G/B Y
John 30 1200 M B John 30 1200 M B 0
Sarah 25 800 F G Sarah 25 800 F G 1
Sophie 52 2200 F G Sophie 52 2200 F G 1
David 48 2000 M B David 48 2000 M B 0
Peter 34 1800 M G Peter 34 1800 M B 1

Linear regression gives : Y=β0+ β1Age+ β2Income+ β3Gender


Can be estimated using OLS (in SAS use proc reg or proc glm)
Two problems:
• No guarantee that Y is between 0 and 1 (i.e. a probability)
• Target/Errors not normally distributed
1
Using a bounding function to limit the outcome between 0 and 1: f ( z ) 
1
1  e z
0,9

0,8

0,7

0,6

0,5

0,4

0,3

0,2

0,1

0
54 -7 -5 -3 -1 1 3 5 7
Logistic Regression
• Linear regression with a transformation such that the output
is always between 0 and 1 can thus be interpreted as a
probability (e.g. probability of good customer)

1
P(customer  good | age , income, gender,... ) 
1  e ( 0  1age  2income 3 gender...)
• Or, alternatively
P(customer  good | age, income, gender,... )
ln ( )   0  1age   2income   3 gender  ...
P(customer  bad | age, income, gender )

• The parameters can be estimated in SAS using proc logistic


• Once the model has been estimated using historical data, we
can use it to score or assign probabilities to new data

55
Logistic Regression in SAS
Historical data
Customer Age Income Gender Response
John 30 1200 M No
Sarah 25 800 F Yes
Sophie 52 2200 F Yes
David 48 2000 M No
Peter 34 1800 M Yes
proc logistic data=responsedata;
class Gender;
model Response=Age Income Gender;
run;

1
P(response  yes | Age, Income, Gender,...) 
1  e ( 0.100.22age0.050income0.80gender...)

New data
Customer Age Income Gender Response score
Emma 28 1000 F 0,44
Will 44 1500 M 0,76
Dan 30 1200 M 0,18
56 Bob 58 2400 M 0,88
Logistic Regression
 The logistic regression model is formulated as follows:
1 e 0  1 X1 ...  n X n
P(Y  1 | X 1 ,..., X n )  
1  e ( 0  1 X1 ...  n X n ) 1  e 0  1 X1 ...  n X n
1 1
P(Y  0 | X 1 ,..., X n )  1  P(Y  1 | X 1 ,..., X n )  1  
1  e ( 0  1 X1 ...  n X n ) 1  e 0  1 X1 ...  n X n

 Hence,
0  P(Y  1 | X1 ,..., X n ), P(Y  0 | X1 ,..., X n )  1

 Model reformulation: P(Y  1 | X 1 ,..., X n ) 0  1X1...  n X n


e
P(Y  0 | X 1 ,..., X n )

57 continued...
Logistic Regression
P(Y  1 | X 1 ,..., X n )
ln ( )   0  1 X 1  ...   n X n
P(Y  0 | X 1 ,..., X n )
P(Y  1 | X 1 ,..., X n )
is the odds in favor of Y=1
1  P(Y  1 | X 1 ,..., X n )

P(Y  1 | X 1 ,..., X n )
ln ( ) is called the logit
1  P(Y  1 | X 1 ,..., X n )

58
Logistic Regression
 If Xi increases by 1:
logit | X i 1  logit | X i  i
i
odds | X i 1  odds | X i e
ei is the odds-ratio: the multiplicative increase
in the odds when Xi increases by one (other
variables remaining constant/ceteris paribus)
 i>0  ei > 1  odds and probability increase
with Xi
 i<0  ei < 1  odds and probability
decrease with Xi
59
Logistic Regression and Weight of Evidence
Coding
Cust Age Age Group Age WoE
Actual Age ID

After
1 20 1: until 22 -1.1
classing
2 31 2: 22 until 35 0.2
After re-
coding
3 49 3: 35+ 0.9

60 continued...
Logistic Regression and Weight of Evidence
Coding

1
P(Y  1 | X 1 ,..., X n )  (  0   age*age_ woe  purpose* purpose_ woe...)
1 e

 No dummy variables!
 More robust

61
Section 4
Measuring the
Performance of Credit Scoring
Classification Models
How to Measure Performance?
Performance
– How well does the trained model perform in predicting
new unseen (!) instances?
– Decide on performance measure
 Classification: Percentage Correctly Classified (PCC),
Sensitivity, Specificity, Area Under ROC curve
(AUROC), ...
 Regression: Mean Absolute Deviation (MAD), Mean
Squared Error (MSE), ...
 Methods

– Split sample method


– Single sample method
– N-fold cross-validation
63
Split Sample Method
 For large data sets
– Large= > 1000 Obs, with > 50 bad customers
 Set aside a test set (typically one-third of the data)
which is NOT used during training!
 Estimate performance of trained classifier on test set
 Note: for decision trees, validation set is part of
training set
 Stratification
– Same class distribution in training set and test set

64
Evaluating classification models
• Train (Estimation) data versus Test (Hold-out) data
• Train data is used to build model (e.g. logistic regression or decision tree)
• Test data is used to measure performance
• Strict separation between training and test set needed!

Train data
Customer Age Income Gender Response … Target proc logistic
John 30 1200 M No 0 data=responsedata;
Sarah 25 800 F Yes 1 Build Model
class Gender;
Sophie 52 2200 F Yes 1
David 48 2000 M No 0 model Response=Age
Peter 34 1800 M Yes 1 Income Gender;
run;
Data
Apply
Test data Model
Customer Age Income Gender … Response Response score
Emma 28 1000 F No 0,44
Will 44 1500 M Yes 0,76
Dan 30 1200 M No 0,18
Bob 58 2400 M No 0,88

65
Performance Measures for Classification
 Classification accuracy, confusion matrix, sensitivity,
specificity
 Area under the ROC curve
 The Lorenz (Power) curve
 The Cumulative lift curve
 Kolmogorov Smirnov

66
Classification performance: confusion matrix
G/B Score Predicted
G/B Score John G 0,72 G
John G 0,72 Sophie B 0,56 G
Sophie B 0,56 Cut off=0,50
David G 0,44 B
David G 0,44 Emma B 0,18 B
Emma B 0,18 Bob B 0,36 B
Bob B 0,36
Confusion matrix
Actual status
Positive (Good) Negative (Bad)
Predicted Positive (Good) True Positive (John) False Positive (Sophie)
status Negative (Bad) False Negative (David) True Negative (Emma, Bob)

Classification accuracy=(TP+TN)/(TP+FP+FN+TN)=3/5
Classification error = (FP +FN)/(TP+FP+FN+TN)=2/5
Sensitivity=TP/(TP+FN)=1/2
Specificity=TN/(FP+TN)=2/3

All these measures depend upon the cut off!


67
Classification Performance: ROC analysis
• Make a table with sensitivity and specificity for each possible cut-off
• ROC curve plots sensitivity versus 1-specificity for each possible cut-off
ROC Curve
Cut-off sensitivity specificity 1-specificity
0 1 0 1 1

0,01 0,8

Sensitivity
0,02 0,6
0,4
0,2
….
0
0 0,2 0,4 0,6 0,8 1
0,99 (1 - Specificity)

1 0 1 0 Scorecard - A Random Scorecard - B

• Perfect model has sensitivity of 1 and specificity of 1 (i.e. upper left corner)
• Scorecard A is better than B in above figure
• ROC curve can be summarized by the area underneath (AUC); the bigger
the better!

68
The Receiver Operating Characteristic Curve

ROC Curve

1
0.8
Sensitivity

0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
(1 - Specificity)

Scorecard - A Random Scorecard - B

69 ...
The Area Under the ROC Curve
 How to compare intersecting ROC curves?
 The area under the ROC curve (AUC)
 The AUC provides a simple figure-of-merit for the
performance of the constructed classifier
 An intuitive interpretation of the AUC is that it provides
an estimate of the probability that a randomly chosen
instance of class 1 is correctly ranked higher than a
randomly chosen instance of class 0 (Hanley and
McNeil, 1983) (Wilcoxon or Mann-Whitney or U
statistic)
 The higher the better
 A good classifier should have an AUC larger than 0.5

70
Cumulative Accuracy Profile (CAP)

• Sort the population from high score to low model score


• Measure the (cumulative) percentage of bads for each score decile
• E.g. top 30% most likely bads according to model captures 65% of true
bads
• Often summarised as top-decile lift, or how many bads in top 10%
71
Accuracy Ratio
A
A

B Current model
B
Perfect model

AR=B/(A+B)

 The accuracy ratio (AR) is defined as:


(Area below power curve for current model-Area below power
curve for random model) /
(Area below power curve for perfect model-Area below power
curve for random model)
 Perfect model has an AR of 1
 Random model has an AR of 0
 AR is sometimes also called the “Gini coefficient”
72  AR=2*AUC-1
72
The Kolmogorov-Smirnov (KS) Distance
 Separation measure
 Measures the distance between the cumulative score
distributions P(s|B) and P(s|G)
 KS = maxs |P(s|G)-P(s|B) |, where:
 P(s|G) = ∑x≤s p(x|G) (equals 1- sensitivity)
 P(s|B) = ∑x≤s p(x|B) (equals the specificity)
 KS distance metric is the maximum vertical distance
between both curves
 KS distance can also be measured on the ROC graph
– Maximum vertical distance between ROC graph and
diagonal

73 continued...
The Kolmogorov-Smirnov Distance
1
0.9
0.8 P(s|B)
0.7
0.6
0.5
0.4 KS distance
0.3
0.2
P(s|G)
0.1
0
score

74
Section 5
Input Selection for
Classification
Input Selection
 Inputs=Features=Attributes=Characteristics=Variables.
 Also called feature selection, attribute selection,
characteristic selection, variable selection
 If n features are present, 2n-1 possible feature sets can
be considered.
 Heuristic search methods are needed!
 Good feature subsets contain features highly correlated
with (predictive of) the class, yet uncorrelated with (not
predictive of) each other (Hall and Smith 1998).
 Can improve the performance and estimation time
of the classifier.

76 continued...
Filter Methods for Input Selection
Continuous target Discrete target
(e.g., LGD) (e.g., PD)
Continuous input Pearson correlation Fisher score
(e.g., income) Spearman correlation
Hoeffding’s D

Categorical input Fisher score Chi-squared analysis


(e.g., marital ANOVA analysis Cramer’s V
status) Information value
Gain/entropy

77
Pearson Correlation
 Compute Pearson correlation between each
continuous variable and continuous target
 Always varies between -1 and +1
 Only keep variables for which |ρP| > 0.50; or keep,
e.g., top 10%

78
Chi-Squared-Based Filter
Good Payer Bad Payer Total
Observed
Frequencies Married 500 100 600
Not Married 300 100 400
800 200 1000

Under the independence assumption,


P(married and good payer) = P(married).P(good payer) = 0.6 * 0.8.
The expected number of good payers that are married is 0.6*0.8*1000 = 480.

Good Bad Payer Total


Payer
Independence
Frequencies Married 480 120 600
Not Married 320 80 400
800 200 1000
79 continued...
Chi-Squared Based Filter
 2500  480² 100  120² 300  320² 100  80²
    10.41
480 120 320 80

 χ²is chi-squared distributed with k-1 degrees of freedom, with k


being the number of classes of the characteristic.
 The bigger (lower) the value of χ², the more (less) predictive the
attribute is.
 Apply as filter: rank all attributes with respect to their p-value and
select the most predictive ones.
 Note that
²
Cramer' s V 
n
is always bounded between 0 and 1; higher values indicate better
predictive input; a threshold of 0.10 can be used.

80
Information Value/Gain
Information value is defined as (see previously):
IV =  ( p_good category - p_bad category ) * woe category )
A predictive variable has IV > 0.10.

Gain measures the decrease in impurity (based on, e.g.,


entropy or gini) when splitting on the variable (see
decision trees).

Cramer’s V, Information Value, and Gain typically give


similar results in terms of input importance.

Both Cramer’s V and Information Value are readily


available in SAS Enterprise Miner
81
Example: Filters (Applicants Data Set)

Van Gestel, Baesens 2008


82
Fisher Score
 Defined as
| xG  x B |
Fisher score 
sG2  sB2
 A higher value indicates a more predictive input.
 It can also be used if the target is continuous and input
is discrete.
 It essentially generalizes to an ANOVA test in the case
of multiple categories.
 Example for Applicants data set:

Van Gestel, Baesens 2008


83
Empirical Logit Plots

Check Linearity

84
Input Selection Procedure
 Step 1: Use a filter procedure
– For quick filtering of inputs
– Inputs are selected independent of the
classification algorithm (e.g., logistic regression).
 Step 2: Forward/backward/stepwise regression
– Use the p-value of the logistic regression for input
selection.

85
Forward/Backward/Stepwise Regression
 Use the p-value to decide upon the importance
of the inputs:
– p-value < 0.01: highly significant
– 0.01 < p-value < 0.05: significant
– 0.05 < p-value < 0.10: weakly significant
– 0.1 < p-value: not significant
 Can be used in different ways:
– Forward
– Backward
– Stepwise

86
Search Strategies
 Forward selection
– Starts from empty model and always adds
variables based on low p-values
 Backward elimination
– Starts from full model and always deletes variables
based on high p-values
 Stepwise
– Starts as forward selection, but checks whether
added variables cannot be removed later

87
Example: Search Space for Four Inputs
{}

{I1} {I2} {I3} {I4}

{I1, I2} {I1, I3} {I2, I3} {I1, I4} {I2, I4} {I3, I4}

{I1, I2, I3} {I1, I2, I4} {I1, I3, I4} {I2, I3, I4}

{I1, I2, I3, I4}

Note: I1=age; I2=income;


88
I3=marital status; I4=employment status
Forward/Backward/Stepwise Logistic
Regression
SELECTION= PROC LOGISTIC first estimates parameters for
FORWARD, effects forced into the model. These effects are the
intercepts and the first n explanatory effects in the
MODEL statement, where n is the number specified
by the START= or INCLUDE= option in the MODEL
statement (n is zero by default). Next, the
procedure computes the score chi-square statistic
for each effect not in the model and examines the
largest of these statistics. If it is significant at the
SLENTRY= level, the corresponding effect is added
to the model. After an effect is entered in the model,
it is never removed from the model. The process is
repeated until none of the remaining effects meet
the specified level for entry or until the STOP=
value is reached.
89 continued...
Stepwise Logistic Regression
SELECTION= Parameters for the complete model as specified in
BACKWARD the MODEL statement are estimated unless the
, START= option is specified. In that case, only the
parameters for the intercepts and the first n
explanatory effects in the MODEL statement are
estimated, where n is the number specified by the
START= option. Results of the Wald test for
individual parameters are examined. The least
significant effect that does not meet the SLSTAY=
level for staying in the model is removed. After an
effect is removed from the model, it remains
excluded. The process is repeated until no other
effect in the model meets the specified level for
removal or until the STOP= value is reached.

90 continued...
Stepwise Logistic Regression
SELECTION= This is similar to the SELECTION=FORWARD
STEPWISE option except that effects already in the model do
not necessarily remain. Effects are entered into and
removed from the model in such a way that each
forward selection step might be followed by one or
more backward elimination steps. The stepwise
selection process terminates if no further effect can
be added to the model or if the effect just entered
into the model is the only effect removed in the
subsequent backward elimination.
SELECTION= PROC LOGISTIC uses the branch and bound
SCORE algorithm of Furnival and Wilson (1974).

91
Stepwise Logistic Regression: Example
proc logistic data=mydata.applicants;
class checking history purpose savings
employed marital coapp resident
property other housing;
model good_bad= amount duration age
installp checking history purpose
savings employed marital coapp
resident property other housing
/selection=stepwise slentry=0.10
slstay=0.01;
run;

92
Thanks!

93

You might also like