Statistics - II Regression - For - Predictive - Modeling - CourseNotes PDF
Statistics - II Regression - For - Predictive - Modeling - CourseNotes PDF
Logistic Regression
Course Notes
Predictive Modeling Using Logistic Regression Course Notes was developed by Mike Patetta. Additional
contributions were made by Dan Kelly. Editing and production support was provided by the Curriculum
Development and Support Department.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product
names are trademarks of their respective companies.
Predictive Modeling Using Logistic Regression Course Notes
Copyright © 2007 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of
America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written
permission of the publisher, SAS Institute Inc.
Book code E70291, course code LWPMLR, prepared date 07Feb2007. LWPMLR_001
ISBN 978-1-59994-335-0
For Your Information iii
Table of Contents
Prerequisites ................................................................................................................................ vi
1.1 Introduction......................................................................................................................1-3
Course Description
This course covers predictive modeling using SAS/STAT software with emphasis on the LOGISTIC
procedure. This course also discusses selecting variables, assessing models, treating missing values, and
using efficiency techniques for massive data sets.
To learn more…
For a list of other SAS books that relate to the topics covered in this
Course Notes, USA customers can contact our SAS Publishing Department at
1-800-727-3228 or send e-mail to [email protected]. Customers outside the
USA, please contact your local SAS office.
Also, see the Publications Catalog on the Web at support.sas.com/pubs for a
complete list of books and a convenient order form.
vi For Your Information
Prerequisites
Before attending this course, you should
have experience executing SAS programs and creating SAS data sets, which you can gain from
the SAS Programming I: Essentials course
have experience building statistical models using SAS software
have completed a statistics course that covers linear regression and logistic regression, such as the
Statistics I: Introduction to ANOVA, Regression, and Logistic Regression course.
Chapter 1 Predictive Modeling
1.1 Introduction
Supervised Classification
The data used to develop a predictive model consists of a set of cases (observations, examples).
Associated with each case is a vector of input variables (predictors, explanatory variables, features) and a
target variable (outcome, response). A predictive model maps the vector of input variables to the target.
The target is the outcome to be predicted. The cases are the units on which the prediction is made.
In supervised classification (Hand 1997), the target is a class label. A predictive model assigns, to each
case, a score (or a set of scores) that measures the propensity that the case belongs to a particular class.
With two classes, the target is binary and usually represents the occurrence of an event.
The term supervised is used when the class label is known for each case. If the label is known, then why
build a prediction model?
1-4 Chapter 1 Predictive Modeling
Generalization
The prediction model is used on new cases where the values of the input variables are known, but the
class labels are unknown. The principal aim of predictive modeling is generalization. Generalization
means the ability to predict the outcome on novel cases.
In contrast, the principal aim of traditional statistical analysis is inference. Confidence intervals,
hypothesis tests, and p-values are the common inferential tools. Similar methods used by predictive
modelers (such as logistic regression) may be used to infer how input variables affect the target. The
validity of the inference relies on understanding the statistical properties of methods and applying them
correctly.
Understanding the relationships between random variables can be important in predictive modeling as
well. However, many of the methods used are ad hoc with poorly understood statistical properties.
Consequently, the discovery of structure in predictive modeling is informal and exploratory. Some
predictive modeling methods (for example, neural networks) are inscrutable yet successful because they
generalize well. The validity of predictive modeling methods is assessed empirically. If a model
generalizes well, then the method is useful, regardless of its statistical properties.
1.1 Introduction 1-5
Applications
Target Marketing
Attrition Prediction
Credit Scoring
Fraud Detection
There are many business applications of predictive modeling. Database marketing uses customer
databases to improve sales promotions and product loyalty. In target marketing, the cases are customers,
the inputs are attributes such as previous purchase history and demographics, and the target is often a
binary variable indicating a response to a past promotion. The aim is to find segments of customers that
are likely to respond to some offer so they can be targeted. Historic customer databases can also be used
to predict who is likely to switch brands or cancel services (churn). Loyalty promotions can then be
targeted at new cases that are at risk.
Credit scoring (Hand and Henley 1997) is used to decide whether to extend credit to applicants. The cases
are past applicants. Most input variables come from the credit application or credit reports. A relevant
binary target is whether the case defaulted (charged-off) or paid-off the debt. The aim is to reduce defaults
and serious delinquencies on new applicants for credit.
In fraud detection, the cases are transactions (for example, telephone calls, credit card purchases) or
insurance claims. The inputs are the particulars and circumstances of the transaction. The binary target is
whether that case was fraudulent. The aim is to anticipate fraud or abuse on new transactions or claims so
that they can be investigated or impeded.
Supervised classification also has less business-oriented uses. Image classification has applications in
areas such as astronomy, nuclear medicine, and molecular genetics (McLachlan 1992; Ripley 1996; Hand
1997).
1-6 Chapter 1 Predictive Modeling
The develop data set is a retail-banking example. The data set has 32,264 cases (banking customers)
and 47 input variables. The binary target variable Ins indicates whether the customer has an insurance
product (variable annuity). The 47 input variables represent other product usage and demographics prior
to their acquiring the insurance product. Two of the inputs are nominally scaled; the others are interval or
binary.
The DATA step reads in the original data and creates a data set in the work library. This prevents writing
over the original data.
libname pmlr 'SAS−data−library';
data develop;
set pmlr.develop;
run;
The %LET statement enables you to define a macro variable and assign it a value. The statement below
creates a macro variable named inputs and assigns it a string that contains all the numeric input
variable names. This reduces the amount of text you need to enter in other programs.
%let inputs=acctage dda ddabal dep depamt cashbk checks
dirdep nsf nsfamt phone teller atm atmamt pos posamt
cd cdbal ira irabal loc locbal inv invbal ils ilsbal
mm mmbal mmcred mtg mtgbal sav savbal cc ccbal
ccpurc sdb income hmown lores hmval age crscore
moved inarea;
The MEANS procedure generates descriptive statistics for the numeric variables. The statistics requested
below are the number of observations, the number of missing values, the mean, the minimum value, and
the maximum value. The macro variable, inputs, is referenced in the VAR statement by prefixing an
ampersand (&) to the macro variable name. The FREQ procedure examines the values of the target
variable and the nominal input variables.
proc means data=develop n nmiss mean min max;
var &inputs;
run;
proc freq data=develop;
tables ins branch res;
run;
1.1 Introduction 1-7
N
Variable Label N Miss Mean Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
AcctAge Age of Oldest Account 30194 2070 5.9086772 0.3000000 61.5000000
DDA Checking Account 32264 0 0.8156459 0 1.0000000
DDABal Checking Balance 32264 0 2170.02 -774.8300000 278093.83
Dep Checking Deposits 32264 0 2.1346082 0 28.0000000
DepAmt Amount Deposited 32264 0 2232.76 0 484893.67
CashBk Number Cash Back 32264 0 0.0159621 0 4.0000000
Checks Number of Checks 32264 0 4.2599182 0 49.0000000
DirDep Direct Deposit 32264 0 0.2955616 0 1.0000000
NSF Number Insufficient Fund 32264 0 0.0870630 0 1.0000000
NSFAmt Amount NSF 32264 0 2.2905464 0 666.8500000
Phone Number Telephone Banking 28131 4133 0.4056024 0 30.0000000
Teller Teller Visits 32264 0 1.3652678 0 27.0000000
Sav Saving Account 32264 0 0.4668981 0 1.0000000
SavBal Saving Balance 32264 0 3170.60 0 700026.94
ATM ATM 32264 0 0.6099368 0 1.0000000
ATMAmt ATM Withdrawal Amount 32264 0 1235.41 0 427731.26
POS Number Point of Sale 28131 4133 1.0756816 0 54.0000000
POSAmt Amount Point of Sale 28131 4133 48.9261782 0 3293.49
CD Certificate of Deposit 32264 0 0.1258368 0 1.0000000
CDBal CD Balance 32264 0 2530.71 0 1053900.00
IRA Retirement Account 32264 0 0.0532792 0 1.0000000
IRABal IRA Balance 32264 0 617.5704550 0 596497.60
LOC Line of Credit 32264 0 0.0633833 0 1.0000000
LOCBal Line of Credit Balance 32264 0 1175.22 -613.0000000 523147.24
Inv Investment 28131 4133 0.0296826 0 1.0000000
InvBal Investment Balance 28131 4133 1599.17 -2214.92 8323796.02
ILS Installment Loan 32264 0 0.0495909 0 1.0000000
ILSBal Loan Balance 32264 0 517.5692344 0 29162.79
MM Money Market 32264 0 0.1148959 0 1.0000000
MMBal Money Market Balance 32264 0 1875.76 0 120801.11
MMCred Money Market Credits 32264 0 0.0563786 0 5.0000000
MTG Mortgage 32264 0 0.0493429 0 1.0000000
MTGBal Mortgage Balance 32264 0 8081.74 0 10887573.28
CC Credit Card 28131 4133 0.4830969 0 1.0000000
CCBal Credit Card Balance 28131 4133 9586.55 -2060.51 10641354.78
CCPurc Credit Card Purchases 28131 4133 0.1541716 0 5.0000000
SDB Safety Deposit Box 32264 0 0.1086660 0 1.0000000
Income Income 26482 5782 40.5889283 0 233.0000000
HMOwn Owns Home 26731 5533 0.5418802 0 1.0000000
LORes Length of Residence 26482 5782 7.0056642 0.5000000 19.5000000
HMVal Home Value 26482 5782 110.9121290 67.0000000 754.0000000
Age Age 25907 6357 47.9283205 16.0000000 94.0000000
CRScore Credit Score 31557 707 666.4935197 509.0000000 820.0000000
Moved Recent Address Change 32264 0 0.0296305 0 1.0000000
InArea Local Address 32264 0 0.9602963 0 1.0000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
The results of PROC MEANS show that several variables have missing values and several variables are
(probably) binary.
1-8 Chapter 1 Predictive Modeling
Cumulative Cumulative
Ins Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
0 21089 65.36 21089 65.36
1 11175 34.64 32264 100.00
Branch of Bank
Cumulative Cumulative
Branch Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
B1 2819 8.74 2819 8.74
B10 273 0.85 3092 9.58
B11 247 0.77 3339 10.35
B12 549 1.70 3888 12.05
B13 535 1.66 4423 13.71
B14 1072 3.32 5495 17.03
B15 2235 6.93 7730 23.96
B16 1534 4.75 9264 28.71
B17 850 2.63 10114 31.35
B18 541 1.68 10655 33.02
B19 285 0.88 10940 33.91
B2 5345 16.57 16285 50.47
B3 2844 8.81 19129 59.29
B4 5633 17.46 24762 76.75
B5 2752 8.53 27514 85.28
B6 1438 4.46 28952 89.73
B7 1413 4.38 30365 94.11
B8 1341 4.16 31706 98.27
B9 558 1.73 32264 100.00
Area Classification
Cumulative Cumulative
Res Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
R 8077 25.03 8077 25.03
S 11506 35.66 19583 60.70
U 12681 39.30 32264 100.00
The results of PROC FREQ show that 34.6 percent of the observations have acquired the insurance
product. There are 19 levels of Branch and 3 levels under Res (R=rural, S=suburb, U=urban).
1.2 Analytical Challenges 1-9
Opportunistic Data
Operational / Observational
Massive
Missing Values
The data that is typically used to develop predictive models can be characterized as opportunistic. It was
collected for operational purposes unrelated to statistical analysis (Huber 1997). Such data is usually
massive, dynamic, and dirty. Preparing data for predictive modeling can be agonizing. For example, the
parts of the data that are relevant to the analysis need to be acquired. Furthermore, the relevant inputs
usually need to be created from the raw operational fields. Many statistical methods do not scale well to
massive data sets because most methods were developed for small data sets generated from designed
experiments.
F, D, C, B, A
0, 1, 2, 3, 4, 5, 6, ...
M, F
10
When there are a large number of input variables, there are usually a variety of measurement scales
represented. The input variable may be intervally scaled (amounts), binary, nominally scaled (names),
ordinally scaled (grades), or counts. Nominal input variables with a large number of levels (such as ZIP
code) are commonplace and present complications for regression analysis.
1-10 Chapter 1 Predictive Modeling
High Dimensionality
II I III II III I I I I I I X3
X1
X2
X3
X2
X1
I I II IIIIII IIII I II I
X2
X3
X3
11
The dimension refers to the number of input variables (actually input degrees of freedom). Predictive
modelers consider large numbers (hundreds) of input variables. The number of variables often has a
greater effect on computational performance than the number of cases. High dimensionality limits the
ability to explore and model the relationships among the variables. This is known as the curse of
dimensionality, which was distilled by Breiman et al. (1984) as
The complexity of a data set increases rapidly with increasing dimensionality.
The remedy is dimension reduction; ignore irrelevant and redundant dimensions without inadvertently
ignoring important ones.
1.2 Analytical Challenges 1-11
12
In predictive modeling, the event of interest is often rare. Usually more data leads to better models, but
having an ever-larger number of nonevent cases has rapidly diminishing return and can even have
detrimental effects. With rare events, the effective sample size for building a reliable prediction model is
closer to 3 times the number of event cases than to the nominal size of the data set (Harrell 1997). A
seemingly massive data set might have the predictive potential of one that is much smaller.
One widespread strategy for predicting rare events is to build a model on a sample that disproportionally
over-represents the event cases (for example, an equal number of events and nonevents). Such an analysis
introduces biases that need to be corrected so that the results are applicable to the population.
E(y) E(y)
x1 x2 x1 x2
Linear Nonlinear
Additive Nonadditive
13
Predictive modeling is a multivariate problem. Each important dimension might affect the target in
complicated ways. Moreover, the effect of each input variable might depend on the values of other input
variables. The curse of dimensionality makes this difficult to untangle. Many classical modeling methods
(including standard logistic regression) were developed for inputs with effects that have a constant rate of
change and do not depend on any other inputs.
1-12 Chapter 1 Predictive Modeling
Model Selection
Underfitting
Overfitting
Just Right
14
Predictive modeling typically involves choices from among a set of models. These might be different
types of models. These might be different complexities of models of the same type. A common pitfall is to
overfit the data; that is, to use too complex a model. An overly complex model might be too sensitive to
peculiarities in the sample data set and not generalize well to new data. However, using too simple a
model can lead to underfitting, where true features are disregarded.
1.3 Chapter Summary 1-13
Functional Form
posterior probability
logit( pi ) = β0 + β1 x1 i + L + βk x ki
parameter
input
The data set consists of i=1,2,…,n cases. Each case belongs to one of two classes. A binary indicator
variable represents the class label for each case
p i = E (y i | x i ) = Pr (y i = 1 | x i )
The standard logistic regression model assumes that the logit of the posterior probability is a linear
combination of the input variables. The parameters, β0,…,βk, are unknown constants that must be
estimated from the data.
2-4 Chapter 2 Fitting the Model
⎛ p ⎞ 1
logit( pi ) = ln⎜⎜ i ⎟⎟ = η ⇔ pi =
⎝ 1 − pi ⎠ 1 + e −η
pi = 1
pi = 0
smaller ← η → larger
A linear combination can take any value. Probability must be between zero and one. The logit
transformation (which is the log of the odds) is a device for constraining the posterior probability to
be between zero and one. The logit function transforms the probability scale to the real line (–∞, +∞).
Therefore, modeling the logit with a linear combination gives estimated probabilities that are constrained
to be between zero and one.
Logistic regression is a special case of the generalized linear model
g ( E ( y | x )) = β 0 + β1 x1 + L + β k x k
where the expected value of the target is linked to the linear predictor by the function g(•). The link
function depends on the scale of the target.
logit(p) p
0
x2 x2
x1 x1
The graph of a linear combination on the logit scale is a (hyper)plane. On the probability scale it becomes
a sigmoidal surface. Different parameter values give different surfaces with different slopes and different
orientations. The nonlinearity is solely due to the constrained scale of the target. The nonlinearity only
appears when the fitted values are close to the limits of their sensible range (>.8 or <.2).
2.1 The Model 2-5
Interpretation
Unit change in x2 ⇒
logit(p) p
x2 x2
x1 x1
β2 change in logit 100(exp(β2)-1)%
change in the odds
A linear-additive model is particularly easy to interpret: each input variable affects the logit linearly. The
coefficients are the slopes. Exponentiating each parameter estimate gives the odds ratios, which compares
the odds of the event in one group to the odds of the event in another group. For example, the odds ratio
for a binary input variable (X) would compare the odds of the event when X=1 to the odds of the event
when X =0. The odds ratio represents the multiplicative effect of each input variable. Moreover, the effect
of each input variable does not depend on the values of the other inputs (additivity).
However, this simple interpretation depends on the model being correctly specified. In predictive
modeling, you should not presume that the true posterior probability has such a simple form. Think of the
model as an approximating (hyper)plane. Consequently, you can determine the extent that the inputs are
important to the approximating plane.
Logistic Discrimination
p
1
above
x2
below
0
x2
x1 x1
In supervised classification, the ultimate use of logistic regression is to allocate cases to classes. This
is more correctly termed logistic discrimination (McClachlan 1989). An allocation rule is merely an
assignment of a cutoff probability, where cases above the cutoff are allocated to class 1 and cases below
the cutoff are allocated to class 0. The standard logistic discrimination model separates the classes by a
linear surface ((hyper)plane). The decision boundary is always linear. Determining the best cutoff is a
fundamental concern in logistic discrimination.
2-6 Chapter 2 Fitting the Model
Log-likelihood
β̂2
β̂1
The method of maximum likelihood (ML) is usually used to estimate the unknown parameters in the
logistic regression model. The likelihood function is the joint probability density function of the data
treated as a function of the parameters. The maximum likelihood estimates are the values of
the parameters that maximize the probability of obtaining the sample data.
If you assume that the yi independently have Bernoulli distributions with probability pi (which is a
function of the parameters), then the log of the likelihood is given by
n n1 n0
where n0 and n1 are the numbers of class 0 and class 1, respectively. The form of the log-likelihood shows
the intuitively appealing result that the ML estimates be chosen so that pi is large when yi=1 and small
when yi=0.
In ML estimation the combination of parameter values that maximize the likelihood (or log-likelihood)
are pursued. There is, in general, no closed form analytical solution for the ML estimates as there is for
linear regression on a normally distributed response. They must be determined using an iterative
optimization algorithm. Consequently, logistic regression is considerably more computationally expensive
than linear regression.
Software for ML estimation of the logistic model is commonplace. Many SAS procedures can be used;
most notable are the LOGISTIC, GENMOD, CATMOD, and DMREG procedures (SAS Enterprise
Miner).
2.1 The Model 2-7
The LOGISTIC procedure fits a binary logistic regression model. The seven input variables included
in the model were selected arbitrarily. The DES (short for descending) option is used to reverse the
sorting order for the levels of the response variable Ins. The CLASS statement names the classification
variables to be used in the analysis. The CLASS statement must precede the MODEL statement. The
PARAM option in the CLASS statement specifies the parameterization method for the classification
variable or variables and the REF option specifies the reference level. In this example, the
parameterization method is reference cell coding and the reference level is S.
The STB option displays the standardized estimates for the parameters for the continuous input variables.
The UNITS statement enables you to obtain an odds ratio estimate for a specified change in an input
variable. In this example, the UNITS statement enables you to estimate the change in odds for a 1000-unit
change in DDABal and DepAmt.
proc logistic data=develop des;
class res (param=ref ref='S');
model ins = dda ddabal dep depamt
cashbk checks res
/ stb;
units ddabal=1000 depamt=1000;
run;
Model Information
Response Profile
Ordered Total
Value Ins Frequency
1 1 11175
2 0 21089
The results consist of a number of tables. The Response Profile table shows the target variable
values listed according to their ordered values. By default, the target-variable values are ordered
alphanumerically and PROC LOGISTIC always models the probability of ordered value 1. The
DES option reverses the order so that PROC LOGISTIC models the probability that Ins=1.
Another way to force the LOGISTIC procedure to model the probability of a particular level
of the target variable is with EVENT= syntax in the MODEL statement.
Design
Class Value Variables
Res R 1 0
S 0 0
U 0 1
The Class Level Information table shows the Res variable was dummy coded into two design variables
using reference cell coding and the level S as the reference level.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
The Model Fit Statistics table contains the Akaike information criteria (AIC) and the Schwarz criterion
(SC). These are goodness-of-fit measures you can use to compare one model to another.
Testing Global Null Hypothesis: BETA=0
The likelihood ratio, Wald, and Score tests all test the null hypothesis that all regression coefficients of the
model other than the intercept are 0.
Type 3 Analysis of Effects
Wald
Effect DF Chi-Square Pr > ChiSq
The Type 3 Analysis of Effects table shows which input variables are significant controlling for all of the
other input variables in the model.
2.1 The Model 2-9
The parameter estimates measure the rate of change in the logit (log odds) corresponding to a one-unit
change in input variable, adjusted for the effects of the other inputs. The parameter estimates are difficult
to compare because they depend on the units in which the variables are measured. The standardized
estimates convert them to standard deviation units. The absolute value of the standardized estimates can
be used to give an approximate ranking of the relative importance of the input variables on the fitted
logistic model. The variable Res has no standardized estimate because it is a class variable.
Odds Ratio Estimates
The odds ratio measures the effect of the input variable on the target adjusted for the effect of the other
input variables. For example, the odds of acquiring an insurance product for DDA (checking account)
customers is .379 times the odds for non-DDA customers. Equivalently, the odds of acquiring an
insurance product is 1/.379 or 2.64 times more likely for non-DDA customers compared to DDA
customers. By default, PROC LOGISTIC reports the 95% Wald confidence interval.
Association of Predicted Probabilities and Observed Responses
The Association of Predicted Probabilities and Observed Responses table lists several measures that
assess the predictive ability of the model. For all pairs of observations with different values of the target
variable, a pair is concordant if the observation with the outcome has a higher predicted outcome
probability (based on the model) than the observation without the outcome. A pair is discordant if the
observation with the outcome has a lower predicted outcome probability than the observation without
the outcome.
2-10 Chapter 2 Fitting the Model
The four rank correlation indexes (Somer’s D, Gamma, Tau-a, and c) are computed from the numbers of
concordant and discordant pairs of observations. In general, a model with higher values for these indexes
(the maximum value is 1) has better predictive ability than a model with lower values for these indexes.
Adjusted Odds Ratios
For continuous variables, it may be useful to convert the odds ratio to a percentage increase or decrease in
odds. For example, the odds ratio for a 1000-unit change in DDABal is 1.074. Consequently, the odds of
acquiring the insurance product increases 7.4% (calculated as 100(1.074−1)) for every thousand-dollar
increase in the checking balance, assuming that the other variables do not change.
2.1 The Model 2-11
x = (1.1, 3.0)
ˆ = .05
p
12
The overriding purpose of predictive modeling is to score new cases. Predictions can be made by simply
plugging in the new values of the inputs.
2-12 Chapter 2 Fitting the Model
The LOGISTIC procedure can score data sets, without using the OUTPUT statement, beginning
in SAS®9. The SCORE statement in the LOGISTIC procedure applies the model to a new data set.
The DATA= option names the data set to be scored, and the OUT= option names the resulting
scored data set. The predicted probability that Ins is 1 is named P_1.
proc logistic data=develop des;
model ins=dda ddabal dep depamt cashbk checks;
score data = pmlr.new out=scored;
run;
Cash
Obs P_1 DDA DDABal Dep DepAmt Bk Checks
The LOGISTIC procedure outputs the final parameter estimates to a data set using the OUTEST= option.
proc logistic data=develop des outest=betas1;
model ins=dda ddabal dep depamt cashbk checks;
run;
The output data set contains one observation and a variable for each parameter estimate. The estimates are
named corresponding to their input variable.
The SCORE procedure multiplies values from two SAS data sets, one containing coefficients (SCORE=)
and the other containing the data to be scored (DATA=). Typically, the data set to be scored would not
have a target variable. The OUT= option specifies the name of the scored data set created by PROC
SCORE. The TYPE=PARMS option is required for scoring regression models.
proc score data=pmlr.new
out=scored
score=betas1
type=parms;
var dda ddabal dep depamt cashbk checks;
run;
The linear combination produced by PROC SCORE (the variable Ins) estimates the logit, not the
posterior probability. The logistic function (inverse of the logit) needs to be applied to compute the
posterior probability.
data scored;
set scored;
p=1/(1+exp(-ins));
run;
Cash
Obs p Ins DDA DDABal Dep DepAmt Bk Checks
Data can also be scored directly in PROC LOGISTIC using the OUTPUT statement. This has several
disadvantages over using PROC SCORE: it does not scale well with large data sets, it requires a target
variable (or some proxy), and the adjustments for oversampling, discussed in the next section, are not
automatically applied.
2.1 The Model 2-15
The Output Delivery System and Automatic Score Code Generation (Self-Study)
The Output Delivery System is an easy way to save results that are typically displayed as output into
a data set. The data set betas2 contains the output listed under the heading “Analysis of Maximum
Likelihood Estimates” in the PROC LOGISTIC output.
ods output parameterEstimates = betas2;
proc logistic data=develop des;
model ins=dda ddabal dep depamt cashbk checks;
run;
1 Intercept 0.1259
2 DDA -0.9705
3 DDABal 0.000072
4 Dep -0.0715
5 DepAmt 0.000018
6 CashBk -0.5618
7 Checks -0.00400
The code to generate score code from the betas2 data set follows. This code creates a file that contains
the code to create the linear predictor. The macro variable target is used in the name and label of the
predicted probability variable. The code handles categorical inputs as well, even though there are none
in this example. Categorical inputs should be numerically coded using reference cell coding. The code
essentially automates the tedious task of cutting and pasting parameter estimates from logistic regression
results into DATA step code.
2-16 Chapter 2 Fitting the Model
%let target=INS;
data _null_;
attrib PREDNAME length=$32
TARGNAME length=$32
LastParm length=$32
;
file scorecd;
set betas2 end=last;
retain TARGNAME PREDNAME LastParm ' ';
if (Variable="Intercept") then do;
TARGNAME=compress("&target");
PREDNAME="P_"||compress(TARGNAME);
put "**********************************************;";
put "*** begin scoring code for Logistic Regression;";
put "**********************************************;";
The output of the code generator is DATA step code to score a new data set.
**********************************************;
*** begin scoring code for Logistic Regression;
**********************************************;
length P_INS 8;
label P_INS = 'Predicted: INS';
*** accumulate XBETA;
XBETA = 0.12592268473366;
XBETA = XBETA + ( -0.97052008343956) * DDA ;
XBETA = XBETA + ( 0.00007181904404) * DDABal ;
XBETA = XBETA + ( -0.07153101800749) * Dep ;
XBETA = XBETA + ( 0.0000178287498) * DepAmt ;
XBETA = XBETA + ( -0.56175142056463) * CashBk ;
XBETA = XBETA + ( -0.00399923596541) * Checks ;
P_INS = 1/(1+exp(-XBETA));
The %INCLUDE statement brings a SAS programming statement, data lines, or both, into a current
SAS program. Using the SOURCE2 option after a slash treats the code as if it was pasted into the
program editor—the text appears in the log. The %INCLUDE statement can use an explicit file name
(C:\Temp\logistic score code.sas) or a fileref (scorecd).
data scored;
set pmlr.new;
%include "c:\temp\logistic score code.sas" /source2;
run;
The above code generates the same results as the following code:
data scored;
set pmlr.new;
%include scorecd /source2;
run;
The results are equivalent to the earlier two techniques.
proc print data=scored(obs=20);
var p_ins xbeta dda ddabal dep depamt cashbk checks;
run;
In order to use the model, you must create scoring code. This method automatically creates a DATA step
that you could use to score new cases, without running the SCORE procedure or having to cut and paste
parameter estimates into code by hand.
2-18 Chapter 2 Fitting the Model
Cash
Obs P_INS XBETA DDA DDABal Dep DepAmt Bk Checks
Sampling Designs
Joint
(x,y),(x,y),(x,y),
{(x,y),(x,y),(x,y),(x,y)}
(x,y),(x,y),(x,y),
(x,y),(x,y),(x,y),
(x,y),(x,y),...
Separate
x,x,x, x,x,x,
x,x,x, {(x,0),(x,0),(x,1),(x,1)} x,x,x,
x,x,x, x,x,x,
x,x,... x,x,...
y=0 y=1
19
In joint (mixture) sampling, the input-target pairs are randomly selected from their joint distribution.
In separate sampling, the inputs are randomly selected from their distributions within each target class.
Separate sampling is standard practice in supervised classification. When the target event is rare, it is
common to oversample the rare event, that is, take a disproportionately large number of event cases.
Oversampling rare events is generally believed to lead to better predictions (Scott and Wild 1986).
Separate sampling is also known as
• case-control sampling
• choice-based sampling
• stratified sampling on the target, not necessarily taken with proportional allocation
• biased sampling
• y-conditional sampling
• outcome-dependent sampling
• oversampling.
The priors, π0 and π1, represent the population proportions of class 0 and 1, respectively. The proportions
of the target classes in the sample are denoted ρ0 and ρ1. In separate sampling (nonproportional) π0 ≠ ρ0
and π1 ≠ ρ1. The adjustments for oversampling require the priors be known a priori.
2-20 Chapter 2 Fitting the Model
20
The maximum likelihood estimates were derived under the assumption that yi have independent Bernoulli
distributions. This assumption is appropriate for joint sampling but not for separate sampling. However,
the effects of violating this assumption can be easily corrected. In logistic regression, only the estimate of
the intercept, β0, is affected by using Bernoulli ML on data from a separate sampling design (Prentice and
Pike 1979). If the standard model
logit( p i ) = β 0 + β 1 x 1 + L + β k x k
is appropriate for joint sampling, then ML estimates of the parameters under separate sampling can be
determined by fitting the pseudo model (Scott and Wild 1986, 1997)
⎛ρ π ⎞
logit( pi* ) = ln⎜⎜ 1 0 ⎟⎟ + β 0 + β1 x1 + L + β k x k
⎝ ρ 0π 1 ⎠
where p* is the posterior probability corresponding to the biased sample. Consequently, the effect of
oversampling is to shift the logits by a constant amount – the offset
⎛ρ π ⎞
ln⎜⎜ 1 0 ⎟⎟
⎝ ρ0 π1 ⎠
When rare events have been oversampled π0 > ρ0 and π1 < ρ1, the offset is positive; that is, the logit is too
large. This vertical shift of the logit affects the posterior probability in a corresponding fashion.
2.2 Adjustments for Oversampling 2-21
Offset
logit
⎛π ρ ⎞
ln⎜⎜ 0 1 ⎟⎟
⎝ π 1ρ0 ⎠
21
The pseudo model can be fitted directly by incorporating the offset into the model. Alternatively, the
offset could be applied after the standard model is fitted. Subtracting the offset from the predicted values
and solving for the posterior probability gives
pˆ i* ρ 0 π 1
ˆi =
p
ˆ i* ) ρ 1 π 0 + p
(1 − p ˆ i* ρ 0 π 1
ˆ i* is the unadjusted estimate of the posterior probability. Both approaches give identical results.
where p
For both types of adjustments, the population priors, π0 and π1, need to be known a priori while the
sample priors, ρ0 and ρ1, can be estimated from the data.
Because only the intercept is affected, the adjustments may not be necessary. If the goal of the analysis is
to understand the relationships between the inputs and the target, or to rank order the population, then the
adjustment is not critical. If the predicted probabilities are important, and not just necessary for rank
ordering or classification, then the correction for oversampling is necessary.
2-22 Chapter 2 Fitting the Model
Separate sampling was used to create the INS data set. The proportion of the target event in the
population was .02, not .346 as appears in the sample. The %LET statement defines the macro variable
pi1 for the population prior for class 1 (Ins=1).
%let pi1=.02; /* supply the prior for class 1 */
The SQL procedure can be used to create macro variables as well. The following code is equivalent to
%let rho1 = 0.346361;.
proc SQL noprint;
select mean(INS) into :rho1 from develop;
quit;
The SCORE statement in the LOGISTIC procedure will correct predicted probabilities back to the
population scale. The option to do this is PRIOREVENT=.
proc logistic data=develop des;
model ins=dda ddabal dep depamt cashbk checks;
score data = pmlr.new out=scored priorevent=&pi1;
run;
Cash
Obs P_1 DDA DDABal Dep DepAmt Bk Checks
The score code follows. Note the change in the intercept term, which was 0.12592268473366 before.
**********************************************;
*** begin scoring code for Logistic Regression;
**********************************************;
length P_INS 8;
label P_INS = 'Predicted: INS';
*** accumulate XBETA;
XBETA = -3.13082398583352;
XBETA = XBETA + ( -0.97052008343956) * DDA ;
XBETA = XBETA + ( 0.00007181904404) * DDABal ;
XBETA = XBETA + ( -0.07153101800749) * Dep ;
XBETA = XBETA + ( 0.0000178287498) * DepAmt ;
XBETA = XBETA + ( -0.56175142056463) * CashBk ;
XBETA = XBETA + ( -0.00399923596541) * Checks ;
P_INS = 1/(1+exp(-XBETA));
Again, you can use the %INCLUDE statement to score data with this code.
data scored;
set pmlr.new;
%include scorecd /source2;
run;
Standard
Parameter DF Estimate Error Chi-Square Pr > ChiSq
The list of parameter estimates contains a new entry for the variable off, which has a fixed value of one.
The probabilities computed from this model have been adjusted down because the population probability
is much lower than the sample probability (.02 versus .346).
The SCORE procedure uses the final parameter estimates from the logistic model with the offset variable
to score new data.
proc score data=pmlr.new out=scored score=betas2
type=parms;
var dda ddabal dep depamt cashbk checks;
run;
data scored;
set scored;
p=1/(1+exp(-ins));
run;
Cash
Obs p Ins DDA DDABal Dep DepAmt Bk Checks
The OFFSET= option is less efficient than fitting an unadjusted model. When the OFFSET= option is
used, PROC LOGISTIC uses CPU time roughly equivalent to two logistic regressions. A more efficient
way of adjusting the posterior probabilities for the offset is to fit the model without the offset and adjust
the fitted posterior probabilities afterwards in a DATA step. The two approaches are statistically
equivalent.
proc logistic data=develop des outest=betas3;
model ins=dda ddabal dep depamt cashbk checks;
run;
data scored;
set scored;
off=log(((1-&pi1)*&rho1)/(&pi1*(1-&rho1)));
p=1/(1+exp(-(ins-off)));
run;
Cash
Obs p Ins DDA DDABal Dep DepAmt Bk Checks
31 3 8
51 1 8
Missing values of the input variables can arise from several different mechanisms (Little 1992). A value is
missing completely at random (MCAR) if the probability that it is missing is independent of the data.
MCAR is a particularly easy mechanism to manage but is unrealistic in most predictive modeling
applications.
The probability that a value is missing might depend on the unobserved value. Credit applicants with
fewer years at their current job might be less inclined to provide this information.
The probability that a value is missing might depend on observed values of other input variables.
Customers with longer tenures might be less likely to have certain historic transactional data. Missingness
might depend on a combination of values of correlated inputs.
An even more pathological missing-value mechanism occurs when the probability that a value is missing
depends on values of unobserved (lurking) predictors. Transient customers might have missing values on
a number of variables.
A fundamental concern for predictive modeling is that the missingness is related to the target. The more
transient customers may be the best prospects for a new offer.
3-4 Chapter 3 Preparing the Input Variables
Input Variables
Cases
The default method for treating missing values in most SAS modeling procedures (including the
LOGISTIC procedure) is complete-case analysis. In complete-case analysis, only those cases without any
missing values are used in the analysis.
Complete-case analysis has some moderately attractive theoretical properties even when the missingness
depends on observed values of other inputs (Donner 1982; Jones 1996). However, complete-case analysis
has serious practical shortcomings with regards to predictive modeling. Even a smattering of missing
values can cause an enormous loss of data in high dimensions. For instance, suppose each of the k input
variables can be MCAR with probability α; in this situation, the expected proportion of complete cases is
(1 − α )k
Therefore, a 1% probability of missing (α=.01) for 100 inputs would leave only 37% of the data for
analysis, 200 would leave 13%, and 400 would leave 2%. If the missingness was increased to 5%
(α=.05), then <1% of the data would be available with 100 inputs.
3.1 Missing Values 3-5
Fitted Model:
ˆ ) = −2.1 + .072 x1 − .89 x 2 − 1.4 x 3
logit( p
Predicted Value:
ˆ ) = −2.1 + .144 − .89( ) + .7
logit( p
Another practical consideration of any treatment of missing values is scorability (the practicality of
method when it is deployed). The purpose of predictive modeling is scoring new cases. How would a
model built on the complete cases score a new case if it had a missing value? To decline to score new
incomplete cases would only be practical if there were a very small number of missing values.
Because of the aforementioned drawbacks of complete-case analysis, some type of missing value
imputation is necessary. Imputation means filling in the missing values with some reasonable value. Many
methods have been developed for imputing missing values (Little 1992). The principal consideration for
most methods is getting valid statistical inference on the imputed data, not generalization.
Often, subject-matter knowledge can be used to impute missing data. For example, the missing values
might be miscoded zeros.
3-6 Chapter 3 Preparing the Input Variables
Imputation + Indicators
Incomplete Completed Missing
Data Data Indicator
34 34 0
63 63 0
. 30 1
22 22 0
26 26 0
54 54 0
18 18 0
. 30 1
47 49 0
20 20 0
Median = 30
8
One reasonable strategy for handling missing values in predictive modeling is to do the following steps.
1. Create missing indicators
⎧ 1 if x j is missing
MI j = ⎨
⎩ 0 otherwise
and treat them as new input variables in the analysis.
2. Use median imputation. Fill the missing value of xj with the median of the complete cases for that
variable.
3. Create a new level representing missing (unknown) for categorical inputs.
If a very large percentage of values are missing (>50%), then the variable might be better handled by
omitting it from the analysis or by creating the missing indicator only. If a very small percentage of the
values are missing (<1%), then the missing indicator is of little value.
This strategy is somewhat unsophisticated but satisfies two of the most important considerations in
predictive modeling: scorability and the potential relationship of missingness with the target. A new case
is easily scored; first replace the missing values with the medians from the development data and then
apply the prediction model.
There is statistical literature concerning different missing value imputation methods, including
discussions of the demerits of mean and median imputation and missing indicators (Donner 1982; Jones
1997). Unfortunately, most of the advice is based on considerations that are peripheral to predictive
modeling. There is very little advice when the functional form of the model is not assumed to be correct,
when the goal is to get good predictions that can be practically applied to new cases, when p-values and
hypothesis tests are largely irrelevant, and when the missingness may be highly pathological, in other
words, depending on lurking predictors.
3.1 Missing Values 3-7
The objective of the following program is to create missing value indicator variables and to replace
missing values with the variable median.
proc print data=develop(obs=30);
var ccbal ccpurc income hmown;
run;
1 483.65 0 16 1
2 0.00 1 4 1
3 0.00 0 30 1
4 65.76 0 125 1
5 0.00 0 25 1
6 38.62 0 19 0
7 85202.99 0 55 1
8 0.00 0 13 0
9 . . 20 0
10 0.00 0 54 0
11 0.00 0 . .
12 0.00 0 25 1
13 . . 102 1
14 . . 24 1
15 0.00 0 8 1
16 0.00 0 100 1
17 323.13 0 13 1
18 . . 17 0
19 . . 8 1
20 0.00 0 7 1
21 0.00 0 . .
22 32366.86 0 . 1
23 0.00 0 9 0
24 . . 45 1
25 . . 36 1
26 1378.46 1 60 1
27 . . 35 1
28 17135.95 0 40 1
29 0.00 0 42 0
30 0.00 0 112 1
Fifteen of the input variables were selected for imputation. Two arrays are created, one called MI, which
contains the missing value indicator variables, and one called X, which contains the input variables. It is
critical that the order of the variables in the array MI matches the order of the variables in array X.
Defining the dimension with an asterisk causes the array elements to be automatically counted. In the DO
loop, the DIM function returns the dimension of the array. Thus, the DO loop will execute 15 times in this
example. The assignment statement inside the DO loop causes the entries of MI to be 1 if the
corresponding entry in X is missing, and 0 otherwise.
3-8 Chapter 3 Preparing the Input Variables
data develop1;
set develop;
/* name the missing indicator variables */
array mi{*} MIAcctAg MIPhone MIPOS MIPOSAmt
MIInv MIInvBal MICC MICCBal
MICCPurc MIIncome MIHMOwn MILORes
MIHMVal MIAge MICRScor;
/* select variables with missing values */
array x{*} acctage phone pos posamt
inv invbal cc ccbal
ccpurc income hmown lores
hmval age crscore;
do i=1 to dim(mi);
mi{i}=(x{i}=.);
end;
run;
The STDIZE procedure with the REPONLY option can be used to replace missing values. The
METHOD= option enables you to choose several different location measures such as the mean, median,
and midrange. The output data set created by the OUT= option contains all the variables in the input data
set where the variables listed in the VAR statement are imputed. Only numeric input variables should be
used in PROC STDIZE.
proc stdize data=develop1
reponly
method=median
out=imputed;
var &inputs;
run;
The REPLACE option in PROC STANDARD can be used to replace missing values with the
mean of that variable on the nonmissing cases.
Obs CCBal MICCBal CCPurc MICCPurc Income MIIncome HMOwn MIHMOwn
1 483.65 0 0 0 16 0 1 0
2 0.00 0 1 0 4 0 1 0
3 0.00 0 0 0 30 0 1 0
4 65.76 0 0 0 125 0 1 0
5 0.00 0 0 0 25 0 1 0
6 38.62 0 0 0 19 0 0 0
7 85202.99 0 0 0 55 0 1 0
8 0.00 0 0 0 13 0 0 0
9 0.00 1 0 1 20 0 0 0
10 0.00 0 0 0 54 0 0 0
11 0.00 0 0 0 35 1 1 1
12 0.00 0 0 0 25 0 1 0
3.1 Missing Values 3-9
Cluster Imputation
X1 =
X2 = ?
12
Mean-imputation uses the unconditional mean of the variable. An attractive extension would be to use the
mean conditional on the other inputs. This is referred to as regression imputation. Regression imputation
would usually give better estimates of the missing values. Specifically, k linear regression models could
be built, one for each input variable using the other inputs as predictors. This would presumably give
better imputations and be able to accommodate missingness that depends on the values of the other
inputs. An added complication is that the other inputs may have missing values. Consequently, the k
imputation regressions also need to accommodate missing values.
Cluster-mean imputation is a somewhat more practical alternative:
1. cluster the cases into relatively homogenous subgroups
2. mean-imputation within each group
3. for new cases with multiple missing values, use the cluster mean that is closest in all the nonmissing
dimensions.
This method can accommodate missingness that depends on the other input variables. This method is
implemented in SAS Enterprise Miner. For large data sets, the FASTCLUS procedure may also be
appropriate.
A simple but less effective alternative is to define a priori segments (for example, high, middle, low, and
unknown income), and then do mean or median imputation within each segment (see the exercises in
Appendix A).
3-10 Chapter 3 Preparing the Input Variables
Dummy Variables
X DA DB DC DD
D 0 0 0 1
B 0 1 0 0
C 0 0 1 0
C 0 0 1 0
A 1 0 0 0
A 1 0 0 0
D 0 0 0 1
C 0 0 1 0
A 1 0 0 0
.
. .
. .
. . .
. . . .
. .
.
17
With the CLASS statement, you can use categorical input variables in the LOGISTIC procedure without
having to create dummy variables in a DATA step. You can specify the type of parameterization to use,
such as effect coding and reference coding, and the reference level. The choice of the reference level is
immaterial in predictive modeling because different reference levels give the same predictions.
Smarter Variables
18
Expanding categorical inputs into dummy variables can greatly increase the dimension of the input space.
A smarter method is to use subject-matter information to create new inputs that represent relevant sources
of variation. A categorical input might be best thought of as a link to other data sets. For example,
geographic areas are often mapped to several relevant demographic variables.
3.2 Categorical Inputs 3-11
Quasi-Complete Separation
0 1 DA DB Dc DD
A 28 7 1 0 0 0
B 16 0 0 1 0 0
C 94 11 0 0 1 0
D 23 21 0 0 0 1
19
Including categorical inputs in the model can cause quasi-complete separation. Quasi-complete
separation occurs when a level of the categorical input has a target event rate of 0 or 100%. The
coefficient of a dummy variable represents the difference in the logits between that level and the reference
level. When quasi-complete separation occurs, one of the logits will be infinite. The likelihood will not
have a maximum in at least one dimension, so the ML estimate of that coefficient will be infinite. If the
zero-cell category is the reference level, then all the coefficients for the dummy variables will be infinite.
Quasi-complete separation complicates model interpretation. It can also affect the convergence of the
estimation algorithm. Furthermore, it might lead to incorrect decisions regarding variable selection.
The most common cause of quasi-complete separation in predictive modeling is categorical inputs with
rare categories. The best remedy for sparseness is collapsing levels of the categorical variable.
3-12 Chapter 3 Preparing the Input Variables
Clustering Levels
0 1
0 1
A 28 7 0 1
28 7 0 1
B 16 0 138 18
110 11 161 39
C 94 11 23 21
23 21
D 23 21
23
Ideally, subject-matter considerations should be used to collapse levels (reduce the dimension) of
categorical inputs. This is not always practical in predictive modeling. A simple data-driven method for
collapsing levels of contingency tables was developed by Greenacre (1988, 1993). The levels (rows) are
hierarchically clustered based on the reduction in the chi-squared test of association between the
categorical variable and the target. At each step, the two levels that give the least reduction in the chi-
squared statistic are merged. The process is continued until the reduction in chi-squared drops below
some threshold (for example, 99%). This method will quickly throw rare categories in with other
categories that have similar marginal response rates. While this method is simple and effective, there is a
potential loss of information because only univariate associations are considered.
3.2 Categorical Inputs 3-13
The levels of a categorical input can be clustered using Greenacre’s method (1988, 1993) in the
CLUSTER procedure. PROC CLUSTER was designed for general clustering applications, but with some
simple pre-processing of the data, it can be made to cluster levels of categorical variables.
The variable Branch has 19 levels. The first step is to create a data set that contains the proportion of the
target event (ins) and number of cases in each level. The NWAY option caused the output data set to
have 19 observations, one for each of the 19 levels. The output data set also has a variable prop for the
proportion of target events. It automatically creates the variable _FREQ_, which counts the number of
cases in each level.
proc means data=imputed noprint nway;
class branch;
var ins;
output out=level mean=prop;
run;
1 B1 1 2819 0.36999
2 B10 1 273 0.40293
3 B11 1 247 0.35628
4 B12 1 549 0.36430
5 B13 1 535 0.37196
6 B14 1 1072 0.19123
7 B15 1 2235 0.24251
8 B16 1 1534 0.27771
9 B17 1 850 0.37059
10 B18 1 541 0.35675
11 B19 1 285 0.38596
12 B2 1 5345 0.32460
13 B3 1 2844 0.38186
14 B4 1 5633 0.37493
15 B5 1 2752 0.38118
16 B6 1 1438 0.37830
17 B7 1 1413 0.34678
18 B8 1 1341 0.38553
19 B9 1 558 0.37814
Using the FREQ statement and METHOD=WARD in PROC CLUSTER gives identical results to
Greenacre’s method. The OUTTREE= option creates an output data set that can be used by the TREE
procedure to draw a tree diagram. The ID statement specifies a variable that identifies observations in the
printed cluster history and in the OUTTREE= data set. The ODS TRACE statement with the LISTING
option will show which output objects are associated with each table in the PROC CLUSTER output.
3-14 Chapter 3 Preparing the Input Variables
Output Added:
-------------
Name: EigenvalueTable
Label: Eigenvalues of the Covariance Matrix
Template: stat.cluster.EigenvalueTable
Path: Cluster.EigenvalueTable
-------------
Output Added:
-------------
Name: ClusterHistory
Label: Cluster History
Template: stat.cluster.ClusterHistory
Path: Cluster.ClusterHistory
-------------
Cluster History
T
i
NCL --Clusters Joined--- FREQ SPRSQ RSQ e
The column labeled RSQ in the output is equivalent to the proportion of chi-squared in the 19×2
contingency table remaining after the levels are collapsed. At each step, the levels that give the smallest
decrease in chi-squared are merged. The change in chi-squared is listed in the SPRSQ column. The rows
in the summary represent the results after the listed clusters were merged. The number of clusters is
reduced from 18 to 1. When previously collapsed levels are merged, they are denoted using the CL as the
prefix and the number of resulting clusters as the suffix. For example, at the sixth step, CL15 represents
B1 and B17 that were merged at the fourth step creating 15 clusters.
To calculate the optimum number of clusters, the chi-square statistic and the associated p-value needs to
be computed for each collapsed contingency table. This information can be obtained by multiplying the
chi-square statistic from the 19×2 contingency table with the proportion of chi-squared remaining after
the levels are collapsed. Therefore, the next program converts the output object CLUSTERHISTORY to
an output data set. The FREQ procedure is used to compute the chi-square statistic for the 19×2
contingency table.
3-16 Chapter 3 Preparing the Input Variables
Number FreqOf
Of New Semipartial
Obs Clusters Idj1 Idj2 Cluster RSq RSquared Tie
Obs _PCHI_
1 350.164
3.2 Categorical Inputs 3-17
The DATA step computes the chi-square statistic for each collapsed contingency table. The _n_ variable
is used to put the overall chi-square value in each observation for the data set cutoff. The LOGSDF
function computes the log of the probability that an observation from a specified distribution is greater
than or equal to a specified value. The arguments for the function are the specified distribution in quotes,
the numeric random variable, and the degrees of freedom. The log of the p-value is calculated in order to
produce a more visually appealing graph.
data cutoff;
if _n_ = 1 then set chi;
set cluster;
chisquare=_pchi_*rsquared;
degfree=numberofclusters-1;
logpvalue=logsdf('CHISQ',chisquare,degfree);
run;
3-18 Chapter 3 Preparing the Input Variables
The PLOT procedure plots the log of the p-value by the number of clusters. The VPOS option specifies
the number of print positions on the vertical axis.
proc plot data=cutoff;
plot logpvalue*numberofclusters/vpos=30;
run; quit;
logpvalue ‚
‚
-130 ˆ
‚
‚
‚
‚
‚
-140 ˆ A
‚
‚
‚
‚ A A
‚ A
-150 ˆ A
‚ A
‚ A
‚ A
‚ A
‚
-160 ˆ A A
‚ A
‚ A
‚ A
‚ A A A
‚
-170 ˆ
‚
Šˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆ
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Number Of Clusters
The graph shows that the 4-, 5-, and 6-cluster solutions had the lowest p-values.
The note at the bottom of the plot refers to an observation with a missing log p-value. Of course,
the 1-cluster solution is associated with a χ2 test with no degrees of freedom. Since this makes no
sense, there is no p-value.
3.2 Categorical Inputs 3-19
To find the cluster solution with the lowest p-value, and assign that value to the macro variable NCL, use
the SQL procedure.
proc sql;
select NumberOfClusters into :ncl
from cutoff
having logpvalue=min(logpvalue);
quit;
The 5-cluster solution has the smallest log p-value.
Number
Of
Clusters
ƒƒƒƒƒƒƒƒ
5
The TREE procedure produces a high-resolution dendrogram. The H= option specifies the variable to be
used as the height axis of the dendrogram. In this example, the proportion of the chi-squared statistic is
the vertical axis.
proc tree data=fortree h=rsq
nclusters=&ncl out=clus;
id branch;
run;
3-20 Chapter 3 Preparing the Input Variables
The dendrogram shows that several branches can be combined with a miniscule reduction in chi-squared.
The horizontal axis is also roughly ordered by the mean proportion of events in each cluster. Choosing a
cluster near the center of the tree (for example, B11, B18, B7, and B2) as a reference group may lead to
better models if the variable selection method you choose incrementally adds variables to the model
(Cohen 1991). The clus data set from the OUT= option in PROC TREE shows which levels of Branch
are associated with each cluster (if the NCLUSTERS= option is correctly specified).
proc sort data=clus;
by clusname;
run;
B14 B14 5
B15 B15 3
B16 B16 4
CL5 B11 2
B18 2
B7 2
B2 2
CL6 B6 1
B9 1
B19 1
B8 1
B1 1
B17 1
B3 1
B5 1
B13 1
B12 1
B4 1
B10 1
A DATA step is used to assign the branches to dummy variables. Four dummy variables are created with
the second cluster designated as the reference level. Note that the dummy variables are numbered
sequentially.
data imputed;
set imputed;
brclus1=(branch in ('B6','B9','B19','B8','B1','B17',
'B3','B5','B13','B12','B4','B10'));
brclus2=(branch='B15');
brclus3=(branch='B16');
brclus4=(branch='B14');
run;
3-22 Chapter 3 Preparing the Input Variables
Redundancy
x1
x1
x5
x5
x3
x3
31
Principal Components
1 .02 0.3 0
1 0.2
Correlation Covariance
Matrix Matrix
32
Principal components analysis (Jackson 1991) can be used for reducing redundant dimensions. A set of k
variables can be transformed into a set of k principal components. The principal components (PCs) are
linear combinations of the k variables constructed to be jointly uncorrelated and to explain the total
variability among the original (standardized) variables.
The correlation matrix is the covariance matrix of the standardized variables. Because each standardized
variable has a variance equal to one, the total variability among the standardized variables is just the
number of variables. The principal components are produced by an eigen-decomposition of the
correlation matrix. The eigenvalues are the variances of the PCs; they sum to the number of variables. The
first PC corresponds to the first eigenvalue and explains the largest proportion of the variability. Each PC
explains a decreasing amount of the total variability. In the above example, the first three PCs explain
90% of the total variability.
In practice, dimension reduction is achieved by retaining only the first few PCs provided they explain a
sufficient proportion of the total variation. The reduced set of PCs might then be used in place of the
original variables in the analysis.
3-24 Chapter 3 Preparing the Input Variables
Cluster Components
1 .04 –.73 1
1 .02
Correlation Covariance
Matrix Matrix
33
Variable clustering as implemented in the VARCLUS procedure (SAS Institute Inc. 1990) is an alternative
method for eliminating redundant dimensions that is closely related to principal components analysis. The
result of clustering k variables is a set of ≤ k cluster components. Like PCs, the cluster components are
linear combinations of the original variables. Unlike the PCs, the cluster components are not uncorrelated
and do not explain all the variability in the original variables. The cluster components (scores) are
standardized to have unit variance.
The three cluster components in the above example correspond to eigenvalues of 1.7, 1.7, and 1.0.
Consequently, 88% of variation among the original (standardized) variables is explained by the three
clusters.
Coefficients
34
The chief advantage of variable clustering over principal components is the coefficients. The coefficients
of the PCs (eigenvectors) are usually nonzero for all the original variables. Thus, even if only a few PCs
were used, all the inputs would still have to be retained in the analysis. In contrast, the cluster component
scores have nonzero coefficients on disjoint subsets of the variables. These subsets correspond to disjoint
clusters of the original variables. In the above example, the clusters are {x3, x5}, {x1, x4}, and {x2}.
3.3 Variable Clustering and Screening 3-25
Divisive Clustering
2nd Eigenvalue
1.7 {X1, X2, X3, X4, X5}
35
Variable clustering finds groups of variables that are as correlated as possible among themselves and as
uncorrelated as possible with variables in other clusters. The basic algorithm is binary and divisive. All
variables start in one cluster. A principal components analysis is done on the variables in the cluster. If the
second eigenvalue is greater than a specified threshold (in other words, there is more than one dominant
dimension), then the cluster is split. The PC scores are then rotated obliquely so that the variables can be
split into two groups. This process is repeated for the two child clusters until the second eigenvalue drops
below the threshold. (By default, PROC VARCLUS does a nonhierarchical version of this algorithm
where variables can also be reassigned to other clusters.)
Larger thresholds for the second eigenvalue give fewer clusters and less of the variation is explained.
Smaller thresholds give more clusters and more of the variation is explained. The value 1 is a common
choice for the threshold because it represents the average size of the eigenvalues. To account for sampling
variability, smaller values such as .7 have been suggested (Jackson 1991).
3-26 Chapter 3 Preparing the Input Variables
Cluster Representatives
2
2 1 − R own
1 − R ratio = 2
cluster
1 − R next closest
1− ↑ ↓
⇒ ⇒ ↓
1− ↓ ↑
36
As with principal components analysis, dimension reduction could be achieved by replacing the original
variables with the cluster scores (components). A simple alternative is to select a representative variable
from each cluster. An ideal representative would have high correlation with its own cluster and have a low
correlation with the other clusters. Consequently, variables with the lowest 1– R2 ratio (defined above) in
each cluster would be good representatives. Of course, subject-matter considerations might dictate the
selection of other representatives.
3.3 Variable Clustering and Screening 3-27
Variable Clustering
The VARCLUS procedure1 clusters numeric variables. The MAXEIGEN= option specifies the largest
permissible value of the second eigenvalue in each cluster. The default is 1 (using the correlation matrix).
The SHORT option suppresses some of the output. The OUTTREE= option creates an output data set that
can be used in the TREE procedure. When the OUTTREE= option is used, the clusters at different levels
maintain a hierarchical structure that prevents variables from transferring from one cluster to another after
the split is made. If the OUTTREE= option is not used, then use the HI option to get a hierarchical
structure. The VAR statement lists the numeric variables to cluster: the original numeric inputs, the
missing indicators, and the dummy variables for the collapsed Branch (64 total).
proc varclus data=imputed
maxeigen=.7
outtree=fortree
short;
var &inputs brclus1-brclus4 miacctag
miphone mipos miposamt miinv
miinvbal micc miccbal miccpurc
miincome mihmown milores mihmval
miage micrscor;
run;
The output from PROC VARCLUS shows the results for each step in the divisive clustering algorithm.
Even with the SHORT option, the amount of printed output is voluminous.
Partial Output
Oblique Principal Component Cluster Analysis
Cluster 1 will be split because it has the largest second eigenvalue, 5.097981, which is greater
than the MAXEIGEN=0.7 value.
1
The default settings for the VARCLUS procedure changed from SAS Version 8 to SAS®9. If you would like to
emulate the behavior of SAS Version 8 exactly, specify MAXSEARCH=0 in the PROC VARCLUS statement.
3-28 Chapter 3 Preparing the Input Variables
R-squared with
2 Clusters ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Own Next 1-R**2 Variable
Cluster Variable Cluster Closest Ratio Label
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Cluster 1 AcctAge 0.0004 0.0002 0.9998 Age of Oldest Account
Dep 0.0010 0.0001 0.9991 Checking Deposits
DepAmt 0.0007 0.0000 0.9993 Amount Deposited
CashBk 0.0003 0.0000 0.9997 Number Cash Back
Checks 0.0037 0.0001 0.9963 Number of Checks
DirDep 0.0001 0.0001 0.9999 Direct Deposit
NSF 0.0002 0.0001 0.9999 Number Insufficient Fund
NSFAmt 0.0000 0.0000 1.0000 Amount NSF
Phone 0.0179 0.0001 0.9821 Number Telephone Banking
Teller 0.0012 0.0001 0.9989 Teller Visits
Sav 0.0006 0.0000 0.9994 Saving Account
SavBal 0.0000 0.0000 1.0000 Saving Balance
ATM 0.0003 0.0000 0.9997 ATM
ATMAmt 0.0013 0.0000 0.9987 ATM Withdrawal Amount
POS 0.0257 0.0000 0.9743 Number Point of Sale
POSAmt 0.0242 0.0001 0.9759 Amount Point of Sale
CD 0.0002 0.0000 0.9998 Certificate of Deposit
CDBal 0.0002 0.0000 0.9998 CD Balance
LOC 0.0066 0.0000 0.9934 Line of Credit
LOCBal 0.0014 0.0001 0.9987 Line of Credit Balance
Inv 0.0052 0.0001 0.9949 Investment
InvBal 0.0002 0.0000 0.9999 Investment Balance
ILS 0.0026 0.0000 0.9974 Installment Loan
ILSBal 0.0024 0.0000 0.9977 Loan Balance
MM 0.0000 0.0000 1.0000 Money Market
MMBal 0.0001 0.0000 1.0000 Money Market Balance
MTG 0.0032 0.0003 0.9970 Mortgage
MTGBal 0.0006 0.0000 0.9995 Mortgage Balance
CC 0.1279 0.0007 0.8727 Credit Card
CCBal 0.0019 0.0000 0.9981 Credit Card Balance
CCPurc 0.0211 0.0000 0.9789 Credit Card Purchases
SDB 0.0006 0.0000 0.9995 Safety Deposit Box
CRScore 0.0001 0.0000 0.9999 Credit Score
Moved 0.0000 0.0000 1.0000 Recent Address Change
InArea 0.0003 0.0000 0.9997 Local Address
brclus1 0.2239 0.0535 0.8200
3.3 Variable Clustering and Screening 3-29
R-squared with
2 Clusters ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Own Next 1-R**2 Variable
Cluster Variable Cluster Closest Ratio Label
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
brclus2 0.5331 0.0022 0.4679
brclus3 0.0061 0.0019 0.9958
brclus4 0.2426 0.0010 0.7582
MIAcctAg 0.0000 0.0000 1.0000
MIPhone 0.9925 0.0044 0.0076
MIPOS 0.9925 0.0044 0.0076
MIPOSAmt 0.9925 0.0044 0.0076
MIInv 0.9925 0.0044 0.0076
MIInvBal 0.9925 0.0044 0.0076
MICC 0.9925 0.0044 0.0076
MICCBal 0.9925 0.0044 0.0076
MICCPurc 0.9925 0.0044 0.0076
----------------------------------------------------------------------------------
Cluster 2 DDA 0.0002 0.0001 0.9999 Checking Account
DDABal 0.0004 0.0001 0.9997 Checking Balance
IRA 0.0000 0.0000 1.0000 Retirement Account
IRABal 0.0001 0.0000 0.9999 IRA Balance
MMCred 0.0000 0.0000 1.0000 Money Market Credits
Income 0.0113 0.0000 0.9887 Income
HMOwn 0.1850 0.0000 0.8150 Owns Home
LORes 0.0005 0.0001 0.9996 Length of Residence
HMVal 0.0137 0.0000 0.9863 Home Value
Age 0.0009 0.0000 0.9991 Age
MIIncome 0.9857 0.0028 0.0143
MIHMOwn 0.9541 0.0030 0.0461
MILORes 0.9857 0.0028 0.0143
MIHMVal 0.9857 0.0028 0.0143
MIAge 0.9194 0.0027 0.0808
MICRScor 0.0044 0.0000 0.9956
Cluster 1 will be split because it has the largest second eigenvalue, 3.431637, which is greater
than the MAXEIGEN=0.7 value.
Using the Output Delivery System, you can restrict the output to just the objects of interest. This requires
knowledge of the object names. These can be found in the documentation for the VARCLUS procedure,
or by using the TRACE statement.
ods trace on/listing;
proc varclus data=imputed
maxeigen=.7
outtree=fortree
short;
var &inputs brclus1-brclus4 miacctag
miphone mipos miposamt miinv
miinvbal micc miccbal miccpurc
miincome mihmown milores mihmval
miage micrscor;
run;
ods trace off;
3-30 Chapter 3 Preparing the Input Variables
Partial Output
Oblique Principal Component Cluster Analysis
Output Added:
-------------
Name: DataOptSummary
Label: Data and Options Summary
Template: Stat.Varclus.DataOptSummary
Path: Varclus.DataOptSummary
-------------
Output Added:
-------------
Name: ConvergenceStatus
Label: Convergence Status
Template: Stat.Varclus.ConvergenceStatus
Path: Varclus.GROUP.ConvergenceStatus
-------------
Output Added:
-------------
Name: ClusterSummary
Label: Cluster Summary
Template: Stat.Varclus.ClusterSummary
Path: Varclus.GROUP.ClusterSummary
-------------
Cluster 1 will be split because it has the largest second eigenvalue, 5.097981, which is greater
than the MAXEIGEN=0.7 value.
3.3 Variable Clustering and Screening 3-31
Output Added:
-------------
Name: ConvergenceStatus
Label: Convergence Status
Template: Stat.Varclus.ConvergenceStatus
Path: Varclus.GROUP.ConvergenceStatus
-------------
Output Added:
-------------
Name: ClusterSummary
Label: Cluster Summary
Template: Stat.Varclus.ClusterSummary
Path: Varclus.GROUP.ClusterSummary
-------------
Output Added:
-------------
Name: RSquare
Label: R-squared
Template: Stat.Varclus.RSquare
Path: Varclus.GROUP.RSquare
-------------
R-squared with
2 Clusters ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Own Next 1-R**2 Variable
Cluster Variable Cluster Closest Ratio Label
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Cluster 1 AcctAge 0.0004 0.0002 0.9998 Age of Oldest Account
Dep 0.0010 0.0001 0.9991 Checking Deposits
DepAmt 0.0007 0.0000 0.9993 Amount Deposited
CashBk 0.0003 0.0000 0.9997 Number Cash Back
Checks 0.0037 0.0001 0.9963 Number of Checks
DirDep 0.0001 0.0001 0.9999 Direct Deposit
NSF 0.0002 0.0001 0.9999 Number Insufficient Fund
NSFAmt 0.0000 0.0000 1.0000 Amount NSF
Phone 0.0179 0.0001 0.9821 Number Telephone Banking
Teller 0.0012 0.0001 0.9989 Teller Visits
Sav 0.0006 0.0000 0.9994 Saving Account
SavBal 0.0000 0.0000 1.0000 Saving Balance
ATM 0.0003 0.0000 0.9997 ATM
ATMAmt 0.0013 0.0000 0.9987 ATM Withdrawal Amount
POS 0.0257 0.0000 0.9743 Number Point of Sale
POSAmt 0.0242 0.0001 0.9759 Amount Point of Sale
Oblique Principal Component Cluster Analysis
R-squared with
2 Clusters ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Own Next 1-R**2 Variable
Cluster Variable Cluster Closest Ratio Label
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
CD 0.0002 0.0000 0.9998 Certificate of Deposit
CDBal 0.0002 0.0000 0.9998 CD Balance
LOC 0.0066 0.0000 0.9934 Line of Credit
LOCBal 0.0014 0.0001 0.9987 Line of Credit Balance
Inv 0.0052 0.0001 0.9949 Investment
InvBal 0.0002 0.0000 0.9999 Investment Balance
ILS 0.0026 0.0000 0.9974 Installment Loan
ILSBal 0.0024 0.0000 0.9977 Loan Balance
MM 0.0000 0.0000 1.0000 Money Market
MMBal 0.0001 0.0000 1.0000 Money Market Balance
MTG 0.0032 0.0003 0.9970 Mortgage
MTGBal 0.0006 0.0000 0.9995 Mortgage Balance
CC 0.1279 0.0007 0.8727 Credit Card
CCBal 0.0019 0.0000 0.9981 Credit Card Balance
CCPurc 0.0211 0.0000 0.9789 Credit Card Purchases
SDB 0.0006 0.0000 0.9995 Safety Deposit Box
CRScore 0.0001 0.0000 0.9999 Credit Score
Moved 0.0000 0.0000 1.0000 Recent Address Change
InArea 0.0003 0.0000 0.9997 Local Address
brclus1 0.2239 0.0535 0.8200
3.3 Variable Clustering and Screening 3-33
R-squared with
2 Clusters ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Own Next 1-R**2 Variable
Cluster Variable Cluster Closest Ratio Label
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
brclus2 0.5331 0.0022 0.4679
brclus3 0.0061 0.0019 0.9958
brclus4 0.2426 0.0010 0.7582
MIAcctAg 0.0000 0.0000 1.0000
MIPhone 0.9925 0.0044 0.0076
MIPOS 0.9925 0.0044 0.0076
MIPOSAmt 0.9925 0.0044 0.0076
MIInv 0.9925 0.0044 0.0076
MIInvBal 0.9925 0.0044 0.0076
MICC 0.9925 0.0044 0.0076
MICCBal 0.9925 0.0044 0.0076
MICCPurc 0.9925 0.0044 0.0076
----------------------------------------------------------------------------------
Cluster 2 DDA 0.0002 0.0001 0.9999 Checking Account
DDABal 0.0004 0.0001 0.9997 Checking Balance
IRA 0.0000 0.0000 1.0000 Retirement Account
IRABal 0.0001 0.0000 0.9999 IRA Balance
MMCred 0.0000 0.0000 1.0000 Money Market Credits
Income 0.0113 0.0000 0.9887 Income
HMOwn 0.1850 0.0000 0.8150 Owns Home
LORes 0.0005 0.0001 0.9996 Length of Residence
HMVal 0.0137 0.0000 0.9863 Home Value
Age 0.0009 0.0000 0.9991 Age
MIIncome 0.9857 0.0028 0.0143
MIHMOwn 0.9541 0.0030 0.0461
MILORes 0.9857 0.0028 0.0143
MIHMVal 0.9857 0.0028 0.0143
MIAge 0.9194 0.0027 0.0808
MICRScor 0.0044 0.0000 0.9956
Cluster 1 will be split because it has the largest second eigenvalue, 3.431637, which is greater
than the MAXEIGEN=0.7 value.
The output object RSQUARE contains the results of each iteration of the algorithm except the first one
(the cluster with all of the variables). To print out the last iteration, you need the number of clusters in the
last iteration.
3-34 Chapter 3 Preparing the Input Variables
The variable NumberOfClusters in the output object ClusterQuality contains the number of clusters
for the final iteration. The output object also has information on the proportion of variation explained by
the clusters and the maximum second eigenvalue in a cluster. This information can be used to decide
whether too few or too many clusters have been formed.
The ODS OUTPUT statement creates a SAS data set named Summary from the output object
ClusterQuality, and a SAS data set named Clusters from each RSQUARE output object. Because
there are 40 RSQUARE objects created (one for the 2-cluster solution, one for the 3-cluster solution, and
so on up to the 41-cluster solution), the Clusters data set concatenates these 40 objects. There is a
column called NumberOfClusters that indicates which cluster solution each observation in the
Clusters data set belongs to.
The CALL SYMPUT routine creates the macro variable nvar, which contains the value of the number of
clusters in the last iteration of the clustering algorithm. The COMPRESS function strips blanks from
variables. Because the Clusters data set contains the results of all 41-cluster solutions, the nvar
macro variable is used to restrict focus to the final result of the VARCLUS algorithm; here, this is the
41-cluster solution. The nvar macro variable will be useful later, because if you select one representative
from each variable cluster, you will have nvar inputs for future modeling consideration.
ods listing close;
ods output clusterquality=summary
rsquare=clusters;
data _null_;
set summary;
call symput('nvar',compress(NumberOfClusters));
run;
RSquare
Cluster Variable Ratio VariableLabel
RSquare
Cluster Variable Ratio VariableLabel
The output shows the cluster number, the names of the variables in each cluster, and the 1–R2 ratio
(RSquareRatio).
PROC TREE draws a dendrogram of the divisive variable clustering process. The AXIS statements are
used to control the plotting and printing of axes and their labels. The HORIZONTAL option specifies that
the dendrogram should be plotted horizontally. The HAXIS= and VAXIS= statements tie the axis
definitions in the AXIS statements to the axes of the plot. The HEIGHT statement plots the second
eigenvalue on the horizontal axis. For example, when a principal components analysis is done on all the
variables (the first cluster), the second eigenvalue is 5.10. When this cluster is split, the second eigenvalue
for one child cluster is 3.17 and 2.06 for the other.
axis1 value=(font = tahoma color=blue rotate=0 height=.8)
label=(font = tahoma angle=90 height=2);
axis2 order=(0 to 6 by 2);
proc tree data=fortree
horizontal
vaxis=axis1
haxis=axis2;
height _MAXEIG_;
run;
3-38 Chapter 3 Preparing the Input Variables
To reduce the amount of text that needs to be entered, a new macro variable named Reduced is created
that contains the names of all the variables selected from PROC VARCLUS. As was mentioned above,
subject-matter considerations should play a part in deciding which inputs become the cluster
representatives. Consider the first cluster, which consists of the variables brclus2, MIPhone, MIPOS,
MIPOSAmt, MIInv, MIInvBal, MICC, MICCBal, and MICCPurc. According to the 1-R-Square
ratio, any of the missing indicators would be the best cluster representative. However, closer investigation
of the development data set might reveal that the missingness of those eight inputs has more to do with
poor data quality than any other source of missing data. If these missing indicators in actuality indicate
poor data quality, then their usefulness as predictors with good generalizing power is suspect. With that
subject-matter consideration in mind, perhaps brclus2 is the best cluster representative from cluster 1.
%let reduced=
brclus2 miincome checks ccbal
mmbal income ilsbal posamt
nsfamt cd irabal age
loc sav dda invbal
crscore brclus3 cc brclus1
cashbk miacctag micrscor moved
acctage dirdep savbal ddabal
sdb ccpurc inarea atmamt
phone mmcred hmown inv
depamt brclus4 atm lores
mtg;
3.3 Variable Clustering and Screening 3-39
Univariate Screening
X1=1
P X1=2 X1=1
X1=2
0
x2 x2
43
It is tempting to use univariate associations to detect irrelevant input variables. Each input variable is
screened individually versus the target (chi-squared tests, correlation coefficients, two-sample tests, and
so on). Only the most important inputs are retained in the analysis. This method does not account for
partial associations among the inputs. Inputs could be erroneously omitted or erroneously included.
Partial association occurs when the effect of one variable changes in the presence of another variable.
Multivariate methods that consider subsets of variables jointly are preferable. The best k inputs in a
univariate sense would not necessarily be the best k-element subset. The presence of interactions can also
give misleading univariate associations.
3-40 Chapter 3 Preparing the Input Variables
Variable Screening
Even after variable clustering, some further variable reduction may be needed prior to using the variable
selection techniques in the LOGISTIC procedure. Very liberal univariate screening may be helpful when
the number of clusters created in the VARCLUS procedure is still relatively large. Because some of the
variable selection techniques use the full model, eliminating clearly irrelevant variables (for example,
p-values greater than .50) will stabilize the full model and may improve the variable selection technique
without much risk of eliminating important input variables. Keep in mind that univariate screening can
give misleading results when there are partial associations. This problem is minimized because the
screening is done after PROC VARCLUS, and is used in eliminating clearly irrelevant variables, rather
than searching for the best predictors.
The CORR procedure can be used for univariate screening. The SPEARMAN option requests Spearman
correlation statistics, which is a correlation of the ranks of the input variables with the binary target. The
Spearman correlation statistic was used rather than the Pearson correlation statistic because Spearman is
less sensitive to nonlinearities and outliers. However, when variables are not monotonically related to
each other, the Spearman correlation statistic can miss important associations. A general and robust
similarity measure is Hoeffding’s D (requested by the HOEFFDING option) which will detect a wide
variety of associations between two variables. Hoeffding’s D statistic has values between –0.5 to 1, but if
there are many ties, smaller values may result. The RANK option prints the correlation coefficients for
each variable in order from highest to lowest.
A useful table (or plot) would compare the rank order of the Spearman correlation statistic to the rank
order of the Hoeffding’s D statistic. If the Spearman rank is high but the Hoeffding’s D rank is low, then
the association is probably not monotonic. Empirical logit plots could be used to investigate this type of
relationship.
The output object for the table of Spearman correlation statistics is called SPEARMANCORR, and the
output object for the table of Hoeffding’s D statistics is called HOEFFDINGCORR.
ods listing close;
ods output spearmancorr=spearman
hoeffdingcorr=hoeffding;
ods listing;
3.3 Variable Clustering and Screening 3-41
The variable names in the SAS data sets Spearman and Hoeffding are in the variables Best1
through Best41, the correlation statistics are in the variables R1 through R41, and the p-values are in
the variables P1 through P41. The macro variable nvar was created in the last demonstration. It points
to the number of clusters in the final cluster solution from the VARCLUS procedure (here it is 41). In
order to make a more useful table, a DATA step is used to transpose the variables to observations.
data spearman1(keep=variable scorr spvalue ranksp);
length variable $ 8;
set spearman;
array best(*) best1--best&nvar;
array r(*) r1--r&nvar;
array p(*) p1--p&nvar;
do i=1 to dim(best);
variable=best(i);
scorr=r(i);
spvalue=p(i);
ranksp=i;
output;
end;
run;
data correlations;
merge spearman1 hoeffding1;
by variable;
run;
3-42 Chapter 3 Preparing the Input Variables
The final data set is sorted by the rank of the Spearman correlation statistic and then a table is generated
showing the rank and associated statistics of the Spearman correlations and Hoeffding’s D statistics.
proc sort data=correlations;
by ranksp;
run;
A graphical representation of this table may aid decision making. The PLOT procedure is used to create a
scatter plot of Spearman ranks against Hoeffding ranks. For ease of interpretation, the variable names are
placed next to the point markers.
In order to have reference lines on the plot, the SQL procedure is used to create macro variables
that point to the smallest Spearman rank and Hoeffding rank associated with a p-value greater
than 0.5. This is just for presentation.
In general, the upper-right corner of the plot contains the names of variables that could reasonably be
excluded from further analysis, due to their poor rank on both metrics. The criterion to use in eliminating
variables is a subjective decision. Thus, four variables are eliminated from the analysis.
proc sql noprint;
select min(ranksp) into :vref
from (select ranksp
from correlations
having spvalue > .5);
select min(rankho) into :href
from (select rankho
from correlations
having hpvalue > .5);
quit;
rankho
In addition to being a useful screening tool, this analysis may point toward further analyses. High ranks
for Spearman and low ranks for Hoeffding’s D are found for the variables DDABal, DepAmt, and
ATMAmt. Even though these variables do not have a monotonic relationship with Ins, some other type of
relationship is detected by Hoeffding’s D statistic. Empirical logit plots should be used to examine these
relationships.
3.3 Variable Clustering and Screening 3-45
The %LET statement creates a macro variable called screened that has the names of the variables
remaining after the univariate screening method.
%let screened =
brclus2 checks ccbal
mmbal income ilsbal posamt
nsfamt cd irabal age
sav dda invbal
crscore brclus3 cc brclus1
cashbk miacctag micrscor moved
acctage dirdep savbal ddabal
sdb ccpurc inarea atmamt
phone mmcred inv
depamt brclus4 atm lores;
3-46 Chapter 3 Preparing the Input Variables
Univariate Smoothing
0 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIIIIIIIII I II I
x x
47
In regression analysis, it is standard practice to examine scatter plots of the target versus each input
variable. When the target is binary, these plots are not very enlightening. A useful plot to detect nonlinear
relationships is a plot of the empirical logits.
Empirical Logits
⎛ Mi ⎞
⎜ mi + ⎟
ln⎜ 2 ⎟
⎜ Mi ⎟
⎜ M i − mi + ⎟
⎝ 2 ⎠
where
mi= number of events
Mi = number of cases
48
Univariate plots of binary data need to be smoothed. A simple, scalable, and robust smoothing method is
to plot empirical logits for quantiles of the input variables. These logits use a minimax estimate of the
proportion of events in each bin (Duffy and Santner 1989). This eliminates the problem caused by zero
counts and reduces variability.
The number of bins determines the amount of smoothing (for example, the fewer bins, the more
smoothing). One large bin would give a constant logit. For very large data sets and intervally scaled
inputs, 100 bins often works well. If the standard logistic model were true, then the plots should be linear.
Sample variability can cause apparent deviations, particularly when the bin size is too small. However,
serious nonlinearities, such as nonmonotonicity, are usually easy to detect.
3.3 Variable Clustering and Screening 3-47
To create a plot of the empirical logits versus a continuous input variable, the input variable first needs to
be binned. Use the RANK procedure with the GROUPS= option to bin variables automatically. The bins
will be equal size (quantiles) except when the number of tied values exceeds the bin size, in which case
the bin will be enlarged to contain all the tied values. The VAR statement lists the variable in the DATA=
data set to be grouped. The RANKS statement names the variable representing the groups in the OUT=
data set. If the RANKS statement is not used, the VAR variable is replaced by the groups.
%let var=DDABal;
1 419.27 44
2 1986.81 76
3 0.00 9
4 1594.84 72
5 2813.45 82
6 1069.78 63
7 1437.57 69
8 1683.28 73
9 190.03 33
10 462.12 46
To compute the empirical logit, the number of target event cases (ins=1) and total cases in each bin
needs to be computed. The empirical logits are plotted against the mean of the input variable in each bin.
This needs to be computed as well. Both tasks can be done in the MEANS procedure using the CLASS
statement. The appropriate statistics (SUM and MEAN) need to be specified in the OUTPUT statement.
proc means data=out noprint nway;
class bin;
var ins &var;
output out=bins sum(ins)=ins mean(&var)=&var;
run;
1 0 1 198 21 -31.8287
2 9 1 5987 3179 0.0000
3 19 1 267 26 2.7617
4 20 1 323 26 9.0509
5 21 1 323 38 17.2157
6 22 1 322 51 28.1966
7 23 1 323 38 40.9160
8 24 1 322 47 53.4157
9 25 1 323 45 66.7146
10 26 1 323 64 81.9273
The variable ins contains the number of events while the automatic variable _FREQ_ contains the bin
size.
data bins;
set bins;
elogit=log((ins+(sqrt(_FREQ_)/2))/
(_FREQ_-ins+(sqrt(_FREQ_)/2)));
run;
The empirical logits can be plotted using the GPLOT procedure. It is easy to get a simple plot. The
statement PLOT Y * X requests that the variable Y be plotted on the vertical axis and that the variable X
be plotted on the horizontal axis. The default picture is a scatter plot of Y by X. The following code plots
elogit against &var, here DDABal.
proc gplot data = bins;
title "Empirical Logit against &var";
plot elogit * &var;
run; quit;
3.3 Variable Clustering and Screening 3-49
Using a few more commands2, it is possible to control many features of the plot. The code below uses a
SYMBOL statement to produce the following:
• a line that interpolates between the points on the scatter plot (I=JOIN)
• a blue line (C=BLUE)
• stars as the plotting symbol (V=STAR).
In addition to plotting the empirical logits against the DDA balance, this code plots the logits against bin
number.
symbol i=join c=blue v=star;
proc gplot data = bins;
title "Empirical Logit against &var";
plot elogit * &var;
title "Empirical Logit against Binned &var";
plot elogit * bin;
run; quit;
2
If you choose a point-and-click interface for creating graphs, the Graph-N-Go utility is supplied in SAS/GRAPH
software. However, pointing and clicking will require much more user interaction.
3-50 Chapter 3 Preparing the Input Variables
The pattern made by plotting logit against checking account balance has two striking features. There is a
spike in the logits at the $0 balance level. Aside from that spike, the trend is monotonic but certainly not
linear.
3.3 Variable Clustering and Screening 3-51
The pattern made by plotting logit against the rank of checking account balance seems to accommodate
the nonlinearity of the last plot, but the spike indicates a portion of the population who are not behaving
as their balance would lead one to believe. The trend in checking account balance is clear, and this is
probably a very good input. How, though, can one account for the nonlinear relationship between
response behavior and balance amount?
3-52 Chapter 3 Preparing the Input Variables
Remedies
1. Hand-Crafted New Input Variables
2. Polynomial Models
3. Flexible Multivariate Function Estimators
4. Do Nothing
51
1. Skilled and patient data analysts can accommodate nonlinearities in a logistic regression model by
transforming or discretizing the input variables. This can become impractical with high-dimensional
data and increases the risk of overfitting.
2. Polynomial terms may improve model fit but hamper model interpretation and generalization.
Quadratic terms and two-factor interactions can be useful additions to a modeler’s tool kit but are no
panacea. Higher-order polynomials are not reliable smoothers.
3. Methods such as classification trees, generalized additive models, projection pursuit, multivariate
adaptive regression splines, radial basis function networks, and multilayer perceptrons are flexible
alternatives to logistic regression (Hastie and Tibshirani 1990, Ripley 1996).
4. Standard (linear) logistic regression can produce powerful and useful classifiers even when the
estimates of the posterior probabilities are poor. Often more flexible approaches do not show enough
improvement to warrant the effort.
3.3 Variable Clustering and Screening 3-53
Accommodating Nonlinearities
In order to use the checking account balance, you might consider taking a logarithmic transformation, a
square root transformation, or some other transformation in an attempt to linearize the relationship
between the logit and the account balances. This linearization will require two steps. First, that spike at $0
needs to be accounted for. Second, you can transform the balances to some scale that reflects the behavior
exhibited in the data.
It seems suspicious that a large portion of the population has exactly $0 in their checking accounts.
Investigation (of the data set or of the people who constructed the data set) will show that most of the
individuals with exactly $0 balances do not have checking accounts. Their balances have been set to $0 as
part of the data pre-processing. This rule seems reasonable from a logical imputation standpoint. How can
someone with no checking account have a checking balance? But it is clear from the logit plots that those
individuals are behaving like people with much more than $0 in their checking accounts. The CLASS
statement yields results for each level of the DDA variable; the results of interest are the mean, minimum,
and maximum for the DDABal and Ins variables.
title;
proc means data=imputed mean min max;
class dda;
var ddabal ins;
run;
From the output, it is clear that the individuals without checking accounts have had $0 balances imputed
for them, and those individuals respond at a higher rate than individuals with checking accounts. If this
seems unreasonable, consider that the individuals without checking accounts presumably do their
everyday banking with a different bank altogether, and treat this bank as an investment institution.
The MEANS Procedure
Checking
Account N Obs Variable Label Mean Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
0 5948 DDABal Checking Balance 0 0 0
Ins 0.5314391 0 1.0000000
There are several possible ways to account for this discrepancy between the DDABal value and the
empirically observed response behavior. One of the most straightforward is to re-impute. The following
code creates a macro variable mean that has the value of the checking account balances of those
customers who actually have checking accounts. Then that mean is substituted for the $0 balances3 of
those who do not have checking account balances. The SELECT INTO statement in the SQL procedure
creates the macro variable, and the following DATA step fills in the $0 balances with the balance of
customers who have their accounts at this bank.
3
Arguably, most, if not all, of these customers have checking accounts somewhere, and therefore do have balances.
3-54 Chapter 3 Preparing the Input Variables
proc sql;
select mean(ddabal) into :mean
from imputed where dda;
quit;
ƒƒƒƒƒƒƒƒ
2660.489
data imputed;
set imputed;
if not dda then ddabal = &mean;
run;
To evaluate the effectiveness of this re-imputation, take another look at the logit plots.
%let var=DDABal;
data bins;
set bins;
elogit=log((ins+(sqrt(_FREQ_ )/2))/
( _FREQ_ -ins+(sqrt(_FREQ_ )/2)));
run;
The spike seems to be accounted for. Now, the question that remains is that of transformation. Which
transformation does the best job of linearizing the relationship? You may have several of your favorite
transformations; the limitations are usually only the analyst’s imagination and the software on the server
that houses the database. Here, it seems like the bins do a very good job of capturing the way that
response behavior changes with respect to checking balances in a fairly linear way. Hence, the following
code creates a new input, B_DDABal, which represents the binned checking account balances. Because
you would need this recipe to create the B_DDABal input if it eventually is part of your model, the
following code generates DATA step code to perform the binning.
Some analysts might reserve the term binning for the practice of creating a categorical predictor
from a continuous input. Therefore, it might be more appropriate to label this a rank-group or
percentile transformation.
Regardless of the input’s original distribution, the percentiles have a uniform distribution. This
uniformly distributed version of DDABal has a linear association with the logit. Even better,
unlike the log transformation, the percentile transformation affords easy interpretability with odds
ratios: a one-unit change in the percentiles results in the corresponding change in the odds of
response.
3.3 Variable Clustering and Screening 3-57
The RANK procedure assigns observations to 100 groups, according to DDABal. The MEANS procedure
creates a data set that contains the maximum DDABal value in each bin. This information can be used to
assign a new data set its own bins, without running the RANK procedure.
proc rank data=imputed groups=100 out=out;
var ddabal;
ranks bin;
run;
title;
proc means data = out noprint nway;
class bin;
var ddabal;
output out=endpts max=max;
run;
1 0 1 323 1.72
2 1 1 322 8.52
3 2 1 322 16.38
4 3 1 323 27.12
5 4 1 323 40.27
6 5 1 322 52.51
7 6 1 323 65.29
8 7 1 323 81.54
9 8 1 322 94.38
10 9 1 323 110.38
Rather than writing out, by hand, a series of rules like IF DDABal <= 1.72 then bin =0; ELSE IF… for
many bins, you can use a DATA step to write the rules. An easy way to do this is to use a FILENAME
statement to point to a file and then write the rules to that file using the PUT statement.
The FILENAME statement creates a fileref, Rank, which points to the physical file C:\temp\rank.sas. The
DATA _NULL_ statement enables you to use the DATA step processing without being required to write
out a data set. The FILE statement specifies where you want to put the output of the DATA step (much as
the INFILE statement would enable you to specify the input source for a DATA step.) The SET statement
brings in the data set endpts, which has the maximum balance in each bin. The option END= creates an
internal flag that you can use to identify the last record of the data set. You could write
IF…THEN…ELSE syntax to do the bin assignment. The following example uses
SELECT…WHEN…OTHERWISE to perform the same task. The last record captures everyone with a
balance larger than the maximum in the penultimate bin.
3-58 Chapter 3 Preparing the Input Variables
data _null_;
file rank;
set endpts end=last;
if _n_ = 1 then put "select;";
if not last then do;
put " when (ddabal <= " max ") B_DDABal =" bin ";";
end;
else if last then do;
put " otherwise B_DDABal =" bin ";";
put "end;";
end;
run;
Partial Listing of RANK.SAS
select;
when (ddabal <= 1.72 ) B_DDABal =0 ;
when (ddabal <= 8.52 ) B_DDABal =1 ;
when (ddabal <= 16.38 ) B_DDABal =2 ;
when (ddabal <= 27.12 ) B_DDABal =3 ;
when (ddabal <= 40.27 ) B_DDABal =4 ;
when (ddabal <= 52.51 ) B_DDABal =5 ;
when (ddabal <= 65.29 ) B_DDABal =6 ;
when (ddabal <= 81.54 ) B_DDABal =7 ;
when (ddabal <= 94.38 ) B_DDABal =8 ;
when (ddabal <= 110.38 ) B_DDABal =9 ;
To use this code, you can use the %INCLUDE statement embedded in a DATA step. For example, the
following code puts B_DDABal on the Imputed data set. The SOURCE2 option requests that the code
be included in the log as well.
data imputed;
set imputed;
%include rank /source2;
run;
To see that the recoding worked, evaluate the minimum and maximum checking account balances in each
of the bins. The MEANS procedure with a CLASS statement will do this.
proc means data = imputed min max;
class B_DDABal;
var DDABal;
run;
3.3 Variable Clustering and Screening 3-59
Partial Output
The MEANS Procedure
Because the binned balance input is a replacement for the DDABal column, switch B_DDABal for
DDABal in the Screened macro variable.
%let screened =
brclus2 checks ccbal
mmbal income ilsbal posamt
nsfamt cd irabal age
sav dda invbal
crscore brclus3 cc brclus1
cashbk miacctag micrscor moved
acctage dirdep savbal B_DDABAL
sdb ccpurc inarea atmamt
phone mmcred inv
depamt brclus4 atm lores;
Binning is not the only option. Other popular techniques for hand-crafting inputs include the
transformations mentioned above as well as splines. Notice that B_DDABal is not a categorical
input. The point of the linearization exercise is to take advantage of the linear relationship
between logits and bins, not add 100 dummy variables to the list of inputs.
If there was any question about data quality, the binning code would have to be much more
robust.
3-60 Chapter 3 Preparing the Input Variables
All Subsets
3
2
2
2
2k
1
1
1
0
56
Variable selection methods in regression are concerned with finding subsets of the inputs that are jointly
important in predicting the target. The most thorough search would consider all possible subsets. This can
be prohibitively expensive when the number of inputs, k, is large, as there are 2k possible subsets to
consider.
3.4 Subset Selection 3-61
Stepwise Selection
0
1
2
3
4
5
6
Stop
57
Stepwise variable selection is a much-maligned yet heavily used subset selection method. Stepwise
selection first searches the 1-input models and selects the best. It then searches the 2-input models that
contain the input selected in the first step and selects the best. The model is incrementally built in this
fashion until no improvement is made. There is also a backward portion of the algorithm where at each
step, the variables in the current model can be removed if they have become unimportant. The usual
criterion used for entry and removal from the model is the p-value from a significance test that the
coefficient is zero, although other criteria can also be used. Note that in subset selection, the p-value is
merely a tuning parameter that measures the relative strength of the candidate variables.
Stepwise selection was devised to give a computationally efficient alternative to examining all subsets. It
is not guaranteed to find the best subset and it can be shown to perform badly in many situations (Harrell
1997).
Backward Elimination
0
1
2
3
4
5
6
Stop
58
Backward variable selection starts with all the candidate variables in the model simultaneously. At each
step, the least important input variable is removed (as determined by the p-value). Backward elimination
is less inclined to exclude important inputs or include spurious inputs than forward (stepwise) methods
(Mantel 1970; Harrell 1997). However, it is considered more computationally expensive than stepwise
because more steps are usually required and they involve larger models.
3-62 Chapter 3 Preparing the Input Variables
All
Subsets
Stepwise
Time
ck ward
Fast Ba
Most of the literature on the different subset selection methods has considered linear rather than logistic
regression. The conventional wisdom regarding computation time is that
stepwise < backwards < all subsets.
However, logistic regression (as implemented by PROC LOGISTIC) gives a different story. For up to ≈60
inputs, the results are reversed
all subsets < backwards < stepwise.
For any number of inputs, backward elimination (with the FAST option) is more efficient than stepwise.
(The above simulation was conducted with 50,000 cases and 200 intercorrelated inputs; 16 of the inputs
were important, 6 strongly so.)
Logistic regression requires an iterative optimization algorithm. Each model fit is much more expensive
than with linear regression. Each step in the stepwise algorithm requires iterative optimization. To find 16
variables (considering only the forward part of stepwise) would take 16k–120 separate nonlinear
regressions, each of which might require several iterations. Stepwise logistic regression is a poor
performer. Because its only universally acknowledged advantage is speed, there is little reason to prefer it
for logistic regression (using PROC LOGISTIC).
All-subsets selection is executed in PROC LOGISTIC with the SELECTION=SCORE option (SAS
Institute Inc. 1997). This method only requires that one model be fit (the full model). The results are then
manipulated to calculate a score test for each possible combination of input variables. It also uses a
branch and bound method for efficiently searching the many combinations. This method is the fastest
until the number of possible combinations becomes unmanageable, at which point the performance
acutely deteriorates. If redundant inputs are eliminated first (using variable clustering), then all-subsets
selection can be a practical method for predictive modeling.
When combined with the FAST option, backward variable selection requires only a single logistic
regression (SAS Institute Inc. 1997). PROC LOGISTIC uses the method of Lawless and Singhal (1978)
to manipulate the full model fit to approximate fitting reduced models. The FAST option is extremely
efficient because the model is not refitted for every variable removed. Fast backward elimination had the
best overall performance, a linear increase in time as the number of inputs increased. Note that ordinary
backwards (without the FAST option) would have been slower than stepwise.
3.4 Subset Selection 3-63
PROC LOGISTIC can be used to further reduce the number of input variables. In the MODEL statement,
the SELECTION= option specifies the method, in this example, the backward elimination method. The
FAST option uses the full model fit to approximate the remaining slope estimates for each subsequent
elimination of a variable from the model. The SLSTAY option specifies the significance level for a
variable to stay in the model in a backward elimination step. The significance level4 was chosen
arbitrarily for illustrative purposes.
proc logistic data=imputed des;
class res;
model ins=&screened res / selection=backward fast
slstay=.001;
run;
Partial Output
Summary of Backward Elimination
4
The choice of significance level is a multi-faceted question. Analysts with experience may be able to posit a
significance level that will secure a model with good generalizing power. If not, the information criterion-based
method that follows or the techniques of the following chapters can be used to find a model that generalizes well.
3-64 Chapter 3 Preparing the Input Variables
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
The results show that PROC LOGISTIC using the backward elimination method reduced the number of
variables down from 38 to 18.
The SELECTION=SCORE option finds the best subsets of each model size. The number of models
printed of each size is controlled by the BEST= option. Because the best subsets method does not support
class variables, dummy variables for Res are created in a DATA step.
data imputed;
set imputed;
resr=(res='R');
resu=(res='U');
run;
Partial Output
Regression Models Selected by Score Criterion
Number of Score
Variables Chi-Square Variables Included in Model
1 2781.3404 B_DDABal
2 3711.5967 CD B_DDABal
11 5872.1641 brclus2 Checks MMBal CD Sav brclus3 CC SavBal B_DDABal Inv brclus4
12 5924.1821 brclus2 Checks MMBal CD Sav brclus3 CC DirDep SavBal B_DDABal Inv brclus4
13 5961.7085 brclus2 Checks MMBal CD Sav brclus3 CC AcctAge DirDep SavBal B_DDABal Inv
brclus4
14 6009.4822 brclus2 Checks MMBal CD Sav brclus3 CC AcctAge SavBal B_DDABal ATMAmt Inv
brclus4 ATM
The score test statistic increases with model size. The Schwarz Bayes criterion (SBC) is often used for
model selection. The SBC is essentially the –2 log likelihood plus a penalty term that increases as the
model gets bigger. The penalty term for SBC is (k+1)⋅ln(n), where k is the number of variables in the
model and n is the sample size. Smaller values of SBC are preferable. The score test statistic is
asymptotically equivalent to the likelihood ratio statistic. The –2 log likelihood is a constant minus the
likelihood ratio statistic. Thus, an SBC type statistic could be computed from the score statistic as
–(score) + (k+1)⋅ln(n), where smaller values would be preferable.
3-66 Chapter 3 Preparing the Input Variables
When the SELECTION=SCORE option is used, output data sets are not available. Therefore, the Output
Delivery System is used to create an output data set with the score statistic and the number of variables.
The output objects with this information are called NOBS and BESTSUBSETS.
ods listing close;
ods output NObs=NObs
bestsubsets=score;
ods listing;
Sum Sum
NObs NObs Freqs Freqs
Obs Label N Read Used Read Used
1 1 2781.3404
2 1 2 3711.5967
3 1 3 4380.4694
4 1 4 4763.8784
5 1 5 5147.5244
Obs VariablesInModel
1 B_DDABal
2 CD B_DDABal
3 CD SavBal B_DDABal
4 CD Sav SavBal B_DDABal
5 MMBal CD Sav SavBal B_DDABal
The variables of interest to calculate SBC are NumberOfVariables and ScoreChiSq from the
score data set and N from the NObs data set.
3.4 Subset Selection 3-67
The first DATA step selects one observation (the number of observations) and creates a macro variable
obs that contains the number of observations. The second DATA step computes the SBC statistic.
data _NULL_;
set NObs;
where label = 'Number of Observations Used';
call symput('obs',n);
run;
data subset;
set score;
sbc=-scorechisq+log(&obs)*(numberofvariables+1);
run;
Partial Output
Obs sbc
1 -2760.58
2 -3680.45
3 -4338.94
4 -4711.97
5 -5085.23
6 -5303.42
7 -5484.55
8 -5588.03
9 -5646.06
10 -5692.97
11 -5747.58
12 -5789.22
13 -5816.36
14 -5853.76
15 -5875.59
16 -5879.15
17 -5879.46
18 -5879.71
19 -5879.35
20 -5878.21
Obs VariablesInModel
1 B_DDABal
2 CD B_DDABal
3 CD SavBal B_DDABal
4 CD Sav SavBal B_DDABal
5 MMBal CD Sav SavBal B_DDABal
6 Checks MMBal CD Sav SavBal B_DDABal
7 Checks MMBal CD Sav CC SavBal B_DDABal
8 Checks MMBal CD Sav CC brclus1 SavBal B_DDABal
9 Checks MMBal CD Sav CC brclus1 SavBal B_DDABal Inv
10 brclus2 Checks MMBal CD Sav brclus3 CC SavBal B_DDABal brclus4
11 brclus2 Checks MMBal CD Sav brclus3 CC SavBal B_DDABal Inv brclus4
12 brclus2 Checks MMBal CD Sav brclus3 CC DirDep SavBal B_DDABal Inv brclus4
13 brclus2 Checks MMBal CD Sav brclus3 CC AcctAge DirDep SavBal B_DDABal Inv brclus4
14 brclus2 Checks MMBal CD Sav brclus3 CC AcctAge SavBal B_DDABal ATMAmt Inv brclus4 ATM
15 brclus2 Checks MMBal CD Sav brclus3 CC AcctAge DirDep SavBal B_DDABal ATMAmt Inv brclus4 ATM
16 brclus2 Checks MMBal ILSBal CD Sav brclus3 CC AcctAge DirDep SavBal B_DDABal ATMAmt Inv brclu
17 brclus2 Checks MMBal ILSBal CD Sav brclus3 CC AcctAge DirDep SavBal B_DDABal ATMAmt Phone Inv
18 brclus2 Checks MMBal ILSBal NSFAmt CD Sav brclus3 CC AcctAge DirDep SavBal B_DDABal ATMAmt Ph
19 brclus2 Checks MMBal ILSBal NSFAmt CD IRABal Sav brclus3 CC AcctAge DirDep SavBal B_DDABal AT
20 brclus2 Checks MMBal ILSBal NSFAmt CD IRABal Sav brclus3 CC brclus1 AcctAge DirDep SavBal B_D
Careful inspection reveals that the 18-input model has the smallest value of sbc. The following SQL
code creates a macro variable selected that contains the names of the inputs in that model.
proc sql;
select VariablesInModel into :selected
from subset
having sbc=min(sbc);
quit;
3.4 Subset Selection 3-69
These automatic selection routines raise several questions. For techniques like stepwise selection and
backward elimination, what are good stopping rules? For best subsets, what number of inputs yields the
best model?
The answers to these questions lie in the purposes of the models. The goal of most predictive modeling is
generalization. Hence, the best model is the model that generalizes the best. How does one measure
generalizing ability of a model? What are some statistics that summarize a model’s performance? The
next chapters offer several suggestions for comparing and selecting models.
3-70 Chapter 3 Preparing the Input Variables
Training Test
Accuracy = 70% Accuracy = 47%
x1
x2 x2
3
Evaluating the performance of a classifier on the same data used to train the classifier usually leads to an
optimistically biased assessment.
For example, the above classifier was fit (or more properly overfit) to a 10-case data set. It correctly
classified 70% of the cases. However, when the same classification rule was applied to 100 new cases
from the same distribution, only 47% were correctly classified. This is called overfitting. The model was
overly sensitive to peculiarities of the particular training data, in addition to true features of their joint
distribution.
The more flexible the underlying model and less plentiful the data, the more that overfitting is a problem.
When a relatively inflexible model like (linear) logistic regression is fitted to massive amounts of data,
overfitting may not be a problem (Hand 1997). However, the chance of overfitting is increased by
variable selection methods and supervised input preparation (such as collapsing levels of nominal
variables based on associations with the target). It is prudent to assume overfitting until proven otherwise.
Large differences between the performance on the training and test sets usually indicate overfitting.
4-4 Chapter 4 Measuring Classifier Performance
Data Splitting
Comparison
Validation Selection
Tuning
Training
Final Assessment
Test
The simplest strategy for correcting the optimistic bias is to holdout a portion of the development data for
assessment. The model is fit to the remainder (training data set) and performance is evaluated on the
holdout portion (test data set). Usually from one-fourth to one-half of the development data is used as a
test set (Picard and Berk 1990). After assessment, it is common practice to refit the final model on the
entire undivided data set.
When the holdout data is used for comparing, selecting, and tuning models and the chosen model is
assessed on the same data set that was used for comparison, then the optimism principle again applies. In
this situation, the holdout sample is more correctly called a validation data set, not a test set. The test set
is used for a final assessment of a fully specified classifier (Ripley 1996). If model tuning and a final
assessment are both needed, then the data should be split three ways into training, validation, and test sets.
4.1 Honest Assessment 4-5
Other Approaches
A B C D E
Train Validate
1) BCDE A
2) ACDE B
3) ABDE C
4) ABCE D
5) ABCD E
Data splitting is a simple but costly technique. When data is scarce, it is inefficient to use only a portion
for training. Furthermore, when the test set is small, the performance measures may be unreliable because
of high variability. For small and moderate data sets, ν-fold cross-validation (Breiman et al. 1984; Ripley
1996; Hand 1997) is a better strategy. In 5-fold cross-validation, for instance, the data would be split into
five equal sets. The entire modeling process would be redone on each four-fifths of the data using the
remaining one-fifth for assessment. The five assessments would then be averaged. In this way, all the data
is used for both training and assessment.
Another approach that is frugal with the data is to assess the model on the same data set that was used for
training but to penalize the assessment for optimism (Ripley 1996). The appropriate penalty can be
determined theoretically or using computationally intensive methods such as the bootstrap.
4-6 Chapter 4 Measuring Classifier Performance
The objective is to split the data into training and validation sets. The model and all the input preparation
steps then need to be redone on the training set. The validation data will be used for assessment.
Consequently, it needs to be treated as if it were truly new data where the target is unknown. The results
of the analysis on the training data need to be applied to the validation data, not recalculated.
Several input-preparation steps can be done before the data is split. Creating missing indicators should be
done on the full development data because the results will not change. The rho1 macro variable is also
created before the data is split to get the best estimate of the proportion of events.
%let pi1=0.02;
data develop(drop=i);
set pmlr.develop;
/* name the missing indicator variables */
array mi{*} MIAcctAg MIPhone MIPOS MIPOSAmt
MIInv MIInvBal MICC MICCBal
MICCPurc MIIncome MIHMOwn MILORes
MIHMVal MIAge MICRScor;
/* select variables with missing values */
array x{*} acctage phone pos posamt
inv invbal cc ccbal
ccpurc income hmown lores
hmval age crscore;
do i=1 to dim(mi);
mi{i}=(x{i}=.);
end;
run;
4.1 Honest Assessment 4-7
The SURVEYSELECT procedure can be used to select the records for the training and validation data
sets. To create a stratified sample, the data must be sorted by the stratum variable. The SAMPRATE=
option specifies what proportion of the develop data set should be selected. The default behavior of
PROC SURVEYSELECT is to output the sample, not the entire data set, so the OUTALL option can be
used to return the initial data set augmented by a flag to indicate selection in the sample.
Of course, this flag indicates membership in the training and validation data sets in this context. The
FREQ procedure verifies the stratification. The SEED= option enables the user to control what series of
pseudo-random numbers is generated to do the partitioning. A particular number, greater than zero, will
produce the same split each time the SURVEYSELECT procedure was run. If the seed were zero, then
the data would be split differently each time the procedure was run.
proc sort data=develop out=develop;
by ins;
run;
Frequency‚
Percent ‚
Row Pct ‚
Col Pct ‚ 0‚ 1‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
0 ‚ 7028 ‚ 14061 ‚ 21089
‚ 21.78 ‚ 43.58 ‚ 65.36
‚ 33.33 ‚ 66.67 ‚
‚ 65.36 ‚ 65.36 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
1 ‚ 3724 ‚ 7451 ‚ 11175
‚ 11.54 ‚ 23.09 ‚ 34.64
‚ 33.32 ‚ 66.68 ‚
‚ 34.64 ‚ 34.64 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total 10752 21512 32264
33.33 66.67 100.00
4-8 Chapter 4 Measuring Classifier Performance
The next DATA step creates the two data sets, Train and Valid. All other input-preparation steps are
done after the split, on the training data only, because the validation data is treated as if it were truly new
cases. This is imperative for supervised methods such as collapsing the levels of Branch based on its
association with Ins. It could be argued that unsupervised methods, such as median imputation, could be
done on Develop because they do not involve Ins. This example presents a cautious approach and does
not involve the validation data at all.
This conservative approach also permits the comparison of models based on different imputation
schemes.
data train valid;
set develop;
if selected then output train;
else output valid;
run;
The STDIZE procedure imputes missing values. Here, as before, the median is used.
proc stdize data=train
reponly
method=median
out=train1;
var &inputs;
run;
The CLUSTER procedure is used to perform Greenacre’s correspondence analysis; this is the method
used to reduce the cardinality of the nominal input Branch.
proc means data=train1 noprint nway;
class branch;
var ins;
output out=level mean=prop;
run;
ods listing;
data cutoff;
if _n_ = 1 then set chi;
set cluster;
chisquare=_pchi_*rsquared;
degfree=numberofclusters-1;
logpvalue=logsdf('CHISQ',chisquare,degfree);
run;
proc plot data=cutoff;
plot logpvalue*numberofclusters/vpos=30;
run; quit;
logpvalue ‚
‚
‚
-90 ˆ
‚
‚
‚
‚ A
-95 ˆ A
‚
‚ A
‚ A
‚
-100 ˆ A A
‚ A
‚
‚ A
‚ A
-105 ˆ
‚ A
‚
‚ A
‚ A
-110 ˆ
‚ A
‚
‚ A
‚ A A
-115 ˆ A
‚
Šˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆ
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Number Of Clusters
NOTE: 1 obs had missing values.
A macro variable is created using the SQL procedure that contains the number of clusters associated with
the smallest p-value. The TREE procedure then creates a data set containing the information about that
cluster solution.
4-10 Chapter 4 Measuring Classifier Performance
proc sql;
select NumberOfClusters into :ncl
from cutoff
having logpvalue=min(logpvalue);
quit;
Number
Of
Clusters
ƒƒƒƒƒƒƒƒ
4
B14 B14 4
CL4 B17 1
B6 1
B10 1
B19 1
B3 1
B8 1
B18 1
B5 1
B9 1
B12 1
B13 1
B1 1
B4 1
CL5 B15 3
B16 3
CL8 B11 2
B7 2
B2 2
Three dummy variables are created, and the cluster with the response rate closest to ρ1 is made the
reference level.
data train1;
set train1;
brclus1=(branch='B14');
brclus2=(branch in ('B12','B5','B8',
'B3','B18','B19','B17',
'B4','B6','B10','B9',
'B1','B13'));
brclus3=(branch in ('B15','B16'));
run;
4-12 Chapter 4 Measuring Classifier Performance
The VARCLUS procedure is used to group correlated inputs with one another. This allows variable
reduction based on redundancy within the inputs.
ods listing close;
ods output clusterquality=summary
rsquare=clusters;
data _null_;
set summary;
call symput('nvar',compress(NumberOfClusters));
run;
RSquare
Cluster Variable Ratio VariableLabel
RSquare
Cluster Variable Ratio VariableLabel
A macro variable reduced is created with the names of the selected cluster representatives.
%let reduced=
MIPhone MIIncome Teller MM
Income ILS LOC POSAmt
NSFAmt CD LORes CCPurc
ATMAmt brclus2 Inv Dep
CashBk Moved IRA CRScore
MIAcctAg IRABal MICRScor MTGBal
AcctAge SavBal DDABal SDB
InArea Sav Phone CCBal
InvBal MTG HMOwn DepAmt
DirDep ATM brclus1 Age;
Using correlation coefficients, the list of potential inputs can be further screened. Recall that the purpose
of this step is to eliminate poor performers, not select the top inputs.
ods listing close;
ods output spearmancorr=spearman
hoeffdingcorr=hoeffding;
ods listing;
data correlations;
merge spearman1 hoeffding1;
by variable;
run;
rankho
Dropping any variable that has both the Hoeffding and the Spearman p-value above .5 leaves 36 inputs.
4.1 Honest Assessment 4-19
data bins;
set bins;
elogit=log((ins+(sqrt(_FREQ_ )/2))/
(_FREQ_ -ins+(sqrt(_FREQ_ )/2)));
run;
1
You might also consider DepAmt and ATMAmt as skewed inputs.
4-20 Chapter 4 Measuring Classifier Performance
4.1 Honest Assessment 4-21
As before, the logical imputation of a $0 balance for customers without a DDA account will be replaced
with mean imputation to account for the spike. Also, the rank-transformed DDABal looks like it has a
more linear relationship with the target, so that is an attractive option.
Performing a similar evaluation on SavBal reveals two things: the spike induced by logical imputation
of $0 balances for customer without saving accounts is not as drastic as it was for DDABal, and the
relationship would be (roughly) linear if the top 5% or 10% of individuals had their balances capped at
the 95th or 90th percentile. Hence, a reasonable “transformation” of SavBal might be simply replacing
every balance greater than some number with that number.
4-22 Chapter 4 Measuring Classifier Performance
This capping could be handled using the UNIVARIATE procedure (or the MEANS procedure) to find the
percentile and using macro variables to pass that information into DATA step processing, but rudimentary
visual inspection indicates that the 95th percentile of savings balance is around $16,000. This appears to
be a point where the relationship is turning a corner; hence, it seems like a good value for the cap.
The binned savings balance does not have the overwhelming linear relationship that was seen in the
earlier DDABal plots. Hence, the transformation for the SavBal input is probably best left at this
capping.
4.1 Honest Assessment 4-23
To handle these transformations, first tend to the DDABal variable. Replace the $0 balances with the
mean of those who have checking accounts for individuals who have no checking accounts.
proc sql;
select mean(ddabal) into :mean
from train1 where dda;
quit;
ƒƒƒƒƒƒƒƒ
2620.85
data train1;
set train1;
if not dda then ddabal = &mean;
run;
The rank-group or percentile transformation looks like it will help linearize the relationship between the
logits and the predictor. The MEANS procedure calculates the endpoints of the bins, and the DATA step
is used to create DATA step code that will perform the rank-group transformation. The final DATA step
uses the binning code to create B_DDABal and to truncate the larger values of SavBal.
4-24 Chapter 4 Measuring Classifier Performance
data _null_;
file rank;
set endpts end=last;
if _n_ = 1 then put "select;";
if not last then do;
put " when (ddabal <= " max ") B_DDABal =" bin ";";
end;
else if last then do;
put "otherwise B_DDABal =" bin ";";
put "end;";
end;
run;
data train1;
set train1;
%include rank /source2;
if savbal > 16000 then savbal=16000;
run;
After performing the transformation, replace DDABal with the transformed input B_DDABal in the
screened macro variable.
%let screened =
MIPhone Teller MM
Income ILS LOC POSAmt
NSFAmt CD CCPurc
ATMAmt brclus2 INV DEP
CashBk IRA CRScore
MIAcctAg IRABal MICRScor MTGBal
AcctAge SavBal B_DDABal SDB
InArea Sav Phone CCBal
INVBal MTG DEPAmt
DirDep ATM brclus1 Age;
Using, as an example, the fast backward elimination input selection algorithm, a model with 19 inputs is
selected.
4.1 Honest Assessment 4-25
Wald
Effect DF Chi-Square Pr > ChiSq
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
The 19 inputs in the model are named in the macro variable selected.
%let selected =
MIPhone TELLER MM LOC CD
ATMAMT brclus2 INV DEP IRA
MTGBAL ACCTAGE SAVBAL B_DDABal SAV
PHONE CCBAL DIRDEP ATM;
To assess the model generalization performance, the validation data need to be prepared for scoring the
same way that the training data was prepared for model building. Missing values need to be imputed,
new inputs need to be created, and any transformations need to be applied. The MEANS procedure can
be used to see which inputs need imputation on this valid data set.
proc means data = valid nmiss;
var MIPhone Teller MM LOC CD
ATMAmt INV DEP IRA MTGBal
AcctAge SavBal DDABal Sav
Phone CCBal DirDep ATM;
run;
N
Variable Label Miss
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
MIPhone 0
Teller Teller Visits 0
MM Money Market 0
LOC Line of Credit 0
CD Certificate of Deposit 0
ATMAmt ATM Withdrawal Amount 0
Inv 1369
Dep 0
IRA Retirement Account 0
MTGBal Mortgage Balance 0
AcctAge Age of Oldest Account 687
SavBal Saving Balance 0
DDABal Checking Balance 0
Sav Saving Account 0
Phone Number Telephone Banking 1369
CCBal Credit Card Balance 1369
DirDep Direct Deposit 0
ATM ATM 0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
In the validation data set, missing values should be replaced with the medians from the training data set.
Because the variables AcctAge, Phone, Inv and CCBal have missing values, PROC UNIVARIATE is
used to create an output data set with the medians of those variables. The PCTLPTS option requests the
50th percentile and the PCTLPRE option specifies the prefix for the variable names in the output data set.
4-28 Chapter 4 Measuring Classifier Performance
BrClus1 and BrClus3 are not used by this model, but if you are considering several possible
models, it may be easier to create one validation data set that has received exactly the same
treatment as the training data set so that assessing model performance is simplified. To that end, it
may be useful to know about the code that follows.
Now that the validation data is prepared for scoring, you could use any of the techniques discussed in
Chapter 2 to score it. However, this begs a question: what metrics can you use to measure model
performance? This question and the related question of optimal use of a model are the foci of the
following sections.
As before, use the STDIZE procedure with the REPONLY option to impute missing values on the training
data. In addition, specify the OUTSTAT= option to save the imputed values in a separate data set, called
med.
proc stdize data = train out=train2
method=median reponly
OUTSTAT=med;
var &inputs;
run;
After the med data set has been created, it can be used to impute for missing values in a different data set.
The code below creates a data set, valid2, based on the unimputed valid data, with the medians from
the training data imputed. The option to specify the data set with the median information is
METHOD=IN(data-set-name).
proc stdize data=valid out=valid2
reponly method=in(med);
var &inputs;
run;
To see that the values imputed for AcctAge, Phone, Inv, and CCBal are the same in valid1
(created above) as in valid2 (created here) you can use the COMPARE procedure.
proc compare base= valid1 compare=valid2;
var acctage phone inv ccbal;
run;
4-30 Chapter 4 Measuring Classifier Performance
Variables Summary
Observation Summary
First Obs 1 1
Last Obs 10752 10752
NOTE: No unequal values were found. All values compared are exactly equal.
The train2 and valid2 data sets will be used in the next chapter; in order to generate a series of
models, you probably want to impute for all inputs first. This way, you can evaluate model performance
on the validation data without having to intervene by hand to impute for missing values.
4.2 Misclassification 4-31
4.2 Misclassification
Confusion Matrix
Predicted Class
0 1
Predicted Predicted
Negative Positive
12
Supervised classification does not usually end with an estimate of the posterior probability. An allocation
rule corresponds to a threshold value (cutoff) of the posterior probability. For example, all cases with
probabilities of default greater than .04 might be rejected for a loan. For a given cutoff, how well does the
classifier perform?
The fundamental assessment tool is the confusion matrix. The confusion matrix is a crosstabulation of the
actual and predicted classes. It quantifies the confusion of the classifier. The event of interest, whether it
is unfavorable (like fraud, churn, or default) or favorable (like response to offer), is often called a
positive, although this convention is arbitrary. The simplest performance statistics are accuracy
(true positives and negatives) / (total cases)
and error rate
(false positives and negatives) / (total cases).
4-32 Chapter 4 Measuring Classifier Performance
Predicted Class
0 1
Actual Class
True Actual
1 Positive Positive
Predicted
Positive
13
ROC Curve
Sensitivity
ff
to
cu
0
0 1
1 — Specificity
14
The receiver operating characteristic (ROC) curve was adapted from signal detection theory for the
assessment of classifiers. The ROC curve displays the sensitivity and specificity for the entire range of
cutoff values. As the cutoff decreases, more and more cases are allocated to class 1; hence, the sensitivity
increases and specificity decreases. As the cutoff increases, more and more cases are allocated to class 0,
hence the sensitivity decreases and specificity increases. Consequently, the ROC curve intersects (0,0)
and (1,1). If the posterior probabilities were arbitrarily assigned to the cases, then the ratio of false
positives to true positives would be the same as the ratio of the total actual negatives to the total actual
positives. Consequently, the baseline (random) model is a 45° angle going through the origin. As the
ROC curve bows above the diagonal, the predictive power increases. A perfect model would reach the
(0,1) point where both sensitivity and specificity equal 1.
4-34 Chapter 4 Measuring Classifier Performance
Gains Chart
1
PV+
.5
cutoff
0
0 20 40 60 80 100
Depth (%)
15
The depth of a classification rule is the total proportion of cases that were allocated to class 1. The
(cumulative) gains chart displays the positive predicted value and depth for a range of cutoff values.
As the cutoff decreases, more and more cases are allocated to class 1; hence, the depth increases and
the PV+ approaches the marginal event rate. When the cutoff is minimum, then 100% of the cases are
selected and the response rate is ρ1. As the cutoff increases the depth decreases. A model with good
predictive power would have increasing PV+ (response rate) as the depth decreases. If the posterior
probabilities were arbitrarily assigned to the cases, then the gains chart would be a horizontal line
at ρ1.
The gains chart is widely used in database marketing to decide how deep in a database to go with a
promotion. The simplest way to construct this curve is to sort and bin the predicted posterior probabilities
(for example, deciles). The gains chart is easily augmented with revenue and cost information. The lift is
PV+/ρ1, so for a given depth, there are (lift)× more responders targeted by the model than by random
chance.
A plot of sensitivity versus depth is sometimes called a Lorentz curve, concentration curve, or a lift curve
(although lift value is not explicitly displayed). This plot and the ROC curve are very similar because
depth and 1–specificity are monotonically related.
4.2 Misclassification 4-35
Predicted Predicted
0 1 0 1
0 29 21 50 56 41 97
Actual
1 17 33 50 1 2 3
46 54 57 43
Sample Population
16
If the holdout data was obtained by splitting oversampled data, then it is oversampled as well. If the
proper adjustments were made when the model was fitted, then the predicted posterior probabilities
are correct. However, the confusion matrices would be incorrect (with regard to the population) because
the event cases are over-represented. Consequently, PV+ (response rate) might be badly overestimated.
Sensitivity and specificity, however, are not affected by separate sampling because they do not depend
on the proportion of each class in the sample.
Predicted Class
0 1
π0·Sp π0(1—Sp) π0
Actual Class
1 π1(1—Se) π1·Se π1
17
Knowing sensitivity, specificity, and the priors is sufficient for adjusting the confusion matrix for
oversampling. For example, if the sample represented the population, then nπ1 cases are in class 1. The
proportion of those that were allocated to class 1 is Se. Thus, there are nπ1⋅Se true positives. Note that
these adjustments are equivalent to multiplying the cell counts by their sample weights, for example
π1 n
TPsample ⋅ wt 1 = TPsample = TPsample ⋅ π 1 = n ⋅ π1 ⋅ Se
ρ1 Tot Pos sample
where TP is the proportion of true positives, and sample weights are defined as πi / ρi for Class i.
4-36 Chapter 4 Measuring Classifier Performance
The performance measures need to be calculated on the validation data. One approach would be to use
the SCORE procedure to score the validation data and then use DATA steps and the FREQ procedure to
calculate misclassification measures for different cutoffs. An easier approach is to score the validation
data inside the LOGISTIC procedure and use the OUTROC= data set, which contains many of the
statistics necessary for assessment. The SCORE statement allows for you to output this data set.
The OUTROC= option creates an output data set with sensitivity (_SENSIT_) and one minus specificity
(_1MSPEC_) calculated for a full range of cutoff probabilities (_PROB_). The other statistics in the
OUTROC= data set are not useful when the data is oversampled. The two variables _SENSIT_ and
_1MSPEC_ in the OUTROC= data set are correct whether or not the validation data is oversampled. The
variable _PROB_ is correct, provided the PRIOREVENT= was set to π1. If they were not corrected, then
_PROB_ needs to be adjusted using the formula
ˆ i* ρ0 π1
p
ˆi =
p
ˆ i* ) ρ1 π0 + p
(1 − p ˆ i* ρ0 π1
where pˆ i* is the unadjusted estimate of the posterior probability (_PROB_). The Scoval (scored
validation) data set will be used later.
proc logistic data=train1 des;
model ins=&selected;
score data=valid1 out=scoval
priorevent=&pi1 outroc=roc;
run;
Model Information
Response Profile
Ordered Total
Value Ins Frequency
1 1 7451
2 0 14061
Intercept
Intercept and
Criterion Only Covariates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Knowledge of the population priors and sensitivity and specificity is sufficient to fill in the confusion
matrices. Several additional statistics can be calculated in a DATA step:
• TP = proportion of true positives
• FN = proportion of false negatives
• TN = proportion of true negatives
• FP = proportion of false positives
• POSPV = positive predicted value
• NEGPV = negative predicted value
• ACC = accuracy
• DEPTH = proportion allocated to class 1
• LIFT = positive predicted value/π1.
Each row in the OUTROC= data set corresponds to a cutoff (_PROB_). The selected cutoffs occur where
values of the estimated posterior probability change, provided the posterior probabilities are more than
.0001 apart, otherwise they are grouped. Consequently, the maximum number of rows in the OUTROC=
data set is 9999, but it is usually much less. The grouping can be made coarser by using the ROCEPS=
option in the MODEL statement.
4-40 Chapter 4 Measuring Classifier Performance
data roc;
set roc;
cutoff=_PROB_;
specif=1-_1MSPEC_;
tp=&pi1*_SENSIT_;
fn=&pi1*(1-_SENSIT_);
tn=(1-&pi1)*specif;
fp=(1-&pi1)*_1MSPEC_;
depth=tp+fp;
pospv=tp/depth;
negpv=tn/(1-depth);
acc=tp+tn;
lift=pospv/&pi1;
keep cutoff tn fp fn tp
_SENSIT_ _1MSPEC_ specif depth
pospv negpv acc lift;
run;
ROC curves can be created by plotting _SENSIT_ versus _1MSPEC_. Gains charts can be created by
plotting PosPV versus Depth or Lift versus Depth. Often, overlying many statistics versus Depth
is informative. These plots follow.
To generate the ROC curve, use the GPLOT procedure. The SYMBOL statement defines the line that will
connect the points: join the points (I=JOIN), use no symbol to plot each point (V=NONE), and color the
line black.
symbol i=join v=none c=black;
proc gplot data = roc;
title "ROC Curve for the Validation Data Set";
plot _SENSIT_*_1MSPEC_;
run; quit;
4.2 Misclassification 4-41
This plot could be improved, if desired, by making the axes of equal length. Also, the baseline through
(0,0) and (1,1) is a useful addition. Finally, the labels on the axes are redundant. An ROC curve is defined
by the values plotted, so you could remove the labels. The length of the axes can be controlled in the
AXIS statement with the LENGTH= option and the labels can be removed from the axes by specifying
LABEL=NONE. The ORDER= option in the AXIS statement enables you to control the major tick
marks. To overlay the ROC curve with the straight line, you can add _1MSPEC_*_1MSPEC_ to the
PLOT statement and specify the OVERLAY option following a slash. Since you will be plotting two sets
of points, a SYMBOL2 statement specifies the plotting options for the second set of points listed.
axis order=(0 to 1 by .1) label=none length=4in;
symbol i=join v=none c=black;
symbol2 i=join v=none c=black;
proc gplot data = roc;
title "ROC Curve for the Validation Data Set";
plot _SENSIT_*_1MSPEC_ _1MSPEC_*_1MSPEC_
/ overlay vaxis=axis haxis=axis;
run; quit;
4-42 Chapter 4 Measuring Classifier Performance
To improve this graph, consider adding a reference line at the base line (a lift of 1) and restricting the
focus to the region where depth is greater than 0.5% (less erratic) and less than 50%. The final TITLE
statement, with no options, clears the title. The VREF= option in the PLOT statement puts a reference line
on the vertical axis.
symbol i=join v=none c=black;
proc gplot data=roc;
where 0.005 < depth < 0.50;
title "Lift Chart for Validation Data";
plot lift*depth / vref=1;
run; quit;
title;
4-44 Chapter 4 Measuring Classifier Performance
Finally, a plot of sensitivity, specificity, positive predicted value, and depth against cutoff may help guide
you to an appropriate cutoff.2 Allocation rules are discussed in more detail in the next section, but
considering a plot that shows how several statistics of interest react to changes in the cutoff may be
useful.
Specify the plotting controls in SYMBOL statements, as before. To be able to discriminate between lines,
you can specify L= for different line types (from 1 to 46), and you can change the width of the lines with
the W= option. Finally, to include a legend with your plot, include the LEGEND option in the PLOT
statement.
symbol i=join v=none c=black l=1 w=1;
symbol2 i=join v=none c=black l=2 w=2;
symbol3 i=join v=none c=black l=3 w=3;
symbol4 i=join v=none c=black l=4 w=4;
proc gplot data=roc;
title "Selected Statistics against Cutoff";
plot (_SENSIT_ specif depth pospv)*cutoff
/ overlay legend;
run; quit;
2
Recall that cutoff refers to the probability that defines your decision rule. Individuals with a posterior probability
greater than the cutoff are predicted to be events.
4.2 Misclassification 4-45
This plot could be improved by a few simple steps. With a target this rare, evaluating cutoffs above 0.1
makes little sense; restrict the focus. In that range, the depth and sensitivity lines are on a different scale
than the specificity and positive predicted value. Hence, plot depth and sensitivity on the right axis.
Finally, add a reference line to the horizontal axis at a cutoff of π1.
The SYMBOL statements can be used to differentiate lines. In the GPLOT procedure, the PLOT2
statement has syntax just like the PLOT statement, except it plots against the right side axis. The
HREF= option specifies the point at which to draw a reference line on the horizontal axis. C=LIGR
specifies that those lines should be drawn in light gray.
symbol i=join v=none c=black l=1 w=1;
symbol2 i=join v=none c=black l=2 w=1;
symbol3 i=join v=none c=ligr l=1 w=3;
symbol4 i=join v=none c=ligr l=2 w=3;
proc gplot data=roc;
where cutoff < 0.1;
title "Selected Statistics against Cutoff";
plot (specif pospv)*cutoff
/ overlay legend href=&pi1;
plot2 (depth _sensit_)*cutoff
/overlay legend;
run; quit;
title;
4-46 Chapter 4 Measuring Classifier Performance
4.3 Allocation Rules 4-47
Cutoffs
0 1 Se Sp
P
70 5
.64 .93
9 16
66 9
.84 .88
4 21
x2 57 18
x1 .96 .76
1 24
22
Different cutoffs produce different allocations and different confusion matrices. To determine the optimal
cutoff, a performance criterion needs to be defined. If the goal were to increase the sensitivity of the
classifier, then the optimal classifier would allocate all cases to class 1. If the goal were to increase
specificity, then the optimal classifier would be to allocate all cases to class 0. For realistic data, there is a
trade-off between sensitivity and specificity. Higher cutoffs decrease sensitivity and increase specificity.
Lower cutoffs decrease specificity and increase sensitivity.
Misclassification Costs
Total Cost
70 5
Predicted 9*4 + 5 = 41
9 16
0 1
0 0 1 66 9
Actual
4*4 + 9 = 25
1 4 0 4 21
57 18
1*4 + 18 = 22
1 24
23
A formal approach to determining the optimal cutoff uses statistical decision theory (McLachlan 1992;
Ripley 1996; Hand 1997). The decision-theoretic approach starts by assigning misclassification costs
(losses) to each type of error (false positives and false negatives). The optimal decision rule minimizes
the total expected cost (risk).
4-48 Chapter 4 Measuring Classifier Performance
Bayes Rule
Allocate to class 1 if
1
pi >
⎛ cost FN ⎞
1 + ⎜⎜ ⎟⎟
⎝ cost FP ⎠
24
The Bayes rule is the decision rule that minimizes the expected cost. In the two-class situation, the
Bayes rule can be determined analytically. If you classify a case into class 1, then the cost is
(1 − p) cost FP
where p is the true posterior probability that a case belongs to class 1. If you classify a case into class 0,
then the cost is
p ⋅ cost FN
otherwise allocate the case to class 0. Solving for p gives the optimal cutoff probability. Because p must
be estimated from the data, the plug-in Bayes rule is used in practice
1
p̂ >
⎛ cost FN ⎞
1 + ⎜⎜ ⎟⎟
⎝ cost FP ⎠
Consequently, the plug-in Bayes rule may not achieve the minimum cost if the estimate of the posterior
probability is poorly estimated.
Note that the Bayes rule only depends on the ratio of the costs, not on their actual values.
4.3 Allocation Rules 4-49
If the misclassification costs are equal, then the Bayes rule corresponds to a cutoff of 0.5. The expected
cost (risk) equals
When the cost ratio equals one, the expected cost is proportional to the error rate. A .5 cutoff tends to
minimize error rate (maximize accuracy). Hand (1997) commented that
The use of error rate often suggests insufficiently careful thought about the real objectives….
When the target event is rare, the cost of a false negative is usually greater than the cost of a false
positive. The cost of not soliciting a responder is greater than the cost of sending a promotion to someone
who does not respond. The cost of accepting an applicant who will default is greater than the cost of
rejecting someone who would pay off the loan. The cost of approving a fraudulent transaction is greater
than the cost of denying a legitimate one. Such considerations dictate cutoffs that are less (often much
less) than .5.
4-50 Chapter 4 Measuring Classifier Performance
Cost Ratio
1 Sensitivity
mean(Se,Sp)
Specificity
0
0 π0/π1 +∞ CFN/CFP
25
In many situations, it is difficult to precisely quantify the cost ratio, costFN/costFP. Examining the
performance of a classifier over a range of cost ratios can be useful. A pertinent range of cost ratios is
usually centered on the ratio of priors π0/π1. This corresponds to a cutoff of π1
1
p> = π1
⎛ π0 ⎞
1 + ⎜⎜ ⎟⎟
⎝ π1 ⎠
For instance, if the nonevent cases were nine times more prevalent than the event cases, then a false
negative would be nine times more costly than a false positive and the cutoff would be 0.1. When the
cost ratio equals π0/π1, the expected cost is equivalent to the negative sum of sensitivity and specificity.
The central cutoff, π1, tends to maximize the mean of sensitivity and specificity. Because increasing
sensitivity usually corresponds to decreasing specificity, the central cutoff tends to equalize sensitivity
and specificity.
If separate samples were taken with equal allocation (50% events and 50% nonevents), then using the
unadjusted cutoff of 0.5 on the biased sample is equivalent to using the central cutoff, π1, on the
population.
4.4 Overall Predictive Power 4-51
Profit Matrix
Total Profit
70 5
Predicted 16*4 - 5 = 59
9 16
0 1
0 0 -1 66 9
Actual
21*4 - 9 = 75
1 0 4 4 21
57 18
24*4 - 18 = 78
1 24
28
Defining a profit matrix (instead of a cost matrix, as in the last section) will not lead to a different
classification rule. It does point to a useful statistic for measuring classifier performance. The model
yields posterior probabilities, and those probabilities (in conjunction with a profit or cost matrix) classify
individuals into likely positives and likely negatives. On the validation data, the behavior of these
individuals is known; hence, it is feasible to calculate each individual’s expected profit, and hence it is
also feasible to calculate a total profit. This total profit can be used as a model selection and assessment
criterion.
4-52 Chapter 4 Measuring Classifier Performance
Gathering profit information can be difficult. If profit information can be used, however, it permits a more
familiar scale for comparing models. Consumers of models may not have a feel for what kind of lift they
expect, or what constitutes a good value for sensitivity. Using total or average profit as an assessment
statistic may skirt those issues.
Consider the following profit matrix:
Predicted Class
0 1
1 $0 $99
This profit matrix is consistent with a marketing effort that costs $1, and, when successful, garners
revenue of $100. Hence, the profit for soliciting a non-responder is -$1, and the profit for soliciting a
responder is $100-$1 = $99. Given that each individual has a posterior probability pi, you can resort either
to the Bayes rule or simple algebra to find the optimum cutoff.
A typical decision rule would be: Solicit if the expected profit for soliciting, given the posterior
probability, is higher than the expected profit for ignoring the customer.
Solicit if:
E(Profit | pi, solicit) > E(Profit | pi, do not solicit)
pi*99 + (1-pi)*(-1) > pi*0 + (1-pi)*(0)
99*pi -1 + pi > 0
100*pi -1 > 0
pi > 0.01.
This cutoff of 0.01 can be used to calculate the expected profit of using this rule with the current model.
In order to calculate total and average profit comparable to what would be achieved in the population,
weights must be calculated. The decision variable is created as a flag indicating whether the predicted
probability is greater than the cutoff, 0.01. Using the information about decision and response, the profit
per individual is calculated. These profits are summed and averaged by the MEANS procedure.
data scoval;
set scoval;
sampwt = (&pi1/&rho1)*(INS)
+ ((1-&pi1)/(1-&rho1))*(1-INS);
decision = (p_1 > 0.01);
profit = decision*INS*99
- decision*(1-INS)*1;
run;
4.4 Overall Predictive Power 4-53
Sum Mean
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
13269.60 1.2341397
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Using this model to score a population of, say, 1,000,000 individuals and soliciting only those with pi
greater than 0.01 would yield a total expected profit of $1,234,139.70. With the above profit matrix and
π1=0.02, the “solicit everyone” rule generates a profit of 1,000,000*0.02*99-1,000,000*.98*1=
$1,000,000. Using the model and some elementary decision theory leads to better decisions and more
profit. Other models can be compared to this current model with this statistic.
To see how other cutoffs fare, you can use the information in the ROC data set to draw a plot of how
average profit changes as a function of the cutoff (or the solicited depth).
Using the true positive and false positive rates calculated earlier, you can calculate the average profit for
each of the cutoffs considered in the ROC data set. A false positive (fp) is an individual who is solicited
but does not respond. Hence, the cost of soliciting the false positives, on average, is the $1 solicitation
cost times the false positive rate. Likewise, the profit associated with individuals who are solicited and
respond is $99 times the true positive rate, tp. The difference of these two terms is the average profit.
Using the SYMBOL statements, specify two rules for plotting the points: the first rule plots short (H=.5),
light gray (C=GRAY) plus signs (V=PLUS) for the points; the second line joins the points with a thick
(W=2), smoothed spline (I=SM##) that is colored black (C=BLACK). The plot is restricted to the region
around the cutoff of 0.01.
data roc;
set roc;
AveProf = 99*tp - 1*fp;
run;
Class Separation
Less More
PDF Class 0
Class 1
0 1 0 1
⎯⎯⎯ Posterior Probability ⎯⎯⎯
31
Statistics such as sensitivity, positive predictive value, and risk depend on the choice of cutoff value.
Statistics that summarize the performance of a classifier across a range of cutoffs can also be useful for
assessing global discriminatory power. One approach is to measure the separation between the predicted
posterior probabilities for each class. The more that the distributions overlap, the weaker the model.
The simplest statistics are based on the difference between the means of the two distributions. In credit
scoring, the divergence statistic is a scaled difference between the means (Nelson 1997). Hand (1997)
discusses several summary measures based on the difference between the means.
The well-known t-test for comparing two distributions is based on the difference between the means.
The t-test has many optimal properties when the two distributions are symmetric with equal variance
(and have light tails). However, the distributions of the predicted posterior probabilities are typically
asymmetric with unequal variance. Many other two-sample tests have been devised for nonnormal
distributions (Conover 1980).
4-56 Chapter 4 Measuring Classifier Performance
K-S Statistic
PDF EDF
1
0
D = .49
0
0 1 0 1
⎯⎯⎯ Posterior Probability ⎯⎯⎯
32
The Kolmogorov-Smirnov two-sample test is based on the distance between the empirical distribution
functions (Conover 1980). The test statistic, D, is the maximum vertical difference between the
cumulative distributions. If D equals zero, the distributions are everywhere identical. If D>0, then there is
some posterior probability where the distributions differ. The maximum value of the K-S statistic, 1,
occurs when the distributions are perfectly separated. Use of the K-S statistic for comparing predictive
models is popular in database marketing.
An oversampled validation data set does not affect D because the empirical distribution function is
unchanged if each case represents more than one case in the population.
4.4 Overall Predictive Power 4-57
D = .24 c = .66
D = .49 c = .82
33
The Kolmogorov-Smirnov two-sample test is sensitive to all types of differences between the
distributions – location, scale, and shape. In the predictive modeling context, it could be argued that
location differences are paramount. Because of its generality, the K-S test is not particularly powerful at
detecting location differences. The most powerful nonparametric two-sample test is the Wilcoxon-Mann-
Whitney test. Remarkably, the Wilcoxon-Mann-Whitney test statistic is also equivalent to the area under
the ROC curve (Hand 1997).
The Wilcoxon version of this popular two-sample test is based on the ranks of the data. In the predictive
modeling context, the predicted posterior probabilities would be ranked from smallest to largest. The test
statistic is based on the sum of the ranks in the classes. The area under the ROC curve, c, can be
determined from the rank-sum in class 1.
n1
∑R
{ i|y =1 }
i − 1 2 n 1 ( n 1 + 1)
c=
n1 ⋅ n 0
The first term in the numerator is the sum of the ranks in class 1.
A perfect ROC curve would be a horizontal line at one – that is, sensitivity and specificity would both
equal one for all cutoffs. In this case, the c statistic would equal one. The c statistic technically ranges
from zero to one, but in practice, it should not get much lower than one-half. A perfectly random model,
where the posterior probabilities were assigned arbitrarily, would give a 45° angle straight ROC curve
that intersects the origin; hence, it would give a c statistic of 0.5.
Oversampling does not affect the area under the ROC curve because sensitivity and specificity are
unaffected. The area under the ROC curve is also equivalent to the Gini coefficient, which is used to
summarize the performance of a Lorentz curve (Hand 1997).
4-58 Chapter 4 Measuring Classifier Performance
The K-S statistic can be computed in the NPAR1WAY procedure using the scored validation data set. The
EDF and WILCOXON options in PROC NPAR1WAY request the Kolmogorov-Smirnov and Wilcoxon
tests, respectively. The tests compare the values of the variable listed in the VAR statement between the
groups listed in the CLASS statement.
proc npar1way edf wilcoxon data=scoval;
class ins;
var p_1;
run;
Partial Output
Kolmogorov-Smirnov Test for Variable P_1
Classified by Variable Ins
The results show that the two-sample Kolmogorov statistic D is 0.44, which represents the largest
separation between the two cumulative distributions. The results of the Wilcoxon test can be used to
compute the c statistic. However, this is automatically computed in the LOGISTIC procedure.
proc logistic data=scoval des;
model ins=p_1;
run;
Partial Output
Association of Predicted Probabilities and Observed Responses
Comparing this model to a different model, say the DDA, DDABal, Dep, DepAmt, CashBk, Checks,
and Res model posited in Chapter 2, you would find that this model performs much better on the
validation data and hence would be the model to use.
4.4 Overall Predictive Power 4-59
This process of selecting inputs for a model, fitting that model, and evaluating that model’s fit on the
validation data can be automated with macro programming. This will allow you to consider many
candidate models in a small time frame; this should lead to better model generalization.
The next chapter features an example of a macro program that takes the series of models generated by
best subsets logistic regression and compares them on the validation data performance. A plot shows the
performance gains as a function of model complexity, which should be a very useful tool for final model
selection.
4-60 Chapter 4 Measuring Classifier Performance
Complexity
To compare many models, an appropriate fit statistic (or statistics) must be selected. For statistics like
average profit, c, and Kolmogorov-Smirnov’s D, higher values mean better fitting models. Because the
goal of most predictive modeling efforts is a model that generalizes well, these statistics are typically
measured on the validation data set. For a series of models, which could be generated by an automatic
selection routine, it is conceivable to plot a fit measure against some index of complexity. For standard
logistic regression models, this index is likely equivalent to the degrees of freedom in the model.
Typically, model performance follows a fairly straightforward trend. As the complexity increases (that is,
as terms are added) the fit on the training data gets better. After a point, the fit may plateau, but on the
training data, the fit gets better as model complexity increases. Some of this increase is attributable to the
model capturing relevant trends in the data. Detecting these trends is the goal of modeling. Some of the
increase, however, is due to the model identifying vagaries of the training data set. This behavior has been
called overfitting. Because these vagaries are not likely to be repeated, in the validation data or in future
observations, it is reasonable to want to eliminate those models. Hence, the model fit on the validation
data, for models of varying complexity, is also plotted. The typical behavior of the validation fit line is an
increase (as more complex models detect more usable patterns) followed by a plateau, which may finally
result in a decline1 in performance. The decline in performance is due to overfitting. The plateau just
indicates more complicated models that have no fit-based arguments for their use. A reasonable rule
would be to select the model associated with the complexity that has the highest validation fit statistic.
Plotting the training and validation results together permits a further assessment of the model’s
generalizing power. Typically, the performance will deteriorate from the training data to the validation
data. This phenomenon, sometimes known as shrinkage, is an additional statistic that some modelers use
to get a measure of the generalizing power of the selected model. For example, the rule used to select a
model from the many different models plotted might be: “Choose the simplest model that has the highest
validation fit measure, with no more than 10% shrinkage from the training to the validation results.”
1
This behavior is sometimes called peaking.
5-4 Chapter 5 Generating and Evaluating Many Models
If the measure of model fit is some sort of error rate, then the plot looks like the above but flipped about
the horizontal axis. In the absence of profit or cost information, the Mean Squared Error (MSE) is one
such fitness statistic that measures how poorly a model fits; that is, smaller is better.
D
MSE
D^ ^ 2]
E[(D-D)
N
1
Estimated MSE =
N Σ (D - D^ )
i=1
i i
2
X
Training Data
Mean squared error (MSE) is a commonly employed method of establishing and assessing the
relationship between inputs and the expected value of the target. MSE is estimated from a sample by
differencing model predictions from observed target values, squaring these differences, and averaging
across all data points in the sample.
D
MSE Variance
D^ ^ 2] = E[(D-ED)2] + [E(D-ED)]
E[(D-D) ^ 2
N
1
Estimated MSE =
N Σ (D - D^ )
i=1
i i
2
X
Training Data
12
In theory, the MSE can be decomposed into two components, each involving a deviance from the true
(but unknown) expected value of the target variable.
The first of these components is the residual variance of the target variable. This term quantifies the
theoretical limit of prediction accuracy and the absolute lower bound for the MSE. The variance
component is independent of any fitted model.
5.1 Model Selection Plots 5-5
D
MSE Variance Bias2
D^ ^ 2] = E[(D-ED)2] + [E(D-ED)]
E[(D-D) ^ 2
N
1
Estimated MSE =
N Σ (D - D^ )
i=1
i i
2
X
Training Data
14
The second MSE component is the average prediction bias squared. This term quantifies the difference
between the predicted and actual expected value of the target.
D
MSE Variance Bias2
D^ ^ 2] = E[(D-ED)2] + [E(D-ED)]
E[(D-D) ^ 2
N
1
Estimated MSE = N Σ (D - D^ )
i i
2
X i=1
Validation Data
21
As always, you must be careful to obtain an unbiased estimate of MSE. MSE estimates obtained from
the data used to fit the model will almost certainly be overly optimistic. Estimates of MSE from an
independent validation data set allow for an honest assessment of model performance.
5-6 Chapter 5 Generating and Evaluating Many Models
B
MSE Variance Bias2
^ 2] = E[(B-EB)2] + [E(B-EB)]
E[(B-B) ^ 2
B^
Inaccuracy Inseparability Imprecision
N
1
Estimated MSE = N Σ (B - B^ )
i i
2
X i=1
Validation Data
25
While MSE is an obvious choice for comparing interval target models, it is also useful for assessing
binary target models (Hand 1997). The estimated MSE can be thought of as measuring the overall
inaccuracy of model prediction. This inaccuracy estimate can be decomposed into a term related to the
inseparability of the two-target level (corresponding to the variance component) plus a term related to the
imprecision of the model estimate (corresponding to the bias-squared component). In this way, the model
with the smallest estimated MSE will also be the least imprecise.
Whether you choose the MSE, the c statistic, the average profit, Kolmogorov-Smirnov’s D or any other
measure of model fit, the model selection plot should help point you toward the appropriate level of
complexity to achieve good validation generalization.
Likewise, there are many ways to consider generating a series of models of different complexity. Consider
any of the automatic variable selection routines described in Chapter 3. Backward elimination of variables
creates a series of models of decreasing complexity. Stepwise selection, in general, generates a series of
models of increasing complexity. All subsets selection techniques likewise create a series of increasingly
complex models. Any of these techniques lends itself to model selection based on validation data
performance; the next demonstration shows an example of using all subsets regression.
5.1 Model Selection Plots 5-7
This demonstration uses the validation performance of many different models as a guide to selecting
models that will generalize well. Hence, the validation data must be prepared in the same way as the
training data. To this end, the data sets train2 and valid2, which were created by imputing for all
numeric variables on the train and valid data sets, are augmented with the modified DDABal and
SavBal variables. Moreover, because the all subsets algorithm in the LOGISTIC procedure cannot be
used with the CLASS statement, the dummy variables for Res are created on these data sets as well.
data train2;
set train2;
resr=(res="R");
resu=(res="U");
if not dda then ddabal = &mean;
brclus1=(branch='B14');
brclus2=(branch in ('B12','B5','B8',
'B3','B18','B19','B17',
'B4','B6','B10','B9',
'B1','B13'));
brclus3=(branch in ('B15','B16'));
%include rank;
if savbal > 16000 then savbal=16000;
run;
data valid2;
set valid2;
resr=(res="R");
resu=(res="U");
if not dda then ddabal = &mean;
brclus1=(branch='B14');
brclus2=(branch in ('B12','B5','B8',
'B3','B18','B19','B17',
'B4','B6','B10','B9',
'B1','B13'));
brclus3=(branch in ('B15','B16'));
%include rank;
if savbal > 16000 then savbal=16000;
run;
5-8 Chapter 5 Generating and Evaluating Many Models
The LOGISTIC procedure with SELECTION equal to SCORE is called. The number of models per
complexity is specified by the BEST= option. The ODS output data set score contains the inputs used in
at each complexity level. Setting BEST=2 will generate twice as many candidate models as BEST=1; this
may prove useful in finding a model that fits the validation data, and hence the population, well.
ods listing close;
ods output bestsubsets=score;
ods listing;
proc print data=score(obs=12);
run;
control_ NumberOf
Obs var Variables ScoreChiSq VariablesInModel
1 1 2085.5103 SavBal
2 1 1870.2276 B_DDABal
3 1 2 3316.1397 SavBal B_DDABal
4 2 2794.5356 CD SavBal
5 1 3 3856.4605 CD SavBal B_DDABal
6 3 3691.8664 MM SavBal B_DDABal
7 1 4 4153.3824 MM CD SavBal B_DDABal
8 4 3977.5128 MIPhone CD SavBal B_DDABal
9 1 5 4278.3879 MIPhone MM CD SavBal B_DDABal
10 5 4272.2890 MM CD brclus2 SavBal B_DDABal
11 1 6 4352.5207 MIPhone MM CD SavBal B_DDABal DirDep
12 6 4347.5405 MM CD brclus2 SavBal B_DDABal DirDep
Recall that the screened macro variable contains the list of inputs created by making missing
indicators (MIPhone, and so forth), clustering categorical input levels (brclus1, and so forth),
and then clustering variables and choosing representatives. The inputs deemed to have low
correlation with the target according to the Spearman and Hoeffding measures have been
excluded. All in all, this leaves 36 inputs.
The inputs placed in score by the LOGISTIC procedure are transferred to macro variables using an
SQL SELECT INTO function. The automatic macro variable SQLOBS gives the number of models
produced overall.
proc sql noprint;
select variablesinmodel into :inputs1 - :inputs99999
from score;
select NumberOfVariables into :ic1 - :ic99999
from score;
quit;
The variables π1, ρ1, π0, and ρ0 are necessary to calculate sampling weights. π1 must be known a priori,
and ρ1 was automatically calculated in an earlier demonstration. To simplify the code, create macro
variables to house π0 and ρ0 as well. The %SYSEVALF function takes the text string in its argument and
performs the calculation. For example, %SYSEVALF(1-&pi1) will resolve to 0.98 in this instance.
%let rho0 = %sysevalf(1-&rho1);
%let pi0 = %sysevalf(1-&pi1);
Likewise, the profit matrix is placed into macro variables.
%let pf11 = 99; %let pf10 = 0;
%let pf01 = -1; %let pf00 = 0;
After these preliminaries have been established, it is easy to imagine the data set necessary to generate the
model selection plot. It requires a summary of the performance for each model on the training and on the
validation data sets. The particular data wrangling that needs to take place depends on what statistics are
of interest and how you choose to calculate them. The following code calculates c, ASE, and average
overall profit. ASE, the average squared error, is related to MSE but has a different divisor.
Two macros will be compiled. One, %ASSESS, will assess the performance of a model on a particular
data set, and append a record summarizing that model’s performance to the Results data set. The
%FITANDSCORE macro will fit and score many different models; here, the series of models created by
the all subsets algorithm.
These macros are predicated on assumptions of modeling data structure and the macro variables
created by the SQL SELECT INTO call above. Applying these techniques, or similar techniques,
on your data may require careful consideration of the data.
The %ASSESS macro is defined. This macro simplifies the assessment of each of the models found by
the best subsets procedure. Data for assessment are assumed stored in a data set named SCORE&DATA,
where DATA is a macro parameter with anticipated values TRAIN, VALID, and potentially TEST. The
INPUTCOUNT= macro parameter points to the number of parameters in the model in question. The
INPUTSINMODEL= macro parameter points to the list of parameters in the model. The INDEX= macro
parameter is a unique index for each model in the series. This allows you to have more than one model
with, say, 16 inputs.
%macro assess(data=,inputcount=,inputsinmodel=,index=);
The assessment data is sorted by descending posterior probability. This sorting allows the calculation of a
c statistic by the ASSESS DATA step below.
proc sort data=scored&data;
by descending p_1;
run;
The main assessment DATA step begins. The DATAROLE variable is defined to contain the type of
assessment (for example, VALID). A two-by-two temporary array, n, is defined and initialized to all
zeros. The n array will contain the confusion matrix based on the profit matrix specified by decision
processing.
5-10 Chapter 5 Generating and Evaluating Many Models
A two-by-one temporary array, w, is defined and initialized to the ratio of the population and sample
marginal averages (π0/ρ0) and (π1/ρ1). The w array will be used to adjust the model posterior probabilities
as well as to calculate total profit and the confusion matrix.
data assess;
attrib DATAROLE length=$5;
retain sse 0 csum 0 DATAROLE "&data";
Assuming that the LOGISTIC call that generates the scored data sets will correct for
oversampling, you can use the predicted probabilities. If the probabilities are biased due to
oversampling, and you do not correct for this before the assessment, you could do that here.
Profits for each decision alternative are calculated, based on the user-specified decision data.
d1=&PF11*p_1+&PF01*p_0;
d0=&PF10*p_1+&PF00*p_0;
Variable t equals 1 if the target equals the primary target value and zero otherwise; that is, t is a flag for
response. Variable d equals 1 if the profit for decision 1 exceeds the profit for decision 0; that is, d is the
profit maximizing decision. The STRIP function removes trailing and leading blanks. Contrast this with
the default behavior of the COMPRESS function, which removes all blanks in a field.
These variables may be defined differently in the presence of different decision rules or different
values for the target classes; for example a different profit matrix or a Y/N target instead of a 1/0
target.
t=(strip(ins)="1");
d=(d1>d0);
The appropriate cell of the confusion matrix array n is incremented by the appropriate element of the
weight vector. The sum of squared error, sse, is incremented. The sum used to calculate the c statistic,
csum, is incremented. The increment will be positive only when the target value is zero. This is in effect
a Riemann sum of the Receiver Operating Characteristic (ROC) curve.
n[t,d] + w[t];
sse + (ins-p_1)**2;
csum + ((n[1,1]+n[1,0])*(1-t)*w[0]);
5.1 Model Selection Plots 5-11
On the last time through the DATA step, finalize the fit statistics and output. To finalize the ASE, divide
the sum of squared errors by the sample size. Total profit is the product of the sample sizes in each cell of
n with the appropriate profit amount. The c statistic is based on the sum of the area of rectangles that
approximate the ROC curve, but those areas are based on the sample size. To restrict this area to be
between 0 and 1, the range of the c statistic, divide through by the product of n1 and n0.
if last then do;
INPUT_COUNT=&inputcount;
TOTAL_PROFIT =
sum(&PF11*n[1,1],&PF10*n[1,0],&PF01*n[0,1],&PF00*n[0,0]);
OVERALL_AVG_PROFIT =
TOTAL_PROFIT/sum(n[0,0],n[1,0],n[0,1],n[1,1]);
ASE = sse/sum(n[0,0],n[1,0],n[0,1],n[1,1]);
C = csum/(sum(n[0,0],n[0,1])*sum(n[1,0],n[1,1]));
index=&index;
output;
end;
run;
Update the Results data set with this data’s assessment results.
proc append base=results data=assess force;
run;
End the %ASSESS macro.
%mend assess;
Define another macro %FITANDSCORE. This macro will refit the series of logistic regression models,
and create the scored data sets that the %ASSESS macro needs.
%macro fitandscore();
Any previous version of the Results data set is removed using the DATASETS procedure.
proc datasets
library=work
nodetails
nolist;
delete results;
run;
A loop begins over all models fit by the original LOGISTIC procedure. The macro variable IM is set
equal to the inputs in the MODEL_INDX model. IC is set equal to the number of inputs in the
MODEL_INDX model.
%do model_indx=1 %to &lastindx;
%let im=&&inputs&model_indx;
%let ic=&&ic&model_indx;
The macro variable LASTINDX was created earlier; it points to the total number of models in the
series of models to be considered.
5-12 Chapter 5 Generating and Evaluating Many Models
The LOGISTIC procedure refits the model with inputs specified in IM. The SCORE option is used to
score the training and validation data sets and place the result in ScoredTrain and ScoredValid,
respectively. Only the target value and posterior probabilities are kept. The PRIOREVENT= option
corrects the predicted probabilities for oversampling. If you have many models, or if you are not
interested in the output, you can use the Output Delivery System to eliminate the printed listing of the
LOGISTIC procedure results.
proc logistic data=train2 des;
model ins=&im;
score data=train2
out=scoredtrain(keep=ins p_1 p_0)
priorevent=&pi1;
score data=valid2
out=scoredvalid(keep=ins p_1 p_0)
priorevent=&pi1;
run;
The %ASSESS macro, discussed above, is called for each of the scored data sets. The values of
the parameters in the call (DATA=, INPUTCOUNT=, and so forth) become macro variables in the
%ASSESS macro.
%assess(data=TRAIN,
inputcount=&ic,
inputsinmodel=&im,
index=&model_indx);
%assess(data=VALID,
inputcount=&ic,
inputsinmodel=&im,
index=&model_indx);
The loop over all models fit by the LOGISTIC procedure with the SCORE option ends.
%end;
The %FITANDSCORE macro definition ends.
%mend fitandscore;
The %FITANDSCORE macro is called.
%fitandscore;
The result of the %FITANDSCORE macro is a results data set.
5.1 Model Selection Plots 5-13
The SYMBOL1 statement specifies a dark red (C=DARKRED) line with no breaks (L=1) connecting
points indicated by the letter c (V=c) plotted 1.5 times higher than a point marker usually is (H=1.5).
The F=ARIAL option specifies the font for printing the character. This symbol definition is associated
with the first series of points plotted, which will be the c statistic for each model for the training data.
The SYMBOL2 statement specifies a red line with breaks in it (L=2) for the validation data points. In
the PLOT statement, the reference lines on the vertical axis are set to the training and validation c
statistics of the model selected in Chapter 4.
symbol1 i=join f=arial v=c h=1.5 c=darkred l=1;
symbol2 i=join f=arial v=c h=1.5 c=red l=2;
proc gplot data = results;
where index > 12;
plot C*Index=datarole / vref=.784 .780;
run; quit;
The plot seems to reach a plateau around index 45 or 50. Because the goal is to find a model that
generalizes well, the simplest model that has good performance on the validation data is probably
the best candidate.
5-14 Chapter 5 Generating and Evaluating Many Models
Of course, it is difficult to quantify the trade-off between a slightly more complex model and a slightly
higher c statistic. The trade-off may be easier to think about in terms of profit. What is it worth to the
organization to use a more complex model? A plot of profit may help to visualize, and quantify, this
trade-off.
This plot uses SYMBOL and GPLOT procedure syntax that is similar to earlier examples. The reference
line is the average profit achieved using the model built in Chapter 4.
symbol1 i=join f=arial v=$ h=1.5 c=darkgreen w=2 l=1;
symbol2 i=join f=arial v=$ h=1.5 c=green w=2 l=2;
proc gplot data = results;
where index > 12;
plot OVERALL_AVG_PROFIT*Index=datarole / vref=1.234;
run; quit;
5.1 Model Selection Plots 5-15
Again, the plot seems to show that the validation data performance peaks near index 50. Plotting the ASE
will show similar results as well. Of course, you could use the MEANS procedure or the SQL procedure
to find the model index associated with the highest validation profit, or c, or ASE. To see the results of,
say, the highest profit model, submit the following code. Because the highest validation performance is
achieved by the model with index 48, that model is a good candidate.
proc logistic data=train2 des;
model ins=&inputs48;
score data=valid2 out=scoval2;
run;
5-16 Chapter 5 Generating and Evaluating Many Models
Model Information
Response Profile
Ordered Total
Value INS Frequency
1 1 7451
2 0 14061
Intercept
Intercept and
Criterion Only Covariates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
A national veterans’ organization seeks to better target its solicitations for donation. By only soliciting the most
likely donors, less money will be spent on solicitation efforts and more money will be available for charitable
concerns. Solicitations involve sending a small gift to an individual together with a request for donation. Gifts
include mailing labels and greeting cards.
The organization has more than 3.5 million individuals in its mailing database. These individuals have been
classified by their response behavior to previous solicitation efforts. Of particular interest is the class of individuals
identified as lapsing donors. These individuals made their most recent donation between 12 and 24 months ago. The
organization found that by predicting the response behavior of this group, they could use the model to rank all 3.5
million individuals in their database. With this ranking, a decision can be made to either solicit or ignore an
individual in the current solicitation campaign. The current campaign refers to a greeting card mailing sent in June
of 1997. It is identified in the raw data as the 97NK campaign.
The raw analysis data has been reduced for the purpose of this course. A subset of slightly over 19,000 records has
been selected for modeling. As will be seen, this subset was not chosen arbitrarily. In addition, the 481 fields have
been reduced to 47. Considering their potential association with the analysis objective eliminated some of the fields
(for example, it is doubtful that CD player ownership is strongly correlated with donation potential). Other fields
were combined to form summaries of a particular customer behavior. The following table details the variables and
descriptions:
Variable Description
CARD_PROM_12 Count of card promotions in the last 12 months
CLUSTER_CODE Socio-Economic Cluster Code
CONTROL_NUMBER ID
DONOR_AGE Donor Age
DONOR_GENDER Donor Gender
FREQUENCY_STATUS_97NK Count of Donations between June 1995 and June 1996 (capped at 4)
HOME_OWNER Home Owner flag
INCOME_GROUP Income Bracket, from 1 to 7
IN_HOUSE Flag for In-House donor program
LAST_GIFT_AMT Amount of most recent donation
LIFETIME_AVG_GIFT_AMT Average donation amount, ever
LIFETIME_CARD_PROM Number of card promotions, ever
LIFETIME_GIFT_AMOUNT Total donation amount, ever
LIFETIME_GIFT_COUNT Total number of donations, ever
LIFETIME_GIFT_RANGE Maximum gift amount less minimum gift amount
LIFETIME_MAX_GIFT_AMT Maximum gift amount, ever
LIFETIME_MIN_GIFT_AMT Minimum gift amount, ever
LIFETIME_PROM Count of solicitations ever sent
Session 1: Exercises A-3
Session 1: Exercises
Chapter 2 Exercises
1. Investigate the inputs in the file PMLR.PVA_RAW_DATA. You may choose to create a temporary
copy of the data. Investigate a preliminary model, for example, PVA.
a. Create a table of the mean, minimum, maximum, and count of missing for each numeric input.
b. Create tables of the categorical inputs. Do not create a table using CONTROL_NUMBER, the
identification key.
c. Create a macro variable to store π1, the proportion of responders in the population. This value is
0.05.
d. The current model consists of PEP_STAR, RECENT_AVG_GIFT_AMT, and
FREQUENCY_STATUS_97NK. Fit this model to the data. Use the SCORE statement to append
the predicted probability to the data, correcting for oversampling. Investigate the minimum,
maximum, and average predicted probability of response based on this model.
e. How many individuals in the sample have a predicted response rate greater than 0.025? 0.05?
0.075? How many responders?
Chapter 3 Exercises
1. Replace missing values using group-median imputation.
a. Create missing value indicators for inputs that have missing values.
b. Submit the program below to group the variables RECENT_RESPONSE_PROP and
RECENT_AVG_GIFT_AMT into three groups. The RANK procedure with the GROUPS= option
bins variables into quantiles. The VAR statement lists the variables to be grouped. The RANKS
statement names the group indicators in the OUT= data set. If the RANKS statement is omitted,
then the group indicators replace the VAR variables in the OUT= data set.
proc rank data=pva out=pva groups=3;
var recent_response_prop recent_avg_gift_amt;
ranks grp_resp grp_amt;
run;
c. Sort PVA by grp_resp and grp_amt.
d. Use the STDIZE procedure with a BY statement to impute missing values for each BY group and
output the completed data set. Name the output data set PVA1.
e. Use the MEANS procedure to determine the values that the missing values were replaced with.
f. (Optional) If you use the OUTSTAT= option in the PROC STDIZE statement, you can save the
information used by the procedure for imputation. Print out the OUTSTAT= data set for a more
compact view of the information.
Session 1: Exercises A-5
a. Use the MEANS procedure to generate a data set with information about the average response
rate and sample size for each level of CLUSTER_CODE.
If you do not specify OUTTREE=, use the HI option in the PROC VARCLUS statement
to force divisive clustering and save some processing time.
b. Select one variable from each cluster as a representative. Did you choose variables according to
the smallest 1-R2 ratio, or did you choose some according to subject-matter considerations?
c. Fit a logistic regression model with the FAST BACKWARD method using only the selected
cluster representatives as inputs. How will you assess this model’s fit?
4. Compare variable selection methods.
a. Fit a logistic regression model with the STEPWISE method. Use all of the numeric variables and
the URBANICITY, SES, and HOME_OWNER categorical variables. Set SLSTAY and SLENTRY
equal to .001.
b. Fit a logistic regression model with the FAST BACKWARD method. Use all of the numeric
variables and the URBANICITY, SES, and HOME_OWNER categorical variables. Set SLSTAY
equal to .001.
c. Fit a logistic regression model with the SCORE method and the BEST=1 option. Use all of the
numeric variables and create dummies for the URBANICITY, SES, and HOME_OWNER
categorical variables. Use the Output Delivery System to determine which model is the best
according to the SBC criterion. (How did you decide what level of the dummies to make the
reference level?)
A-6 Appendix A Exercises and Solutions
Chapter 4 Exercises
1. Assess some logistic regression models using a validation data set.
a. Split the imputed data set into training and validation data sets. Use 50% of the data for each data
set role. Stratify on the target variable.
b. (Optional) Use the Spearman and Hoeffding correlation coefficients to screen the inputs with the
least evidence of a relationship with the target. If necessary, consider transforming some inputs
that do well according to the Hoeffding correlation measure.
Chapter 5 Exercises
1. Assess some logistic regression models using a validation data set.
a. Using the macros outlined in Chapter 5, score and evaluate the performance of the best subsets
models on the training and validation data sets.
If you want to use average profit, you can calculate the profit matrix from the following
information: The average responder donates $15.62, and the cost of a solicitation is
$0.68.
b. Use graphical summaries to determine what models are of interest. If these are inconclusive, what
measures will you take to choose the best model? You might consider lift charts or ROC curves.
2. (Optional) Prepare scoring code for your selected model. Accommodate any specialized input that you
created; for example, clustered levels of CLUSTER_CODE.
Session 1: Solutions A-7
Session 1: Solutions
Chapter 2 Solutions
1. Investigate the inputs in the file, PMLR.PVA_RAW_DATA.
data pva(drop=control_number);
set pmlr.pva_raw_data;
run;
N
Variable N Miss Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
MONTHS_SINCE_ORIGIN 19372 0 73.4099732 41.2555742 5.0000000 137.0000000
DONOR_AGE 14577 4795 58.9190506 16.6693824 0 87.0000000
IN_HOUSE 19372 0 0.0731984 0.2604687 0 1.0000000
INCOME_GROUP 14980 4392 3.9075434 1.8647962 1.0000000 7.0000000
PUBLISHED_PHONE 19372 0 0.4977287 0.5000077 0 1.0000000
MOR_HIT_RATE 19372 0 3.3616560 9.5034812 0 241.0000000
WEALTH_RATING 10562 8810 5.0053967 2.8153860 0 9.0000000
MEDIAN_HOME_VALUE 19372 0 1079.87 960.7534484 0 6000.00
MEDIAN_HOUSEHOLD_INCOME 19372 0 341.9702147 164.2078074 0 1500.00
PCT_OWNER_OCCUPIED 19372 0 69.6989986 21.7110186 0 99.0000000
PER_CAPITA_INCOME 19372 0 15857.33 8710.63 0 174523.00
PCT_MALE_MILITARY 19372 0 1.0290109 4.9182974 0 97.0000000
PCT_MALE_VETERANS 19372 0 30.5739211 11.4214714 0 99.0000000
PCT_VIETNAM_VETERANS 19372 0 29.6032934 15.1203598 0 99.0000000
PCT_WWII_VETERANS 19372 0 32.8524675 17.8397648 0 99.0000000
PEP_STAR 19372 0 0.5044394 0.4999932 0 1.0000000
RECENT_STAR_STATUS 19372 0 0.9311377 2.5455850 0 22.0000000
FREQUENCY_STATUS_97NK 19372 0 1.9839975 1.0993458 1.0000000 4.0000000
RECENT_RESPONSE_PROP 19372 0 0.1901275 0.1139467 0 1.0000000
RECENT_AVG_GIFT_AMT 19372 0 15.3653959 10.1674849 0 260.0000000
RECENT_CARD_RESPONSE_PROP 19372 0 0.2308077 0.1862301 0 1.0000000
RECENT_AVG_CARD_GIFT_AMT 19372 0 11.6854703 10.8341202 0 300.0000000
RECENT_RESPONSE_COUNT 19372 0 3.0431034 2.0464006 0 16.0000000
RECENT_CARD_RESPONSE_COUNT 19372 0 1.7305389 1.5355208 0 9.0000000
LIFETIME_CARD_PROM 19372 0 18.6680776 8.5587782 2.0000000 56.0000000
LIFETIME_PROM 19372 0 47.5705141 22.9501581 5.0000000 194.0000000
LIFETIME_GIFT_AMOUNT 19372 0 104.4257165 105.7224599 15.0000000 3775.00
LIFETIME_GIFT_COUNT 19372 0 9.9797646 8.6881633 1.0000000 95.0000000
LIFETIME_AVG_GIFT_AMT 19372 0 12.8583383 8.7877579 1.3600000 450.0000000
LIFETIME_GIFT_RANGE 19372 0 11.5878758 15.1168929 0 997.0000000
LIFETIME_MAX_GIFT_AMT 19372 0 19.2088081 16.1011278 5.0000000 1000.00
LIFETIME_MIN_GIFT_AMT 19372 0 7.6209323 7.9597857 0 450.0000000
LAST_GIFT_AMT 19372 0 16.5841988 11.9775577 0 450.0000000
CARD_PROM_12 19372 0 5.3671278 1.2642046 0 17.0000000
NUMBER_PROM_12 19372 0 12.9018687 4.6420721 2.0000000 64.0000000
MONTHS_SINCE_LAST_GIFT 19372 0 18.1911522 4.0330648 4.0000000 27.0000000
MONTHS_SINCE_FIRST_GIFT 19372 0 69.4820875 37.5681693 15.0000000 260.0000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
b. Create tables of the categorical inputs. Do not create a table using CONTROL_NUMBER, the
identification key.
proc freq data=pva;
tables _character_ target_b / missing;
run;
Session 1: Solutions A-9
Cumulative Cumulative
URBANICITY Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
? 454 2.34 454 2.34
C 4022 20.76 4476 23.11
R 4005 20.67 8481 43.78
S 4491 23.18 12972 66.96
T 3944 20.36 16916 87.32
U 2456 12.68 19372 100.00
Cumulative Cumulative
SES Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
1 5924 30.58 5924 30.58
2 9284 47.92 15208 78.51
3 3323 17.15 18531 95.66
4 387 2.00 18918 97.66
? 454 2.34 19372 100.00
RECENCY_
STATUS_ Cumulative Cumulative
96NK Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
A 11918 61.52 11918 61.52
E 427 2.20 12345 63.73
F 1521 7.85 13866 71.58
L 93 0.48 13959 72.06
N 1192 6.15 15151 78.21
S 4221 21.79 19372 100.00
Session 1: Solutions A-11
Cumulative Cumulative
TARGET_B Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
0 14529 75.00 14529 75.00
1 4843 25.00 19372 100.00
c. Create a macro variable to store π1, the proportion of responders in the population. This value is
0.05.
%let ex_pi1=0.05;
d. The current model consists of PEP_STAR, RECENT_AVG_GIFT_AMT, and
FREQUENCY_STATUS_97NK. Fit this model to the data. Use the SCORE statement to append
the predicted probability to the data, correcting for oversampling. Investigate the minimum,
maximum, and average predicted probability of response based on this model.
proc logistic data=pva des;
model target_b=PEP_STAR
RECENT_AVG_GIFT_AMT
FREQUENCY_STATUS_97NK;
score data=PMLR.PVA_RAW_DATA
out=scopva
priorevent=&ex_pi1;
run;
Model Information
Model Information
Response Profile
Ordered Total
Value TARGET_B Frequency
1 1 4843
2 0 14529
Intercept
Intercept and
Criterion Only Covariates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
e. How many individuals in the sample have a predicted response rate greater than 0.025? 0.05?
0.075? How many responders?
proc means data=scopva n sum;
where p_1 > .025;
var target_b;
run;
N Sum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
19323 4833.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
N Sum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
8161 2579.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
N Sum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
2240 830.0000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
A-14 Appendix A Exercises and Solutions
Chapter 3 Solutions
1. Replace missing values using group-median imputation.
a. Create missing value indicators for inputs that have missing values.
data pva(drop=i);
set pva;
/* name the missing indicator variables */
array mi{*} mi_DONOR_AGE mi_INCOME_GROUP
mi_WEALTH_RATING;
/* select variables with missing values */
array x{*} DONOR_AGE INCOME_GROUP WEALTH_RATING;
do i=1 to dim(mi);
mi{i}=(x{i}=.);
end;
run;
b. Group the variables RECENT_RESPONSE_PROP and RECENT_AVG_GIFT_AMT into three
groups.
proc rank data=pva out=pva groups=3;
var recent_response_prop recent_avg_gift_amt;
ranks grp_resp grp_amt;
run;
f. (Optional) If you use the OUTSTAT= option in the PROC STDIZE statement, you can save the
information used by the procedure for imputation. Print out the OUTSTAT= data set for a more
compact view of the information.
proc stdize data=pva method=median
reponly out=pva1 outstat=med;
by grp_resp grp_amt;
var DONOR_AGE INCOME_GROUP WEALTH_RATING;
run;
1 0 0 LOCATION 64 4 5
9 0 1 LOCATION 59 4 5
17 0 2 LOCATION 57 4 6
25 1 0 LOCATION 65 4 5
33 1 1 LOCATION 58 4 5
41 1 2 LOCATION 57 4 6
49 2 0 LOCATION 63 4 5
57 2 1 LOCATION 61 4 5
65 2 2 LOCATION 59 4 6
a. Use the MEANS procedure to generate a data set with information about the average response
rate and sample size for each level of CLUSTER_CODE.
proc means data=pva1 noprint nway;
class CLUSTER_CODE;
var target_b;
output out=level mean=prop;
run;
Session 1: Solutions A-17
ods listing;
c. Use the p-value of the appropriate χ2 test to determine which level of clustering is appropriate.
proc freq data=pva1 noprint;
tables CLUSTER_CODE*TARGET_B / chisq;
output out=chi(keep=_pchi_) chisq;
run;
data cutoff;
if _n_ = 1 then set chi;
set cluster;
chisquare=_pchi_*rsquared;
degfree=numberofclusters-1;
logpvalue=logsdf('CHISQ',chisquare,degfree);
run;
logpvalue ‚
-10 ˆ
‚
‚ AA
‚ AAAA
‚ AAA
‚ AA
‚ AAA
-20 ˆ AA
‚ AAA
‚ AA
‚ AA
‚ AA
‚ AA
‚ AA
-30 ˆ AA
‚ A
‚ AA
‚ A
‚ AA
‚ A
‚ AA
-40 ˆ A A
‚ A
‚ A
‚ AA
‚ A
‚ A
‚ AAAA
-50 ˆ
Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ
0 10 20 30 40 50 60
Number Of Clusters
proc sql;
select NumberOfClusters into :ncl
from cutoff
having logpvalue=min(logpvalue);
quit;
Number
Of
Clusters
ƒƒƒƒƒƒƒƒ
5
Session 1: Solutions A-19
CLUSTER_
CLUSNAME CODE CLUSTER
CL5 23 3
39 3
07 3
34 3
27 3
31 3
33 3
22 3
36 3
05 3
29 3
04 3
48 3
12 3
26 3
15 3
25 3
19 3
42 3
02 3
11 3
17 3
CL6 13 5
20 5
53 5
. 5
28 5
CL7 16 1
38 1
03 1
40 1
18 1
24 1
01 1
14 1
46 1
35 1
Session 1: Solutions A-21
CL8 06 4
10 4
32 4
41 4
44 4
47 4
CL9 09 2
43 2
49 2
51 2
21 2
30 2
45 2
52 2
08 2
37 2
50 2
a. Use the VARCLUS procedure to cluster all numeric variables, but use PERCENT=80 instead of
MAXEIGEN= as your stopping criterion.
proc varclus data=pva1
short hi
percent=80;
var &ex_inputs ClusCd: mi_: statusFL statusES;
run;
Session 1: Solutions A-23
b. Select one variable from each cluster as a representative. Using only the 1-R2 ratio as a criterion,
you might get:
%let ex_screened=
LIFETIME_CARD_PROM LIFETIME_MIN_GIFT_AMT PER_CAPITA_INCOME
CARD_PROM_12 mi_INCOME_GROUP ClusCdGrp1
RECENT_RESPONSE_COUNT ClusCdGrp3 PCT_MALE_MILITARY
DONOR_AGE PCT_VIETNAM_VETERANS MOR_HIT_RATE
PCT_OWNER_OCCUPIED PCT_MALE_VETERANS PUBLISHED_PHONE
ClusCdGrp4 WEALTH_RATING ClusCdGrp2
MONTHS_SINCE_LAST_GIFT RECENT_STAR_STATUS LIFETIME_GIFT_RANGE
INCOME_GROUP IN_HOUSE statusFL
statusES RECENT_AVG_GIFT_AMT PCT_WWII_VETERANS
LIFETIME_GIFT_AMOUNT PEP_STAR mi_DONOR_AGE
RECENT_AVG_CARD_GIFT_AMT RECENT_CARD_RESPONSE_PROP
;
c. Fit a logistic regression model with the FAST BACKWARD method using only the selected
cluster representatives as inputs. How will you assess this model’s fit?
proc logistic data=pva1 des namelen=32;
model target_b = &ex_screened
/selection=backward fast;
run;
Model Information
Response Profile
Ordered Total
Value TARGET_B Frequency
1 1 4843
2 0 14529
Intercept
Intercept and
Criterion Only Covariates
Pr >
Effect Residual Residual
Removed Chi-Square DF Pr > ChiSq Chi-Square DF ChiSq
Intercept
Intercept and
Criterion Only Covariates
12.9763 16 0.6745
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
a. Fit a logistic regression model with the STEPWISE method. Use all of the numeric variables and
the URBANICITY, SES, and HOME_OWNER categorical variables. Set SLSTAY and SLENTRY
equal to .001.
proc logistic data=pva1 des namelen=32;
class URBANICITY SES HOME_OWNER;
model target_b = &ex_screened URBANICITY
SES HOME_OWNER
/selection=stepwise slentry=.001 slstay=.001;
run;
Partial Output
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
b. Fit a logistic regression model with the FAST BACKWARD method. Use all of the numeric
variables and the URBANICITY, SES, and HOME_OWNER categorical variables. Set SLSTAY
equal to .001.
proc logistic data=pva1 des namelen=32;
class URBANICITY SES HOME_OWNER;
model target_b = &ex_screened URBANICITY
SES HOME_OWNER
/selection=backward FAST slstay=.001;
run;
Partial Output
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
c. Fit a logistic regression model with the SCORE method and the BEST=1 option. Use all of the
numeric variables and create dummies for the URBANICITY, SES, and HOME_OWNER
categorical variables. Use the Output Delivery System to determine which model is the best
according to the SBC criterion.
proc freq data=pva1;
tables (URBANICITY SES HOME_OWNER)*target_b;
run;
URBANICITY TARGET_B
Frequency‚
Percent ‚
Row Pct ‚
Col Pct ‚ 0‚ 1‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
? ‚ 314 ‚ 140 ‚ 454
‚ 1.62 ‚ 0.72 ‚ 2.34
‚ 69.16 ‚ 30.84 ‚
‚ 2.16 ‚ 2.89 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
C ‚ 3016 ‚ 1006 ‚ 4022
‚ 15.57 ‚ 5.19 ‚ 20.76
‚ 74.99 ‚ 25.01 ‚
‚ 20.76 ‚ 20.77 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
R ‚ 3078 ‚ 927 ‚ 4005
‚ 15.89 ‚ 4.79 ‚ 20.67
‚ 76.85 ‚ 23.15 ‚
‚ 21.19 ‚ 19.14 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
S ‚ 3304 ‚ 1187 ‚ 4491
‚ 17.06 ‚ 6.13 ‚ 23.18
‚ 73.57 ‚ 26.43 ‚
‚ 22.74 ‚ 24.51 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
T ‚ 2938 ‚ 1006 ‚ 3944
‚ 15.17 ‚ 5.19 ‚ 20.36
‚ 74.49 ‚ 25.51 ‚
‚ 20.22 ‚ 20.77 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
U ‚ 1879 ‚ 577 ‚ 2456
‚ 9.70 ‚ 2.98 ‚ 12.68
‚ 76.51 ‚ 23.49 ‚
‚ 12.93 ‚ 11.91 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total 14529 4843 19372
75.00 25.00 100.00
A-30 Appendix A Exercises and Solutions
SES TARGET_B
Frequency‚
Percent ‚
Row Pct ‚
Col Pct ‚ 0‚ 1‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
1 ‚ 4341 ‚ 1583 ‚ 5924
‚ 22.41 ‚ 8.17 ‚ 30.58
‚ 73.28 ‚ 26.72 ‚
‚ 29.88 ‚ 32.69 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
2 ‚ 6966 ‚ 2318 ‚ 9284
‚ 35.96 ‚ 11.97 ‚ 47.92
‚ 75.03 ‚ 24.97 ‚
‚ 47.95 ‚ 47.86 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
3 ‚ 2590 ‚ 733 ‚ 3323
‚ 13.37 ‚ 3.78 ‚ 17.15
‚ 77.94 ‚ 22.06 ‚
‚ 17.83 ‚ 15.14 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
4 ‚ 318 ‚ 69 ‚ 387
‚ 1.64 ‚ 0.36 ‚ 2.00
‚ 82.17 ‚ 17.83 ‚
‚ 2.19 ‚ 1.42 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
? ‚ 314 ‚ 140 ‚ 454
‚ 1.62 ‚ 0.72 ‚ 2.34
‚ 69.16 ‚ 30.84 ‚
‚ 2.16 ‚ 2.89 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total 14529 4843 19372
75.00 25.00 100.00
HOME_OWNER TARGET_B
Frequency‚
Percent ‚
Row Pct ‚
Col Pct ‚ 0‚ 1‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
H ‚ 7898 ‚ 2708 ‚ 10606
‚ 40.77 ‚ 13.98 ‚ 54.75
‚ 74.47 ‚ 25.53 ‚
‚ 54.36 ‚ 55.92 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
U ‚ 6631 ‚ 2135 ‚ 8766
‚ 34.23 ‚ 11.02 ‚ 45.25
‚ 75.64 ‚ 24.36 ‚
‚ 45.64 ‚ 44.08 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total 14529 4843 19372
75.00 25.00 100.00
Session 1: Solutions A-31
data pva2;
set pva1;
home01=(HOME_OWNER="H");
nses1=(SES="1");
nses3=(SES="3");
nses4=(SES="4");
nses_=(SES="?");
nurbr=(URBANICITY="R");
nurbu=(URBANICITY="U");
nurbs=(URBANICITY="S");
nurbt=(URBANICITY="T");
nurb_=(URBANICITY="?");
run;
data _NULL_;
set NObs;
where label = 'Number of Observations Used';
call symput('obs',n);
run;
proc sql;
select VariablesInModel into :ex_selected
from subset
having sbc=min(sbc);
quit;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Partial Output
All variables left in the model are significant at the 0.0010 level.
No other variable met the 0.0010 significance level for entry into the model.
Variable Variable
Step Entered Removed
1 RECENT_RESPONSE_COUNT
2 MONTHS_SINCE_LAST_GIFT
3 PEP_STAR
4 PER_CAPITA_INCOME
5 RECENT_CARD_RESPONSE_PROP
6 INCOME_GROUP
7 statusFL
8 ClusCdGrp3
9 ClusCdGrp4
10 PER_CAPITA_INCOME
11 RECENT_AVG_GIFT_AMT
12 DONOR_AGE
e. How does the c statistic computed from the model generated in PROC REG compare with the c
statistics from the models generated in PROC LOGISTIC? What reasons might one have for
choosing PROC LOGISTIC over PROC REG as a modeling tool? What reasons might one have
for choosing PROC REG over PROC LOGISTIC as a modeling tool?
proc logistic data=out1;
model target_b=p;
run;
Partial Output
Association of Predicted Probabilities and Observed Responses
A possible reason to consider PROC REG as a modeling tool is its speed. A reason to consider
PROC LOGISTIC is the fact that REG may yield negative predicted probabilities or probabilities
greater than one.
Chapter 4 Solutions
1. Assess some logistic regression models using a validation data set.
a. Split the imputed data set into training and validation data sets. Use 50% of the data for each data
set role. Stratify on the target variable.
proc sort data=pva2 out=pva2;
by target_b;
run;
b. (Optional) Use the Spearman and Hoeffding correlation coefficients to screen the inputs with the
least evidence of a relationship with the target. If necessary, consider transforming some inputs
that do well according to the Hoeffding correlation measure.
ods listing close;
ods output spearmancorr=spearman
hoeffdingcorr=hoeffding;
ods listing;
1 RECENT_RESPONSE_COUNT 1 1
2 RECENT_AVG_GIFT_AMT 2 2
3 PEP_STAR 3 6
4 RECENT_CARD_RESPONSE_PROP 4 3
5 LIFETIME_MIN_GIFT_AMT 5 4
6 statusES 6 7
7 MONTHS_SINCE_LAST_GIFT 7 5
8 RECENT_STAR_STATUS 8 12
9 LIFETIME_CARD_PROM 9 8
10 LIFETIME_GIFT_AMOUNT 10 9
11 PER_CAPITA_INCOME 11 10
12 statusFL 12 21
13 ClusCdGrp4 13 17
14 ClusCdGrp3 14 25
15 IN_HOUSE 15 31
16 DONOR_AGE 16 13
17 CARD_PROM_12 17 15
18 nses3 18 20
19 nurbr 19 19
20 INCOME_GROUP 20 16
21 nurb_ 21 40
22 nses_ 22 41
23 nses4 23 42
24 MOR_HIT_RATE 24 18
25 ClusCdGrp1 25 36
26 ClusCdGrp2 26 26
27 nses1 27 23
28 mi_DONOR_AGE 28 28
29 LIFETIME_GIFT_RANGE 29 14
Session 1: Solutions A-39
30 nurbs 30 33
31 RECENT_AVG_CARD_GIFT_AMT 31 11
32 home01 32 32
33 PUBLISHED_PHONE 33 34
34 nurbu 34 39
35 WEALTH_RATING 35 22
36 PCT_MALE_VETERANS 36 24
37 PCT_MALE_MILITARY 37 35
38 PCT_OWNER_OCCUPIED 38 29
39 PCT_WWII_VETERANS 39 27
40 PCT_VIETNAM_VETERANS 40 30
41 nurbt 41 38
42 mi_INCOME_GROUP 42 37
rankho
%let var=RECENT_AVG_CARD_GIFT_AMT;
*****************
%let var = LIFETIME_GIFT_RANGE;
proc rank data=pva_train groups=25 out=out;
var &var;
ranks bin;
run;
data bins;
set bins;
elogit=log((target_b+(sqrt(_FREQ_ )/2))/
( _FREQ_ -target_b+(sqrt(_FREQ_ )/2)));
run;
goptions reset=all;
symbol i=join c=blue v=star;
proc gplot data = bins;
title "Empirical Logit against &var";
plot elogit * &var;
run;
title "Empirical Logit against Binned &var";
plot elogit * bin;
run; quit;
title;
A-42 Appendix A Exercises and Solutions
Session 1: Solutions A-43
A-44 Appendix A Exercises and Solutions
2) You can consider different inputs to account for the nonlinear relationships among these
inputs. Try both quadratic terms.
Chapter 5 Solutions
1. Assess some logistic regression models using a validation data set.
a. Using the macros outlined in Chapter 5, score and evaluate the performance of the best subsets
models on the training and validation data sets.
ods listing close;
ods output bestsubsets=score;
proc logistic data=pva_train des;
model target_b = &ex_screened
nses1 nses3 nses4 nses_
nurbr nurbu nurbs nurbt nurb_
home01
RECENT_AVG_CARD_GIFT_AMT*RECENT_AVG_CARD_GIFT_AMT
LIFETIME_GIFT_RANGE*LIFETIME_GIFT_RANGE
/selection=score best=1;
run;
ods listing;
If you elected to skip 6.b. then you can ignore the two squared terms in the above PROC
LOGISTIC call.
proc sql noprint;
select variablesinmodel into :inputs1 - :inputs99999
from score;
select NumberOfVariables into :ic1 - :ic99999
from score;
quit;
%let lastindx = &SQLOBS;
* Save the observed response rate, for the weighted sums of profit
etc.;
proc sql;
select mean(target_b) into :ex_rho1
from pva;
quit;
The macro variables defined above make the comparisons of literally hundreds of models
easier.
Session 1: Solutions A-45
data train2;
set pva_train;
RECENT_AVG_CARD_GIFT_AMTRECENT_A=
RECENT_AVG_CARD_GIFT_AMT*RECENT_AVG_CARD_GIFT_AMT;
LIFETIME_GIFT_RANGELIFETIME_GIFT=
LIFETIME_GIFT_RANGE*LIFETIME_GIFT_RANGE;
run;
data valid2;
set pva_valid;
RECENT_AVG_CARD_GIFT_AMTRECENT_A=
RECENT_AVG_CARD_GIFT_AMT*RECENT_AVG_CARD_GIFT_AMT;
LIFETIME_GIFT_RANGELIFETIME_GIFT=
LIFETIME_GIFT_RANGE*LIFETIME_GIFT_RANGE;
run;
Any new inputs need to be created on the training and validation data sets. These
variables are named according to the rules used to refer to polynomial terms in the best
subsets output.
%macro ex_assess(data=,inputcount=,inputsinmodel=,index=);
/* sort data set from likely to unlikely to respond */
proc sort data=scored&data;
by descending p_1;
run;
A-46 Appendix A Exercises and Solutions
%mend ex_assess;
%macro ex_fitandscore();
proc datasets
library=work
nodetails
nolist;
delete results;
run;
%let im=&&inputs&model_indx;
%let ic=&&ic&model_indx;
The EX_ASSESS and EX_FITANDSCORE macros are updated versions of the model
fitting and assessing macros discussed in Chapter 5.
%ex_fitandscore;
b. Use graphical summaries to determine what models are of interest. If these are inconclusive,
what measures will you take to choose the best model? You might consider lift charts or ROC
curves.
symbol1 i=join f=arial v=c h=1.5 c=darkred l=1;
symbol2 i=join f=arial v=c h=1.5 c=red l=2;
proc gplot data = results;
plot C*Index=datarole;
run; quit;
Session 1: Solutions A-49
1) On validation profit, the model with index 11 seems superior. By the c statistic, model 10
seems best. You could compare these using the ROC curve or lift charts to see if they seem to
perform differently.
proc logistic data=train2 des namelen=32 noprint;
model target_b=&inputs10;
score data=valid2 out=scoval10 outroc=outroc10 priorevent=&ex_pi1;
run;
data outroc10;
set outroc10;
cutoff=_PROB_;
specif10=1-_1MSPEC_;
sensit10=_SENSIT_;
tp10=&ex_pi1*_SENSIT_;
fn10=&ex_pi1*(1-_SENSIT_);
tn10=(1-&ex_pi1)*specif10;
fp10=(1-&ex_pi1)*_1MSPEC_;
depth=tp10+fp10;
pospv10=tp10/depth;
negpv10=tn10/(1-depth);
acc10=tp10+tn10;
lift10=pospv10/&ex_pi1;
run;
data outroc11;
set outroc11;
cutoff=_PROB_;
specif11=1-_1MSPEC_;
sensit11=_SENSIT_;
tp11=&ex_pi1*_SENSIT_;
fn11=&ex_pi1*(1-_SENSIT_);
tn11=(1-&ex_pi1)*specif11;
fp11=(1-&ex_pi1)*_1MSPEC_;
depth=tp11+fp11;
pospv11=tp11/depth;
negpv11=tn11/(1-depth);
acc11=tp11+tn11;
lift11=pospv11/&ex_pi1;
run;
data ex_roc;
merge outroc10 outroc11;
by descending _PROB_;
run;
A-52 Appendix A Exercises and Solutions
Because it seems like there is little difference in performance, and because a 10-input model will be easier
to use than an 11-input model, you might consider using the 10-input model.
A-54 Appendix A Exercises and Solutions
2. (Optional) Prepare scoring code for your selected model. Accommodate any specialized input that you
created; for example, clustered levels of CLUSTER_CODE.
ods output parameterEstimates = betas2;
proc logistic data=train2 des namelen=32;
model target_b=&inputs10;
run;
%let target=TARGET_B;
data _null_;
attrib PREDNAME length=$32
TARGNAME length=$32
LastParm length=$32
;
file scorecd;
set betas2 end=last;
retain TARGNAME PREDNAME LastParm ' ';
if (Variable="Intercept") then do;
TARGNAME=compress("&target");
PREDNAME="P_"||compress(TARGNAME);
put "**********************************************;";
put "*** begin scoring code for Logistic Regression;";
put "**********************************************;";
⎧ π1
⎪ρ if y i = 1
⎪ 1
weight i = ⎨
⎪π 0 if y i = 0
⎪⎩ ρ 0
Another method for adjusting for oversampling is to incorporate sampling weights. Sampling weights
adjust the data so that it better represents the true population. When a rare target event has been
oversampled, class 0 is under-represented in the sample. Consequently, a class-0 case should actually
count more in the analysis than a class-1 case. The predicted values will be properly corrected by using
weights that are inversely proportional to selection probabilities (for each class, the number of cases in the
sample divided by the number of cases in the population). It is convenient to use the normalized sample
weights because they sum to the original sample size
n
π0 π
∑ weight
i =1
i = n0
ρ0
+ n1 1 = n π 0 + n π 1 = n
ρ1
The weights adjust the number of cases in the sample to be nπ0 and nπ1, in class 0 and 1 respectively. The
classes now are in the same proportion in the adjusted sample as they are in the population. The
normalization causes less distortion in standard errors and p-values. While statistical inference is not the
goal of the analysis, p-values are used as tuning parameters in variable selection algorithms.
The offset method and the weighted method are not statistically equivalent. The parameter estimates are
not exactly the same, but they have the same large-sample statistical properties. When the linear-logistic
model is correctly specified, the offset method (un-weighted) analysis is considered superior. However,
when the logistic model is merely an approximation to some nonlinear model, weighted analysis has
advantages (Scott and Wild 1986).
⎧ 0.02
⎪⎪ 0.35 = 0.58 if Ins = 1
Sampling Weight = ⎨
⎪ 0.98 = 1.46 if Ins = 0
⎪⎩ 0.65
B.1 Sampling Weights B-3
Sampling Weights
The DATA step adds the sampling weights to the data set DEVELOP. The weights are .058 (.02/.346) for
class 1 and 1.5 (.98/.654) for class 0. They could have been assigned manually without having to
reference macro variables. The logical expressions (Ins=1) and (Ins=0) in the assignment statement
return the value one when true and zero when false. Consequently, this syntax is a more compact way of
expressing a conditional.
libname pmlr "c:\workshop\winsas\pmlr";
data develop;
set pmlr.develop;
sampwt=((1-&pi1)/(1-&rho1))*(ins=0)+(&pi1/&rho1)*(ins=1);
run;
The WEIGHT statement in PROC LOGISTIC weights each observation in the input data set by the value
of the WEIGHT variable.
proc logistic data=develop des;
weight sampwt;
model ins = dda ddabal dep depamt cashbk checks / stb;
score data=pmlr.new out=scored;
run;
Model Information
Response Profile
1 1 11175 645.281
2 0 21089 31618.707
The results show that the response profile table has a new column named Total Weight. The figures in the
column represent the sample sizes adjusted to the population proportions. Note that the Sum of the
Weights equals the total sample size.
Model Convergence Status
Intercept
Intercept and
Criterion Only Covariates
Cash
Obs P_1 DDA DDABal Dep DepAmt Bk Checks
The probabilities from the weighted analysis are similar but not equivalent to the probabilities estimated
by the offset method (pseudo model).
B-6 Appendix B Additional Resources
B.2 References
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984), Classification and Regression Trees,
Chapman and Hall.
Cohen, A. (1991), “Dummy Variables in Stepwise Regression,” The American Statistician, 45, 226-228.
Conover, W. J. (1980), Practical Nonparametric Statistics, New York: John Wiley & Sons.
Donner, A. (1982), “The Relative Effectiveness of Procedures Commonly Used in Multiple Regression
Analysis for Dealing with Missing Values,” The American Statistician, 36, 378-381.
Duffy, T. J. and Santner, D. E. (1989), The Statistical Analysis of Discrete Data, New York:
Springer-Verlag.
Georges, J.E. (2004), Advanced Predictive Modeling Using SAS® Enterprise Miner 5.1 Course Notes,
Cary, NC: SAS Institute Inc.
Greenacre, M. J. (1993), Correspondence Analysis in Practice, San Diego, CA: Academic Press.
Hand, D. J. (1997), Construction and Assessment of Classification Rules, New York: John Wiley & Sons.
Hand, D. J. and Henley, W. E. (1997), “Statistical Classification Methods in Consumer Credit Scoring: A
Review,” Journal of the Royal Statistical Society A, 160, 153-541.
Harrell, F. E. (1997), Predicting Outcomes: Applied Survival Analysis and Logistic Regression,
Charlottesville Virginia: School of Medicine, University of Virginia.
Hastie, T. J. and Tibshirani, R. J. (1990), Generalized Additive Models, London: Chapman and Hall.
Huber, P. J. (1997), “From Large to Huge: A Statistician’s Reactions to KDD & DM,” Proceedings of the
Third International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI
Press.
Jackson, J. E. (1991), A Users Guide to Principal Components, New York: John Wiley & Sons.
Jones, M. P. (1996), “Indicator and Stratification Methods for Missing Explanatory Variables in Multiple
Linear Regression,” Journal of the American Statistical Association, 91, 222-230.
Little, R. J. A. (1992), “Regression with Missing X’s: A Review,” Journal of the American Statistical
Association, 87, 1227-1237.
Magee, L. (1998), “Nonlocal Behavior in Polynomial Regressions,” The American Statistician, 52, 20-22.
Mantel, N. (1970), “Why Stepdown Procedures in Variable Selection,” Technometrics, 12, 621-625.
B.2 References B-7
McLachlan, G. J. (1992), Discriminant Analysis and Statistical Pattern Recognition, New York: John
Wiley & Sons.
Prentice, R. L. and Pike, R. (1979), “Logistic Disease Incidence Models and Case-Control Studies,”
Biometrika, 66, 403-411.
Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press.
Sarle, W. S. (1994), “Neural Networks and Statistical Models,” Proceedings of the 19th Annual SUGI,
Cary, NC: SAS Institute Inc.
SAS Institute Inc., SAS® 9.1 Language Reference: Concepts, Third Edition, Cary, NC: SAS Institute Inc.,
2005.
SAS Institute Inc., Base SAS® 9.1Procedures Guide, Second Edition, Cary, NC: SAS Institute Inc., 2006.
SAS Institute Inc., SAS/STAT® 9.1 Users Guide, Cary, NC: SAS Institute Inc., 2004.
SAS Institute Inc., Logistic Regression Examples Using the SAS System, Version 6, First Edition, Cary,
NC: SAS Institute Inc., 1995.
Scott, A. J. and Wild, C. J. (1986), “Fitting Logistic Regression Models under Case-Control or Choice
Based Sampling, ” Journal of the Royal Statistical Society B, 48, 170-182.
Scott, A. J. and Wild, C. J. (1997), “Fitting Regression Models to Case-Control Data by Maximum
Likelihood,” Biometrika, 84, 57-71.
B-8 Appendix B Additional Resources
C Index
complexity, 5-3
% concentration curves, 4-34
confusion matrix, 4-31, 4-47
%LET statement, 1-6
cost ratios, 4-50
A credit scoring, 1-5
cross-validation, 4-5
accuracy, 4-31 curse of dimensionality, 1-10
all subsets regression, 5-7–5-18 cutoff values, 4-47
ASSESS macro, 5-9
assessing fit D
using profit, 4-52–4-54
data sets
attrition predicting, 1-5
test, 4-4
automatic score code generation
training, 4-4
correcting the intercept, 2-23–2-24
validation, 4-4
Output Delivery System, 2-11–2-18
data splitting, 4-4
B database marketing, 1-5
depth
backward elimination method, 3-61 classification rules, 4-34
Bayes rule, 4-48 DES option
binary indicator variables, 2-3 PROC LOGISTIC statement, 2-7
binary logistic regression model, 2-7–2-10 dimensions, 1-10
bins, 3-46 divergence statistics, 4-55
divisive clustering, 3-25
C dummy variables, 3-10
c statistic, 4-57, 4-58
cases, 1-3 E
categorical inputs, 3-10 EDF option
class separation, 4-55 PROC NPAR1WAY statement, 4-58
CLASS statement, LOGISTIC procedure eigenvalues, 3-23
PARAM option, 2-7 eigenvectors, 3-24
REF option, 2-7 empirical logits, 3-46–3-51
CLASS statement, MEANS procedure, 3-47 error rate, 4-31
classification rules estimators
depth, 4-34 flexible multivariate function, 3-52
classifier performance, 4-36–4-45
cluster components, 3-24 F
cluster imputation, 3-9
FAST option
CLUSTER procedure, 4-8
MODEL statement (LOGISTIC), 3-62
FREQ statement, 3-13
FITANDSCORE macro, 5-11
cluster representatives, 3-26
fitted surface, 2-4
clustering, 3-27–3-38
flexible multivariate function estimators, 3-
clustering levels, 3-12, 3-13–3-21
52
cluster-mean imputation, 3-9
forward selection method, 3-61
coefficients, 3-24
fraud detection, 1-5
complete-case analysis, 3-4
FREQ procedure, 1-6
C-2 Index