Using Results From Proc Corr For Variable Screening
Using Results From Proc Corr For Variable Screening
PROC CORR
for Variable Screening
1
Feature Engineering
2
The Spearman correlation statistic is the correlation of the ranks of the input variables
with the binary target.
3
Compare the results of the Spearman and Hoeffding paying
attention to:
4
The rank option in PROC CORR, some
details
5
The set for consideration.
%let reduced=
MIPhone MICCBal Dep MM ILS MTGBal Income POS CD IRA
brclus1 Sav NSF Age SavBal LOCBal NSFAmt Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea ATMAmt DDABal
DDA brclus2 CC HMOwn DepAmt Phone ATM LORes brclus4;
6
The rank option in PROC CORR
%let reduced=
MIPhone MICCBal Dep MM ILS MTGBal Income POS CD IRA
brclus1 Sav NSF Age SavBal LOCBal NSFAmt Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea ATMAmt DDABal
DDA brclus2 CC HMOwn DepAmt Phone ATM LORes brclus4;
7
proc contents data=spearman;run;
proc print data=hoeffding;run;
The variable names in the SAS data sets Spearman and Hoeffding are in the variables best1
through best39
8
We need to restructure the data sets so the identifier is the variable
name and there is a single observation for each variable name.
We also will want to keep the correlation means, its rank, and p-
value for each observation (named to be different on the two data
sets.
9
Restructure Spearman data
%let nvar=39;/*reduced set*/
data correlations;
merge spearman1 hoeffding1;
by variable;
run; 12
Print results
proc sort data=correlations;
by ranksp;
run;
If the Spearman rank is high but the Hoeffding’s D rank is low, then there may be an association that is probably not monotonic. (Empirical
logit plots can be used to investigate this type of relationship.)
A graph might help.
14
Get some values to draw reference lines
15
Plot rank of Spearman vs rank of Hoeffding
Four variables are eliminated from the analysis: hmown, mtgbal, Miccbal, locbal
High ranks for Spearman and low ranks for Hoeffding’s D are found for the variables
DDABal, DepAmt, and ATMAmt. Even though these variables do not have a monotonic
relationship with Ins, some other type of relationship is detected by Hoeffding’s D
statistic. Empirical logit plots should be used to examine these relationships.
17
The variables remaining
%let screened=
MIPhone Dep MM ILS Income POS CD IRA
brclus1 Sav NSF Age SavBal NSFAmt Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea
ATMAmt DDABal
DDA brclus2 CC DepAmt Phone ATM LORes brclus4;
18
Investigate DDABal.
19
Empirical Logits
mi 1
log
M
i mi 1
where
mi= number of events
Mi = number of observations
20
A new macro PlotLogitsSeries
%macro PlotLogitsSeries(indata=,numgrp=7,indepvar=,depvar=);
proc rank data=&indata groups=&numgrp out=Ranks;
var &indepvar;
ranks Bin;
run;
proc sql;
create table toplot as
select
avg(&indepvar) as mean label="Mean of group",
sum(&depvar) as num_chd label="Number of Events",
count(*) as binsize label="Number at Risk",
log((calculated num_chd+1)/
(calculated binsize-calculated num_chd+1)) as logit
from ranks
group by bin;
quit;
proc sgplot data=toplot;
series x=mean y=logit/markers;
reg x=mean y=logit;
title "Estimated Logit Plot &indepvar, &numgrp groups";
run;
title;
%mend PlotLogitsSeries;
21
%PlotLogitsSeries(indata=d.develop_a,numgrp=100,indepvar=ddabal,depvar=ins);
There is a spike in the logits at the $0 balance level. Aside from that spike, the trend is
monotonic but certainly not linear.
22
Examining means a little more closely -- the
spike at $0
proc means data= d.develop;
class dda;
var ddabal;
run;
23
proc freq data=d.develop;
where ddabal=0;
tables dda;
run;
24
Most of the individuals with exactly $0 balances do not have checking accounts. It
turns out that their balances have been set to $0 as part of the data pre-processing.
This rule seems reasonable from a logical imputation standpoint, less so for analysis.
The logit plot suggests that those individuals with 0 balance are behaving like people
with much more than $0 in their checking accounts.
25
Impute ddabal and add a new variable to
d.develop_a.
26
proc sql;
select mean(ddabal) into : mnbal
from d.develop_a
where dda eq 1
;
quit;
%put &mnbal;
data d.develop_a;
set d.develop_a;
imputed_ddabal=ddabal;
if dda = 0 then imputed_ddabal=&mnbal;
run;
proc means data=d.develop_a;
var ddabal imputed_ddabal;
run;
27
%PlotLogitsSeries(indata=d.develop_a,numgrp=100,
indepvar=imputed_ddabal,depvar=ins);
28
Plot logits by bin rather than mean
29
%let indata=d.develop_a;
%let numgrp=100;
%let indepvar=imputed_ddabal;
%let depvar=ins;
proc rank data=&indata groups=&numgrp out=Ranks;
var &indepvar;
ranks Bin;
run;
proc sql;
create table toplot as
select
bin label="Bin number",
avg(&indepvar) as mean label="Mean of group",
sum(&depvar) as num_chd label="Number of Events",
count(*) as binsize label="Number at Risk",
log((calculated num_chd+1)/
(calculated binsize-calculated num_chd+1)) as logit
from ranks
group by bin;
quit;
proc sort data=toplot;by bin;run;
proc sgplot data=toplot;
series x=bin y=logit/markers;
reg x=bin y=logit;
title "Estimated Logit Plot &indepvar, &numgrp groups";
title2 "Using bin number rather than mean";
run; 30
title;
31
Some more "feature engineering"
To use imputed_ddabal “bins” for scoring new cases can perhaps best
be done using percentiles of the distribution.
32
First get the information for 100 bins
proc rank data=d.develop_a groups=100 out=out;
var imputed_ddabal;
ranks bin;
run;
title;
proc means data = out noprint nway;
class bin;
var imputed_ddabal;
output out=endpts max=max;
run;
33
Using this information isn’t difficult, but
requires a lot of code.
Using a select construct requires that we write a line of code for each
endpoint.
34
A program to write the necessary
code
data _null_;
file rank;
set endpts end=last;
if _n_ = 1 then put "select;";
if not last then do;
put " when (imputed_ddabal <= " max ") B_DDABal =" bin ";";
end;
else if last then do;
put " otherwise B_DDABal =" bin ";";
put "end;";
end;
run;
35
A program that uses the code
data d.develop_a;
set d.develop_a;
%include rank / source;
run;
36
%PlotLogitsSeries(indata=d.develop_a,numgrp=100,
indepvar=b_ddabal,depvar=ins);
37
The new screened set
%let screened=
MIPhone MICCBal Dep MM ILS MTGBal Income
POS CD IRA
brclus1 Sav NSF Age SavBal LOCBal NSFAmt
Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk
AcctAge InArea ATMAmt b_DDABal
DDA brclus2 CC HMOwn DepAmt Phone ATM LORes
brclus4;
38