0% found this document useful (0 votes)
41 views38 pages

Using Results From Proc Corr For Variable Screening

The document discusses using Spearman and Hoeffding's D correlation statistics to screen variables for feature engineering. It restructures the correlation output and plots the ranks to identify variables for exclusion or further investigation. A variable is identified where the zero value is imputed and the relationship with the target reexamined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views38 pages

Using Results From Proc Corr For Variable Screening

The document discusses using Spearman and Hoeffding's D correlation statistics to screen variables for feature engineering. It restructures the correlation output and plots the ranks to identify variables for exclusion or further investigation. A variable is identified where the zero value is imputed and the relationship with the target reexamined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Using results from

PROC CORR
for Variable Screening

1
Feature Engineering

2
The Spearman correlation statistic is the correlation of the ranks of the input variables
with the binary target.

Hoeffding’s D detects a wide variety of associations between two variables.

3
Compare the results of the Spearman and Hoeffding paying
attention to:

Neither measure shows a relationship – drop the variable.


Decision based on p-value.

Hoeffding results in higher measure than Spearman – perhaps


need some ”feature engineering”

Use ranking of measures for decisions.

4
The rank option in PROC CORR, some
details

5
The set for consideration.

%let reduced=
MIPhone MICCBal Dep MM ILS MTGBal Income POS CD IRA
brclus1 Sav NSF Age SavBal LOCBal NSFAmt Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea ATMAmt DDABal
DDA brclus2 CC HMOwn DepAmt Phone ATM LORes brclus4;

6
The rank option in PROC CORR

%let reduced=
MIPhone MICCBal Dep MM ILS MTGBal Income POS CD IRA
brclus1 Sav NSF Age SavBal LOCBal NSFAmt Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea ATMAmt DDABal
DDA brclus2 CC HMOwn DepAmt Phone ATM LORes brclus4;

ods output spearmancorr=spearman


hoeffdingcorr=hoeffding;

proc corr data=d.develop_a spearman hoeffding rank;


var &reduced;
with ins;
run;

7
proc contents data=spearman;run;
proc print data=hoeffding;run;

The variable names in the SAS data sets Spearman and Hoeffding are in the variables best1
through best39

The correlation statistics are in the variables r1 through r39

The p-values are in the variables p1 through p39.

8
We need to restructure the data sets so the identifier is the variable
name and there is a single observation for each variable name.

We also will want to keep the correlation means, its rank, and p-
value for each observation (named to be different on the two data
sets.

9
Restructure Spearman data
%let nvar=39;/*reduced set*/

data spearman1(keep=variable scorr spvalue ranksp);


length variable $ 8;
set spearman;
array best(*) best1--best&nvar;
array r(*) r1--r&nvar;
array p(*) p1--p&nvar;
do i=1 to dim(best);
variable=best(i);
scorr=r(i);
spvalue=p(i);
ranksp=i;
output;
end;
run; 10
Restructure Hoeffding data.

data hoeffding1(keep=variable hcorr hpvalue rankho);


length variable $ 8;
set hoeffding;
array best(*) best1--best&nvar;
array r(*) r1--r&nvar;
array p(*) p1--p&nvar;
do i=1 to dim(best);
variable=best(i);
hcorr=r(i);
hpvalue=p(i);
rankho=i;
output;
end;
run;
11
Merge the two data sets by variable name.

proc sort data=spearman1;


by variable;
run;

proc sort data=hoeffding1;


by variable;
run;

data correlations;
merge spearman1 hoeffding1;
by variable;
run; 12
Print results
proc sort data=correlations;
by ranksp;
run;

proc print data=correlations label split='*';


var variable ranksp rankho scorr spvalue hcorr hpvalue;
label ranksp = 'Spearman rank*of variables'
scorr = 'Spearman Correlation'
spvalue = 'Spearman p-value'
rankho = 'Hoeffding rank*of variables'
hcorr = 'Hoeffding Correlation'
hpvalue = 'Hoeffding p-value';
title "Rank of Spearman Correlations and Hoeffding
Correlations";
run;
Title; 13
A low rank means a low p-value

If the Spearman rank is high but the Hoeffding’s D rank is low, then there may be an association that is probably not monotonic. (Empirical
logit plots can be used to investigate this type of relationship.)
A graph might help.

14
Get some values to draw reference lines

proc sql noprint;


select min(ranksp) into :vref
from (select ranksp
from correlations
having spvalue > .5);
select min(rankho) into :href
from (select rankho
from correlations
having hpvalue > .5);
quit;

15
Plot rank of Spearman vs rank of Hoeffding

proc sgplot data=correlations;


refline &vref / axis=y;
refline &href / axis=x;
scatter y=ranksp x=rankho / datalabel=variable;
yaxis label="Rank of Spearman";
xaxis label="Rank of Hoeffding";
title "Scatter Plot of the Ranks of Spearman vs.
Hoeffding";
run;
title ;
16
In general, the upper right corner of the plot contains the names of variables that
could reasonably be excluded from further analysis, due to their poor rank on both
metrics. The criterion to use in eliminating variables is a subjective decision.

Four variables are eliminated from the analysis: hmown, mtgbal, Miccbal, locbal

High ranks for Spearman and low ranks for Hoeffding’s D are found for the variables
DDABal, DepAmt, and ATMAmt. Even though these variables do not have a monotonic
relationship with Ins, some other type of relationship is detected by Hoeffding’s D
statistic. Empirical logit plots should be used to examine these relationships.

17
The variables remaining

%let screened=
MIPhone Dep MM ILS Income POS CD IRA
brclus1 Sav NSF Age SavBal NSFAmt Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea
ATMAmt DDABal
DDA brclus2 CC DepAmt Phone ATM LORes brclus4;

18
Investigate DDABal.

19
Empirical Logits

 mi  1 
log  
M
 i  mi  1 

where
mi= number of events
Mi = number of observations
20
A new macro PlotLogitsSeries
%macro PlotLogitsSeries(indata=,numgrp=7,indepvar=,depvar=);
proc rank data=&indata groups=&numgrp out=Ranks;
var &indepvar;
ranks Bin;
run;
proc sql;
create table toplot as
select
avg(&indepvar) as mean label="Mean of group",
sum(&depvar) as num_chd label="Number of Events",
count(*) as binsize label="Number at Risk",
log((calculated num_chd+1)/
(calculated binsize-calculated num_chd+1)) as logit
from ranks
group by bin;
quit;
proc sgplot data=toplot;
series x=mean y=logit/markers;
reg x=mean y=logit;
title "Estimated Logit Plot &indepvar, &numgrp groups";
run;
title;
%mend PlotLogitsSeries;
21
%PlotLogitsSeries(indata=d.develop_a,numgrp=100,indepvar=ddabal,depvar=ins);

There is a spike in the logits at the $0 balance level. Aside from that spike, the trend is
monotonic but certainly not linear.
22
Examining means a little more closely -- the
spike at $0
proc means data= d.develop;
class dda;
var ddabal;
run;

23
proc freq data=d.develop;
where ddabal=0;
tables dda;
run;

24
Most of the individuals with exactly $0 balances do not have checking accounts. It
turns out that their balances have been set to $0 as part of the data pre-processing.
This rule seems reasonable from a logical imputation standpoint, less so for analysis.

The logit plot suggests that those individuals with 0 balance are behaving like people
with much more than $0 in their checking accounts.

25
Impute ddabal and add a new variable to
d.develop_a.

26
proc sql;
select mean(ddabal) into : mnbal
from d.develop_a
where dda eq 1
;
quit;
%put &mnbal;

data d.develop_a;
set d.develop_a;
imputed_ddabal=ddabal;
if dda = 0 then imputed_ddabal=&mnbal;
run;
proc means data=d.develop_a;
var ddabal imputed_ddabal;
run;

27
%PlotLogitsSeries(indata=d.develop_a,numgrp=100,
indepvar=imputed_ddabal,depvar=ins);

28
Plot logits by bin rather than mean

29
%let indata=d.develop_a;
%let numgrp=100;
%let indepvar=imputed_ddabal;
%let depvar=ins;
proc rank data=&indata groups=&numgrp out=Ranks;
var &indepvar;
ranks Bin;
run;
proc sql;
create table toplot as
select
bin label="Bin number",
avg(&indepvar) as mean label="Mean of group",
sum(&depvar) as num_chd label="Number of Events",
count(*) as binsize label="Number at Risk",
log((calculated num_chd+1)/
(calculated binsize-calculated num_chd+1)) as logit
from ranks
group by bin;
quit;
proc sort data=toplot;by bin;run;
proc sgplot data=toplot;
series x=bin y=logit/markers;
reg x=bin y=logit;
title "Estimated Logit Plot &indepvar, &numgrp groups";
title2 "Using bin number rather than mean";
run; 30
title;
31
Some more "feature engineering"

To use imputed_ddabal “bins” for scoring new cases can perhaps best
be done using percentiles of the distribution.

32
First get the information for 100 bins
proc rank data=d.develop_a groups=100 out=out;
var imputed_ddabal;
ranks bin;
run;

title;
proc means data = out noprint nway;
class bin;
var imputed_ddabal;
output out=endpts max=max;
run;

proc print data = endpts;


run;

33
Using this information isn’t difficult, but
requires a lot of code.
Using a select construct requires that we write a line of code for each
endpoint.

34
A program to write the necessary
code

filename rank "C:\tmp\rank.sas";

data _null_;
file rank;
set endpts end=last;
if _n_ = 1 then put "select;";
if not last then do;
put " when (imputed_ddabal <= " max ") B_DDABal =" bin ";";
end;
else if last then do;
put " otherwise B_DDABal =" bin ";";
put "end;";
end;
run;

35
A program that uses the code

data d.develop_a;
set d.develop_a;
%include rank / source;
run;

proc means data = d.develop_a min max;


class B_DDABal;
var imputed_DDABal;
run;

36
%PlotLogitsSeries(indata=d.develop_a,numgrp=100,
indepvar=b_ddabal,depvar=ins);

37
The new screened set

%let screened=
MIPhone MICCBal Dep MM ILS MTGBal Income
POS CD IRA
brclus1 Sav NSF Age SavBal LOCBal NSFAmt
Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk
AcctAge InArea ATMAmt b_DDABal
DDA brclus2 CC HMOwn DepAmt Phone ATM LORes
brclus4;

38

You might also like