0% found this document useful (0 votes)

41 views38 pages

Using Results From Proc Corr For Variable Screening

The document discusses using Spearman and Hoeffding's D correlation statistics to screen variables for feature engineering. It restructures the correlation output and plots the ranks to identify variables for exclusion or further investigation. A variable is identified where the zero value is imputed and the relationship with the target reexamined.

Uploaded by

Reno Felipe Tavares Costa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views38 pages

Using Results From Proc Corr For Variable Screening

Uploaded by

Reno Felipe Tavares Costa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Using results from

PROC CORR
for Variable Screening

1
Feature Engineering

2
The Spearman correlation statistic is the correlation of the ranks of the input variables
with the binary target.

Hoeffding’s D detects a wide variety of associations between two variables.

3
Compare the results of the Spearman and Hoeffding paying
attention to:

Neither measure shows a relationship – drop the variable.

Decision based on p-value.

Hoeffding results in higher measure than Spearman – perhaps

need some ”feature engineering”

Use ranking of measures for decisions.

4
The rank option in PROC CORR, some
details

5
The set for consideration.

%let reduced=
MIPhone MICCBal Dep MM ILS MTGBal Income POS CD IRA
brclus1 Sav NSF Age SavBal LOCBal NSFAmt Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea ATMAmt DDABal
DDA brclus2 CC HMOwn DepAmt Phone ATM LORes brclus4;

6
The rank option in PROC CORR

ods output spearmancorr=spearman

hoeffdingcorr=hoeffding;

proc corr data=d.develop_a spearman hoeffding rank;

var &reduced;
with ins;
run;

7
proc contents data=spearman;run;
proc print data=hoeffding;run;

The variable names in the SAS data sets Spearman and Hoeffding are in the variables best1
through best39

The correlation statistics are in the variables r1 through r39

The p-values are in the variables p1 through p39.

8
We need to restructure the data sets so the identifier is the variable
name and there is a single observation for each variable name.

We also will want to keep the correlation means, its rank, and p-
value for each observation (named to be different on the two data
sets.

9
Restructure Spearman data
%let nvar=39;/*reduced set*/

data spearman1(keep=variable scorr spvalue ranksp);

length variable $ 8;
set spearman;
array best(*) best1--best&nvar;
array r(*) r1--r&nvar;
array p(*) p1--p&nvar;
do i=1 to dim(best);
variable=best(i);
scorr=r(i);
spvalue=p(i);
ranksp=i;
output;
end;
run; 10
Restructure Hoeffding data.

data hoeffding1(keep=variable hcorr hpvalue rankho);

length variable $ 8;
set hoeffding;
array best(*) best1--best&nvar;
array r(*) r1--r&nvar;
array p(*) p1--p&nvar;
do i=1 to dim(best);
variable=best(i);
hcorr=r(i);
hpvalue=p(i);
rankho=i;
output;
end;
run;
11
Merge the two data sets by variable name.

proc sort data=spearman1;

by variable;
run;

proc sort data=hoeffding1;

by variable;
run;

data correlations;
merge spearman1 hoeffding1;
by variable;
run; 12
Print results
proc sort data=correlations;
by ranksp;
run;

proc print data=correlations label split='*';

var variable ranksp rankho scorr spvalue hcorr hpvalue;
label ranksp = 'Spearman rank*of variables'
scorr = 'Spearman Correlation'
spvalue = 'Spearman p-value'
rankho = 'Hoeffding rank*of variables'
hcorr = 'Hoeffding Correlation'
hpvalue = 'Hoeffding p-value';
title "Rank of Spearman Correlations and Hoeffding
Correlations";
run;
Title; 13
A low rank means a low p-value

If the Spearman rank is high but the Hoeffding’s D rank is low, then there may be an association that is probably not monotonic. (Empirical
logit plots can be used to investigate this type of relationship.)
A graph might help.

14
Get some values to draw reference lines

proc sql noprint;

select min(ranksp) into :vref
from (select ranksp
from correlations
having spvalue > .5);
select min(rankho) into :href
from (select rankho
from correlations
having hpvalue > .5);
quit;

15
Plot rank of Spearman vs rank of Hoeffding

proc sgplot data=correlations;

refline &vref / axis=y;
refline &href / axis=x;
scatter y=ranksp x=rankho / datalabel=variable;
yaxis label="Rank of Spearman";
xaxis label="Rank of Hoeffding";
title "Scatter Plot of the Ranks of Spearman vs.
Hoeffding";
run;
title ;
16
In general, the upper right corner of the plot contains the names of variables that
could reasonably be excluded from further analysis, due to their poor rank on both
metrics. The criterion to use in eliminating variables is a subjective decision.

Four variables are eliminated from the analysis: hmown, mtgbal, Miccbal, locbal

High ranks for Spearman and low ranks for Hoeffding’s D are found for the variables
DDABal, DepAmt, and ATMAmt. Even though these variables do not have a monotonic
relationship with Ins, some other type of relationship is detected by Hoeffding’s D
statistic. Empirical logit plots should be used to examine these relationships.

17
The variables remaining

%let screened=
MIPhone Dep MM ILS Income POS CD IRA
brclus1 Sav NSF Age SavBal NSFAmt Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea
ATMAmt DDABal
DDA brclus2 CC DepAmt Phone ATM LORes brclus4;

18
Investigate DDABal.

19
Empirical Logits

 mi  1 
log  
M
 i  mi  1 

where
mi= number of events
Mi = number of observations
20
A new macro PlotLogitsSeries
%macro PlotLogitsSeries(indata=,numgrp=7,indepvar=,depvar=);
proc rank data=&indata groups=&numgrp out=Ranks;
var &indepvar;
ranks Bin;
run;
proc sql;
create table toplot as
select
avg(&indepvar) as mean label="Mean of group",
sum(&depvar) as num_chd label="Number of Events",
count(*) as binsize label="Number at Risk",
log((calculated num_chd+1)/
(calculated binsize-calculated num_chd+1)) as logit
from ranks
group by bin;
quit;
proc sgplot data=toplot;
series x=mean y=logit/markers;
reg x=mean y=logit;
title "Estimated Logit Plot &indepvar, &numgrp groups";
run;
title;
%mend PlotLogitsSeries;
21
%PlotLogitsSeries(indata=d.develop_a,numgrp=100,indepvar=ddabal,depvar=ins);

There is a spike in the logits at the $0 balance level. Aside from that spike, the trend is
monotonic but certainly not linear.
22
Examining means a little more closely -- the
spike at $0
proc means data= d.develop;
class dda;
var ddabal;
run;

23
proc freq data=d.develop;
where ddabal=0;
tables dda;
run;

24
Most of the individuals with exactly $0 balances do not have checking accounts. It
turns out that their balances have been set to $0 as part of the data pre-processing.
This rule seems reasonable from a logical imputation standpoint, less so for analysis.

The logit plot suggests that those individuals with 0 balance are behaving like people
with much more than $0 in their checking accounts.

25
Impute ddabal and add a new variable to
d.develop_a.

26
proc sql;
select mean(ddabal) into : mnbal
from d.develop_a
where dda eq 1
;
quit;
%put &mnbal;

data d.develop_a;
set d.develop_a;
imputed_ddabal=ddabal;
if dda = 0 then imputed_ddabal=&mnbal;
run;
proc means data=d.develop_a;
var ddabal imputed_ddabal;
run;

27
%PlotLogitsSeries(indata=d.develop_a,numgrp=100,
indepvar=imputed_ddabal,depvar=ins);

28
Plot logits by bin rather than mean

29
%let indata=d.develop_a;
%let numgrp=100;
%let indepvar=imputed_ddabal;
%let depvar=ins;
proc rank data=&indata groups=&numgrp out=Ranks;
var &indepvar;
ranks Bin;
run;
proc sql;
create table toplot as
select
bin label="Bin number",
avg(&indepvar) as mean label="Mean of group",
sum(&depvar) as num_chd label="Number of Events",
count(*) as binsize label="Number at Risk",
log((calculated num_chd+1)/
(calculated binsize-calculated num_chd+1)) as logit
from ranks
group by bin;
quit;
proc sort data=toplot;by bin;run;
proc sgplot data=toplot;
series x=bin y=logit/markers;
reg x=bin y=logit;
title "Estimated Logit Plot &indepvar, &numgrp groups";
title2 "Using bin number rather than mean";
run; 30
title;
31
Some more "feature engineering"

To use imputed_ddabal “bins” for scoring new cases can perhaps best
be done using percentiles of the distribution.

32
First get the information for 100 bins
proc rank data=d.develop_a groups=100 out=out;
var imputed_ddabal;
ranks bin;
run;

title;
proc means data = out noprint nway;
class bin;
var imputed_ddabal;
output out=endpts max=max;
run;

proc print data = endpts;

run;

33
Using this information isn’t difficult, but
requires a lot of code.
Using a select construct requires that we write a line of code for each
endpoint.

34
A program to write the necessary
code

filename rank "C:\tmp\rank.sas";

data _null_;
file rank;
set endpts end=last;
if _n_ = 1 then put "select;";
if not last then do;
put " when (imputed_ddabal <= " max ") B_DDABal =" bin ";";
end;
else if last then do;
put " otherwise B_DDABal =" bin ";";
put "end;";
end;
run;

35
A program that uses the code

data d.develop_a;
set d.develop_a;
%include rank / source;
run;

proc means data = d.develop_a min max;

class B_DDABal;
var imputed_DDABal;
run;

36
%PlotLogitsSeries(indata=d.develop_a,numgrp=100,
indepvar=b_ddabal,depvar=ins);

37
The new screened set

%let screened=
MIPhone MICCBal Dep MM ILS MTGBal Income
POS CD IRA
brclus1 Sav NSF Age SavBal LOCBal NSFAmt
Inv MIHMVal CRScore
MIAcctAg InvBal DirDep CCPurc SDB CashBk
AcctAge InArea ATMAmt b_DDABal
DDA brclus2 CC HMOwn DepAmt Phone ATM LORes
brclus4;

BCOM-Business Statistics Notes
100% (1)
BCOM-Business Statistics Notes
266 pages
Quantitative Analysis of Market Data A Primer
100% (2)
Quantitative Analysis of Market Data A Primer
43 pages
Filipino Psychology - Concepts and Methods
70% (10)
Filipino Psychology - Concepts and Methods
116 pages
Non Technical, Non Scientific Approach
75% (16)
Non Technical, Non Scientific Approach
10 pages
Exploratory Data Analysis
100% (3)
Exploratory Data Analysis
26 pages
Traditional Medicine Research Methodology in Naturopathy & Yoga
100% (2)
Traditional Medicine Research Methodology in Naturopathy & Yoga
124 pages
Statistics Learners' Working Manual
No ratings yet
Statistics Learners' Working Manual
25 pages
Chapter 1 1
No ratings yet
Chapter 1 1
9 pages
The Conscious Universe The Scientific Truth of Psy
0% (1)
The Conscious Universe The Scientific Truth of Psy
4 pages
HTTP WWW - Subtleenergies.com Ormus WG Grebennikov Grebennikov-Eng
100% (1)
HTTP WWW - Subtleenergies.com Ormus WG Grebennikov Grebennikov-Eng
15 pages
Statement
100% (1)
Statement
3 pages
Exploratory Data Analysis - v3 - Part1
No ratings yet
Exploratory Data Analysis - v3 - Part1
36 pages
Research Methdology
No ratings yet
Research Methdology
133 pages
Corelation With Example
No ratings yet
Corelation With Example
112 pages
LECTURE 2 - Introduction To Nonparametric
No ratings yet
LECTURE 2 - Introduction To Nonparametric
24 pages
Analise Bivariada - Moodle
No ratings yet
Analise Bivariada - Moodle
46 pages
Descriptive Descriptive Analysis and Histograms 1.1 Recode 1.2 Select Cases & Split File 2. Reliability
100% (1)
Descriptive Descriptive Analysis and Histograms 1.1 Recode 1.2 Select Cases & Split File 2. Reliability
6 pages
Data Cleaning
No ratings yet
Data Cleaning
39 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
61 pages
Regression - Binary Logit - Q
No ratings yet
Regression - Binary Logit - Q
7 pages
IPS7e LecturePPT ch02
No ratings yet
IPS7e LecturePPT ch02
105 pages
Chapter2-ESTA3042 2020S2
No ratings yet
Chapter2-ESTA3042 2020S2
80 pages
Practical Biostatistics BMB-308: Torial Port and Presentation
No ratings yet
Practical Biostatistics BMB-308: Torial Port and Presentation
28 pages
Correlation and Regression Analysis Using SPSS
No ratings yet
Correlation and Regression Analysis Using SPSS
102 pages
06 Correlation
No ratings yet
06 Correlation
8 pages
Second Stats Packet 24
No ratings yet
Second Stats Packet 24
100 pages
Univariate and Bivariate Statistical Analysespdf
100% (1)
Univariate and Bivariate Statistical Analysespdf
6 pages
Qunatitative Analysis
No ratings yet
Qunatitative Analysis
30 pages
Untitled Document
No ratings yet
Untitled Document
14 pages
Correlation vs. Regression - A Key Difference That Many Analysts Miss - by John v. Kane - The Stata Gallery - Apr, 2024 - Medium
No ratings yet
Correlation vs. Regression - A Key Difference That Many Analysts Miss - by John v. Kane - The Stata Gallery - Apr, 2024 - Medium
13 pages
7.1 Regression Building Relationships
No ratings yet
7.1 Regression Building Relationships
44 pages
Lecture 4 Regression Analysis
No ratings yet
Lecture 4 Regression Analysis
51 pages
GR 12 - Statistics
No ratings yet
GR 12 - Statistics
25 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
48 pages
Spearman and Kendalls Tau B
No ratings yet
Spearman and Kendalls Tau B
7 pages
NonParametrics pt1
No ratings yet
NonParametrics pt1
13 pages
MBR Lab Week 10-12-1
No ratings yet
MBR Lab Week 10-12-1
65 pages
Regression Analysis - Its Formulation and Execution in Dentistry
No ratings yet
Regression Analysis - Its Formulation and Execution in Dentistry
10 pages
Spearman's Rank-Order Correlation
No ratings yet
Spearman's Rank-Order Correlation
13 pages
Course Code: 8614 Course Name: Educational Statistics Assignment: 2 Semester: Spring 2022 Program: B.Ed
No ratings yet
Course Code: 8614 Course Name: Educational Statistics Assignment: 2 Semester: Spring 2022 Program: B.Ed
19 pages
Correlation: Type Informat Name What It Does
No ratings yet
Correlation: Type Informat Name What It Does
6 pages
HANDOUT 02 - Factoring Concept
No ratings yet
HANDOUT 02 - Factoring Concept
23 pages
Sufficient Statistics - Problems - Solved - Xiang - Yin
No ratings yet
Sufficient Statistics - Problems - Solved - Xiang - Yin
5 pages
IE005 Lab Exercise 1 and 2
No ratings yet
IE005 Lab Exercise 1 and 2
15 pages
Corr - Regression Analysis
No ratings yet
Corr - Regression Analysis
19 pages
Ucc 301.com 422 L8
No ratings yet
Ucc 301.com 422 L8
10 pages
Iii-Q1 Module 5
No ratings yet
Iii-Q1 Module 5
8 pages
Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
No ratings yet
Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
17 pages
Sps 2291 Lesson 4
No ratings yet
Sps 2291 Lesson 4
27 pages
MM13 Content Module 11
No ratings yet
MM13 Content Module 11
13 pages
Lesson 8
No ratings yet
Lesson 8
11 pages
Regression & Correlation 230224 221642
No ratings yet
Regression & Correlation 230224 221642
9 pages
9.bivariate Analysis
No ratings yet
9.bivariate Analysis
64 pages
Maintanance & Operation of Biomedical Equipements-I
No ratings yet
Maintanance & Operation of Biomedical Equipements-I
148 pages
Spss Assignment 2 Group 2
No ratings yet
Spss Assignment 2 Group 2
11 pages
Correlation and Regression
No ratings yet
Correlation and Regression
5 pages
Predictive Analytics Exam-December 2019: Exam PA Home Page
No ratings yet
Predictive Analytics Exam-December 2019: Exam PA Home Page
9 pages
ANOVA For One Way Classification Theory
No ratings yet
ANOVA For One Way Classification Theory
4 pages
Uantum Mechanics A L A Irac T S - G Q
No ratings yet
Uantum Mechanics A L A Irac T S - G Q
10 pages
Unit 1 - Introduction 2
No ratings yet
Unit 1 - Introduction 2
5 pages
Example:: Item and Distance From The Contemporary Art Museum
No ratings yet
Example:: Item and Distance From The Contemporary Art Museum
9 pages
Correlation and Regration
No ratings yet
Correlation and Regration
8 pages
Introduction To Evaluation
No ratings yet
Introduction To Evaluation
8 pages
SPSS Instruction
No ratings yet
SPSS Instruction
14 pages
Spearman's Rank Correlation: XX X, and Y, With Sample Values X Y With Sample Values
No ratings yet
Spearman's Rank Correlation: XX X, and Y, With Sample Values X Y With Sample Values
8 pages
Looking at Data Relationships p79: Explanatory
No ratings yet
Looking at Data Relationships p79: Explanatory
8 pages
Preliminary Analysis: - Descriptive Statistics. - Checking The Reliability of A Scale
No ratings yet
Preliminary Analysis: - Descriptive Statistics. - Checking The Reliability of A Scale
92 pages
An Introduction To Krishnamurti's Work
No ratings yet
An Introduction To Krishnamurti's Work
3 pages
Minitab Tip Sheet 15
No ratings yet
Minitab Tip Sheet 15
5 pages
Q1 W5 D1&D2 Statistics
No ratings yet
Q1 W5 D1&D2 Statistics
33 pages
Statistics and Probability
No ratings yet
Statistics and Probability
5 pages
Basic STATA Command
No ratings yet
Basic STATA Command
5 pages
WK2 - Steps of Simulation Study
No ratings yet
WK2 - Steps of Simulation Study
27 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
23 pages
Stata
No ratings yet
Stata
26 pages
Oup 9
No ratings yet
Oup 9
26 pages
Assignment 2
No ratings yet
Assignment 2
17 pages
CASP Checklist: Case Control Study How To Use This Appraisal Tool
No ratings yet
CASP Checklist: Case Control Study How To Use This Appraisal Tool
6 pages
Midwest Political Science Association
No ratings yet
Midwest Political Science Association
31 pages
Analisis Kinerja Rumah Sakit Berdasarkan Balanced Scorecard Di Rumah Sakit Umum Daerah Arifin Achmad Provinsi Riau
No ratings yet
Analisis Kinerja Rumah Sakit Berdasarkan Balanced Scorecard Di Rumah Sakit Umum Daerah Arifin Achmad Provinsi Riau
8 pages
Sas Tutorial Procunivariate
No ratings yet
Sas Tutorial Procunivariate
10 pages
Pearson and Correlation
No ratings yet
Pearson and Correlation
8 pages
General SPSS Help
No ratings yet
General SPSS Help
4 pages
Spearman's Rank Correlation QM3 - 1617
No ratings yet
Spearman's Rank Correlation QM3 - 1617
2 pages
Correlation and Regression
No ratings yet
Correlation and Regression
5 pages
Looking at Data: Relationships: Least-Squares Regression
No ratings yet
Looking at Data: Relationships: Least-Squares Regression
23 pages
SPSS Workshop: Utilizing and Implementing SPSS in Our OC-Math Statistics Classes
No ratings yet
SPSS Workshop: Utilizing and Implementing SPSS in Our OC-Math Statistics Classes
11 pages
SPSS Brief Guide
No ratings yet
SPSS Brief Guide
7 pages
C Language Programming Codes
From Everand
C Language Programming Codes
Durgesh
No ratings yet