Data Exploration & Descriptive Analysis - Tutorial 2
Data Exploration & Descriptive Analysis - Tutorial 2
Chapter 4
TIP To determine which node an icon represents, position the mouse pointer over
the icon and read the tooltip.
3. Connect the DONOR_RAW_DATA input data source node to the StatExplore node.
To connect the two nodes, position the mouse pointer over the right edge of the input
data source node until the pointer becomes a pencil. With the left mouse button held
down, drag the pencil to the left edge of the StatExplore node. Then, release the
mouse button. An arrow between the two nodes indicates a successful connection.
4. Select the StatExplore node. In the Properties Panel, scroll down to view the Chi-
Square Statistics properties group. Click the value of Interval Variables and select
Yes from the drop-down menu that appears.
Chi-square statistics are always computed for categorical variables. Changing the
selection for interval variables causes SAS Enterprise Miner to distribute interval
variables into five (by default) bins and compute chi-square statistics for the binned
variables when you run the node.
5. In the Diagram Workspace, right-click the StatExplore node, and select Run from
the resulting menu. Click Yes in the Confirmation window that opens.
When you run a node, all of the nodes preceding it in the process flow are also run in
order, beginning with the first node that has changed since the flow was last run. If
no nodes other than the one that you select have changed since the last run, then only
the node that you select is run. You can watch the icons in the process flow diagram
to monitor the status of execution.
• Nodes that are outlined in green are currently running.
• Nodes that are denoted with a check mark inside a green circle have successfully
run.
• Nodes that are outlined in red have failed to run due to errors.
In this example, the DONOR_RAW_DATA input data node had not yet been run.
Therefore, both nodes are run when you select to run the StatExplore node.
6. In the window that appears when processing completes, click Results. The Results
window appears.
Generate Descriptive Statistics 19
Note: Panels in Results windows might not have the same arrangement on your
screen, due to window resizing. When the Results window is resized, SAS
Enterprise Miner redistributes panels for optimal viewing.
The results window displays the following:
• a plot that orders the variables by their worth in predicting the target variable.
Note: In the StatExplore node, SAS Enterprise Miner calculates variable worth
using the Gini split worth statistic that would be generated by building a
decision tree of depth 1. For detailed information about Gini split worth, see
the SAS Enterprise Miner Help.
• the SAS output from the node.
• a plot that orders the top 20 variables by their chi-square statistics. You can also
choose to view the top 20 variables ordered by their Cramer's V statistics on this
plot.
TIP In SAS Enterprise Miner, you can select graphs, tables, and rows within
tables and select Copy from the right-click pop-up menu to copy these items for
subsequent pasting in other applications such as Microsoft Word and Microsoft
Excel.
7. Expand the Output window, and then scroll to the Class Variable Summary
Statistics and the Interval Variable Summary Statistics sections of the output.
• Notice that there are two class variables and two interval variables for which
there are missing values. Later in the example, you will impute values to use in
the place of missing values for these variables.
• Notice that several variables have relatively large standard deviations. Later in
the example, you will plot the data and explore transformations that can reduce
the variances of these variables.
20 Chapter 4 • Explore the Data and Replace Input Values
4. Select the Data Partition node. In the Properties Panel, scroll down to view the Data
Set Allocations in the Train properties.
• Click the value of Training, and enter 55.0
• Click the value of Validation, and enter 45.0
• Click the value of Test, and enter 0.0
These properties define the percentage of input data that is used in each type of
mining data set. In this example, you use a training data set and a validation data set,
but you do not use a test data set.
5. In the Diagram Workspace, right-click the Data Partition node, and select Run from
the resulting menu. Click Yes in the Confirmation window that opens.
6. In the window that appears when processing completes, click OK.
Replace Missing Values 21
4. Select the Replacement node. In the Properties Panel, scroll down to view the Train
properties.
a. For Interval Variables, click the value of Default Limits Method, and select
None from the drop-down menu. This selection indicates that no values of
interval variables should be replaced. With the default selection, a particular
range for the values of each interval variable would have been enforced. In this
example, you do not want to enforce such a range.
Note: In this data set, all missing interval variable values are correctly coded as
SAS missing values (a blank or a period).
b. For Class Variables, click the ellipses that represent the value of Replacement
Editor. The Replacement Editor opens.
• Notice that SES and URBANICITY both have a level that contains
observations with the value ?. For these two variables, this level represents
observations with missing values. Enter _MISSING_ as the Replacement
Value for the two rows, as shown in the following image. This action enables
SAS Enterprise Miner to recognize that the question marks indicate missing
values for these two variables. Later, you will impute values for observations
with missing values.
22 Chapter 4 • Explore the Data and Replace Input Values
In the data that is exported from the Replacement node, a new variable is created for
each variable that is replaced (in this example, SES, URBANICITY, and
DONOR_GENDER). The original variable is not overwritten. Instead, the new variable
has the same name as the original variable but is prefaced with REP_. The original
version of each variable also exists in the exported data and has the role Rejected.
To view the data that is exported by a node, click the ellipsis button that represents the
value of the General property Exported Data in the Properties Panel. To view the
exported variables, click Properties in the window that opens, and then view the
Variables tab. Similarly, you can view the data that is imported and used by a node by
clicking the ellipsis button that represents the value of the General property Imported
Data in the Properties Panel.
TIP “Predictive Modeling with SAS Enterprise Miner: Practical Solutions for
Business Applications” provides examples of and options for the StatExplore and
Replacement nodes. The book also discusses alternate configurations for the Data
Partition node.