Data Exploration & Descriptive Analysis - Tutorial 2

Uploaded by

NURASYIKIN shamsuddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views6 pages

Data Exploration & Descriptive Analysis - Tutorial 2

Uploaded by

NURASYIKIN shamsuddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

17

Chapter 4

Explore the Data and Replace

Input Values

About the Tasks That You Will Perform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Generate Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Partition the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Replace Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

About the Tasks That You Will Perform

You have already set up the project and defined the input data source that you will use in
this example. Now, you will import the data and perform the following tasks, which help
you learn properties of the input data and prepare it for subsequent modeling:
1. You will explore the statistical properties of the variables in the input data set. The
results that are generated in this step will give you an idea of which variables are
most useful in predicting the target response (whether a person donates or not) in this
data set.
2. You will partition the data into two data sets, a training data set and a validation data
set. Such partitioning is common practice in data mining and enables you to develop
a complete model that is not overfitted to a particular set of data.
3. You will specify how SAS Enterprise Miner should handle missing values of
predictor variables.
TIP It is always a good idea to plot the input data and to check it for missing values
before you proceed to model building. Knowing the statistical properties of your
input data is essential for building an accurate and robust predictive model.

Generate Descriptive Statistics

To use the StatExplore node to produce a statistical summary of the input data:
1. Select the Explore tab on the Toolbar.
2. Select the StatExplore node icon. Drag the node into the Diagram Workspace.
18 Chapter 4 • Explore the Data and Replace Input Values

TIP To determine which node an icon represents, position the mouse pointer over
the icon and read the tooltip.
3. Connect the DONOR_RAW_DATA input data source node to the StatExplore node.
To connect the two nodes, position the mouse pointer over the right edge of the input
data source node until the pointer becomes a pencil. With the left mouse button held
down, drag the pencil to the left edge of the StatExplore node. Then, release the
mouse button. An arrow between the two nodes indicates a successful connection.

4. Select the StatExplore node. In the Properties Panel, scroll down to view the Chi-
Square Statistics properties group. Click the value of Interval Variables and select
Yes from the drop-down menu that appears.
Chi-square statistics are always computed for categorical variables. Changing the
selection for interval variables causes SAS Enterprise Miner to distribute interval
variables into five (by default) bins and compute chi-square statistics for the binned
variables when you run the node.

5. In the Diagram Workspace, right-click the StatExplore node, and select Run from
the resulting menu. Click Yes in the Confirmation window that opens.
When you run a node, all of the nodes preceding it in the process flow are also run in
order, beginning with the first node that has changed since the flow was last run. If
no nodes other than the one that you select have changed since the last run, then only
the node that you select is run. You can watch the icons in the process flow diagram
to monitor the status of execution.
• Nodes that are outlined in green are currently running.
• Nodes that are denoted with a check mark inside a green circle have successfully
run.
• Nodes that are outlined in red have failed to run due to errors.
In this example, the DONOR_RAW_DATA input data node had not yet been run.
Therefore, both nodes are run when you select to run the StatExplore node.
6. In the window that appears when processing completes, click Results. The Results
window appears.
Generate Descriptive Statistics 19

Note: Panels in Results windows might not have the same arrangement on your
screen, due to window resizing. When the Results window is resized, SAS
Enterprise Miner redistributes panels for optimal viewing.
The results window displays the following:
• a plot that orders the variables by their worth in predicting the target variable.
Note: In the StatExplore node, SAS Enterprise Miner calculates variable worth
using the Gini split worth statistic that would be generated by building a
decision tree of depth 1. For detailed information about Gini split worth, see
the SAS Enterprise Miner Help.
• the SAS output from the node.
• a plot that orders the top 20 variables by their chi-square statistics. You can also
choose to view the top 20 variables ordered by their Cramer's V statistics on this
plot.
TIP In SAS Enterprise Miner, you can select graphs, tables, and rows within
tables and select Copy from the right-click pop-up menu to copy these items for
subsequent pasting in other applications such as Microsoft Word and Microsoft
Excel.
7. Expand the Output window, and then scroll to the Class Variable Summary
Statistics and the Interval Variable Summary Statistics sections of the output.
• Notice that there are two class variables and two interval variables for which
there are missing values. Later in the example, you will impute values to use in
the place of missing values for these variables.
• Notice that several variables have relatively large standard deviations. Later in
the example, you will plot the data and explore transformations that can reduce
the variances of these variables.
20 Chapter 4 • Explore the Data and Replace Input Values

8. Close the Results window.

Partition the Data

In data mining, a strategy for assessing the quality of model generalization is to partition
the data source. A portion of the data, called the training data set, is used for preliminary
model fitting. The rest is reserved for empirical validation and is often split into two
parts: validation data and test data. The validation data set is used to prevent a modeling
node from overfitting the training data and to compare models. The test data set is used
for a final assessment of the model.
Note: In SAS Enterprise Miner, the default data partitioning method for class target
variables is to stratify on the target variable or variables. This method is appropriate
for this sample data because there is a large number of non-donors in the input data
relative to the number of donors. Stratifying ensures that both non-donors and donors
are well-represented in the data partitions.
To use the Data Partition node to partition the input data into training and validation sets:
1. Select the Sample tab on the Toolbar.
2. Select the Data Partition node icon. Drag the node into the Diagram Workspace.
3. Connect the StatExplore node to the Data Partition node.

4. Select the Data Partition node. In the Properties Panel, scroll down to view the Data
Set Allocations in the Train properties.
• Click the value of Training, and enter 55.0
• Click the value of Validation, and enter 45.0
• Click the value of Test, and enter 0.0
These properties define the percentage of input data that is used in each type of
mining data set. In this example, you use a training data set and a validation data set,
but you do not use a test data set.
5. In the Diagram Workspace, right-click the Data Partition node, and select Run from
the resulting menu. Click Yes in the Confirmation window that opens.
6. In the window that appears when processing completes, click OK.
Replace Missing Values 21

Replace Missing Values

In this example, the variables SES and URBANICITY are class variables for which the
value ? denotes a missing value. Because a question mark does not denote a missing
value in the terms that SAS defines a missing value (that is, a blank or a period), SAS
Enterprise Miner sees it as an additional level of a class variable. However, the
knowledge that these values are missing will be useful later in the model-building
process.
To use the Replacement node to interactively specify that such observations of these
variables are missing:
1. Select the Modify tab on the Toolbar.
2. Select the Replacement node icon. Drag the node into the Diagram Workspace.
3. Connect the Data Partition node to the Replacement node.

4. Select the Replacement node. In the Properties Panel, scroll down to view the Train
properties.
a. For Interval Variables, click the value of Default Limits Method, and select
None from the drop-down menu. This selection indicates that no values of
interval variables should be replaced. With the default selection, a particular
range for the values of each interval variable would have been enforced. In this
example, you do not want to enforce such a range.
Note: In this data set, all missing interval variable values are correctly coded as
SAS missing values (a blank or a period).
b. For Class Variables, click the ellipses that represent the value of Replacement
Editor. The Replacement Editor opens.
• Notice that SES and URBANICITY both have a level that contains
observations with the value ?. For these two variables, this level represents
observations with missing values. Enter _MISSING_ as the Replacement
Value for the two rows, as shown in the following image. This action enables
SAS Enterprise Miner to recognize that the question marks indicate missing
values for these two variables. Later, you will impute values for observations
with missing values.
22 Chapter 4 • Explore the Data and Replace Input Values

• Enter _UNKNOWN_ as the Replacement Value for the level of

DONOR_GENDER that has the value A. This value is the result of a data
entry error, and you do not know whether the intention was to code it as an F
or an M.
Click OK.
5. In the Diagram Workspace, right-click the Replacement node, and select Run from
the resulting menu. Click Yes in the Confirmation window that opens.
6. In the window that appears when processing completes, click OK.

In the data that is exported from the Replacement node, a new variable is created for
each variable that is replaced (in this example, SES, URBANICITY, and
DONOR_GENDER). The original variable is not overwritten. Instead, the new variable
has the same name as the original variable but is prefaced with REP_. The original
version of each variable also exists in the exported data and has the role Rejected.
To view the data that is exported by a node, click the ellipsis button that represents the
value of the General property Exported Data in the Properties Panel. To view the
exported variables, click Properties in the window that opens, and then view the
Variables tab. Similarly, you can view the data that is imported and used by a node by
clicking the ellipsis button that represents the value of the General property Imported
Data in the Properties Panel.
TIP “Predictive Modeling with SAS Enterprise Miner: Practical Solutions for
Business Applications” provides examples of and options for the StatExplore and
Replacement nodes. The book also discusses alternate configurations for the Data
Partition node.

Data Mining Using SAS Enterprise Miner - A Case Study Approach
No ratings yet
Data Mining Using SAS Enterprise Miner - A Case Study Approach
134 pages
Statistics Essential V6.Pptx
No ratings yet
Statistics Essential V6.Pptx
85 pages
Chap4 - em - Overview
No ratings yet
Chap4 - em - Overview
81 pages
MGMT 645 Case Retail Bank Predictive Model
No ratings yet
MGMT 645 Case Retail Bank Predictive Model
45 pages
SAS EMiner Guidelines - FULL
No ratings yet
SAS EMiner Guidelines - FULL
84 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Introduction To Econometric Views: N.Nilgün Çokça
No ratings yet
Introduction To Econometric Views: N.Nilgün Çokça
50 pages
BRM Lab File
No ratings yet
BRM Lab File
52 pages
In Class Exercise Clustering
No ratings yet
In Class Exercise Clustering
46 pages
Statgraphics Online
No ratings yet
Statgraphics Online
39 pages
DAUP Exam Notes -2in1
No ratings yet
DAUP Exam Notes -2in1
35 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
Statistics Foundation Slider Team Group#1
No ratings yet
Statistics Foundation Slider Team Group#1
94 pages
1) Intro To Minitab
No ratings yet
1) Intro To Minitab
29 pages
Module 8
No ratings yet
Module 8
13 pages
The Basics of SAS Enterprise Miner 5.2: 1.1 Introduction To Data Mining
No ratings yet
The Basics of SAS Enterprise Miner 5.2: 1.1 Introduction To Data Mining
46 pages
Chapter 2 - Preparing To Model
No ratings yet
Chapter 2 - Preparing To Model
16 pages
Data Mining Using SAS Enterprise Miner A Case Study Approach PDF
No ratings yet
Data Mining Using SAS Enterprise Miner A Case Study Approach PDF
135 pages
Centurion Data Managment
No ratings yet
Centurion Data Managment
5 pages
SAS E Miner Cloud-Based Software - Tutorial 1
No ratings yet
SAS E Miner Cloud-Based Software - Tutorial 1
15 pages
Statistical Functions On The HP10Bii+
No ratings yet
Statistical Functions On The HP10Bii+
23 pages
Stat
No ratings yet
Stat
23 pages
Business Research Methodology Vivan
No ratings yet
Business Research Methodology Vivan
19 pages
DL Unit 3 Jntuk r20
100% (1)
DL Unit 3 Jntuk r20
47 pages
Kaleidagraph Quick Start Guide
No ratings yet
Kaleidagraph Quick Start Guide
16 pages
Ensemble Learning in Machine Learning
No ratings yet
Ensemble Learning in Machine Learning
39 pages
BRM File
No ratings yet
BRM File
55 pages
ExpertFit Student Version Overview
No ratings yet
ExpertFit Student Version Overview
23 pages
3-Building Decision Trees Using SAS
No ratings yet
3-Building Decision Trees Using SAS
30 pages
Overview of SAS EM
No ratings yet
Overview of SAS EM
10 pages
Minitab V18
No ratings yet
Minitab V18
85 pages
Statistical Graphs and Calculations
No ratings yet
Statistical Graphs and Calculations
29 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
SAS Stat Studio v3.1
No ratings yet
SAS Stat Studio v3.1
69 pages
Getting Started With SAS Enterprise Miner
No ratings yet
Getting Started With SAS Enterprise Miner
76 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Graph
No ratings yet
Graph
4 pages
GRP 5 Tan Yi Xuen
No ratings yet
GRP 5 Tan Yi Xuen
122 pages
Measure Phase and Data Collection
No ratings yet
Measure Phase and Data Collection
55 pages
Quick Start PDF
No ratings yet
Quick Start PDF
17 pages
Xlstat® Tip Sheet For Business Statistics - Cengage Learning
No ratings yet
Xlstat® Tip Sheet For Business Statistics - Cengage Learning
30 pages
DSA5102_lecture3
No ratings yet
DSA5102_lecture3
34 pages
CHAPTER 1 Arithmetic n Geometric Sequence
No ratings yet
CHAPTER 1 Arithmetic n Geometric Sequence
26 pages
Statistics
No ratings yet
Statistics
87 pages
Chapter 3.1_trade and Cash Discount
No ratings yet
Chapter 3.1_trade and Cash Discount
8 pages
2022 TESIS Boerman - Thomas - MASc - 2022
No ratings yet
2022 TESIS Boerman - Thomas - MASc - 2022
103 pages
Complete Download Machine Learning With R, The Tidyverse, and MLR 1st Edition Hefin Ioan Rhys PDF All Chapters
100% (4)
Complete Download Machine Learning With R, The Tidyverse, and MLR 1st Edition Hefin Ioan Rhys PDF All Chapters
62 pages
Topic 4 Annuity
No ratings yet
Topic 4 Annuity
43 pages
Topic 4 Annuity
No ratings yet
Topic 4 Annuity
43 pages
Guide For SPSS For Windows: I. Using The Data Editor
No ratings yet
Guide For SPSS For Windows: I. Using The Data Editor
16 pages
AGE 301_NOTE_a-1
No ratings yet
AGE 301_NOTE_a-1
8 pages
Variables Selection and Transformation SAS EM
No ratings yet
Variables Selection and Transformation SAS EM
12 pages
Chapter 2 Introduction To Risk Management
No ratings yet
Chapter 2 Introduction To Risk Management
36 pages
CHAPTER 3
No ratings yet
CHAPTER 3
15 pages
Ds unit 3 notes
No ratings yet
Ds unit 3 notes
29 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
48S_BE_3516_15_48th_series_SPP_Midterm_Evaluation_PPT 2
No ratings yet
48S_BE_3516_15_48th_series_SPP_Midterm_Evaluation_PPT 2
14 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Big Data: New Tricks For Econometrics: Hal R. Varian
No ratings yet
Big Data: New Tricks For Econometrics: Hal R. Varian
55 pages
Introduction To SPSS
No ratings yet
Introduction To SPSS
23 pages
Chapter 4: SPSS: Spss Overview The SPSS Environment
No ratings yet
Chapter 4: SPSS: Spss Overview The SPSS Environment
10 pages
Analysis Procedures: Data Input Dialog Boxes
No ratings yet
Analysis Procedures: Data Input Dialog Boxes
12 pages
AI+Governance+Framework+by+Trail+ +2024.2
No ratings yet
AI+Governance+Framework+by+Trail+ +2024.2
22 pages
Enhancing The Government Accounting Information Sys - 2023 - International Journ
No ratings yet
Enhancing The Government Accounting Information Sys - 2023 - International Journ
19 pages
Artificial Intelligent: Supervised Learning and Unsupervised Learning
No ratings yet
Artificial Intelligent: Supervised Learning and Unsupervised Learning
17 pages
BQ ML
No ratings yet
BQ ML
14 pages
1. Introduction to Artificial Neural Networks _ Neural networks and deep learning
No ratings yet
1. Introduction to Artificial Neural Networks _ Neural networks and deep learning
26 pages
Tips and Techniques For The SAS Programmer
No ratings yet
Tips and Techniques For The SAS Programmer
19 pages
Data Cleansing Mechanisms and Approaches For Big Data Analytics A Systematic Study
No ratings yet
Data Cleansing Mechanisms and Approaches For Big Data Analytics A Systematic Study
13 pages
FINAL MANUNSCRIPT BEFORE INTRODUCTION (1)
No ratings yet
FINAL MANUNSCRIPT BEFORE INTRODUCTION (1)
10 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Cs230exam Spr18 Soln PDF
100% (1)
Cs230exam Spr18 Soln PDF
45 pages
SPSS Basic
No ratings yet
SPSS Basic
24 pages
2023-Scoring Predictors Stunting Based On The Epidemiological Triad
No ratings yet
2023-Scoring Predictors Stunting Based On The Epidemiological Triad
10 pages
data science
No ratings yet
data science
6 pages
Time Series Forecasting Final Report
No ratings yet
Time Series Forecasting Final Report
7 pages
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
No ratings yet
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
7 pages
Lab
No ratings yet
Lab
5 pages
Amazon Products Review Sentiment Analysis
No ratings yet
Amazon Products Review Sentiment Analysis
23 pages
IEEE AIIooT Certified
No ratings yet
IEEE AIIooT Certified
8 pages
Salesforce-AI-Associate-questions
No ratings yet
Salesforce-AI-Associate-questions
4 pages
IJSDR2305088
No ratings yet
IJSDR2305088
4 pages
تقرير مجموعه البيانات
No ratings yet
تقرير مجموعه البيانات
4 pages
History
No ratings yet
History
5 pages
Customer Segmentation Using Data Science
No ratings yet
Customer Segmentation Using Data Science
7 pages
Online Payment Fraud Detection
No ratings yet
Online Payment Fraud Detection
5 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Learning Highcharts
From Everand
Learning Highcharts
Joe Kuan
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Tableau 8.2 Training Manual: From Clutter to Clarity
From Everand
Tableau 8.2 Training Manual: From Clutter to Clarity
Larry Keller
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Microsoft Office Productivity Pack: Microsoft Excel, Microsoft Word, and Microsoft PowerPoint
From Everand
Microsoft Office Productivity Pack: Microsoft Excel, Microsoft Word, and Microsoft PowerPoint
Steven Bright
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet