100% found this document useful (1 vote)
90 views87 pages

Data Preparation

This document provides steps for preparing data for analysis in RapidMiner, including: 1. Importing data from Excel files using the "Import Data" or "Read Excel" operators. 2. Cleaning the data through filtering cases, imputing missing values, dealing with miscoded entries, and selecting/setting attribute roles. Techniques include using the Filter Examples, Replace Missing Values, Trim, Remove Duplicates, and Replace operators. 3. Exploring the basic statistics of attributes using the "Statistics" operator to identify issues like missing or miscoded values. The document provides detailed instructions for each data preparation technique and emphasizes exploring the data to identify issues before analyzing.

Uploaded by

jessie nando
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
90 views87 pages

Data Preparation

This document provides steps for preparing data for analysis in RapidMiner, including: 1. Importing data from Excel files using the "Import Data" or "Read Excel" operators. 2. Cleaning the data through filtering cases, imputing missing values, dealing with miscoded entries, and selecting/setting attribute roles. Techniques include using the Filter Examples, Replace Missing Values, Trim, Remove Duplicates, and Replace operators. 3. Exploring the basic statistics of attributes using the "Statistics" operator to identify issues like missing or miscoded values. The document provides detailed instructions for each data preparation technique and emphasizes exploring the data to identify issues before analyzing.

Uploaded by

jessie nando
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Data Preparation

using

Asst. Prof. Arturo J. Patungan Jr


Asst. Prof. Xandro Alexi A. Nieto
Mr. Eduardo Dulay
the process of preparing
Data Cleaning data for analysis by
removing or modifying
incorrect, incomplete,
irrelevant, duplicated,
or improperly
formatted data.
The

Interface

Parameter
Repository/ tabs
Source tabs

Canvas

Operators/
Analysis
tabs Description
tabs
Importing Data
• Click File then Import Data,
or click
in the Repository
tab
Importing Data
• Choose the source of your data set
Importing Data
• Locate the data then click Next.
For this lecture,
choose CustomerDetails.xls.
Importing Data
• Verify the cells you
want to import and
click Next.
Importing Data
• Format the columns with
your specifications.
Importing Data
• Format the columns with
your specifications.
• You may change the type,
role, and name of each
attribute (variable).
Importing Data
• Types:
polynomial
many different string values (for example: red, green, blue, yellow)

binomial
exactly two values (for example: true/false, yes/no)

real
a fractional number (for example: 11.23 or -0.0001).

integer
a whole number (for example: 23, -5, or 11,024,768).

date_time date
• both date and time (for example: 23.12.2014 17:59). date without time (for example 23.12.2014).

time
time without date (for example 17:59).
Importing Data
• Format the columns with
your specifications.
• In here you could change
the type, role, and name of
each attribute (variable).
• Click Next.
Importing Data
• Choose the folder where
the data will be stored.
• Type the file name.
• Click Finish.
• The data will appear in the
result view.
Importing Data
• The data will
appear in the
Results tab.
This time, using a RapidMiner operator.
Importing Data
• In the Views tab,
click Design.
Importing Data
• Search for Read Excel
in the operator tab.
Importing Data
• Search for Read Excel
in the operator tab.
• Drag and drop it to the
canvas.
Importing Data
• Search for Read Excel
in the operator tab.
• Drag and drop it to the
canvas.
• Click Import Configuration Wizard.
Importing Data
• Search for Read Excel
in the operator tab.
• Drag and drop it to the
canvas.
• Click Import Configuration Wizard.
• Locate and open the file.
For this lecture,
choose OrderDetails.xls.
• Click Next, Next, and Finish.
Importing Data
Exploratory Analysis
• To find the basic
statistics of each
attributes, click
Statistics.
Exploratory Analysis
• To find the basic
statistics of each
attributes, click
Statistics.
Data Preparation

Go back to Design view.


Data Preparation
Connect the Out node
of the Read Excel
operator and res of
the result knob.
Data Preparation
Click Run to execute the
process.
Data Preparation
Click Run to execute the
process.
Data Preparation
Click Run to execute the
process.
Data Filtering
using
Data Preparation

Go back to Design view.


Data Preparation
1. Filtering cases.
• In the operator tab,
search for
Filter Examples,
then drag and drop on
the line connecting
the Read Excel and
the res knob.
Data Preparation
1. Filtering cases.
• In the parameter tab,
choose Add Filter in
the condition class.
Data Preparation
1. Filtering cases.
• Choose the attribute’s filtering criteria.
Data Preparation
1. Filtering cases.
• Choose the attribute’s filtering criteria.
• Example, retaining only the orders before 2016.

This will remove case(s) ordered from 2016 and beyond.


Data Preparation
1. Filtering cases.
• Choose the attribute’s filtering criteria.
• Example, retaining only the orders before 2016.
• You may add more criteria by clicking Add Entry.
Data Preparation
1. Filtering cases.
• Choose the attribute’s filtering criteria.
• Example, retaining only the orders before 2016.
• You may add more criteria by clicking Add Entry.
• Once all criteria have been set, click OK then RUN.
Data Preparation
RapidMiner removed 1 case, an order taken from 2016 onwards.
Missing Value Imputation
using
Data Preparation
Instead of filtering, you
may remove all cases with
missing values, using the
condition class, instead of
Add Filters.
Data Preparation

As seen in the statistics


of the data, 199 cases have
missing values in the Discount
attribute.
Data Preparation

Go back to Design view.


Data Preparation
2. Imputing Missing Data
• In the operator tab, search
for Replace Missing Values,
then drag and drop on the
line connecting the Filtering
Examples and the res knob.
Data Preparation
2. Imputing Missing Data
• In the parameter tab, select
how many attribute filter.
Choose single if the
imputation will apply to a
single attribute.
Data Preparation
2. Imputing Missing Data
• In the parameter tab, select
how many attribute filter.
Choose single if the
imputation will apply to a
single attribute.
• Select the attribute where
the imputation be applied.
Data Preparation
2. Imputing Missing Data
• In the parameter tab, select
how many attribute filter.
Choose single if the
imputation will apply to a
single attribute.
• Select the attribute where
the imputation be applied.
• Select the imputation
method in the Default.
• Click Run to see result.
Data Preparation
2. Imputing Missing Data
• In the parameter tab, select
how many attribute filter.
Choose single if the
imputation will apply to a
single attribute.
• Select the attribute where
the imputation be applied.
• Select the imputation
method in the Default.
• Click Run to see result.
Data Preparation

No more missing values in


the Discount attribute.
Dealing with Miscoded entries
using
Data Preparation

Go back to Design view.


Data Preparation
• Instead of the Order
Details data, we will use
the Customer Details data.
• Drag and drop the
Customer Details in the
canvas.
Data Preparation
• The Customer Details data
can be viewed in the
Results view.
Data Preparation
• Notice in the statistics tab,
that the Gender attribute
has miscoded entries.
Data Preparation
• Notice in the statistics tab,
that the Gender attribute
has miscoded entries.
Click Details…
Data Preparation
• Notice in the statistics tab,
that the Gender attribute
has miscoded entries.
Data Preparation

Go back to Design view.


Data Preparation
3. Dealing with miscoded data
• Connect the Out node of the
Retrieve Customer operator
and second res of the result
knob.
Data Preparation
3. Dealing with miscoded data
• Connect the Out node of the
Retrieve Customer operator
and second res of the result
knob.
• To remove “white spaces”
in the encoding, use the
TRIM operator.
Data Preparation
3. Dealing with miscoded data
• Select single if trimming
shall be applied to a single
attribute.

Then click RUN.


Data Preparation
You may see the trimming result
by viewing the statistics.

Click Details…
Data Preparation
You may see the trimming result
by viewing the statistics.

Before After
Data Preparation

Go back to Design view.


Data Preparation
3. Dealing with miscoded data
• To remove “duplicates”
in the encoding, use the
Remove Duplicates
operator.
Data Preparation
3. Dealing with miscoded data
This will retain only one entry
• Select single if trimming if duplicate Customer IDs
shall be applied to a single have been found.
attribute.

Then click RUN.


Data Preparation

Still, 2267 cases are retained,


indicating that there are no
duplicates in Customer IDs.
Data Preparation

Go back to Design view.


Data Preparation
3. Dealing with miscoded data
• To recode miscoded values,
use the REPLACE operator.
Data Preparation
3. Dealing with miscoded data
Select single if replacing of values
shall be applied to a single
attribute.
Data Preparation Temporarily,
female à girl
Data Preparation
3. Dealing with miscoded data
Select single if replacing of values
shall be applied to a single
attribute.
Data Preparation
3. Dealing with miscoded data
• Add another REPLACE operator,
Data Preparation
Replace FEMALE with girl.
Data Preparation
3. Dealing with miscoded data
• Add another REPLACE operator replacing male with boy;
• Add another REPLACE operator replacing m with boy;
• Add another REPLACE operator replacing f with girl;
• Add another REPLACE operator replacing m with boy;
• Add another REPLACE operator replacing MALE with boy;
• Add another REPLACE operator replacing Male with boy;
To replace back girl and boy to female and male, respectively,
• Add another REPLACE operator replacing girl with female;
• Add another REPLACE operator replacing boy with male.
Data Preparation
Data Preparation
3. Dealing with miscoded data
• Click RUN to verify the process.
Data Preparation
You may impute missing values
using REPLACE MISSING VALUES
operator in other attributes.
Selecting and Setting Roles of Attributes
using
Data Preparation
4. Selecting the Attributes for Analysis
• Use the Select Attributes operator to select
the attributes that you need for analysis.
Data Preparation
4. Selecting the Attributes
for Analysis
• You can select all the attributes,
single, and or a subset.
Click Select Attribute
Data Preparation
4. Selecting the Attributes for Analysis
• Select the Attributes that will be used for analysis.

This will remove the names and


Responder attribute
in the final data.
Data Preparation
5. Setting the role that an
attribute to perform.
• Use the Set Role operator to tag
the attribute that will be use as
the label (Target Variable) or
any other role it will act in the
analysis.
Combining Data Sets
using
Data Preparation
6. Joining Two Data Sets
If two data sets are needed to be merged
in order to make an analysis, use the Join
operator.
• Connect the first data set or its result in
the left node of the Join operator and
the other data set at the right node.
Data Preparation
6. Joining Two Data Sets
If two data sets are needed to be merged
in order to make an analysis, use the Join
operator.
• Connect the first data set or its result in
the left node of the Join operator and
the other data set at the right node.
Data Preparation
6. Joining Two Data Sets
In the parameter tab, use Inner
as join type.

Click Edit List.


Data Preparation
6. Joining Two Data Sets
Select the attribute on the first
data (left) and the second data
(right) that will be used in
matching the two data sets.
Data Preparation
6. Joining Two Data Sets
Select the attribute on the first
data (left) and the second data
(right) that will be used in
matching the two data sets.

Click Apply, then click Run.


Data Preparation
7. Creating a new data set from the cleaned/pre-process data.
• Use the “Store” operator to create a RapidMiner data set from the
process
• Use the “Write ***” operator to store the data in a format you want.
Data Cleaning
using

Asst. Prof. Arturo J. Patungan Jr


Asst. Prof. Xandro Alexi A. Nieto
Mr. Eduardo Dulay

You might also like