0% found this document useful (0 votes)
264 views

Data Preparation

This document discusses the data preparation step in data mining. It describes activities like selecting databases and fields, cleaning data by addressing missing values, invalid values and outliers. It also covers transforming and aggregating data, including techniques for encoding categorical variables, binning numerical variables, and handling relationships between multiple tables. The goal is to construct the final modeling dataset from raw data sources.

Uploaded by

naveengargns
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
264 views

Data Preparation

This document discusses the data preparation step in data mining. It describes activities like selecting databases and fields, cleaning data by addressing missing values, invalid values and outliers. It also covers transforming and aggregating data, including techniques for encoding categorical variables, binning numerical variables, and handling relationships between multiple tables. The goal is to construct the final modeling dataset from raw data sources.

Uploaded by

naveengargns
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Data Preparation

Dr. Saed Sayad


University of Toronto
2010
[email protected]
1 http://chem-eng.utoronto.ca/~datamining/
Data Mining Steps
1
Problem Definition
2
Data Preparation
3
Data Exploration
4
Modeling
5
Evaluation
6
Deployment
http://chem-eng.utoronto.ca/~datamining/ 2
1. Problem Definition
http://chem-eng.utoronto.ca/~datamining/ 3
Understanding the project objectives and
requirements from a business perspective,
and then converting this knowledge into a
data mining problem definition with a
preliminary plan designed to achieve the
objectives.
Source: http://www.crisp-dm.org/Process/index.htm
2- Data Preparation
The data preparation step covers all
activities to construct the final dataset
for modeling from the raw data. Tasks
include database, table, record, and field
selection as well as cleaning, aggregation
and transformation of data.
4 http://chem-eng.utoronto.ca/~datamining/
Data Preparation
Modeling Data
Data
Text
Data
DSN
ETL
http://chem-eng.utoronto.ca/~datamining/ 5
Data Sources
http://chem-eng.utoronto.ca/~datamining/ 6
Text Files
Relational
Database
Multi-dimensional
Database
Entities File Table Cube
Attributes Row and Col
Record, Field,
Index
Dimension, Level,
Measurement
Methods Read, Write
Select, Insert,
Update,
Delete
Drill down, Drill
up, Drill through
Language - SQL MDX
Data Types
Data
Measurement
Ratio
Interval
Counting
Ordinal
Nominal
http://chem-eng.utoronto.ca/~datamining/ 7
Numerical
Categorical
Denormalization
8 http://chem-eng.utoronto.ca/~datamining/
One Row per Subject
Tranformation
Customer
Customer
Transformed
1 to 1
Transaction
Transaction
Transformed
1 to 1
1 to N
1 to N
9 http://chem-eng.utoronto.ca/~datamining/
Copy and Aggregate
Customer
Transaction
Copy Aggregate
10 http://chem-eng.utoronto.ca/~datamining/
Data Preparation - Aggregation
Aggregation
Categorical
Count
Count%
Numeric
Count, Sum
Mean, Std
Min, Max
11 http://chem-eng.utoronto.ca/~datamining/
One to Many Relationship
Customer ID Age Married
1 25 N
2 38 Y
3 46 Y
Transaction ID Customer ID
Purchased
Amount
1 1 250
2 1 125
3 2 100
4 2 85
5 2 24
6 3 400
12 http://chem-eng.utoronto.ca/~datamining/
Customers
Transactions
1
N
Data Preparation - Copy
Transaction ID Customer ID
Purchased
Amount
Age Married
1 1 250 25 N
2 1 125 25 N
3 2 100 38 Y
4 2 85 38 Y
5 2 24 38 Y
6 3 400 46 Y
13 http://chem-eng.utoronto.ca/~datamining/
Data Preparation - Aggregation
Customer ID Age Married
Purchased
Count
Purchased
Total
1 25 N 2 375
2 38 Y 3 209
3 46 Y 1 400
14 http://chem-eng.utoronto.ca/~datamining/
Data Transformation and Cleansing
http://chem-eng.utoronto.ca/~datamining/ 15
Variable
Categorical Numeric
Missing Values Missing Values
Invalid Values Invalid & Outliers
Encoding Binning
Missing Values
http://chem-eng.utoronto.ca/~datamining/ 16
Education
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
B
L
A
N
K 1 2 3 4
F
r
e
q
u
e
n
c
y
83%
Missing Value
Invalid Values
http://chem-eng.utoronto.ca/~datamining/ 17
doc_type_id
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
N
U
L
L
Z X 1 2 3
F
r
e
q
u
e
n
c
y
Invalid
Missing and Invalid Values and Outliers
18 http://chem-eng.utoronto.ca/~datamining/
Months in Business
Box Plot
http://chem-eng.utoronto.ca/~datamining/ 19
Outliers
*
Missing Values
Fill in missing values manually based on our
domain knowledge
Ignore the records with missing data
Fill in it automatically:
A global constant (e.g., ?)
The variable mean
Inference-based methods such as Bayes rule,
decision tree, or EM algorithm
http://chem-eng.utoronto.ca/~datamining/ 20
Managing Outliers
Data points inconsistent with the majority of data
Different outliers
Valid: CEOs salary
Noisy: Ones age = 200, widely deviated points
Removal methods
Box plot
Clustering
Curve-fitting
http://chem-eng.utoronto.ca/~datamining/ 21
Encoding Categorical Variables
Encoding is the process of transforming
categorical variables into numerical
counterparts.
Encoding methods:
Binary method
Ordinal Method
Target based Encoding
http://chem-eng.utoronto.ca/~datamining/ 22
Encoding
Binary method:
for free: 1, 0, 0
own: 0, 1, 0
rent: 0, 0, 1
http://chem-eng.utoronto.ca/~datamining/ 23
Ordinal method:
own: 1
for free: 3
rent: 5
Housing (for free, own, rent)
Binning Numerical Variables
Binning is the process of transforming
numerical variables into categorical
counterparts.
Binning methods:
Equal Width
Equal Frequency
Entropy Based
http://chem-eng.utoronto.ca/~datamining/ 24
Binning
Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28
Equi-width binning:
Bin 1: 0, 4 [-,10) bin
Bin 2: 12, 16, 16, 18 [10,20) bin
Bin 3: 24, 26, 28 [20,+) bin
Equi-frequency binning :
Bin 1: 0, 4, 12 [-, 14) bin
Bin 2: 16, 16, 18 [14, 21) bin
Bin 3: 24, 26, 28 [21,+) bin
http://chem-eng.utoronto.ca/~datamining/ 25
Binning
26 http://chem-eng.utoronto.ca/~datamining/
Months in Business
Summary
In the data preparation step the final modeling
dataset is constructed from the raw data.
One Row per Subject is the heart of the data
preparation activities for building the modeling
dataset.
Tasks include database, table, record, and field
selection as well as cleaning, aggregation and
transformation of data also taking care of missing
values, invalid values and outliers.
27 http://chem-eng.utoronto.ca/~datamining/
28 http://chem-eng.utoronto.ca/~datamining/

You might also like