This document discusses the data preparation step in data mining. It describes activities like selecting databases and fields, cleaning data by addressing missing values, invalid values and outliers. It also covers transforming and aggregating data, including techniques for encoding categorical variables, binning numerical variables, and handling relationships between multiple tables. The goal is to construct the final modeling dataset from raw data sources.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
264 views
Data Preparation
This document discusses the data preparation step in data mining. It describes activities like selecting databases and fields, cleaning data by addressing missing values, invalid values and outliers. It also covers transforming and aggregating data, including techniques for encoding categorical variables, binning numerical variables, and handling relationships between multiple tables. The goal is to construct the final modeling dataset from raw data sources.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28
Data Preparation
Dr. Saed Sayad
University of Toronto 2010 [email protected] 1 http://chem-eng.utoronto.ca/~datamining/ Data Mining Steps 1 Problem Definition 2 Data Preparation 3 Data Exploration 4 Modeling 5 Evaluation 6 Deployment http://chem-eng.utoronto.ca/~datamining/ 2 1. Problem Definition http://chem-eng.utoronto.ca/~datamining/ 3 Understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition with a preliminary plan designed to achieve the objectives. Source: http://www.crisp-dm.org/Process/index.htm 2- Data Preparation The data preparation step covers all activities to construct the final dataset for modeling from the raw data. Tasks include database, table, record, and field selection as well as cleaning, aggregation and transformation of data. 4 http://chem-eng.utoronto.ca/~datamining/ Data Preparation Modeling Data Data Text Data DSN ETL http://chem-eng.utoronto.ca/~datamining/ 5 Data Sources http://chem-eng.utoronto.ca/~datamining/ 6 Text Files Relational Database Multi-dimensional Database Entities File Table Cube Attributes Row and Col Record, Field, Index Dimension, Level, Measurement Methods Read, Write Select, Insert, Update, Delete Drill down, Drill up, Drill through Language - SQL MDX Data Types Data Measurement Ratio Interval Counting Ordinal Nominal http://chem-eng.utoronto.ca/~datamining/ 7 Numerical Categorical Denormalization 8 http://chem-eng.utoronto.ca/~datamining/ One Row per Subject Tranformation Customer Customer Transformed 1 to 1 Transaction Transaction Transformed 1 to 1 1 to N 1 to N 9 http://chem-eng.utoronto.ca/~datamining/ Copy and Aggregate Customer Transaction Copy Aggregate 10 http://chem-eng.utoronto.ca/~datamining/ Data Preparation - Aggregation Aggregation Categorical Count Count% Numeric Count, Sum Mean, Std Min, Max 11 http://chem-eng.utoronto.ca/~datamining/ One to Many Relationship Customer ID Age Married 1 25 N 2 38 Y 3 46 Y Transaction ID Customer ID Purchased Amount 1 1 250 2 1 125 3 2 100 4 2 85 5 2 24 6 3 400 12 http://chem-eng.utoronto.ca/~datamining/ Customers Transactions 1 N Data Preparation - Copy Transaction ID Customer ID Purchased Amount Age Married 1 1 250 25 N 2 1 125 25 N 3 2 100 38 Y 4 2 85 38 Y 5 2 24 38 Y 6 3 400 46 Y 13 http://chem-eng.utoronto.ca/~datamining/ Data Preparation - Aggregation Customer ID Age Married Purchased Count Purchased Total 1 25 N 2 375 2 38 Y 3 209 3 46 Y 1 400 14 http://chem-eng.utoronto.ca/~datamining/ Data Transformation and Cleansing http://chem-eng.utoronto.ca/~datamining/ 15 Variable Categorical Numeric Missing Values Missing Values Invalid Values Invalid & Outliers Encoding Binning Missing Values http://chem-eng.utoronto.ca/~datamining/ 16 Education 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 B L A N K 1 2 3 4 F r e q u e n c y 83% Missing Value Invalid Values http://chem-eng.utoronto.ca/~datamining/ 17 doc_type_id 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 N U L L Z X 1 2 3 F r e q u e n c y Invalid Missing and Invalid Values and Outliers 18 http://chem-eng.utoronto.ca/~datamining/ Months in Business Box Plot http://chem-eng.utoronto.ca/~datamining/ 19 Outliers * Missing Values Fill in missing values manually based on our domain knowledge Ignore the records with missing data Fill in it automatically: A global constant (e.g., ?) The variable mean Inference-based methods such as Bayes rule, decision tree, or EM algorithm http://chem-eng.utoronto.ca/~datamining/ 20 Managing Outliers Data points inconsistent with the majority of data Different outliers Valid: CEOs salary Noisy: Ones age = 200, widely deviated points Removal methods Box plot Clustering Curve-fitting http://chem-eng.utoronto.ca/~datamining/ 21 Encoding Categorical Variables Encoding is the process of transforming categorical variables into numerical counterparts. Encoding methods: Binary method Ordinal Method Target based Encoding http://chem-eng.utoronto.ca/~datamining/ 22 Encoding Binary method: for free: 1, 0, 0 own: 0, 1, 0 rent: 0, 0, 1 http://chem-eng.utoronto.ca/~datamining/ 23 Ordinal method: own: 1 for free: 3 rent: 5 Housing (for free, own, rent) Binning Numerical Variables Binning is the process of transforming numerical variables into categorical counterparts. Binning methods: Equal Width Equal Frequency Entropy Based http://chem-eng.utoronto.ca/~datamining/ 24 Binning Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28 Equi-width binning: Bin 1: 0, 4 [-,10) bin Bin 2: 12, 16, 16, 18 [10,20) bin Bin 3: 24, 26, 28 [20,+) bin Equi-frequency binning : Bin 1: 0, 4, 12 [-, 14) bin Bin 2: 16, 16, 18 [14, 21) bin Bin 3: 24, 26, 28 [21,+) bin http://chem-eng.utoronto.ca/~datamining/ 25 Binning 26 http://chem-eng.utoronto.ca/~datamining/ Months in Business Summary In the data preparation step the final modeling dataset is constructed from the raw data. One Row per Subject is the heart of the data preparation activities for building the modeling dataset. Tasks include database, table, record, and field selection as well as cleaning, aggregation and transformation of data also taking care of missing values, invalid values and outliers. 27 http://chem-eng.utoronto.ca/~datamining/ 28 http://chem-eng.utoronto.ca/~datamining/