0% found this document useful (0 votes)
23 views21 pages

Insy662 - f23 - Week 1

This document discusses data pre-processing techniques for preparing raw data for data mining. It covers why pre-processing is important to minimize garbage in/garbage out, as well as specific techniques for data cleaning like handling missing data, outliers, and duplicates. It also discusses data transformation techniques like adjusting variable scales, creating dummy variables, and binning numeric variables. The goal of these pre-processing steps is to prepare raw data into a format suitable for data mining algorithms.

Uploaded by

lakshyaagrwl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views21 pages

Insy662 - f23 - Week 1

This document discusses data pre-processing techniques for preparing raw data for data mining. It covers why pre-processing is important to minimize garbage in/garbage out, as well as specific techniques for data cleaning like handling missing data, outliers, and duplicates. It also discusses data transformation techniques like adjusting variable scales, creating dummy variables, and binning numeric variables. The goal of these pre-processing steps is to prepare raw data into a format suitable for data mining algorithms.

Uploaded by

lakshyaagrwl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

INSY 662 – Fall 2023

Data Mining and Visualization

Week 1: Data Pre-processing


August 31, 2023
Elizabeth Han
Why Do We Preprocess Data?
▪ Raw data are often incomplete, noisy
▪ They usually contain:
– Obsolete fields
– Missing values
– Outliers
– Data in form not suitable for data mining
– Erroneous values
– Irrelevant data

2
Data Pre-Processing
▪ Minimize GIGO (Garbage In, Garbage Out)
– IF garbage input minimized
THEN garbage outputs minimized

▪ For data mining purposes, raw data must


undergo data cleaning and data transformation

▪ Data preparation is ~70% of effort for data


mining process

3
Data Cleaning

▪ Inconsistent formatting or labeling


– Not all countries use the same zip code format
e.g., 90210 (U.S.) vs. J2S7K7 (Canada)

– Truncation of zero for numeric fields


e.g., 6269 vs. 06269 (New England states)

4
Data Cleaning

▪ Missing data
– Pose problems to data analysis methods
– More common in massive datasets with large
number of fields
– Dropping is the naïve approach
▪ Drop columns with missing values
→ What if all columns contain missing values?
▪ Drop rows with missing values
→ What if missing is not at random?

5
Data Cleaning

▪ Missing data
1. Replace with user-defined constant
2. Replace with mean, median, or mode
3. Replace with random values from underlying
distribution
4. Create a model to predict the values

6
Data Cleaning

▪ Outliers

– Should we always remove all outliers?

7
Data Cleaning

▪ Create an index field


– To track the sort order of the records in the
database
– Data mining data gets partitioned at least once
(and sometimes several times)
– It is helpful to have an index field so that the
original sort order may be recreated

8
Data Cleaning

▪ Remove unary (or nearly unary) variables


– Variables that take on only a single value
– Sometimes a variable can be very nearly unary

e.g., Suppose that 99.95% of the players in a field


hockey league are female, with the remaining 0.05%
male
– While it may be useful to investigate the male
players, some algorithms will tend to treat the
variable as essentially unary

9
Data Cleaning

▪ Removing variables with ≥90% missing values


– But should we always remove them?

e.g., Variable ‘donation’ from a survey data


– If most people do not donate, the data will contain
many missing values.

▪ Recommendation
– Create a dummy variable
(1=record w/o missing value; 0=record w/ missing
value)

10
Data Cleaning

▪ Removing strongly correlated variables


– In statistics, they lead to the issue of
multicollinearity
– In data mining and predictive analytics, they may
cause a double-count of a particular aspect of the
analysis, and at worst lead to instability of the
model results

▪ Recommendation
– Remove the variables from the model
– Apply dimension reduction techniques, such as
the principal components analysis (PCA),
11
Data Cleaning

▪ Removing duplicates
– May occur after merging datasets
– Lead to an overweighting of the data values in
those records

But are they really duplicates?

▪ Recommendation
– Weigh the likelihood that the duplicates truly
represent different records against the likelihood
that the duplicates are indeed just duplicated
records

12
Data Transformation

▪ Adjust the scale of variables


– Variables tend to have different ranges

e.g., two fields in a baseball player data set:


– Batting average: [ 0.0, 0.400 ]
– Number of home runs: [ 0, 70 ]

– Will influence the prediction process of some data


mining algorithms
– By standardizing numeric field values, we can
ensure that the impact of variables on the model is
similar
13
Data Transformation

▪ Adjust the scale of variables


1. Min-Max scaling
– Results in [0, 1]
– Sensitive to extreme values

𝑿 − 𝒎𝒊𝒏(𝑿)
𝑿𝒎𝒎 =
𝒎𝒂𝒙 𝑿 − 𝒎𝒊𝒏(𝑿)

14
Data Transformation

▪ Adjust the scale of variables


2. Decimal scaling
– Reduce the magnitude using a factor of 10
– Results in [-1, 1]

𝑿
𝑿𝒅𝒔 = 𝒅
𝟏𝟎

where d represents the number of digits in the data


value with the largest absolute value

15
Data Transformation

▪ Adjust the scale of variables


3. Z-score standardization
– To follow normal distribution (mean = 0, SD = 1)

𝑿 − 𝒎𝒆𝒂𝒏(𝑿)
𝑿𝒛𝒔 =
𝑺𝑫(𝑿)

16
Data Transformation

▪ Adjust the scale of variables


4. Log transformation
– To account for skewness
– ln(x); 𝑥; 1/ 𝑥

𝟑(𝒎𝒆𝒂𝒏 𝑿 − 𝒎𝒆𝒅𝒊𝒂𝒏 𝑿 )
𝑺𝒌𝒆𝒘𝒏𝒆𝒔𝒔(𝑿) =
𝑺𝑫(𝑿)

17
Data Transformation

▪ Dummy variables (a.k.a. flag or indicator)


– A categorical variable taking only 0 or 1
– Create k-1 dummies for a categorical predictor
with k possible values, and use the unassigned
category as the reference category

e.g. For a variable “region”: {north, east, south,


west}, dummy variables will be:
– dummy_north if region = north
– dummy_east if region = east
– dummy_south if region = south

18
Data Transformation
▪ Binning of numeric variables
– Partitioning numeric values into bins
– Equal width binning: create k categories with
equal width
– Equal frequency binning: create k categories,
each with the same number of records
– Binning by clustering: use clustering algorithm

e.g., X = {1,1,1,1,1,2,2,11,11,12,12,44} & k = 3

19
Data Transformation

▪ Transforming categorical to numerical


– Most of the times, should be avoided
– Except only when categorical variables are clearly
ordered
– A variable “survey_response”

– Should “never” be “0” rather than “1”? Is “always”


closer to “usually” than “usually” is to
“sometimes”? 20
Data Transformation

▪ Reclassifying categorical variables


– Sometimes, there may be too many categories
– 50 states in the U.S.

▪ Recommendation
– Reclassify as a variable “region” with five field values
{Northeast, Southeast, North Central, Southwest,
West}
– Reclassify as a variable “economic_level” with three
field values
{the richer states, the midrange states, the poorer
states}
21

You might also like