0% found this document useful (0 votes)

23 views21 pages

Insy662 - f23 - Week 1

This document discusses data pre-processing techniques for preparing raw data for data mining. It covers why pre-processing is important to minimize garbage in/garbage out, as well as specific techniques for data cleaning like handling missing data, outliers, and duplicates. It also discusses data transformation techniques like adjusting variable scales, creating dummy variables, and binning numeric variables. The goal of these pre-processing steps is to prepare raw data into a format suitable for data mining algorithms.

Uploaded by

lakshyaagrwl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views21 pages

Insy662 - f23 - Week 1

Uploaded by

lakshyaagrwl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

INSY 662 – Fall 2023

Data Mining and Visualization

Week 1: Data Pre-processing

August 31, 2023
Elizabeth Han
Why Do We Preprocess Data?
▪ Raw data are often incomplete, noisy
▪ They usually contain:
– Obsolete fields
– Missing values
– Outliers
– Data in form not suitable for data mining
– Erroneous values
– Irrelevant data

2
Data Pre-Processing
▪ Minimize GIGO (Garbage In, Garbage Out)
– IF garbage input minimized
THEN garbage outputs minimized

▪ For data mining purposes, raw data must

undergo data cleaning and data transformation

▪ Data preparation is ~70% of effort for data

mining process

3
Data Cleaning

▪ Inconsistent formatting or labeling

– Not all countries use the same zip code format
e.g., 90210 (U.S.) vs. J2S7K7 (Canada)

– Truncation of zero for numeric fields

e.g., 6269 vs. 06269 (New England states)
–

4
Data Cleaning

▪ Missing data
– Pose problems to data analysis methods
– More common in massive datasets with large
number of fields
– Dropping is the naïve approach
▪ Drop columns with missing values
→ What if all columns contain missing values?
▪ Drop rows with missing values
→ What if missing is not at random?

5
Data Cleaning

▪ Missing data
1. Replace with user-defined constant
2. Replace with mean, median, or mode
3. Replace with random values from underlying
distribution
4. Create a model to predict the values

6
Data Cleaning

▪ Outliers

– Should we always remove all outliers?

7
Data Cleaning

▪ Create an index field

– To track the sort order of the records in the
database
– Data mining data gets partitioned at least once
(and sometimes several times)
– It is helpful to have an index field so that the
original sort order may be recreated

8
Data Cleaning

▪ Remove unary (or nearly unary) variables

– Variables that take on only a single value
– Sometimes a variable can be very nearly unary

e.g., Suppose that 99.95% of the players in a field

hockey league are female, with the remaining 0.05%
male
– While it may be useful to investigate the male
players, some algorithms will tend to treat the
variable as essentially unary

9
Data Cleaning

▪ Removing variables with ≥90% missing values

– But should we always remove them?

e.g., Variable ‘donation’ from a survey data

– If most people do not donate, the data will contain
many missing values.

▪ Recommendation
– Create a dummy variable
(1=record w/o missing value; 0=record w/ missing
value)

10
Data Cleaning

▪ Removing strongly correlated variables

– In statistics, they lead to the issue of
multicollinearity
– In data mining and predictive analytics, they may
cause a double-count of a particular aspect of the
analysis, and at worst lead to instability of the
model results

▪ Recommendation
– Remove the variables from the model
– Apply dimension reduction techniques, such as
the principal components analysis (PCA),
11
Data Cleaning

▪ Removing duplicates
– May occur after merging datasets
– Lead to an overweighting of the data values in
those records

But are they really duplicates?

▪ Recommendation
– Weigh the likelihood that the duplicates truly
represent different records against the likelihood
that the duplicates are indeed just duplicated
records

12
Data Transformation

▪ Adjust the scale of variables

– Variables tend to have different ranges

e.g., two fields in a baseball player data set:

– Batting average: [ 0.0, 0.400 ]
– Number of home runs: [ 0, 70 ]

– Will influence the prediction process of some data

mining algorithms
– By standardizing numeric field values, we can
ensure that the impact of variables on the model is
similar
13
Data Transformation

▪ Adjust the scale of variables

1. Min-Max scaling
– Results in [0, 1]
– Sensitive to extreme values

𝑿 − 𝒎𝒊𝒏(𝑿)
𝑿𝒎𝒎 =
𝒎𝒂𝒙 𝑿 − 𝒎𝒊𝒏(𝑿)

14
Data Transformation

▪ Adjust the scale of variables

2. Decimal scaling
– Reduce the magnitude using a factor of 10
– Results in [-1, 1]

𝑿
𝑿𝒅𝒔 = 𝒅
𝟏𝟎

where d represents the number of digits in the data

value with the largest absolute value

15
Data Transformation

▪ Adjust the scale of variables

3. Z-score standardization
– To follow normal distribution (mean = 0, SD = 1)

𝑿 − 𝒎𝒆𝒂𝒏(𝑿)
𝑿𝒛𝒔 =
𝑺𝑫(𝑿)

16
Data Transformation

▪ Adjust the scale of variables

4. Log transformation
– To account for skewness
– ln(x); 𝑥; 1/ 𝑥

𝟑(𝒎𝒆𝒂𝒏 𝑿 − 𝒎𝒆𝒅𝒊𝒂𝒏 𝑿 )
𝑺𝒌𝒆𝒘𝒏𝒆𝒔𝒔(𝑿) =
𝑺𝑫(𝑿)

17
Data Transformation

▪ Dummy variables (a.k.a. flag or indicator)

– A categorical variable taking only 0 or 1
– Create k-1 dummies for a categorical predictor
with k possible values, and use the unassigned
category as the reference category

e.g. For a variable “region”: {north, east, south,

west}, dummy variables will be:
– dummy_north if region = north
– dummy_east if region = east
– dummy_south if region = south

18
Data Transformation
▪ Binning of numeric variables
– Partitioning numeric values into bins
– Equal width binning: create k categories with
equal width
– Equal frequency binning: create k categories,
each with the same number of records
– Binning by clustering: use clustering algorithm

e.g., X = {1,1,1,1,1,2,2,11,11,12,12,44} & k = 3

19
Data Transformation

▪ Transforming categorical to numerical

– Most of the times, should be avoided
– Except only when categorical variables are clearly
ordered
– A variable “survey_response”

– Should “never” be “0” rather than “1”? Is “always”

closer to “usually” than “usually” is to
“sometimes”? 20
Data Transformation

▪ Reclassifying categorical variables

– Sometimes, there may be too many categories
– 50 states in the U.S.

▪ Recommendation
– Reclassify as a variable “region” with five field values
{Northeast, Southeast, North Central, Southwest,
West}
– Reclassify as a variable “economic_level” with three
field values
{the richer states, the midrange states, the poorer
states}
21

Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Week 2
No ratings yet
Week 2
96 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
DMBI Unit-4,5,6
No ratings yet
DMBI Unit-4,5,6
38 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Week2 2
No ratings yet
Week2 2
25 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
ML 4
No ratings yet
ML 4
17 pages
Data Mining
No ratings yet
Data Mining
22 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Down 2
No ratings yet
Down 2
61 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Mining
No ratings yet
Data Mining
40 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Unit 1
No ratings yet
Unit 1
8 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Normalization
No ratings yet
Normalization
35 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Mining
No ratings yet
Data Mining
5 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Ranch Database Analysis
No ratings yet
Ranch Database Analysis
101 pages
DWM
No ratings yet
DWM
14 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Introduction To Information Systems People Technology and Processes 3rd Edition Wallace Solutions Manual 1
100% (83)
Introduction To Information Systems People Technology and Processes 3rd Edition Wallace Solutions Manual 1
26 pages
SAP Sales Pricing Deep Dive - SDC Sharing
No ratings yet
SAP Sales Pricing Deep Dive - SDC Sharing
25 pages
MainFrame Sample Questions
No ratings yet
MainFrame Sample Questions
24 pages
All Project Specifications
No ratings yet
All Project Specifications
11 pages
SQL New Slide
No ratings yet
SQL New Slide
60 pages
Databricks Class 1 PPT
No ratings yet
Databricks Class 1 PPT
8 pages
Data Hierarchy
100% (3)
Data Hierarchy
2 pages
Oracle DBA Tuning
No ratings yet
Oracle DBA Tuning
10 pages
Holidays Homework - Ip
No ratings yet
Holidays Homework - Ip
5 pages
02 Exploring and Creating A Map
No ratings yet
02 Exploring and Creating A Map
16 pages
INSY662 - F23 - Week 3-2
No ratings yet
INSY662 - F23 - Week 3-2
15 pages
Calibre
No ratings yet
Calibre
387 pages
Data Warehousing Predictive Analytics For Economic Insights
No ratings yet
Data Warehousing Predictive Analytics For Economic Insights
17 pages
Volatile Data UNIT 5
No ratings yet
Volatile Data UNIT 5
37 pages
Fin Irjmets1734783749
No ratings yet
Fin Irjmets1734783749
5 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 1
34 pages
SQL Server Security (Logins, Users - Fixed Roles)
No ratings yet
SQL Server Security (Logins, Users - Fixed Roles)
3 pages
Ankita Shukla Resume 2025-1
No ratings yet
Ankita Shukla Resume 2025-1
2 pages
Open Source Software Referance Guide
No ratings yet
Open Source Software Referance Guide
9 pages
Computing - Grade 8 - WS
No ratings yet
Computing - Grade 8 - WS
3 pages
Power Bi Session Notes-1
No ratings yet
Power Bi Session Notes-1
10 pages
How To Build A Data Science Portfolio
No ratings yet
How To Build A Data Science Portfolio
17 pages
INSY662 - F23 - Week 3-1
No ratings yet
INSY662 - F23 - Week 3-1
22 pages
What'S New in Omnis Studio 4.3.1: Tigerlogic Corporation
No ratings yet
What'S New in Omnis Studio 4.3.1: Tigerlogic Corporation
54 pages
MIS Questions Bank
No ratings yet
MIS Questions Bank
24 pages
Answer: - : I Use Waterfall Model For The Specification Given Above
No ratings yet
Answer: - : I Use Waterfall Model For The Specification Given Above
16 pages
DISTINCT in SQL - 1keydata
No ratings yet
DISTINCT in SQL - 1keydata
2 pages
Rajeev D Hand
No ratings yet
Rajeev D Hand
5 pages
Resume Velocity 2 Years Informatica
No ratings yet
Resume Velocity 2 Years Informatica
2 pages
The Key Components of Hbase Are Zookeeper, Regionserver, Region, Catalog Tables and Hbase Master
No ratings yet
The Key Components of Hbase Are Zookeeper, Regionserver, Region, Catalog Tables and Hbase Master
5 pages
Chapter 2 Notes
No ratings yet
Chapter 2 Notes
5 pages
Breaking Brands 3.0 Prelims
No ratings yet
Breaking Brands 3.0 Prelims
4 pages
The World Can Survive Without Religion
No ratings yet
The World Can Survive Without Religion
2 pages
MICA Impetus Brainstorm 2
No ratings yet
MICA Impetus Brainstorm 2
1 page
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)

Insy662 - f23 - Week 1

Uploaded by

Insy662 - f23 - Week 1

Uploaded by

INSY 662 – Fall 2023

Data Mining and Visualization

Week 1: Data Pre-processing

▪ For data mining purposes, raw data must

▪ Data preparation is ~70% of effort for data

▪ Inconsistent formatting or labeling

– Truncation of zero for numeric fields

– Should we always remove all outliers?

▪ Create an index field

▪ Remove unary (or nearly unary) variables

e.g., Suppose that 99.95% of the players in a field

▪ Removing variables with ≥90% missing values

e.g., Variable ‘donation’ from a survey data

▪ Removing strongly correlated variables

But are they really duplicates?

▪ Adjust the scale of variables

e.g., two fields in a baseball player data set:

– Will influence the prediction process of some data

▪ Adjust the scale of variables

▪ Adjust the scale of variables

where d represents the number of digits in the data

▪ Adjust the scale of variables

▪ Adjust the scale of variables

▪ Dummy variables (a.k.a. flag or indicator)

e.g. For a variable “region”: {north, east, south,

e.g., X = {1,1,1,1,1,2,2,11,11,12,12,44} & k = 3

▪ Transforming categorical to numerical

– Should “never” be “0” rather than “1”? Is “always”

▪ Reclassifying categorical variables

You might also like