0% found this document useful (0 votes)

58 views40 pages

Data Preprocessing Guide

The document provides an overview of data preprocessing, including definitions of data, attributes, and types of attributes such as nominal, ordinal, interval, and ratio. It discusses the importance of understanding dataset characteristics, handling missing values, outliers, and the methods for data normalization and discretization. Additionally, it covers techniques for managing categorical and continuous attributes, as well as measures of similarity and dissimilarity in data analysis.

Uploaded by

l225000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views40 pages

Data Preprocessing Guide

Uploaded by

l225000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data

Preprocessing
What is Data?
Attributes
Collection of data objects and their attributes

Tid Refund Marital Taxable

Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Objects
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Types of Attributes
➢ There are different types of attributes
➢ Nominal
➢ Examples: ID numbers, eye color, zip codes
➢ Ordinal
➢ Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
➢ Interval
➢ Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
➢ Ratio
➢ Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
➢ The type of an attribute depends on which of the
following properties it possesses:
➢ Distinctness: = 
➢ Order: < >
➢ Addition: + -
➢ Multiplication: */

➢ Nominal attribute: distinctness

➢ Ordinal attribute: distinctness & order
➢ Interval attribute: distinctness, order & addition
➢ Ratio attribute: all 4 properties
Discrete, Continuous,
& Asymmetric Attributes
➢ Discrete Attribute
➢ Has only a finite or countably infinite set of values
➢ Ex: zip codes, counts, or the set of words in a collection of documents
➢ Often represented as integer variables (Nominal, ordinal, binary attributes)
➢ Continuous Attribute
➢ Has real numbers as attribute values
➢ Interval and ratio attributes
➢ Ex: temperature, height, or weight
➢ Asymmetric Attribute
➢ Only presence is regarded as important
➢ Ex: If students are compared on the basis of the courses they do not take,
then most students would seem very similar
Step 1: To describe the dataset

What do your records represent?

What does each attribute mean?

What type of attributes?

• Categorical
• Numerical
• Discrete
• Continuous
• Binary – Asymmetric
Step 2: To explore the dataset
➢ Preliminary investigation of the data to better
understand its specific characteristics
➢ It can help to answer some of the data mining questions
➢ To help in selecting pre-processing tools
➢ To help in selecting appropriate data mining algorithms
➢ Things to look at
➢ Class balance Visualization tools
➢ Dispersion of data attribute values are important
➢ Skewness, outliers, missing values Histograms, scatter
➢ Attributes that vary together plots
Useful Statistics
➢ Discrete attributes
➢ Frequency of each value
➢ Mode = value with highest frequency
➢ Continuous attributes
➢ Range of values, i.e. min and max
➢ Mean (average)
➢ Sensitive to outliers
➢ Median
➢ Better indication of the ”middle” of a set of values in a skewed
distribution
➢ Skewed distribution
➢ mean and median are quite different
Skewed Distributions of
Attribute Values
Dispersion of Data
➢ How do the values of an attribute spread?
➢ Variance
➢ Variance is sensitive to outliers
𝑛
1
ത )2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑋) = ෍(𝑥𝑖 − 𝑥
𝑛
𝑖=1
➢ What if the distribution of values is multimodal, i.e.
data has several bumps?

➢ Visualization tools are useful

Attributes that Vary Together
➢ Correlation is a measure that describe how two attributes
vary together
𝑐𝑜𝑟𝑟(𝑥, 𝑦)
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ )(𝑦𝑖 − 𝑦ത )
=
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ )2 σ𝑛𝑖=1(𝑦𝑖 − 𝑦ത )2
Data Quality
➢ Examples of data quality problems:
➢ Noise and outliers Tid Refund Marital Taxable
Status Income Cheat
➢ Missing values 1 Yes Single 125K No

➢ Duplicate data 2 No Married 100K No

3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes

6 No NULL 60K No

Missing values 7 Yes Divorced 220K NULL

8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries 9 No Single 90K No
10
Missing Values
➢ Reasons for missing values
➢ Information is not collected
(e.g., people decline to give their age and weight)
➢ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

➢ Handling missing values

➢ Eliminate Data Objects
➢ Estimate Missing Values
➢ Ignore the Missing Value During Analysis
➢ Replace with all possible values (weighted by their probabilities)
Outliers
➢ Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
➢ Can help to
➢ detect new phenomenon or
➢ discover unusual behavior in data
➢ detect problems
How to Handle Noisy Data?
➢ Binning method:
➢ first sort data and partition into (equi-depth) bins
➢ then smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
➢ Clustering
➢ detect and remove outliers
➢ Combined computer and human inspection
➢ detect suspicious values and check by human
➢ Regression
➢ smooth by fitting the data
into regression functions
Discretization

Divide the range of a continuous attribute into intervals

➢ Interval labels can be used to replace actual data
values.

➢ Reduce data size by discretization

➢ Some data mining algorithms only work with discrete
➢ attributes
➢ E.g. Apriori for ARM
Binning (Equal-width)
➢ Equal-width (distance) partitioning
➢ Divide the attribute values x into k equally sized bins
➢ If xmin ≤ x ≤ xmax then the bin width δ is given by
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
𝛿=
𝑘

➢ Disadvantages:
➢ outliers may dominate presentation
➢ Skewed data is not handled well.
Binning (Equal-frequency)
➢ Equal-depth (frequency) partitioning:

➢ Divides the range into N intervals, each containing

approximately same number of samples
➢ Good data scaling

➢ Disadvantage:
➢ Many occurrences of the same continuous value could
cause the values to be assigned into different bins
➢ Managing categorical attributes can be tricky.
Binning Example
Attribute values (for one attribute e.g., age):
• 0, 4, 12, 16, 16, 18, 24, 26, 28

Equi-width binning – for bin width of e.g., 10:

• Bin 1: 0, 4 [-,10) bin
• Bin 2: 12, 16, 16, 18 [10,20) bin
• Bin 3: 24, 26, 28 [20,+) bin
• – denote negative infinity, + positive infinity

Equi-frequency binning – for bin density of e.g., 3:

• Bin 1: 0, 4, 12 [-, 14) bin
• Bin 2: 16, 16, 18 [14, 21) bin
• Bin 3: 24, 26, 28 [21,+] bin
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into Equi-depth bins:
Equi-depth bins:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Smoothing by bin
Smoothing by bin means: boundaries:

Bin 1: Bin 1: 9, 9, 9, 9 Bin 1: 4, 4, 4, 15

Bin 2: 23, 23, 23, 23 Bin 2: 21, 21, 25, 25
Bin 3: 29, 29, 29, 29 Bin 3: 26, 26, 26, 34
Data Normalization
An attribute values are scaled to fall within a small, specified
range , such as 0.0 to 1.0
➢ Min-Max normalization
➢ performs a linear transformation on the original data.
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

➢ Example: Let min and max values for the attribute income are
$12,000 and $98,000, respectively.
➢ Map income to the range [0.0;1.0].
Data Normalization
➢ z-score normalization(or zero-mean normalization)
➢ An attribute A, values are normalized based on the mean and
standard deviation of A.

v − meanA
v' =
stand _ devA
➢ Example: Let mean= 54,000 and standard deviation=16,000 for the
attribute income
➢ With z-score normalization, a value of $73,600 for income is
transformed to
Continuous and Categorical
Attributes
How to apply association analysis formulation to non-
asymmetric binary variables?
Session Country Session Number of
Browser
Id Length Web Pages Gender Buy
Type
(sec) viewed
1 USA 982 8 Male IE No
2 China 811 10 Female Netscape No
3 USA 2125 45 Female Mozilla Yes
4 Germany 596 4 Male IE Yes
5 Australia 123 9 Male Mozilla No
… … … … … … …
10

Example of Association Rule:

{Number of Pages [5,10)  (Browser=Mozilla)} → {Buy = No}
Handling Categorical Attributes
➢ Categorical Attributes:
➢finite number of possible values,
➢no ordering among value
➢ Transform categorical attribute into asymmetric
binary variables
➢ Introduce a new “item” for each distinct
attribute-value pair
➢ Example: replace Browser Type attribute with
➢ Browser Type = Internet Explorer
➢ Browser Type = Mozilla
➢ Browser Type = Chrome
Handling Categorical Attributes
➢ Potential Issues
➢What if attribute has many possible values
➢ Example: attribute country has more than 200
possible values
➢ Many of the attribute values may have very low
support
➢Potential solution: Aggregate the low-support
attribute values

Replace less frequent attribute values

into category called others
Handling Categorical Attributes

➢ Potential Issue: What if distribution of attribute values is

highly skewed
➢ Example: In an online survey, we collected information
regarding attributes gender, education, state, computer
at home, chat online, shop online and privacy concern.
➢ 85 % of the participant have computer at home
➢ {Computer at home =yes, shop Online =yes} ->{ privacy
concerns = yes}
➢ Better: {shop Online =yes} ->{ privacy concerns = yes}
➢ Potential solution: drop the highly frequent items
Handling Continuous Attributes
➢ Different kinds of rules:
➢Age[21,35)  Salary[70k,120k) → Buy
➢Salary[70k,120k)  Buy → Age: =28, =4
➢ Different methods:
➢Discretization-based
➢Equal-width binning
➢Equal-depth binning
➢Clustering
➢Statistics-based
Similarity and

Dissimilarity
Similarity and Dissimilarity
• Numerical measure of how alike two data
objects are.
Similarity • Is higher when objects are more alike.
• Often falls in the range [0,1]

• Numerical measure of how different are

two data objects
Dissimilarity • Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies

Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for
Simple Attributes

p and q are the attribute values for two data objects.

Euclidean Distance

➢ Euclidean Distance
n
dist =  ( pk − qk )
2
k =1
Where n is the number of dimensions (attributes)
pk and qk are the kth attributes (components) or data
objects p and q.

➢ Standardization is necessary, if scales differ.

General Approach for Combining
Similarities
Sometimes attributes are of many different types, but an
overall similarity is needed.
Using Weights to Combine
Similarities

➢ May not want to treat all attributes the same.

➢Use weights wk which are between 0 and 1 and
sum to 1.
Example

➢ One categorical variable, test-1,

➢ d(i, j) evaluates to 0 if objects i = j , and 1 otherwise
Example

➢ Ordinal variable, test-2,

➢ d(i, j) = |i-j| / (n-1), we have 0 to n-1 values
Example

➢ Ratio scaled variable, test-3,

➢ Normalize (min-max normalization)
➢ Max =64 , min =22
➢ Distance measure (Manhattan or Euclidean distance )
Example

➢ Variable of Mixed types

➢ We use the dissimilarity matrices for the three variables.
Assignment 4
➢ Preprocessing and Clustering using K-means in
Pyspark
➢Due on 24th May
Project ???
➢ Replaced with
➢more assignments
➢a mini project
➢review of a latest research paper on Spark

Data Mining: Statistical Analysis Techniques
No ratings yet
Data Mining: Statistical Analysis Techniques
24 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
Data Preprocessing PDF
No ratings yet
Data Preprocessing PDF
57 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
19 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data Pre-Processing Essentials
No ratings yet
Data Pre-Processing Essentials
21 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Importance of Data Preprocessing in Mining
No ratings yet
Importance of Data Preprocessing in Mining
77 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Full
No ratings yet
Full
367 pages
Preprocessing 1
No ratings yet
Preprocessing 1
11 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Session 1 - Getting To Know Data
No ratings yet
Session 1 - Getting To Know Data
62 pages
3-Preparing The Data-10-01-2024
No ratings yet
3-Preparing The Data-10-01-2024
127 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Mining
No ratings yet
Mining
129 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
33 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Data Mining: Understanding Data Types
No ratings yet
Data Mining: Understanding Data Types
53 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Data Exploration and Preprocessing Guide
No ratings yet
Data Exploration and Preprocessing Guide
81 pages
Data
No ratings yet
Data
84 pages
Correlation in Data Mining Explained
No ratings yet
Correlation in Data Mining Explained
12 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
02 Data
No ratings yet
02 Data
35 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Understanding Data Attributes and Preprocessing
No ratings yet
Understanding Data Attributes and Preprocessing
12 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
MB Manual Ga-H61m-S2p (r3)
No ratings yet
MB Manual Ga-H61m-S2p (r3)
32 pages
Daily Test - Passive Voice2
No ratings yet
Daily Test - Passive Voice2
6 pages
Ganotherapy by Dr. Lim Sio Jin
100% (13)
Ganotherapy by Dr. Lim Sio Jin
22 pages
Catalog Danfoss Fc302 en
No ratings yet
Catalog Danfoss Fc302 en
72 pages
Grammar: Pronouns Present Simple Present Continuous
No ratings yet
Grammar: Pronouns Present Simple Present Continuous
81 pages
NISM-Series-XXII: Fixed Income Securities Certification Examination
No ratings yet
NISM-Series-XXII: Fixed Income Securities Certification Examination
5 pages
EST Exams
No ratings yet
EST Exams
28 pages
What Is Drug Education and Treatment
100% (1)
What Is Drug Education and Treatment
6 pages
PCNHS Arnis Team Wins 20 Golds at Meet
No ratings yet
PCNHS Arnis Team Wins 20 Golds at Meet
2 pages
Management Practices at Beximco Pharmaceuticals
No ratings yet
Management Practices at Beximco Pharmaceuticals
15 pages
Pricelist Puskesmas 2022
No ratings yet
Pricelist Puskesmas 2022
11 pages
Randomized Quicksort Performance Analysis
No ratings yet
Randomized Quicksort Performance Analysis
4 pages
Brazil-Africa Relations in The 21st Century: From Surge To Downturn and Beyond Mathias Alencastro PDF Download
No ratings yet
Brazil-Africa Relations in The 21st Century: From Surge To Downturn and Beyond Mathias Alencastro PDF Download
72 pages
Aggarwal Et Al. - 2022
No ratings yet
Aggarwal Et Al. - 2022
23 pages
Shortening Lexicology
No ratings yet
Shortening Lexicology
4 pages
Book List
No ratings yet
Book List
24 pages
Hadoop Notes Unit2
No ratings yet
Hadoop Notes Unit2
24 pages
Trancription 1 - The Enchanted Forest Adventure
100% (5)
Trancription 1 - The Enchanted Forest Adventure
1 page
Load Distribution in Structural Engineering
No ratings yet
Load Distribution in Structural Engineering
283 pages
Module Literacy & Numeracy
No ratings yet
Module Literacy & Numeracy
11 pages
TTL Module Final
No ratings yet
TTL Module Final
14 pages
Understanding Transactional Analysis Games
No ratings yet
Understanding Transactional Analysis Games
21 pages
Markos Fflags.@Biohazard .
No ratings yet
Markos Fflags.@Biohazard .
5 pages
Retention of Removable Partial Denture
No ratings yet
Retention of Removable Partial Denture
45 pages
Escalation-Matrix 1685342662
No ratings yet
Escalation-Matrix 1685342662
5 pages
Unitarism, Pluralism, Interests and Values: Chris Provis
No ratings yet
Unitarism, Pluralism, Interests and Values: Chris Provis
24 pages
Rex Education REX Book Store Law Books Pricelist 01.02.2024
No ratings yet
Rex Education REX Book Store Law Books Pricelist 01.02.2024
16 pages
IB Chemistry SL and HL
No ratings yet
IB Chemistry SL and HL
108 pages
Compact KCET 2024 Chemistry 6
No ratings yet
Compact KCET 2024 Chemistry 6
85 pages
Space Race To Mars
No ratings yet
Space Race To Mars
8 pages