0% found this document useful (0 votes)
5 views

Data Preprocessing for Clustering

The document provides an overview of data preprocessing, including definitions of data, attributes, and types of attributes such as nominal, ordinal, interval, and ratio. It discusses the importance of understanding dataset characteristics, handling missing values, outliers, and the methods for data normalization and discretization. Additionally, it covers techniques for managing categorical and continuous attributes, as well as measures of similarity and dissimilarity in data analysis.

Uploaded by

l225000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Preprocessing for Clustering

The document provides an overview of data preprocessing, including definitions of data, attributes, and types of attributes such as nominal, ordinal, interval, and ratio. It discusses the importance of understanding dataset characteristics, handling missing values, outliers, and the methods for data normalization and discretization. Additionally, it covers techniques for managing categorical and continuous attributes, as well as measures of similarity and dissimilarity in data analysis.

Uploaded by

l225000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Data

Preprocessing
What is Data?
Attributes
Collection of data objects and their attributes

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Objects
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Types of Attributes
➢ There are different types of attributes
➢ Nominal
➢ Examples: ID numbers, eye color, zip codes
➢ Ordinal
➢ Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
➢ Interval
➢ Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
➢ Ratio
➢ Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
➢ The type of an attribute depends on which of the
following properties it possesses:
➢ Distinctness: = 
➢ Order: < >
➢ Addition: + -
➢ Multiplication: */

➢ Nominal attribute: distinctness


➢ Ordinal attribute: distinctness & order
➢ Interval attribute: distinctness, order & addition
➢ Ratio attribute: all 4 properties
Discrete, Continuous,
& Asymmetric Attributes
➢ Discrete Attribute
➢ Has only a finite or countably infinite set of values
➢ Ex: zip codes, counts, or the set of words in a collection of documents
➢ Often represented as integer variables (Nominal, ordinal, binary attributes)
➢ Continuous Attribute
➢ Has real numbers as attribute values
➢ Interval and ratio attributes
➢ Ex: temperature, height, or weight
➢ Asymmetric Attribute
➢ Only presence is regarded as important
➢ Ex: If students are compared on the basis of the courses they do not take,
then most students would seem very similar
Step 1: To describe the dataset

What do your records represent?

What does each attribute mean?

What type of attributes?


• Categorical
• Numerical
• Discrete
• Continuous
• Binary – Asymmetric
Step 2: To explore the dataset
➢ Preliminary investigation of the data to better
understand its specific characteristics
➢ It can help to answer some of the data mining questions
➢ To help in selecting pre-processing tools
➢ To help in selecting appropriate data mining algorithms
➢ Things to look at
➢ Class balance Visualization tools
➢ Dispersion of data attribute values are important
➢ Skewness, outliers, missing values Histograms, scatter
➢ Attributes that vary together plots
Useful Statistics
➢ Discrete attributes
➢ Frequency of each value
➢ Mode = value with highest frequency
➢ Continuous attributes
➢ Range of values, i.e. min and max
➢ Mean (average)
➢ Sensitive to outliers
➢ Median
➢ Better indication of the ”middle” of a set of values in a skewed
distribution
➢ Skewed distribution
➢ mean and median are quite different
Skewed Distributions of
Attribute Values
Dispersion of Data
➢ How do the values of an attribute spread?
➢ Variance
➢ Variance is sensitive to outliers
𝑛
1
ത )2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑋) = ෍(𝑥𝑖 − 𝑥
𝑛
𝑖=1
➢ What if the distribution of values is multimodal, i.e.
data has several bumps?

➢ Visualization tools are useful


Attributes that Vary Together
➢ Correlation is a measure that describe how two attributes
vary together
𝑐𝑜𝑟𝑟(𝑥, 𝑦)
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ )(𝑦𝑖 − 𝑦ത )
=
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ )2 σ𝑛𝑖=1(𝑦𝑖 − 𝑦ത )2
Data Quality
➢ Examples of data quality problems:
➢ Noise and outliers Tid Refund Marital Taxable
Status Income Cheat
➢ Missing values 1 Yes Single 125K No

➢ Duplicate data 2 No Married 100K No


3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes


6 No NULL 60K No

Missing values 7 Yes Divorced 220K NULL


8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries 9 No Single 90K No
10
Missing Values
➢ Reasons for missing values
➢ Information is not collected
(e.g., people decline to give their age and weight)
➢ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

➢ Handling missing values


➢ Eliminate Data Objects
➢ Estimate Missing Values
➢ Ignore the Missing Value During Analysis
➢ Replace with all possible values (weighted by their probabilities)
Outliers
➢ Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
➢ Can help to
➢ detect new phenomenon or
➢ discover unusual behavior in data
➢ detect problems
How to Handle Noisy Data?
➢ Binning method:
➢ first sort data and partition into (equi-depth) bins
➢ then smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
➢ Clustering
➢ detect and remove outliers
➢ Combined computer and human inspection
➢ detect suspicious values and check by human
➢ Regression
➢ smooth by fitting the data
into regression functions
Discretization

Divide the range of a continuous attribute into intervals


➢ Interval labels can be used to replace actual data
values.

➢ Reduce data size by discretization


➢ Some data mining algorithms only work with discrete
➢ attributes
➢ E.g. Apriori for ARM
Binning (Equal-width)
➢ Equal-width (distance) partitioning
➢ Divide the attribute values x into k equally sized bins
➢ If xmin ≤ x ≤ xmax then the bin width δ is given by
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
𝛿=
𝑘

➢ Disadvantages:
➢ outliers may dominate presentation
➢ Skewed data is not handled well.
Binning (Equal-frequency)
➢ Equal-depth (frequency) partitioning:

➢ Divides the range into N intervals, each containing


approximately same number of samples
➢ Good data scaling

➢ Disadvantage:
➢ Many occurrences of the same continuous value could
cause the values to be assigned into different bins
➢ Managing categorical attributes can be tricky.
Binning Example
Attribute values (for one attribute e.g., age):
• 0, 4, 12, 16, 16, 18, 24, 26, 28

Equi-width binning – for bin width of e.g., 10:


• Bin 1: 0, 4 [-,10) bin
• Bin 2: 12, 16, 16, 18 [10,20) bin
• Bin 3: 24, 26, 28 [20,+) bin
• – denote negative infinity, + positive infinity

Equi-frequency binning – for bin density of e.g., 3:


• Bin 1: 0, 4, 12 [-, 14) bin
• Bin 2: 16, 16, 18 [14, 21) bin
• Bin 3: 24, 26, 28 [21,+] bin
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into Equi-depth bins:
Equi-depth bins:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Smoothing by bin
Smoothing by bin means: boundaries:

Bin 1: Bin 1: 9, 9, 9, 9 Bin 1: 4, 4, 4, 15


Bin 2: 23, 23, 23, 23 Bin 2: 21, 21, 25, 25
Bin 3: 29, 29, 29, 29 Bin 3: 26, 26, 26, 34
Data Normalization
An attribute values are scaled to fall within a small, specified
range , such as 0.0 to 1.0
➢ Min-Max normalization
➢ performs a linear transformation on the original data.
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

➢ Example: Let min and max values for the attribute income are
$12,000 and $98,000, respectively.
➢ Map income to the range [0.0;1.0].
Data Normalization
➢ z-score normalization(or zero-mean normalization)
➢ An attribute A, values are normalized based on the mean and
standard deviation of A.

v − meanA
v' =
stand _ devA
➢ Example: Let mean= 54,000 and standard deviation=16,000 for the
attribute income
➢ With z-score normalization, a value of $73,600 for income is
transformed to
Continuous and Categorical
Attributes
How to apply association analysis formulation to non-
asymmetric binary variables?
Session Country Session Number of
Browser
Id Length Web Pages Gender Buy
Type
(sec) viewed
1 USA 982 8 Male IE No
2 China 811 10 Female Netscape No
3 USA 2125 45 Female Mozilla Yes
4 Germany 596 4 Male IE Yes
5 Australia 123 9 Male Mozilla No
… … … … … … …
10

Example of Association Rule:


{Number of Pages [5,10)  (Browser=Mozilla)} → {Buy = No}
Handling Categorical Attributes
➢ Categorical Attributes:
➢finite number of possible values,
➢no ordering among value
➢ Transform categorical attribute into asymmetric
binary variables
➢ Introduce a new “item” for each distinct
attribute-value pair
➢ Example: replace Browser Type attribute with
➢ Browser Type = Internet Explorer
➢ Browser Type = Mozilla
➢ Browser Type = Chrome
Handling Categorical Attributes
➢ Potential Issues
➢What if attribute has many possible values
➢ Example: attribute country has more than 200
possible values
➢ Many of the attribute values may have very low
support
➢Potential solution: Aggregate the low-support
attribute values

Replace less frequent attribute values


into category called others
Handling Categorical Attributes

➢ Potential Issue: What if distribution of attribute values is


highly skewed
➢ Example: In an online survey, we collected information
regarding attributes gender, education, state, computer
at home, chat online, shop online and privacy concern.
➢ 85 % of the participant have computer at home
➢ {Computer at home =yes, shop Online =yes} ->{ privacy
concerns = yes}
➢ Better: {shop Online =yes} ->{ privacy concerns = yes}
➢ Potential solution: drop the highly frequent items
Handling Continuous Attributes
➢ Different kinds of rules:
➢Age[21,35)  Salary[70k,120k) → Buy
➢Salary[70k,120k)  Buy → Age: =28, =4
➢ Different methods:
➢Discretization-based
➢Equal-width binning
➢Equal-depth binning
➢Clustering
➢Statistics-based
Similarity and

Dissimilarity
Similarity and Dissimilarity
• Numerical measure of how alike two data
objects are.
Similarity • Is higher when objects are more alike.
• Often falls in the range [0,1]

• Numerical measure of how different are


two data objects
Dissimilarity • Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies

Proximity refers to a similarity or dissimilarity


Similarity/Dissimilarity for
Simple Attributes

p and q are the attribute values for two data objects.


Euclidean Distance

➢ Euclidean Distance
n
dist =  ( pk − qk )
2
k =1
Where n is the number of dimensions (attributes)
pk and qk are the kth attributes (components) or data
objects p and q.

➢ Standardization is necessary, if scales differ.


General Approach for Combining
Similarities
Sometimes attributes are of many different types, but an
overall similarity is needed.
Using Weights to Combine
Similarities

➢ May not want to treat all attributes the same.


➢Use weights wk which are between 0 and 1 and
sum to 1.
Example

➢ One categorical variable, test-1,


➢ d(i, j) evaluates to 0 if objects i = j , and 1 otherwise
Example

➢ Ordinal variable, test-2,


➢ d(i, j) = |i-j| / (n-1), we have 0 to n-1 values
Example

➢ Ratio scaled variable, test-3,


➢ Normalize (min-max normalization)
➢ Max =64 , min =22
➢ Distance measure (Manhattan or Euclidean distance )
Example

➢ Variable of Mixed types


➢ We use the dissimilarity matrices for the three variables.
Assignment 4
➢ Preprocessing and Clustering using K-means in
Pyspark
➢Due on 24th May
Project ???
➢ Replaced with
➢more assignments
➢a mini project
➢review of a latest research paper on Spark

You might also like