Modified Module 2-DM
Modified Module 2-DM
MODULE 2
1
Chapter 2: Data Preprocessing
Data Preprocessing
Types of data?
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Similarity and Dissimilarity measures.
2
What is Data?
Attributes
Collection of data objects and
their attributes
Tid Refund Marital Taxable
Status Income Cheat
An attribute is a property or 1 Yes Single 125K No
characteristic of an object 2 No Married 100K No
Examples: eye color of a 3 No Single 70K No
person, temperature, etc. 4 Yes Married 120K No
Attribute is also known as 5 No Divorced 95K Yes
variable, field, characteristic,
or feature Objects 6 No Married 60K No
7 Yes Divorced 220K No
A collection of attributes 8 No Single 85K Yes
describe an object 9 No Married 75K No
Object is also known as 10 No Single 90K Yes
record, point, case, sample, 10
entity, or instance
Types of attributes ….
1) Nominal:
items differentiated by a simple naming
system
They may have numbers assigned to them.
But, they are not actual numbers. They
simply capture and reference.
They are ‘categorical’ i.e, they belong to a
definable category.
Ex:- ID numbers, eye color, zip codes
Types of attributes…
2) Ordinal:
They have some kind of order by their position on
the scale.
order of items can be defined by assigning numbers
– relative position.
letters can also be assigned.
they are ‘categorical’
cannot do arithmetic – only ordering property.
Ex:
rankings (taste of potato chips on a scale from 1 –
10)
Grades in {A, B, C, D, E}
Height in {small, medium, large}
Types of attributes ….
3) Interval:
Is measured along a scale in which each position is
equidistant from one another.
Ex:
Calendar dates
Temperature in celsius/fah
Types of attributes….
4) Ratio
numbers can be compared as multiples of one another.
One person can be twice as tall as another person
Number zero has no meaning
Ex:
Difference between a person of age 35 and a person of age 38 is
same as difference between people who are 12 and 15. ( 35 to
38 = 3 , 12 to 15 = 3) 3:3 .
Ratio data can be multiplied and divided.
Types of attributes…
Interval and ratio data measure quantities and
hence are quantitative.
Ex: length, time, count
Types of attributes…
Nominal (symbolic, categorical)
Values from an unordered set
Ordinal :
Values from an ordered set
(interval or ratio)
1) Age in years
If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
Nodes – atoms
Links – chemical compounds
Benzene Molecule: C H
6 6
Mining Substructures
Which substructures occur frequently
in a chemical compound?
Is the presence of any substructure
• Patterns ?
Spatio-Temporal
Data
Spatial attributes
– position and area
Ex: weather data
Earth sciences data
- measures temp
and pressure
measured at points
on latitude - longitude
Average Monthly
Temperature of
land and ocean
Time-series data
A special type of sequential data
Each record is a time-series i.e. a series of
measurements taken over time
Ex: financial data set has objects which are the time
series of the daily prices of various stocks.
Have temporal autocorrelation
If two measurements are close in time, then their
values are often similar.
Chapter 2: Data Preprocessing
Data Preprocessing
Types of data?
Data Preprocessing
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Similarity and Dissimilarity measures.
26
Data Quality: Why Preprocess the Data?
27
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
28
Chapter 2: Data Preprocessing
Data Preprocessing
Types of data?
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Similarity and Dissimilarity measures.
29
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
30
Incomplete (Missing) Data
technology limitation
incomplete data
inconsistent data
33
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
Clustering
detect and remove outliers
34
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
35
Chapter 3: Data Preprocessing
Data Preprocessing
Types of data?
Data Preprocessing
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Similarity and Dissimilarity measures.
36
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
37
Handling Redundancy in Data Integration
39
Chi-Square Calculation: An Example
i 1 (ai A)(bi B)
n n
(ai bi ) n A B
rA, B i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
44
Correlation (viewed as linear relationship)
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, A and B, and then take their dot product
45
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing
Wavelet transforms
Data compression
49
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
50
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant attributes
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
51
52
Attribute subset selection..
54
Numerosity Reduction
Linear regression
Histogram
Clustering
Sampling
55
Parametric Data Reduction: Regression
and Log-Linear Models
Linear regression
Data modeled to fit a straight line
Multiple regression
Allows a response variable Y to be modeled as a
distributions
56
y
Regression Analysis
Y1
Regression analysis: A collective name for
techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors)
Used for prediction
The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but relationships
other criteria have also been used
57
Regress Analysis and Log-Linear Models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
Useful for dimensionality reduction and data smoothing
58
Histogram Analysis
Divide data into buckets and 40
store average (sum) for each 35
bucket 30
Partitioning rules: 25
Equal-width: equal bucket 20
range 15
Equal-frequency (or equal- 10
depth) 5
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
59
Histogram
Entropy – based
Unsupervised
Equal width and equal frequency
Normalization
Min-max
Z-score
Decimal scaling
Binarization
Discretization/Quantization
Three types of attributes:
Nominal – values from an unordered set
Discretization :
Divide the range of a continuous attribute into
intervals
Some classification algos only accept categorical
attributes
Reduce data size by discretization
binning
Supervised
Dependent on the class label
Disadvantages :
Unsupervised
10 15 7 10 6 13 7
+ + + + - - +
Entropy
Procedure…
1. Sort the attb values to be discretized, S
2. Bisect the initial values so that the resulting two
intervals have minimum entropy.
i. Consider each value T as a possible split point
Where T = midpoint of each consecutive attb
values.
ii. Compute the information gain before and after
choosing T as a split point.
Gain = E(S) – E(T,S)
i. Select the best T which gives the highest info
gain as the optimum split.
3. Repeat step 2 with another interval (highest
entropy) until a user specified no. of intervals is
reached or some stopping criterion is met.
Normalization
Scale attribute values to fall within a small-
specified range.
Min-max
Z-score
Decimal scaling
Similarity and Dissimilarity
• More important – used in clustering, some
classification, anomaly detection.
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Objects with Single
Attribute
p and q are the attribute values for two data objects.
Dissimilarities between Data Objects with multiple
Numeric attributes
• For ordinal attributes : Need to normalize the attribute
ranks.
• Refer Page 75 of the example 2.21
Dissimilarities between Data Objects with multiple
Numeric attributes
• Euclidean Distance
Where n is the number of dimensions and pk and qk are, respectively, the kth attributes of data objects p and q.
n 2
dist ( pk qk )
k 1
Euclidean Distance
3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2 p4 5 1
0
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Manhattan (or city block) distance
d(i, j) = |xi1 −xj1| + |xi2 −xj2| + · · · + |xip −xjp|
CSE-307-Data Mining 86
Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance. Given two objects p and q
Where r is a parameter, n is the number of dimensions and pk and qk are, respectively, the kth attributes of data objects p and q.
1
n r r
dist ( | pk qk |)
k 1
Minkowski Distance: Examples
Minkowski Distance
Manhattan Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1 Euclidean Distance
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness). Distance will never have negative.
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points
(data objects), p and q.
p & q are the data points on the graph.
• A distance that satisfies all these properties is a
metric
Common Properties of a Similarity
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
• Variation of JC
• Used for document data
• Reduces to Jaccard for binary attributes
Correlation
Pearson’s Correlation
• If correlation between two variables x and y is -1,
they are negatively correlated.
• If one increases, the other decreases and vice versa.
• If correlation between two variables x and y is
+1, they are positively correlated.
• Either both increase or both decrease.
CSE-307-Data Mining
98
Summary
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
101
The attribute age is given below in ascending order:
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth these
smoothing?
102
Discuss issues to consider during data
integration.
What are the value ranges of the following
normalization methods?
(a) min-max normalization
103
Use these methods to normalize the following
group of data: 200,300,400,600,1000
(a) min-max normalization by setting min = 0
and max = 1
(b) z-score normalization
104
Using the data for age given in Exercise 3.slide
102, answer the following:
(a) Use min-max normalization to transform the
value 35 for age onto the range [0.0, 1.0].
(b) Use z-score normalization to transform the value
35 for age, where the standard deviation of age is
12.94 years.
(c) Use normalization by decimal scaling to
transform the value 35 for age.
d) Comment on which method you would prefer to
use for the given data, giving reasons as to why.
105
suppose a group of 12 sales price records has
been sorted as follows:
5,10,11,13,15,35,50,55,72,92,204,215.
Partition them into three bins by each of the
following methods:
(a) equal-frequency (equal-depth) partitioning
106
Briefly outline how to compute the dissimilarity
between objects described by the following:
(a) Nominal attributes
(b) Asymmetric binary attributes
(c) Numeric attributes
Given two objects represented by the tuples (22, 1, 42,
10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two
objects.
(b) Compute the Manhattan distance between the two
objects.
(c) Compute the Minkowski distance between the two
objects, using q = 3.
107
It is important to define or select similarity measures in data
analysis. However, there is no commonly accepted subjective
similarity measure. Results can vary depending on the
similarity measures used. Nonetheless, seemingly different
similarity measures may be equivalent after some
transformation. Suppose we have the following 2-D data set:
(a) Consider the data as 2-D data points. Given a new data point, x
= (1.4,1.6) as a query, rank the database points based on similarity
with the query using Euclidean distance, Manhattan distance,
supremum distance, and cosine similarity.
(b) Normalize the data set to make the normofeach data point equal
to 1. Use Euclidean distance on the transformed data to rank the
data points 108