DM Chapter 3 Data Preprocessing
DM Chapter 3 Data Preprocessing
Data Mining
Data Preprocessing
(Ch # 3)
Data Preprocessing
u Why preprocess the data?
u Data cleaning
u Data integration
u Data reduction
u Data Transformation and Discretization
u Summary
2
Why Data Preprocessing?
u Data in the real world is dirty
w incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
w noisy: containing errors or outliers
w inconsistent: containing discrepancies in codes or names
u Data quality is a major concern in Data Mining and Knowledge
Discovery tasks.
u Why: At most all Data Mining algorithms induce knowledge
strictly from data.
u No quality data, no quality mining results!
w Quality decisions must be based on quality data
u No quality data, inefficient mining process!
w Complete, noise-free, and consistent data means faster
algorithms
w The quality of knowledge extracted highly depends on the quality
of data 3
u Measures for Data Quality: A multidimensional
view
w Accuracy: correct or wrong, accurate or not
w Completeness: not recorded, unavailable, …
w Consistency: some modified but some not,
dangling, …
w Timeliness: timely update?
w Believability: how trustable the data are
correct?
w Interpretability: how easily the data can be
understood? 4
Major Tasks in Data Preprocessing
u Data cleaning
w Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
u Data integration
w Integration of multiple databases, data cubes, or files
u Data reduction
w Obtains reduced representation in volume but
produces the same or similar analytical results
u Data transformation
w Normalization and aggregation
u Data discretization
w Part of data reduction but with particular importance,
especially for numerical data
5
Forms of data preprocessing
6
Data Preprocessing
u Why preprocess the data?
u Data cleaning
u Data integration
u Data reduction
u Data Transformation and Discretization
u Summary
7
Data Cleaning
u Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
w incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g.,Occupation =“ ” (missing data)
w noisy: containing noise, errors, or outliers
• e.g.,Salary =“−10” (an error)
w inconsistent: containing discrepancies in codes or names, e.
g.,
• Age =“42”,Birthday =“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
w Intentional (e.g.,disguised missing data)
• Jan. 1 as everyone’s birthday? 8
Incomplete (Missing) Data
u Data is not always available
w E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
u Missing data may be due to
w equipment malfunction
w inconsistent with other recorded data and thus
deleted
w data not entered due to misunderstanding
w certain data may not be considered important at the
time of entry
w not register history or changes of the data
9
u Missing data may need to be inferred
Data Cleaning
10
Methods of Treating Missing Data
u Ignoring and discarding data:- There are two main ways to discard data
with missing values.
w Discard all those records which have missing data also called as discard case
analysis. Usually done when class label is missing (assuming the task is
classification)
w Discarding only those attributes which have high level of missing data.
u Fill in the missing value manually: tedious + infeasible?
u Use a global constant to fill in the missing value: e.g., “unknown”, a new
class.
u Imputation using Mean/median or Mod:- One of the most frequently used
method (Statistical technique).
w Use the attribute mean to fill in the missing value
w Use the attribute mean for all samples belonging to the same class to fill in the
missing value: smarter
w Replace (numeric continuous) type “attribute missing values” using mean/
median. (Median robust against noise).
w Replace (discrete) type attribute missing values using MOD.
11
Methods of Treating Missing Data
u Replace missing values using prediction/ classification
model:-
w Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
w Advantage:- it considers relationship among the known
attribute values and the missing values, so the imputation
accuracy is very high.
w Disadvantage:- If there is no correlation exist for some missing
attribute values and known attribute values. The imputation
can’t be performed.
w (Alternative approach):- Use hybrid combination of Prediction/
Classification model and Mean/MOD.
• First try to impute missing value using prediction/classification
model, and then Median/MOD.
w We will study more about this topic in Association Rules Mining.
12
Methods of Treating Missing Data
u K-Nearest Neighbor (k-NN) approach (Best approach):-
w k-NN imputes the missing attribute values on the basis
of nearest K neighbor. Neighbors are determined on the
basis of distance measure.
w Once K neighbors are determined, missing value are
imputed by taking mean/median or MOD of known
attribute values of missing attribute.
u Smoothing by Regression
• smooth by fitting the data into regression functions
17
Smoothing by Binning Method
u Equal-width (distance) partitioning:
w It divides the range intoN intervals of equal size( range):
uniform grid
w ifA andB are the lowest and highest values of the attribute,
the width of intervals will be:W = B( A- )/
k, wherek is the
number of bins.
w The most straightforward
w But outliers may dominate presentation
w Skewed data is not handled well.
wWhere does k come from?
u Equal-depth/ Equal-height (frequency) partitioning:
w It divides the range intoM intervals, each containing
approximately same number of samples
w Good data scaling
18
u Equal width is easier to implement but equal depth (frequency)
gives better results.
19
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-width) bins: A+w , A+2w , A+(k-1)w
- Bin 1: 4, 8, 9
- Bin 2: 15, 21, 21, 24
- Bin 3: 25, 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20
- Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 9, 9
- Bin 2: 15, 24, 24, 24
- Bin 3: 25, 25, 25, 25, 34
20
Regression Method for smoothing the
data
u Regression is a technique
that conforms data values y
Linear
to a function.
regression involves finding
the “best” line to fit two Y1
attributes (or variables) so
that one attribute can be
used to predict the other. Y1’ y=x+1
X1 x
21
Detecting Outliers (Clustering)
u Outliers may be detected by clustering, where similar
values are organized into groups or “clusters”.
22
Data Preprocessing
u Why preprocess the data?
u Data cleaning
u Data integration
u Data reduction
u Data Transformation and Discretization
u Summary
23
Data Integration
u Data integration:
w combines data from multiple sources into a
coherent store
24
Data Integration (Problem 1)
u Attribute naming ( in schema integration)
Problem: Entity identification problem: identify real world entities from
multiple data sources. Attributes are named differently across different
data sources, e.g., A.cust-id B.cust-# (integrate metadata from
different sources).
CustomerID
…
… CustomerID
CustID …
… …
… Extraction,
Transformation, and
ClientID Loading (ETL) tool.
…
…
Gender
Male
Female Gender
Gender Male
M Female
F
Weight
(kilograms)
6 Weight
10 (kilograms)
Weight 6
10
(pounds) 2.72
6 4.54
10
27
Handling Redundant Data in Data Integration
u Redundant data occur often when integrating
multiple databases
w The same attribute may have different names in
different databases
w One attribute may be a “derived” attribute in another
table, e.g., annual revenue
u Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
u Redundancy can be checked using correlation
Analysis. 28
Correlation Analysis for Detecting Redundancy
u For numeric attributes we can use correlation and covariance
u Correlation between two attributes can be checked by:-
in1 ( a i A )( bi B ) in1 ( a i bi ) n A B
r A, B
n A B n A B
1. Resulting value > 0, then A and B are positively correlated. If A
increase B will also increase. If value ofr is close to 1 either A or B can
be removed
2. Resulting value = 0, then A and B are independent.
3. Resulting value < 0, then A and B are negatively correlated. If the value
of A increases, the value B will decreases.
u Covariance between two attributes is
in1 ( a i A )( bi B ) in1 ( a i bi )
C ov ( A , B ) AB
n n
u Under some assumptions a covariance of 0 implies independence
29
Example: Correlation and Covariance
u For the following data find correlation and covariance measures value
30
2 Correlation Test for Nominal Data
u
( o i j ei j )
2
r
2
c
i 1 j 1 ei j
32
u Example: For the following data find weather the two attributes are
independent or not. Suppose that we have 1500 data points
35
Data Reduction
wReduced representation of the data set
that is much smaller in volume, yet
closely maintains the integrity of the
original data. Different Strategies are:
• Dimensionality Reduction
• Numerosity reduction
• Data Compression
36
Dimensionality Reduction (DR)
wProcess of reducing the number of
random attributes under consideration.
wTwo very common methods are Wavelet
Transforms and Principal Components
Analysis
wThey transform or project the data onto a
smaller space
37
Numerosity Reduction
wReplace the original data volume by
alternative smaller forms of data
representation.
wRegression and log-linear models
(parametric methods)
wHistograms, clustering, sampling and
data cube aggregation (nonparametric
methods)
38
Data Compression
w Transformations are applied to obtain a
reduced or compressed representation
w Lossless: The original data can be
reconstructed from the compressed data
without any information loss.
w Lossy: Only an approximation of the original
data can be reconstructed.
39
DR: Principal Components Analysis (PCA)
u Why PCA?
u PCA is a useful statistical technique, has
found applications in:
w Face recognition
w Image Compression
w Reducing dimension of data
40
PCA Goal:
Removing Dimensional Redundancy
u The major goal of PCA in Data Mining is to remove
the “dimensional redundancy” from data.
u What does that mean?
w A typical dataset contains several dimensions
(variables) that may or may not correlate.
w Dimensions that correlate vary together.
w The information represented by a set of dimensions
with high correlation can be extracted by studying just
one dimension the represents the whole set.
w Hence the goal is to reduce the dimensions of a dataset
to a smaller set of representative dimensions that do
not correlate. 41
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2
Dim 3
Analyzing 12
Dim 4
Dimensional data is
Dim 5
Dim 6
challenging !!!
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2 But some dimensions
Dim 3
represent redundant
Dim 4
Dim 5
information. Can we
Dim 6
“reduce” these.
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2
Dim 3
Lets assume we have a
Dim 4 “PCA black box” that can
Dim 5 reduce the correlating
Dim 6 dimensions.
Dim 7
Dim 8
Pass the 12 d data set
Dim 9
through the black box to
Dim 10
Dim 11
get a three dimensional
Dim 12 data set.
PCA Goal:
Removing Dimensional Redundancy
Given appropriate reduction,
Dim 1
analyzing the reduced dataset is
Dim 2
Dim 3
much more efficient than the
Dim 4
original “redundant” data.
Dim 5 Dim A
Dim 6 PCA Dim B
Dim 7 Black box
Dim C
Dim 8
Dim 9
Dim 10 Pass the 12 d data set through the black
Dim 11 box to get a three dimensional data set.
Dim 12
PCA Goal: Change of Basis
u Assume X is the 6-dimensional data set given as
input
Dimensions
x 22 x 23 x 24 x 25 x 26
21
X x 31 x 32 x 33 x 34 x 35 x 36
x 41 x 42 x 43 x 44 x 45 x 46
x 51 x 52 x 53 x 54 x 55 x 56
p11 p 12 p1 m x1 y 1
p p 22 p2m
x y
21 2 2
p m1 pm2 p mm x m y m
( X i
X )( Yi Y )
cov ( X , Y ) i 1
(n 1)
Covariance Interpretation
u We have data set for students study hour (H) and
marks achieved (M)
u We find cov(H,M)
u Exact value of covariance is not as important as the
sign (i.e. positive or negative)
u +ve , both dimensions increase together
u -ve , as one dimension increases other decreases
u Zero, their exist no relationship
Covariance Matrix
u Covariance is always measured
between 2 – dim.
u What if we have a data set with more
than 2-dim?
u We have to calculate more than one
covariance measurement.
u E.g. from a 3-dim data set (dimensions
x,y,z) we could cacluate cov(x,y) , cov(x,
z) , cov(y,z)
Covariance Matrix
u Can use covariance matrix to find
covariance of all the possible pairs
u Since cov(a,b)=cov(b,a)
The matrix is symmetrical about the
main diagonal
u Step 2:
w Mean normalization and feature scaling
w Subtract the mean from each of the data
point
Step1 & Step2
. 616555556 . 615444444
cov
. 615444444 . 716555556
Step 4: Calculate the eigenvectors and eigenvalues of the
covariance matrix
u
| - I| = 0 is used to find Eigenvalues
u Then Bx = 0 is solved
u Since covariance matrix is square, we can calculate the
eigenvector and eigenvalues of the matrix
0 . 04908
eigenvalues =
1 . 28403
0 . 735 0 . 678
eigenvectors =
0 . 678 0 . 735
What does this all mean?
Data Points
Eigenvectors
Conclusion
u Eigenvector give us information about
the pattern.
u By looking at graph in previous slide.
See how one of the eigenvectors go
through the middle of the points.
u Second eigenvector tells about another
weak pattern in data.
u So by finding eigenvectors of
covariance matrix we are able to extract
lines that characterize the data.
Step 5:Chosing components and forming a
feature vector.
u Highest eigenvalue is the principal
component of the data set.
u In our example, the eigenvector with the
largest eigenvalue was the one that
pointed down the middle of the data.
u So, once the eigenvectors are found, the
next step is to order them by
eigenvectors, highest to lowest.
u This gives the components in order of
significance.
Cont’d
u We have n – dimension
u So we will find n eigenvectors
u But if we chose only p first eigenvectors.
u Then the final dataset has only p
dimension
Cont’d
u To obtain the final dataset we will
multiply the above vector matrix
transposed with the transpose of
original data matrix.
u Final dataset will have data items in
columns and dimensions along rows.
u So we have original data set
represented in terms of the vectors we
chose.
Original data set represented using two
eigenvectors.
Original data set represented using one
eigenvectors.
Data Preprocessing
u Why preprocess the data?
u Data cleaning
u Data integration
u Data reduction
u Data Transformation and Discretization
u Summary
67
Data Transformation
u The data are transformed or consolidated into forms
appropriate for mining processing. Different strategies
includes
w Smoothing – binning, regression, and clustering
w Attribute construction – new attributes are constructed from
the given set of attributes.
w Aggregation – Summary or aggregation operations are applied
e.g. construction of data cube
w Normalization – data are scaled so as to fall within a smaller
rage e.g. -1.0 to 1.0 or 0.0 to 1.0
w Discretization - Values of numeric attribute are replaced by
interval labels or conceptual labels. (concept hierarchy for
numeric attribute)
w Concept hierarchy generation for nominal data – nominal
attribute values are generalized to higher-level concepts e.g.
68
street is generalized to block, city or country.
Data Normalization
u A database can contain n numbers of continuous type
attributes.
u Where a larger range continuous type attribute or noise can
shift the objects distance.
w (Remember: All continuous type attributes similarity is checked
using a single Euclidean distance formula).
u For example: ‘Income’ attribute can dominate the distance
as compared to ‘Weight’ and ‘Age’ attributes.
u The objective of normalization is convert all integer type
attributes, so that there values fall within a small specified
range, such as 0 to 1.0.
u Normalization is particularly useful for clustering and
distance measure algorithms such as k-nearest-neighbor.
Data Normalization
u min-max normalization
v min
v' ( new _ max A new _ min A ) new
A
_ min A
max min
A A
stand_ dev A
u Decimal scaling normalization
v
v' j
Where j is the smallest integer such that Max(| v ' |)<1
10
Building mineable data sets
Data Transformation: Normalization
u min-max normalization
min
v
v' ( new _ max A new _ min A ) new
A
_ min A
max min A A
u z-score normalization
v mean
v'
A
stand _ dev A
73
Concept Hierarchy Generation
u Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
u Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
u Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values forage ) by
youth, adult , or
higher level concepts (such as senior )
u Concept hierarchies can be explicitly specified by domain experts and/
or data warehouse designers
u Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
74
Concept Hierarchy Generation
for Nominal Data
u Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
wstreet <city <state <country
u Specification of a hierarchy for a set of values by explicit
data grouping
w {Urbana, Champaign, Chicago} < Illinois
u Specification of only a partial set of attributes
w E.g., onlystreet <city , not others
u Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
w E.g., for a set of attributes: {street, city, state, country }
75
Automatic Concept Hierarchy Generation
u Some hierarchies can be automatically generated
based on the analysis of the number of distinct values
per attribute in the data set
w The attribute with the most distinct values is placed
at the lowest level of the hierarchy
w Exceptions, e.g., weekday, month, quarter, year