Data Pre Processing - NG
Data Pre Processing - NG
Prof. Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani
Data Preprocessing
Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy
generation
Summary
Distributive Measure: sum,count,max,min Algebraic Measure: algebraic fn. On one or more distributive measure Example: average weighted average
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
Data Cleaning
Data cleaning tasks
Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter Use the most probable value to fill in the missing value:
Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may be due to
faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention
Smoothing Techniques
Binning method:
first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Regression
smooth by fitting the data into regression functions
Binning
Binning methods smooth a sorted data value by consulting its neighborhood, that is, values around it Sorted values are distributed into a number of buckets or bins Binning does local smoothing Different binning methods illustrated by an example Also used as data discretization tech.
Regression
y
Y1
Y1
y=x+1
X1
Cluster Analysis
Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#
Redundant data may be detected by correlation analysis (Pearsons Correlation Coefficient) Correlation does not imply Causality Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Data Transformation
Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range
min-max normalization z-score normalization normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Cubes are created at various levels of abstraction, depending upon the analysis task cubiods
Cube is a lattice of cuboids Data volume reduces as we move up from base to apex cubiod While doing data mining, the smallest available cuboid relevant to the given task should be used
Cube aggregation gives smaller data without loss of information necessary for the analysis task
Class 1
>
Class 2
Class 1
Class 2
Wavelet Transforms
Discrete wavelet transform (DWT): linear signal processing
Haar2 Daubechie4
Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space Method:
Length, L, must be an integer power of 2 (padding with 0s, when necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length
Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
Each data vector is a linear combination of the c principal component vectors Works for numeric data only Used when the number of dimensions is large
X1
Numerosity Reduction
Can we reduce the data volume by choosing alternative smaller forms of data representation? Techniques:
Parametric Non-parametric methods
Numerosity Reduction
Parametric methods
Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models
Non-parametric methods
Do not assume models. Stores reduced representations of the data
Major families: histograms, clustering, sampling
Log-linear models:
The multi-way table of joint probabilities is approximated by a product of lower-order tables.
Probability: p(a, b, c, d) =
ab acad bcd
Histograms
40 A popular data reduction technique 35 Divide data into 30 buckets and store average (sum) for 25 each bucket 20 Can be constructed optimally in one 15 dimension using dynamic programming 10 Related to quantization 5 problems. 0
10000 30000 50000 70000 90000
Clustering
Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multi-dimensional index tree structures There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8
Sampling
Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data
Simple random sampling may have very poor performance in the presence of skew
Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time).
Sampling
Raw Data
Sampling
Raw Data
Cluster/Stratified Sample