Module2 DataPreprocessing
Module2 DataPreprocessing
— Chapter 3 —
1
Chapter 3: Data Preprocessing
■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary
■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary
■ technology limitation
■ incomplete data
■ inconsistent data
10
How to Handle Noisy Data?
■ Binning
■ first sort data and partition into (equal-frequency) bins
■ Clustering
■ detect and remove outliers
11
Data Cleaning as a Process
■ Data discrepancy detection
■ Use metadata (e.g., domain, range, dependency, distribution)
■ Check field overloading
■ Check uniqueness rule, consecutive rule and null rule
■ Use commercial tools
■ Data scrubbing: use simple domain knowledge (e.g., postal code,
■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary
13
Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ Schema integration: e.g., A.cust-id ≡ B.cust-#
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from different sources are
different
■ Possible reasons: different representations, different scales, e.g., metric
vs. British units
14
Handling Redundancy in Data Integration
16
Clustering
■ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
■ Can be very effective if data is clustered but not if data is
“smeared”
■ Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
■ There are many choices of clustering definitions and
clustering algorithms
■ Cluster analysis will be studied in depth in Chapter 10
17
Sampling
18
Types of Sampling
■ Stratified sampling:
■ Partition the data set, and draw samples from each partition
19
Sampling: With or without Replacement
W O R
SRS le random
i m p h ou t
(s e wi t
p l
sam ment)
p l a ce
re
SRSW
R
Raw Data
20
Sampling: Cluster or Stratified Sampling
21
Data Cube Aggregation
22
Data Reduction 3: Data Compression
■ String compression
■ There are extensive theories and well-tuned algorithms
without expansion
■ Audio/video compression
■ Typically lossy compression, with progressive refinement
s s y
lo
Original Data
Approximated
24
Automatic Concept Hierarchy Generation
■ Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
■ The attribute with the most distinct values is placed at
the lowest level of the hierarchy
■ Exceptions, e.g., weekday, month, quarter, year
■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary
26
Summary
■ Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
■ Data cleaning: e.g. missing/noisy values, outliers
■ Data integration from multiple sources:
■ Entity identification problem; Remove redundancies; Detect
inconsistencies
■ Data reduction
■ Dimensionality reduction; Numerosity reduction; Data
compression
■ Data transformation and data discretization
■ Normalization; Concept hierarchy generation
27