Lecture 7 - Data Preprocessing - Cleaning-M
Lecture 7 - Data Preprocessing - Cleaning-M
Data Mining
Lecture # 7
Data Preprocessing
(Ch # 3)
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data reduction
Data Transformation and
Discretization
Summary
Why Data
Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
Data quality is a major concern in Data Mining and
Knowledge Discovery tasks.
Why: At most all Data Mining algorithms induce
knowledge strictly from data.
No quality data, no quality mining results!
Quality decisions must be based on quality data
No quality data, inefficient mining process!
Complete, noise-free, and consistent data means faster
algorithms
The quality of knowledge extracted highly depends on
the quality of data
Effect of Noisy Data on Results Accuracy
X1 x
Detecting Outliers (Clustering)
Outliers may be detected by clustering, where
similar values are organized into groups or
“clusters”.