Why Data Preprocessing
Why Data Preprocessing
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
Data Integration
Data integration:
► Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id ≡ B.cust-#
► Integrate metadata from different sources
Entity identification problem:
► Identify real-world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
► For the same real-world entity, attribute values from different sources
are different
► Possible reasons: different representations, different scales, e.g.,
metric vs. British units