Lecture123
Lecture123
Quality
• Data cleaning
• Data reduction
• Discretization
• Summary
Data Cleaning…
• Importance
• “Data cleaning is the number one problem in data warehousing”
• Data cleaning
• Data reduction
• Discretization
Data Integration..
• Data integration:
• combines data from multiple sources
• Schema integration
• integrate metadata from different sources
• Entity identification problem: identify real world entities from multiple data sources,
e.g., A.cust-id B.cust-#
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different sources are different,
e.g., different scales, metric vs. British units
• Removing duplicates and redundant data
Data Transformation..
• Smoothing: remove noise from data
• Normalization: scaled to fall within a small, specified range (-0.1 to 1.0 and
0.0 to 1.0)
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: summarization
• Generalization: concept hierarchy climbing
: Data Preprocessing
• Data cleaning
• Data reduction
• Discretization
• Summary