Chapter 2 - Data Preprocessing
Chapter 2 - Data Preprocessing
Data Preprocessing
■ Data cleaning
■ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
■ Data integration
■ Integration of multiple databases, data cubes, or files
■ Data reduction
■ Obtains reduced representation in volume but produces the same
or similar analytical results
■ Importance
■ “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
■ “Data cleaning is the number one problem in data
warehousing”—DCI survey
■ Data cleaning tasks
■ Fill in missing values
■ Identify outliers and smooth out noisy data
■ Correct inconsistent data
■ Resolve redundancy caused by data integration