Data Mining and Data Warehousing - Data Preprocessing - L03
Data Mining and Data Warehousing - Data Preprocessing - L03
Madava Viranjan
Why Preprocess Data?
• Inaccurate Data
• Incomplete Data
• Inconsistent Data
Tasks in Data Preprocessing
• Data Cleaning
– Smoothing noise, remove outliers, fill missing values, resolve inconsistencies
• Data Integration
– Integrate data from multiple sources
• Data Reduction
– Reduced representation of data set
• Data Transformation
– Normalization, concept hierarchies
Data Cleaning – Missing Values
• Ignore
– Ignore the tuple if attribute missing (eg: class label in
classification task)
• Manually fill
• Fill via global constant
– Filling with something like ‘Unknown’
• Fill with measure
– Eg: Mean or Median
• Use the most probable value
– Value can be determined by regression, decision tree kind of
method
Data Cleaning – Noise
• Binning
– Sorted values are distribute into number of bins
and perform an operation based on the bin
contents
Data Cleaning – Noise
• Regression
– Technique that data values conforms to a function
• Outlier Analysis
– Can be detected by clustering
Data Cleaning - Discrepancy
• Use Meta Data
• Lookout for inconsistent use of codes
– Eg: “2010/12/25” and “25/12/2010” as dates
• Consecutive Rule
– There cannot be any missing values in defined minimum
and maximum values
• Unique rule
– Value in each attribute should be different to other
• Null rule
– Should specify how to handle empty values
Data Integration
• Entity Identification Problem
– How to map equilant entities from different data sets
• Redundancy of Attributes
– Attribute become redundant if it can be derived from others
• Tuple Duplication
– Two or more identical tuples in unique entry
• Dimensionality Reduction
– Only required dimensions will be kept
• Numerosity Reduction
– Replace original data volume by smaller forms of data
• Data Compression
– Lossless and lossy compression
Data Transformation
Data transformed or consolidated into forms appropriate for mining.
• Smoothing
– Remove noise
• Attribute construction
– Construct new attributes
• Aggregation
– Summary or aggregation operations on data
• Normalization
– Scale data to fall into smaller range
• Discretization
– Replace raw values via labels
• Concept Hierarchy