0% found this document useful (0 votes)
6 views10 pages

Data Mining and Data Warehousing - Data Preprocessing - L03

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Data Mining and Data Warehousing - Data Preprocessing - L03

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Preprocessing

Madava Viranjan
Why Preprocess Data?
• Inaccurate Data

• Incomplete Data

• Inconsistent Data
Tasks in Data Preprocessing
• Data Cleaning
– Smoothing noise, remove outliers, fill missing values, resolve inconsistencies

• Data Integration
– Integrate data from multiple sources

• Data Reduction
– Reduced representation of data set

• Data Transformation
– Normalization, concept hierarchies
Data Cleaning – Missing Values
• Ignore
– Ignore the tuple if attribute missing (eg: class label in
classification task)
• Manually fill
• Fill via global constant
– Filling with something like ‘Unknown’
• Fill with measure
– Eg: Mean or Median
• Use the most probable value
– Value can be determined by regression, decision tree kind of
method
Data Cleaning – Noise
• Binning
– Sorted values are distribute into number of bins
and perform an operation based on the bin
contents
Data Cleaning – Noise
• Regression
– Technique that data values conforms to a function

• Outlier Analysis
– Can be detected by clustering
Data Cleaning - Discrepancy
• Use Meta Data
• Lookout for inconsistent use of codes
– Eg: “2010/12/25” and “25/12/2010” as dates
• Consecutive Rule
– There cannot be any missing values in defined minimum
and maximum values
• Unique rule
– Value in each attribute should be different to other
• Null rule
– Should specify how to handle empty values
Data Integration
• Entity Identification Problem
– How to map equilant entities from different data sets

• Redundancy of Attributes
– Attribute become redundant if it can be derived from others

• Tuple Duplication
– Two or more identical tuples in unique entry

• Data Value Conflicts


– Different representations, scaling, encoding for same attribute
Data Reduction
Reduce representation of data in smaller volume but
provides the integrity of original data

• Dimensionality Reduction
– Only required dimensions will be kept

• Numerosity Reduction
– Replace original data volume by smaller forms of data

• Data Compression
– Lossless and lossy compression
Data Transformation
Data transformed or consolidated into forms appropriate for mining.

• Smoothing
– Remove noise

• Attribute construction
– Construct new attributes

• Aggregation
– Summary or aggregation operations on data

• Normalization
– Scale data to fall into smaller range

• Discretization
– Replace raw values via labels

• Concept Hierarchy

You might also like