36.why Data Preprocessing Introduction
36.why Data Preprocessing Introduction
Why Data
Preprocessing:
Introduction
Data Preprocessing - Introduction
What is Data
Pre-processing
Transforming raw
data
into an
understandable
format.
Data Preprocessing - Introduction
Why DP
• No quality = No DM
• Decisions = QD
Random variance
and/or error in
measurement
Containing errors or
outliers
Data Preprocessing - Introduction
Incomplete data
Lacking attribute
values
Lacking certain
attributes of interest
Containing only
aggregate data
Data Preprocessing - Introduction
Inconsistent data
Containing
discrepancies in codes
or names
Age=“42”
birthday=“03/07/1997”
Rating “1,2,3”,
“A, B, C”
Data Mining
Why data
Preprocessing:
Why is data dirty
Why is data dirty
Reasons
• Noise
• Incompleteness
• Inaccuracy
• Inconsistency
• Timeliness
Why is data dirty
Reason of Noise
• Faulty data
collection
instruments
• Human or computer
error at data entry
• Errors in data
transmission
Why is data dirty
Incompleteness
“Not applicable” data
value when collected
Human/HW/SW
problems
Why is data dirty
Reasons of Inaccuracy
• Data
transmission
• Inconsistent
naming
conventions,
• Duplicate tuples
• Inaccurate data
collection
Why is data dirty
Inconsistency &
Timeliness
Different data
sources
Functional
dependency violation
Why data
Preprocessing:
Multi-Dimensional
Measure of Data
Quality
Measuring Data Quality
Accuracy &
Completeness
Data stored is correct
or not.
Unambiguous.
Consistency &
Timeliness
Data is in same format
at all time and from
different sources.
Availability of data in
required time.
Measuring Data Quality
Interpretability &
Accessibility
How easily data can
be understood.
Data Cleaning
Introduction
Data Cleaning
Introduction
identifying outliers
correct
inconsistencies
Data Cleaning
Advantage
False, inaccurate or
misdirecting
conclusions
Need
Transmission error
Faulty equipment
Availability of data
Data Mining
Data Cleaning
Missing Data
Missing Data
Missing data
Missing data is
unavailability of
essential data
which is required to
draw a conclusion
or information.
Missing Data
Inconsistent with
recorded data/deletion
Fill in automatically
a global constant
Attribute mean
Data Cleaning
Noisy Data
Introduction
Noisy Data Intro
Missing data
Random error or
variance in a
measured variable.
transmission problems
technology limitation
Inconsistency in
naming convention
Noisy Data Intro
Handling Techniques
Binning
Regression analysis
Outlier analysis in
clustering
Combined computer
and human
inspection
Data Mining
Data Cleaning
Binning
Binning
Binning
Smooth sorted
data by
neighborhood
Data Cleaning
Models
Data Cleaning - Models
Models
Linear Regression
Clustering
Data Cleaning - Models
Linear Regression
Line to fit two attributes
Approx fn to capture
imp patterns/values
Clustering
Similar values into
groups or clusters
Procedure