0% found this document useful (0 votes)
28 views

36.why Data Preprocessing Introduction

The document discusses data preprocessing and data cleaning techniques. It introduces why data preprocessing is important for data mining and describes common data quality issues like noise, incompleteness and inconsistencies in data. It then explains different techniques for handling missing values, noisy data, binning and using models like linear regression and clustering for cleaning dirty data.

Uploaded by

amna shahid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

36.why Data Preprocessing Introduction

The document discusses data preprocessing and data cleaning techniques. It introduces why data preprocessing is important for data mining and describes common data quality issues like noise, incompleteness and inconsistencies in data. It then explains different techniques for handling missing values, noisy data, binning and using models like linear regression and clustering for cleaning dirty data.

Uploaded by

amna shahid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Mining

Why Data
Preprocessing:
Introduction
Data Preprocessing - Introduction
What is Data
Pre-processing

Process raw data to


prepare it for another
processing procedure

Transforming raw
data
into an
understandable
format.
Data Preprocessing - Introduction

Why DP

• No quality = No DM
• Decisions = QD

• Data is dirty in real


world
• Noisy
• Incomplete
• Inconsistent
Data Preprocessing - Introduction

Noisy & Inconsistent


Data
Noisy data

Random variance
and/or error in
measurement

Containing errors or
outliers
Data Preprocessing - Introduction

Incomplete data
Lacking attribute
values

Lacking certain
attributes of interest

Containing only
aggregate data
Data Preprocessing - Introduction

Inconsistent data
Containing
discrepancies in codes
or names

Age=“42”
birthday=“03/07/1997”

Rating “1,2,3”,
“A, B, C”
Data Mining

Why data
Preprocessing:
Why is data dirty
Why is data dirty
Reasons

• Noise

• Incompleteness

• Inaccuracy

• Inconsistency

• Timeliness
Why is data dirty

Reason of Noise
• Faulty data
collection
instruments

• Human or computer
error at data entry

• Errors in data
transmission
Why is data dirty

Incompleteness
“Not applicable” data
value when collected

Data collection &


analysis time difference

Human/HW/SW
problems
Why is data dirty

Reasons of Inaccuracy
• Data
transmission

• Inconsistent
naming
conventions,

• Duplicate tuples

• Inaccurate data
collection
Why is data dirty

Inconsistency &
Timeliness
Different data
sources

Functional
dependency violation

Data collection not


on required
frequency
Data Mining

Why data
Preprocessing:
Multi-Dimensional
Measure of Data
Quality
Measuring Data Quality

Measure of Data Quality


• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
Measuring Data Quality

Accuracy &
Completeness
Data stored is correct
or not.
Unambiguous.

Assures that all data


for required
information is
available or not.
Measuring Data Quality

Consistency &
Timeliness
Data is in same format
at all time and from
different sources.

Availability of data in
required time.
Measuring Data Quality

Believability & Value


added
How much data can
be trusted that it is
true

What impact new


data has on existing
Measuring Data Quality

Interpretability &
Accessibility
How easily data can
be understood.

How and how easily


data can be
accessed
Data Mining

Data Cleaning
Introduction
Data Cleaning

Introduction

fill in missing values

smooth out noise

identifying outliers

correct
inconsistencies
Data Cleaning

Advantage
False, inaccurate or
misdirecting
conclusions

Make data more


reliable and
accurate
Data Cleaning

Need
Transmission error

Faulty equipment

Error due to different


conventions or scales

Availability of data
Data Mining

Data Cleaning
Missing Data
Missing Data

Missing data
Missing data is
unavailability of
essential data
which is required to
draw a conclusion
or information.
Missing Data

Reasons for Missing


Equipment
malfunction

Inconsistent with
recorded data/deletion

Data not entered

Not register history or


changes of the data
Missing Data
Handling missing values
Ignore the tuple

Fill in the missing


value manually

Fill in automatically
a global constant

Attribute mean

Most probable value


Data Mining

Data Cleaning
Noisy Data
Introduction
Noisy Data Intro
Missing data
Random error or
variance in a
measured variable.

Noisy data can be


expressed as
meaningless or
corrupt data that
cant be understood
by machine.
Noisy Data Intro
Reasons for Missing data
faulty instruments

data entry problems

transmission problems

technology limitation

Inconsistency in
naming convention
Noisy Data Intro

Handling Techniques
Binning

Regression analysis

Outlier analysis in
clustering

Combined computer
and human
inspection
Data Mining

Data Cleaning
Binning
Binning

Binning
Smooth sorted
data by
neighborhood

The sorted values


are distributed
into a number of
buckets or bins.
Binning
Binning Methods
Bin Medians, Bin Boundaries
Data Mining

Data Cleaning
Models
Data Cleaning - Models

Models

Linear Regression

Clustering
Data Cleaning - Models
Linear Regression
Line to fit two attributes

One att to predict other

Fit the data into fns.

Approx fn to capture
imp patterns/values

FN to find data set


values
Data Cleaning - Models

Clustering
Similar values into
groups or clusters

Detect and remove


outliers.

Procedure

You might also like