UpdatedUnit 1 Data Preprocessing
UpdatedUnit 1 Data Preprocessing
Introduction
Real-world DBs are highly noisy, missing and inconsistent data
Preprocessing techniques are used to Improve the quality, efficiency
and mining results
• Data Cleaning: remove noise and correct the inconsistencies in
data
• Data Integration: merges data from multiple sources into
coherent data store such as a data warehouse
• Data Reduction: reduce data size by performing aggregations,
eliminating redundant features and clustering
• Data Transformations: where data are scaled to fall with in a
smaller range like 0.0 to 1.0, this can improve the accuracy and
efficiency of mining algorithms involving distance measures
Why Preprocess the Data?
• Data Quality: Data have quality if they satisfy
the requirements of the intended use.
• Many factors comprising data quality
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Interpretability
Cont.,
• Several attributes for various tuples have no record
values, that is called missing data that reduces the
quality by reporting errors, unusual values and
inconsistencies
• The data you wish to analyze by the DM techniques are
– Incomplete (lacking attribute values or containing aggregate
data)
– Inaccurate or noisy (having incorrect attr values that are
deviate from the expected)
– Inconsistent(contains discrepancies in the dept codes used to
categorize items)
Accuracy, Completeness and Consistency
• Inaccurate, incomplete and inconsistent data are common
properties if large dbs and dws
• Reasons :
– Data collection instruments used may be faulty
– Human errors or computer errors occurring at data entry
– Users may purposely submit incorrect data values for mandatory
fields when they do not submit personal information(DoB)
– Errors in data transmission
– Technology limitations such as limited buffer size for coordinating
synchronized data transfer and computation
– Incorrect data may result the inconsistencies in naming
conversions or formats in input fields(Date)
Timeliness
• Timeliness: also affects data quality
– All Electronics- Update sales details at the month
end
– Some sales managers not update before month
last day
– And updated details have corrections and
adjustments
– Fact is month end data are not updated in a timely
fashion has a negative impact on data quality
Believability and Interpretability
• Believability reflects how much the data are
trusted by users/employees
• Interpretability reflects how easy the data are
understood
Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Data Preprocessing
Data Cleaning
• Data cleaning routines attempt to
– Fill in missing values
– smooth out noise while identifying outliers
– correct inconsistencies in the data
– Resolve redundancy caused by data integration
Missing Values
1. Ignore the tuple,
2. Fill in the missing value manually- time consuming may not be
feasible for large datasets
3. Use a global constant to fill in the missing value- like unknown
or ∞
4. Use a measure of central tendency- mean for symmetric data
and median for skewed data
5. Use the attribute mean or median for all samples belonging to
the same class as the given tuple
6. Use the most probable value to fill in the missing value
– Determined with regression, inference based tools using a Bayesian
formalism or decision tree induction
Noisy Data
• Noise is a random error or variance in a
measured variable
– Boxplots and scatter plots are used to identifying
the outliers which may represent noise
– Ex: attribute “price” , we have to smooth out the
data to remove noise
Smoothing techniques
• Binning: smooth the sorted data value by
consulting its neighbourhood i.e. the values
around it
• The sorted values are distributed into a no.of
buckets or bins and the perform local
smoothing
Smoothing techniques
• Smooth by Bin means
• Smooth by bin medians
• Smooth by bin
boundaries(min and max
values are as bin
boundaries)
Regression
• Regression: is a technique that conforms data
values to a function
• Linear Regression involves finding the best
line to fit two attributes so that one attribute
can be used to predict the other
• Multiple linear regression is an extension of
linear regression, where more than two
attributes are involved and data are fit to
multidimensional surface
• Outlier analysis: detected by clustering
• Ex: similar values are organized into groups or
clusters, values outside of the set of clusters
are considered as outliers
Data Integration
• The semantic heterogeneity and structure of
the data pose great challenges in the data
integration
– Entity identification problem – How can we match
schema and objects from different sources?
– Correlation tests on numeric data and nominal
data – Specifies the correlation between objects
Entity Identification Problem
• No.of issues are consider during data integration
– Schema integration and object matching can be tricky
• The equivalent of real world entities from multiple
data sources is known as entity identity problem
– Ex: Different representations and different scales like
cust_id and custmer_id how they are refer
• Metadata for each attribute (name, meaning, datatype, range
of values permitted for the attribute, null rules for handling
blanks, zeros)
• Such metadata may also be used to help avoid errors in
schema integration
Redundancy and Correlation Analysis
• An attribute may be redundant if it can be derived from
another attribute.
• Careful integration of data from multiple sources may
help/avoid redundancies
v A
v'
A
Normalization by Decimal Scaling
• Normalizes by moving the decimal point of
values of attribute A
• The number of decimal points moved
depends on the maximum absolute value of A.
• A value, , of A normalized to by computing
v
v' j
10
• When decision tree induction is used for attribute subset selection, a tree is constructed from the
given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree form the reduced subset of attributes.
Thank You