0% found this document useful (0 votes)
23 views62 pages

Null 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views62 pages

Null 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

DATA PREPROCESSING

By
R. Siva Narayana
D . Sri Lakshmi
Introduction
Real-world DBs are highly noisy, missing and inconsistent data.
Preprocessing techniques are used to improve the quality, efficiency
and mining results
• Data Cleaning: remove noise and correct the inconsistencies in
data
• Data Integration: merges data from multiple sources into
coherent data store such as a data warehouse.
• Data Reduction: reduce data size by performing aggregations,
eliminating redundant features and clustering.
• Data Transformations: where data are scaled to fall with in a
smaller range like 0.0 to 1.0, this can improve the accuracy and
efficiency of mining algorithms involving distance measures.
Why Preprocess the Data?
• Data Quality: Data have quality if they satisfy
the requirements of the intended use.
• Many factors comprising data quality
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Interpretability
Cont.,
• Several attributes for various tuples have no record
values, that is called missing data that reduces the
quality by reporting errors, unusual values and
inconsistencies
• The data you wish to analyze by the DM techniques are
– Incomplete (lacking attribute values or containig aggregate
data)
– Inaccurate or noisy (having incorrect attr values that are
deviate from the expected)
– Inconsistent(contains discrepancies in the dept codes used to
categorize items)
Accuracy, Completeness and Consistency
• Inaccurate, incomplete and inconsistent data are common
properties if large dbs and dws
• Reasons :
– Data collection instruments used may be faulty
– Human errors or computer errors occurring at data entry
– Users may purposely submit incorrect data values for mandatory
fields when they do not submit personal information(DoB)
– Errors in data transmission
– Technology limitations such as limited buffer size for coordinating
synchronized data transfer and computation
– Incorrect data may result the inconsistencies in naming
conversions or formats in input fields(Date)
Timeliness
• Timeliness: also affects data quality
– All Electronics- Update sales details at the month
end
– Some sales managers not update before month
last day
– And updated details have corrections and
adjustments
– Fact is month end data are not updated in a timely
fashion has a negative impact on data quality
Believability and Interpretability
• Believability reflects how much the data are
trusted by users/employees
• Interpretability reflects how easy the data are
understood
Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Data Preprocessing
Data Cleaning
• Data cleaning routines attempt to
– Fill in missing values
– smooth out noise while identifying outliers
– correct inconsistencies in the data
– Resolve redundancy caused by data integration
Missing Values
1. Ignore the tuple,
2. Fill in the missing value manually- time consuming may not be
feasible for large datasets
3. Use a global constant to fill in the missing value- like unknown
or ∞
4. Use a measure of central tendency- mean for symmetric data
and median for skewed data
5. Use the attribute mean or median for all samples belonging to
the same class as the given tuple
6. Use the most probable value to fill in the missing value
– Determined with regression, inference based tools using a Bayesian
formalism or decision tree induction
Noisy Data
• Noise is a random error or variance in a
measured variable
– Boxplots and scatter plots are used to identifying
the outliers which may represent noise
– Ex: attribute “price” , we have to smooth out the
data to remove noise
Smoothing techniques
• Binning: smooth the sorted data value by
consulting its neighbourhood i.e. the values
around it
• The sorted values are distributed into a no.of
buckets or bins and the perform local
smoothing
• Smooth by Bin means
• Smooth by bin medians
• Smooth by bin
boundaries(min and max
values are as bin
boundaries)
• Regression: is a technique that conforms data
values to a function
• Linear Regression involves finding the best
line to fit two attributes so that one attribute
can be used to predict the other
• Multiple linear regression is an extension of
linear regression, where more than two
attributes are involved and data are fit to
multidimensional surface
• Outlier analysis: detected by clustering
• Ex: similar values are organized into groups or
clusters, values outside of the set of clusters
are considered as outliers
Data cleaning as a Process
• First step is discrepancy detection (similarity or
difference)
• Discrepancy caused several factors
– Poorly designed data entry forms having many optional
fields
– Human errors in data entry
– Deliberate errors(not want to give)
– Data decay(Out dated address)
– Errors in instrumentation devices that record data
– Data integration process may cause
• Data discrepancy detection
– Use Metadata (knowledge about data, data type
and domain of attribute, acceptable values)
– Taking care about inconsistent data
representations(“2020/01/03” and “03/01/2020”)
– Field overloading – is an error source it results
when developers squeeze new attribute definition
into unused portions of already defined attributes
(used 31 bits out of 32 bits)
• Data is examined for unique rules,
constructive rules and null rules
– Unique rule: Each value of the given attribute
must be different
– Constructive rule: There can be no missing values
between the lowest and highest values and all
values must be unique also
– Null rule: specifies the use of blanks, question
marks, special characters or other strings that may
indicate the null condition
• Once we find data discrepancies, we need to define
and apply series of transformations to correct them
• Commercial tools assist in the data transformation
step
• Data Migration tools: allow simple transformations
to be specified (replace gender by sex)
• ETL (Extraction/transformation/loading) tools:
allow users to specify transforms through a GUI
Data Integration
• The semantic heterogeneity and structure of
the data pose great challenges in the data
integration
– Entity identification problem – How can we match
schema and objects from different sources?
– Correlation tests on numeric data and nominal
data – Specifies the correlation between objects
Entity Identification Problem
• No.of issues are consider during data integration
– Schema integration and object matching can be tricky
• The equivalent of real world entities from multiple
data sources is known as entity identity problem
– Ex: Different representations and Different scales like
cust_id and custmer_id how they are refer
• Metadata for each attribute (name, meaning, datatype,
range of values permitted for the attribute, null rules for
handling blanks, zeros)
• Such metadata may also be used to help transform the data.
Redundancy and Correlation Analysis
• An attribute may be redundant if it can be derived from
another attribute.
• Careful integration of data from multiple sources may
help/avoid redundancies

• Some redundancies can be detected by correlation analysis


• Given two attributes, correlation analysis can measure how
strongly one attribute implies the other, based on the available
data
– For Nominal data, we use Chi-Square test
– For Numeric attribute, we use Correlation Coefficient and Covariance

• Chi-Square statistic tests the hypothesis that A and B are
independent, i.e., there is no correlation b/w them
• The test is based on the significant level, with (r-1)x(c-1)
degree of freedom.
• If the hypothesis can be rejected, then we say that A and B are
statistically correlated.

• In terms of a p-value and a chosen significance level (alpha), the


test can be interpreted as follows:
– If p-value <= alpha: significant result, reject null hypothesis (H0),
dependent.
– If p-value > alpha: not significant result, fail to reject null hypothesis
(H0), independent.

Correlation Coefficient for Numeric Data


Covariance of Numeric Data

Covariance of Numeric Data
• If COV(A,B) is greater than 0, then A and B are
Positively Correlated (values of A increase as
the values of B increase)
• If the resulting value = 0, then A and B are
independent, and there is no correlation
between them
• If the resulting value is < 0, then A and B are
Negatively correlated (values of A increase as
the values of B decrease)
Example

• If the stocks are affected by the same industry


trends, will their prices rise or fall together?
Tuple Duplication
• The use of denormalized tables (often done to
improve performance by avoiding joins) is
another source of data redundancy
• Inconsistencies often arise between various
duplicates, due to inaccurate data entry or
updating some but not all data occurrences.
DATA TRANSFORMATION
• Smoothing: To remove noise from the data (Techniques include binning, regression
and clustering)
• Attribute construction (or feature construction): New attributes are constructed
from the given set of attributes to help the mining process.
• Aggregation: Aggregation operations are applied to the data. Ex- the daily sales
data may be aggregated so as to compute monthly and annual total amounts.
• Normalization: The attribute data are scaled so as to fall within a smaller range,
such as -1.0 to 1.0, or 0.0 to 1.0.
• Discretization: The raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior). The labels, in turn, can be recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute.
• Concept hierarchy generation for nominal data: Attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for
nominal attributes are implicit within the database schema and can be
automatically defined at the schema definition level.
Data Transformation by Normalization
• Changing the measurement unit of attribute values/units
is called normalization
• Transforming the data to fall within a smaller range such
as [-1,1] or [0.0,1.0]
• Normalization is particularly useful for classification
algorithms involving neural networks or distance
measurements such as clustering
• Data Normalization methods
– Min-max normalization
– Z-score normalization
– Normalization by Decimal Scaling
Min-Max Normalization

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
Z(Zero) - Score Normalization

v  A
v' 
 A
Normalization by Decimal Scaling

v
v'  j
10
Data Reduction
• To obtain a reduced representation of the data set that is much smaller in volume
but yet produce the same (almost same) analytical results.
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis and mining may take a very long time to run on the
complete data set.
• Data reduction strategies
– Dimensionality reduction, e.g., reduces the no.of random variables or
attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction, Replace the original data volume by alternative or
smaller forms
• Parametric methods: Regression and Log-Linear Models
• Non – parametric methods: Histograms, clustering, sampling
– Data compression:
• Lossy
• Lossless
Wavelet Transforms
• The discrete wavelet transform (DWT) is a linear
signal processing technique that, when applied to
a data vector X, transforms it to a numerically
different vector, X’, of wavelet coefficients. The
two vectors are of the same length. When
applying this technique to data reduction, we
consider each tuple as an n-dimensional data
vector, that is, X =(x1,x2, … ,xn), depicting n
measurements made on the tuple from n
database attributes
• “How can this technique be useful for data reduction if the wavelet
transformed data are of the same length as the original data?”
• The usefulness lies in the fact that the wavelet transformed data can be
truncated.
• A compressed approximation of the data can be retained by storing only a
small fraction of the strongest of the wavelet coefficients.
• Ex: all wavelet coefficients larger than some user-specified threshold can be
retained. All other coefficients are set to 0.
– The resulting data representation is therefore very sparse, so that operations that
can take advantage of data sparsity are computationally very fast if performed in
wavelet space.
• The technique also works to remove noise without smoothing out the main
features of the data, making it effective for data cleaning as well.
• Given a set of coefficients, an approximation of the original data can be
constructed by applying the inverse of the DWT used.
• DWT achieves better lossy compression.
• There are several families of DWT
– Haar
– Daubechies-4
– Daubechies-6
• DWT uses a hierarchical pyramid algorithm,
that halves the data at each iteration,
resulting in a fast computational speed
Hierarchical Pyramidal Algorithm

Principle Component Analysis
(Karhunen-Loeve, or K-L, method)
• Suppose that the data to be reduced consist of tuples or data vectors
described by n attributes or dimensions.
• Principal components analysis searches for k n-dimensional orthogonal
vectors that can best be used to represent the data, where k <= n.
• The original data are thus projected onto a much smaller space, resulting
in dimensionality reduction.
• Unlike attribute subset selection which reduces the attribute set size by
retaining a subset of the initial set of attributes, PCA “combines” the
essence of attributes by creating an alternative, smaller set of variables.
• The initial data can then be projected onto this smaller set. PCA often
reveals relationships that were not previously suspected and thereby
allows interpretations that would not ordinarily result.
PCA Procedure
• The input data are normalized, so that each attribute falls within the same range.
This step helps ensure that attributes with large domains will not dominate
attributes with smaller domains.
• PCA computes k orthonormal vectors that provide a basis for the normalized input
data. These are unit vectors that each point in a direction perpendicular to the
others. These vectors are referred to as the principal components. The input data
are a linear combination of the principal components.
• The principal components are sorted in order of decreasing “significance” or
strength. The principal components essentially serve as a new set of axes for the
data, providing important information about variance. That is, the sorted axes are
such that the first axis shows the most variance among the data, the second axis
shows the next highest variance, and so on.
• Because the components are sorted in decreasing order of “significance,” the data
size can be reduced by eliminating the weaker components, that is, those with low
variance. Using the strongest principal components, it should be possible to
reconstruct a good approximation of the original data.
PCA - Advantages
• PCA can be applied to ordered and unordered attributes,
and can handle sparse data and skewed data.
• Multidimensional data of more than two dimensions can
be handled by reducing the problem to two dimensions.
• Principal components may be used as inputs to multiple
regression and cluster analysis.
• In comparison with wavelet transforms, PCA tends to be
better at handling sparse data, whereas wavelet
transforms are more suitable for data of high
dimensionality.
Attribute Subset Selection
• Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes (or
dimensions).
• The goal of attribute subset selection is to find a
minimum set of attributes such that the resulting
probability distribution of the data classes is as close as
possible to the original distribution obtained using all
attributes.
• It reduces the number of attributes appearing in the
discovered patterns, helping to make the patterns easier
to understand.

Greedy(Heuristic) methods for Attribute Subset
Selection
• Stepwise forward selection:
– starts with an empty set of attributes as the reduced set.
– Best of the original attributes is determined and added to the reduced set.
– At each subsequent iteration or step, the best of the remaining original attributes is added to the set.
• Stepwise backward elimination:
– Starts with the full set of attributes.
– At each step, it removes the worst attribute remaining in the set.
• Combination of forward selection and backward elimination:
– The stepwise forward selection and backward elimination methods can be combined
– At each step, the procedure selects the best attribute and removes the worst from among the remaining
attributes.
• Decision tree induction:
– Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended for classification.
– Decision tree induction constructs a flow chart like structure where each internal (nonleaf) node denotes a
test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction.
– At each node, the algorithm chooses the “best” attribute to partition the data into individual classes.

• When decision tree induction is used for attribute subset selection, a tree is constructed from the
given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree form the reduced subset of attributes.
Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …

49
Parametric Data Reduction: Regression and
Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a linear
function of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions

50
y
Regression Analysis
Y1

• Regression analysis: A collective name for


techniques for the modeling and analysis of Y1’
y=x+1
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more X1 x
independent variables (aka. explanatory
variables or predictors)
• Used for prediction (including
• The parameters are estimated so as to give a forecasting of time-series data),
"best fit" of the data inference, hypothesis testing,
• Most commonly the best fit is evaluated by and modeling of causal
relationships
using the least squares method, but other
criteria have also been used
51
Regress Analysis and Log-Linear Models
• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above
• Log-linear models:
– Approximate discrete multidimensional probability distributions
– Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
– Useful for dimensionality reduction and data smoothing 52
Histogram Analysis
• Divide data into buckets and 40
store average (sum) for each 35
bucket
30
• Partitioning rules:
25
– Equal-width: equal bucket
range 20

– Equal-frequency (or equal- 15


depth) 10
5
0
10000 30000 50000 70000 90000
53
Clustering

• Partition data set into clusters based on similarity, and store


cluster representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms
• Cluster analysis will be studied in depth in Chapter 10

54
Sampling
• Sampling: obtaining a small sample s to represent the whole
data set N
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance
in the presence of skew
– Develop adaptive sampling methods, e.g., stratified
sampling:
• Note: Sampling may not reduce database I/Os (page at a time)
55
Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement
– Once an object is selected, it is removed from the population
• Sampling with replacement
– A selected object is not removed from the population
• Stratified sampling:
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of
the data)
– Used in conjunction with skewed data

56
Sampling: With or without Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
l
sa m p m e nt )
p l a ce
re

SRSW
R

Raw Data
57
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

58
Data Cube Aggregation
• The lowest level of a data cube (base cuboid)
– The aggregated data for an individual entity of interest
– E.g., a customer in a phone calling data warehouse
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve the
task
• Queries regarding aggregated information should be answered
using data cube, when possible
59
Data Compression
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible
without expansion
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be
considered as forms of data compression
60
Data Compression

Original Data Compressed


Data

lossless

os sy
l
Original Data
Approximated

61
Thank You

You might also like