0% found this document useful (0 votes)
42 views

Data Preprocessing

Data preprocessing involves cleaning data by handling missing values, smoothing noisy data, and resolving inconsistencies. Major techniques for data cleaning include filling in missing values manually or using statistical measures, binning values to reduce noise, and detecting discrepancies using domain knowledge or data rules. Data preprocessing also includes reducing data through dimensionality reduction methods like wavelet transforms and principal component analysis, as well as attribute subset selection algorithms to minimize irrelevant attributes. The goal of preprocessing is to improve data quality for mining tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Data Preprocessing

Data preprocessing involves cleaning data by handling missing values, smoothing noisy data, and resolving inconsistencies. Major techniques for data cleaning include filling in missing values manually or using statistical measures, binning values to reduce noise, and detecting discrepancies using domain knowledge or data rules. Data preprocessing also includes reducing data through dimensionality reduction methods like wavelet transforms and principal component analysis, as well as attribute subset selection algorithms to minimize irrelevant attributes. The goal of preprocessing is to improve data quality for mining tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Data Preprocessing

Why Preprocess the Data?


Today’s real-world databases are highly susceptible to
noisy, missing, and inconsistent data due to their typically
huge size and their likely origin from multiple,
heterogenous sources.

Low-quality data will lead to low-quality mining results.


Major Tasks in Data Preprocessing

Data Data Data Data


cleaning Integration Reduction Transformation
Major Tasks in Data Preprocessing
Data Cleaning
Data Cleaning
Data cleaning routines
attempt to fill in missing
values, smooth out noise
while identifying outliers,
and correct inconsistencies
in the data.
Methods on
Filling in
Missing Values
Ignore the tuple
This method is not very effective, unless the tuple contains
several attributes with missing values. It is especially
poor when the percentage of missing values per attribute
varies considerably.

By ignoring the tuple, we do not make use of the remaining


attributes’ values in the tuple. Such data could have been
useful to the task at hand.
Fill in the missing value manually
You basically find the missing value and manually input them
into the database to fix the tuple.

In general, this approach is time consuming and may not be


feasible given a large data set with many missing values.
Use a global constant to fill in the missing value
Replace all missing attribute values by the same constant
such as a label like “Unknown” or −∞.

If missing values are replaced by, say, “Unknown,” then the


mining program may mistakenly think that they form an
interesting concept, since they all have a value in common.
Hence, although this method is simple, it is not foolproof.
Use a measure of central tendency for the attribute
In this method, we find the central tendency of all the data
and replace the missing attribute with this value.

For normal (symmetric) data distributions, the mean can be


used, while skewed data distribution should employ the
median.
Use the attribute mean or median for all samples
belonging to the same class as the given tuple
In this method, we use once again use a measure of central
tendency to obtain a data that we can substitute for the
missing value.

The difference is that we base the measure of central


tendency according to the category that the tuple is in.
Use the most probable value to fill in the missing value
In this method, we predict the missing values by using an
algorithm and fill it in accordingly.

This may be determined with regression, inference-based


tools using a Bayesian formalism, or decision tree
induction.
Dealing with
Noisy Data
Binning
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.

The sorted values are distributed into a number of


“buckets,” or bins. Because binning methods consult the
neighborhood of values, they perform local smoothing.
Binning
Regression
Data smoothing can also be done by regression, a technique
that conforms data values to a function.

Linear regression involves finding the “best” line to fit


two attributes (or variables) so that one attribute can be
used to predict the other.

Multiple linear regression is an extension of linear


regression, where more than two attributes are involved and
the data are fit to a multidimensional surface.
Outlier analysis
Outliers may be detected by
clustering, for example,
where similar values are
organized into groups, or
“clusters.” Intuitively,
values that fall outside of
the set of clusters may be
considered outliers.
Fixing Data
Inconsistencies
Discrepancy Detection Tools
Data scrubbing tools use simple domain knowledge (e.g.,
knowledge of postal addresses and spell-checking) to detect
errors and make corrections in the data.

Data auditing tools find discrepancies by analyzing the data


to discover rules and relationships, and detecting data that
violate such conditions.
Data Transformation Tools
Data migration tools allow simple transformations to be
specified such as to replace the string “gender” by “sex”.

ETL (extraction/transformation/loading) tools allow users to


specify transforms through a graphical user interface (GUI)
Data Integration
Data Integration

Redundancy
Entity Data warfare
and Tuple
Identification Detection and
Correlation Duplication
Problem backbone
Analysis
Data Reduction
Data Reduction
Data reduction is used to obtain a reduced representation of
the data set that is much smaller in volume, yet closely
maintains the integrity of the original data.

That is, mining on the reduced data set should be more


efficient yet produce the same analytical results.
Data Reduction Strategies
Dimensionality Reduction Numerosity Reduction

Principal Regression Histogram


Wavelet
components
Transforms
analysis
Clustering Sampling
Attribute
subset Data cube
selection aggregation
Data Reduction
Dimensionality reduction is the process of reducing the
number of random variables or attributes under
consideration.

Numerosity reduction techniques replace the original data


volume by alternative, smaller forms of data representation.
These techniques may be parametric or nonparametric.
Dimensionality
Reduction
Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal
processing technique that, when applied to a data vector X,
transforms it to a numerically different vector, X1, of
wavelet coefficients.

When applying this technique to data reduction, we consider


each tuple as an n-dimensional data vector, thatis, X =
(x1,x2,...,xn), depicting n measurements made on the tuple
from n database attributes.
Applying a Discrete Wavelet Transform
1. The length, L, of the input data vector must be an
integer power of 2. This condition can be met by padding
the data vector with zeros as necessary (L ≥ n).
2. Each transform involves applying two functions. The first
applies some data smoothing, such as a sum or weighted
average. The second performs a weighted difference, which
acts to bring out the detailed features of the data.
Applying a Discrete Wavelet Transform
3. The two functions are applied to pairs of data points in
X, that is, to all pairs of measurements (x2i,x2i+1).
This results in two data sets of length L/2.
4. The two functions are recursively applied to the data
sets obtained in the previous loop, until the resulting
data sets obtained are of length 2.
5. Selected values from the data sets obtained in the
previous iterations are designated the wavelet
coefficients of the transformed data.
Principal Components Analysis
Principal components analysis searches for k n-dimensional
orthogonal vectors that can best be used to represent the
data, where k ≤ n.
The original data are thus projected onto a much smaller
space, resulting in dimensionality reduction.
PCA often reveals relationships that were not previously
suspected and thereby allows interpretations that would not
ordinarily result
Principal Components Analysis Procedure
1. The input data are normalized, so that each attribute
falls within the same range. This step helps ensure that
attributes with large domains will not dominate
attributes with smaller domains.
2. PCA computes k orthonormal vectors that provide a basis
for the normalized input data. These are unit vectors
that each point in a direction perpendicular to the
others. These vectors are referred to as the principal
components. The input data are a linear combination of
the principal components.
Principal Components Analysis Procedure
3. The principal components are sorted in
order of decreasing “significance” or
strength. The principal components
essentially serve as a new set of axes
for the data, providing important
information about variance. That is,
the sorted axes are such that the
first axis shows the most variance
among the data, the second axis shows
the next highest variance, and so on.
Principal Components Analysis Procedure
4. Because the components are sorted in decreasing order of
“significance,” the data size can be reduced by
eliminating the weaker components, that is, those with
low variance. Using the strongest principal components,
it should be possible to reconstruct a good approximation
of the original data.
Attribute Subset Selection
Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes (or dimensions).

The goal of attribute subset selection is to find a minimum


set of attributes such that the resulting probability
distribution of the data classes is as close as possible to
the original distribution obtained using all attributes.
Attribute Subset Selection
For n attributes, there are 2n possible subsets. An
exhaustive search for the optimal subset of attributes can
be prohibitively expensive, especially as n and the number
of data classes increase.

Therefore, heuristic methods that explore a reduced search


space are commonly used for attribute subset selection.
Heuristic Methods Techniques
1. Stepwise forward selection: The procedure starts with an
empty set of attributes as the reduced set. The best of
the original attributes is determined and added to the
reduced set. At each subsequent iteration or step, the
best of the remaining original attributes is added to the
set.
2. Stepwise backward elimination: The procedure starts with
the full set of attributes. At each step, it removes the
worst attribute remaining in the set.
Heuristic Methods Techniques
3. Combination of forward selection and backward
elimination: The stepwise forward selection and backward
elimination methods can be combined so that, at each
step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
Heuristic Methods Techniques
4. Decision tree induction: Decision tree induction
constructs a flowchart like structure where each internal
(nonleaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction. At each
node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
Heuristic Methods Techniques
Numerosity
Reduction
Linear Regression Models
Regression and log-linear
models can be used to
approximate the given data.
In (simple) linear
regression, the data are
modeled to fit a straight
line.
Simple Linear Regression
For example, a random variable, y (called a response
variable), can be modeled as a linear function of another
random variable, x (called a predictor variable), with the
equation where the variance of y is assumed to be constant.

The coefficients, w and b (called regression coefficients),


specify the slope of the line and the y-intercept,
respectively.
Multiple linear regression
Multiple linear regression is an extension of (simple)
linear regression, which allows a response variable, y, to
be modeled as a linear function of two or more predictor
variables
Histograms
Histograms use binning to approximate data distributions and
are a popular form of data reduction.
Partitioning Rules
● Equal-width: In an equal-width histogram, the width of
each bucket range is uniform.
● Equal-frequency (or equal-depth): In an equal-frequency
histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant.
Clustering
Clustering techniques consider
data tuples as objects.

They partition the objects into


groups, or clusters, so that
objects within a cluster are
“similar” to one another and
“dissimilar” to objects in
other clusters.
Clustering
The “quality” of a cluster may be represented by its
diameter, the maximum distance between any two objects in
the cluster.

Centroid distance is an alternative measure of cluster


quality and is defined as the average distance of each
cluster object from the cluster centroid
Sampling
Sampling can be used as a data reduction technique because
it allows a large data set to be represented by a much
smaller random data sample (or subset).

Suppose that a large dataset, D, contains N tuples.


Common ways to sample for data reduction
Simple random sample without replacement (SRSWOR) of size s:

This is created by drawing s of the N tuples from D (s < N),


where the probability of drawing any tuple in D is 1/N, that
is, all tuples are equally likely to be sampled.
Common ways to sample for data reduction
Simple random sample with replacement (SRSWR) of size s:

This is similar to SRSWOR, except that each time a tuple is


drawn from D, it is recorded and then replaced. That is,
after a tuple is drawn, it is placed back in D so that it
may be drawn again.
SRSWOR / SRSWR Example
Common ways to sample for data reduction
Cluster sample:

If the tuples in D are grouped into M mutually disjoint


“clusters,” then an SRS of s clusters can be obtained, where
s < M.
Common ways to sample for data reduction
Stratified sample:

If D is divided into mutually


disjoint parts called strata,
a stratified sample of D is
generated by obtaining an SRS
at each stratum. This helps
ensure a representative
sample, especially when the
data are skewed.
Data Cube Aggregation
Data cubes store
multidimensional aggregated
information. Each cell holds
an aggregate data value,
corresponding to the data
point in multidimensional
space.
Data
Transformation
Data Transformation Strategies
● Smoothing, which works to remove noise from the data.
Techniques include binning, regression, and clustering.
● Attribute construction (or feature construction), where
new attributes are constructed and added from the given
set of attributes to help the mining process.
● Aggregation, where summary or aggregation operations are
applied to the data. This step is typically used in
constructing a data cube for data analysis at multiple
abstraction levels.
Data Transformation Strategies
● Normalization, where the attribute data are scaled so as
to fall within a smaller range, such as −1.0 to 1.0, or
0.0 to 1.0.
● Discretization, where the raw values of a numeric
attribute (e.g., age) are replaced by interval labels
(e.g., 0–10, 11–20, etc.) or conceptual labels (e.g.,
youth, adult, senior). The labels, in turn, can be
recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric
attribute.
Data Transformation Strategies
● Concept hierarchy generation for nominal data, where
attributes such as street can be generalized to
higher-level concepts, like city or country. Many
hierarchies for nominal attributes are implicit within
the database schema and can be automatically defined at
the schema definition level.
Key Points
Key Points
● Data cleaning routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct
inconsistencies in the data. Data cleaning is usually
performed as an iterative two-step process consisting of
discrepancy detection and data transformation.
Key Points
● Data integration combines data from multiple sources to
form a coherent data store. The resolution of semantic
heterogeneity, metadata, correlation analysis, tuple
duplication detection, and data conflict detection
contribute to smooth data integration
Key Points
● Data reduction techniques obtain a reduced representation
of the data while minimizing the loss of information
content. These include methods of dimensionality
reduction, numerosity reduction, and data compression.
● Dimensionality reduction reduces the number of random
variables or attributes under consideration.
● Numerosity reduction methods use parametric or
nonparametric models to obtain smaller representations of
the original data
Key Points
● Data transformation routines convert the data into
appropriate forms for mining. For example, in
normalization, attribute data are scaled so as to fall
within a small range such as 0.0 to 1.0. Other examples
are data discretization and concept hierarchy generation.
The End

You might also like