DWDM Unit II
DWDM Unit II
UNIT –II
Data preprocessing
Introduction
Data preprocessing describes any type of processing performed on raw data to prepare it for
another processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively processed for
the purpose of the user.
Data preprocessing describes any type of processing performed on raw data to prepare it for
another processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively processed for
the purpose of the user
2. Fill in the missing values manually: This approach is time –consuming and may not be feasible
given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute value by the same
constant, such as a label like “unknown” or -∞.
4. Use the attribute mean to fill in the missing value: For example, suppose that the average income
of customers is $56,000. Use this value to replace the missing value for income.
5. Use the attribute mean for all samples belonging to the same class as the given tuple: If classifying
customers according to credit risk, replace the missing value with the average income value for
customers in the same credit risk category as that of give tuple.
6. Use the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using a Bayesian formalism or decision tree induction. For example, using the
other customer attributes in the sets decision tree is constructed to predict the missing value for income.
2.3.2 Noisy data:
Noise is a random error or variance in a measured variable. Data smoothing tech is used for removing
such noisy data.
Several Data smoothing techniques:
1 Binning methods:
Binning methods smooth a sorted data value by consulting the neighborhood", or values
around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing.
In this technique,
1. The data for first sorted
• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33
Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56
2. Clustering:
Outliers in the data may be detected by clustering, where similar values are organized into
groups, or ‘clusters’. Values that fall outside of the set of clusters may be considered outliers.
Multiple linear regression is an extension of linear regression, where more than two
variables are involved and the data are fit to a multidimensional surface.
2.3.3 Inconsistent Data
Inconsistencies exist in the data stored in the transaction. Inconsistencies occur due to occur
during data entry, functional dependencies between attributes and missing values.
The inconsistencies can be detected and corrected either by manually or by knowledge
engineering tools.
2.3.4.Data cleaning as a process
Data cleaning is a two step process
1. Discrepancy detection
2. Data transformations
1. Discrepancy detection
Data auditing tools – analyzes the data to discover rules and relationship, and detecting data
that violate such conditions.
2. Data transformations:
This is the second step in data cleaning as a process. After detecting discrepancies, we need to
define and apply (a series of) transformations to correct them.
Data Transformations Tools:
Data migration tools – allows simple transformation to be specified, such to replaced the string
“gender” by “sex”.
where
- n is the number of tuples
- 𝑨 mean value of A
- 𝑩 mean value of B
- 𝝈𝑨 Standard deviation of A
- 𝝈𝑩 Standard deviation of B
If
r A ,B >0 then A and B are positively correlated
r A,B<0 then A and B are negatively correlated
r A,B=0 then on correlation between A and B.
where oij is the observed frequency (i.e., actual count) of the joint event .(Ai ,Bj) and eij is the expected
frequency of (Ai ,Bj) , which can be computed as
Example:
Min-max normalization preserves the relationships among the original data values. It will
encounter an ”out-of-bounds ” error if a future input case for normalization falls outside of the original
data range for A.
Example. Given one-dimensional data set X = {-5,023.0,17.6,9.23,1.11}, normalize the data set using
(a) Min-max normalization on interval [0,1],
(b) Min-max normalization on interval [-1,1],
(c) Standard deviation normalization.
a) Min-max normalization [0,1]
v meanA
v'
stand _ devA
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 14
III CSE DWDM -II
This method is useful when min and max value of attribute A are unknown or when outliers
that are dominate min-max normalization.
Example: Suppose that the mean and standard deviation of the value for the attribute income are
$52,000 & $14,000, respectively. With z-score normalization, a value of $72,000 for income is
transformed to
3. Normalization by decimal scaling: normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the maximum absolute value of A. A value v
of A is normalized to v’ by computing,
Example: Suppose that the recorded values of A range from -986 to 917. The maximum absolute value
of A is 986. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j=3) so
that -986 normalizes -0.986 and 917 normalizes to 0.917.
1. Data cube aggregation, where aggregation operations are applied to the data in the construction of
a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions
may be detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduced the data set size.
5. Discretization and concept hierarchy generation, where raw data values for attributes are replaced
by ranges or higher conceptual levels
2. Lattice of cuboids- Data cubes created for varying levels of abstraction are often referred to
as cuboids.
The following database consists of sales per quarter for the years 1997-1999.
Suppose, the annalyser interested in the annual sales rather than sales per quarter, the above data
can be aggregated so that the resulting data summarizes the total sales per year instead of per
quarter.
The resulting data in smaller in volume, without loss of information necessary for the analysis task
Lossy - If the original data can be reconstructed from the compressed data with loss of
information, then the data reduction is called lossy.
1. Wavelet compression is a form of data compression well suited for image compression.
The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied
to a data vector D, transforms it to a numerically different vector, D0, of wavelet coefficients.
The general algorithm for a discrete wavelet transform is as follows.
1. The length, L, of the input data vector must be an integer power of two. This condition can be
met by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions:
data smoothing
calculating weighted difference
3. The two functions are applied to pairs of the input data, resulting in two sets of data of length
L/2.
4. The two functions are recursively applied to the sets of data obtained in the previous loop, until
the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are designated the
wavelet coefficients of the transformed data.
If wavelet coefficients are larger than some user-specified threshold then it can be retained. The
remaining coefficients are set to 0.
2. PCA computes k orthonormal vectors that provide a basis for the normalized input data. These
are unit vectors that each point in a direction perpendicular to the others.
2. Multidimensional data of more than two dimensions can be handled by reducing the problem
to two dimensions.
3. Principal components may be used as inputs to multiple regression and cluster analysis.
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 20
III CSE DWDM -II
Linear regression
In linear regression, the data are model to fit a straight line. For example, a random variable, Y
called a response variable), can be modeled as a linear function of another random variable, X called
a predictor variable), with the equation Y=αX+β
Where the variance of Y is assumed to be constant. The coefficients, α and β (called regression
coefficients), specify the slope of the line and the Y- intercept, respectively.
Multiple- linear regression
Multiple linear regression is an extension of (simple) linear regression, allowing a response
variable Y, to be modeled as a linear function of two or more predictor variables.
2. Log-Linear Models
Log-Linear Models can be used to estimate the probability of each point in a multidimensional
space for a set of discretized attributes, based on a smaller subset of dimensional combinations.
2 Histogram
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or
buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets are Called
singleton buckets.
Example:
The following data are a list of prices of commonly sold items. The numbers have been sorted.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18,
18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Draw histogram plot for price where each bucket should have equi width of 10
The buckets can be determined based on the following partitioning rules, including the following.
1. Equi-width: histogram with bars having the same width
2. Equi-depth: histogram with bars having the same height
3. V-Optimal: histogram with least variance (countb*valueb)
4. MaxDiff: bucket boundaries defined by user specified threshold
V-Optimal: The V-Optional histogram is the one with the least variance. Histogram variance is a
weighted sum of the original values that each bucket represents, where bucket weight is equal to the
number of values in the bucket.
MaxDiff: It is the difference between each pair of adjacent values. A bucket boundary is established
between each pair for pairs having the 𝜷-1 largest differences, where 𝜷 is the user-specified.
3.Clustering techniques
Quality of clusters measured by their diameter (max distance between any two objects in the
cluster) or centroid distance (avg. distance of each cluster object from its centroid)
4.Sampling
Sampling can be used as a data reduction technique since it allows a large data set to be represented
by a much smaller random sample (or subset) of the data.
Suppose that a large data set, D, contains N tuples. Let's have a look at some possible samples for
D.
1. Simple random sample without replacement (SRSWOR) of size n: This is created by drawing
n of the N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N, i.e., all tuples
are qually likely.
2. Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be drawn again.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a SRS of
m clusters can be obtained, where m < M. For example, tuples in a database are usually retrieved a
page at a time, so that each page can be considered a cluster. A reduced data representation can be
obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified sample of
D is generated by obtaining a SRS at each stratum. This helps to ensure a representative sample,
especially when the data are skewed. For example, a stratified sample may be obtained from customer
data, where stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented.
Advantges of sampling
Easy-to-use
Top-down discretization or splitting – Here, the process starts by first finding one or a few
points (called split points or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals.
Bottom-up discretization or merging – Here, the process starts by considering all of the
continuous values to form intervals, and then recursively applies this process to the resulting
intervals.
Histogram analysis
Entropy-based discretization
𝝌𝟐-merging
Cluster analysis
I. Binning
Binning is a top-down splitting technique based on a specified number of bins.
These methods are also used as discretization methods for numerosity reduction and concepts
hierarchy generation.
These techniques can be applied recursively to the resulting partitions in order to generate
concepts hierarchies.
Binning does not use class information and is therefore an unsupervised discretization
technique.
It is sensitive to the user-specified number of bins, as well as the presence of outliers.
2. Suppose we want to classify the tuples in D by partitioning on attribute A and some split-
point. Ideally, we would like this partitioning to result in an exact classification of the tuples.
For example, if we had two classes, we would hope that all of the tuples of, say, class C1 will
fall into one partition and all of the tuples of class C2 will fall into the other partition.
Adjacent intervals with the least c2 values are merged together, because low c2 values for
a pair indicate similar class distributions.
This merging process precedes recursively until a predefined stopping criterion is met.
V. Cluster Analysis
Cluster analysis is a popular data discretization method.
A clustering algorithm can be applied to discretize a numerical attribute, A, by partitioning the
values of A into clusters or groups.
Clustering can be used to generate a concept hierarchy for A by following either a top down
splitting strategy or a bottom-up merging strategy, where each cluster forms a node of the
concept hierarchy.
In the top down splitting strategy, each initial cluster or partition may be further decomposed
into several sub clusters, forming a lower level of the hierarchy.
In the bottom-up merging strategy, clusters are formed by repeatedly grouping neighboring
clusters in order to form higher-level concepts.
VI. Discretization by intuitive partitioning
Although the above discretization methods are useful in the generation of numerical
hierarchies, many users would like to see numerical ranges partitioned into relatively uniform,
easy-to-read intervals that appear intuitive or natural.
The 3-4-5 rule can be used to segment numerical data into relatively uniform, natural seeming
intervals. In general, the rule partitions a given range of data into 3, 4 or 5 relatively equal-
width intervals, recursively and level by level, based on the value range at the most significant
digit.
If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4
equal-width intervals.
If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into
5 equal-width intervals.
The rule can be recursively applied to each interval, creating a concept hierarchy for the given
numerical attribute.
Real-world data often contain extremely large positive and/or negative outlier values, which
could distort any top-down discretization method, based on minimum and maximum data values.
Specification of a set of attributes, but not of their partial ordering: A user may specify a set of
attributes forming a concept hierarchy, but omit to explicitly state their partial ordering. The
system can they try to automatically generate the attribute so as to construct a meaningful concept
hierarchy .