Module 2 DMW
Module 2 DMW
MODULE 2
DATA PEPROCESSING
Data Preprocessing: Data Pre-processing Concepts, Data Cleaning, Data integration
and transformation, Data Reduction, Discretization and concept hierarchy.
----------------------------------------------------------------------------------------------------
Data preprocessing
Data processing is the preparatory step for data mining. Preprocessing is
needed as data collected from different sources can be incomplete, noisy,
and inconsistent.
Incomplete data can occur because attributes of interest may not always be
available, Relevant data may not be recorded due to a misunderstanding, or
because of equipment malfunctions, data that were inconsistent with other
recorded data may have been deleted.
Reasons for noisy data (having incorrect attribute values) are the data
collection instruments used may be faulty, there may have been human or
computer errors occurring at data entry. Errors in data transmission can also
occur. There may be technology limitations.
Data Cleaning
Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
a) Missing Values
Many tuples have no recorded value for several attributes.The methods used
are:
MODULE 2 Page 1
CS 402 : DATA MINING AND WAREHOUSING RCET
i. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not
very effective, unless the tuple contains several attributes with missing
values. It is especially poor when the percentage of missing values per
attribute varies considerably
ii. Fill in the missing value manually: this approach is time-consuming
and may not be feasible given a large data set with many missing
values.
iii. Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown”
or �¥
iv. Use the attribute mean to fill in the missing value: For example,
suppose that the average income of a customers is $56,000. Use this
value to replace the missing value for income.
v. Use the attribute mean for all samples belonging to the same class
as the given tuple:For example, if classifying customers according to
credit risk, replace the missing value with the average income value
for customers in the same credit risk category as that of the given tuple.
vi. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
b) Noisy Data
Noise is a random error or variance in a measured variable.The methods for
noise removal are:
i. Binning:
Binning methods smooth a sorted data value by consulting
its“neighborhood,”that is, the values around it. The sorted values are
distributed into a numberof “buckets,” or bins. Because binning
methods consult the neighborhood of values,they perform local
smoothing.
MODULE 2 Page 2
CS 402 : DATA MINING AND WAREHOUSING RCET
ii. Regression
Data can be smoothed by fitting the data to a function, such as with
regression. Find a mathematical function corresponding to a set of data.
All data can be mapped using this function i.e all data will
approximately fall into that function. If a function is linear we call it
linear regression.
iii. Clustering:
Outliers may be detected by clustering, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside
of the set of clusters may be considered outliers
iv. Combined computer and human interaction
Outliers can be identifies through a combination of computer and human
inspection. Outliers may be informative or garbage.
MODULE 2 Page 3
CS 402 : DATA MINING AND WAREHOUSING RCET
c) Inconsistent data
There may be inconsistencies in the data recorded for some transaction.
Errors occurred may be corrected manually using external references.
Knowledge engineering tools may be used to detect violation of known data
constraints.
Data integration
It combines data from multiple sources into a coherent data store, as in data
warehousing. These sources may include multiple databases, data cubes, or flat
files.
Issues to consider during data integration are:
a) Schema integration
b) Redundancy
c) Detection and resolution of data value conflicts.
Schema integration
Schema integration can be confusing as same entity can be represented in different
forms in different database tables. This is referred to as entity identification
problem.
Eg: cust_id in one databse can be represented as cust_no. in another.
Database and data warehouses have meta data. Meta data helps to avoid errors in
schema integration.
Redundancy
Redundancy is another important issue. An attribute (such as annual revenue, for
instance) may be redundant if it can be “derived” from another attribute or set of
attributes. Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set. Some redundancies can be detected by
correlation analysis. Given two attributes, such analysis can measure how strongly
one attribute implies the other, based on the available data. For numerical
attributes, we can evaluate the correlation between two attributes A and B, by
computing the correlation coefficient:
∑( ̅ )( ̅)
=
( )
Where n is the number of tuples, ̅ , ̅ are mean of attributes A and B respectively,
σA,σB are standard deviations of A and B.
If is greater than 0, then A and B are positively correlated, meaning that the
values of A increase as the values of B increase. The higher the value, the stronger
MODULE 2 Page 4
CS 402 : DATA MINING AND WAREHOUSING RCET
the correlation (i.e., the more each attribute implies the other).Hence, a higher
value may indicate that A (or B) may be removed as a redundancy. If the resulting
value is equal to 0, then A and B are independent and there is no correlation
between them. If the resulting value is less than 0, then A and B are negatively
correlated,where the values of one attribute increase as the values of the other
attribute decrease.This means that each attribute discourages the other.
Data Transformation
In data transformation, the data are transformed or consolidated into forms
appropriate
for mining. Data transformation can involve the following:
i. Smoothing, which works to remove noise from the data. Such techniques
include binning, regression, and clustering.
ii. Aggregation, where summary or aggregation operations are applied to the
data. For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically used in constructing
a data cube for analysis of the data at multiple granularities.
iii. Generalization of the data, where low-level or “primitive” (raw) data are
replaced by higher-level concepts through the use of concept hierarchies.
For example, categorical attributes, like street, can be generalized to higher-
level concepts, like city or country. Values for numerical attributes, like
age, may be mapped to higher-level
concepts, like youth, middle-aged, and senior.
iv. Normalization, where the attribute data are scaled so as to fall within a
small specified range, such as -1:0 to 1:0, or 0:0 to 1:0.Attribute
construction (or feature construction),where new attributes are constructed
and added from the given set of attributes to help the mining process.
There are many methods for data normalization:
a) min-max normalization,
b) z-score normalization,
MODULE 2 Page 5
CS 402 : DATA MINING AND WAREHOUSING RCET
c) normalization by decimal scaling.
Eg:Suppose that the minimum and maximum values for the attribute income
are $12,000 and $98,000, respectively. We would like to map income to the
range [0:0;1:0]. By min-max normalization, a value of $73,600 for income is
transformed to
MODULE 2 Page 6
CS 402 : DATA MINING AND WAREHOUSING RCET
Data Reduction
Complex data analysis and mining on huge amounts of data can take a long
time, making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation
of the data set that is much smaller in volume, yet closely maintains the
integrity of the original data.
Dimensionality reduction
Reduce number of random variables or attributes.
a) Wavelet transform
b) Principle component analysis
c) Attribute subset selection
a) Wavelet Transforms
The discrete wavelet transform(DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically
different vector, , of wavelet coefficients. The two vectors are of the same
length. A compressed approximation of the data can be retained by storing
only a small fraction of the strongest of the wavelet coefficients.
The method is as follows:
MODULE 2 Page 7
CS 402 : DATA MINING AND WAREHOUSING RCET
1. The length, L, of the input data vector must be an integer power of 2.
This condition can be met by padding the data vector with zeros as
necessary (L - n).
2. Each transform involves applying two functions. The first applies some
data smoothing, such as a sum or weighted average. The second
performs a weighted difference, which acts to bring out the detailed
features of the data.
3. The two functions are applied to pairs of data points in X, that is, to all
pairs of measurements (x2i;x 2i+1). This results in two sets of data of
length L=2.
4. The two functions are recursively applied to the sets of data obtained in
the previous loop, until the resulting data sets obtained are of length 2.
5. Selected values from the data sets obtained in the above iterations are
designated the wavelet coefficients of the transformed data.
MODULE 2 Page 8
CS 402 : DATA MINING AND WAREHOUSING RCET
strongest principal components, it should be possible to reconstruct a
good approximation of the original data.
MODULE 2 Page 9
CS 402 : DATA MINING AND WAREHOUSING RCET
Numerosity reduction
Replace original data volume by alternative smaller forms of data
representation.
2 techniques
i. Parametric : Used a model to estimate data. So only data parameters
need to be stored.
MODULE 2 Page 10
CS 402 : DATA MINING AND WAREHOUSING RCET
where the variance of y is assumed to be constant. x and y are
numerical database attributes. The coefficients, w and b (called
regression coefficients),specify the slope of the line and the Y-
intercept, respectively.
Histograms
Histograms use binning to approximate data distributions and are a
popular form of data reduction.
A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, or buckets.
If each bucket represents only a single attribute-value/frequency pair,
the buckets are called singleton buckets. Often, buckets instead
represent continuous ranges for the given attribute.
Eg: The following data are a list of prices of commonly sold items at
AllElectronics(rounded to the nearest dollar). The numbers have been
sorted: 1, 1, 5, 5, 5, 5, 5,8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,20, 20, 20, 20, 20, 20, 21,
21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
MODULE 2 Page 11
CS 402 : DATA MINING AND WAREHOUSING RCET
Figure 2.18 shows a histogram for the data using singleton buckets
.
There are several partitioning rules, including the following:
Equal-width: In an equal-width histogram, the width of each bucket
range is uniform
(such as the width of $10 for the buckets in Figure 2.19).
Clustering
Clustering techniques consider data tuples as objects. They partition
the objects into groups or clusters, so that objects within a cluster
MODULE 2 Page 12
CS 402 : DATA MINING AND WAREHOUSING RCET
are “similar” to one another and “dissimilar” to objects in other
clusters.
Similarity is commonly defined in terms of how “close” the objects
are in space, based on a distance function.
The “quality” of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality and
is defined as the average distance of each cluster object from the
cluster centroid.
Sampling
It allows a large data set to be represented by a much smaller random
sample (or subset) of the data. Suppose that a large data set, D, contains N
tuples.
MODULE 2 Page 13
CS 402 : DATA MINING AND WAREHOUSING RCET
The most common ways that we could sample D for data reduction are:
i. Simple random sample without replacement (SRSWOR) of size s
This is created by drawing s of the N tuples from D (s < N), where the
probability of drawing any tuple in D is 1=N, that is, all tuples are equally
likely to be sampled.
MODULE 2 Page 14
CS 402 : DATA MINING AND WAREHOUSING RCET
Data Compression
Apply transformation to get reduced representation of data.
2 types:
i. Lossy : if we can reconstruct only approximate of data then we call it
lossy compression
ii. Lossless: If original data can be reconstructed from compressed data
without loss of information then data reduction is lossless.
MODULE 2 Page 15
CS 402 : DATA MINING AND WAREHOUSING RCET
Binning
Binning is a top-down splitting technique based on a specified number of bins.
These methods are also used as discretization methods for numerosity reduction
and concept hierarchy generation. For example, attribute values can be discretized
by applying equal-width or equal-frequency binning, and then replacing each bin
value by the bin mean or median, as in smoothing by bin means or smoothing by
bin medians. These techniques can be applied recursively to the resulting partitions
in order to generate concept hierarchies.
Histogram Analysis
Histogram analysis is an unsupervised discretization technique because it does not
use class information. Histograms partition the values for an attribute, A, into
disjoint ranges called buckets. In an equal-width histogram, for example, the
values are partitioned into equal-sized partitions or ranges. With an equal
frequency histogram, the values are partitioned so that each partition contains the
same number of data tuples. The histogram analysis algorithm can be applied
recursively to each partition in order to automatically generate a multilevel concept
hierarchy, with the procedure terminating once a prespecified number of concept
levels has been reached. A minimum interval size can also be used per level to
control the recursive procedure.
MODULE 2 Page 16
CS 402 : DATA MINING AND WAREHOUSING RCET
Entropy-Based Discretization
Entropy-based discretization is a supervised, top-down splitting technique.Given a
set of data tuples S, the basic method for entropy-based discretization of A is as
follows:
2. Given S, the threshold value slected is the one that maximizes the
information gin resulting from the subsequent partitioning.The information
gain is
I(S,T) = ( ) ( )
where and corresponds to the samples in S, satisfying the conditions
A<T and A>= T, respectively. The entropy functiuon Ent for a given set is
calculated based on the class distribution of the samples in the set.For eg:
given m classes, the entropy of is
Ent( )= - ∑ ( )
Where is the probability of class i, in determined by dividing the
number of class i, in by the total number of samples in . The value of
Ent( ) can be computed similarly.
Cluster Analysis
Cluster analysis is a popular data discretization method. A clustering
algorithm can be applied to discretize a numerical attribute, A, by partitioning the
values of A into clusters or groups. Clustering takes the distribution of A into
consideration, as well as the closeness of data points, and therefore is able to
produce high-quality discretization results. Clustering can be used to generate a
concept hierarchy for A by following either a topdown splitting strategy or a
bottom-up merging strategy, where each cluster forms a node of the concept
hierarchy.
MODULE 2 Page 17
CS 402 : DATA MINING AND WAREHOUSING RCET
the value range at the most significant digit.We will illustrate the use of the rule
with an example further below.
The rule is as follows:
If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit,
then partition the range into 3 intervals (3 equal-width intervals for 3, 6, and
9; and 3 intervals in the grouping of 2-3-2 for 7).
If it covers 2, 4, or 8 distinct values at the most significant digit, then
partition the range into 4 equal-width intervals.
If it covers 1, 5, or 10 distinct values at the most significant digit, then
partition the range into 5 equal-width intervals. The rule can be recursively
applied to each interval, creating a concept hierarchy for the given numerical
attribute. Real-world data often contain extremely large positive and/or
negative outlier values, which could distort any top-down discretization
method based on minimum and maximum data values
QUESTIONS
1. Why do we need data transformation? What are the different ways of data
transformation?
2. Where do we use Linear regression? Explain linear regression.
3. Summarize the various pre-processing activities involved in data mining
.Use the two methods below to normalize the following group of data:
1000,2000,3000,5000,9000
i) min-max normalization by setting min: 0 and max:1
ii) z-score normalization
4. What is data cleaning? Describe the approaches to fill missing values.
5. What is noisy data? Explain the binning methods for data smoothening.
6. What is data normalization? Explain any two normalization methods.
7. What is attribute selection measure? Briefly describe the attribute selection
measures for decision tree induction.
8. What is a concept hierarchy?
9. The following data is given in increasing order for the attribute age:
13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,36,40,45,46,
52,70.
a) Use smoothing by bin boundaries to smooth these data, using bin
depth of 3.
b) How might you determine outliers in the data?
c) What other methods are there for data smoothing?
10. Explain the following procedures for attribute subset selection
MODULE 2 Page 18
CS 402 : DATA MINING AND WAREHOUSING RCET
a) Stepwise forward selection
b) Stepwise backward elimination
c) A combination of forward selection and backward elimination
11. Real-world data tend to be incomplete, noisy, and inconsistent. What are
the various approaches adopted to clean the data?
12. Summarize the various pre-processing activities involved in data mining
MODULE 2 Page 19