Syllabus: Data Warehousing and Data Mining
Syllabus: Data Warehousing and Data Mining
UNIT-II
DATA PREPROCESSING
1. Preprocessing
Real-world databases are highly susceptible to noisy, missing, and inconsistent data due
to their typically huge size (often several gigabytes or more) and their likely origin from multiple,
heterogeneous sources. Low-quality data will lead to low-quality mining results, so we prefer a
preprocessing concepts.
Data Preprocessing Techniques
* Data cleaning can be applied to remove noise and correct inconsistencies in the data.
* Data integration merges data from multiple sources into coherent data store, such as a
data warehouse.
* Data reduction can reduce the data size by aggregating, eliminating redundant
features, or clustering, for instance. These techniques are not mutually exclusive; they
may work together.
* Data transformations, such as normalization, may be applied.
Need for preprocessing
Incomplete, noisy and inconsistent data are common place properties of large real world
databases and data warehouses.
Incomplete data can occur for a number of reasons:
Attributes of interest may not always be available
Relevant data may not be recorded due to misunderstanding, or because of equipment
malfunctions.
Data that were inconsistent with other recorded data may have been deleted.
Missing data, particularly for tuples with missing values for some attributes, may
need to be inferred.
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data entry.
Errors in data transmission can also occur.
There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption.
Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
Data integration is the process of integrating multiple databases cubes or files. Yet some
attributes representing a given may have different names in different databases, causing
inconsistencies and redundancies.
Data transformation is a kind of operations, such as normalization and aggregation, are
additional data preprocessing procedures that would contribute toward the success of
the mining process.
Data reduction obtains a reduced representation of data set that is much smaller in
Page 1
Data Warehousing and Data Mining
volume, yet produces the same(or almost the same) analytical results.
Page 2
Data Warehousing and Data Mining
2. DATA CLEANING
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers
and correct inconsistencies in the data.
Missing Values
Many tuples have no recorded value for several attributes, such as customer income.so we can fill
the missing values for this attributes.
The following methods are useful for performing missing values over several attributes:
1. Ignore the tuple: This is usually done when the class label missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the
percentage of the missing values per attribute varies considerably.
2. Fill in the missing values manually: This approach is time –consuming and may not
be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute value
by the same constant, such as a label like ―unknown‖ or -∞.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of customers is $56,000. Use this value to replace the missing value for
income.
5. Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism or decision tree
induction. For example, using the other customer attributes in the sets decision tree is
constructed to predict the missing value for income.
Page 3
Data Warehousing and Data Mining
Noisy Data
Noise is a random error or variance in a measured variable. Noise is removed using data
smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its ―neighborhood,‖ that is the
value around it. The sorted values are distributed into a number of ―buckets‖ or ―bins―. Because
binning methods consult the neighborhood of values, they perform local smoothing. Sorted data
for price (in dollars): 3,7,14,19,23,24,31,33,38.
Example 1: Partition into (equal-frequency) bins:
Bin 1: 3,7,14
Bin 2: 19,23,24
Bin 3: 31,33,38
In the above method the data for price are first sorted and then partitioned into equal- frequency
bins of size 3.
Smoothing by bin means:
Bin 1: 8,8,8
Bin 2: 22,22,22
Bin 3: 34,34,34
In smoothing by bin means method, each value in a bin is replaced by the mean value ofthe bin. For
example, the mean of the values 3,7&14 in bin 1 is 8[(3+7+14)/3].
Smoothing by bin boundaries:
Bin 1: 3,3,14
Bin 2: 19,24,24
Bin 3: 31,31,38
In smoothing by bin boundaries, the maximum & minimum values in give bin or identify as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
In general, the large the width, the greater the effect of the smoothing. Alternatively, bins may be
equal-width, where the interval range of values in each bin is constant Example 2: Remove the
noise in the following data using smoothing techniques:
8, 4,9,21,25,24,29,26,28,15
Sorted data for price (in dollars):4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8,9,15
Bin 2: 21,21,24,25
Bin 3: 26,28,29,34
Smoothing by bin means:
Bin 1: 9,9,9,9
Bin 2: 23,23,23,23
Bin 3: 29,29,29,29
Smoothing by bin boundaries:
Bin 1: 4, 4,4,15
Bin 2: 21,21,25,25
Bin3: 26,26,26,34
Page 4
Data Warehousing and Data Mining
Regression: Data can be smoothed by fitting the data to function, such as with regression. Linear
regression involves finding the ―best‖ line to fit two attributes (or variables), so that one attribute
can be used to predict the other. Multiple linear regressions is an extension of linear regression,
where more than two attributes are involved and the data are fit to a multidimensional surface.
Clustering: Outliers may be detected by clustering, where similar values are organized into groups,
or ―clusters.‖ Intuitively, values that fall outside of the set of clusters may be considered
outliers.
Page 5
Data Warehousing and Data Mining
3. Data Integration
Data mining often requires data integration - the merging of data from stores into a coherent data
store, as in data warehousing. These sources may include multiple data bases, data cubes, or flat
files.
Issues in Data Integration
a) Schema integration & object matching.
b) Redundancy.
c) Detection & Resolution of data value conflict
a) Schema Integration & Object Matching
Schema integration & object matching can be tricky because same entity can be
represented in different forms in different tables. This is referred to as the entity identification
problem. Metadata can be used to help avoid errors in schema integration. The meta data may
also be used to help transform the data.
b) Redundancy:
Redundancy is another important issue an attribute (such as annual revenue, for instance)
may be redundant if it can be ―derived‖ from another attribute are set of attributes. Inconsistencies
in attribute of dimension naming can also cause redundancies in the resulting data set. Some
redundancies can be detected by correlation analysis and covariance analysis.
For Nominal data, we use the 2 (Chi-Square) test.
For Numeric attributes we can use the correlation coefficient and covariance.
Correlation analysis for numerical data:
2
For nominal data, a correlation relationship between two attributes, A and B, can be
discovered by a 2 (Chi-Square) test. Suppose A has c distinct values, namely a1, a2, a3,
……., ac. B has r distinct values, namely b1, b2, b3, …., br. The data tuples are described by table.
The 2 value is computed as
𝟐
𝒐𝒊𝒋−𝒆𝒊𝒋
2
= 𝒄𝒊=𝟏 𝒓
𝒋=𝟏 𝒆𝒊𝒋
Where oij is the observed frequency of the joint event (Ai,Bj) and eij is the expected
frequency of (Ai,Bj), which can computed as
𝒄𝒐𝒖𝒏𝒕 𝑨=𝒂𝒊 𝑿𝒄𝒐𝒖𝒏𝒕(𝑩=𝒃𝒋)
eij = 𝒏
For Example,
Page 7
Data Warehousing and Data Mining
Cov(A,B) = 𝒊=𝟏
𝒏
4. Data Reduction:
Obtain a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results.
Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
Page 8
Data Warehousing and Data Mining
Page 9
Data Warehousing and Data Mining
globally optimal solution. Many other attributes evaluation measure can be used, such as the
information gain measure used in building decision trees for classification.
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of original attributes is determined and added to the reduced set. At each
subsequent iteration or step, the best of the remaining original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with full set of attributes. At each step,
it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedure selects the best attribute and removes the worst from among the remaining attributes.
4. Decision tree induction: Decision tree induction constructs a flowchart like structure where
each internal node denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each leaf node denotes a class prediction. At each node, the algorithm choices the
―best‖ attribute to partition the data into individual classes. A tree is constructed from the
given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree from the reduced subset of attributes. Threshold measure is used
as stopping criteria.
Numerosity Reduction:
Numerosity reduction is used to reduce the data volume by choosing alternative, smaller forms of
the data representation
Techniques for Numerosity reduction:
Parametric - In this model only the data parameters need to be stored, instead of the
actual data. (e.g.,) Log-linear models, Regression
Page 10
Data Warehousing and Data Mining
Nonparametric – This method stores reduced representations of data include
histograms, clustering, and sampling
Parametric model
1. Regression
Linear regression
In linear regression, the data are model to fit a straight line. For example, a random
variable, Y called a response variable), can be modeled as a linear function of
another random variable, X called a predictor variable), with the equation Y=αX+β
Where the variance of Y is assumed to be constant. The coefficients, α and β (called
regression coefficients), specify the slope of the line and the Y- intercept,
respectively.
Multiple- linear regression
Multiple linear regression is an extension of (simple) linear regression, allowing a
response variable Y, to be modeled as a linear function of two or more predictor
variables.
2. Log-Linear Models
Log-Linear Models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller subset
of dimensional combinations.
Nonparametric Model
1. Histograms
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or
buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets are
called singleton buckets.
Ex: The following data are bast of prices of commonly sold items at All Electronics. The numbers
have been sorted:
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,18,18,18,18,18,18,18,18,20,20,20,2
0,20,20,21,21,21,21,21,25,25,25,25,25,28,28,30,30,30
Page 11
Data Warehousing and Data Mining
(Equal-frequency (or equi-depth): the frequency of each bucket is constant
2. Clustering
Clustering technique consider data tuples as objects. They partition the objects into groups
or clusters, so that objects within a cluster are similar to one another and dissimilar to objects in
other clusters. Similarity is defined in terms of how close the objects are in space, based on a
distance function. The quality of a cluster may be represented by its diameter, the maximum
distance between any two objects in the cluster. Centroid distance is an alternative measure of
cluster quality and is defined as the average distance of each cluster object from the cluster
centroid.
3. Sampling:
Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random sample (or subset) of the data. Suppose that a large data
set D, contains N tuples, then the possible samples are Simple Random sample without
Replacement (SRS WOR) of size n: This is created by drawing „n‟ of the „N‟ tuples from D
(n<N), where the probability of drawing any tuple in D is 1/N, i.e., all tuples are equally likely to
be sampled.
Page 12
Data Warehousing and Data Mining
Dimensionality Reduction:
In dimensionality reduction, data encoding or transformations are applied so as to obtained
reduced or ―compressed‖ representation of the oriental data.
Dimension Reduction Types
Lossless - If the original data can be reconstructed from the compressed data without any
loss of information
Lossy - If the original data can be reconstructed from the compressed data with loss of
information, then the data reduction is called lossy.
Effective methods in lossy dimensional reduction
a) Wavelet transforms
b) Principal components analysis.
a) Wavelet transforms:
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector, transforms it to a numerically different vector, of wavelet coefficients.
The two vectors are of the same length. When applying this technique to data reduction, we
consider each tuple as an n-dimensional data vector, that is, X=(x1,x2,…………,xn), depicting n
measurements made on the tuple from n database attributes.
For example, all wavelet coefficients larger than some user-specified threshold can be
retained. All other coefficients are set to 0. The resulting data representation is therefore very
sparse, so that can take advantage of data sparsity are computationally very fast if performed in
wavelet space.
The numbers next to a wave let name is the number of vanishing moment of the wavelet
this is a set of mathematical relationships that the coefficient must satisfy and is related to number
of coefficients.
1. The length, L, of the input data vector must be an integer power of 2. This condition
can be met by padding the data vector with zeros as necessary (L >=n).
2. Each transform involves applying two functions
The first applies some data smoothing, such as a sum or weighted average.
The second performs a weighted difference, which acts to bring out the detailed
features of data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (X2i , X2i+1). This results in two sets of data of length L/2. In general,
Page 13
Data Warehousing and Data Mining
these represent a smoothed or low-frequency version of the input data and high frequency
content of it, respectively.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of length 2.
In the above figure, Y1 and Y2, for the given set of data originally mapped to the axes X1 and X2.
This information helps identify groups or patterns within the data. The sorted axes are such that
the first axis shows the most variance among the data, the second axis shows the next highest
variance, and so on.
The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
PCA is computationally inexpensive
Multidimensional data of more than two dimensions can be handled by reducing the
problem to two dimensions.
Principal components may be used as inputs to multiple regression and cluster analysis.
Page 14
Data Warehousing and Data Mining
Page 15
Data Warehousing and Data Mining
a) Min-max normalization performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute, A. Min- max
normalization maps a value, vi, of A to vi’in the range [new_minA, new_maxA]by computing
Min-max normalization preserves the relationships among the original data values. It will encounter
an ―out-of-bounds‖ error if a future input case for normalization falls outside of the original data
range for A.
Example:-Min-max normalization. Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively. We would like to map income to the
range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is transformed to
b) Z-Score Normalization
The values for an attribute, A, are normalized based on the mean (i.e., average) and standard deviation
of A. A value, vi, of A is normalized to vi’ by computing
where𝐴 and A are the mean and standard deviation, respectively, of attribute A.
Example z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a value
of $73,600 for income is transformed to
Example Decimal scaling. Suppose that the recorded values of A range from -986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we therefore divide each
value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917normalizes to 0.917.
Page 16
Data Warehousing and Data Mining
Page 17
Data Warehousing and Data Mining
State their partial ordering. The system can then try to automatically
generate the attribute ordering so as to construct a meaningful
concept hierarchy.
4. Specification of only a partial set of attributes: Sometimes a
user can be careless when defining a hierarchy, or have only a
vague idea about what should be included in a hierarchy.
Consequently, the user may have included only a small subset of
there irrelevant attributes in the hierarchy specification.
Page 18