0% found this document useful (0 votes)

32 views29 pages

DWDM Unit II

Uploaded by

Saidulu Inamanamelluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views29 pages

DWDM Unit II

Uploaded by

Saidulu Inamanamelluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

III CSE DWDM -II

UNIT –II
Data preprocessing
Introduction
Data preprocessing describes any type of processing performed on raw data to prepare it for
another processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively processed for
the purpose of the user.
Data preprocessing describes any type of processing performed on raw data to prepare it for
another processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively processed for
the purpose of the user

2.1 Why Preprocess the Data ?

Data in the real world is dirty. It can be in incomplete, noisy and inconsistent from. These data
needs to be preprocessed in order to help improve the quality of the data, and quality of the mining
results.
 If no quality data, then no quality mining results. The quality decision is always based on the
quality data.
 If there is much irrelevant and redundant information present or noisy and unreliable data, then
knowledge discovery during the training phase is more difficult
 Incomplete data: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data. e.g., occupation=“ ”.
 Noisy data: containing errors or outliers data. e.g., Salary=“-10”
 Inconsistent data: containing discrepancies in codes or names. e.g., Age=“42”
Birthday=“03/07/1997”
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection by instruments
 Human or computer error at data entry

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 1

III CSE DWDM -II
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for numerical data
Forms of Data Preprocessing

Forms of data preprocessing.

2.3 Data cleaning:

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 2

III CSE DWDM -II
Data cleaning routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.

2.3.1 Missing values

The various methods for handling the problem of missing values in data tuples include:
1.Ignore the tuple: This is usually done when the class label missing (assuming the mining task
involves classification). This method is not very effective, unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of the missing values per attribute varies
considerably.

2. Fill in the missing values manually: This approach is time –consuming and may not be feasible
given a large data set with many missing values.

3. Use a global constant to fill in the missing value: Replace all missing attribute value by the same
constant, such as a label like “unknown” or -∞.

4. Use the attribute mean to fill in the missing value: For example, suppose that the average income
of customers is $56,000. Use this value to replace the missing value for income.
5. Use the attribute mean for all samples belonging to the same class as the given tuple: If classifying
customers according to credit risk, replace the missing value with the average income value for
customers in the same credit risk category as that of give tuple.

6. Use the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using a Bayesian formalism or decision tree induction. For example, using the
other customer attributes in the sets decision tree is constructed to predict the missing value for income.
2.3.2 Noisy data:
Noise is a random error or variance in a measured variable. Data smoothing tech is used for removing
such noisy data.
Several Data smoothing techniques:
1 Binning methods:
Binning methods smooth a sorted data value by consulting the neighborhood", or values
around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing.
In this technique,
1. The data for first sorted

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 3

III CSE DWDM -II
2. Then the sorted list partitioned into equi-depth of bins.
3. Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
a. Smoothing by bin means: Each value in the bin is replaced by the mean value of the
bin.
b. Smoothing by bin medians: Each value in the bin is replaced by the bin median.
c. Smoothing by boundaries: The min and max values of a bin are identified as the bin
boundaries. Each bin value is replaced by the closest boundary value.

 Example: Binning Methods for Data Smoothing

Example 2: Remove the noise in the following data using smoothing techniques:
8, 4,9,21,25,24,29,26,28,15
o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
o Partition into (equi-depth) bins(equi depth of 3 since each bin contains three values):
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
o Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
o Smoothing by bin medians :
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
o Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is
replaced by the value 9.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 4

III CSE DWDM -II
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by
the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
Suppose that the data for analysis include the attribute age. The age values for the data tuples
are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.
Comment on the effect of this technique for the given data.
The following steps are required to smooth the above data using smoothing by bin means with a bin
depth of 3.
• Step 1: Sort the data. (This step is not required here as the data are already sorted.)

• Step 2: Partition the data into equi depth bins of depth 3.

Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70

• Step 3: Calculate the arithmetic mean of each bin.

• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33
Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56

2. Clustering:
Outliers in the data may be detected by clustering, where similar values are organized into
groups, or ‘clusters’. Values that fall outside of the set of clusters may be considered outliers.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 5

III CSE DWDM -II

Outliers detected by clustering analysis

3. Regression :
Smooth by fitting the data into regression functions.
 Linear regression involves finding the best of line to fit two variables, so that one
variable can be used to predict the other.

 Multiple linear regression is an extension of linear regression, where more than two
variables are involved and the data are fit to a multidimensional surface.
2.3.3 Inconsistent Data
Inconsistencies exist in the data stored in the transaction. Inconsistencies occur due to occur
during data entry, functional dependencies between attributes and missing values.
The inconsistencies can be detected and corrected either by manually or by knowledge
engineering tools.
2.3.4.Data cleaning as a process
Data cleaning is a two step process
1. Discrepancy detection

2. Data transformations

1. Discrepancy detection

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 6

III CSE DWDM -II
The first step in data cleaning is discrepancy detection. It considers the knowledge of meta data
and examines the following rules for detecting the discrepancy.
Unique rules- each value of the given attribute must be different from all other values for that attribute.
Consecutive rules – Implies no missing values between the lowest and highest values for the attribute
and that all values must also be unique .
Null rules - specifies the use of blanks, question marks, special characters, or other strings that may
indicates the null condition
Discrepancy detection Tools:
 Data scrubbing tools - use simple domain knowledge (e.g., knowledge of postal addresses, and
spell-checking) to detect errors and make corrections in the data

 Data auditing tools – analyzes the data to discover rules and relationship, and detecting data
that violate such conditions.

2. Data transformations:
This is the second step in data cleaning as a process. After detecting discrepancies, we need to
define and apply (a series of) transformations to correct them.
Data Transformations Tools:
 Data migration tools – allows simple transformation to be specified, such to replaced the string
“gender” by “sex”.

 ETL (Extraction/Transformation/Loading) tools – allows users to specific transforms through

a graphical user interface(GUI)

2.2.4 Disadvantages of Data Cleaning Process

* Nested discrepancies
* Lack of interactivity.
* Increased interactivity.

2.4 Data Integration

It combines data from multiple sources into a coherent store. There are number of issues to
consider during data integration.
Issues:

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 7

III CSE DWDM -II

 Schema integration: refers integration of metadata from different sources.

 Entity identification problem: Identifying entity in one data source similar to entity in
another table. For example, customer_id in one db and customer_no in another db refer to the
same entity
 Detecting and resolving data value conflicts: Attribute values from different sources can be
different due to different representations, different scales. E.g. metric vs. British units
 Redundancy: is another issue while performing data integration. Redundancy can occur due
to the following reasons:
 Object identification: The same attribute may have different names in different db
 Derived Data: one attribute may be derived from another attribute.

Handling redundant data in data integration

1. Correlation analysis
For numeric data
Some redundancy can be identified by correlation analysis. The correlation between two variables A
and B can be measured by computing the correlation coefficient

where
- n is the number of tuples
- 𝑨 mean value of A
- 𝑩 mean value of B
- 𝝈𝑨 Standard deviation of A
- 𝝈𝑩 Standard deviation of B

If
r A ,B >0 then A and B are positively correlated
r A,B<0 then A and B are negatively correlated
r A,B=0 then on correlation between A and B.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 8

III CSE DWDM -II
-also called Pearson’s product moment coefficient

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 9

III CSE DWDM -II

For categorical or Nominal data

For nominal data, a correlation relationship between two attributes, A and B, can be discovered
by a χ2 (chi-square) test. Suppose A has c distinct values, namely a1,a2, : : :ac . B has r distinct
values, namely b1,b2, : : :br . The data tuples described by A and B can be shown as a contingency
table .
The χ2 value (also known as the Pearson _2 statistic) is computed as

where oij is the observed frequency (i.e., actual count) of the joint event .(Ai ,Bj) and eij is the expected
frequency of (Ai ,Bj) , which can be computed as

Example:

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 10

III CSE DWDM -II

2.5 Data Transformation

Data transformation can involve the following:
 Smoothing: which works to remove noise from the data

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 11

III CSE DWDM -II
 Aggregation: where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute weekly and annual total scores.
 Generalization of the data: where low-level or “primitive” (raw) data are replaced by higher-
level concepts through the use of concept hierarchies. For example, categorical attributes, like
street, can be generalized to higher-level concepts, like city or country.
 Normalization: where the attribute data are scaled so as to fall within a small specified range,
such as −1.0 to 1.0, or 0.0 to 1.0.
 Attribute construction (feature construction): this is where new attributes are constructed
and added from the given set of attributes to help the mining process.
Normalization
In which data are scaled to fall within a small, specified range, useful for classification
algorithms involving neural networks, distance measurements such as nearest neighbor classification
and clustering.
There are 3 methods for data normalization. They are:
1. min-max normalization
2. z-score normalization
3. normalization by decimal scaling
1. Min-max normalization:
performs linear transformation on the original data values.
Suppose that 𝒎𝒊𝒏𝑨 and maxA are the minimum and maximum values of an attribute, A. Min-max
normalization maps a value, V, of A to 𝑽′ in the range [new_minA,- new_maxA] by computing
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
v is the value to be normalized
minA,maxA are minimum and maximum values of an attribute A
new_ maxA, new_ minA are the normalization range.

Min-max normalization preserves the relationships among the original data values. It will
encounter an ”out-of-bounds ” error if a future input case for normalization falls outside of the original
data range for A.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 12

III CSE DWDM -II
Example: Suppose that the minimum and maximum values for the attribute income are $1,000 and
$15,000 respectively. Map income to the range [0.0,1.0]. By min-max normalization a value $12,000
for income is transformed to

Example. Given one-dimensional data set X = {-5,023.0,17.6,9.23,1.11}, normalize the data set using
(a) Min-max normalization on interval [0,1],
(b) Min-max normalization on interval [-1,1],
(c) Standard deviation normalization.
a) Min-max normalization [0,1]

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 13

III CSE DWDM -II

2. Z-score normalization / zero-mean normalization:

In which values of an attribute A are normalized based on the mean and standard deviation of
A. It can be defined as,

v  meanA
v' 
stand _ devA
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 14
III CSE DWDM -II
This method is useful when min and max value of attribute A are unknown or when outliers
that are dominate min-max normalization.
Example: Suppose that the mean and standard deviation of the value for the attribute income are
$52,000 & $14,000, respectively. With z-score normalization, a value of $72,000 for income is
transformed to

3. Normalization by decimal scaling: normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the maximum absolute value of A. A value v
of A is normalized to v’ by computing,

Example: Suppose that the recorded values of A range from -986 to 917. The maximum absolute value
of A is 986. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j=3) so
that -986 normalizes -0.986 and 917 normalizes to 0.917.

2.6 Data Reduction techniques

These techniques can be applied to obtain a reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity of the original data.
 Data reduction includes,

1. Data cube aggregation, where aggregation operations are applied to the data in the construction of
a data cube.

2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions
may be detected and removed.

3. Dimensionality reduction, where encoding mechanisms are used to reduced the data set size.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 15

III CSE DWDM -II
4. Numerosity reduction, where the data are replaced or estimated by alternative data smaller data
representation.

5. Discretization and concept hierarchy generation, where raw data values for attributes are replaced
by ranges or higher conceptual levels

2.6.1 Data cube aggregation:

Data cube aggregation, where aggregation operations are applied to the data for construction
of a data cube.
 Data cubes store multidimensional aggregated information.
 Each cell holds an aggregate data value, corresponding to the data point in multidimensional space.
 Concept hierarchies may exist for each attribute, allowing the analysis of data at multiple level of
abstraction.
 Data cubes provide fast access to pre computed summarized data, thereby benefiting on-line
analytical processing as well as data mining.
 The cube can be created in three ways:
1. Based cuboid- The cube created at the lowest level of abstraction is referred to as base cuboid.

2. Lattice of cuboids- Data cubes created for varying levels of abstraction are often referred to
as cuboids.

3. Apex cuboid - A cube at highest level of abstraction is the apex cuboid

The following database consists of sales per quarter for the years 1997-1999.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 16

III CSE DWDM -II

 Suppose, the annalyser interested in the annual sales rather than sales per quarter, the above data
can be aggregated so that the resulting data summarizes the total sales per year instead of per
quarter.
 The resulting data in smaller in volume, without loss of information necessary for the analysis task

2.6.2 Attribute sub selection / Feature selection

 Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes
(or dimensions).
 The goal of attribute subset selection is to find a minimum set of attributes.
 It reduces the number of attributes appearing in the discovered patterns, helping to make the
patterns easier to understand.

To find out a ‘good’ subset from the original attributes

 For n attributes, there are 2n possible subsets.
 An exhaustive search for the optimal subset of attributes can be prohibitively expensive, especially
as n and the number of data classes increase.
 Therefore, heuristic methods that explore a reduced search space are commonly used for attribute
subset selection.
Techniques for heuristic methods of attribute sub set selection
1. Stepwise forward selection
2. Stepwise backward elimination
3. Combination of forward selection and backward elimination
4. Decision tree induction
1. Step-wise forward selection:

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 17

III CSE DWDM -II
The procedure starts with an empty set of attributes. The best of the original attributes is
determined and added to the set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
2. Step-wise backward elimination:
The procedure starts with the full set of attributes. At each step, it removes the worst attribute
remaining in the set.
3. Combination forward selection and backward elimination:
The step-wise forward selection and backward elimination methods can be combined, where
at each step one selects the best attribute and removes the worst from among the remaining attributes.
4. Decision tree induction:
 Decision tree induction constructs a flow-chart-like structure where each internal (non-leaf) node
denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external
(leaf) node denotes a class prediction.
 At each node, the algorithm chooses the “best" attribute to partition the data into individual
classes.
 When decision tree induction is used for attribute subset selection, a tree is constructed from the
given data.
 All attributes that do not appear in the tree are assumed to be irrelevant.
 The set of attributes appearing in the tree form the reduced subset of attributes.

2.6.3 Dimensionality reduction

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 18

III CSE DWDM -II
In dimensionality reduction, data encoding or transformations are applied so as to obtained
reduced or “compressed” representation of the oriental data.
Dimension Reduction Types
 Lossless - If the original data can be reconstructed from the compressed data without any loss
of information

 Lossy - If the original data can be reconstructed from the compressed data with loss of
information, then the data reduction is called lossy.

Effective methods in lossy dimensional reduction

1. Wavelet transforms

2. Principal components analysis.

1. Wavelet compression is a form of data compression well suited for image compression.
 The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied
to a data vector D, transforms it to a numerically different vector, D0, of wavelet coefficients.
The general algorithm for a discrete wavelet transform is as follows.
1. The length, L, of the input data vector must be an integer power of two. This condition can be
met by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions:
 data smoothing
 calculating weighted difference
3. The two functions are applied to pairs of the input data, resulting in two sets of data of length
L/2.
4. The two functions are recursively applied to the sets of data obtained in the previous loop, until
the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are designated the
wavelet coefficients of the transformed data.
 If wavelet coefficients are larger than some user-specified threshold then it can be retained. The
remaining coefficients are set to 0.

Haar2 and Daubechie4 are two popular wavelet transforms.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 19

III CSE DWDM -II

2. Principal Component Analysis (PCA)

 It is also called as Karhunen-Loeve (K-L) method.
 The basic procedure is as follows:
1. The input data are normalized.

2. PCA computes k orthonormal vectors that provide a basis for the normalized input data. These
are unit vectors that each point in a direction perpendicular to the others.

3. The principal components are sorted in order of decreasing significance or strength.

Principal componenets analysis

 In the above figure, Y1 and Y2, for the given set of data originally mapped to the axes X1 and X2.
 This information helps identify groups or patterns within the data.
 The sorted axes are such that the first axis shows the most variance among the data, the second
axis shows the next highest variance, and so on.
 The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
1. PCA is computationally inexpensive

2. Multidimensional data of more than two dimensions can be handled by reducing the problem
to two dimensions.

3. Principal components may be used as inputs to multiple regression and cluster analysis.
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 20
III CSE DWDM -II

2.6.4 Numerosity Reduction

Data volume can be reduced by choosing alternative smaller forms of data. This tech. can be
 Parametric method
 Non parametric method
Parametric: Assume the data fits some model, then estimate model parameters, and store only the
parameters, instead of actual data.
Non parametric: In which histogram, clustering and sampling is used to store reduced form of data.
Parametric model
1. Regression

 Linear regression
In linear regression, the data are model to fit a straight line. For example, a random variable, Y
called a response variable), can be modeled as a linear function of another random variable, X called
a predictor variable), with the equation Y=αX+β
Where the variance of Y is assumed to be constant. The coefficients, α and β (called regression
coefficients), specify the slope of the line and the Y- intercept, respectively.
 Multiple- linear regression
Multiple linear regression is an extension of (simple) linear regression, allowing a response
variable Y, to be modeled as a linear function of two or more predictor variables.
2. Log-Linear Models
Log-Linear Models can be used to estimate the probability of each point in a multidimensional
space for a set of discretized attributes, based on a smaller subset of dimensional combinations.

2 Histogram
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or
buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets are Called
singleton buckets.
Example:
The following data are a list of prices of commonly sold items. The numbers have been sorted.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18,
18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 21

III CSE DWDM -II
The buckets are displayed in a horizontal axis while height of a bucket represents the average
frequency of the values.

Draw histogram plot for price where each bucket should have equi width of 10

The buckets can be determined based on the following partitioning rules, including the following.
1. Equi-width: histogram with bars having the same width
2. Equi-depth: histogram with bars having the same height
3. V-Optimal: histogram with least variance (countb*valueb)
4. MaxDiff: bucket boundaries defined by user specified threshold

V-Optimal: The V-Optional histogram is the one with the least variance. Histogram variance is a
weighted sum of the original values that each bucket represents, where bucket weight is equal to the
number of values in the bucket.
MaxDiff: It is the difference between each pair of adjacent values. A bucket boundary is established
between each pair for pairs having the 𝜷-1 largest differences, where 𝜷 is the user-specified.

3.Clustering techniques

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 22

III CSE DWDM -II
 Consider data tuples as objects. They partition the objects into groups or clusters, so that objects
within a cluster are “similar" to one another and “dissimilar" to objects in other clusters.
 Similarity is commonly defined in terms of how “close" the objects are in space, based on a distance
function.

 Quality of clusters measured by their diameter (max distance between any two objects in the
cluster) or centroid distance (avg. distance of each cluster object from its centroid)

4.Sampling
 Sampling can be used as a data reduction technique since it allows a large data set to be represented
by a much smaller random sample (or subset) of the data.
 Suppose that a large data set, D, contains N tuples. Let's have a look at some possible samples for
D.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 23

III CSE DWDM -II

1. Simple random sample without replacement (SRSWOR) of size n: This is created by drawing
n of the N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N, i.e., all tuples
are qually likely.
2. Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be drawn again.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a SRS of
m clusters can be obtained, where m < M. For example, tuples in a database are usually retrieved a
page at a time, so that each page can be considered a cluster. A reduced data representation can be
obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified sample of
D is generated by obtaining a SRS at each stratum. This helps to ensure a representative sample,
especially when the data are skewed. For example, a stratified sample may be obtained from customer
data, where stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented.
Advantges of sampling

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 24

III CSE DWDM -II
1. An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample, n, as opposed to N, the data set size. Hence, sampling
complexity is potentially sub-linear to the size of the data.
2. When applied to data reduction, sampling is most commonly used to estimate the answer to an
aggregate query.

2.6 DATA DISCRETIZATION AND CONCEPT HIERARCHY GENERATION

2.6.1 DATA DISCRETIZATION
 Data Discretization techniques can be used to reduce the number of values for a given continuous
attribute by dividing the range of attribute into intervals.
 Interval labels can then be used to replace actual data values.
Features of Data Discretization
 Leads to a concise

 Easy-to-use

 Knowledge-level representation of mining results.

Categories of Data Discretization
 Supervised discretization- Uses class information.

 Unsupervised discretization or splitting – Does not uses class information.

 Top-down discretization or splitting – Here, the process starts by first finding one or a few
points (called split points or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals.

 Bottom-up discretization or merging – Here, the process starts by considering all of the
continuous values to form intervals, and then recursively applies this process to the resulting
intervals.

2.6.2 Concept Hierarchy

 A concept hierarchy for a given numerical attribute defines a discretization of attribute.
 Concept hierarchies can be used to reduced the data by collecting and replacing low-level concepts
(such as numerical values for the attribute age) with higher-level concepts (such as youth, middle-
aged, or senior).

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 25

III CSE DWDM -II

2.6.3 Discretization and concept Hierarchy Generation for Numerical data

 Concept hierarchies for numerical attributes can be constructed automatically based on data
discretization.
Methods for handling numerical data over concept hierarchy
 Binning

 Histogram analysis

 Entropy-based discretization

 𝝌𝟐-merging

 Cluster analysis

 Discretization by intuitive partitioning

I. Binning
 Binning is a top-down splitting technique based on a specified number of bins.
 These methods are also used as discretization methods for numerosity reduction and concepts
hierarchy generation.
 These techniques can be applied recursively to the resulting partitions in order to generate
concepts hierarchies.
 Binning does not use class information and is therefore an unsupervised discretization
technique.
 It is sensitive to the user-specified number of bins, as well as the presence of outliers.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 26

III CSE DWDM -II
II. Histogram Analysis
 Like binning, histogram analysis is an unsupervised discretization technique because it does
not use class information.
 Histograms partition the value for an attribute A, into disjoint ranges called buckets.
 The histograms analysis algorithm can be applied recursively to each partition in order to
automatically generate a multilevel concepts hierarchy, with the procedure terminating once
a prespecified number of concepts levels has been reached.

III. Entropy-Based Discretization

 Entropy is one of the most commonly used discretization measures.
 Entropy-based discretization is a supervised, top-down splitting technique.
 It explores class distribution information in its calculation and determination of split-points.
 To discretize a numerical attribute, A, the method selects the value of A that has the minimum
entropy as a split-point, and recursively partitions the resulting intervals to arrive at a
hierarchical discretization. Such discretization forms a concept hierarchy for A.
 Let D consist of data tuples defined by a set of attributes and a class-label attribute.
 The basic method for entropy-based discretization of an attribute A within the set is as
follows:
1. Each value of A can be considered as a potential interval boundary or split-point to partition
the range of A. That is, a split-point for A can partition the tuples in D into two subsets
satisfying the conditions A<split point and A> split point, respectively, thereby creating a
binary discretization.

2. Suppose we want to classify the tuples in D by partitioning on attribute A and some split-
point. Ideally, we would like this partitioning to result in an exact classification of the tuples.
For example, if we had two classes, we would hope that all of the tuples of, say, class C1 will
fall into one partition and all of the tuples of class C2 will fall into the other partition.

3. The process of determining a split-point is recursively applied to each partition obtained

until some stopping criterion is met, such as when the minimum information requirement on
all candidate split-points is less than a small threshold, e, or when the number of intervals is
greater than a threshold, max interval.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 27

III CSE DWDM -II
IV. Interval merging by 𝝌𝟐 Analysis
 Chi Merge, which employs a button-up approach by finding the best neighboring intervals and
then merging these to form larger intervals, recursively.
 The method is supervised in that it uses class information.
 Chi Merge Proceeds as follows:
 Initially, each distinct value of a numerical attribute A is considered to be one interval.

 C2 tests are performed for every pair of adjacent intervals.

 Adjacent intervals with the least c2 values are merged together, because low c2 values for
a pair indicate similar class distributions.

 This merging process precedes recursively until a predefined stopping criterion is met.

V. Cluster Analysis
 Cluster analysis is a popular data discretization method.
 A clustering algorithm can be applied to discretize a numerical attribute, A, by partitioning the
values of A into clusters or groups.
 Clustering can be used to generate a concept hierarchy for A by following either a top down
splitting strategy or a bottom-up merging strategy, where each cluster forms a node of the
concept hierarchy.
 In the top down splitting strategy, each initial cluster or partition may be further decomposed
into several sub clusters, forming a lower level of the hierarchy.
 In the bottom-up merging strategy, clusters are formed by repeatedly grouping neighboring
clusters in order to form higher-level concepts.
VI. Discretization by intuitive partitioning
 Although the above discretization methods are useful in the generation of numerical
hierarchies, many users would like to see numerical ranges partitioned into relatively uniform,
easy-to-read intervals that appear intuitive or natural.
 The 3-4-5 rule can be used to segment numerical data into relatively uniform, natural seeming
intervals. In general, the rule partitions a given range of data into 3, 4 or 5 relatively equal-
width intervals, recursively and level by level, based on the value range at the most significant
digit.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 28

III CSE DWDM -II
The rule is as follows:
 If an interval covers 3,6,7, or 9 distinct values at the most significant digit, then partition the
range into 3 intervals(3 equal-width intervals for 3,6, and 9; and 3 intervals in the grouping of
2-3-2 for 7)

 If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4
equal-width intervals.

 If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into
5 equal-width intervals.
 The rule can be recursively applied to each interval, creating a concept hierarchy for the given
numerical attribute.
 Real-world data often contain extremely large positive and/or negative outlier values, which
could distort any top-down discretization method, based on minimum and maximum data values.

2.6.4 Concept Hierarchy Generation for categorical data

 Categorical data are discrete data.
 Categorical attributes have a finite (but possibly large) Number of distinct values, with no
ordering among the values.
 Methods of generating a concept hierarchy for categorical data.
 Specification of a partial ordering of attributes explicitly at the schema level by user or experts:
Concept hierarchies for categorical or dimensions typically involve a group of attribute. A user
or expert can easily define a concept hierarchy by specifying a partial or total ordering of the
attributes at the schema level.

 Specification of a portion of a hierarchy by explicit data grouping: This is essentially the

manual definition of a portion of a concept hierarchy. In a large database, it is unrealistic to
define an entire concept hierarchy by explicit value enumeration. However, it is realistic to
specify explicit groupings for a small portion of intermediate-level data.

 Specification of a set of attributes, but not of their partial ordering: A user may specify a set of
attributes forming a concept hierarchy, but omit to explicitly state their partial ordering. The
system can they try to automatically generate the attribute so as to construct a meaningful concept
hierarchy .

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 29

World Political Map Blank - Google Search
No ratings yet
World Political Map Blank - Google Search
1 page
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Unit 2
No ratings yet
Unit 2
46 pages
Knowledge Discovery Database - Unit 2
No ratings yet
Knowledge Discovery Database - Unit 2
53 pages
Outliners
No ratings yet
Outliners
15 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Data Processing - Unit-3
No ratings yet
Data Processing - Unit-3
38 pages
Week2 2
No ratings yet
Week2 2
25 pages
Unit-1 3
No ratings yet
Unit-1 3
58 pages
Unit 2
No ratings yet
Unit 2
34 pages
Excel Building Weight Calculator
0% (1)
Excel Building Weight Calculator
2 pages
Poster ASME 2022 EN 013
No ratings yet
Poster ASME 2022 EN 013
1 page
Datasheet de Uma Memória Eeprom - 95640
No ratings yet
Datasheet de Uma Memória Eeprom - 95640
46 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
29402data Preprocessing - Data Cleaning
No ratings yet
29402data Preprocessing - Data Cleaning
12 pages
Lecture Notes 1.7 & 1.8
No ratings yet
Lecture Notes 1.7 & 1.8
3 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
ML Assignment-1
No ratings yet
ML Assignment-1
7 pages
Image Recognition Using CIFAR 10
100% (1)
Image Recognition Using CIFAR 10
56 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Mathematics: Grade 8
No ratings yet
Mathematics: Grade 8
180 pages
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
No ratings yet
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
12 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Screw Conveyor Design
100% (1)
Screw Conveyor Design
8 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Basics of A Jet Engine
No ratings yet
Basics of A Jet Engine
34 pages
DM 24 Data Cleaning
No ratings yet
DM 24 Data Cleaning
2 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Bif601 Final Term Handouts
No ratings yet
Bif601 Final Term Handouts
18 pages
Ultrasonic Testing
100% (1)
Ultrasonic Testing
57 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Unit 2
No ratings yet
Unit 2
37 pages
DWDMUNIT2
No ratings yet
DWDMUNIT2
51 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Design of High-Speed Comparator For LVDS Receiver
No ratings yet
Design of High-Speed Comparator For LVDS Receiver
3 pages
ML 4
No ratings yet
ML 4
17 pages
Toyota 4Y Motor Spec - Motorpower
No ratings yet
Toyota 4Y Motor Spec - Motorpower
1 page
Elec-275 Final Examination April 2012
No ratings yet
Elec-275 Final Examination April 2012
4 pages
RES320 - Preisinger, Carrie FINAL EXAM
100% (1)
RES320 - Preisinger, Carrie FINAL EXAM
5 pages
Pascal Output Answer
100% (1)
Pascal Output Answer
13 pages
Industrial Iot Case Studies
100% (1)
Industrial Iot Case Studies
6 pages
Concept: Mathematics 4 - Quarter 1 Week 2
No ratings yet
Concept: Mathematics 4 - Quarter 1 Week 2
9 pages
UNIT 03 - Electrochemistry
No ratings yet
UNIT 03 - Electrochemistry
10 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
What Is Trip Circuit Supervision (TCS) Protection
No ratings yet
What Is Trip Circuit Supervision (TCS) Protection
7 pages
Depolarization
No ratings yet
Depolarization
8 pages
IIOT Slideshare
No ratings yet
IIOT Slideshare
12 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
BTC Script Grabber
No ratings yet
BTC Script Grabber
3 pages
Offer Letter 2024-10-15
No ratings yet
Offer Letter 2024-10-15
10 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Unit 09
No ratings yet
Unit 09
9 pages
DWDM - Unit - VIII
No ratings yet
DWDM - Unit - VIII
32 pages
Data Logging of Power Profiles From Wireless Iot and Other Low-Power Devices
No ratings yet
Data Logging of Power Profiles From Wireless Iot and Other Low-Power Devices
16 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Q3 Carpentry Week 4
No ratings yet
Q3 Carpentry Week 4
25 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
DWM
No ratings yet
DWM
14 pages
Calculating Pump Head
No ratings yet
Calculating Pump Head
6 pages
Intelligent Search Algorithms: Forth Year
No ratings yet
Intelligent Search Algorithms: Forth Year
17 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
General Physics II
No ratings yet
General Physics II
52 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
Cse Nirf 2024
No ratings yet
Cse Nirf 2024
57 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Crystal I I Zat I: Zyxwvutsrqponmlkjihgfedcbazyxwvutsrqponmlkjihgfedcba
No ratings yet
Crystal I I Zat I: Zyxwvutsrqponmlkjihgfedcbazyxwvutsrqponmlkjihgfedcba
10 pages
40 Most Popular Internet of Things (IOT) Applications & Examples
No ratings yet
40 Most Popular Internet of Things (IOT) Applications & Examples
23 pages
Industrial IoT Lab Manual
No ratings yet
Industrial IoT Lab Manual
33 pages
Iot - Mid-Ii - Bit Bank
No ratings yet
Iot - Mid-Ii - Bit Bank
21 pages
Haxmaps 159016197889
No ratings yet
Haxmaps 159016197889
2 pages
Python Internal-2
No ratings yet
Python Internal-2
6 pages
Reference Manual
No ratings yet
Reference Manual
77 pages
Higher Education Proofs 2020-2021
No ratings yet
Higher Education Proofs 2020-2021
32 pages
Entrepenureship Data Template
No ratings yet
Entrepenureship Data Template
3 pages
Higher Education Data Template
No ratings yet
Higher Education Data Template
3 pages
DK 5 Dan 8 Financial Plan - Iqmal Denda R
No ratings yet
DK 5 Dan 8 Financial Plan - Iqmal Denda R
24 pages
Industril Iot Unit Test 1
No ratings yet
Industril Iot Unit Test 1
1 page
Industrial IoT Laboratory
No ratings yet
Industrial IoT Laboratory
1 page
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

DWDM Unit II

Uploaded by

DWDM Unit II

Uploaded by

III CSE DWDM -II

2.1 Why Preprocess the Data ?

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 1

Forms of data preprocessing.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 2

2.3.1 Missing values

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 3

 Example: Binning Methods for Data Smoothing

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 4

• Step 2: Partition the data into equi depth bins of depth 3.

• Step 3: Calculate the arithmetic mean of each bin.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 5

Outliers detected by clustering analysis

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 6

 ETL (Extraction/Transformation/Loading) tools – allows users to specific transforms through

2.2.4 Disadvantages of Data Cleaning Process

2.4 Data Integration

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 7

 Schema integration: refers integration of metadata from different sources.

Handling redundant data in data integration

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 8

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 9

For categorical or Nominal data

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 10

2.5 Data Transformation

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 11

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 12

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 13

2. Z-score normalization / zero-mean normalization:

2.6 Data Reduction techniques

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 15

2.6.1 Data cube aggregation:

3. Apex cuboid - A cube at highest level of abstraction is the apex cuboid

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 16

2.6.2 Attribute sub selection / Feature selection

To find out a ‘good’ subset from the original attributes

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 17

2.6.3 Dimensionality reduction

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 18

Effective methods in lossy dimensional reduction

2. Principal components analysis.

Haar2 and Daubechie4 are two popular wavelet transforms.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 19

2. Principal Component Analysis (PCA)

3. The principal components are sorted in order of decreasing significance or strength.

Principal componenets analysis

2.6.4 Numerosity Reduction

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 21

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 22

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 23

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 24

2.6 DATA DISCRETIZATION AND CONCEPT HIERARCHY GENERATION

 Knowledge-level representation of mining results.

 Unsupervised discretization or splitting – Does not uses class information.

2.6.2 Concept Hierarchy

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 25

2.6.3 Discretization and concept Hierarchy Generation for Numerical data

 Discretization by intuitive partitioning

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 26

III. Entropy-Based Discretization

3. The process of determining a split-point is recursively applied to each partition obtained

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 27

 C2 tests are performed for every pair of adjacent intervals.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 28

2.6.4 Concept Hierarchy Generation for categorical data

 Specification of a portion of a hierarchy by explicit data grouping: This is essentially the

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 29

You might also like