0% found this document useful (0 votes)
32 views29 pages

DWDM Unit II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views29 pages

DWDM Unit II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

III CSE DWDM -II

UNIT –II
Data preprocessing
Introduction
Data preprocessing describes any type of processing performed on raw data to prepare it for
another processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively processed for
the purpose of the user.
Data preprocessing describes any type of processing performed on raw data to prepare it for
another processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively processed for
the purpose of the user

2.1 Why Preprocess the Data ?


Data in the real world is dirty. It can be in incomplete, noisy and inconsistent from. These data
needs to be preprocessed in order to help improve the quality of the data, and quality of the mining
results.
 If no quality data, then no quality mining results. The quality decision is always based on the
quality data.
 If there is much irrelevant and redundant information present or noisy and unreliable data, then
knowledge discovery during the training phase is more difficult
 Incomplete data: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data. e.g., occupation=“ ”.
 Noisy data: containing errors or outliers data. e.g., Salary=“-10”
 Inconsistent data: containing discrepancies in codes or names. e.g., Age=“42”
Birthday=“03/07/1997”
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection by instruments
 Human or computer error at data entry

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 1


III CSE DWDM -II
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for numerical data
Forms of Data Preprocessing

Forms of data preprocessing.


2.3 Data cleaning:

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 2


III CSE DWDM -II
Data cleaning routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.

2.3.1 Missing values


The various methods for handling the problem of missing values in data tuples include:
1.Ignore the tuple: This is usually done when the class label missing (assuming the mining task
involves classification). This method is not very effective, unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of the missing values per attribute varies
considerably.

2. Fill in the missing values manually: This approach is time –consuming and may not be feasible
given a large data set with many missing values.

3. Use a global constant to fill in the missing value: Replace all missing attribute value by the same
constant, such as a label like “unknown” or -∞.

4. Use the attribute mean to fill in the missing value: For example, suppose that the average income
of customers is $56,000. Use this value to replace the missing value for income.
5. Use the attribute mean for all samples belonging to the same class as the given tuple: If classifying
customers according to credit risk, replace the missing value with the average income value for
customers in the same credit risk category as that of give tuple.

6. Use the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using a Bayesian formalism or decision tree induction. For example, using the
other customer attributes in the sets decision tree is constructed to predict the missing value for income.
2.3.2 Noisy data:
Noise is a random error or variance in a measured variable. Data smoothing tech is used for removing
such noisy data.
Several Data smoothing techniques:
1 Binning methods:
Binning methods smooth a sorted data value by consulting the neighborhood", or values
around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing.
In this technique,
1. The data for first sorted

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 3


III CSE DWDM -II
2. Then the sorted list partitioned into equi-depth of bins.
3. Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
a. Smoothing by bin means: Each value in the bin is replaced by the mean value of the
bin.
b. Smoothing by bin medians: Each value in the bin is replaced by the bin median.
c. Smoothing by boundaries: The min and max values of a bin are identified as the bin
boundaries. Each bin value is replaced by the closest boundary value.

 Example: Binning Methods for Data Smoothing


Example 2: Remove the noise in the following data using smoothing techniques:
8, 4,9,21,25,24,29,26,28,15
o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
o Partition into (equi-depth) bins(equi depth of 3 since each bin contains three values):
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
o Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
o Smoothing by bin medians :
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
o Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is
replaced by the value 9.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 4


III CSE DWDM -II
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by
the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
Suppose that the data for analysis include the attribute age. The age values for the data tuples
are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.
Comment on the effect of this technique for the given data.
The following steps are required to smooth the above data using smoothing by bin means with a bin
depth of 3.
• Step 1: Sort the data. (This step is not required here as the data are already sorted.)

• Step 2: Partition the data into equi depth bins of depth 3.


Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70

• Step 3: Calculate the arithmetic mean of each bin.

• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33
Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56

2. Clustering:
Outliers in the data may be detected by clustering, where similar values are organized into
groups, or ‘clusters’. Values that fall outside of the set of clusters may be considered outliers.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 5


III CSE DWDM -II

Outliers detected by clustering analysis


3. Regression :
Smooth by fitting the data into regression functions.
 Linear regression involves finding the best of line to fit two variables, so that one
variable can be used to predict the other.

 Multiple linear regression is an extension of linear regression, where more than two
variables are involved and the data are fit to a multidimensional surface.
2.3.3 Inconsistent Data
Inconsistencies exist in the data stored in the transaction. Inconsistencies occur due to occur
during data entry, functional dependencies between attributes and missing values.
The inconsistencies can be detected and corrected either by manually or by knowledge
engineering tools.
2.3.4.Data cleaning as a process
Data cleaning is a two step process
1. Discrepancy detection

2. Data transformations

1. Discrepancy detection

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 6


III CSE DWDM -II
The first step in data cleaning is discrepancy detection. It considers the knowledge of meta data
and examines the following rules for detecting the discrepancy.
Unique rules- each value of the given attribute must be different from all other values for that attribute.
Consecutive rules – Implies no missing values between the lowest and highest values for the attribute
and that all values must also be unique .
Null rules - specifies the use of blanks, question marks, special characters, or other strings that may
indicates the null condition
Discrepancy detection Tools:
 Data scrubbing tools - use simple domain knowledge (e.g., knowledge of postal addresses, and
spell-checking) to detect errors and make corrections in the data

 Data auditing tools – analyzes the data to discover rules and relationship, and detecting data
that violate such conditions.

2. Data transformations:
This is the second step in data cleaning as a process. After detecting discrepancies, we need to
define and apply (a series of) transformations to correct them.
Data Transformations Tools:
 Data migration tools – allows simple transformation to be specified, such to replaced the string
“gender” by “sex”.

 ETL (Extraction/Transformation/Loading) tools – allows users to specific transforms through


a graphical user interface(GUI)

2.2.4 Disadvantages of Data Cleaning Process


* Nested discrepancies
* Lack of interactivity.
* Increased interactivity.

2.4 Data Integration


It combines data from multiple sources into a coherent store. There are number of issues to
consider during data integration.
Issues:

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 7


III CSE DWDM -II

 Schema integration: refers integration of metadata from different sources.


 Entity identification problem: Identifying entity in one data source similar to entity in
another table. For example, customer_id in one db and customer_no in another db refer to the
same entity
 Detecting and resolving data value conflicts: Attribute values from different sources can be
different due to different representations, different scales. E.g. metric vs. British units
 Redundancy: is another issue while performing data integration. Redundancy can occur due
to the following reasons:
 Object identification: The same attribute may have different names in different db
 Derived Data: one attribute may be derived from another attribute.

Handling redundant data in data integration


1. Correlation analysis
For numeric data
Some redundancy can be identified by correlation analysis. The correlation between two variables A
and B can be measured by computing the correlation coefficient

where
- n is the number of tuples
- 𝑨 mean value of A
- 𝑩 mean value of B
- 𝝈𝑨 Standard deviation of A
- 𝝈𝑩 Standard deviation of B

If
r A ,B >0 then A and B are positively correlated
r A,B<0 then A and B are negatively correlated
r A,B=0 then on correlation between A and B.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 8


III CSE DWDM -II
-also called Pearson’s product moment coefficient

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 9


III CSE DWDM -II

For categorical or Nominal data


For nominal data, a correlation relationship between two attributes, A and B, can be discovered
by a χ2 (chi-square) test. Suppose A has c distinct values, namely a1,a2, : : :ac . B has r distinct
values, namely b1,b2, : : :br . The data tuples described by A and B can be shown as a contingency
table .
The χ2 value (also known as the Pearson _2 statistic) is computed as

where oij is the observed frequency (i.e., actual count) of the joint event .(Ai ,Bj) and eij is the expected
frequency of (Ai ,Bj) , which can be computed as

Example:

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 10


III CSE DWDM -II

2.5 Data Transformation


Data transformation can involve the following:
 Smoothing: which works to remove noise from the data

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 11


III CSE DWDM -II
 Aggregation: where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute weekly and annual total scores.
 Generalization of the data: where low-level or “primitive” (raw) data are replaced by higher-
level concepts through the use of concept hierarchies. For example, categorical attributes, like
street, can be generalized to higher-level concepts, like city or country.
 Normalization: where the attribute data are scaled so as to fall within a small specified range,
such as −1.0 to 1.0, or 0.0 to 1.0.
 Attribute construction (feature construction): this is where new attributes are constructed
and added from the given set of attributes to help the mining process.
Normalization
In which data are scaled to fall within a small, specified range, useful for classification
algorithms involving neural networks, distance measurements such as nearest neighbor classification
and clustering.
There are 3 methods for data normalization. They are:
1. min-max normalization
2. z-score normalization
3. normalization by decimal scaling
1. Min-max normalization:
performs linear transformation on the original data values.
Suppose that 𝒎𝒊𝒏𝑨 and maxA are the minimum and maximum values of an attribute, A. Min-max
normalization maps a value, V, of A to 𝑽′ in the range [new_minA,- new_maxA] by computing
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
v is the value to be normalized
minA,maxA are minimum and maximum values of an attribute A
new_ maxA, new_ minA are the normalization range.

Min-max normalization preserves the relationships among the original data values. It will
encounter an ”out-of-bounds ” error if a future input case for normalization falls outside of the original
data range for A.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 12


III CSE DWDM -II
Example: Suppose that the minimum and maximum values for the attribute income are $1,000 and
$15,000 respectively. Map income to the range [0.0,1.0]. By min-max normalization a value $12,000
for income is transformed to

Example. Given one-dimensional data set X = {-5,023.0,17.6,9.23,1.11}, normalize the data set using
(a) Min-max normalization on interval [0,1],
(b) Min-max normalization on interval [-1,1],
(c) Standard deviation normalization.
a) Min-max normalization [0,1]

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 13


III CSE DWDM -II

2. Z-score normalization / zero-mean normalization:


In which values of an attribute A are normalized based on the mean and standard deviation of
A. It can be defined as,

v  meanA
v' 
stand _ devA
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 14
III CSE DWDM -II
This method is useful when min and max value of attribute A are unknown or when outliers
that are dominate min-max normalization.
Example: Suppose that the mean and standard deviation of the value for the attribute income are
$52,000 & $14,000, respectively. With z-score normalization, a value of $72,000 for income is
transformed to

3. Normalization by decimal scaling: normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the maximum absolute value of A. A value v
of A is normalized to v’ by computing,

Example: Suppose that the recorded values of A range from -986 to 917. The maximum absolute value
of A is 986. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j=3) so
that -986 normalizes -0.986 and 917 normalizes to 0.917.

2.6 Data Reduction techniques


These techniques can be applied to obtain a reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity of the original data.
 Data reduction includes,

1. Data cube aggregation, where aggregation operations are applied to the data in the construction of
a data cube.

2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions
may be detected and removed.

3. Dimensionality reduction, where encoding mechanisms are used to reduced the data set size.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 15


III CSE DWDM -II
4. Numerosity reduction, where the data are replaced or estimated by alternative data smaller data
representation.

5. Discretization and concept hierarchy generation, where raw data values for attributes are replaced
by ranges or higher conceptual levels

2.6.1 Data cube aggregation:


Data cube aggregation, where aggregation operations are applied to the data for construction
of a data cube.
 Data cubes store multidimensional aggregated information.
 Each cell holds an aggregate data value, corresponding to the data point in multidimensional space.
 Concept hierarchies may exist for each attribute, allowing the analysis of data at multiple level of
abstraction.
 Data cubes provide fast access to pre computed summarized data, thereby benefiting on-line
analytical processing as well as data mining.
 The cube can be created in three ways:
1. Based cuboid- The cube created at the lowest level of abstraction is referred to as base cuboid.

2. Lattice of cuboids- Data cubes created for varying levels of abstraction are often referred to
as cuboids.

3. Apex cuboid - A cube at highest level of abstraction is the apex cuboid

The following database consists of sales per quarter for the years 1997-1999.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 16


III CSE DWDM -II

 Suppose, the annalyser interested in the annual sales rather than sales per quarter, the above data
can be aggregated so that the resulting data summarizes the total sales per year instead of per
quarter.
 The resulting data in smaller in volume, without loss of information necessary for the analysis task

2.6.2 Attribute sub selection / Feature selection


 Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes
(or dimensions).
 The goal of attribute subset selection is to find a minimum set of attributes.
 It reduces the number of attributes appearing in the discovered patterns, helping to make the
patterns easier to understand.

To find out a ‘good’ subset from the original attributes


 For n attributes, there are 2n possible subsets.
 An exhaustive search for the optimal subset of attributes can be prohibitively expensive, especially
as n and the number of data classes increase.
 Therefore, heuristic methods that explore a reduced search space are commonly used for attribute
subset selection.
Techniques for heuristic methods of attribute sub set selection
1. Stepwise forward selection
2. Stepwise backward elimination
3. Combination of forward selection and backward elimination
4. Decision tree induction
1. Step-wise forward selection:

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 17


III CSE DWDM -II
The procedure starts with an empty set of attributes. The best of the original attributes is
determined and added to the set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
2. Step-wise backward elimination:
The procedure starts with the full set of attributes. At each step, it removes the worst attribute
remaining in the set.
3. Combination forward selection and backward elimination:
The step-wise forward selection and backward elimination methods can be combined, where
at each step one selects the best attribute and removes the worst from among the remaining attributes.
4. Decision tree induction:
 Decision tree induction constructs a flow-chart-like structure where each internal (non-leaf) node
denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external
(leaf) node denotes a class prediction.
 At each node, the algorithm chooses the “best" attribute to partition the data into individual
classes.
 When decision tree induction is used for attribute subset selection, a tree is constructed from the
given data.
 All attributes that do not appear in the tree are assumed to be irrelevant.
 The set of attributes appearing in the tree form the reduced subset of attributes.

2.6.3 Dimensionality reduction

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 18


III CSE DWDM -II
In dimensionality reduction, data encoding or transformations are applied so as to obtained
reduced or “compressed” representation of the oriental data.
Dimension Reduction Types
 Lossless - If the original data can be reconstructed from the compressed data without any loss
of information

 Lossy - If the original data can be reconstructed from the compressed data with loss of
information, then the data reduction is called lossy.

Effective methods in lossy dimensional reduction


1. Wavelet transforms

2. Principal components analysis.

1. Wavelet compression is a form of data compression well suited for image compression.
 The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied
to a data vector D, transforms it to a numerically different vector, D0, of wavelet coefficients.
The general algorithm for a discrete wavelet transform is as follows.
1. The length, L, of the input data vector must be an integer power of two. This condition can be
met by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions:
 data smoothing
 calculating weighted difference
3. The two functions are applied to pairs of the input data, resulting in two sets of data of length
L/2.
4. The two functions are recursively applied to the sets of data obtained in the previous loop, until
the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are designated the
wavelet coefficients of the transformed data.
 If wavelet coefficients are larger than some user-specified threshold then it can be retained. The
remaining coefficients are set to 0.

Haar2 and Daubechie4 are two popular wavelet transforms.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 19


III CSE DWDM -II

2. Principal Component Analysis (PCA)


 It is also called as Karhunen-Loeve (K-L) method.
 The basic procedure is as follows:
1. The input data are normalized.

2. PCA computes k orthonormal vectors that provide a basis for the normalized input data. These
are unit vectors that each point in a direction perpendicular to the others.

3. The principal components are sorted in order of decreasing significance or strength.

Principal componenets analysis


 In the above figure, Y1 and Y2, for the given set of data originally mapped to the axes X1 and X2.
 This information helps identify groups or patterns within the data.
 The sorted axes are such that the first axis shows the most variance among the data, the second
axis shows the next highest variance, and so on.
 The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
1. PCA is computationally inexpensive

2. Multidimensional data of more than two dimensions can be handled by reducing the problem
to two dimensions.

3. Principal components may be used as inputs to multiple regression and cluster analysis.
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 20
III CSE DWDM -II

2.6.4 Numerosity Reduction


Data volume can be reduced by choosing alternative smaller forms of data. This tech. can be
 Parametric method
 Non parametric method
Parametric: Assume the data fits some model, then estimate model parameters, and store only the
parameters, instead of actual data.
Non parametric: In which histogram, clustering and sampling is used to store reduced form of data.
Parametric model
1. Regression

 Linear regression
In linear regression, the data are model to fit a straight line. For example, a random variable, Y
called a response variable), can be modeled as a linear function of another random variable, X called
a predictor variable), with the equation Y=αX+β
Where the variance of Y is assumed to be constant. The coefficients, α and β (called regression
coefficients), specify the slope of the line and the Y- intercept, respectively.
 Multiple- linear regression
Multiple linear regression is an extension of (simple) linear regression, allowing a response
variable Y, to be modeled as a linear function of two or more predictor variables.
2. Log-Linear Models
Log-Linear Models can be used to estimate the probability of each point in a multidimensional
space for a set of discretized attributes, based on a smaller subset of dimensional combinations.

2 Histogram
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or
buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets are Called
singleton buckets.
Example:
The following data are a list of prices of commonly sold items. The numbers have been sorted.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18,
18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 21


III CSE DWDM -II
The buckets are displayed in a horizontal axis while height of a bucket represents the average
frequency of the values.

Draw histogram plot for price where each bucket should have equi width of 10

The buckets can be determined based on the following partitioning rules, including the following.
1. Equi-width: histogram with bars having the same width
2. Equi-depth: histogram with bars having the same height
3. V-Optimal: histogram with least variance (countb*valueb)
4. MaxDiff: bucket boundaries defined by user specified threshold

V-Optimal: The V-Optional histogram is the one with the least variance. Histogram variance is a
weighted sum of the original values that each bucket represents, where bucket weight is equal to the
number of values in the bucket.
MaxDiff: It is the difference between each pair of adjacent values. A bucket boundary is established
between each pair for pairs having the 𝜷-1 largest differences, where 𝜷 is the user-specified.

3.Clustering techniques

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 22


III CSE DWDM -II
 Consider data tuples as objects. They partition the objects into groups or clusters, so that objects
within a cluster are “similar" to one another and “dissimilar" to objects in other clusters.
 Similarity is commonly defined in terms of how “close" the objects are in space, based on a distance
function.

 Quality of clusters measured by their diameter (max distance between any two objects in the
cluster) or centroid distance (avg. distance of each cluster object from its centroid)

4.Sampling
 Sampling can be used as a data reduction technique since it allows a large data set to be represented
by a much smaller random sample (or subset) of the data.
 Suppose that a large data set, D, contains N tuples. Let's have a look at some possible samples for
D.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 23


III CSE DWDM -II

1. Simple random sample without replacement (SRSWOR) of size n: This is created by drawing
n of the N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N, i.e., all tuples
are qually likely.
2. Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be drawn again.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a SRS of
m clusters can be obtained, where m < M. For example, tuples in a database are usually retrieved a
page at a time, so that each page can be considered a cluster. A reduced data representation can be
obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified sample of
D is generated by obtaining a SRS at each stratum. This helps to ensure a representative sample,
especially when the data are skewed. For example, a stratified sample may be obtained from customer
data, where stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented.
Advantges of sampling

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 24


III CSE DWDM -II
1. An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample, n, as opposed to N, the data set size. Hence, sampling
complexity is potentially sub-linear to the size of the data.
2. When applied to data reduction, sampling is most commonly used to estimate the answer to an
aggregate query.

2.6 DATA DISCRETIZATION AND CONCEPT HIERARCHY GENERATION


2.6.1 DATA DISCRETIZATION
 Data Discretization techniques can be used to reduce the number of values for a given continuous
attribute by dividing the range of attribute into intervals.
 Interval labels can then be used to replace actual data values.
Features of Data Discretization
 Leads to a concise

 Easy-to-use

 Knowledge-level representation of mining results.


Categories of Data Discretization
 Supervised discretization- Uses class information.

 Unsupervised discretization or splitting – Does not uses class information.

 Top-down discretization or splitting – Here, the process starts by first finding one or a few
points (called split points or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals.

 Bottom-up discretization or merging – Here, the process starts by considering all of the
continuous values to form intervals, and then recursively applies this process to the resulting
intervals.

2.6.2 Concept Hierarchy


 A concept hierarchy for a given numerical attribute defines a discretization of attribute.
 Concept hierarchies can be used to reduced the data by collecting and replacing low-level concepts
(such as numerical values for the attribute age) with higher-level concepts (such as youth, middle-
aged, or senior).

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 25


III CSE DWDM -II

2.6.3 Discretization and concept Hierarchy Generation for Numerical data


 Concept hierarchies for numerical attributes can be constructed automatically based on data
discretization.
Methods for handling numerical data over concept hierarchy
 Binning

 Histogram analysis

 Entropy-based discretization

 𝝌𝟐-merging

 Cluster analysis

 Discretization by intuitive partitioning

I. Binning
 Binning is a top-down splitting technique based on a specified number of bins.
 These methods are also used as discretization methods for numerosity reduction and concepts
hierarchy generation.
 These techniques can be applied recursively to the resulting partitions in order to generate
concepts hierarchies.
 Binning does not use class information and is therefore an unsupervised discretization
technique.
 It is sensitive to the user-specified number of bins, as well as the presence of outliers.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 26


III CSE DWDM -II
II. Histogram Analysis
 Like binning, histogram analysis is an unsupervised discretization technique because it does
not use class information.
 Histograms partition the value for an attribute A, into disjoint ranges called buckets.
 The histograms analysis algorithm can be applied recursively to each partition in order to
automatically generate a multilevel concepts hierarchy, with the procedure terminating once
a prespecified number of concepts levels has been reached.

III. Entropy-Based Discretization


 Entropy is one of the most commonly used discretization measures.
 Entropy-based discretization is a supervised, top-down splitting technique.
 It explores class distribution information in its calculation and determination of split-points.
 To discretize a numerical attribute, A, the method selects the value of A that has the minimum
entropy as a split-point, and recursively partitions the resulting intervals to arrive at a
hierarchical discretization. Such discretization forms a concept hierarchy for A.
 Let D consist of data tuples defined by a set of attributes and a class-label attribute.
 The basic method for entropy-based discretization of an attribute A within the set is as
follows:
1. Each value of A can be considered as a potential interval boundary or split-point to partition
the range of A. That is, a split-point for A can partition the tuples in D into two subsets
satisfying the conditions A<split point and A> split point, respectively, thereby creating a
binary discretization.

2. Suppose we want to classify the tuples in D by partitioning on attribute A and some split-
point. Ideally, we would like this partitioning to result in an exact classification of the tuples.
For example, if we had two classes, we would hope that all of the tuples of, say, class C1 will
fall into one partition and all of the tuples of class C2 will fall into the other partition.

3. The process of determining a split-point is recursively applied to each partition obtained


until some stopping criterion is met, such as when the minimum information requirement on
all candidate split-points is less than a small threshold, e, or when the number of intervals is
greater than a threshold, max interval.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 27


III CSE DWDM -II
IV. Interval merging by 𝝌𝟐 Analysis
 Chi Merge, which employs a button-up approach by finding the best neighboring intervals and
then merging these to form larger intervals, recursively.
 The method is supervised in that it uses class information.
 Chi Merge Proceeds as follows:
 Initially, each distinct value of a numerical attribute A is considered to be one interval.

 C2 tests are performed for every pair of adjacent intervals.

 Adjacent intervals with the least c2 values are merged together, because low c2 values for
a pair indicate similar class distributions.

 This merging process precedes recursively until a predefined stopping criterion is met.

V. Cluster Analysis
 Cluster analysis is a popular data discretization method.
 A clustering algorithm can be applied to discretize a numerical attribute, A, by partitioning the
values of A into clusters or groups.
 Clustering can be used to generate a concept hierarchy for A by following either a top down
splitting strategy or a bottom-up merging strategy, where each cluster forms a node of the
concept hierarchy.
 In the top down splitting strategy, each initial cluster or partition may be further decomposed
into several sub clusters, forming a lower level of the hierarchy.
 In the bottom-up merging strategy, clusters are formed by repeatedly grouping neighboring
clusters in order to form higher-level concepts.
VI. Discretization by intuitive partitioning
 Although the above discretization methods are useful in the generation of numerical
hierarchies, many users would like to see numerical ranges partitioned into relatively uniform,
easy-to-read intervals that appear intuitive or natural.
 The 3-4-5 rule can be used to segment numerical data into relatively uniform, natural seeming
intervals. In general, the rule partitions a given range of data into 3, 4 or 5 relatively equal-
width intervals, recursively and level by level, based on the value range at the most significant
digit.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 28


III CSE DWDM -II
The rule is as follows:
 If an interval covers 3,6,7, or 9 distinct values at the most significant digit, then partition the
range into 3 intervals(3 equal-width intervals for 3,6, and 9; and 3 intervals in the grouping of
2-3-2 for 7)

 If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4
equal-width intervals.

 If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into
5 equal-width intervals.
 The rule can be recursively applied to each interval, creating a concept hierarchy for the given
numerical attribute.
 Real-world data often contain extremely large positive and/or negative outlier values, which
could distort any top-down discretization method, based on minimum and maximum data values.

2.6.4 Concept Hierarchy Generation for categorical data


 Categorical data are discrete data.
 Categorical attributes have a finite (but possibly large) Number of distinct values, with no
ordering among the values.
 Methods of generating a concept hierarchy for categorical data.
 Specification of a partial ordering of attributes explicitly at the schema level by user or experts:
Concept hierarchies for categorical or dimensions typically involve a group of attribute. A user
or expert can easily define a concept hierarchy by specifying a partial or total ordering of the
attributes at the schema level.

 Specification of a portion of a hierarchy by explicit data grouping: This is essentially the


manual definition of a portion of a concept hierarchy. In a large database, it is unrealistic to
define an entire concept hierarchy by explicit value enumeration. However, it is realistic to
specify explicit groupings for a small portion of intermediate-level data.

 Specification of a set of attributes, but not of their partial ordering: A user may specify a set of
attributes forming a concept hierarchy, but omit to explicitly state their partial ordering. The
system can they try to automatically generate the attribute so as to construct a meaningful concept
hierarchy .

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 29

You might also like