0% found this document useful (0 votes)

4 views19 pages

Module 2 DMW

The document discusses data preprocessing in data mining, emphasizing its importance due to issues like incomplete, noisy, and inconsistent data. It outlines various preprocessing techniques, including data cleaning, integration, transformation, and reduction, detailing methods for handling missing values, noise, and data conflicts. Additionally, it covers dimensionality reduction strategies such as wavelet transforms and principal components analysis to enhance data analysis efficiency.

Uploaded by

sathul257

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views19 pages

Module 2 DMW

Uploaded by

sathul257

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

CS 402 : DATA MINING AND WAREHOUSING RCET

MODULE 2

DATA PEPROCESSING
Data Preprocessing: Data Pre-processing Concepts, Data Cleaning, Data integration
and transformation, Data Reduction, Discretization and concept hierarchy.
----------------------------------------------------------------------------------------------------

Data preprocessing
 Data processing is the preparatory step for data mining. Preprocessing is
needed as data collected from different sources can be incomplete, noisy,
and inconsistent.

 Incomplete data can occur because attributes of interest may not always be
available, Relevant data may not be recorded due to a misunderstanding, or
because of equipment malfunctions, data that were inconsistent with other
recorded data may have been deleted.

 Reasons for noisy data (having incorrect attribute values) are the data
collection instruments used may be faulty, there may have been human or
computer errors occurring at data entry. Errors in data transmission can also
occur. There may be technology limitations.

Different preprocessing techniques are:

1. Data cleaning
2. Data integration
3. Data transformation
4. Data reduction
5. Discretisation and concept hierarchy generation.

Data Cleaning
Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.

a) Missing Values
Many tuples have no recorded value for several attributes.The methods used
are:

MODULE 2 Page 1
CS 402 : DATA MINING AND WAREHOUSING RCET
i. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not
very effective, unless the tuple contains several attributes with missing
values. It is especially poor when the percentage of missing values per
attribute varies considerably
ii. Fill in the missing value manually: this approach is time-consuming
and may not be feasible given a large data set with many missing
values.
iii. Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown”
or �¥
iv. Use the attribute mean to fill in the missing value: For example,
suppose that the average income of a customers is $56,000. Use this
value to replace the missing value for income.
v. Use the attribute mean for all samples belonging to the same class
as the given tuple:For example, if classifying customers according to
credit risk, replace the missing value with the average income value
for customers in the same credit risk category as that of the given tuple.
vi. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.

b) Noisy Data
Noise is a random error or variance in a measured variable.The methods for
noise removal are:
i. Binning:
Binning methods smooth a sorted data value by consulting
its“neighborhood,”that is, the values around it. The sorted values are
distributed into a numberof “buckets,” or bins. Because binning
methods consult the neighborhood of values,they perform local
smoothing.

MODULE 2 Page 2
CS 402 : DATA MINING AND WAREHOUSING RCET

There are 3 methods for binning:

smoothing by bin means, each value in a bin is replaced by the mean value
of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.
Therefore, each original value in this bin is replaced by the value 9.
smoothing by bin medians can be employed, in which each bin value is
replaced by the bin median
smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.

ii. Regression
Data can be smoothed by fitting the data to a function, such as with
regression. Find a mathematical function corresponding to a set of data.
All data can be mapped using this function i.e all data will
approximately fall into that function. If a function is linear we call it
linear regression.
iii. Clustering:
Outliers may be detected by clustering, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside
of the set of clusters may be considered outliers
iv. Combined computer and human interaction
Outliers can be identifies through a combination of computer and human
inspection. Outliers may be informative or garbage.

MODULE 2 Page 3
CS 402 : DATA MINING AND WAREHOUSING RCET
c) Inconsistent data
There may be inconsistencies in the data recorded for some transaction.
Errors occurred may be corrected manually using external references.
Knowledge engineering tools may be used to detect violation of known data
constraints.

Data integration
It combines data from multiple sources into a coherent data store, as in data
warehousing. These sources may include multiple databases, data cubes, or flat
files.
Issues to consider during data integration are:
a) Schema integration
b) Redundancy
c) Detection and resolution of data value conflicts.

Schema integration
Schema integration can be confusing as same entity can be represented in different
forms in different database tables. This is referred to as entity identification
problem.
Eg: cust_id in one databse can be represented as cust_no. in another.
Database and data warehouses have meta data. Meta data helps to avoid errors in
schema integration.

Redundancy
Redundancy is another important issue. An attribute (such as annual revenue, for
instance) may be redundant if it can be “derived” from another attribute or set of
attributes. Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set. Some redundancies can be detected by
correlation analysis. Given two attributes, such analysis can measure how strongly
one attribute implies the other, based on the available data. For numerical
attributes, we can evaluate the correlation between two attributes A and B, by
computing the correlation coefficient:

∑( ̅ )( ̅)
=
( )
Where n is the number of tuples, ̅ , ̅ are mean of attributes A and B respectively,
σA,σB are standard deviations of A and B.

If is greater than 0, then A and B are positively correlated, meaning that the
values of A increase as the values of B increase. The higher the value, the stronger

MODULE 2 Page 4
CS 402 : DATA MINING AND WAREHOUSING RCET
the correlation (i.e., the more each attribute implies the other).Hence, a higher
value may indicate that A (or B) may be removed as a redundancy. If the resulting
value is equal to 0, then A and B are independent and there is no correlation
between them. If the resulting value is less than 0, then A and B are negatively
correlated,where the values of one attribute increase as the values of the other
attribute decrease.This means that each attribute discourages the other.

Detection and resolution of data value conflict

For example, for the same real-world entity, attribute values from different
sources may differ. This may be due to differences in representation, scaling,or
encoding. For instance, a weight attribute may be stored in metric units in one
system and British imperial units in another.This is called semantic heterogeneity.
Careful integration of data from multiple sources can help reduce
redundancies and inconsistencies in resulting data.This will help improve speed
and accuracy of data mining process.

Data Transformation
In data transformation, the data are transformed or consolidated into forms
appropriate
for mining. Data transformation can involve the following:
i. Smoothing, which works to remove noise from the data. Such techniques
include binning, regression, and clustering.
ii. Aggregation, where summary or aggregation operations are applied to the
data. For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically used in constructing
a data cube for analysis of the data at multiple granularities.
iii. Generalization of the data, where low-level or “primitive” (raw) data are
replaced by higher-level concepts through the use of concept hierarchies.
For example, categorical attributes, like street, can be generalized to higher-
level concepts, like city or country. Values for numerical attributes, like
age, may be mapped to higher-level
concepts, like youth, middle-aged, and senior.
iv. Normalization, where the attribute data are scaled so as to fall within a
small specified range, such as -1:0 to 1:0, or 0:0 to 1:0.Attribute
construction (or feature construction),where new attributes are constructed
and added from the given set of attributes to help the mining process.
There are many methods for data normalization:
a) min-max normalization,
b) z-score normalization,

MODULE 2 Page 5
CS 402 : DATA MINING AND WAREHOUSING RCET
c) normalization by decimal scaling.

Min-max normalization performs a linear transformation on the original

data. Suppose that minA and maxA are the minimum and maximum values
of an attribute, A.Min-max normalization maps a value, v, of A to v0 in the
range [new minA;new maxA]by computing:

Eg:Suppose that the minimum and maximum values for the attribute income
are $12,000 and $98,000, respectively. We would like to map income to the
range [0:0;1:0]. By min-max normalization, a value of $73,600 for income is
transformed to

In z-score normalization (or zero-mean normalization), the values for an

attribute,
A, are normalized based on the mean and standard deviation of A. A value,
v, of A is
normalized to ́ by computing

where A and sA are the mean and standard deviation, respectively, of

attribute A.
Eg: z-score normalization Suppose that the mean and standard deviation of
the values for
the attribute income are $54,000 and $16,000, respectively. With z-score
normalization,a value of $73,600 for income is transformed to

Normalization by decimal scaling normalizes by moving the decimal point

of values
of attribute A. The number of decimal points moved depends on the
maximum absolute
value of A. A value, v, of A is normalized to v0 by computing

MODULE 2 Page 6
CS 402 : DATA MINING AND WAREHOUSING RCET

where j is the smallest integer such that

Eg: Suppose that the recorded values of A range from -986 to 917. The
maximumabsolute value of A is 986. To normalize by decimal scaling,we
therefore divide each value by 1,000 (i.e., j = 3) so that -986 normalizes to -
0:986 and 917 normalizes to 0:917.
v. Attribute construction :new attributes are constructed from the given
attributes and added in order to help improve the accuracy and
understanding of structure in high-dimensional data. For example, we may
wish to add the attribute area based on the attributes height and width.

Data Reduction
 Complex data analysis and mining on huge amounts of data can take a long
time, making such analysis impractical or infeasible.
 Data reduction techniques can be applied to obtain a reduced representation
of the data set that is much smaller in volume, yet closely maintains the
integrity of the original data.

Strategies for data reduction include the following:

1. Dimensionality reduction
2. Numerosity reduction
3. Data Compression

Dimensionality reduction
 Reduce number of random variables or attributes.
a) Wavelet transform
b) Principle component analysis
c) Attribute subset selection

a) Wavelet Transforms
The discrete wavelet transform(DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically
different vector, , of wavelet coefficients. The two vectors are of the same
length. A compressed approximation of the data can be retained by storing
only a small fraction of the strongest of the wavelet coefficients.
The method is as follows:

MODULE 2 Page 7
CS 402 : DATA MINING AND WAREHOUSING RCET
1. The length, L, of the input data vector must be an integer power of 2.
This condition can be met by padding the data vector with zeros as
necessary (L - n).
2. Each transform involves applying two functions. The first applies some
data smoothing, such as a sum or weighted average. The second
performs a weighted difference, which acts to bring out the detailed
features of the data.
3. The two functions are applied to pairs of data points in X, that is, to all
pairs of measurements (x2i;x 2i+1). This results in two sets of data of
length L=2.
4. The two functions are recursively applied to the sets of data obtained in
the previous loop, until the resulting data sets obtained are of length 2.
5. Selected values from the data sets obtained in the above iterations are
designated the wavelet coefficients of the transformed data.

b) Principal Components Analysis

Suppose that the data to be reduced consist of tuples or data vectors
described by n attributes or dimensions. Principal components analysis, or
PCA (also called the Karhunen-Loeve, or K-L, method), searches for k n-
dimensional orthogonal vectors that can best be used to represent the data,
where k <= n. The original data are thus projected onto a much smaller
space, resulting in dimensionality reduction
The basic procedure is as follows:
1. The input data are normalized, so that each attribute falls within the
same range. This step helps ensure that attributes with large domains
will not dominate attributes with smaller domains.
2. PCA computes k orthonormal vectors that provide a basis for the
normalized input data. These are unit vectors that each point in a
direction perpendicular to the others. These vectors are referred to as
the principal components. The input data are a linear combination of
the principal components.
3. The principal components are sorted in order of decreasing
“significance” ordata, providing important information about
variance. That is, the sorted axes are such that the first axis shows the
most variance among the data, the second axis shows the next highest
variance, and so on.
4. Because the components are sorted according to decreasing order of
“significance,” the size of the data can be reduced by eliminating the
weaker components, that is, those with low variance. Using the

MODULE 2 Page 8
CS 402 : DATA MINING AND WAREHOUSING RCET
strongest principal components, it should be possible to reconstruct a
good approximation of the original data.

c) Attribute subset selection

 Data sets for analysis may contain hundreds of attributes, many of which
may be irrelevant to the mining task or redundant.
 Attribute subset selection reduces the data set size by removing irrelevant
or redundant attributes (or dimensions).
 The goal of attribute subset selection is to find a minimum set of attributes
such that the resulting probability distribution of the data classes is as close
as possible to the original distribution obtained using all attributes.
 For n attributes, there are possible subsets.
 An exhaustive search for the optimal subset of attributes can be
expensive, especially as n and the number of data classes increase.
Therefore, heuristic methods that explore a reduced search space are
commonly used for attribute subset selection.

Basic heuristic methods of attribute subset selection include the following

techniques,
some of which are illustrated in Figure 2.15:

MODULE 2 Page 9
CS 402 : DATA MINING AND WAREHOUSING RCET

1. Stepwise forward selection:

The procedure starts with an empty set of attributes as the reduced set. The
best of the original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes
is added to the set.
2. Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step, it removes
the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination:
The stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
4. Decision tree induction:
Decision tree induction constructs a flowchart like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch corresponds
to an outcome of the test, and each external (leaf) node denotes a class
prediction. At each node, the algorithm chooses the “best” attribute to partition
the data into individual classes. When decision tree induction is used for
attribute subset selection, a tree is constructed from the given data. All
attributes that do not appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree form the reduced subset of attributes.

Numerosity reduction
 Replace original data volume by alternative smaller forms of data
representation.
 2 techniques
i. Parametric : Used a model to estimate data. So only data parameters
need to be stored.

a) Regression and Log-Linear Models

 Regression and log-linear models can be used to approximate
the given data.
 In linear regression, the data are modeled to fit a straight line.
For example, a randomvariable, y (called a response variable),
can be modeled as a linear function of another randomvariable,
x (called a predictor variable), with the equation:
y = wx+b

MODULE 2 Page 10
CS 402 : DATA MINING AND WAREHOUSING RCET
where the variance of y is assumed to be constant. x and y are
numerical database attributes. The coefficients, w and b (called
regression coefficients),specify the slope of the line and the Y-
intercept, respectively.

 Multiple linear regression is an extension of (simple) linear

regression,which allows a response variable, y, to be modeled
as a linear function of two or more predictor variables.

 Log-linear models approximate discrete multidimensional

probability distributions. Given a set of tuples in n dimensions
Log-linear models can be used to estimate the probability of
each point in a multidimensional space for a set of discretized
attributes, based on a smaller subset of dimensional
combinations. This allows a higher-dimensional data space to
be constructed from lower dimensional spaces.

ii. Non-parametric : store reduced representation of data

a) Histogram
b) Clustering
c) Sampling
d) Data Cube aggregation

Histograms
 Histograms use binning to approximate data distributions and are a
popular form of data reduction.
 A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, or buckets.
 If each bucket represents only a single attribute-value/frequency pair,
the buckets are called singleton buckets. Often, buckets instead
represent continuous ranges for the given attribute.
Eg: The following data are a list of prices of commonly sold items at
AllElectronics(rounded to the nearest dollar). The numbers have been
sorted: 1, 1, 5, 5, 5, 5, 5,8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,20, 20, 20, 20, 20, 20, 21,
21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

MODULE 2 Page 11
CS 402 : DATA MINING AND WAREHOUSING RCET
Figure 2.18 shows a histogram for the data using singleton buckets

.
There are several partitioning rules, including the following:
Equal-width: In an equal-width histogram, the width of each bucket
range is uniform
(such as the width of $10 for the buckets in Figure 2.19).

Equal-frequency (or equidepth): In an equal-frequency histogram, the

buckets are created so that, roughly, the frequency of each bucket is
constant (that is, each bucket contains roughly the same number of
contiguous data samples).

Clustering
 Clustering techniques consider data tuples as objects. They partition
the objects into groups or clusters, so that objects within a cluster

MODULE 2 Page 12
CS 402 : DATA MINING AND WAREHOUSING RCET
are “similar” to one another and “dissimilar” to objects in other
clusters.
 Similarity is commonly defined in terms of how “close” the objects
are in space, based on a distance function.
 The “quality” of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster.
 Centroid distance is an alternative measure of cluster quality and
is defined as the average distance of each cluster object from the
cluster centroid.

Sampling
It allows a large data set to be represented by a much smaller random
sample (or subset) of the data. Suppose that a large data set, D, contains N
tuples.

MODULE 2 Page 13
CS 402 : DATA MINING AND WAREHOUSING RCET
The most common ways that we could sample D for data reduction are:
i. Simple random sample without replacement (SRSWOR) of size s
This is created by drawing s of the N tuples from D (s < N), where the
probability of drawing any tuple in D is 1=N, that is, all tuples are equally
likely to be sampled.

ii. Simple random sample with replacement (SRSWR) of size s:

This is similar to SRSWOR, except that each time a tuple is drawn from D, it
is recorded and then replaced. That is, after a tuple is drawn, it is placed back
in D so that it may be drawn again.

iii. Cluster sample:

If the tuples in D are grouped into M mutually disjoint “clusters,”then an
SRS of s clusters can be obtained, where s < M. For example, tuples in a
database are usually retrieved a page at a time, so that each page can be
considered a cluster. A reduced data representation can be obtained by
applying, say, SRSWOR to the pages, resulting in a cluster sample of the
tuples.

Data Cube Aggregation

 Data cubes store multidimensional aggregated information.
 Imagine that you have collected the data for your analysis. These data
consist of the AllElectronics sales per quarter, for the years 2002 to
2004. You are interested in the annual sales (total per year), rather
than the total per quarter. Thus the data can be aggregated so that the
resulting data summarize the total sales per year instead of per quarter.
This aggregation is illustrated in Figure 2.13.

MODULE 2 Page 14
CS 402 : DATA MINING AND WAREHOUSING RCET

Data Compression
 Apply transformation to get reduced representation of data.
 2 types:
i. Lossy : if we can reconstruct only approximate of data then we call it
lossy compression
ii. Lossless: If original data can be reconstructed from compressed data
without loss of information then data reduction is lossless.

Data Discretization and Concept Hierarchy Generation

 Raw data values for attributes are replaced by ranges or higher conceptual
levels.
 Data discretization is a form of numerosity reduction that is very useful for
the automatic generation of concept hierarchies.
 Data discretization techniques can be used to reduce the number of values
for a given continuous attribute by dividing the range of the attribute into
intervals.
 Interval labels can then be used to replace actual data values.
 Replacing numerous values of a continuous attribute by a small number of
interval labels thereby reduces and simplifies the original data.
 This leads to a concise, easy-to-use, knowledge-level representation of
mining results.

Discretization techniques can be categorized as:

1. Supervised and unsupervised discretization: If the discretization process
uses class information,then we say it is supervised discretization. Otherwise,
it is unsupervised.
2. Top-down discretization or splitting:
If the process starts by first finding one or a few points (called split points
or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals, it is called top-down discretization or
splitting.

3. Bottom-up discretization or merging:

It starts by considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to form intervals,
and then recursively applies this process to the resulting intervals.

MODULE 2 Page 15
CS 402 : DATA MINING AND WAREHOUSING RCET

Concept hierarchies can be used to reduce the data by collecting and

replacing low-level concepts (such as numerical values for the attribute age) with
higher-level concepts (such as youth, middle-aged, or senior).

Discretization and Concept Hierarchy Generation for Numerical Data

Concept hierarchies for numerical attributes can be constructed
automatically based
on data discretization.The methods used are:
a) binning
b) histogram analysis
c) entropy-based discretization
d) cluster analysis
e) discretization by intuitive partitioning.

Binning
Binning is a top-down splitting technique based on a specified number of bins.
These methods are also used as discretization methods for numerosity reduction
and concept hierarchy generation. For example, attribute values can be discretized
by applying equal-width or equal-frequency binning, and then replacing each bin
value by the bin mean or median, as in smoothing by bin means or smoothing by
bin medians. These techniques can be applied recursively to the resulting partitions
in order to generate concept hierarchies.

Histogram Analysis
Histogram analysis is an unsupervised discretization technique because it does not
use class information. Histograms partition the values for an attribute, A, into
disjoint ranges called buckets. In an equal-width histogram, for example, the
values are partitioned into equal-sized partitions or ranges. With an equal
frequency histogram, the values are partitioned so that each partition contains the
same number of data tuples. The histogram analysis algorithm can be applied
recursively to each partition in order to automatically generate a multilevel concept
hierarchy, with the procedure terminating once a prespecified number of concept
levels has been reached. A minimum interval size can also be used per level to
control the recursive procedure.

MODULE 2 Page 16
CS 402 : DATA MINING AND WAREHOUSING RCET
Entropy-Based Discretization
Entropy-based discretization is a supervised, top-down splitting technique.Given a
set of data tuples S, the basic method for entropy-based discretization of A is as
follows:

1. Each value A can be considered a potential interval boundary or threshold T.

2. Given S, the threshold value slected is the one that maximizes the
information gin resulting from the subsequent partitioning.The information
gain is
I(S,T) = ( ) ( )
where and corresponds to the samples in S, satisfying the conditions
A<T and A>= T, respectively. The entropy functiuon Ent for a given set is
calculated based on the class distribution of the samples in the set.For eg:
given m classes, the entropy of is
Ent( )= - ∑ ( )
Where is the probability of class i, in determined by dividing the
number of class i, in by the total number of samples in . The value of
Ent( ) can be computed similarly.

3. The process of determining a threshold value is recursively applied to each

partition obtained, until some stopping criteria is met. Such as Ent(S) –
I(S,T)> ∂

Cluster Analysis
Cluster analysis is a popular data discretization method. A clustering
algorithm can be applied to discretize a numerical attribute, A, by partitioning the
values of A into clusters or groups. Clustering takes the distribution of A into
consideration, as well as the closeness of data points, and therefore is able to
produce high-quality discretization results. Clustering can be used to generate a
concept hierarchy for A by following either a topdown splitting strategy or a
bottom-up merging strategy, where each cluster forms a node of the concept
hierarchy.

Discretization by Intuitive Partitioning

The 3-4-5 rule can be used to segment numerical data into relatively uniform,
natural seeming intervals. In general, the rule partitions a given range of data into
3, 4, or 5 relatively equal-width intervals, recursively and level by level, based on

MODULE 2 Page 17
CS 402 : DATA MINING AND WAREHOUSING RCET
the value range at the most significant digit.We will illustrate the use of the rule
with an example further below.
The rule is as follows:
 If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit,
then partition the range into 3 intervals (3 equal-width intervals for 3, 6, and
9; and 3 intervals in the grouping of 2-3-2 for 7).
 If it covers 2, 4, or 8 distinct values at the most significant digit, then
partition the range into 4 equal-width intervals.
 If it covers 1, 5, or 10 distinct values at the most significant digit, then
partition the range into 5 equal-width intervals. The rule can be recursively
applied to each interval, creating a concept hierarchy for the given numerical
attribute. Real-world data often contain extremely large positive and/or
negative outlier values, which could distort any top-down discretization
method based on minimum and maximum data values

QUESTIONS
1. Why do we need data transformation? What are the different ways of data
transformation?
2. Where do we use Linear regression? Explain linear regression.
3. Summarize the various pre-processing activities involved in data mining
.Use the two methods below to normalize the following group of data:
1000,2000,3000,5000,9000
i) min-max normalization by setting min: 0 and max:1
ii) z-score normalization
4. What is data cleaning? Describe the approaches to fill missing values.
5. What is noisy data? Explain the binning methods for data smoothening.
6. What is data normalization? Explain any two normalization methods.
7. What is attribute selection measure? Briefly describe the attribute selection
measures for decision tree induction.
8. What is a concept hierarchy?
9. The following data is given in increasing order for the attribute age:
13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,36,40,45,46,
52,70.
a) Use smoothing by bin boundaries to smooth these data, using bin
depth of 3.
b) How might you determine outliers in the data?
c) What other methods are there for data smoothing?
10. Explain the following procedures for attribute subset selection

MODULE 2 Page 18
CS 402 : DATA MINING AND WAREHOUSING RCET
a) Stepwise forward selection
b) Stepwise backward elimination
c) A combination of forward selection and backward elimination
11. Real-world data tend to be incomplete, noisy, and inconsistent. What are
the various approaches adopted to clean the data?
12. Summarize the various pre-processing activities involved in data mining

MODULE 2 Page 19

Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Manual of 1325 Wood CNC Router
No ratings yet
Manual of 1325 Wood CNC Router
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Week2 2
No ratings yet
Week2 2
25 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Unit 3
No ratings yet
Unit 3
164 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
DMDW
No ratings yet
DMDW
14 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Unit 2
No ratings yet
Unit 2
37 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
VoLTE Intro P 1 Network Architecture
No ratings yet
VoLTE Intro P 1 Network Architecture
110 pages
05 DS Data Preprocessing - Cleaning
No ratings yet
05 DS Data Preprocessing - Cleaning
14 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Color and Shade Matching in Operative Dentistry
100% (1)
Color and Shade Matching in Operative Dentistry
19 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Down 2
No ratings yet
Down 2
61 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Basic Engineering Circuit Analysis 10th Edition Irwin Solution Manual PDF Free
No ratings yet
Basic Engineering Circuit Analysis 10th Edition Irwin Solution Manual PDF Free
185 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Orskey User Manual-En
No ratings yet
Orskey User Manual-En
34 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
App X-Ray Multiphos 10 Plus
No ratings yet
App X-Ray Multiphos 10 Plus
8 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Normalization
No ratings yet
Normalization
35 pages
Datamining 180303060331
No ratings yet
Datamining 180303060331
12 pages
Mit401 Unit 10-Slm
No ratings yet
Mit401 Unit 10-Slm
23 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Greenheck - Effects of Screens On Louver Performance
No ratings yet
Greenheck - Effects of Screens On Louver Performance
1 page
Suhail Khan 12th Marksheet
No ratings yet
Suhail Khan 12th Marksheet
1 page
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
I. NGCP: (National Grid Corporation of The Philippines)
No ratings yet
I. NGCP: (National Grid Corporation of The Philippines)
10 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Facades - Laravel 10.x - The PHP Framework For Web Artisans
No ratings yet
Facades - Laravel 10.x - The PHP Framework For Web Artisans
13 pages
Sahil Sehgal's E-Book
No ratings yet
Sahil Sehgal's E-Book
10 pages
ESP
No ratings yet
ESP
15 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Disassembly Automation Automated Systems With Cognitive Abilities (Supachai Vongbunyong, Wei Hua Chen (Auth.) ) (Z-Library)
No ratings yet
Disassembly Automation Automated Systems With Cognitive Abilities (Supachai Vongbunyong, Wei Hua Chen (Auth.) ) (Z-Library)
205 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Exploring The Oracle Latches
No ratings yet
Exploring The Oracle Latches
52 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Flow Over Immersed Body
No ratings yet
Flow Over Immersed Body
12 pages
PDL-III Report FINAL
No ratings yet
PDL-III Report FINAL
34 pages
Visoin LED Series
No ratings yet
Visoin LED Series
8 pages
TRIJIT Cloud Linux Server Quote For Price Structure FixingDots-1
No ratings yet
TRIJIT Cloud Linux Server Quote For Price Structure FixingDots-1
6 pages
Curriculum Vitae: Name: Present Address: #521/8 Ramaiah Building
No ratings yet
Curriculum Vitae: Name: Present Address: #521/8 Ramaiah Building
3 pages
Talking Cars: Field Talking Cars: Field Test Results Comparing DSRC & C-V2X Technology
No ratings yet
Talking Cars: Field Talking Cars: Field Test Results Comparing DSRC & C-V2X Technology
39 pages
SSRN Id4032020
No ratings yet
SSRN Id4032020
27 pages
Ken Capulong - Pds
No ratings yet
Ken Capulong - Pds
8 pages
Intelligent Algorithmic Trading Strategy Using Reinforcement Learning and Directional Change
No ratings yet
Intelligent Algorithmic Trading Strategy Using Reinforcement Learning and Directional Change
13 pages
Pibna GP 009A SeriesGPSoftwareRecoveryProcedure
No ratings yet
Pibna GP 009A SeriesGPSoftwareRecoveryProcedure
10 pages
Internet Webtechnology Question With Answers
No ratings yet
Internet Webtechnology Question With Answers
8 pages
Wiring Mikrohidro
No ratings yet
Wiring Mikrohidro
6 pages
Navodaya Vidyalaya Samiti: Pre-Board-Ii Examination:: 2022-23
No ratings yet
Navodaya Vidyalaya Samiti: Pre-Board-Ii Examination:: 2022-23
6 pages
E12360-Hyperfill FCAW
No ratings yet
E12360-Hyperfill FCAW
6 pages
ThinkSharp Foundation
No ratings yet
ThinkSharp Foundation
1 page
Summary Multi-Purpose-Camera en
No ratings yet
Summary Multi-Purpose-Camera en
1 page