0% found this document useful (0 votes)

101 views18 pages

Syllabus: Data Warehousing and Data Mining

The document discusses data preprocessing techniques for data warehousing and data mining. It covers the need for preprocessing due to issues like incomplete, noisy and inconsistent data in large real-world databases. The key techniques discussed are data cleaning, data integration, data reduction and data transformation. Data cleaning involves filling in missing values, smoothing noisy data, and resolving inconsistencies. It aims to produce clean, consistent data for analysis.

Uploaded by

It's Me

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views18 pages

Syllabus: Data Warehousing and Data Mining

Uploaded by

It's Me

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Data Warehousing and Data Mining

UNIT –II: Syllabus

Data Pre-processing: Data Preprocessing: An Overview,Data Cleaning,Data Integration,Data
Reduction,Data Transformation and Data Discretization

UNIT-II
DATA PREPROCESSING
1. Preprocessing
Real-world databases are highly susceptible to noisy, missing, and inconsistent data due
to their typically huge size (often several gigabytes or more) and their likely origin from multiple,
heterogeneous sources. Low-quality data will lead to low-quality mining results, so we prefer a
preprocessing concepts.
Data Preprocessing Techniques
* Data cleaning can be applied to remove noise and correct inconsistencies in the data.
* Data integration merges data from multiple sources into coherent data store, such as a
data warehouse.
* Data reduction can reduce the data size by aggregating, eliminating redundant
features, or clustering, for instance. These techniques are not mutually exclusive; they
may work together.
* Data transformations, such as normalization, may be applied.
Need for preprocessing
 Incomplete, noisy and inconsistent data are common place properties of large real world
databases and data warehouses.
 Incomplete data can occur for a number of reasons:
 Attributes of interest may not always be available
 Relevant data may not be recorded due to misunderstanding, or because of equipment
malfunctions.
 Data that were inconsistent with other recorded data may have been deleted.
 Missing data, particularly for tuples with missing values for some attributes, may
need to be inferred.
 The data collection instruments used may be faulty.
 There may have been human or computer errors occurring at data entry.
 Errors in data transmission can also occur.
 There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption.
 Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
 Data integration is the process of integrating multiple databases cubes or files. Yet some
attributes representing a given may have different names in different databases, causing
inconsistencies and redundancies.
 Data transformation is a kind of operations, such as normalization and aggregation, are
additional data preprocessing procedures that would contribute toward the success of
the mining process.
 Data reduction obtains a reduced representation of data set that is much smaller in
Page 1
Data Warehousing and Data Mining
volume, yet produces the same(or almost the same) analytical results.

Page 2
Data Warehousing and Data Mining

2. DATA CLEANING
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers
and correct inconsistencies in the data.
Missing Values
Many tuples have no recorded value for several attributes, such as customer income.so we can fill
the missing values for this attributes.
The following methods are useful for performing missing values over several attributes:
1. Ignore the tuple: This is usually done when the class label missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the
percentage of the missing values per attribute varies considerably.
2. Fill in the missing values manually: This approach is time –consuming and may not
be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute value
by the same constant, such as a label like ―unknown‖ or -∞.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of customers is $56,000. Use this value to replace the missing value for
income.
5. Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism or decision tree
induction. For example, using the other customer attributes in the sets decision tree is
constructed to predict the missing value for income.

Page 3
Data Warehousing and Data Mining
Noisy Data
Noise is a random error or variance in a measured variable. Noise is removed using data
smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its ―neighborhood,‖ that is the
value around it. The sorted values are distributed into a number of ―buckets‖ or ―bins―. Because
binning methods consult the neighborhood of values, they perform local smoothing. Sorted data
for price (in dollars): 3,7,14,19,23,24,31,33,38.
Example 1: Partition into (equal-frequency) bins:
Bin 1: 3,7,14
Bin 2: 19,23,24
Bin 3: 31,33,38
In the above method the data for price are first sorted and then partitioned into equal- frequency
bins of size 3.
Smoothing by bin means:
Bin 1: 8,8,8
Bin 2: 22,22,22
Bin 3: 34,34,34
In smoothing by bin means method, each value in a bin is replaced by the mean value ofthe bin. For
example, the mean of the values 3,7&14 in bin 1 is 8[(3+7+14)/3].
Smoothing by bin boundaries:
Bin 1: 3,3,14
Bin 2: 19,24,24
Bin 3: 31,31,38
In smoothing by bin boundaries, the maximum & minimum values in give bin or identify as the bin
boundaries. Each bin value is then replaced by the closest boundary value.

In general, the large the width, the greater the effect of the smoothing. Alternatively, bins may be
equal-width, where the interval range of values in each bin is constant Example 2: Remove the
noise in the following data using smoothing techniques:
8, 4,9,21,25,24,29,26,28,15
Sorted data for price (in dollars):4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8,9,15
Bin 2: 21,21,24,25
Bin 3: 26,28,29,34
Smoothing by bin means:
Bin 1: 9,9,9,9
Bin 2: 23,23,23,23
Bin 3: 29,29,29,29
Smoothing by bin boundaries:
Bin 1: 4, 4,4,15
Bin 2: 21,21,25,25
Bin3: 26,26,26,34

Page 4
Data Warehousing and Data Mining
Regression: Data can be smoothed by fitting the data to function, such as with regression. Linear
regression involves finding the ―best‖ line to fit two attributes (or variables), so that one attribute
can be used to predict the other. Multiple linear regressions is an extension of linear regression,
where more than two attributes are involved and the data are fit to a multidimensional surface.
Clustering: Outliers may be detected by clustering, where similar values are organized into groups,
or ―clusters.‖ Intuitively, values that fall outside of the set of clusters may be considered
outliers.

2.3 Inconsistent Data

Inconsistencies exist in the data stored in the transaction. Inconsistencies occur due to occur
during data entry, functional dependencies between attributes and missing values. The inconsistencies
can be detected and corrected either by manually or by knowledge engineering tools.
Data cleaning as a process
a) Discrepancy detection
b) Data transformations
a) Discrepancy detection
The first step in data cleaning is discrepancy detection. It considers the knowledge
ofmeta data and examines the following rules for detecting the discrepancy.
Unique rules- each value of the given attribute must be different from all other values for that
attribute.
Consecutive rules – Implies no missing values between the lowest and highest values for the
attribute and that all values must also be unique.
Null rules - specifies the use of blanks, question marks, special characters, or other strings
that may indicates the null condition
Discrepancy detection Tools:
 Data scrubbing tools - use simple domain knowledge (e.g., knowledge of postal
addresses, and spell-checking) to detect errors and make corrections in the data
 Data auditing tools – analyzes the data to discover rules and relationship, and
detecting data that violate such conditions.
b) Data transformations
This is the second step in data cleaning as a process. After detecting discrepancies, we
need to define and apply (a series of) transformations to correct them.

Data Transformations Tools:

 Data migration tools – allows simple transformation to be specified, such to replaced
the string ―gender‖ by ―sex‖.
 ETL (Extraction/Transformation/Loading) tools – allows users to specific transforms
through a graphical user interface(GUI)

Page 5
Data Warehousing and Data Mining

3. Data Integration
Data mining often requires data integration - the merging of data from stores into a coherent data
store, as in data warehousing. These sources may include multiple data bases, data cubes, or flat
files.
Issues in Data Integration
a) Schema integration & object matching.
b) Redundancy.
c) Detection & Resolution of data value conflict
a) Schema Integration & Object Matching
Schema integration & object matching can be tricky because same entity can be
represented in different forms in different tables. This is referred to as the entity identification
problem. Metadata can be used to help avoid errors in schema integration. The meta data may
also be used to help transform the data.
b) Redundancy:
Redundancy is another important issue an attribute (such as annual revenue, for instance)
may be redundant if it can be ―derived‖ from another attribute are set of attributes. Inconsistencies
in attribute of dimension naming can also cause redundancies in the resulting data set. Some
redundancies can be detected by correlation analysis and covariance analysis.
For Nominal data, we use the 2 (Chi-Square) test.
For Numeric attributes we can use the correlation coefficient and covariance.
 Correlation analysis for numerical data:
2

For nominal data, a correlation relationship between two attributes, A and B, can be
discovered by a 2 (Chi-Square) test. Suppose A has c distinct values, namely a1, a2, a3,
……., ac. B has r distinct values, namely b1, b2, b3, …., br. The data tuples are described by table.
The 2 value is computed as
𝟐
𝒐𝒊𝒋−𝒆𝒊𝒋
 2
= 𝒄𝒊=𝟏 𝒓
𝒋=𝟏 𝒆𝒊𝒋

Where oij is the observed frequency of the joint event (Ai,Bj) and eij is the expected
frequency of (Ai,Bj), which can computed as
𝒄𝒐𝒖𝒏𝒕 𝑨=𝒂𝒊 𝑿𝒄𝒐𝒖𝒏𝒕(𝑩=𝒃𝒋)
eij = 𝒏
For Example,

Male Female Total

Fiction 250 200 450
Non_Fiction 50 1000 1050
Total 300 1200 1500
𝑐𝑜𝑢𝑛𝑡 𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 300𝑋450
11e = = = 90
𝑛 1500

𝑐𝑜𝑢𝑛𝑡 𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑛𝑜𝑛 _𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 300𝑋1050

12e = = 1500
= 210
𝑛
𝑐𝑜𝑢𝑛𝑡 𝑓𝑒𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 1200 𝑋450
e
21 = = = 360
𝑛 1500

𝑐𝑜𝑢𝑛𝑡 𝑓𝑒𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑛𝑜𝑛 _𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 1200 𝑋1050

e = = = 840
Page 6
Data Warehousing and Data Mining
22 1500
𝑛

Page 7
Data Warehousing and Data Mining

Male Female Total

Fiction 250 (90) 200 (360) 450
Non_Fiction 50 (210) 1000 (840) 1050
Total 300 1200 1500

For 2 computation, we get

(250−90)2 (50−210)2 200 −360 2 (1000 −840)2
2 = + + +
90 210 360 840

= 284.44 + 121.90 + 71.11 + 30.48 = 507.93

For this 2 X 2 table, the degrees of freedom are (2-1(2-1)=1. For 1 degree of freedom, the
x2 value needed to reject the hypothesis at the 0.001 significance level is 10.828(from statistics
table). Since our computed value is greater than this, we can conclude that two attributes are
strongly correlated for the given group of people.
Correlation Coefficient for Numeric data:
For Numeric attributes, we can evaluate the correlation between two attributes, A and B,
by computing the correlation coefficient. This
𝒏
is 𝒏
𝒂𝒊 −𝑨 𝒃𝒊 −𝑩 𝒂𝒊𝒃 𝒊 −𝒏𝑨𝑩
rA,B = 𝒊=𝟏
= 𝒊=𝟏
𝒏𝝈𝑨𝝈𝑩 𝒏𝝈𝑨𝝈𝑩
For Covariance between A and B defined as
𝒏 𝒂𝒊 −𝑨 𝒃 𝒊 −𝑩

Cov(A,B) = 𝒊=𝟏
𝒏

c) Detection and Resolution of Data Value Conflicts.

A third important issue in data integration is the detection and resolution of data
valueconflicts. For example, for the same real–world entity, attribute value from differentsources
may differ. This may be due to difference in representation, scaling, or encoding.
For instance, a weight attribute may be stored in metric units in one system and British
imperial units in another. For a hotel chain, the price of rooms in different cities may involve not
only different currencies but also different services (such as free breakfast) and taxes. An attribute
in one system may be recorded at a lower level of abstraction than the
―same‖ attribute in another.
Careful integration of the data from multiple sources can help to reduce and avoid
redundancies and inconsistencies in the resulting data set. This can help to improve the accuracy
and speed of the subsequent of mining process.

4. Data Reduction:
Obtain a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results.
Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.

Page 8
Data Warehousing and Data Mining

Data reduction strategies

4.1.Data cube aggregation
4.2.Attribute Subset Selection
4.3.Numerosity reduction — e.g., fit data into models
4.4.Dimensionality reduction - Data Compression

Data cube aggregation:

For example, the data consists of AllElectronics sales per quarter for the years 2014 to
2017.You are, however, interested in the annual sales, rather than the total per quarter. Thus, the
data can be aggregated so that the resulting data summarize the total sales per year instead of per
quarter.

Year/Quarter 2014 2015 2016 2017 Year Sales

Quarter 1 200 210 320 230 2014 1640
Quarter 2 400 440 480 420 2015 1710
Quarter 3 480 480 540 460 2016 2020
Quarter 4 560 580 680 640 2017 1750

Attribute Subset Selection

Attribute subset selection reduces the data set size by removing irrelevant or redundant
attributes (or dimensions). The goal of attribute subset selection is to find a minimum set of
attributes. It reduces the number of attributes appearing in the discovered patterns, helping to
make the patterns easier to understand.
For n attributes, there are 2n possible subsets. An exhaustive search for the optimal subset
of attributes can be prohibitively expensive, especially as n and the number of data classes
increase. Therefore, heuristic methods that explore a reduced search space are commonly used
for attribute subset selection. These methods are typically greedy in that, while searching to
attribute space, they always make what looks to be the best choice at that time. Their strategy to
make a locally optimal choice in the hope that this will lead to a

Page 9
Data Warehousing and Data Mining
globally optimal solution. Many other attributes evaluation measure can be used, such as the
information gain measure used in building decision trees for classification.

Techniques for heuristic methods of attribute sub set selection

 Stepwise forward selection
 Stepwise backward elimination
 Combination of forward selection and backward elimination
 Decision tree induction

1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of original attributes is determined and added to the reduced set. At each
subsequent iteration or step, the best of the remaining original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with full set of attributes. At each step,
it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedure selects the best attribute and removes the worst from among the remaining attributes.
4. Decision tree induction: Decision tree induction constructs a flowchart like structure where
each internal node denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each leaf node denotes a class prediction. At each node, the algorithm choices the
―best‖ attribute to partition the data into individual classes. A tree is constructed from the
given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree from the reduced subset of attributes. Threshold measure is used
as stopping criteria.

Numerosity Reduction:
Numerosity reduction is used to reduce the data volume by choosing alternative, smaller forms of
the data representation
Techniques for Numerosity reduction:
 Parametric - In this model only the data parameters need to be stored, instead of the
actual data. (e.g.,) Log-linear models, Regression

Page 10
Data Warehousing and Data Mining
 Nonparametric – This method stores reduced representations of data include
histograms, clustering, and sampling

Parametric model
1. Regression
 Linear regression
 In linear regression, the data are model to fit a straight line. For example, a random
variable, Y called a response variable), can be modeled as a linear function of
another random variable, X called a predictor variable), with the equation Y=αX+β
 Where the variance of Y is assumed to be constant. The coefficients, α and β (called
regression coefficients), specify the slope of the line and the Y- intercept,
respectively.
 Multiple- linear regression
 Multiple linear regression is an extension of (simple) linear regression, allowing a
response variable Y, to be modeled as a linear function of two or more predictor
variables.
2. Log-Linear Models
 Log-Linear Models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller subset
of dimensional combinations.

Nonparametric Model
1. Histograms
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or
buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets are
called singleton buckets.
Ex: The following data are bast of prices of commonly sold items at All Electronics. The numbers
have been sorted:
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,18,18,18,18,18,18,18,18,20,20,20,2
0,20,20,21,21,21,21,21,25,25,25,25,25,28,28,30,30,30

There are several partitioning rules including the following:

Equal-width: The width of each bucket range is uniform

Page 11
Data Warehousing and Data Mining
 (Equal-frequency (or equi-depth): the frequency of each bucket is constant

2. Clustering
Clustering technique consider data tuples as objects. They partition the objects into groups
or clusters, so that objects within a cluster are similar to one another and dissimilar to objects in
other clusters. Similarity is defined in terms of how close the objects are in space, based on a
distance function. The quality of a cluster may be represented by its diameter, the maximum
distance between any two objects in the cluster. Centroid distance is an alternative measure of
cluster quality and is defined as the average distance of each cluster object from the cluster
centroid.
3. Sampling:
Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random sample (or subset) of the data. Suppose that a large data
set D, contains N tuples, then the possible samples are Simple Random sample without
Replacement (SRS WOR) of size n: This is created by drawing „n‟ of the „N‟ tuples from D
(n<N), where the probability of drawing any tuple in D is 1/N, i.e., all tuples are equally likely to
be sampled.

Page 12
Data Warehousing and Data Mining

Dimensionality Reduction:
In dimensionality reduction, data encoding or transformations are applied so as to obtained
reduced or ―compressed‖ representation of the oriental data.
Dimension Reduction Types
 Lossless - If the original data can be reconstructed from the compressed data without any
loss of information
 Lossy - If the original data can be reconstructed from the compressed data with loss of
information, then the data reduction is called lossy.
Effective methods in lossy dimensional reduction
a) Wavelet transforms
b) Principal components analysis.
a) Wavelet transforms:
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector, transforms it to a numerically different vector, of wavelet coefficients.
The two vectors are of the same length. When applying this technique to data reduction, we
consider each tuple as an n-dimensional data vector, that is, X=(x1,x2,…………,xn), depicting n
measurements made on the tuple from n database attributes.
For example, all wavelet coefficients larger than some user-specified threshold can be
retained. All other coefficients are set to 0. The resulting data representation is therefore very
sparse, so that can take advantage of data sparsity are computationally very fast if performed in
wavelet space.
The numbers next to a wave let name is the number of vanishing moment of the wavelet
this is a set of mathematical relationships that the coefficient must satisfy and is related to number
of coefficients.

1. The length, L, of the input data vector must be an integer power of 2. This condition
can be met by padding the data vector with zeros as necessary (L >=n).
2. Each transform involves applying two functions
 The first applies some data smoothing, such as a sum or weighted average.
 The second performs a weighted difference, which acts to bring out the detailed
features of data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (X2i , X2i+1). This results in two sets of data of length L/2. In general,

Page 13
Data Warehousing and Data Mining
these represent a smoothed or low-frequency version of the input data and high frequency
content of it, respectively.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of length 2.

b) Principal components analysis

Suppose that the data to be reduced, which Karhunen-Loeve, K-L, method consists of
tuples or data vectors describe by n attributes or dimensions. Principal components analysis, or
PCA (also called the Karhunen-Loeve, or K-L, method), searches for k n-dimensional orthogonal
vectors that can best be used to represent the data where k<=n. PCA combines the essence of
attributes by creating an alternative, smaller set of variables. The initial data can then be projected
onto this smaller set.
The basic procedure is as follows:
 The input data are normalized.
 PCA computes k orthonormal vectors that provide a basis for the normalized input
data. These are unit vectors that each point in a direction perpendicular to the others.
 The principal components are sorted in order of decreasing significance or strength.

In the above figure, Y1 and Y2, for the given set of data originally mapped to the axes X1 and X2.
This information helps identify groups or patterns within the data. The sorted axes are such that
the first axis shows the most variance among the data, the second axis shows the next highest
variance, and so on.
 The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
 PCA is computationally inexpensive
 Multidimensional data of more than two dimensions can be handled by reducing the
problem to two dimensions.
 Principal components may be used as inputs to multiple regression and cluster analysis.

Page 14
Data Warehousing and Data Mining

5. Data Transformation and Discretization

Data transformation is the process of converting data from one format or structure into
another format or structure.
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for data analysis at multiple
abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such
as �1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced
byinterval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).The
labels, in turn, can be recursively organized into higher-level concepts, resulting in a concept
hierarchy for the numeric attribute. Figure 3.12 shows a concept hierarchy for the attribute
price. More than one concept hierarchy can be defined for the same attribute to accommodate
the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be automatically defined at the
schema definition level.

5.1 Data Transformation by Normalization:

The measurement unit used can affect the data analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for weight, may
lead to very different results.
For distance-based methods, normalization helps prevent attributes with initially large
ranges (e.g., income) from outweighing attributes with initially smaller ranges (e.g., binary
attributes). It is also useful when given no prior knowledge of the data.
There are many methods for data normalization. We study min-max normalization,
z- score normalization, and normalization by decimal scaling. For our discussion, let A be a
numeric attribute with n observed values, v1, v2, …., vn.

Page 15
Data Warehousing and Data Mining
a) Min-max normalization performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute, A. Min- max
normalization maps a value, vi, of A to vi’in the range [new_minA, new_maxA]by computing
Min-max normalization preserves the relationships among the original data values. It will encounter

an ―out-of-bounds‖ error if a future input case for normalization falls outside of the original data
range for A.
Example:-Min-max normalization. Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively. We would like to map income to the
range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is transformed to

b) Z-Score Normalization
The values for an attribute, A, are normalized based on the mean (i.e., average) and standard deviation
of A. A value, vi, of A is normalized to vi’ by computing

where𝐴 and A are the mean and standard deviation, respectively, of attribute A.
Example z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a value
of $73,600 for income is transformed to

c) Normalization by Decimal Scaling:

Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A.
The number of decimal points moved depends on the maximum absolute value of
A. A value, vi, of A is normalized to vi’ by computing

where j is the smallest integer such that max(|vi’)< 1.

Example Decimal scaling. Suppose that the recorded values of A range from -986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we therefore divide each
value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917normalizes to 0.917.

Page 16
Data Warehousing and Data Mining

5.2. Data Discretization

a) Discretization by binning:
Binning is a top-down splitting technique based on a specified number of bins. For example,
attribute values can be discretized by applying equal-width or equal-
frequencybinning, and then replacing each bin value by the bin mean or median, as in smoothing by
bin means or smoothing by bin medians, respectively. These techniques can be applied recursively
to the resulting partitions to generate concept hierarchies.

b) Discretization by Histogram Analysis:

Like binning, histogram analysis is an unsupervised discretization technique because it
does not use class information. A histogram partitions the values of an attribute, A, into disjoint
ranges called buckets or bins.
In an equal-width histogram, for example, the values are partitioned into equal-size
partitions or ranges (e.g., for price, where each bucket has a width of $10).With an equal-
frequency histogram, the values are partitioned so that, ideally, each partition contains the same
number of data tuples.

c) Discretization by Cluster, Decision Tree, and Correlation Analyses

Cluster analysis is a popular data discretization method. A clustering algorithm can be
applied to discretize a numeric attribute, A, by partitioning the values of A into clusters or groups.
Techniques to generate decision trees for classification can be applied to discretization.
Such techniques employ a top-down splitting approach. Unlike the other methods mentioned so
far, decision tree approaches to discretization are supervised, that is, they make use of class label
information.

Concept Hierarchy Generation for Nominal Data

Nominal attributes have a finite (but possibly large) number of distinct values, with no
ordering among the values. Examples include geographic location, job category, and item type.
Manual definition of concept hierarchies can be a tedious and time-consuming task for a
user or a domain expert. Fortunately, many hierarchies are implicit within the database schema
and can be automatically defined at the schema definition level. The concept hierarchies can be
used to transform the data into multiple levels of granularity.
1. Specification of a partial ordering of attributes explicitly at the schema level by
users or experts: A user or expert can easily define a concept hierarchy by specifying
a partial or total ordering of the attributes at the schema level.
2. Specification of a portion of a hierarchy by explicit data grouping: In a large
database, it is unrealistic to define an entire concept hierarchy by explicit value
enumeration. For example, after specifying that province and country form a hierarchy
at the schema level, a user could define some intermediate levels manually.
3. Specification of a set of attributes, but not of their partial ordering: A user may
specify a set of attributes forming a concept hierarchy, but omit to explicitly

Page 17
Data Warehousing and Data Mining

State their partial ordering. The system can then try to automatically
generate the attribute ordering so as to construct a meaningful
concept hierarchy.
4. Specification of only a partial set of attributes: Sometimes a
user can be careless when defining a hierarchy, or have only a
vague idea about what should be included in a hierarchy.
Consequently, the user may have included only a small subset of
there irrelevant attributes in the hierarchy specification.

 Data cleaning routines attempt to fill in missing values, smooth

out noise while identifying outliers, and correct inconsistencies
in the data.
 Data integration combines data from multiple sources to form
a coherent data store. The resolution of semantic heterogeneity,
metadata, correlation analysis ,tuple duplication detection, and
data conflict detection contribute to smooth data integration.
 Data reduction techniques obtain a reduced representation of
the data while minimizing the loss of information content. These
include methods of dimensionality reduction, numerosity
reduction, and data compression.
 Data transformation routines convert the data into appropriate
forms for mining. For example, in normalization, attribute data
are scaled so as to fall within a small range such as 0.0 to 1.0.
Other examples are data discretization and concept hierarchy
generation.
 Data discretization transforms numeric data by mapping values
to interval or concept labels. Such methods can be used to
automatically generate concept hierarchies for the data, which
allows for mining at multiple levels of granularity.

Page 18

Chapter 3
No ratings yet
Chapter 3
81 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Preprocessing 1 - Annotated
No ratings yet
Data Preprocessing 1 - Annotated
23 pages
Catalog Amp Ruang Teknik Group
100% (1)
Catalog Amp Ruang Teknik Group
23 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
ML 4
No ratings yet
ML 4
17 pages
Unit 2
No ratings yet
Unit 2
37 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Dmi Unit 3
No ratings yet
Dmi Unit 3
12 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Lecture Notes 1.7 & 1.8
No ratings yet
Lecture Notes 1.7 & 1.8
3 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Mit401 Unit 10-Slm
No ratings yet
Mit401 Unit 10-Slm
23 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Spring Boot Annotations
No ratings yet
Spring Boot Annotations
55 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Normalization
No ratings yet
Normalization
35 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
CD 2,3 Unit's Material
100% (1)
CD 2,3 Unit's Material
170 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Neuromuscular Assessments of Form and Function (Neuromethods, 204) (Philip J. Atherton (Editor) Etc.) (Z-Library)
No ratings yet
Neuromuscular Assessments of Form and Function (Neuromethods, 204) (Philip J. Atherton (Editor) Etc.) (Z-Library)
323 pages
Jean Watson's Human Caring Science, A Theory of Nursing
0% (1)
Jean Watson's Human Caring Science, A Theory of Nursing
30 pages
R Language 1st Unit Deep
100% (3)
R Language 1st Unit Deep
61 pages
Tumours of Hypopharynx
No ratings yet
Tumours of Hypopharynx
32 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
The Role of Teacher Feedback in Enhancing ESL Learners Writing Proficiency 1
No ratings yet
The Role of Teacher Feedback in Enhancing ESL Learners Writing Proficiency 1
8 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
IGCSE Biology - Keywords PDF
No ratings yet
IGCSE Biology - Keywords PDF
13 pages
Guidance and Control of Cannon Launched Guided Projectile-Morrison
100% (1)
Guidance and Control of Cannon Launched Guided Projectile-Morrison
7 pages
Grade 7 SCIENCE Item-Analysis-for-item-bank
100% (1)
Grade 7 SCIENCE Item-Analysis-for-item-bank
5 pages
Unix Unit-6
100% (1)
Unix Unit-6
13 pages
Hashing Algorithm Linked Hash Map - Notes Lyst8155
No ratings yet
Hashing Algorithm Linked Hash Map - Notes Lyst8155
10 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Introduction To Community Health and Environmental Sanitation
100% (3)
Introduction To Community Health and Environmental Sanitation
44 pages
003 - Syngas Generation For GTL PDF
No ratings yet
003 - Syngas Generation For GTL PDF
91 pages
DWM
No ratings yet
DWM
14 pages
Ooad-Unit-3 R16
No ratings yet
Ooad-Unit-3 R16
26 pages
The Clergyman's Wife Chapter Sampler
0% (2)
The Clergyman's Wife Chapter Sampler
21 pages
EDU431 Mega For Final Term Obj+Subj All in 1 File by Everblue August2023
No ratings yet
EDU431 Mega For Final Term Obj+Subj All in 1 File by Everblue August2023
193 pages
Maleic Anhydride Plant Design
No ratings yet
Maleic Anhydride Plant Design
46 pages
Jnu Final 6303
No ratings yet
Jnu Final 6303
36 pages
Wa0003 PDF
No ratings yet
Wa0003 PDF
40 pages
Pahal Solar PVT
No ratings yet
Pahal Solar PVT
21 pages
Wa0011
No ratings yet
Wa0011
32 pages
DVP06XA-S Mixed Analog Input-Output Module
No ratings yet
DVP06XA-S Mixed Analog Input-Output Module
2 pages
Unit - 6
No ratings yet
Unit - 6
32 pages
Computer Science & Engineering II B.Tech. - II Semester: Department of
No ratings yet
Computer Science & Engineering II B.Tech. - II Semester: Department of
13 pages
Flat-Unit-4 R16
No ratings yet
Flat-Unit-4 R16
16 pages
The Keyword Operator: Unit - III
No ratings yet
The Keyword Operator: Unit - III
21 pages
Walls
No ratings yet
Walls
17 pages
Dept of CSE Unit - I: Differences Between C and C++
No ratings yet
Dept of CSE Unit - I: Differences Between C and C++
9 pages
Unit - IV PDF
No ratings yet
Unit - IV PDF
16 pages
The Impact of Artificial Intelligence On Software Development
No ratings yet
The Impact of Artificial Intelligence On Software Development
3 pages
Functions.: Python Programming Language Unit-4
No ratings yet
Functions.: Python Programming Language Unit-4
12 pages
Wa0012 PDF
No ratings yet
Wa0012 PDF
21 pages
Logic Programming Languages: Unit - 6
No ratings yet
Logic Programming Languages: Unit - 6
10 pages
Supply Chain Improvement in Construction Industry
No ratings yet
Supply Chain Improvement in Construction Industry
8 pages
Industrial Training Report 2
No ratings yet
Industrial Training Report 2
3 pages
Corex Delivery
No ratings yet
Corex Delivery
37 pages
Cambridge IGCSE: Travel & Tourism 0471/21
No ratings yet
Cambridge IGCSE: Travel & Tourism 0471/21
12 pages
Social Work in A Digital Age - Ethical and Risk Management Challenges
No ratings yet
Social Work in A Digital Age - Ethical and Risk Management Challenges
12 pages
Ethics in Human Resource Management: A Conceptual and Theoretical Analysis
No ratings yet
Ethics in Human Resource Management: A Conceptual and Theoretical Analysis
17 pages
CLAIND Hygen en 2021 Brochure
No ratings yet
CLAIND Hygen en 2021 Brochure
4 pages
Questionnaire For Employees
No ratings yet
Questionnaire For Employees
7 pages
Jivit 200810
No ratings yet
Jivit 200810
6 pages
Examen Parcial AMERICA
No ratings yet
Examen Parcial AMERICA
11 pages
Tracy Resume
No ratings yet
Tracy Resume
2 pages
Nothing To Hide, The Blurring of The Physical and Temporal Line Between Life, Work and Education - Microcities
No ratings yet
Nothing To Hide, The Blurring of The Physical and Temporal Line Between Life, Work and Education - Microcities
7 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Syllabus: Data Warehousing and Data Mining

Uploaded by

Syllabus: Data Warehousing and Data Mining

Uploaded by

Data Warehousing and Data Mining

UNIT –II: Syllabus

2.3 Inconsistent Data

Data Transformations Tools:

Male Female Total

𝑐𝑜𝑢𝑛𝑡 𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑛𝑜𝑛 _𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 300𝑋1050

𝑐𝑜𝑢𝑛𝑡 𝑓𝑒𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑛𝑜𝑛 _𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 1200 𝑋1050

Male Female Total

For 2 computation, we get

= 284.44 + 121.90 + 71.11 + 30.48 = 507.93

c) Detection and Resolution of Data Value Conflicts.

Data reduction strategies

Data cube aggregation:

Year/Quarter 2014 2015 2016 2017 Year Sales

Attribute Subset Selection

Techniques for heuristic methods of attribute sub set selection

There are several partitioning rules including the following:

b) Principal components analysis

5. Data Transformation and Discretization

5.1 Data Transformation by Normalization:

c) Normalization by Decimal Scaling:

where j is the smallest integer such that max(|vi’)< 1.

5.2. Data Discretization

b) Discretization by Histogram Analysis:

c) Discretization by Cluster, Decision Tree, and Correlation Analyses

Concept Hierarchy Generation for Nominal Data

 Data cleaning routines attempt to fill in missing values, smooth

You might also like