0% found this document useful (0 votes)
23 views14 pages

Data Warehouse and Data Mining - Unit 3

Uploaded by

zrimreaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views14 pages

Data Warehouse and Data Mining - Unit 3

Uploaded by

zrimreaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3

DATA PROCESSING

CHAPTER OUTLINE

Data cleaning
Data integration and transformation
Data reduction
Data discretization and Concept Hierarchy Generation
Data mining primitives
66 Data Warehousing and Data Mining

INTRODUCTION

raw data into an


involves transforming
Data preprocessing is data mining technique that
a
inconsistent, and/or lacking in certain
understandable format. Real-world data is often incomplete,
errors. Data preprocessing is a proven method of
behaviors or trends, and is likely to contain many
resolving such issues. named as data
to the application of a data mining method is
The set of techniques used prior issues within the
is known be one
to of the most meaningful
preprocessing for data mining and it
from Data process as shown in figure 3.1.
famous Knowledge Discovery
Data Data mining
preprocessing

OVMie,
Problem Evaluation
understanding

Problem Result
specification exploitation

DATA KNOWLEDGE

Figure 3.1: Place of Data pre-processing in KDD process

Data goes through a series of steps during preprocessing:


Data Cleaning: Data is cleansed through processes such as filling in missing values or
deleting rows with missing data, smoothing the noisy data, or resolving the
inconsistencies in the data. Smoothing noisy data is particularly important for ML
datasets, since machines cannot make use of data they cannot interpret. Data can be
cleaned by dividing it into equal size segments that are thus smoothed (binning), by
fitting it to a linear or multiple regression function (regression), or by grouping it into
clusters of similar data (clustering)
Data Integration: Data with different representations are put together and contics
within the data are resolved.
Data Transformation: Data is normalized and generalized. Normalization is a proce
the
that ensures that no data is redundant, it is all stored in a single place, and all
dependencies are logical.
Introduction to Data Mining O CHAPTER 367
Data Reduction: When the volume of data is huge, databases can become slower
costly to access, and challenging to properly store. Data reduction step aims to present
a reduced representation of the data in data
a
warehouse
Data reduction strategies
Data cube aggregation
Dimensionality reduction
Data compression
Numerosity reduction
Discretization and concept hierarchy generation
Data Discretization: Data could also be discretized to replace raw values with interval
levels. This step involves the reduction of a number of values of a continuous attribute
by dividing the range of attribute intervals.
Data Sampling: Sometimes, due to time, storage or memory constraints, a dataset is too
big or too complex to be worked with. Sampling techniques can be used to select and
work with just a subset of the dataset, provided that it has approximately the same
properties of the original one.
Data cleaning Data normalization

Data transformation Missing values imputation

Noise identification
Data integration

Figure 3.2: Data preprocessing tasks

why do we need to preprocess data?


By preprocessing data,
we more accurate: We eliminate
the incorrect o r missing values that
Make our database
factor or bugs.
are there as a result of the human affects the
inconsistencies in data or duplicates, it
Boost consistency: When there are

accuracy of the results. We can fill in the attributes that are missing if
Make the database more complete:
needed. and interpret.
we make it easier to use
Smooth the data: This way
68 Data Warehousing and Data Mining

DATA CLEANING

Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies
in data. It involves transformations to correct the wrong data. Data cleaning is pertormed as a data

preprocessing step while preparing the data for a data warehouse. When combining multiple data
SOurces, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect

outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute
way to prescribe the exact steps in the data cleaning process because the processes will vary from
dataset to dataset. But it is crucial to establish a template for your data cleaning process so you know

irrelevant and missing parts. To


you are doing it the right way every time. The data can have many
handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.

Source systems

Data Warehouse

Data analysis & defining


transformation workflow,
mapping rules
Backflow of
cleaned data

Data cleaning Data Staging


Polluted functions Cleaned ?Area
data data

Vendor
In-house
Tools
programs

Verification &
Transformation

Figure 3.3: Data cleaning process


a) Missing Data
This situation arises when data is
some
missing in the data. It can be handled in various ways
Some of them are:
Ignore the tuples: This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
Fill the Missing values: There are various ways to do this task. Youcan choose to fill
the missing values manually, by attribute mean or the most probable value.
Introduction to Data Mining O CHAPTER 3 69
Noisy Data
b)
Noisy data is a
meaningless data that can't be
interpreted by machines. It can be generated
due to faulty data collection, data
entry errors etc. It can be handled in following ways:
Binning Method: This method works on sorted data in order to smooth it. The whole
data is divided into
segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete the task.
Regression: Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple (having
multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

Benefits of Data Cleaning

Having clean data will ultimately increase overall productivity and allow for the
information in your decision-making. Benefits include:
highest quality
Removal of errors when
multiple sources of data are atplay.
Fewer make for
errors
happier clients and less-frustrated employees.
Ability to map the different functions and what your data is intended to do.
Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or
corrupt data for future applications.
Using tools for data cleaning will make for more efficient business practices and
quicker decision-making.
Data Cleaning Methods o
Prevention of Unnecessary observation: One of the important achievements of data
cleansing is to assure that the information portfolio is clean from
unnecessary
fobservations, the unnecessary dataset is of two types: alternative observations and
le irrelevances observation.
Fix Data Structure: Structural errors may arise
during data exchange due to oversight
of human omission or the inability of the person who is not well
trained.
Filter out Outliers: Outliers are information spots that
depart importantly from
supervision in an information sort. It is much designing, in the insight the same type of
e examinations.
Handle Missing Data: We may terminate with absent values in data because
of the
omission of attention during information gathering or lack of
confidentiality towards
anyone. There are two types of managing unavailable data, one is displaying the
aessexaminations from the information notes and the second is filling in new information.
Drop Missing Values: Dropping unavailable information assists in making a
good
decision.
Data Warehousing and Data Mining

DATA INTEGRATION AND TRANSFORMATION

data from multiple sources and


ad
ntegration is a data preprocessing technique that combines
provides users a unified view of these data.

Data Source

Unified
Data Source Data Integration
View

Data Source|
3
Figure 3.4: Data integratior
These sources may include multiple databases, data cubes, or flat files. One of the most well-known
implementation of data integration is building an enterprise's data warehouse. The benefit of a data
warehouse enables a business to perform analyses based on the data in the data warehouse. There
are mainly 2 major approaches for data integration
Tight Coupling: In tight coupling data is combined from different sources into a single
physical location through the process of ETL -Extraction, Transformation and Loading
Loose Coupling: In loose coupling data only remains in the actual source databases. In
this approach, an interface is provided that takes query from user and transforms it in a
way the source database can understand and then sends the query directly to the
source databases to obtain the result.
Data transformation is the
process of converting data from one format to another, typically from the
format ofa source system into the
required format of a destination system. Data transformation is a
component of most data integration and data management tasks, such as data wrangling and data
warehousing. One step in the ELT/ETL process, data transformation may be described as either
"simple" or "complex," depending on the kinds of changes that must occur to the data before it is
delivered to its target destination. The data transformation process can be automated, handled
manually, or completed using a combination of the two.
Once the data is cleansed, the following steps in the transformation process occur:
Data discovery: The first step in the data transformation process consists of
and understanding the data in its source format. This is identifying
usually accomplished
help of a data profiling tool. This step helps you decide what needs
with the
to happen to the
data in order to get it into the desired format.
Data mapping: During this phase, the actual transtormation process is planned.
Generating code: In order for the transformation process to be completed, a code must
be created to run the transformation job. Often these codes are generated with the help
of a data transformation tool or
platform.
Introduction to Data Mining O CHAPTER 3 71

Executing the code: The data transformation process that has been planned and coded
is now put into motion, and the data is converted to the desired output.
Review: Transformed data is checked to make sure it has been formatted correctly.
Some Data Transformation Strategies

Let us study strategies used for data transformation in brief some of which we have already studied
in data reduction and data cleaning.
Data Smoothing: Smoothing is a process of removing noise from the data. We have
studied this technique of data smoothing in our previous content 'data cleaning8
Smoothing the data means removing noise from the considered data set. There we have
seen how the noise is removed from the data using the techniques such as binnin&
regression, clustering.
Binning: This method splits the sorted data into the number of bins and
smoothens the data values in each bin considering the neighborhood values
around it.
Regression: This method identifies the relation among two dependent attributes
so that if we have one attribute it can be used to predict the other attribute.

Clustering: This method groups, similar data values and form a cluster. The
values that lie outside a cluster are termed as outliers.
Data Aggregation: Data aggregation transforms a large set of data to a smaller volume by
implementing aggregation operation on the data set. sts
Example: we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data in order to get the annual sales report of the enterprise.

Year 2010

Year 2009
Year Sales
Year 20008
00 2008 Rs.1,568,000
Quarter Sales ,000 00 2009 Rs.2,356,000
00
Q1Rs.224,000 ,000 2010Rs.3,594,000
Rs.408,00o .000 0
Q2
Rs.350,000|,000
02
Q2 Rs.586,00o
olhulsRiicnoimarid
Figure 3.5: Data Aggregation
veianad
Generalization: In generalization low-level data are replaced with high-level data by
using concept hierarchies climbingd
Normalizing the data refers to scaling the data values to a much
Data Normalization:
smaller range like such as (-1, 1] or [0.0, 1.0]
Attribute Construction: In attribute construction method, the new attributes are
a new data set
constructed consulting the existing set of attributes in order to construct
consider
that eases data mining. This can be understood with the help on an example,
72 Data Warehousing and Data
Mining
we have adata set referring to measurements of different plots i.e., we may have
height
and width of each plot. So here we can construct a new attribute 'area' from
attributes
height' and 'weight. This also helps in understanding the relations among the
attributes in a data set.

Benefits of Data Transformation

Whether it's information about customer behaviors, internal processes, supply chains, or even the
weather, businesses and organizations across all industries understand that data has the potential to
increase efficiencies and generate revenue. The challenge here is to make sure that all the data that's
being collected can be used. By using a data transformation process, companies are able to reap
massive benefits from their data, including:
Getting maximum value from data: Data transformation tools allow companies to
standardize data to improve accessibility and usability.
Managing data more effectively: With data being generated from an increasing
number of sources, inconsistencies in metadata can make it a challenge to organize and
understand data. Data transformation refines metadata to make it easier to
organiz
and understand what's in your data set.
Performing faster queries: Transformed data is standardized and stored in a source
location, where it can be quickly and easily retrieved.
Enhancing data quality: Data quality is becoming a major concern for organizations
due to the risks and costs of
using bad data to obtain business intelligence. The process
of transforming data can reduce or eliminate
quality issues like inconsistencies and
missing values.

DATA REDUCTION

A database or date warehouse may store


terabytes of data. So, it may take very long to perform data
analysis and mining on such huge amounts of data. Data reduction
obtain a reduced
techniques can be applied to
representation of the data set that is much smaller in volume but still contain
critical information. Data reduction increases the
efficiency of data mining. In the following section,
we will discuss the
techniques of data reduction. Techniques of data deduction include
Dimensionality reduction
Numerosity reduction and
Data compression.
1, Dimensionality Reduction
Dimensionality reduction eliminates the attributes from the data set
under consideration
thereby reducing the volume of original data. In the section below, we will discuss three
methods of dimensionality reduction. Example:
Eid Ename Mobile No Mobile Network
E1 Indra 9851265434 NTC Postpaid
E2 Bhupi 9849148484 NTC Prepaid
E3 Kamala 9802323245 NCELL
& E4 Aadesh 9803324331 NCELL
Figure 3.6: Before Dimension reduction
Introduction to Data Mining O CHAPTER 3
If we know mobile number, then we can know the mobile network so we need to reduce one
dimension on above table as below,
Eid Ename Mobile No
E1 Indra 9851265434
E2
Bhupi 9849148484
E3 Kamala 9802323245
E4 Aadesh
9803324331
2. Numerosity Reduction
The numerosity reduction reduces the volume of the original data and represents it in a much
smaller form. This technique includes two types and
reduction.
parametric non-parametric numerosity
Parametric
Parametric numerosity reduction incorporates 'storing only data parameters instead of
the original data'. One method of parametric numerosity reduction is regression and
log-linear method. Linear regression models a relationship between the two attributes
by modelling a linear equation to the data set. Suppose we need to model a linear
function between two attributes
y= wx +b
Here, y is the response attribute andx is the predictor attribute. If we discuss in terms
of data mining, the attribute x and the attribute y are the numeric database attributes
whereas w and b are regression coefficients.
Multiple linear regression lets the response variable y to model linear function between
two or more predictor variable. Log-linear model discovers the relation between two or
more discrete attributes in the database. Suppose, we have a set of tuples presented in
n-dimensional space. Then the log-linear model is used to study the probability of each
tuple in a multidimensional space.
Non-parametric
Non-parametric methods for storing reduced representations of the data include
histograms, clustering, and sampling Clustering techniques groups the similar objects
from the data in such a way that the objects in a cluster are similar to each other but
they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated by using a distance
function. More is the similarity between the objects in a cluster closer they appear in the
cluster a0-T3ERH T98370 GMA VOITASTTS0f ATAU
The quality of cluster depends on the diameter of the cluster ie, at max distance
between any two objects in the cluster. The original data is replaced by the cluster
representation. This technique is more effective if the present data can be classified into
a distinct clustered. nibrib
3. Data Compression o e ge en
Data compression is a technique where the data transformation technique is applied to the
original data in order to obtain compressed data. If the compressed data can again be
reconstructed to form the original data without losing any information, then it is a lossless'
data reduction. If you are unable to reconstruct the original data from the compressed one
then your data reduction is lossy. Dimensionality and numerosity reduction method are also
used for data compression.
74 Data Warehousing and Data Mining o
Data Reduction Strategies d
Data Cube Aggregation: Aggregation operations are applied to the data in the

construction of a data cube.


redundant attributes
Dimensionality Reduction: In dimensionality reduction are

detected and removed which reduce the data set size set size.
used to reduce the data
Compression: Encoding mechanisms are
Data
where the data are replaced
Numerosity Reduction: In numerosity reduction
or

estimated by alternative.
Discretization and concept hierarchy generation: Where raw data values for attributes

are replaced by ranges or higher conceptual levels.


Difference between Dimensionality Reduction and Numerosity Reduction

Dimensionality Reduction Numerosity Reduction

In dimensionality reduction, data encoding or In Numerosity reduction, data volume is reduced


data transformations are applied to obtain a by choosing suitable alternating forms of data

reduced compressed for of original data.


or representation.
It can be used to remove irrelevant or redundant | It is merely a representation technique of original

attributes. data into smaller form.


In this method, some data can be lost which is In this method, there is no loss of data.
irrelevant.
Methods for dimensionality reduction are: Methods for Numerosity reduction are:
Wavelet transformations. .Regression or log-linear model (parametric).
Principal Component Analysis. .
Histograms, clustering, sampling non
parametric).
The components of dimensionality reduction are It has no components but methods that ensure
feature selection and feature extraction. reduction of data volume.
It leads to less misleading data and more
model| Itpreserves the integrity of data and the data
accuracy. volume is also reduced.

DATA DISCRETIZATION AND CONCEPT HIERARCHY GENERATION

Discretization techniques can be used to reduce the number of values for a given continuous
attribute, by dividing the attribute into a range of intervals. Interval value labels can be used
replace actual data values. These methods are typically recursive, where a large amount of time is
spent on sorting the data at each step. The smaller the number of distinct values to sort, the faste
these methods should be. Here numerous continuous attribute values are replaced by small interva
labels. This leads to a concise, easy-to-use, knowledge-level representation of mining results.

Example: We have an attribute of age with the following values:


Age: 10, 11, 13, 14, 17, 19, 30, 31, 32, 38, 40, 42, 70, 72, 73, 75
Introduction to Data Mining O CHAPTER3
Attribute
Agel Age2 Age3
10,11, 13, 14, 17, 19 30, 31, 32, 38, 42
After Discretization 40, 70,72, 73, 75
Young Mature Old
te can be categorized into following two types:

Top-down discretization: If theprocess starts by first


(called split points or cut points) to split the entire attributefinding
one or a few points
range, and then repeats this
recursively on the resulting intervals, then it is called top-down discretization or
splitting
Bottom-up discretization: If the process starts by considering all of the continuous
values as potential split-points, removes some by merging neighborhood values to
form intervals, then it is called bottom-up discretization or
merging8
Many discretization techniques can be applied recursively in order to provide a hierarchical or multi-
resolution partitioning of the attribute values known as concept hierarchy.

Concept Hierarchies

A concept hierarchy for a given numeric attribute defines a discretization of the attribute. Concept
hierarchies can be used to reduce the data collecting and replacing low-level concepts (such as
numeric value for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
Although detail is lost by such generalization, it becomes meaningful and it is easier to interpret. In
the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies. This organization provides
users with the flexibility to view data from different perspectives. Data mining on a reduced data set
means fewer input/output operations and is more efficient than mining on a larger data set. Because
of these benefits, discretization techniques and concept hierarchies are typically applied before data
mining, rather than during mining. oist
s (00R00E (Rs.0....Rs.1000]

(Rs.0..Rs.200] (Rs200..Rs.400] (Rs400.Rs.6001(Rs.600..Rs.800(Rs.800.Rs1007


Ste

(Rs..Rs.100]
00](Rs.200...Rs.300)Rs
(Rs.400...Rs.500] (Rs.600.. .Rs.700] (Rs.800...Rs.900]

(Rs.100..Rs.200 (Rs.300...Rs.400] (Rs.500...Rs.600] (Rs.70...Rs.800] (Rs.900..Rs,.1000]


Figure 3.7: Concept hierarchy for the attribute price
oy ssrdo
1. Discretization and Concept Hierarchy Generation for Numeric Data
It is difficult and laborious for to specify concept hierarchies for numeric attributes due to the
wide diversity of possible data ranges and the frequent updates if data values. Manual
Data Warehousing and Data Mining
numeric attributes can be
Specification also could be arbitrary. Concept hierarchies for
Five methods for concept
constructed automatically based on data distribution analysis.
hierarchy generation are defined below:

Binning
Histogram analysis
Entropy-based discretization
Data segmentation by natural partitioning

Binning the values into bin and replacing


Attribute values can be discretized by distributing can be applied
These techniques
bin value or bin median value.
each bin by the mean hierarchies.

partitions in order
to generate concept
recursively to the resulting
b) Histogram Analysis can be applied to
discretization. Partitioning rules
Histograms can also be used for
can be applied recursively to
analyses algorithm
define range of values. The histogram
multilevel concept hierarchy, with
each partition in order to automatically
generate a
once a pre-specified
number of concept levels have been
the procedure terminating
to control the recursive
size can be used per level
reached. A minimum interval
or the minimum member
the minimum width of the partition,
procedure. This specifies
of partitions at each level.

Cluster Analysis
data into clusters or groups. Each
clustering algorithm can be applied partition
A
to
cluster forms a node of a concept hierarchy, where all noses are at the same conceptual

into sub-clusters, forming a lower level


level. Each cluster may be further decomposed
in the hierarchy. Clusters may also be grouped together to form a higher-level concept

hierarchy.
d) Segmentation by natural partitioning
Breaking up annual salaries in the range of into ranges like (Rs.50,000-Rs.100,000) are
often more desirable than ranges like (Rs.51, 263, 89-Rs.60,765.3) arrived at by cluster
analysis. The 3 - 4-5 rule can be used to segment numeric data into relatively uniform

natural intervals. In general, the rule partitions a give range of data into 3,4,or 5
equinity intervals, recursively level by level based on value range at the most
significant digit. The rule can be recursively applied to each interval creating a concept
hierarchy for the given numeric attribute.
22. Discretization and Concept Hierarchy Generation for Categorical Data
Categorical data are discrete data. Categorical attributes have finite number of distinct values
with no ordering among the values, examples include geographic location, item type and job
category. There are several methods for generation of concept hierarchies for categorical data
a)Specification of a partial ordering of attributes explicitly at the schema level by
experts
Concept hierarchies for categorical attributes or dimensions typically involve a group
of attributes. A user or an expert can easily define concept hierarchy by specifying a
partial or total ordering of the attributes at a schema level. A hierarchy can be defined
at the schema level such as street < city < province <state < country.
CHAPTER3 T
Introduction to Data Mining O

b) Specification of a portion of hierarchy by explicit data grouping


a
This is identically a manual definition of a portion of a concept hierarchy. In a large
database, is unrealistic to define an entire concept hierarchy by explicit value
enumeration. However, it io realistic to specify explicit groupings for a small portiono
the intermediate-level data.
c) Specification of a set of attributes but not their partial ordering
A user may specify a set of attributes forming a concept hierarchy, but omit to specify
their partial ordering. The system can then try to automatically generate the attribute
ordering so as to construct a meaningful concept hierarchy.
d) Specification of only of partial set of attributes
Sometimes a user can be sloppy when defining a hierarchy, or may have only a vague
idea about what should be included in a hierarchy. Consequently, the user may have
included only a small subset of the relevant attributes for the location, the user may
have only specified street and city. To handle such partially specified hierarchies, it is
important to embed data semantics in the database schema so that attributes with tight
semantic connections can be pinned together

DATA MINING PRIMTIVES

A data mining query is defined in terms of the following primitives


Task-relevant data: This is the database portion to be investigated. For example,
in the
suppose that you are a manager of Bhat-Bhateni Supermarket in charge of sales
Pokhara and Dhangadi. In particular, you would like to study the buying trends of
customers in Naxal. Rather than mining on the entire database. These are referred to as

relevant attributes.
The kinds of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association, classification,
evolution analysis, For instance, if studying the buying habits of
clustering, or
customers in Naxal, you may choose to mine associations between customer profiles
and the items that these customers like to buy.
Background knowledge: Users can specify background knowledge, or knowledge
about the domain to be mined. This knowledge is useful for guiding the knowledge

discovery process, and for evaluating the patterns found. There are several kinds of

background knowledge.
Interestingness measures: These functions are used to separate uninteresting patterns
from knowledge. They may be used to guide the mining process, or after discovery, to
evaluate the discovered patterns. Different kinds of knowledge may have different
interestingness measures.
Presentation and visualization of discovered patterns: This refers to the form in which
discovered patterns are to be displayed. Users can choose from different forms for
knowledge presentation, such as rules, tables, charts, graphs, decision trees, and cubes.
Data Warehousing and Data Mining

Exercise
Explain about introduction to concept hierarchy?
for numeric data.
Explain discretization and concept hierarchy generation data.
generation for categorical
3 Explain discretization and concept hierarchy
Also describe their importance.
Explain data cleaning in data mining.
suitable example.
5 What is data integration? Explain with
What is data reduction? Explain.
Explain their advantages and disadvantages.
What is data discretization?

What is the concept behind hierarchy generation? Explain.


What are data mining primitives? Explain.
10. What are typical data preprocessing tasks?
11. What do you mean by data transformation? Explain.

12 Why do we need to preprocess data? Explain


What is meant hy data discretization? What are the discretization processes involved in data
13.
preprocessing?
14 What are the steps involved in data preprocessing? Explain
15. Describe why concept hierarchies are useful in data mining
16. Define concept hierarchy.
17. What is preprocessing technique? Why we need data transformation?
18. What is meant by data discretization? What are the discretization processes involved in dat
preprocessing?
19. What is visualization? Explain.
20. Why data preprocessing is important? e c

You might also like