Data Warehouse and Data Mining - Unit 3
Data Warehouse and Data Mining - Unit 3
DATA PROCESSING
CHAPTER OUTLINE
Data cleaning
Data integration and transformation
Data reduction
Data discretization and Concept Hierarchy Generation
Data mining primitives
66 Data Warehousing and Data Mining
INTRODUCTION
OVMie,
Problem Evaluation
understanding
Problem Result
specification exploitation
DATA KNOWLEDGE
Noise identification
Data integration
accuracy of the results. We can fill in the attributes that are missing if
Make the database more complete:
needed. and interpret.
we make it easier to use
Smooth the data: This way
68 Data Warehousing and Data Mining
DATA CLEANING
Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies
in data. It involves transformations to correct the wrong data. Data cleaning is pertormed as a data
preprocessing step while preparing the data for a data warehouse. When combining multiple data
SOurces, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect
outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute
way to prescribe the exact steps in the data cleaning process because the processes will vary from
dataset to dataset. But it is crucial to establish a template for your data cleaning process so you know
Source systems
Data Warehouse
Vendor
In-house
Tools
programs
Verification &
Transformation
Having clean data will ultimately increase overall productivity and allow for the
information in your decision-making. Benefits include:
highest quality
Removal of errors when
multiple sources of data are atplay.
Fewer make for
errors
happier clients and less-frustrated employees.
Ability to map the different functions and what your data is intended to do.
Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or
corrupt data for future applications.
Using tools for data cleaning will make for more efficient business practices and
quicker decision-making.
Data Cleaning Methods o
Prevention of Unnecessary observation: One of the important achievements of data
cleansing is to assure that the information portfolio is clean from
unnecessary
fobservations, the unnecessary dataset is of two types: alternative observations and
le irrelevances observation.
Fix Data Structure: Structural errors may arise
during data exchange due to oversight
of human omission or the inability of the person who is not well
trained.
Filter out Outliers: Outliers are information spots that
depart importantly from
supervision in an information sort. It is much designing, in the insight the same type of
e examinations.
Handle Missing Data: We may terminate with absent values in data because
of the
omission of attention during information gathering or lack of
confidentiality towards
anyone. There are two types of managing unavailable data, one is displaying the
aessexaminations from the information notes and the second is filling in new information.
Drop Missing Values: Dropping unavailable information assists in making a
good
decision.
Data Warehousing and Data Mining
Data Source
Unified
Data Source Data Integration
View
Data Source|
3
Figure 3.4: Data integratior
These sources may include multiple databases, data cubes, or flat files. One of the most well-known
implementation of data integration is building an enterprise's data warehouse. The benefit of a data
warehouse enables a business to perform analyses based on the data in the data warehouse. There
are mainly 2 major approaches for data integration
Tight Coupling: In tight coupling data is combined from different sources into a single
physical location through the process of ETL -Extraction, Transformation and Loading
Loose Coupling: In loose coupling data only remains in the actual source databases. In
this approach, an interface is provided that takes query from user and transforms it in a
way the source database can understand and then sends the query directly to the
source databases to obtain the result.
Data transformation is the
process of converting data from one format to another, typically from the
format ofa source system into the
required format of a destination system. Data transformation is a
component of most data integration and data management tasks, such as data wrangling and data
warehousing. One step in the ELT/ETL process, data transformation may be described as either
"simple" or "complex," depending on the kinds of changes that must occur to the data before it is
delivered to its target destination. The data transformation process can be automated, handled
manually, or completed using a combination of the two.
Once the data is cleansed, the following steps in the transformation process occur:
Data discovery: The first step in the data transformation process consists of
and understanding the data in its source format. This is identifying
usually accomplished
help of a data profiling tool. This step helps you decide what needs
with the
to happen to the
data in order to get it into the desired format.
Data mapping: During this phase, the actual transtormation process is planned.
Generating code: In order for the transformation process to be completed, a code must
be created to run the transformation job. Often these codes are generated with the help
of a data transformation tool or
platform.
Introduction to Data Mining O CHAPTER 3 71
Executing the code: The data transformation process that has been planned and coded
is now put into motion, and the data is converted to the desired output.
Review: Transformed data is checked to make sure it has been formatted correctly.
Some Data Transformation Strategies
Let us study strategies used for data transformation in brief some of which we have already studied
in data reduction and data cleaning.
Data Smoothing: Smoothing is a process of removing noise from the data. We have
studied this technique of data smoothing in our previous content 'data cleaning8
Smoothing the data means removing noise from the considered data set. There we have
seen how the noise is removed from the data using the techniques such as binnin&
regression, clustering.
Binning: This method splits the sorted data into the number of bins and
smoothens the data values in each bin considering the neighborhood values
around it.
Regression: This method identifies the relation among two dependent attributes
so that if we have one attribute it can be used to predict the other attribute.
Clustering: This method groups, similar data values and form a cluster. The
values that lie outside a cluster are termed as outliers.
Data Aggregation: Data aggregation transforms a large set of data to a smaller volume by
implementing aggregation operation on the data set. sts
Example: we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data in order to get the annual sales report of the enterprise.
Year 2010
Year 2009
Year Sales
Year 20008
00 2008 Rs.1,568,000
Quarter Sales ,000 00 2009 Rs.2,356,000
00
Q1Rs.224,000 ,000 2010Rs.3,594,000
Rs.408,00o .000 0
Q2
Rs.350,000|,000
02
Q2 Rs.586,00o
olhulsRiicnoimarid
Figure 3.5: Data Aggregation
veianad
Generalization: In generalization low-level data are replaced with high-level data by
using concept hierarchies climbingd
Normalizing the data refers to scaling the data values to a much
Data Normalization:
smaller range like such as (-1, 1] or [0.0, 1.0]
Attribute Construction: In attribute construction method, the new attributes are
a new data set
constructed consulting the existing set of attributes in order to construct
consider
that eases data mining. This can be understood with the help on an example,
72 Data Warehousing and Data
Mining
we have adata set referring to measurements of different plots i.e., we may have
height
and width of each plot. So here we can construct a new attribute 'area' from
attributes
height' and 'weight. This also helps in understanding the relations among the
attributes in a data set.
Whether it's information about customer behaviors, internal processes, supply chains, or even the
weather, businesses and organizations across all industries understand that data has the potential to
increase efficiencies and generate revenue. The challenge here is to make sure that all the data that's
being collected can be used. By using a data transformation process, companies are able to reap
massive benefits from their data, including:
Getting maximum value from data: Data transformation tools allow companies to
standardize data to improve accessibility and usability.
Managing data more effectively: With data being generated from an increasing
number of sources, inconsistencies in metadata can make it a challenge to organize and
understand data. Data transformation refines metadata to make it easier to
organiz
and understand what's in your data set.
Performing faster queries: Transformed data is standardized and stored in a source
location, where it can be quickly and easily retrieved.
Enhancing data quality: Data quality is becoming a major concern for organizations
due to the risks and costs of
using bad data to obtain business intelligence. The process
of transforming data can reduce or eliminate
quality issues like inconsistencies and
missing values.
DATA REDUCTION
detected and removed which reduce the data set size set size.
used to reduce the data
Compression: Encoding mechanisms are
Data
where the data are replaced
Numerosity Reduction: In numerosity reduction
or
estimated by alternative.
Discretization and concept hierarchy generation: Where raw data values for attributes
Discretization techniques can be used to reduce the number of values for a given continuous
attribute, by dividing the attribute into a range of intervals. Interval value labels can be used
replace actual data values. These methods are typically recursive, where a large amount of time is
spent on sorting the data at each step. The smaller the number of distinct values to sort, the faste
these methods should be. Here numerous continuous attribute values are replaced by small interva
labels. This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Concept Hierarchies
A concept hierarchy for a given numeric attribute defines a discretization of the attribute. Concept
hierarchies can be used to reduce the data collecting and replacing low-level concepts (such as
numeric value for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
Although detail is lost by such generalization, it becomes meaningful and it is easier to interpret. In
the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies. This organization provides
users with the flexibility to view data from different perspectives. Data mining on a reduced data set
means fewer input/output operations and is more efficient than mining on a larger data set. Because
of these benefits, discretization techniques and concept hierarchies are typically applied before data
mining, rather than during mining. oist
s (00R00E (Rs.0....Rs.1000]
(Rs..Rs.100]
00](Rs.200...Rs.300)Rs
(Rs.400...Rs.500] (Rs.600.. .Rs.700] (Rs.800...Rs.900]
Binning
Histogram analysis
Entropy-based discretization
Data segmentation by natural partitioning
partitions in order
to generate concept
recursively to the resulting
b) Histogram Analysis can be applied to
discretization. Partitioning rules
Histograms can also be used for
can be applied recursively to
analyses algorithm
define range of values. The histogram
multilevel concept hierarchy, with
each partition in order to automatically
generate a
once a pre-specified
number of concept levels have been
the procedure terminating
to control the recursive
size can be used per level
reached. A minimum interval
or the minimum member
the minimum width of the partition,
procedure. This specifies
of partitions at each level.
Cluster Analysis
data into clusters or groups. Each
clustering algorithm can be applied partition
A
to
cluster forms a node of a concept hierarchy, where all noses are at the same conceptual
hierarchy.
d) Segmentation by natural partitioning
Breaking up annual salaries in the range of into ranges like (Rs.50,000-Rs.100,000) are
often more desirable than ranges like (Rs.51, 263, 89-Rs.60,765.3) arrived at by cluster
analysis. The 3 - 4-5 rule can be used to segment numeric data into relatively uniform
natural intervals. In general, the rule partitions a give range of data into 3,4,or 5
equinity intervals, recursively level by level based on value range at the most
significant digit. The rule can be recursively applied to each interval creating a concept
hierarchy for the given numeric attribute.
22. Discretization and Concept Hierarchy Generation for Categorical Data
Categorical data are discrete data. Categorical attributes have finite number of distinct values
with no ordering among the values, examples include geographic location, item type and job
category. There are several methods for generation of concept hierarchies for categorical data
a)Specification of a partial ordering of attributes explicitly at the schema level by
experts
Concept hierarchies for categorical attributes or dimensions typically involve a group
of attributes. A user or an expert can easily define concept hierarchy by specifying a
partial or total ordering of the attributes at a schema level. A hierarchy can be defined
at the schema level such as street < city < province <state < country.
CHAPTER3 T
Introduction to Data Mining O
relevant attributes.
The kinds of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association, classification,
evolution analysis, For instance, if studying the buying habits of
clustering, or
customers in Naxal, you may choose to mine associations between customer profiles
and the items that these customers like to buy.
Background knowledge: Users can specify background knowledge, or knowledge
about the domain to be mined. This knowledge is useful for guiding the knowledge
discovery process, and for evaluating the patterns found. There are several kinds of
background knowledge.
Interestingness measures: These functions are used to separate uninteresting patterns
from knowledge. They may be used to guide the mining process, or after discovery, to
evaluate the discovered patterns. Different kinds of knowledge may have different
interestingness measures.
Presentation and visualization of discovered patterns: This refers to the form in which
discovered patterns are to be displayed. Users can choose from different forms for
knowledge presentation, such as rules, tables, charts, graphs, decision trees, and cubes.
Data Warehousing and Data Mining
Exercise
Explain about introduction to concept hierarchy?
for numeric data.
Explain discretization and concept hierarchy generation data.
generation for categorical
3 Explain discretization and concept hierarchy
Also describe their importance.
Explain data cleaning in data mining.
suitable example.
5 What is data integration? Explain with
What is data reduction? Explain.
Explain their advantages and disadvantages.
What is data discretization?