DWM - Module 2
DWM - Module 2
• Data integration is one of the steps of data pre-processing that involves combining
data residing in different sources and providing users with a unified view of these data.
• It merges the data from multiple data stores (data sources)
• It includes multiple databases, data cubes or flat files.
• The statistical strategy in Data Integration in Data Mining is formally stated as a triplet
of the (G, S, M) approach, where G is a global schema, S is a heterogeneous source
and M represents a mapping between source and global schema queries.
What are the Different Approaches to Data Integration in
Data Mining?
Data Integration in Data Mining is mainly categorized into two types of approaches.
Tight Coupling
• This approach involves the creation of a centralized database that integrates data from
different sources. The data is loaded into the centralized database using extract,
transform, and load (ETL) processes.
• In this approach, the integration is tightly coupled, meaning that the data is physically
stored in the central database, and any updates or changes made to the data sources
are immediately reflected in the central database.
• Tight coupling is suitable for situations where real-time access to the data is required,
and data consistency is critical. However, this approach can be costly and complex,
especially when dealing with large volumes of data.
Loose Coupling
• This approach involves the integration of data from different sources without
physically storing it in a centralized database.
• In this approach, data is accessed from the source systems as needed and combined in
real-time to provide a unified view. This approach uses middleware, such as application
programming interfaces (APIs) and web services, to connect the source systems and
access the data.
• In the Loose Coupling, the data is kept actual data source Databases. Using this
approach users get an interface to send a query, which transforms into a format suitable
for the data source, and the query is received by the source and it sends the data back to
the user as per the query.
• Loose coupling is suitable for situations where real-time access to the data is not
critical, and the data sources are highly distributed. This approach is more cost-
effective and flexible than tight coupling but can be more complex to set up and
maintain.
Data Integration Techniques
1.Manual Integration
• This technique avoids the use of automation during data integration.
• The data analyst himself collects the data, cleans it and integrate it to provide useful
information.
• This technique can be implemented for a small organization with a small data set.
• But it would be tedious for the large, complex and recurring integration. Because it is a time
taking process as the entire process has to be done manually.
2.Middleware Integration
• The middleware software is employed to collect the information from different sources,
normalize the data and stored into the resultant data set.
• This technique is adopted when the enterprise wants to integrate data from the legacy
systems to modern systems.
• Middleware software act as an interpreter between the legacy systems and advanced
systems.
• You can take an example of the adapter which helps in connecting two systems with
different interfaces. It can be applied to some system only.
3. Application-Based Integration
• This technique makes use of software application to extract, transform and load the
data from the heterogeneous sources.
• This technique also makes the data from disparate source compatible in order to ease
the transfer of the data from one system to another.
• This technique saves time and effort but is little complicated as designing such an
application requires technical knowledge.
4. Uniform Access Integration
• This technique integrates data from a more discrepant source. But, here the location of
the data is not changed, the data stays in its original location.
• This technique only creates a unified view which represents the integrated data. No
separate storage is required to store the integrated data as only the integrated view is
created for the end-user.
5. Data Warehousing
• This technique loosely relates to the uniform access integration technique. But the
difference is that the unified view is stored in certain storage.
• This allows the data analyst to handle more complex queries.
• Though this is a promising technique it has increased storage cost as the view or copy
of the unified data needs separate storage and even it has an increase in maintenance
cost.
Issues in Data Integration
While integrating the data we have to deal with several issues which are discussed below.
1. Entity Identification Problem
• As we know the data is unified from the heterogeneous sources then how can we
‘match the real-world entities from the data’.
• For example, we have customer data from two different data source. An entity from one
data source has customer_id and the entity from the other data source has
customer_number.
• Now how does the data analyst or the system would understand that these two entities
refer to the same attribute?
• Well, here the schema integration can be achieved using metadata of each attribute.
• Metadata of an attribute incorporates its name, what does it mean in the particular
scenario, what is its data type, up to what range it can accept the value.
• What rules does the attribute follow for the null value, blank, or zero? Analyzing this
metadata information will prevent error in schema integration.
• Structural integration can be achieved by ensuring that the functional dependency
of an attribute in the source system and its referential constraints matches the functional
dependency and referential constraint of the same attribute in the target system.
• This can be understood with the help of an example suppose in the one system, the
discount would be applied to an entire order but in another system, the discount would
be applied to every single item in the order. This difference must be caught before the
data from these two sources are integrated into the target system.
2. Redundancy and Correlation Analysis
• Redundancy is one of the big issues during data integration.
• Redundant data is an unimportant data or the data that is no longer needed.
• It can also arise due to attributes that could be derived using another attribute in the
data set.
• For example, one data set has the customer age and other data set has the customers
date of birth then age would be a redundant attribute as it could be derived using the
date of birth.
• Inconsistencies in the attribute also raise the level of redundancy. The redundancy can
be discovered using correlation analysis.
The attributes are analyzed to detect their interdependency on each other thereby
detecting the correlation between them.
3. Tuple Duplication
• Along with redundancies data integration has also deal with the duplicate tuples.
• Duplicate tuples may come in the resultant data if the renormalized table has been
used as a source for data integration.
4. Data Conflict Detection and Resolution
• Data conflict means the data merged from the different sources do not match. Like the
attribute values may differ in different data sets.
• The difference maybe because they are represented differently in the different data
sets. For suppose the price of a hotel room may be represented in different currencies in
different cities.
•This kind of issues is detected and resolved during data integration.
So far, we have disused the issues that a data analyst or system has to deal with during
data integration.
Data Reduction
• Data reduction is a process that reduces the volume of original data and
represents it in a much smaller volume.
• Data reduction techniques are used to obtain a reduced representation of the
dataset that is much smaller in volume by maintaining the integrity of the
original data.
• By reducing the data, the efficiency of the data mining process is improved,
which produces the same analytical results.
• Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information.
• This can be beneficial in situations where the dataset is too large to be processed
efficiently, or where the dataset contains a large amount of irrelevant or
redundant information.
Data Reduction Techniques
1. Data Cube Aggregation
• This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
• For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to
the year 2022.
• If you want to get the annual sale per year, you just have to aggregate the sales per quarter for each
year.
• In this way, aggregation provides you with the required data, which is much smaller in size, and
thereby we achieve data reduction even without losing any data.
• The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis.
The data cube present precomputed and summarized data which eases the data mining into fast
access.
2.Dimensionality Reduction
• Dimensionality reduction eliminates the attributes from the data set under consideration
thereby reducing the volume of original data.
• In the section below, we will discuss three methods of dimensionality reduction.
a. Wavelet Transform
• In the wavelet transform, a data vector X is transformed into a numerically different data
vector X’ such that both X and X’ vectors are of the same length. Then how it is useful in
reducing data?
• The data obtained from the wavelet transform can be truncated. The compressed data is
obtained by retaining the smallest fragment of the strongest of wavelet coefficients.
• Wavelet transform can be applied to data cube, sparse data or skewed data.
b. Principal Component Analysis
• Let us consider we have a data set to be analyzed that has tuples with n attributes, then the
principal component analysis identifies k independent tuples with n attributes that can
represent the data set.
• In this way, the original data can be cast on a much smaller space. In this way, the
dimensionality reduction can be achieved.
• Principal component analysis can be applied to sparse, and skewed data.
c. Attribute Subset Selection
• The large data set has many attributes some of which are irrelevant to data mining or
some are redundant.
• The attribute subset selection reduces the volume of data by eliminating the redundant
and irrelevant attribute.
• The attribute subset selection makes it sure that even after eliminating the unwanted
attributes we get a good subset of original attributes such that the resulting probability
of data distribution is as close as possible to the original data distribution using all the
attributes.
3. Numerosity Reduction
The numerosity reduction reduces the volume of the original data and represents it in a much
smaller form. This technique includes two types parametric and non-parametric numerosity
reduction.
Parametric
• Parametric numerosity reduction incorporates ‘storing only data parameters instead of the
original data’.
• One method of parametric numerosity reduction is ‘regression and log-linear’ method.
Regression and Log-Linear
Linear regression models a relationship between the two attributes by modeling a linear
equation to the data set. Suppose we need to model a linear function between two attributes.
y = wx +b
Here, y is the response attribute and x is the predictor attribute. If we discuss in terms of data
mining, the attribute x and the attribute y are the numeric database attributes whereas w and b
are regression coefficients.
Multiple linear regression lets the response variable y to model linear function between two or
more predictor variable.
• Log-linear model discovers the relation between two or more discrete attributes in the
database.
• Suppose, we have a set of tuples presented in n-dimensional space. Then the log-linear
model is used to study the probability of each tuple in a multidimensional space.
• Regression and log-linear method can be used for sparse data and skewed data.
Non-Parametric
1.Histogram
A histogram is a ‘graph’ that represents frequency distribution which describes how
often a value appears in the data. Histogram uses the binning method and to represent
data distribution of an attribute. It uses disjoint subset which we call as bin or buckets.
We have data for All Electronics data set, which contains prices for regularly sold items.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18,
18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28,
30,30,30.
The diagram below shows a histogram of equal width that shows the frequency of price
distribution.
A histogram is capable of representing dense, sparse, uniform or skewed data. Instead
of only one attribute, the histogram can be implemented for multiple attributes. It can
effectively represent the up to five attributes.
2.Clustering
• Clustering techniques groups the similar objects from the data in such a way that the
objects in a cluster are similar to each other but they are dissimilar to objects in
another cluster.
• How much similar are the objects inside a cluster can be calculated by using a
distance function. More is the similarity between the objects in a cluster closer they
appear in the cluster.
• The quality of cluster depends on the diameter of the cluster i.e. the at max distance
between any two objects in the cluster.
• The original data is replaced by the cluster representation. This technique is more
effective if the present data can be classified into a distinct clustered.
3. Sampling
• One of the methods used for data reduction is sampling as it is capable to reduce the
large data set into a much smaller data sample.
• Below we will discuss the different method in which we can sample a large data set
D containing N tuples:
Simple random sample without replacement (SRSWOR) of size s:
• In this ‘s number’ of tuples are drawn from N tuples such that in the data set D (s<N).
• The probability of drawing any tuple from the data set D is 1/N this means all tuples
have an equal probability of getting sampled.
Simple random sample with replacement (SRSWR) of size s:
• It is similar to the SRSWOR but the tuple is drawn from data set D, is recorded and
then replaced back into the data set D so that it can be drawn again.
Cluster sample:
• The tuples in data set D are clustered into M mutually disjoint subsets.
• From these clusters, a simple random sample of size s could be generated where
s<M.
• The data reduction can be applied by implementing SRSWOR on these clusters.
Stratified sample:
• The large data set D is partitioned into mutually disjoint sets called ‘strata’.
• Now a simple random sample is taken from each stratum to get stratified data.
• This method is effective for skewed data.
4.Data Compression
• The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding).
• We can divide it into two types based on their compression techniques.
Lossless Compression –
• Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction.
• Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
Lossy Compression –
• Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image.
• In lossy-data compression, the decompressed data may differ from the original data but
are useful enough to retrieve information from them.
5.Discretization and concept hierarchy generation −
• Techniques of data discretization are used to divide the attributes of the continuous
nature into data with intervals.
• We replace many constant values of the attributes by labels of small intervals. This
means that mining results are shown in a concise, and easily understandable way.
Top-down discretization –
• If you first consider one or a couple of points (so-called breakpoints or split points) to
divide the whole set of attributes and repeat this method up to the end, then the process
is known as top-down discretization also known as splitting.
Bottom-up discretization –
• If you first consider all the constant values as split points, some are discarded through
a combination of the neighborhood values in the interval, that process is called bottom-
up discretization.
Concept Hierarchies:
• It reduces the data size by collecting and then replacing the low-level concepts (such as
43 for age) with high-level concepts (categorical variables such as middle age or Senior).
• For numeric data following techniques can be followed:
Binning
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X,
into disjoint ranges called brackets. There are several partitioning rules:
Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
Equal Width Partitioning: Partitioning the values in a fixed gap based on the number
of bins i.e. a set of values ranging from 0-20.
Clustering: Grouping similar data together.
Data Transformation & Data Discretization
• Data transformation in data mining is done by combining unstructured data with
structured data to analyze it later. It is also important when the data is transferred
to a new cloud data warehouse.
• When the data is homogeneous and well-structured, it is easier to analyze and
look for patterns.
• For example, a company has acquired another firm and now has to consolidate
all the business data. The Smaller company may be using a different database
than the parent firm. Also, the data in these databases may have unique IDs,
keys, and values. All this needs to be formatted so that all the records are similar
and can be evaluated.
• This is why data transformation methods are applied. And, they are described
below:
Data Smoothing
This method is used for removing the noise from a dataset. Noise is referred to as the
distorted and meaningless data within a dataset.
Smoothing uses algorithms to highlight the special features in the data.
After removing noise, the process can detect any small changes to the data to detect
special patterns.
Any data modification or trend can be identified by this method.
Data Aggregation
• Aggregation is the process of collecting data from a variety of sources and storing it in
a single format. Here, data is collected, stored, analyzed, and presented in a report or
summary format.
• It helps in gathering more information about a particular data cluster. The method helps
in collecting vast amounts of data.
• This is a crucial step as accuracy and quantity of data is important for proper analysis.
• Companies collect data about their website visitors. This gives them an idea about
customer demographics and behavior metrics. This aggregated data assists them in
designing personalized messages, offers, and discounts.
Discretization
• This is a process of converting continuous data into a set of data intervals.
Continuous attribute values are substituted by small interval labels. This makes the
data easier to study and analyze.
• If a continuous attribute is handled by a data mining task, then its discrete values can
be replaced by constant quality attributes.
• This improves the efficiency of the task.
• This method is also called a data reduction mechanism as it transforms a large
dataset into a set of categorical data.
• Discretization can be done by Binning, Histogram Analysis, and Correlation
Analyses.
• Discretization also uses decision tree-based aigorithms to produce short, compact,
and accurate results when xSing discrete values.
Generalization
• In this process, low-level data attributes are transformed into high-level data
attributes using concept hierarchies.
• This conversion from a lower level to a higher conceptual level is useful to get a
clearer picture of the data.
• For example, age data can be in the form of (20, 30) in a dataset. It is transformed
into a higher conceptual level into a categorical value (young, old).
• Data generalization can be divided into two approaches
data cube process (OLAP) and attribute-oriented induction approach (AO).
Attribute construction
In the attribute construction method, new attributes are
created from an existing set of attributes.
For example, in a dataset of employee information, the
attributes can be employee name, employee ID, and address.
These attributes can be used to construct another dataset that
contains information about the employees who have joined in
the year 2019 only.
This method of reconstruction makes mining more piicient and
helps in creating new datasets quickly.
Normalization
Also called data pre-processing, this is one of the crucial techniques for data transformation in data mining.
Here, the data is transformed so that it falls under a given range. When attributes are on different ranges or scales, data
modeling and mining can be difficult.
Normalization helps in applying data mining algorithms and extracting data faster.
The popular normalization methods are:
Min-max normalization
In this technique of data normalization, a linear transformation is performed on the original data. Minimum and
maximum value from data is fetched and each value is replaced according to the following formula:
Where A is the attribute data, min-max are the minimum and maximum the absolute value of A respectively, vis the new
value of each entry in data, v is the old value of each entry in data, new_maxA, new_min is the max and min value of the
range (i.e. boundary value of the range required) respectively.
Example: Suppose the income range from 10,000 to Ex.95,000 is normalized to [0.0, 1.0]. By min-max normalization, a
value of $64,300 for income is in transformed to 64300-10000 95000-100001.0-0.0) + 0.0 =0.6388
Z-score normalization
In this technique, values are normalized based on a mean and standard deviation of the data A.
Decimal scaling
It normalizes by moving the decimal point of values of the data.
To normalize the data by this technique, we divide each value
of the data by the maximum absolute value of the data. The
data value, Vi, of data is normalized to v; by using the formula
below:
where, j is the smallest integer such that max (Iv )<1.