Data Integration in Data Mining
Data Integration in Data Mining
Data integration is the process of merging data from several disparate sources. While performing
data integration, you must work on data redundancy, inconsistency, duplicity, etc. In data mining,
data integration is a record preprocessing method that includes merging data from a couple of the
heterogeneous data sources into coherent data to retain and provide a unified perspective of the
data. These assets could also include several record cubes, databases, or flat documents. The
statistical integration strategy is formally stated as a triple (G, S, M) approach. G represents
the global schema, S represents the heterogeneous source of schema, and M represents
the mapping between source and global schema queries.
In this article, you will learn about Data integration in data mining and discuss its methods, issues,
techniques, and tools.
Data integration is important because it gives a uniform view of scattered data while also maintaining data
accuracy. It assists the data-mining program in meaningful mining information, which in turn assists the
executive and managers make strategic decisions for the enterprise's benefit.
The data integration methods are formally characterized as a triple (G, S, M), where;
Tight Coupling
It is the process of using ETL (Extraction, Transformation, and Loading) to combine data
from various sources into a single physical location.
Loose Coupling
Facts with loose coupling are most effectively kept in the actual source databases. This approach
provides an interface that gets a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the source databases without delay to obtain
the result.
Inconsistencies further increase the level of redundancy within the characteristic. The use of
correlation analysis can be used to determine redundancy. The traits are examined to determine
their interdependence on each difference, consequently discovering the link between them.
Tuple Duplication
Information integration has also handled duplicate tuples in addition to redundancy. Duplicate
tuples may also appear in the generated information if the denormalized table was utilized as a
deliverable for data integration.
Manual Integration
This method avoids using automation during data integration. The data analyst collects, cleans,
and integrates the data to produce meaningful information. This strategy is suitable for a mini
organization with a limited data set. Although, it will be time-consuming for the huge,
sophisticated, and recurring integration. Because the entire process must be done manually, it is a
time-consuming operation.
Middleware Integration
The middleware software is used to take data from many sources, normalize it, and store it in the
resulting data set. When an enterprise needs to integrate data from legacy systems to modern
systems, this technique is used. Middleware software acts as a translator between legacy and
advanced systems. You may take an adapter that allows two systems with different interfaces to
be connected. It is only applicable to certain systems.
Application-based integration
It is using software applications to extract, transform, and load data from disparate sources. This
strategy saves time and effort, but it is a little more complicated because building such an
application necessitates technical understanding. This strategy saves time and effort, but it is a
little more complicated because building such an application necessitates technical understanding.
Data Warehousing
This technique is related to the uniform access integration technique in a roundabout way. The
unified view, on the other hand, is stored in a different location. It enables the data analyst to deal
with more sophisticated inquiries. Although it is a promising solution and increased storage costs,
the unified data's view or copy requires separate storage and maintenance costs.
Integration tools
There are various integration tools in data mining. Some of them are as follows: