Data Integration and The Extraction, Transformation and Loading Processes
Data Integration and The Extraction, Transformation and Loading Processes
Global competitive pressures, demand for return on investment (ROI), management and
investor inquiry, and government regulations are forcing business managers to rethink how they
integrate and manage their businesses. A decision maker typically needs access to multiple
sources of data that must be integrated. Before data warehouses, DMs, and BI software,
providing access to data sources was a major, laborious process. Even with modern Web-based
data management tools, recognizing what data to access and providing them to the decision
maker is a nontrivial task that requires database specialists. As data warehouses grow in size, the
issues of integrating data grow as well. The business analysis needs continue to evolve. Mergers
and acquisitions, regulatory requirements, and the introduction of new channels can drive
data, business users increasingly demand access to real-time, unstructured, and/or remote data.
And everything must be integrated with the contents of an existing data warehouse. Moreover,
access via PDAs and through speech recognition and synthesis is becoming more commonplace,
further complicating integration issues (Edwards, 2003). Many integration projects involve
enterprise-wide systems. Orovic (2003) provided a checklist of what works and what does not
work when attempting such a project. Properly integrating data from various databases and other
disparate sources is difficult. When it is not done properly, though, it can lead to disaster in
enterprise-wide systems such as CRM, ERP, and supplychain projects (Nash, 2002). Data
Integration Data integration comprises three major processes that, when correctly implemented,
permit data to be accessed and made accessible to an array of ETL and analysis tools and the
data warehousing environment: data access (i.e., the ability to access and extract data from any
data source), data federation (i.e., the integration of business views across multiple data stores),
and change capture (based on the identification, capture, and delivery of the changes made to
enterprise data sources). See Application Case 3.2 for an example of how BP Lubricant benefits
from implementing a data warehouse that integrates data from many sources. Some vendors,
such as SAS Institute, Inc., have developed strong data integration tools. The SAS enterprise data
integration server includes customer data integration tools that improve data quality in the
integration process. The Oracle Business Intelligence Suite assists in integrating data as well.
At the heart of the technical side of the data warehousing process is extraction,
transformation, and load (ETL). ETL technologies, which have existed for some time, are
instrumental in the process and use of data warehouses. The ETL process is an integral
component in any data-centric project. IT managers are often faced with challenges because the
ETL process typically consumes 70% of the time in a data-centric project. The ETL process
consists of extraction (i.e., reading data from one or more databases), transformation (i.e.,
converting the extracted data from its previous form into the form in which it needs to be so that
it can be placed into a data warehouse or simply another database), and load (i.e., putting the data
into the data warehouse). Transformation occurs by using rules or lookup tables or by combining
the data with other data. The three database functions are integrated into one tool to pull data out
of one or more databases and place them into another, consolidated database or a data
warehouse. ETL tools also transport data between sources and targets, document how data
elements (e.g., metadata) change as they move between source and target, exchange metadata
with other applications as needed, and administer all runtime processes and operations (e.g.,
scheduling, error management, audit logs, statistics). ETL is extremely important for data
integration as well as for data warehousing. The purpose of the ETL process is to load the
warehouse with integrated and cleansed data. The data used in ETL processes can come from
any source: a mainframe application, an ERP application, a CRM tool, a flat file, an Excel
spreadsheet, or even a message queue. In Figure 3.9, we outline the ETL process. Data
warehouse Other internal applications Legacy system Cleanse Load Packaged application
Extract Transform Transient data source Data marts FIGURE 3.9 The ETL Process. The process
of migrating data to a data warehouse involves the extraction of data from all relevant sources.
Data sources may consist of files extracted from OLTP databases, spreadsheets, personal
databases (e.g., Microsoft Access), or external files. Typically, all the input files are written to a
set of staging tables, which are designed to facilitate the load process. A data warehouse contains
numerous business rules that define such things as how the data will be used, summarization
rules, standardization of encoded attributes, and calculation rules. Any data quality issues
pertaining to the source files need to be corrected before the data are loaded into the data
warehouse. One of the benefits of a well-designed data warehouse is that these rules can be
stored in a metadata repository and applied to the data warehouse centrally. This differs from an
OLTP approach, which typically has data and business rules scattered throughout the system.
The process of loading data into a data warehouse can be performed either through data
transformation tools that provide a GUI to aid in the development and maintenance of business
rules or through more traditional methods, such as developing programs or utilities to load the
data warehouse, using programming languages such as PL/SQL, C++, Java, or .NET Framework
languages. This decision is not easy for organizations. Several issues affect whether an
organization will purchase data transformation tools or build the transformation process itself: •
Data transformation tools are expensive. • Data transformation tools may have a long learning
curve. • It is difficult to measure how the IT organization is doing until it has learned to use the
data transformation tools. In the long run, a transformation-tool approach should simplify the
detecting and scrubbing (i.e., removing any anomalies in the data). OLAP and data mining tools
rely on how well the data are transformed. As an example of effective ETL, Motorola, Inc., uses
ETL to feed its data warehouses. Motorola collects information from 30 different procurement
systems and sends them to its global SCM data warehouse for analysis of aggregate company
spending (see Songini, 2004). Solomon (2005) classified ETL technologies into four categories:
sophisticated, enabler, simple, and rudimentary. It is generally acknowledged that tools in the
sophisticated category will result in the ETL process being better documented and more
accurately managed as the data warehouse project evolves. Even though it is possible for
programmers to develop software for ETL, it is simpler to use an existing ETL tool. The
following are some of the important criteria in selecting an ETL tool (see Brown, 2004): •
Ability to read from and write to an unlimited number of data source architectures • Automatic
interface for the developer and the functional user Performing extensive ETL may be a sign of
poorly managed data and a fundamental lack of a coherent data management strategy. Karacsony
(2006) indicated that there is a direct correlation between the extent of redundant data and the
number of ETL processes. When data are managed correctly as an enterprise asset, ETL efforts
are significantly reduced, and redundant data are completely eliminated. This leads to huge
savings in maintenance and greater efficiency in new development while also improving data
quality. Poorly designed ETL processes are costly to maintain, change, and update.
Consequently, it is crucial to make the proper choices in terms of the technology and tools to use
for developing and maintaining the ETL process. A number of packaged ETL tools are available.
Database vendors currently offer ETL capabilities that both enhance and compete with
independent ETL tools. SAS acknowledges the importance of data quality and offers the
industry’s first fully integrated solution that merges ETL and data quality to transform data into
strategic valuable assets. Other ETL software providers include Microsoft, Oracle, IBM,