0% found this document useful (0 votes)
42 views5 pages

Data Integration and The Extraction, Transformation and Loading Processes

Uploaded by

Poet Cruz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views5 pages

Data Integration and The Extraction, Transformation and Loading Processes

Uploaded by

Poet Cruz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Integration and the Extraction, Transformation, and Load (ETL) Processes By

Sharda et al. (2018).

Global competitive pressures, demand for return on investment (ROI), management and

investor inquiry, and government regulations are forcing business managers to rethink how they

integrate and manage their businesses. A decision maker typically needs access to multiple

sources of data that must be integrated. Before data warehouses, DMs, and BI software,

providing access to data sources was a major, laborious process. Even with modern Web-based

data management tools, recognizing what data to access and providing them to the decision

maker is a nontrivial task that requires database specialists. As data warehouses grow in size, the

issues of integrating data grow as well. The business analysis needs continue to evolve. Mergers

and acquisitions, regulatory requirements, and the introduction of new channels can drive

changes in BI requirements. In addition to historical, cleansed, consolidated, and point-in-time

data, business users increasingly demand access to real-time, unstructured, and/or remote data.

And everything must be integrated with the contents of an existing data warehouse. Moreover,

access via PDAs and through speech recognition and synthesis is becoming more commonplace,

further complicating integration issues (Edwards, 2003). Many integration projects involve

enterprise-wide systems. Orovic (2003) provided a checklist of what works and what does not

work when attempting such a project. Properly integrating data from various databases and other

disparate sources is difficult. When it is not done properly, though, it can lead to disaster in

enterprise-wide systems such as CRM, ERP, and supplychain projects (Nash, 2002). Data

Integration Data integration comprises three major processes that, when correctly implemented,

permit data to be accessed and made accessible to an array of ETL and analysis tools and the

data warehousing environment: data access (i.e., the ability to access and extract data from any
data source), data federation (i.e., the integration of business views across multiple data stores),

and change capture (based on the identification, capture, and delivery of the changes made to

enterprise data sources). See Application Case 3.2 for an example of how BP Lubricant benefits

from implementing a data warehouse that integrates data from many sources. Some vendors,

such as SAS Institute, Inc., have developed strong data integration tools. The SAS enterprise data

integration server includes customer data integration tools that improve data quality in the

integration process. The Oracle Business Intelligence Suite assists in integrating data as well.

Extraction, Transformation, and Load

At the heart of the technical side of the data warehousing process is extraction,

transformation, and load (ETL). ETL technologies, which have existed for some time, are

instrumental in the process and use of data warehouses. The ETL process is an integral

component in any data-centric project. IT managers are often faced with challenges because the

ETL process typically consumes 70% of the time in a data-centric project. The ETL process

consists of extraction (i.e., reading data from one or more databases), transformation (i.e.,

converting the extracted data from its previous form into the form in which it needs to be so that

it can be placed into a data warehouse or simply another database), and load (i.e., putting the data

into the data warehouse). Transformation occurs by using rules or lookup tables or by combining

the data with other data. The three database functions are integrated into one tool to pull data out

of one or more databases and place them into another, consolidated database or a data

warehouse. ETL tools also transport data between sources and targets, document how data

elements (e.g., metadata) change as they move between source and target, exchange metadata

with other applications as needed, and administer all runtime processes and operations (e.g.,

scheduling, error management, audit logs, statistics). ETL is extremely important for data
integration as well as for data warehousing. The purpose of the ETL process is to load the

warehouse with integrated and cleansed data. The data used in ETL processes can come from

any source: a mainframe application, an ERP application, a CRM tool, a flat file, an Excel

spreadsheet, or even a message queue. In Figure 3.9, we outline the ETL process. Data

warehouse Other internal applications Legacy system Cleanse Load Packaged application

Extract Transform Transient data source Data marts FIGURE 3.9 The ETL Process. The process

of migrating data to a data warehouse involves the extraction of data from all relevant sources.

Data sources may consist of files extracted from OLTP databases, spreadsheets, personal

databases (e.g., Microsoft Access), or external files. Typically, all the input files are written to a

set of staging tables, which are designed to facilitate the load process. A data warehouse contains

numerous business rules that define such things as how the data will be used, summarization

rules, standardization of encoded attributes, and calculation rules. Any data quality issues

pertaining to the source files need to be corrected before the data are loaded into the data

warehouse. One of the benefits of a well-designed data warehouse is that these rules can be

stored in a metadata repository and applied to the data warehouse centrally. This differs from an

OLTP approach, which typically has data and business rules scattered throughout the system.

The process of loading data into a data warehouse can be performed either through data

transformation tools that provide a GUI to aid in the development and maintenance of business

rules or through more traditional methods, such as developing programs or utilities to load the

data warehouse, using programming languages such as PL/SQL, C++, Java, or .NET Framework

languages. This decision is not easy for organizations. Several issues affect whether an

organization will purchase data transformation tools or build the transformation process itself: •

Data transformation tools are expensive. • Data transformation tools may have a long learning
curve. • It is difficult to measure how the IT organization is doing until it has learned to use the

data transformation tools. In the long run, a transformation-tool approach should simplify the

maintenance of an organization’s data warehouse. Transformation tools can also be effective in

detecting and scrubbing (i.e., removing any anomalies in the data). OLAP and data mining tools

rely on how well the data are transformed. As an example of effective ETL, Motorola, Inc., uses

ETL to feed its data warehouses. Motorola collects information from 30 different procurement

systems and sends them to its global SCM data warehouse for analysis of aggregate company

spending (see Songini, 2004). Solomon (2005) classified ETL technologies into four categories:

sophisticated, enabler, simple, and rudimentary. It is generally acknowledged that tools in the

sophisticated category will result in the ETL process being better documented and more

accurately managed as the data warehouse project evolves. Even though it is possible for

programmers to develop software for ETL, it is simpler to use an existing ETL tool. The

following are some of the important criteria in selecting an ETL tool (see Brown, 2004): •

Ability to read from and write to an unlimited number of data source architectures • Automatic

capturing and delivery of metadata • A history of conforming to open standards • An easy-to-use

interface for the developer and the functional user Performing extensive ETL may be a sign of

poorly managed data and a fundamental lack of a coherent data management strategy. Karacsony

(2006) indicated that there is a direct correlation between the extent of redundant data and the

number of ETL processes. When data are managed correctly as an enterprise asset, ETL efforts

are significantly reduced, and redundant data are completely eliminated. This leads to huge

savings in maintenance and greater efficiency in new development while also improving data

quality. Poorly designed ETL processes are costly to maintain, change, and update.

Consequently, it is crucial to make the proper choices in terms of the technology and tools to use
for developing and maintaining the ETL process. A number of packaged ETL tools are available.

Database vendors currently offer ETL capabilities that both enhance and compete with

independent ETL tools. SAS acknowledges the importance of data quality and offers the

industry’s first fully integrated solution that merges ETL and data quality to transform data into

strategic valuable assets. Other ETL software providers include Microsoft, Oracle, IBM,

Informatica, Embarcadero, and Tibco. For additional info

You might also like