0% found this document useful (0 votes)
21 views30 pages

ETL - Extract, Transform and Load: What Is A Data Warehouse?

Load , transfer

Uploaded by

Kiran Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views30 pages

ETL - Extract, Transform and Load: What Is A Data Warehouse?

Load , transfer

Uploaded by

Kiran Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

ETL - Extract, Transform and Load

What is a Data Warehouse?

A data warehouse is:

➢ Typically, a relational database that is designed for query and analysis rather than for
transaction processing
➢ A place where historical data is stored for archival, analysis and security purposes.
➢ contains either raw data or formatted data
➢ combines data from multiple sources (i.e., Sales data, salaries, operational data, human
resource data, inventory data, web logs, social networks, Internet text and docs, other)
➢ As name implies Data warehouse, it is warehouse for database to store large aggregated
data collected from wide range of sources within an organization.
➢ Source can be soft files, database files or some excel files.
For example: Baskin-Robbins (Famous for world's largest chain of ice cream specialty shops) has
many shops in India as well as across the world. Let's say there is a Baskin-Robbins shop in our
area and it has its own system of saving customer visit and product purchase history. So, these
data must be stored in a excel. Once in a week all these area-data is been collected and stored in
a centralized city-data center which is nothing data-warehouse for all small-small areas. Same
way all this city-data must be collected and stored in a state-data. A large data store which is
accumulated from wide-range of sources is known as Data Warehouse

Data Mart

➢ A data mart is a subset of a data warehouse that has the same characteristics but is usually
smaller and is focused on the data for one division or one workgroup within an enterprise.

What is Big Data?

➢ Big Data is defined as too much volume, velocity and variability to work on normal
database architectures.
➢ Typically, Hadoop is the architecture used for big data lakes.
➢ Hadoop is an Apache open-source project that develops software for scalable, distributed
computing.

1
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Difference between Database and Data warehouse

• Data warehouse is one kind of database or a large database. Data warehouse is formed
using multiple databases.
• Database with respect Data warehouse, It helps to store data/information for a particular
entity.
• Database is used for Online Transactional Processing (OLTP).
• Data warehouse is used for Online Analytical Processing (OLAP)
• OLTP is a decentralized system normally used in Internet websites, banks, airlines, to
avoid single points of failure and to spread the volume between multiple servers. This
system is good to control and run fundamental business tasks.
• OLAP is centralized system to help with planning, problem solving, and decision support.
Queries are often very complex and relatively used for low volume of transaction.
• Database tables are always in a normalized structure.
• Data Warehouse tables are always in a de-normalized structure.
• Normalization Database - Designed in such a way that same column data is not repeated
or in simple words there will not be any redundant data. Here you will lots of Joins using
foreign-key and primary-key
• De-Normalization Database where you get more repeated data. Database management
becomes easy. No need to joins using foreign-key and primary-key.
• Normalized Database - In terms performance due to many joins it affects the
performance.
• De-Normalized Database - Due to a smaller number of joins or some time no-joins it
improves the performance.

What is ETL?

➢ ETL stands for Extract, Transform and Load, which is a process used to collect data from
various sources, transform the data depending on business rules/needs and load the data
into a destination database.
➢ It is a process in data warehousing to extract data, transform data and load data to final
source.
➢ The need to use ETL arises from the fact that in modern computing business data resides
in multiple locations and in many incompatible formats.
For example, business data might be stored on the file system in various formats (Word docs,
PDF, spreadsheets, plain text, etc), or can be stored as email files, or can be kept in a various
database server like MS SQL Server, Oracle and MySQL for example. Handling all this business
information efficiently is a great challenge and ETL plays an important role in solving this problem.

➢ ETL stands for Extract, Transform and Load.

2
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
➢ An ETL tool extracts the data from different RDBMS source systems, transforms the data
like applying calculations, concatenate, etc. and then load the data-to-Data Warehouse
system.
➢ The data is loaded in the DW system in the form of dimension and fact tables.

➢ ETL (Extract, Transform, Load) is an automated process which takes raw data, extracts the
information required for analysis, transforms it into a format that can serve business
needs, and loads it to a data warehouse.
➢ ETL typically summarizes data to reduce its size and improve performance for specific
types of analysis.
➢ When you build an ETL infrastructure, you must first integrate data from a variety of
sources. Then you must carefully plan and test to ensure you transform the data correctly.
➢ This process is complicated and time-consuming.

3
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
When do we need ETL Testing?

➢ ETL is commonly associated with Data Warehousing projects but in reality, any form of
bulk data movement from a source to a target can be considered ETL.
➢ Large enterprises often have a need to move application data from one source to another
for data integration or data migration purposes.
➢ ETL testing is a data centric testing process to validate that the data has been transformed
and loaded into the target as expected.

Extract, Transform and Load:


The ETL process has 3 main steps, which are Extract, Transform and Load.
Extract:
➢ The first step in the ETL process is extracting the data from various sources. Each of the
source systems may store its data in completely different format from the rest. The
sources are usually flat files or RDBMS, but almost any data storage can be used as a
source for an ETL process.
➢ Three Data Extraction methods:
▪ Full Extraction
▪ Partial Extraction- without update notification.
▪ Partial Extraction- with update notification

Transform:
➢ Once the data has been extracted and converted in the expected format, it’s time for the
next step in the ETL process, which is transforming the data according to set of business
rules. The data transformation may include various operations including but not limited

4
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
to filtering, sorting, aggregating, joining data, cleaning data, generating calculated data
based on existing values, validating data, etc.
➢ It is one of the important ETL concepts where you apply a set of functions on extracted
data. Data that does not require any transformation is called as direct move or pass-
through data.
➢ Transform data to DW (Data Warehouse) format
➢ Build keys – A key is one or more data attributes that uniquely identify an entity. Various
types of keys are primary key, alternate key, foreign key, composite key, surrogate key.
The Datawarehouse owns these keys and never allows any other entity to assign them.
➢ Cleansing of data: After the data is extracted, it will move into the next phase, of cleaning
and conforming of data. Cleaning does the omission in the data as well as identifying and
fixing the errors. Conforming means resolving the conflicts between those data’s that is
incompatible, so that they can be used in an enterprise data warehouse. In addition to
these, this system creates meta-data that is used to diagnose source system problems
and improves data quality.

Load:
➢ The final ETL step involves loading the transformed data into the destination target, which
might be a database or data warehouse.
➢ Load data into DW (Data Warehouse)
➢ Build aggregates – Creating an aggregate is summarizing and storing data which is
available in fact table in order to improve the performance of end-user queries.
➢ Types of Loading:
o Initial Load — populating all the Data Warehouse tables
o Incremental Load — applying ongoing changes as when needed periodically.
o Full Refresh —erasing the contents of one or more tables and reloading with fresh
data.

5
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
➢ Each step is performed sequentially. However, the exact nature of each step – which
format is required for the target database – depends on the enterprise’s specific needs
and requirements.
➢ Extraction can involve copying data to tables quickly to minimize the time spent querying
the source system.
➢ In the transformation step, the data is most usually stored in one set of staging tables as
part of the process.
➢ Finally, a secondary transformation step might place data in tables that are copies of the
warehouse tables, which eases loading.
➢ Each ETL stage requires interaction by data engineers and developers to deal with the
capacity limitations of traditional data warehouses.

Why perform an ETL?

To load a data warehouse or data mart regularly (daily/weekly) so that it can serve its purpose of
facilitating business analysis. Or move data from files, xml or other sources to a big data lake,
data warehouse or data mart.

Steps involves in ETL Process:

The 5 steps of the ETL process are: extract, clean, transform, load, and analyze. Of the 5, extract,
transform, and load are the most important process steps.

• Extract: Retrieves raw data from an unstructured data pool and migrates it into a
temporary, staging data repository

• Clean: Cleans data extracted from an unstructured data pool, ensuring the quality of the
data prior to transformation.

• Transform: Structures and converts the data to match the correct target source

• Load: Loads the structured data into a data warehouse so it can be properly analyzed and
used

• Analyze: Big data analysis is processed within the warehouse, enabling the business to
gain insight from the correctly configured data.

6
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
What is ELT?

ELT stands for “extract, load, and transform” — the processes a data pipeline uses to replicate
data from a source system into a target system such as a cloud data warehouse.

❖ Extraction: This first step involves copying data from the source system.
❖ Loading: During the loading step, the pipeline replicates data from the source into the
target system, which might be a data warehouse or data lake.
❖ Transformation: Once the data is in the target system, organizations can run whatever
transformations they need. Often organizations will transform raw data in different ways
for use with different tools or business processes.

Benefits of ELT

The explosion in the types and volume of data that businesses must process can put a strain on
traditional data warehouses. Using an ETL process to manage millions of records in these new
formats can be time-consuming and costly. ELT offers a number of advantages, including:

• Simplifying management — ELT separates the loading and transformation tasks,


minimizing the interdependencies between these processes, lowering risk, and
streamlining project management.
• Future-proofed data sets — ELT implementations can be used directly for data
warehousing systems, but oftentimes ELT is used in the data lake approach in which data
is collected from a range of sources. This, combined with the separation of the
transformation process, makes it easier to make future changes to the warehouse
structure.
• Leveraging the latest technologies — ELT solutions harness the power of new
technologies in order to push improvements, security, and compliance across the
enterprise. ELT also leverages the native capabilities of modern cloud data warehouses
and big data processing frameworks.
• Lowering costs — Like most cloud services, cloud-based ELT can result in lower total cost
of ownership, because an upfront investment in hardware is often unnecessary.
• Flexibility — The ELT process is adaptable and flexible, so it’s suitable for a variety of
businesses, applications, and goals.
• Scalability — The scalability of a cloud infrastructure and hosted services like integration
platform-as-a-service (iPaaS) and software-as-a-service (SaaS) give organizations the
ability to expand resources on the fly. They add the compute time and storage space
necessary for even massive data transformation tasks.

7
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Although it is still evolving, ELT offers the promise of unlimited access to data, less development
time, and significant cost savings. In these and other ways, the cloud is redefining data
integration.

ETL vs ELT:

As stated above, ETL = Extract, Transform, Load. ELT, on the other hand = Extract, Load,
Transform.

• According to IBM, “the most obvious difference between ETL and ELT is the difference in
order of operations.
• ELT copies or exports the data from the source locations, but instead of loading it to a
staging area for transformation, it loads the raw data directly to the target data store to
be transformed as needed.

While both processes leverage a variety of data repositories, such as databases, data
warehouses, and data lakes, each process has its advantages and disadvantages.
➢ ELT is particularly useful for high-volume, unstructured datasets as loading can occur
directly from the source.
➢ ELT can be more ideal for big data management since it doesn’t need much upfront
planning for data extraction and storage.
➢ The ETL process, on the other hand, requires more definition at the onset. Specific data
points need to be identified for extraction along with any potential “keys” to integrate
across disparate source systems.
➢ Even after that work is completed, the business rules for data transformations need to be
constructed.
➢ While ELT has become increasingly more popular with the adoption of cloud databases,
it has its own disadvantages for being the newer process, meaning that best practices are
still being established.”

The ETL and ELT approaches to data integration differ in several key ways.

• Load time — It takes significantly longer to get data from source systems to the target
system with ETL.
• Transformation time — ELT performs data transformation on-demand, using the target
system’s computing power, reducing wait times for transformation.
• Complexity — ETL tools typically have an easy-to-use GUI that simplifies the process. ELT
requires in-depth knowledge of BI tools, masses of raw data, and a database that can
transform it effectively.

8
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
• Data warehouse support — ETL is a better fit for legacy on-premise data warehouses and
structured data. ELT is designed for the scalability of the cloud.
• Maintenance — ETL requires significant maintenance for updating data in the data
warehouse. With ELT, data is always available in near real-time.

Both ETL and ELT processes have their place in today’s competitive landscape, and understanding
a business’ unique needs and strategies is key to determining which process will deliver the best
results.

The five critical differences of ETL vs ELT:

1. ETL is the Extract, Transform, and Load process for data. ELT is Extract, Load, and
Transform process for data.
2. In ETL, data moves from the data source to staging into the data warehouse.
3. ELT leverages the data warehouse to do basic transformations. There is no need for data
staging.
4. ETL can help with data privacy and compliance by cleaning sensitive and secure data even
before loading into the data warehouse.
5. ETL can perform sophisticated data transformations and can be more cost-effective than
ELT.

ETL vs. ELT Comparison

ETL ELT
Adoption of the
ELT is a new technology, so it can be difficult
technology and ETL is a well-developed process
to locate experts and more challenging to
availability of used for over 20 years, and ETL
develop an ELT pipeline compared to an ETL
tools and experts are readily available.
pipeline.
experts
ETL only transforms and loads the
Availability of data that you decide is necessary ELT can load all data immediately, and users
data in the when creating the data warehouse can determine later which data to transform
system and ETL process. Therefore, only and analyze.
this information will be available.
Calculations will either replace
existing columns, or you can
Can you add ELT adds calculated columns directly to the
append the dataset to push the
calculations? existing dataset.
calculation result to the target data
system.

9
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
ETL is not normally a solution for
Compatible ELT offers a pipeline for data lakes to ingest
data lakes. It transforms data for
with data unstructured data. Then it transforms the
integration with a structured
lakes? data on an as-needed basis for analysis.
relational data warehouse system.
ETL can redact and remove ELT requires you to upload the data before
sensitive information before redacting/removing sensitive information.
putting it into the data warehouse This could violate GDPR, HIPAA, and CCPA
or cloud server. This makes it easier standards. Sensitive information will be more
Compliance
to satisfy GDPR, HIPAA, and CCPA vulnerable to hacks and inadvertent
compliance standards. It also exposure. You could also violate some
protects data from hacks and compliance standards if the cloud-server is in
inadvertent exposure. another country.
Data size vs. ETL is best suited for dealing with ELT is best when dealing with massive
complexity of smaller data sets that require amounts of structured and unstructured
transformations complex transformations. data.
ETL works with cloud-based and ELT works with cloud-based data
Data
onsite data warehouses. It requires warehousing solutions to support structured,
warehousing
a relational or structured data unstructured, semi-structured, and raw data
support
format. types.
Cloud-based ETL platforms (like
Xplenty) don't require special
Hardware hardware. Legacy, onsite ETL ELT processes are cloud-based and don't
requirements processes have extensive and costly require special hardware.
hardware requirements, but they
are not as popular today.
How are Aggregation becomes more As long as you have a powerful, cloud-based
aggregations complicated as the dataset target data system, you can quickly process
different? increases in size. massive amounts of data.
ETL experts are easy to procure As a new technology, the tools to implement
Implementation when building an ETL pipeline. an ELT solution are still evolving. Moreover,
Complexity Highly evolved ETL tools are also experts with the requisite ELT knowledge and
available to facilitate this process. skills can be difficult to find.
Automated, cloud-based ETL
ELT is cloud-based and generally
Maintenance solutions, like Xplenty, require little
incorporates automated solutions, so very
requirement maintenance. However, an onsite
little maintenance is required.
ETL solution that uses a physical

10
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
server will require frequent
maintenance.
Data transformations happen
Order of the Data is extracted, then loaded into the target
immediately after extraction within
extract, data system first. Only later is some of the
a staging area. After
transform, load data transformed on an “as-needed” basis
transformation, the data is loaded
process for analytical purposes.
into the data warehouse.
Cloud-based SaaS ELT platforms that bill with
Cloud-based SaaS ETL platforms a pay-per-session pricing model offer flexible
that bill with a pay-per-session plans that start at approximately $100 and
pricing model (such as Xplenty) go up from there. One cost advantage of ELT
offer flexible plans that start at is that you can load and save your data
approximately $100 and go up from without incurring large fees, then apply
Costs
there, depending on usage transformations as needed. This can save
requirements. Meanwhile, an money on initial costs if you just want to load
enterprise-level onsite ETL solution and save information. However, financially
like Informatica could cost over $1 strapped businesses may never be able to
million a year! afford the processing power required to reap
the full benefits of their data lake.
Transformations happen within a
Transformation Transformations happen inside the data
staging area outside the data
process system itself, and no staging area is required.
warehouse.
ETL can be used to structure ELT is a solution for uploading unstructured
Unstructured unstructured data, but it can’t be data into a data lake and make unstructured
data support used to pass unstructured data into data available to business intelligence
the target system. systems.
ETL load times are longer than ELT
because it's a multi-stage process: Data loading happens faster because there's
Waiting time to (1) data loads into the staging area, no waiting for transformations and the data
load (2) transformations take place, (3) only loads one time into the target data
information data loads into the data warehouse. system. However, analysis of the information
Once the data is loaded, analysis of is slower than ETL.
the information is faster than ELT.
Waiting time to Data transformations take more Since transformations happen after loading,
perform time initially because every piece of on an as-needed basis—and you transform
transformations data requires transformation only the data you need to analyze at the

11
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
before loading. Also, as the size of time—transformations happen a lot of
the data system increases, faster. However, the need to continually
transformations take longer. transform data slows down the total time it
However, once transformed and in takes for querying/analysis.
the system, analysis happens
quickly and efficiently.

Why is ETL Data Integration Important?

Have you ever heard the phrase, “garbage in, garbage out?” That phrase is more appropriate
than ever in today’s digital data landscape because it emphasizes how the quality of data is
directly related to accurate insights and better decision-making. And thus, we have ETL – extract,
transform, load – to help ensure good data hygiene and added business value on the output.

ETL database tools perform several critical business functions:

• Reconcile varying data formats to move data from a legacy system into modern
technology, which often cannot support the legacy format
• Sync data from external ecosystem partners, like suppliers, vendors, and customers
• Consolidate data from various overlapping systems acquired via merger and/or
acquisition
• Combine transactional data from a data store so it can be read and understood by
business users

Who is involved in the ETL process?


There are at least 4 roles involved. They are:
➢ Data Analyst: Creates data requirements (source-to-target map or mapping doc)
➢ Data Architect: Models and builds data store (Big Data Lake, Data Warehouse, Data Mart,
etc.)
➢ ETL Developer: Transforms and loads data from sources to target data stores
➢ ETL Tester: Validates the data, based on mappings, as it moves and transforms from
sources to targets

The image shows the intertwined roles, tasks and timelines for performing ETL Testing with the
sampling method.

12
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Responsibilities of an ETL Tester

❖ Writing ETL test cases—crafting SQL queries that can simulate important parts of the ETL
process.
❖ Verifying source system tables—displaying expertise in data sources, their meaning, and
how they contribute to the final data in the data warehouse.
❖ Applying transformation logic—ETL testers are qualified ETL engineers who can run and
design transformation scripts.
❖ Loading data—loading data into staging environments to enable testing.
❖ ETL tool testing—checking and troubleshooting the underlying ETL system.
❖ Testing end user reports and dashboards—ensuring that business analysts and data
scientists have a complete and accurate view of the data they need.

SQL and ETL:


➢ When companies engage in fast data manipulation, for example, SQL is often the primary
language that allows database servers to communicate, edit, and store data.

➢ When you load customer data into a data warehouse, for example, ETL can help you
engage in batch processing to derive the most value for your data warehouse.

13
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
➢ SQL stands for “Structured Query Language” that is used to access information from a
variety of relational databases. By writing SQL commands, you can perform tasks such as
updates, data retrieval, deletion, and more.

A very basic example SQL query would be something like this:

SELECT email, created_date


FROM customers
WHERE country = ‘India’
ORDER BY created_date LIMIT 100

➢ In this case, the SQL query asks the database the email addresses and date the customer
record was created for the first 100 customers in India, and pulls that information from
the “customers” table.

ETL can be described as the process of moving data from homogeneous or heterogeneous
sources to its target destination (most often a data warehouse). ETL, for example, encompasses
three steps that make it easy to extract raw data from multiple sources, process it (or transform
it), and load it into a data warehouse for analysis.

➢ Extract: to retrieve raw data from unstructured data pools, and migrate it into a
temporary, staging data repository;
➢ Transform: to structure, enrich, and convert raw data to match the target destination;
➢ Load: to load structured data into a data warehouse for analysis or to be leveraged by
Business Intelligence (BI) tools.

❖ For decades, data engineers have built or bought ETL pipelines to integrate diverse data
types into data warehouses for analysis, seamlessly.
❖ The goal here is simple – leveraging ETL makes the process of data analysis much easier
(especially when using online analytical processing or OLAP data warehouses).
❖ ETL is also essential when you want to transform online transactional processing
databases to work with an OLAP data warehouse quickly.

In recent years, it has become increasingly easy to perform a number of processes involved in
ETL, including exception handling, by taking advantage of user-friendly ETL tools that come with
robust graphical user interfaces.

14
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The noticeable difference here is that SQL is a query language, while ETL is an approach to extract,
process, and load data from multiple sources into a centralized target destination.

Some businesses who have the capital to bear large data warehouse costs opt to use several
databases (customer details, sales records, recent transactions, etc.). In such cases, you’ll need a
tool to access, adjust, or add to the database as necessary. That’s where SQL comes in.

When working in a data warehouse with SQL, you can:

➢ Create new tables, views, and stored procedures within the data warehouse
➢ Execute queries to ask and receive answers to structured questions
➢ Insert, update or delete records as needed
➢ Retrieve data and visualize the contents of the data warehouse
➢ Implement strict permissions to access tables, views, and stored procedures

❖ The modern data stack comes with a variety of tools, including ETL tools, and they use
SQL to read, write, and query warehouse data.
❖ SQL syntax can also be used to frame questions answered using a data warehouse.
❖ However, for the manipulation part of the process, for the most part, data needs to be
collected and interpreted before it’s moved to the target destination. That's where ETL
comes in.
❖ ETL uses batch or stream processing protocols to pull raw data, modify it according to
reporting requirements, and load the transformed data into a centralized data
warehouse.
❖ However, data scientists can query and analyze the information directly without having
to build complex ETL processes.
❖ So, we can say that ETL is used to get the data ready and acts as a data integration layer.
SQL is used when data needs to be manipulated and tweaked.
❖ ETL also helps companies implement data governance best practices and achieve data
democracy.

Example:
Let us assume there is a manufacturing company having multiple departments such as sales, HR,
Material Management, EWM, etc. All these departments have separate databases which they
use to maintain information w.r.t. their work and each database has a different technology,
landscape, table names, columns, etc. Now, if the company wants to analyze historical data and

15
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
generate reports, all the data from these data sources should be extracted and loaded into a Data
Warehouse to save it for analytical work.

An ETL tool extracts the data from all these heterogeneous data sources, transforms the data
(like applying calculations, joining fields, keys, removing incorrect data fields, etc.), and loads it
into a Data Warehouse.

What is ETL Testing?

➢ ETL testing is done to ensure that the data that has been loaded from a source to the
destination after business transformation is accurate.
➢ It also involves the verification of data at various middle stages that are being used
between source and destination.
➢ ETL stands for Extract-Transform-Load. ETL testing is a crucial part of ETL, because ETL is
typically performed on mission critical data.

There are several types of ETL testing:

❖ testing the accuracy of the data;


❖ its completeness (whether any parts are missing);
❖ validating that the data hasn’t changed in transition and complies with business rules;
❖ testing metadata to ensure it hasn’t changed in transit;
❖ testing syntax of formally-defined data types;
❖ reference testing against business dictionaries and master data;
❖ and interface and performance testing for the ETL system.

16
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
ETL Testing Process

❖ Similar to other Testing Process, ETL also go through different phases. The different
phases of ETL testing process are as follows

ETL testing is performed in five stages

1. Identifying data sources and requirements


2. Data acquisition
3. Implement business logics and dimensional Modelling
4. Build and populate data
5. Build Reports

17
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Types of ETL Testing:

Types Of Testing Testing Process


“Table balancing” or “production reconciliation” this type of ETL testing is
done on data as it is being moved into production systems. To support
Production your business decision, the data in your production systems has to be in
Validation Testing the correct order. Informatica Data Validation Option provides the ETL
testing automation and management capabilities to ensure that
production systems are not compromised by the data.
Source to Target
Such type of testing is carried out to validate whether the data values
Testing (Validation
transformed are the expected data values.
Testing)
Such type of ETL testing can be automatically generated, saving
Application substantial test development time. This type of testing checks whether
Upgrades the data extracted from an older application or repository are exactly
same as the data in a repository or new application.
Metadata testing includes testing of data type check, data length check
Metadata Testing
and index/constraint check.
Data Completeness To verify that all the expected data is loaded in target from the source,
Testing data completeness testing is done. Some of the tests that can be run are

18
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Types Of Testing Testing Process
compare and validate counts, aggregates and actual data between the
source and target for columns with simple transformation or no
transformation.
Data Accuracy This testing is done to ensure that the data is accurately loaded and
Testing transformed as expected.
Testing data transformation is done as in many cases it cannot be
Data Transformation achieved by writing one source SQL query and comparing the output with
Testing the target. Multiple SQL queries may need to be run for each row to verify
the transformation rules.
Data Quality Tests includes syntax and reference tests. In order to avoid
any error due to date or order number during business process Data
Quality testing is done.

Syntax Tests: It will report dirty data, based on invalid characters,


character pattern, incorrect upper- or lower-case order etc.
Data Quality Testing
Reference Tests: It will check the data according to the data model. For
example: Customer ID

Data quality testing includes number check, date check, precision check,
data check, null check etc.
This testing is done to check the data integrity of old and new data with
Incremental ETL the addition of new data. Incremental testing verifies that the inserts and
testing updates are getting processed as expected during incremental ETL
process.
GUI/Navigation This testing is done to check the navigation or GUI aspects of the front-
Testing end reports.

How to Create ETL Test Case:

❖ ETL testing is a concept which can be applied to different tools and databases in
information management industry.
❖ The objective of ETL testing is to assure that the data that has been loaded from a source
to destination after business transformation is accurate.
❖ It also involves the verification of data at various middle stages that are being used
between source and destination.

19
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
While performing ETL testing, two documents that will always be used by an ETL tester are

1. ETL mapping sheets: An ETL mapping sheets contain all the information of
source and destination tables including each and every column and their look-
up in reference tables. An ETL testers need to be comfortable with SQL queries
as ETL testing may involve writing big queries with multiple joins to validate
data at any stage of ETL. ETL mapping sheets provide a significant help while
writing queries for data verification.
2. DB Schema of Source, Target: It should be kept handy to verify any detail in
mapping sheets.

ETL Test Scenarios and Test Cases:

Test Scenario Test Cases


Mapping doc Verify mapping doc whether corresponding ETL information is provided or
validation not. Change log should maintain in every mapping doc.
1. Validate the source and target table structure against
corresponding mapping doc.
2. Source data type and target data type should be same
3. Length of data types in both source and target should be equal
Validation 4. Verify that data field types and formats are specified
5. Source data type length should not less than the target data type
length
6. Validate the name of columns in the table against mapping doc.

Constraint
Ensure the constraints are defined for specific table as expected
Validation
1. The data type and length for a particular attribute may vary in files
Data consistency or tables though the semantic definition is the same.
issues 2. Misuse of integrity constraints

1. Ensure that all expected data is loaded into target table.


2. Compare record counts between source and target.
3. Check for any rejected records
Completeness 4. Check data should not be truncated in the column of target tables
Issues 5. Check boundary value analysis
6. Compares unique values of key fields between data loaded to WH
and source data

Correctness Issues 1. Data that is misspelled or inaccurately recorded

20
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Test Scenario Test Cases
2. Null, non-unique or out of range data

Transformation Transformation
1. Number check: Need to number check and validate it
2. Date Check: They have to follow date format and it should be same
across all records
Data Quality 3. Precision Check
4. Data check
5. Null check

Null Validate Verify the null values, where “Not Null” specified for a specific column.
1. Needs to validate the unique key, primary key and any other column
should be unique as per the business requirements are having any
duplicate rows
2. Check if any duplicate values exist in any column which is extracting
Duplicate Check
from multiple columns in source and combining into one column
3. As per the client requirements, needs to be ensure that no
duplicates in combination of multiple columns within target only

Date values are using many areas in ETL development for

1. To know the row creation date


2. Identify active records as per the ETL development perspective
Date Validation 3. Identify active records as per the business requirements
perspective
4. Sometimes based on the date values the updates and inserts are
generated.

1. To validate the complete data set in source and target table minus
a query in a best solution
2. We need to source minus target and target minus source
3. If minus query returns any value those should be considered as
Complete Data
mismatching rows
Validation
4. Needs to matching rows among source and target using intersect
statement
5. The count returned by intersect should match with individual
counts of source and target tables

21
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Test Scenario Test Cases
6. If minus query returns of rows and count intersect is less than
source count or target table then we can consider as duplicate rows
are existed.

Unnecessary columns should be deleted before loading into the staging


Data Cleanness
area.

Types of ETL Bugs

Type of Bugs Description


• Related to GUI of application
User interface bugs/cosmetic • Font style, font size, colours, alignment, spelling
bugs mistakes, navigation and so on

Boundary Value Analysis (BVA) • Minimum and maximum values


related bug
Equivalence Class Partitioning • Valid and invalid type
(ECP) related bug
• Valid values not accepted
Input/Output bugs • Invalid values accepted

22
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Type of Bugs Description
• Mathematical errors
Calculation bugs • Final output is wrong

• Does not allow multiple users


Load Condition bugs • Does not allow customer expected load

• System crash & hang


Race Condition bugs • System cannot run client platforms

• No logo matching
• No version information available
Version control bugs
• This occurs usually in Regression Testing

• Device is not responding to the application


H/W bugs

• Mistakes in help documents


Help Source bugs

Difference between Database Testing and ETL Testing

ETL Testing Data Base Testing


Verifies whether data is moved as The primary goal is to check if the data is following
expected the rules/ standards defined in the Data Model
Verifies whether counts in the source and
target are matching
Verify that there are no orphan records and
foreign-primary key relations are maintained
Verifies whether the data transformed is
as per expectation
Verifies that the foreign primary key Verifies that there are no redundant tables and
relations are preserved during the ETL database is optimally normalized
Verifies for duplication in loaded data Verify if data is missing in columns where required

Challenges in ETL Testing

ETL Testing is different from application testing because it requires a data centric testing
approach. Some of the challenges in ETL Testing are –

• ETL Testing involves comparing of large volumes of data typically millions of records.

23
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
• The data that needs to be tested is in heterogeneous data sources (eg. databases, flat
files).
• Data is often transformed which might require complex SQL queries for comparing the
data.
• ETL testing is very much dependent on the availability of test data with different test
scenarios.

ETL Test scenarios in detail:

Metadata Testing:

The purpose of Metadata Testing is to verify that the table definitions conform to the data model
and application design specifications.

❖ Data Type Check


➢ Verify that the table and column data type definitions are as per the data model
design specifications.

Example: Data Model column data type is NUMBER but the database column data type is
STRING (or VARCHAR).

❖ Data Length Check


➢ Verify that the length of database columns is as per the data model design
specifications.

Example: Data Model specification for the ‘first_name’ column is of length 100 but the
corresponding database table column is only 80 characters long.

❖ Index / Constraint Check


➢ Verify that proper constraints and indexes are defined on the database tables as per
the design specifications.

1. Verify that the columns that cannot be null have the ‘NOT NULL’ constraint.
2. Verify that the unique key and foreign key columns are indexed as per the
requirement.
3. Verify that the table was named according to the table naming convention.

Example 1: A column was defined as ‘NOT NULL’ but it can be optional as per the design.
Example 2: Foreign key constraints were not defined on the database table resulting in
orphan records in the child table.

24
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
❖ Metadata Naming Standards Check
➢ Verify that the names of the database metadata such as tables, columns, indexes are
as per the naming standards.
➢ Example: The naming standard for Fact tables is to end with an ‘_F’ but some of the
fact tables names end with ‘_FACT’.
❖ Metadata Check Across Environments
➢ Compare table and column metadata across environments to ensure that changes
have been migrated appropriately.
➢ Example: A new column added to the SALES fact table was not migrated from the
Development to the Test environment resulting in ETL failures.
❖ Automate metadata testing with ETL Validator
➢ ETL Validator comes with Metadata Compare Wizard for automatically capturing and
comparing Table Metadata.

1. Track changes to Table metadata over a period of time. This helps ensure that the QA
and development teams are aware of the changes to table metadata in both Source
and Target systems.
2. Compare table metadata across environments to ensure that metadata changes have
been migrated properly to the test and production environments.
3. Compare column data types between source and target environments.
4. Validate Reference data between spreadsheet and database or across environments.

Data Accuracy:

➢ In ETL testing, data accuracy is used to ensure if data is accurately loaded to the target
system as per the expectation. The key steps in performing data accuracy are as follows

Value Comparison

❖ Value comparison involves comparing the data in source and target system with minimum
or no transformation.
❖ It can be done using various ETL Testing tools, for example, Source Qualifier
Transformation in Informatica.
❖ Some expression transformations can also be performed in data accuracy testing.
❖ Various set operators can be used in SQL statements to check data accuracy in the source
and the target systems. Common operators are Minus and Intersect operators.
❖ The results of these operators can be considered as deviation in value in the target and
the source system.

Check Critical Data Columns

❖ Critical data columns can be checked by comparing distinct values in the source and the
target systems.

25
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Here is a sample query that can be used to check critical data columns −

SELECT cust_name, Order_Id, city, count(*) FROM customer


GROUP BY cust_name, Order_Id, city;

Data Quality Testing:

❖ The purpose of Data Quality tests is to verify the accuracy of the data.
❖ Data profiling is used to identify data quality issues and the ETL is designed to fix or handle
this issue.
❖ However, source data keeps changing and new data quality issues may be discovered
even after the ETL is being used in production.
❖ Automating the data quality checks in the source and target system is an important aspect
of ETL execution and testing.

Duplicate Data Checks

❖ Look for duplicate rows with same unique key column or a unique combination of
columns as per business requirement.

Example: Business requirement says that a combination of First Name, Last Name, Middle
Name and Data of Birth should be unique.

Sample query to identify duplicates


SELECT fst_name, lst_name, mid_name, date_of_birth, count(1) FROM Customer GROUP BY
fst_name, lst_name, mid_name HAVING count(1)>1

Data Validation Rules

❖ Many database fields can contain a range of values that cannot be enumerated. However,
there are reasonable constraints or rules that can be applied to detect situations where
the data is clearly wrong.
❖ Instances of fields containing values violating the validation rules defined represent a
quality gap that can impact ETL processing.

Example: Date of birth (DOB). This is defined as the DATE datatype and can assume any valid
date. However, a DOB in the future, or more than 100 years in the past are probably invalid. Also,
the date of birth of the child is should not be greater than that of their parents.

Data Integrity Checks

❖ This measurement addresses “keyed” relationships of entities within a domain.

26
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
❖ The goal of these checks is to identify orphan records in the child entity with a foreign key
to the parent entity.

1. Count of records with null foreign key values in the child table.
2. Count of invalid foreign key values in the child table that do not have a corresponding
primary key in the parent table.

Example: In a data warehouse scenario, fact tables have foreign keys to the dimension tables. If
an ETL process does a full refresh of the dimension tables while the fact table is not refreshed,
the surrogate foreign keys in the fact table are not valid anymore. “Late arriving dimensions” is
another scenario where a foreign key relationship mismatch might occur because the fact record
gets loaded ahead of the dimension record.

1. Count of null or unspecified dimension keys in a Fact table:

SELECT count(cust_id) FROM sales where cust_id is null

2. Count of invalid foreign key values in the Fact table:

SELECT cust_id FROM sales minus SELECT s.cust_id FROM sales s, customers c where
s.cust_id=c.cust_id

Automate data quality testing using ETL Validator:

ETL Validator comes with Data Rules Test Plan and Foreign Key Test Plan for automating the
data quality testing.

1. Data Rules Test Plan: Define data rules and execute them on a periodic basis to check for
data that violates them.
2. Foreign Key Test Plan: Define data joins and identify data integrity issues without writing
any SQL queries.

Data Completeness Testing:

➢ The purpose of Data Completeness tests is to verify that all the expected data is loaded
in target from the source.
➢ Some of the tests that can be run are: Compare and Validate counts, aggregates (min,
max, sum, avg) and actual data between the source and target.

Record Count Validation

❖ Compare count of records of the primary source table and target table. Check for any
rejected records.

27
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Example: A simple count of records comparison between the source and target tables.

Source Query

SELECT count(1) src_count FROM customer

Target Query

SELECT count(1) tgt_count FROM customer_dim

Column Data Profile Validation

❖ Column or attribute level data profiling is an effective tool to compare source and target
data without actually comparing the entire data.
❖ It is similar to comparing the checksum of your source and target data. These tests are
essential when testing large amounts of data.

Some of the common data profile comparisons that can be done between the source and target
are:

1. Compare unique values in a column between the source and target.


2. Compare max, min, avg, max length, min length values for columns depending of the
data type.
3. Compare null values in a column between the source and target.
4. For important columns, compare data distribution (frequency) in a column between the
source and target.

Example 1: Compare column counts with values (non-null values) between source and target
for each column based on the mapping.

Source Query
SELECT count(row_id), count(fst_name), count(lst_name), avg(revenue) FROM customer

Target Query
SELECT count(row_id), count(first_name), count(last_name), avg(revenue) FROM customer_dim

28
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Example 2: Compare the number of customers by country between the source and target.

Source Query

SELECT country, count(*) FROM customer GROUP BY country

Target Query
SELECT country_cd, count(*) FROM customer_dim GROUP BY country_cd

Compare Entire Source and Target Data

❖ Compare data (values) between the flat file and target data effectively validating 100% of
the data.
❖ In regulated industries such as finance and pharmaceutical, 100% data validation might
be a compliance requirement.
❖ It is also a key requirement for data migration projects. However, performing 100% data
validation is a challenge when large volumes of data is involved.
❖ This is where ETL testing tools such as ETL Validator can be used because they have an
inbuilt ELV engine (Extract, Load, Validate) capabile of comparing large values of data.

Example: Write a source query that matches the data in the target table after transformation.

Source Query
SELECT cust_id, fst_name, lst_name, fst_name||’,’||lst_name, DOB FROM Customer

Target Query
SELECT integration_id, first_name, Last_name, full_name, date_of_birth FROM Customer_dim

Automate Data Completeness Testing using ETL Validator

ETL Validator comes with Data Profile Test Case, Component Test Case and Query Compare Test
Case for automating the comparison of source and target data.

1. Data Profile Test Case: Automatically computes profile of the source and target query
results – count, count distinct, nulls, avg, max, min, max length and min length.
2. Component Test Case: Provides a visual test case builder that can be used to compare
multiple sources and target.
3. Query Compare Test Case: Simplifies the comparison of results from source and target
queries.

29
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Sample Interview Questions:

1. What is difference between Manual Testing and ETL Testing?


2. Explain Need of ETL Testing.
3. Explain how ETL is used in third party data management.
4. Explain how ETL is used in Data warehousing?
5. What are different characteristics of Data Warehouse?
6. What are different types of Data warehouse Systems?
7. What is ETL?

30
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

You might also like