0% found this document useful (0 votes)
87 views23 pages

ETL Testing Process

ETL testing verifies that an ETL process accurately extracts, transforms, and loads data according to specifications by validating and comparing input and output data. ETL testing is important for data-centric projects with large amounts of data or many data pipelines to identify issues early and ensure accurate data in production.

Uploaded by

karthikerukulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views23 pages

ETL Testing Process

ETL testing verifies that an ETL process accurately extracts, transforms, and loads data according to specifications by validating and comparing input and output data. ETL testing is important for data-centric projects with large amounts of data or many data pipelines to identify issues early and ensure accurate data in production.

Uploaded by

karthikerukulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

What is ETL Testing?

ETL testing verifies that an ETL process accurately extracts, transforms, and
loads data according to the specifications. ETL testing is done by validating and/or
comparing the input and output data transformed by the ETL process.

ETL testing is used in data-centric projects having a huge


amount of data or substantial number of data pipelines. It
should not be confused with application testing which usually
involves a small amount of transactional data.
ETL Testing Full Form
ETL in testing means an extract, transform and load process
that reads data from multiple source systems, transports it to
a data transformation layer for further processing which
includes cleaning, consolidating, integrating, and then loading
into a target database or file. The method used to test ETL is
defined at the top of this article.

Why ETL Testing is Required?


Anytime a piece of application is developed, it must be tested. The ETL process is
ultimately a piece of software written by a developer. An ETL process is at the heart
of any data-centric system and/or project and mistakes in the ETL process will
directly impact the data and the downstream applications.
1. Without ETL testing there is no way of knowing if the
process is built to the specifications and as per
requirements.
2. Without ETL testing the code cannot be released or
deployed in production.
3. ETL testing enables root cause analysis to identify data
issues due to the source data or the ETL process.
4. It is very expensive and difficult to fix data issues in
production. ETL testing ensures that the data issues are
caught early in the development lifecycle.

ETL Testing Basics


ETL processes read data from a source, transform the data,
and then load it in the destination databases. An organization
can easily have thousands of such ETL jobs processing their
financial, customer, or operations data.
“Just like any piece of code that is developed,
the ETL code must be tested to ensure that
the developers have written the code correctly.”
ETL testing is unique since,
 ETL processes are background processes and don’t
have user screens.
 ETL testing involves a large amount of data.
 ETL processes are like functions where testing requires
execution of the ETL process and then the comparison
of input and output data.
 The defects in the ETL processes cannot be detected
by simply reviewing the ETL code.

How to do ETL Testing?


ETL processes are evaluated indirectly through black box testing approach, wherein
the ETL process is first executed to create the output data and then by verifying the
output data the quality of the ETL process is determined.
ETL testing process is summarized in the following three
steps:

1. First, the ETL code is executed to generate the output


data.
2. Then the output data is compared with the
predetermined expected data.
3. Based on the comparison results, the quality of the ETL
process is determined.
For ETL testing you can follow the legacy approach, which is
outdated, or the newer rules based ETL testing pioneered by
iceDQ.

Difference between Manual Testing


and ETL Testing
ETL testing vs manual testing:
 It is resource-intensive, hence very costly.
 Testing is based on few sampled records.
 It is ad hoc hence not repeatable.
 The tester must do all the testing on his desktop.
 Incomplete test coverage.
 Regression testing is not possible.
The diagram below shows legacy manual ETL testing vs Rules
based approach.
1. Manual Testing
2. Manual- Pseudo code-based ETL Testing
3. Manual- Golden Copy based ETL Testing
4. Rules Based ETL Testing
1. Manual ETL Testing
In this
approac
h, the
data
created
by the
ETL
process
is
sampled
and
inspecte
d
visually
by a
tester. If
the data
output is
as
expecte
d the
ETL
process
is
certified
.
This manual testing approach is not scalable as humans are
not capable of dealing with more than a few hundred records.
2. Manual- Pseudo code-based ETL Testing:
QA recreates pseudo ETL code in parallel to the developers’
actual ETL process. This pseudo ETL code processes a subset
of data and generates an output. The actual ETL process also
uses the same input data and generates data. Then the ETL
tester compares the data generated by both the processes and
documents the differences. If the data is an exact match, then
the ETL process passes the quality test.
Parallel development of pseudo ETL code by QA team is
ridiculous because:
 Crazy amount of time, money and resources are wasted
in reproducing the pseudo ETL code by the QA team.
 Since the pseudo ETL code is also development, there
is no guarantee that the pseudo ETL code developed by
the QA team is also error free and hence, the data
output generated by it.
3. Manual- Golden Copy based ETL Testing:
In this method, QA manually creates expected data output
based on their understanding of the data processing logic.
First the manual data, also called as golden copy is created
and is stored in a database. Next, the ETL process is
executed, and the data generated by the ETL process is
compared to the golden copy of data that was created by the
QA team. The ETL process is certified based on the
comparison results.
This approach is also severely limited because:
 Only works with sampled data.
 Does not ensure test coverage.
 Does not work if the input data changes.

4. Rules Based ETL Testing:


An ETL process is a set of actions that transforms an input data into a desired data
output. Rules based ETL testing understands these data transformation requirements
and derive data audit rules which are later used to test the ETL processes.
The prior three legacy methods have inherent limitations.
Beyond the obvious economic reasons there few fundamental
flaws because of which the conventional approach towards
ETL testing will never work.
 Large Data Volume: The recent exponential growth in
the data volumes as well as growth in the number of
ETL processes have made the above three approaches
pretty much useless. It is no longer feasible to manually
test the ETL processes.
 Data Sampling: Data sampling does not work because
there are many corner cases that will be only
discovered if almost all the data is processed, and the
output generated by the ETL process is inspected.
 Dynamic Input Data: The output generated by an ETL
process is totally dependent on the input data; and the
input is dynamic. Hence any conventional ETL testing
that uses a predetermined output, will not work.
 Cross database comparisons: Since ETL processes
data from one system and loads into another, it is
almost impossible to bring the data in one place and
then do the comparison.
The rule based ETL testing is designed to avoid all the above
pitfalls. Its concept is derived from auditing of financial
transactions. Example, if $100 is transferred from account ‘A’
to account ‘B’. Then for the financial audit, the balance in
account ‘A’ must reduce while the balance in account ‘B’ must
stimulatingly increase by the same amount.
Account ‘A’ (Original Balance – New Balance) = Account “B” (New Balance –
Original Balance)
This same concept has been expanded in for ETL testing.
Wherein the logic of ETL transformation is understood and the
audit rules are created to certify the ETL process.
Advantages of Rules based ETL Testing:
 The data can change but the rules don’t.
 No need to recreate the ETL processes.
 There is no limit on data volume.
 No manual interventions.
 No data sampling is required.
 The rules are stored in central knowledge repository.
 The rules are executed on the server via a schedule or
on-demand.
 Cross database testing is possible.
 The rules can be stateless to supports dynamic change
in the input data.

ETL Testing Concepts with Examples


Rules based ETL testing operates under the two assumptions:
– Input data provided will change for each ETL run.
– The data transformation rules and conditions remain
same unless there are changes in business requirements.
For ETL testing you must understand the concepts of Static
ETL testing and the Dynamic ETL testing.
 1. Static ETL Testing
For static ETL
testing only the
output data is used
for certifying the
ETL process. The
output value
generated by ETL
process is
compared to a fixed
value(s) or a
condition that is
previously known
or defined by the
tester. The input
data used by the
ETL process
is not taken into
considerations.
Here are few
examples:
1. Static
Value
Example
: The
Account
Category
column
can only
have one
of the
three
values,
‘Checkin
g,
Savings,
Trading’
2. Static
Conditio
n
Example
: Net
amount
must
equal
Gross
Amount
minus
sum of
Tax,
Commis
sion and
Fees.
 2. Dynamic ETL Testing
For dynamic
ETL testing
both input
and output
data is taken
into
consideration
s while
testing the
ETL process.
In many
cases the
output data
of an ETL
process is
totally
dependent on
the input data
provided at
runtime to
the ETL
process.
Even though
the
transformatio
n logic is
fixed, the
final value
cannot be
determined
without
knowing the
input values
used by the
ETL process
at runtime.
Thus, the ETL testing must support the dynamic nature of
input data that is provided during the execution. This can be
represented by the following simplistic equation.

Input Data + Transformation = Output Data


The ETL testing example below will explain the dynamic ETL
testing concept:
An ETL is processing customer list. The list contains two types
of customers: corporate and individuals. The ETL developer is
asked to only load individual customers and not corporate
customers. To test this ETL process the total of individual
customers in the source must exactly match the customers in
the target.
How many customers should be in the target table? That can
only be known by counting individual customers in the source
that were provided to the ETL process at runtime.

Types of ETL Testing


The types of ETL testing are listed below
1. ETL Source Data Validation Testing
2. ETL Source to Target Data Reconciliation Testing
3. ETL Data Transformation testing
4. ETL Data Validation
5. ETL Referential Integrity Testing
6. ETL Integration Testing
7. ETL Performance Testing
8. ETL Functional Testing
9. ETL Unit Testing
10. ETL Validation
1. ETL Source Data Validation Testing
This ETL testing checks input data for validity. Because if the
input data itself is not valid you cannot expect the ETL
process to transform the data correctly or for the process to
even execute at all.

2. ETL Source to Target Data Reconciliation Testing


This test is mostly done to prove that there is no leakage while
transporting or staging the data. Comparing the source (input)
data and the target (output) data ensures that data
completeness and consistency is not lost because of any
issues in the ETL process. For example,

– Make sure the row count between the source and the target
table is matching.
– Compare all the customer data in the source and the table to
ensure that ETL loaded the data in the destination table as per
the mapping rules.
3. ETL Data Transformation Testing
Data Transformation Tests ensures that every row has
transformed successfully based on the mapping document.
Testing Data transformations involve reconciling the data
between source and destination to verify that the ETL is
transforming the data as expected. For example,

– Test the transformation of first name and last name source


column into full name target column.
– Make sure the ETL is calculating the values correctly.
4. ETL Data Validation
An ETL process loads data into a destination data store. The
ETL data validation verifies the data load by the ETL process.
This is also called as ETL Load Testing as the data is validated
only after the ETL loads the data. Data Validation Tests is used
to validate a single data source, be it a database table, data
extracts, dimension table, or a fact table. Check some of the
examples below
 – Check if there are any nulls in the column.
 – Validate the format of the email should be valid.
 – There should be only one active record in a dimension
table.
 – Date in birth date column should be a valid date.
 – Check if the Net amount cannot be less than zero.
5. ETL Referential Integrity Testing
The referential integrity testing ensures that the child table
only have foreign key values that exists in the parent table.

In the case above the gender table has M, F and Others. The
ETL testing involves reconciling so that the Gender attributes
in the customer table will only have one of those three values.
6. ETL Integration Testing
ETL integration testing is done to verify that the ETL process
has integrated the data correctly. One of the key purposes of
an ETL process is to integrate data from multiple data sources
or multiple subject areas.
Vertical
Integrati
on
Testing: I
n this case
data is
brought in
from
multiple
data
sources
and
integrated
into a
table.
Example
in this
type of
integratio
n
customer
list from
CRM
system
and
accountin
g system
is
integrated
in a single
unified
list. The
integratio
n must
ensure
that:
Attributes from multiple sources are mapped correctly to the
destination
No duplicate records exist.
Horizontal
Integration
Testing:
In this
scenario
data from
multiple
subject
areas and
sources
are linked
together to
form
meaningfu
l
relationshi
p. A
typical
example is
to link the
salesperso
n data
with sales
data to
calculate
the
commissio
n.
Mostly
referential
integrity
/foreign
keys are
created,
and
different
tables are
linked to
together.
ETL integration testing involves creation of multiple ETL
testing rules to verify if the data integration is done correctly.
This is true because even though there might be one ETL
process that integrates the data, it nevertheless contains
multiple business rules for data transformation. ETL testing
must ensure that each of those integration rules are
implemented correctly. This testing includes all the above
types of testing.
 – Ensure the data is going to the respective attributes
 – No duplicate entities exists and at the same time no
unrelated entities are unified.
 – Ensure the entities are linked correctly.
7. ETL Performance Testing
Even if the ETL process is coded correctly it is possible that,
when executed it takes unreasonably more time to finish the
job. ETL performance testing measure and the time taken to
finish processing a certain number of records vs. user
expectations. The ETL performance metrics are usually
measured in the number of rows processed per seconds.
To measure performance three metric are needed, ETL
processes start time, ETL process end time and number of
records processed. The sources for the above metrics are:

 – Special ETL log table which captures all the ETL


process execution stats.
 – Some of the metrics are derived from the target table
with row level logging attributes such as record insert
datetime, record update date time.
There is no universal standard for performance testing
numbers, so it all depends on the expectations. However,
some parameters must be taken into consideration that
directly affects the ETL process performance numbers.
 – Number of records inserted.
 - Number of records updated or deleted.
 – Logging is enabled or not in the target database.
 – Row level locking setting in destination tables.
 – Presence of indexes.
 – The size of the processing machine.
8. ETL Functional Testing
The ETL functional test are designed to ensure that the ETL is
following the business rules defined by the uses. While the test
might look similar the thought behind the functional test is
totally different. The example below will highlight the key
differentiators.
There are processes that loads orders and shipments data
from a source. The tester can compare the source orders with
target orders and the source shipment with target shipment
data; and the data is perfect match. While this is technically
correct and the test is a success, it could still be failure from
business rules or functional point of view. Here are few of the
scenarios below:
If there are more shipments then orders, that means shipment
were done without orders. It is also possible that the shipment
quantity and the order quantity or the prices don’t match
between them.
These kinds of issues are tested by reconciling order and
shipment and falls under the ETL functional testing.
9. ETL Unit Testing
The ETL testing done by the developer during development is
called ETL unit testing. A unit ETL test is a test written by the
programmer to verify that a relatively small piece of ETL code
is doing what it is intended to do. They are narrow in scope.
While testing activity is expected from QA team, some basic
testing tasks are executed by the ETL developer during the
development phase itself and much before the code is
transitioned to the QA team. This ensures basic data
transformations are done correctly.
10. ETL Validation
ETL validation is technical test done as part of unit testing to
ensure that the ETL is executed without crashing. It is usually
done by the developer as part of the sanity checks as
explained above.

ETL Testing Scenarios


Following ETL testing scenarios should be considered for any
data projects.
1. Record level ETL tests
2. Attribute Data level ETL tests
3. Aggregate Data level ETL tests
4. Execution Level ETL tests
The ETL testing scenarios repeat in multiple situations
regardless of the type of data being processed.

ETL Test
Test Description
Scenarios
Record Level
These are record level ETL tests
Scenarios
This is a primary test, to check if all the available records are populated –
Record Count
Nothing more, nothing less. This test ensures that the ETL process has loaded
Testing
all the records. But it does not know if the data in the records is correct.
Duplicate records happens if primary key or unique key constraints are not
Duplicate Records
implemented in the database. In such cases specific ETL Tests are needed to
Testing
ensure duplicate records are not generated by the ETL process.
Record In many scenarios transaction level records are aggregated by time, or other
Aggregation Test dimensions. Test are needed to ensure that the dimension chosen for the
aggregation of records are correct.
Often ETL developers miss or adding filters or sometimes, forget to remove
Row Filter Testing filters that were added during testing. Create ETL tests to ensure proper data
filters are implemented as per requirements.
The type ii dimensions ETL logic retires old records and inserts new records.
Type II dimension
This Test to ensure that only one valid record is present, and the
Testing
expiry dates don’t overlap.
Attribute Level
These are attribute level tests.
Scenarios
During the development of the ETL process the developer might do mistake
Data mapping
in mapping the source and target attributes. This ETL test ensure that the data
Testing
is getting populated in the correct target attributes.
There are many mathematical calculations used to populate calculated fields.
Calculations –
This ETL test ensures that the calculations are done correctly by the ETL
Numeric and date
process.
Various string manipulation and operations such as CONACT, SUBSTRING,
Expressions –
TRIM, are done on strings. This test ensures string transformations are done
String
correctly by the ETL process.
Many time the data processed by the ETL process truncate the data and/or if
the target column has shorter size the data can be get truncated. This ETL test
Data Truncation
ensures string data is not truncated by the ETL process or during the load
time.
This can happen if the datatype is not chosen correctly in either the ETL
Data Rounding – process variables or the target table datatypes.
Numbers and dates Numbers can get rounded; dates can lose time or second components. Ensure
decimal data is not rounded incorrectly.
This mostly happens with string datatypes as it accepts data in almost any
Formatting Issues -
format. Many cases dates are p The date Ensure the date, or string data is
Date and Strings
formatted correctly.
Reference Data or Ensure that the child or transaction attributes have reference data that are
Dimension Lookup present in the master.
Aggregate This involves testing of summarized (balances, snapshot, aggregates)
Scenarios data.
Aggregate
Ensure the data aggregations of data is done correctly.
calculation
Ensure the number of records populated is not more and/or less than the
Simple Row counts expected number of records. The row count in the destination matches to the
source system.
Match the sums of numeric values between source and target to ensure the
Simple Sums
numbers are correct.
Grouped Row
Reconcile counts for different groups between source and target.
Count
Group Sums Reconcile aggregate sums for different groups between source and target.
Execution
This testing involves testing of ETL processes related to their executions.
Scenarios
Often data is loaded in increments based on delta logic. This ETL Test
Incremental Load ensures the incremental loads are reconciling correctly with source and no
gaps or overlapping are generated.
Normally you won’t expect same data to be processed again. But in many
Multi Execution situations the data is reprocessed or accidently executed. This test ensures
Tests multiple reruns of the ETL process with the same data do not generate extra
records.
The data processing must finish within the required timeframe. ETL
ETL Performance
performance test ensures that the ETL processing time is acceptable by
Test
checking the run logs.

Scope of ETL Testing


The scope of ETL testing is restricted to ensuring the ETL
process is correctly developed, and it is processing data as
per the business requirements.
– Data Transformation and data loading is correct.
– ETL process does not create duplicate data.
– ETL process execution is done in proper order.
– The ETL process has correct incremental logic for
processing data.
– The ETL emits proper exit codes on errors.
– The ETL Processes do not crash due to data exceptions.
– The ETL process logs metadata about its process.

You might also like