Etl Testing Tutorial
Etl Testing Tutorial
Audience
This tutorial has been designed for all those readers who want to learn the basics of ETL
testing. It is especially going to be useful for all those software testing professionals who
are required to perform data analysis to extract relevant information from a database.
Prerequisites
We assume the readers of this tutorial have hands-on experience of handling a database
using SQL queries. In addition, it is going to help if the readers have an elementary
knowledge of data warehousing concepts.
ETL Testing
Table of Contents
About the Tutorial .................................................................................................................................... i
Audience .................................................................................................................................................. i
Prerequisites ............................................................................................................................................ i
Disclaimer & Copyright............................................................................................................................. i
Table of Contents .................................................................................................................................... ii
1.
2.
3.
4.
5.
6.
7.
8.
9.
ETL Testing
iii
1. ETL Introduction
ETL Testing
The data in a Data Warehouse system is loaded with an ETL (Extract, Transform, Load)
tool. As the name suggests, it performs the following three operations:
Extracts the data from your transactional system which can be an Oracle,
Microsoft, or any other relational database,
You can also extract data from flat files like spreadsheets and CSV files using an ETL tool
and load it into an OLAP data warehouse for data analysis and reporting. Let us take an
example to understand it better.
Example
Let us assume there is a manufacturing company having multiple departments such as
sales, HR, Material Management, EWM, etc. All these departments have separate
databases which they use to maintain information w.r.t. their work and each database
has a different technology, landscape, table names, columns, etc. Now, if the company
wants to analyze historical data and generate reports, all the data from these data
sources should be extracted and loaded into a Data Warehouse to save it for analytical
work.
An ETL tool extracts the data from all these heterogeneous data sources, transforms the
data (like applying calculations, joining fields, keys, removing incorrect data fields, etc.),
and loads it into a Data Warehouse. Later, you can use various Business Intelligence (BI)
tools to generate meaningful reports, dashboards, and visualizations using this data.
ETL Testing
ETL Process
Let us now discuss in a little more detail the key steps involved in an ETL procedure
ETL Testing
Staging Layer The staging layer or staging database is used to store the data
extracted from different source data systems.
Data Integration Layer The integration layer transforms the data from the
staging layer and moves the data to a database, where the data is arranged into
hierarchical groups, often called dimensions, and into facts and aggregate
facts. The combination of facts and dimensions tables in a DW system is called a
schema.
Access Layer The access layer is used by end-users to retrieve the data for
analytical reporting and information.
The following illustration shows how the three layers interact with each other.
ETL Testing
ETL testing is done before data is moved into a production data warehouse system. It is
sometimes also called as table balancing or production reconciliation. It is different
from database testing in terms of its scope and the steps to be taken to complete this.
The main objective of ETL testing is to identify and mitigate data defects and general
errors that occur prior to processing of data for analytical reporting.
ETL Testing
Both ETL testing and database testing involve data validation, but they are not the
same. ETL testing is normally performed on data in a data warehouse system, whereas
database testing is commonly performed on transactional systems where the data comes
from different applications into the transactional database.
Here, we have highlighted the major differences between ETL testing and Database
testing.
ETL Testing
ETL testing involves the following operations:
1. Validation of data movement from the source to the target system.
2. Verification of data count in the source and the target system.
3. Verifying data extraction, transformation as per requirement and expectation.
4. Verifying if table relations joins and keys are preserved during the
transformation.
Common ETL testing tools include QuerySurge, Informatica, etc.
Database Testing
Database testing stresses more on data accuracy, correctness of data and valid values.
It involves the following operations:
1. Verifying if primary and foreign keys are maintained.
2. Verifying if the columns in a table have valid data values.
3. Verifying data accuracy in columns. Example: Number of months column
shouldnt have a value greater than 12.
4. Verifying missing data in columns. Check if there are null columns which actually
should have a valid value.
Common database testing tools include Selenium, QTP, etc.
ETL Testing
The following table captures the key features of Database and ETL testing and their
comparison:
Function
Primary Goal
Database Testing
ETL Testing
Applicable System
Common tools
Business Need
Modeling
ER method
Multidimensional
Database Type
Data Type
ETL Testing
ETL Testing categorization is done based on objectives of testing and reporting. Testing
categories vary as per the organization standards and it also depends on the client
requirements. Generally, ETL testing is categorized based on the following points
Source to Target Data Testing It involves data validation between the source
and the target systems. It also involves data integration and threshold value
check and duplicate data check in the target system.
Retesting It involves fixing the bugs and defects in data in the target system
and running the reports again for data validation.
System Integration Testing It involves testing all the individual systems, and
later combine the results to find if there are any deviations. There are three
approaches that can be used to perform this: top-down, bottom-up, and hybrid.
Based on the structure of a Data Warehouse system, ETL testing (irrespective of the tool
that is used) can be divided into the following categories:
Migration Testing
In migration testing, customers have an existing Data Warehouse and ETL, but they look
for a new ETL tool to improve the efficiency. It involves migration of data from the
existing system using a new ETL tool.
Change Testing
In change testing, new data is added from different data sources to an existing system.
Customers can also change the existing rules for ETL or a new rule can also be added.
ETL Testing
Report Testing
Report testing involves creating reports for data validation. Reports are the final output
of any DW system. Reports are tested based on their layout, data in the report, and
calculated values.
ETL Testing
ETL testing is different from database testing or any other conventional testing. One may
have to face different types of challenges while performing ETL testing. Here we listed a
few common challenges:
DW system contains historical data, so the data volume is too large and
extremely complex to perform ETL testing in the target system.
ETL testers are normally not provided with access to see job schedules in the ETL
tool. They hardly have access to BI Reporting tools to see the final layout of
reports and data inside the reports.
Tough to generate and build test cases, as data volume is too high and complex.
ETL testers normally dont have an idea of end-user report requirements and
business flow of the information.
ETL testing involves various complex SQL concepts for data validation in the
target system.
Sometimes the testers are not provided with the source-to-target mapping
information.
ETL Testing
An ETL tester is primarily responsible for validating the data sources, extraction of data,
applying transformation logic, and loading the data in the target tables.
The key responsibilities of an ETL tester are listed below.
Count check
Data threshold validation check, for example, age value shouldnt be more than
100.
Record count check, before and after the transformation logic applied.
Data flow validation from the staging area to the intermediate tables.
Data Loading
Data is loaded from the staging area to the target system. It involves the following
operations:
Record count check from the intermediate table to the target system.
Check if the aggregate values and calculated measures are loaded in the fact
tables.
ETL Testing
Check the BI reports based on the loaded fact and dimension table and as per the
expected results.
Create, design, and execute the test plans and test cases.
11
ETL Testing
It is important that you define the correct ETL Testing technique before starting the
testing process. You should take an acceptance from all the stakeholders and ensure
that a correct technique is selected to perform ETL testing. This technique should be well
known to the testing team and they should be aware of the steps involved in the testing
process.
There are various types of testing techniques that can be used. In this chapter, we will
discuss the testing techniques in brief.
12
ETL Testing
Manual errors while transferring data from the source to the target system.
Incremental Testing
Incremental testing is performed to verify if Insert and Update statements are executed
as per the expected result. This testing is performed step-by-step with old and new data.
Regression Testing
When we make changes to data transformation and aggregation rules to add new
functionality which also helps the tester to find new errors, it is called Regression
Testing. The bugs in data that that comes in regression testing are called Regression.
Retesting
When you run the tests after fixing the codes, it is called retesting.
13
ETL Testing
Navigation Testing
Navigation testing is also known as testing the front-end of the system. It involves enduser point of view testing by checking all the aspects of the front-end report includes
data in various fields, calculation and aggregates, etc.
14
ETL Testing
ETL testing covers all the steps involved in an ETL lifecycle. It starts with understanding
the business requirements till the generation of a summary report.
The common steps under ETL Testing lifecycle are listed below:
Test Estimation is used to provide the estimated time to run test-cases and to
complete the summary report.
Test Planning involves finding the Testing technique based on the inputs as per
business requirement.
Once the test-cases are ready and approved, the next step is to perform preexecution check.
The last step is to generate a complete summary report and file a closure
process.
15
ETL Testing
ETL Test Scenarios are used to validate an ETL Testing Process. The following table
explains some of the most common scenarios and test-cases that are used by ETL
testers.
Test Scenarios
Test-Cases
Validating Mapping
document
Validate Constraints
ETL Testing
Data Correctness
Validation
Null Validation
Duplicate Validation
17
ETL Testing
Data Cleaning
18
ETL Testing
ETL performance tuning is used to ensure if an ETL system can handle an expected load
of multiple users and transactions. Performance tuning typically involves server-side
workload on the ETL system. It is used to test the server response in multiuser
environment and to find bottlenecks. These can be found in source and target systems,
mapping of systems, configuration like session management properties, etc.
Step 2 Create new data of that same load or move from Production data to
your local performance server.
Step 3 Disable the ETL until you generate the load required.
Step 4 Take the count of the needed data from the tables of the database.
Step 5 Note down the last run of ETL and enable the ETL, so that it will get
enough stress to transform the entire load created. Run it
Step 6 After the ETL completes its run, take the count of the data created.
Check that the entire expected load got extracted and transferred.
19
ETL Testing
The goal of ETL testing is to achieve credible data. Data credibility can be attained by
making the testing cycle more effective.
A comprehensive test strategy is the setting up of an effective test cycle. The testing
strategy should cover test planning for each stage of ETL process, every time the data
moves and state the responsibilities of each stakeholder, e.g., business analysts,
infrastructure team, QA team, DBAs, Developers and Business Users.
To ensure testing readiness from all aspects, the key areas a test strategy should focus
on are:
20
ETL Testing
In ETL testing, data accuracy is used to ensure if data is accurately loaded to the target
system as per the expectation. The key steps in performing data accuracy are as follows:
Value Comparison
Value comparison involves comparing the data in source and target system with
minimum or no transformation. It can be done using various ETL Testing tools, for
example, Source Qualifier Transformation in Informatica.
Some expression transformations can also be performed in data accuracy testing.
Various set operators can be used in SQL statements to check data accuracy in the
source and the target systems. Common operators are Minus and Intersect operators.
The results of these operators can be considered as deviation in value in the target and
the source system.
21
ETL Testing
Checking the metadata involves validating the source and the target table structure
w.r.t. the mapping document. The mapping document has details of the source and
target columns, data transformations rules and the data types, all the fields that define
the structure of tables in the source and the target systems.
22
ETL Testing
The first step is to create a list of scenarios of input data and the expected results
and validate these with the business customer. This is a good approach for
requirements gathering during design and could also be used as a part of testing.
The next step is to create the test data that contains all the scenarios. Utilize an
ETL developer to automate the entire process of populating the datasets with the
scenario spreadsheet to permit versatility and mobility for the reason that the
scenarios are likely to change.
Next, utilize data profiling results to compare the range and submission of values
in each field between the target and source data.
Validate the accurate processing of ETL generated fields, e.g., surrogate keys.
Validating the data types within the warehouse are the same as was specified in
the data model or design.
The final step is to perform lookup transformation. Your lookup query should
be straight without any aggregation and expected to return only one value per
the source table. You can directly join the lookup table in the source qualifier as
in the previous test. If this is not the case, write a query joining the lookup table
with the main table in the source and compare the data in the corresponding
columns in the target.
23
ETL Testing
Checking data quality during ETL testing involves performing quality checks on data that
is loaded in the target system. It includes the following tests:
Number check
The Number format should be same across the target system. For example, in the
source system, the format of numbering the columns is x.30, but if the target is only
30, then it has to load not prefixing x. in target column number.
Date Check
The Date format should be consistent in both the source and the target systems. For
example, it should be same across all the records. The Standard format is: yyyy-mm-dd.
Precision Check
Precision value should display as expected in the target table. For example, in the source
table, the value is 15.2323422, but in the target table, it should display as 15.23 or
round of 15.
Data Check
It involves checking the data as per the business requirement. The records that dont
meet
certain
criteria
should
be
filtered
out.
Example: Only those records whose date_id >=2015 and Account_Id != 001 should
load in the target table.
Null Check
Some columns should have Null as per the requirement and possible values for that
field. Example: Termination Date column should display Null unless and until its Active
status Column is T or Deceased.
Other Checks
Common checks like From_Date should not greater than To_Date can be done.
24
ETL Testing
Checking Data Completeness is done to verify that the data in the target system is as
per expectation after loading.
The common tests that can be performed for this are as follows:
Checking and validating the counts and the actual data between the source and
the target for columns without transformations or with simple transformations.
Count Validation
Compare the count of number of records in the source and the target tables. It can be
done by writing the following queries:
SELECT count (1) FROM employee;
SELECT count (1) FROM emp_dim;
25
ETL Testing
Backup recovery for a system is planned to ensure that system is restored as soon as
possible from a failure and operations are resumed as early as possible without losing
any important data.
ETL Backup recovery testing is used to ensure that the Data Warehouse system recovers
successfully from hardware, software, or from a network failure with losing any data.
A proper backup plan must be prepared to ensure maximum system availability. Backup
systems should be able to restore with ease and should take over the failed system
without any data loss.
ETL Testing Backup recovery involves exposing the application or the DW system to
extreme conditions for any hardware component, software crash, etc. The next step is to
ensure that recovery process is initiated, system verification is done, and data recovery
is achieved.
26
ETL Testing
ETL testing is mostly done using SQL scripts and gathering the data in spreadsheets.
This approach to perform ETL testing is very slow and time-consuming, error-prone, and
is performed on sample data.
QuerySurge
QuerySurge is a data testing solution designed for testing Big Data, Data Warehouses,
and the ETL process. It can automate the entire process for you and fit nicely into your
DevOps strategy.
The key features of QuerySurge are as follows:
It has Query Wizards to create test QueryPairs fast and easily without the user
having to write any SQL.
It has a Design Library with reusable Query Snippets. You can create custom
QueryPairs as well.
It can compare data from source files and data stores to the target Data
Warehouse or Big Data store.
It allows the user to schedule tests to run (1) immediately, (2) any date/time, or
(3) automatically after an event ends.
It can produce informative reports, view updates, and auto-email results to your
team.
To automate the entire process, your ETL tool should start QuerySurge through
command line API after the ETL software completes its load process.
QuerySurge will run automatically and unattended, executing all tests and then emailing
everyone on the team with results.
Just like QuerySurge, Informatica Data Validation provides an ETL testing tool that helps
you to accelerate and automate the ETL testing process in the development and
production environment. It allows you to deliver complete, repeatable, and auditable test
coverage in less time. It requires no programming skills!
27
ETL Testing
ETL Testing
system study. This knowledge helps the ETL team to identify changed data capture
problems and determine the most appropriate strategy.
Scalability
It is best practice to make sure the offered ETL solution is scalable. At the time of
implementation, one needs to ensure that ETL solution is scalable with the business
requirement and its potential growth in future.
29
ETL Testing
Staging Layer The staging layer is used to store the data extracted from
different source data systems.
Data Integration Layer The integration layer transforms the data from the
staging layer and moves the data to a database, where the data is arranged into
hierarchical groups, often called dimensions, and into facts and aggregate facts.
The combination of facts and dimensions tables in a DW system is called a
schema.
Access Layer The access layer is used by end-users to retrieve the data for
analytical reporting.
ETL Testing
ABInitio
which
is
commonly
Prod_Id
Time_Id
101
24
25
102
25
15
103
26
30
31
ETL Testing
A dimension table stores attributes or dimensions that describe the objects in a fact
table. It is a set of companion tables to a fact table.
Example: Dim_Customer
Cust_id
Cust_Name
Gender
101
Jason
102
Anna
MAX
SUM
AVG
COUNT
COUNT(*)
Example
SELECT AVG(salary)
FROM employee
WHERE title = 'developer';
32
ETL Testing
11. Explain the difference between DDL, DML, and DCL statements.
Data Definition Language (DDL) statements are used to define the database structure or
schema.
Examples:
Data Manipulation Language (DML) statements are used for manipulate data within
database.
Examples:
DELETE deletes all records from a table, the space for the records remain
Data Control Language (DCL) statements are used to control access on database
objects.
Examples:
Arithmetic Operators
Comparison/Relational Operators
Logical Operators
Set Operators
UNION
UNION ALL
INTERSECT
MINUS
33
ETL Testing
14. What is the difference between Minus and Intersect? What is their
use in ETL testing?
Intersect operation is used to combine two SELECT statements, but it only returns the
records which are common from both SELECT statements. In case of Intersect, the
number of columns and datatype must be same. MySQL does not support INTERSECT
operator. An Intersect query looks as follows:
select * from First
INTERSECT
select * from second
Minus operation combines result of two Select statements and return only those result
which belongs to first set of result. A Minus query looks as follows:
select * from First
MINUS
select * from second
If you perform source minus target and target minus source, and if the minus query
returns a value, then it should be considered as a case of mismatching rows.
If the minus query returns a value and the count intersect is less than the source count
or the target table, then the source and target tables contain duplicate rows.
Salary
India
3000
US
2500
India
500
US
1500
34
ETL Testing
Group by Country
Country
Salary
India
3000
India
500
US
2500
US
1500
Database Testing
ETL Testing
Primary Goal
Applicable System
System containing
historical data and not in
business flow environment
QuerySurge, Informatica,
etc.
Business Need
Modeling
ER method
Multidimensional
Database Type
It is applied to OLAP
systems
Data Type
35
ETL Testing
18. What are the different ETL Testing categories as per their function?
ETL testing can be divided into the following categories based on their function:
End-User Testing It involves generating reports for end users to verify if data
in reports are as per expectation. It involves finding deviation in reports and cross
check the data in target system for report validation.
Retesting It involves fixing the bugs and defects in data in target system and
running the reports again for data validation.
System Integration Testing It involves testing all the individual systems, and
later combine the result to find if there is any deviation.
19. Explain the key challenges that you face while performing ETL
Testing.
DW system contains historical data so data volume is too large and really
complex to perform ETL testing in target system.
ETL testers are normally not provided with access to see job schedules in ETL
tool. They hardly have access on BI Reporting tools to see final layout of reports
and data inside the reports.
Tough to generate and build test cases as data volume is too high and complex.
ETL testers normally doesnt have an idea of end user report requirements and
business flow of the information.
ETL testing involves various complex SQL concepts for data validation in target
system.
Sometimes testers are not provided with source to target mapping information.
36
ETL Testing
Verifying the tables in the source system: Count check, Data type check, keys are
not missing, duplicate data.
Applying the transformation logic before loading the data: Data threshold
validation, surrogate ky check, etc.
Data Loading from the Staging area to the target system: Aggregate values and
calculated measures, key fields are not missing, Count Check in target table, BI
report validation, etc.
Testing of ETL tool and its components, Test cases: Create, design and execute
test plans, test cases, Test ETL tool and its function, Test DW system, etc.
Sessions are defined to instruct the data when it is moved from source to target
system.
ETL Testing
28. What is the difference between surrogate key and primary key?
A Surrogate key has sequence-generated numbers with no meaning. It is meant to
identify the rows uniquely.
A Primary key is used to identify the rows uniquely. It is visible to users and can be
changed as per requirement.
29. If there are thousands of records in the source system, how do you
ensure that all the records are loaded to the target in a timely manner?
In such cases, you can apply the checksum method. You can start by checking the
number of records in the source and the target systems. Select the sums and compare
the information.
38
ETL Testing
34. Name the three approaches that can be followed for system
integration.
The three approaches are: top-down, bottom-up, and hybrid.
Structure validation
Validate Constraints
Null Validation
Duplicate Validation
Data Cleaning
37. What do you call the testing bug that comes while performing
threshold validation testing?
It is called Boundary Value Analysis related bug.
39
ETL Testing
39. Name a few checks that can be performed to achieve ETL Testing
Data accuracy.
Value comparison It involves comparing the data in the source and the target
systems with minimum or no transformation. It can be done using various ETL Testing
tools such as Source Qualifier Transformation in Informatica.
Critical data columns can be checked by comparing distinct values in source and target
systems.
difference
between
shortcut
and
reusable
ETL Testing
45. What is a slowly changing dimension and what are its types?
Slowly Changing Dimensions refer to the changing value of an attribute over time. SCDs
are of three types: Type 1, Type 2, and Type 3.
46. User A is already logged into the application and User B is trying to
login, but the system is not allowing. Which type of bug is it?
(a) Race Condition bug
(b) Calculation bug
(c) Hardware bug
(d) Load Condition bug
Answer: D
47. Which testing type is used to check the data type and length of
attributes in ETL transformation?
(a) Production Validation Testing
(b) Data Accuracy Testing
(c) Metadata Testing
(d) Data Transformation testing
Answer: C
48. Which of the following statements is/are not true on the Referential
join?
(a) It is only used when referential integrity between both tables is guaranteed.
(b) It is only used if a filter is set on the right side table
(c) It is considered as optimized Inner join.
(d) It is only executed when fields from both the tables are requested
Answer: B
ETL Testing
50. Which bug type in ETL testing doesnt allow you to enter valid
values?
(a) Load Condition bugs
(b) Calculation bugs
(c) Race condition bug
(d) Input/ Output bug
Answer: D
42