ETL Tutorial
ETL Tutorial
An ETL tool extracts the data from all these heterogeneous data sources, transforms
the data (like applying calculations, joining fields, keys, removing incorrect data fields,
etc.), and loads it into a Data Warehouse. This is an introductory tutorial that explains
all the fundamentals of ETL testing.
Audience
This tutorial has been designed for all those readers who want to learn the basics of
ETL testing. It is especially going to be useful for all those software testing
professionals who are required to perform data analysis to extract relevant
information from a database.
Prerequisites
We assume the readers of this tutorial have hands-on experience of handling a
database using SQL queries. In addition, it is going to help if the readers have an
elementary knowledge of data warehousing concepts.
INTRODUCTION
The data in a Data Warehouse system is loaded with an ETL (Extract, Transform,
Load) tool. As the name suggests, it performs the following three operations −
Extracts the data from your transactional system which can be an Oracle,
Microsoft, or any other relational database,
Transforms the data by performing data cleansing operations, and then
Loads the data into the OLAP data Warehouse.
You can also extract data from flat files like spreadsheets and CSV files using an ETL
tool and load it into an OLAP data warehouse for data analysis and reporting. Let us
take an example to understand it better.
Example
Let us assume there is a manufacturing company having multiple departments such
as sales, HR, Material Management, EWM, etc. All these departments have separate
databases which they use to maintain information w.r.t. their work and each database
has a different technology, landscape, table names, columns, etc. Now, if the
company wants to analyze historical data and generate reports, all the data from
these data sources should be extracted and loaded into a Data Warehouse to save it
for analytical work.
An ETL tool extracts the data from all these heterogeneous data sources, transforms
the data (like applying calculations, joining fields, keys, removing incorrect data fields,
etc.), and loads it into a Data Warehouse. Later, you can use various Business
Intelligence (BI) tools to generate meaningful reports, dashboards, and visualizations
using this data.
Difference between ETL and BI Tools
An ETL tool is used to extract data from different data sources, transform the data,
and load it into a DW system; however a BI tool is used to generate interactive and
ad-hoc reports for end-users, dashboard for senior management, data visualizations
for monthly, quarterly, and annual board meetings.
The most common ETL tools include − SAP BO Data Services (BODS), Informatica
– Power Center, Microsoft – SSIS, Oracle Data Integrator ODI, Talend Open Studio,
Clover ETL Open source, etc.
Some popular BI tools include − SAP Business Objects, SAP Lumira, IBM Cognos,
JasperSoft, Microsoft BI Platform, Tableau, Oracle Business Intelligence Enterprise
Edition, etc.
ETL Process
Let us now discuss in a little more detail the key steps involved in an ETL procedure
−
into facts and aggregate facts. The combination of facts and dimensions
tables in a DW system is called a schema.
Access Layer − The access layer is used by end-users to retrieve the data for
analytical reporting and information.
The following illustration shows how the three layers interact with each other.
ETL Testing – Tasks
ETL testing is done before data is moved into a production data warehouse system.
It is sometimes also called as table balancing or production reconciliation. It is
different from database testing in terms of its scope and the steps to be taken to
complete this.
The main objective of ETL testing is to identify and mitigate data defects and general
errors that occur prior to processing of data for analytical reporting.
ETL Testing
ETL testing involves the following operations −
Validation of data movement from the source to the target system.
Verification of data count in the source and the target system.
Verifying data extraction, transformation as per requirement and expectation.
Verifying if table relations − joins and keys − are preserved during the
transformation.
Common ETL testing tools include QuerySurge, Informatica, etc.
Database Testing
Database testing stresses more on data accuracy, correctness of data and valid
values. It involves the following operations −
Verifying if primary and foreign keys are maintained.
Verifying if the columns in a table have valid data values.
Verifying data accuracy in columns. Example − Number of months column
shouldn’t have a value greater than 12.
Verifying missing data in columns. Check if there are null columns which
actually should have a valid value.
Common database testing tools include Selenium, QTP, etc.
The following table captures the key features of Database and ETL testing and their
comparison −
Migration Testing
In migration testing, customers have an existing Data Warehouse and ETL, but they
look for a new ETL tool to improve the efficiency. It involves migration of data from
the existing system using a new ETL tool.
Change Testing
In change testing, new data is added from different data sources to an existing
system. Customers can also change the existing rules for ETL or a new rule can also
be added.
Report Testing
Report testing involves creating reports for data validation. Reports are the final
output of any DW system. Reports are tested based on their layout, data in the report,
and calculated values.
Data Loading
Data is loaded from the staging area to the target system. It involves the following
operations −
Record count check from the intermediate table to the target system.
Ensure the key field data is not missing or Null.
Check if the aggregate values and calculated measures are loaded in the fact
tables.
Check modeling views based on the target tables.
Check if CDC has been applied on the incremental load table.
Data check in dimension table and history table check.
Check the BI reports based on the loaded fact and dimension table and as per
the expected results.
Incremental Testing
Incremental testing is performed to verify if Insert and Update statements are
executed as per the expected result. This testing is performed step-by-step with old
and new data.
Regression Testing
When we make changes to data transformation and aggregation rules to add new
functionality which also helps the tester to find new errors, it is called Regression
Testing. The bugs in data that that comes in regression testing are called Regression.
Retesting
When you run the tests after fixing the codes, it is called retesting.
Navigation Testing
Navigation testing is also known as testing the front-end of the system. It involves
enduser point of view testing by checking all the aspects of the front-end report −
includes data in various fields, calculation and aggregates, etc.
ETL testing covers all the steps involved in an ETL lifecycle. It starts with
understanding the business requirements till the generation of a summary report.
The common steps under ETL Testing lifecycle are listed below −
Understanding the business requirement.
Validation of the business requirement.
Test Estimation is used to provide the estimated time to run test-cases and to
complete the summary report.
Test Planning involves finding the Testing technique based on the inputs as
per business requirement.
Creating test scenarios and test cases.
Once the test-cases are ready and approved, the next step is to perform pre-
execution check.
Execute all the test-cases.
The last step is to generate a complete summary report and file a closure
process.
Full Data Validation Minus Query You need to match the rows in source
and target using
the Intersect statement.
The count returned by Intersect should
match with the individual counts of
source and target tables.
If the minus query returns no rows and
the count intersect is less than the
source count or the target table count,
then the table holds duplicate rows.
Value Comparison
Value comparison involves comparing the data in source and target system with
minimum or no transformation. It can be done using various ETL Testing tools, for
example, Source Qualifier Transformation in Informatica.
Some expression transformations can also be performed in data accuracy testing.
Various set operators can be used in SQL statements to check data accuracy in the
source and the target systems. Common operators are Minus and Intersect
operators. The results of these operators can be considered as deviation in value in
the target and the source system.
Number check
The Number format should be same across the target system. For example, in the
source system, the format of numbering the columns is x.30, but if the target is
only 30, then it has to load not prefixing x. in target column number.
Date Check
The Date format should be consistent in both the source and the target systems. For
example, it should be same across all the records. The Standard format is: yyyy-mm-
dd.
Precision Check
Precision value should display as expected in the target table. For example, in the
source table, the value is 15.2323422, but in the target table, it should display as
15.23 or round of 15.
Data Check
It involves checking the data as per the business requirement. The records that don’t
meet certain criteria should be filtered out.
Example − Only those records whose date_id >=2015 and Account_Id != ‘001’
should load in the target table.
Null Check
Some columns should have Null as per the requirement and possible values for that
field.
Example − Termination Date column should display Null unless and until its Active
status Column is “T” or “Deceased”.
Other Checks
Common checks like From_Date should not greater than To_Date can be done.
ETL Testing – Data Completeness
Checking Data Completeness is done to verify that the data in the target system is
as per expectation after loading.
The common tests that can be performed for this are as follows −
Checking Aggregate functions (sum, max, min, count),
Checking and validating the counts and the actual data between the source
and the target for columns without transformations or with simple
transformations.
Count Validation
Compare the count of number of records in the source and the target tables. It can
be done by writing the following queries −
SELECT count (1) FROM employee;
SELECT count (1) FROM emp_dim;
QuerySurge
QuerySurge is a data testing solution designed for testing Big Data, Data
Warehouses, and the ETL process. It can automate the entire process for you and fit
nicely into your DevOps strategy.
The key features of QuerySurge are as follows −
It has Query Wizards to create test QueryPairs fast and easily without the user
having to write any SQL.
It has a Design Library with reusable Query Snippets. You can create custom
QueryPairs as well.
It can compare data from source files and data stores to the target Data
Warehouse or Big Data store.
It can compare millions of rows and columns of data in minutes.
It allows the user to schedule tests to run (1) immediately, (2) any date/time, or
(3) automatically after an event ends.
It can produce informative reports, view updates, and auto-email results to your
team.
To automate the entire process, your ETL tool should start QuerySurge through
command line API after the ETL software completes its load process.
QuerySurge will run automatically and unattended, executing all tests and then
emailing everyone on the team with results.
Just like QuerySurge, Informatica Data Validation provides an ETL testing tool that
helps you to accelerate and automate the ETL testing process in the development
and production environment. It allows you to deliver complete, repeatable, and
auditable test coverage in less time. It requires no programming skills!
Scalability
It is best practice to make sure the offered ETL solution is scalable. At the time of
implementation, one needs to ensure that ETL solution is scalable with the business
requirement and its potential growth in future.
101 24 1 25
102 25 2 15
103 26 3 30
A dimension table stores attributes or dimensions that describe the objects in a fact
table. It is a set of companion tables to a fact table.
Example − Dim_Customer
101 Jason M
102 Anna F
Arithmetic Operators
Comparison/Relational Operators
Logical Operators
Set Operators
Operators used to negate conditions
What are the common set operators in SQL?
The common set operators in SQL are −
UNION
UNION ALL
INTERSECT
MINUS
What is the difference between Minus and Intersect? What is their use in ETL testing?
Intersect operation is used to combine two SELECT statements, but it only returns
the records which are common from both SELECT statements. In case of Intersect,
the number of columns and datatype must be same. MySQL does not support
INTERSECT operator. An Intersect query looks as follows −
select * from First
INTERSECT
select * from second
Minus operation combines result of two Select statements and return only those result
which belongs to first set of result. A Minus query looks as follows −
select * from First
MINUS
select * from second
If you perform source minus target and target minus source, and if the minus query
returns a value, then it should be considered as a case of mismatching rows.
If the minus query returns a value and the count intersect is less than the source
count or the target table, then the source and target tables contain duplicate rows.
Explain ‘Group-by’ and ‘Having’ clause with an example.
Group-by clause is used with select statement to collect similar type of
data. HAVING is very similar to WHERE except the statements within it are of an
aggregate nature.
Syntax −
SELECT dept_no, count ( 1 ) FROM employee GROUP BY dept_no;
SELECT dept_no, count ( 1 ) FROM employee GROUP BY dept_no HAVING
COUNT( 1 ) > 1;
Example − Employee table
Country Salary
India 3000
US 2500
India 500
US 1500
Group by Country
Country Salary
India 3000
India 500
US 2500
US 1500
What are the different ETL Testing categories as per their function?
ETL testing can be divided into the following categories based on their function −
Source to Target Count Testing − It involves matching of count of records in
source and target system.
Source to Target Data Testing − It involves data validation between source
and target system. It also involves data integration and threshold value check
and Duplicate data check in target system.
Data Mapping or Transformation Testing − It confirms the mapping of
objects in source and target system. It also involves checking functionality of
data in target system.
End-User Testing − It involves generating reports for end users to verify if data
in reports are as per expectation. It involves finding deviation in reports and
cross check the data in target system for report validation.
Retesting − It involves fixing the bugs and defects in data in target system and
running the reports again for data validation.
System Integration Testing − It involves testing all the individual systems,
and later combine the result to find if there is any deviation.
Explain the key challenges that you face while performing ETL Testing.
Data loss during the ETL process.
Incorrect, incomplete or duplicate data.
DW system contains historical data so data volume is too large and really
complex to perform ETL testing in target system.
ETL testers are normally not provided with access to see job schedules in ETL
tool. They hardly have access on BI Reporting tools to see final layout of
reports and data inside the reports.
Tough to generate and build test cases as data volume is too high and
complex.
ETL testers normally doesn’t have an idea of end user report requirements and
business flow of the information.
ETL testing involves various complex SQL concepts for data validation in target
system.
Sometimes testers are not provided with source to target mapping information.
Unstable testing environment results delay in development and testing the
process.
What are your responsibilities as an ETL Tester?
The key responsibilities of an ETL tester include −
Verifying the tables in the source system − Count check, Data type check, keys
are not missing, duplicate data.
Applying the transformation logic before loading the data: Data threshold
validation, surrogate ky check, etc.
Data Loading from the Staging area to the target system: Aggregate values
and calculated measures, key fields are not missing, Count Check in target
table, BI report validation, etc.
Testing of ETL tool and its components, Test cases − Create, design and
execute test plans, test cases, Test ETL tool and its function, Test DW system,
etc.
What do you understand by the term ‘transformation’?
A transformation is a set of rules which generates, modifies, or passes data.
Transformation can be of two types − Active and Passive.
What do you understand by Active and Passive Transformations?
In an active transformation, the number of rows that is created as output can be
changed once a transformation has occurred. This does not happen during a passive
transformation. The information passes through the same number given to it as input.
What is Partitioning? Explain different types of partitioning.
Partitioning is when you divide the area of data store in parts. It is normally done to
improve the performance of transactions.
If your DW system is huge in size, it will take time to locate the data. Partitioning of
storage space allows you to find and analyze the data easier and faster.
Parting can be of two types − round-robin partitioning and Hash partitioning.
What is the difference between round-robin partitioning and Hash partitioning?
In round-robin partitioning, data is evenly distributed among all the partitions so the
number of rows in each partition is relatively same. Hash partitioning is when the
server uses a hash function in order to create partition keys to group the data.
Explain the terms − mapplet, session, mapping, workflow − in an ETL process?
A Mapplet defines the Transformation rules.
Sessions are defined to instruct the data when it is moved from source to target
system.
A Workflow is a set of instructions that instructs the server on task execution.
Mapping is the movement of data from the source to the destination.
What is lookup transformation and when is it used?
Lookup transformation allows you to access data from relational tables which are not
defined in mapping documents. It allows you to update slowly changing dimension
tables to determine whether the records already exist in the target or not.
What is a surrogate key in a database?
A Surrogate key is something having sequence-generated numbers with no meaning,
and just to identify the row uniquely. It is not visible to users or application. It is also
called as Candidate key.
What is the difference between surrogate key and primary key?
A Surrogate key has sequence-generated numbers with no meaning. It is meant to
identify the rows uniquely.
A Primary key is used to identify the rows uniquely. It is visible to users and can be
changed as per requirement.
If there are thousands of records in the source system, how do you ensure that all the records
are loaded to the target in a timely manner?
In such cases, you can apply the checksum method. You can start by checking the
number of records in the source and the target systems. Select the sums and
compare the information.
What do you understand by Threshold value validation Testing? Explain with an example.
In this testing, a tester validates the range of data. All the threshold values in the
target system are to be checked to ensure they are as per the expected result.
Example − Age attribute shouldn’t have a value greater than 100. In Date column
DD/MM/YY, month field shouldn’t have a value greater than 12.
Write an SQL statement to perform Duplicate Data check Testing.
Select Cust_Id, Cust_NAME, Quantity, COUNT (*)
FROM Customer GROUP BY Cust_Id, Cust_NAME, Quantity HAVING COUNT
(*) >1;
How does duplicate data appear in a target system?
When no primary key is defined, then duplicate values may appear.
Data duplication may also arise due to incorrect mapping, and manual errors while
transferring data from source to target system.
What is Regression testing?
Regression testing is when we make changes to data transformation and aggregation
rules to add a new functionality and help the tester to find new errors. The bugs that
appear in data which comes in Regression testing are called Regression.
Name the three approaches that can be followed for system integration.
The three approaches are − top-down, bottom-up, and hybrid.
What are the common ETL Testing scenarios?
The most common ETL testing scenarios are −
Structure validation
Validating Mapping document
Validate Constraints
Data Consistency check
Data Completeness Validation
Data Correctness Validation
Data Transform validation
Data Quality Validation
Null Validation
Duplicate Validation
Date Validation check
Full Data Validation using minus query
Other Test Scenarios
Data Cleaning
What is data purging?
Data purging is a process of deleting data from a data warehouse. It removes junk
data like rows with null values or extra spaces.
What do you understand by a cosmetic bug in ETL testing?
Cosmetic bug is related to the GUI of an application. It can be related to font style,
font size, colors, alignment, spelling mistakes, navigation, etc.
What do you call the testing bug that comes while performing threshold validation testing?
It is called Boundary Value Analysis related bug.
I have 50 records in my source system but I want to load only 5 records to the target for each
run. How can I achieve this?
You can do it by creating a mapping variable and a filtered transformation. You might
need to generate a sequence in order to have the specifically sorted record you
require.
Name a few checks that can be performed to achieve ETL Testing Data accuracy.
Value comparison − It involves comparing the data in the source and the target
systems with minimum or no transformation. It can be done using various ETL Testing
tools such as Source Qualifier Transformation in Informatica.
Critical data columns can be checked by comparing distinct values in source and
target systems.
Which SQL statements can be used to perform Data completeness validation?
You can use Minus and Intersect statements to perform data completeness
validation. When you perform source minus target and target minus source and the
minus query returns a value, then it is a sign of mismatching rows.
If the minus query returns a value and the count intersect is less than the source
count or the target table, then duplicate rows exist.
What is the difference between shortcut and reusable transformation?
Shortcut Transformation is a reference to an object that is available in a shared
folder. These references are commonly used for various sources and targets which
are to be shared between different projects or environments.
In the Repository Manager, a shortcut is created by assigning ‘Shared’ status. Later,
objects can be dragged from this folder to another folder. This process allows a single
point of control for the object and multiple projects do not have all import sources and
targets into their local folders.
Reusable Transformation is local to a folder. Example − Reusable sequence
generator for allocating warehouse Customer ids. It is useful to load customer details
from multiple source systems and allocating unique ids to each new source-key.
What is Self-Join?
When you join a single table to itself, it is called Self-Join.
What do you understand by Normalization?
Database normalization is the process of organizing the attributes and tables of a
relational database to minimize data redundancy.
Normalization involves decomposing a table into less redundant (and smaller) tables
but without losing information.
What do you understand by fact-less fact table?
A fact-less fact table is a fact table that does not have any measures. It is essentially
an intersection of dimensions. There are two types of fact-less tables: One is for
capturing an event, and the other is for describing conditions.
What is a slowly changing dimension and what are its types?
Slowly Changing Dimensions refer to the changing value of an attribute over time.
SCDs are of three types − Type 1, Type 2, and Type 3.