ETL - Extract, Transform and Load: What Is A Data Warehouse?
ETL - Extract, Transform and Load: What Is A Data Warehouse?
➢ Typically, a relational database that is designed for query and analysis rather than for
transaction processing
➢ A place where historical data is stored for archival, analysis and security purposes.
➢ contains either raw data or formatted data
➢ combines data from multiple sources (i.e., Sales data, salaries, operational data, human
resource data, inventory data, web logs, social networks, Internet text and docs, other)
➢ As name implies Data warehouse, it is warehouse for database to store large aggregated
data collected from wide range of sources within an organization.
➢ Source can be soft files, database files or some excel files.
For example: Baskin-Robbins (Famous for world's largest chain of ice cream specialty shops) has
many shops in India as well as across the world. Let's say there is a Baskin-Robbins shop in our
area and it has its own system of saving customer visit and product purchase history. So, these
data must be stored in a excel. Once in a week all these area-data is been collected and stored in
a centralized city-data center which is nothing data-warehouse for all small-small areas. Same
way all this city-data must be collected and stored in a state-data. A large data store which is
accumulated from wide-range of sources is known as Data Warehouse
Data Mart
➢ A data mart is a subset of a data warehouse that has the same characteristics but is usually
smaller and is focused on the data for one division or one workgroup within an enterprise.
➢ Big Data is defined as too much volume, velocity and variability to work on normal
database architectures.
➢ Typically, Hadoop is the architecture used for big data lakes.
➢ Hadoop is an Apache open-source project that develops software for scalable, distributed
computing.
1
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Difference between Database and Data warehouse
• Data warehouse is one kind of database or a large database. Data warehouse is formed
using multiple databases.
• Database with respect Data warehouse, It helps to store data/information for a particular
entity.
• Database is used for Online Transactional Processing (OLTP).
• Data warehouse is used for Online Analytical Processing (OLAP)
• OLTP is a decentralized system normally used in Internet websites, banks, airlines, to
avoid single points of failure and to spread the volume between multiple servers. This
system is good to control and run fundamental business tasks.
• OLAP is centralized system to help with planning, problem solving, and decision support.
Queries are often very complex and relatively used for low volume of transaction.
• Database tables are always in a normalized structure.
• Data Warehouse tables are always in a de-normalized structure.
• Normalization Database - Designed in such a way that same column data is not repeated
or in simple words there will not be any redundant data. Here you will lots of Joins using
foreign-key and primary-key
• De-Normalization Database where you get more repeated data. Database management
becomes easy. No need to joins using foreign-key and primary-key.
• Normalized Database - In terms performance due to many joins it affects the
performance.
• De-Normalized Database - Due to a smaller number of joins or some time no-joins it
improves the performance.
What is ETL?
➢ ETL stands for Extract, Transform and Load, which is a process used to collect data from
various sources, transform the data depending on business rules/needs and load the data
into a destination database.
➢ It is a process in data warehousing to extract data, transform data and load data to final
source.
➢ The need to use ETL arises from the fact that in modern computing business data resides
in multiple locations and in many incompatible formats.
For example, business data might be stored on the file system in various formats (Word docs,
PDF, spreadsheets, plain text, etc), or can be stored as email files, or can be kept in a various
database server like MS SQL Server, Oracle and MySQL for example. Handling all this business
information efficiently is a great challenge and ETL plays an important role in solving this problem.
2
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
➢ An ETL tool extracts the data from different RDBMS source systems, transforms the data
like applying calculations, concatenate, etc. and then load the data-to-Data Warehouse
system.
➢ The data is loaded in the DW system in the form of dimension and fact tables.
➢ ETL (Extract, Transform, Load) is an automated process which takes raw data, extracts the
information required for analysis, transforms it into a format that can serve business
needs, and loads it to a data warehouse.
➢ ETL typically summarizes data to reduce its size and improve performance for specific
types of analysis.
➢ When you build an ETL infrastructure, you must first integrate data from a variety of
sources. Then you must carefully plan and test to ensure you transform the data correctly.
➢ This process is complicated and time-consuming.
3
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
When do we need ETL Testing?
➢ ETL is commonly associated with Data Warehousing projects but in reality, any form of
bulk data movement from a source to a target can be considered ETL.
➢ Large enterprises often have a need to move application data from one source to another
for data integration or data migration purposes.
➢ ETL testing is a data centric testing process to validate that the data has been transformed
and loaded into the target as expected.
Transform:
➢ Once the data has been extracted and converted in the expected format, it’s time for the
next step in the ETL process, which is transforming the data according to set of business
rules. The data transformation may include various operations including but not limited
4
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
to filtering, sorting, aggregating, joining data, cleaning data, generating calculated data
based on existing values, validating data, etc.
➢ It is one of the important ETL concepts where you apply a set of functions on extracted
data. Data that does not require any transformation is called as direct move or pass-
through data.
➢ Transform data to DW (Data Warehouse) format
➢ Build keys – A key is one or more data attributes that uniquely identify an entity. Various
types of keys are primary key, alternate key, foreign key, composite key, surrogate key.
The Datawarehouse owns these keys and never allows any other entity to assign them.
➢ Cleansing of data: After the data is extracted, it will move into the next phase, of cleaning
and conforming of data. Cleaning does the omission in the data as well as identifying and
fixing the errors. Conforming means resolving the conflicts between those data’s that is
incompatible, so that they can be used in an enterprise data warehouse. In addition to
these, this system creates meta-data that is used to diagnose source system problems
and improves data quality.
Load:
➢ The final ETL step involves loading the transformed data into the destination target, which
might be a database or data warehouse.
➢ Load data into DW (Data Warehouse)
➢ Build aggregates – Creating an aggregate is summarizing and storing data which is
available in fact table in order to improve the performance of end-user queries.
➢ Types of Loading:
o Initial Load — populating all the Data Warehouse tables
o Incremental Load — applying ongoing changes as when needed periodically.
o Full Refresh —erasing the contents of one or more tables and reloading with fresh
data.
5
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
➢ Each step is performed sequentially. However, the exact nature of each step – which
format is required for the target database – depends on the enterprise’s specific needs
and requirements.
➢ Extraction can involve copying data to tables quickly to minimize the time spent querying
the source system.
➢ In the transformation step, the data is most usually stored in one set of staging tables as
part of the process.
➢ Finally, a secondary transformation step might place data in tables that are copies of the
warehouse tables, which eases loading.
➢ Each ETL stage requires interaction by data engineers and developers to deal with the
capacity limitations of traditional data warehouses.
To load a data warehouse or data mart regularly (daily/weekly) so that it can serve its purpose of
facilitating business analysis. Or move data from files, xml or other sources to a big data lake,
data warehouse or data mart.
The 5 steps of the ETL process are: extract, clean, transform, load, and analyze. Of the 5, extract,
transform, and load are the most important process steps.
• Extract: Retrieves raw data from an unstructured data pool and migrates it into a
temporary, staging data repository
• Clean: Cleans data extracted from an unstructured data pool, ensuring the quality of the
data prior to transformation.
• Transform: Structures and converts the data to match the correct target source
• Load: Loads the structured data into a data warehouse so it can be properly analyzed and
used
• Analyze: Big data analysis is processed within the warehouse, enabling the business to
gain insight from the correctly configured data.
6
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
What is ELT?
ELT stands for “extract, load, and transform” — the processes a data pipeline uses to replicate
data from a source system into a target system such as a cloud data warehouse.
❖ Extraction: This first step involves copying data from the source system.
❖ Loading: During the loading step, the pipeline replicates data from the source into the
target system, which might be a data warehouse or data lake.
❖ Transformation: Once the data is in the target system, organizations can run whatever
transformations they need. Often organizations will transform raw data in different ways
for use with different tools or business processes.
Benefits of ELT
The explosion in the types and volume of data that businesses must process can put a strain on
traditional data warehouses. Using an ETL process to manage millions of records in these new
formats can be time-consuming and costly. ELT offers a number of advantages, including:
7
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Although it is still evolving, ELT offers the promise of unlimited access to data, less development
time, and significant cost savings. In these and other ways, the cloud is redefining data
integration.
ETL vs ELT:
As stated above, ETL = Extract, Transform, Load. ELT, on the other hand = Extract, Load,
Transform.
• According to IBM, “the most obvious difference between ETL and ELT is the difference in
order of operations.
• ELT copies or exports the data from the source locations, but instead of loading it to a
staging area for transformation, it loads the raw data directly to the target data store to
be transformed as needed.
While both processes leverage a variety of data repositories, such as databases, data
warehouses, and data lakes, each process has its advantages and disadvantages.
➢ ELT is particularly useful for high-volume, unstructured datasets as loading can occur
directly from the source.
➢ ELT can be more ideal for big data management since it doesn’t need much upfront
planning for data extraction and storage.
➢ The ETL process, on the other hand, requires more definition at the onset. Specific data
points need to be identified for extraction along with any potential “keys” to integrate
across disparate source systems.
➢ Even after that work is completed, the business rules for data transformations need to be
constructed.
➢ While ELT has become increasingly more popular with the adoption of cloud databases,
it has its own disadvantages for being the newer process, meaning that best practices are
still being established.”
The ETL and ELT approaches to data integration differ in several key ways.
• Load time — It takes significantly longer to get data from source systems to the target
system with ETL.
• Transformation time — ELT performs data transformation on-demand, using the target
system’s computing power, reducing wait times for transformation.
• Complexity — ETL tools typically have an easy-to-use GUI that simplifies the process. ELT
requires in-depth knowledge of BI tools, masses of raw data, and a database that can
transform it effectively.
8
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
• Data warehouse support — ETL is a better fit for legacy on-premise data warehouses and
structured data. ELT is designed for the scalability of the cloud.
• Maintenance — ETL requires significant maintenance for updating data in the data
warehouse. With ELT, data is always available in near real-time.
Both ETL and ELT processes have their place in today’s competitive landscape, and understanding
a business’ unique needs and strategies is key to determining which process will deliver the best
results.
1. ETL is the Extract, Transform, and Load process for data. ELT is Extract, Load, and
Transform process for data.
2. In ETL, data moves from the data source to staging into the data warehouse.
3. ELT leverages the data warehouse to do basic transformations. There is no need for data
staging.
4. ETL can help with data privacy and compliance by cleaning sensitive and secure data even
before loading into the data warehouse.
5. ETL can perform sophisticated data transformations and can be more cost-effective than
ELT.
ETL ELT
Adoption of the
ELT is a new technology, so it can be difficult
technology and ETL is a well-developed process
to locate experts and more challenging to
availability of used for over 20 years, and ETL
develop an ELT pipeline compared to an ETL
tools and experts are readily available.
pipeline.
experts
ETL only transforms and loads the
Availability of data that you decide is necessary ELT can load all data immediately, and users
data in the when creating the data warehouse can determine later which data to transform
system and ETL process. Therefore, only and analyze.
this information will be available.
Calculations will either replace
existing columns, or you can
Can you add ELT adds calculated columns directly to the
append the dataset to push the
calculations? existing dataset.
calculation result to the target data
system.
9
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
ETL is not normally a solution for
Compatible ELT offers a pipeline for data lakes to ingest
data lakes. It transforms data for
with data unstructured data. Then it transforms the
integration with a structured
lakes? data on an as-needed basis for analysis.
relational data warehouse system.
ETL can redact and remove ELT requires you to upload the data before
sensitive information before redacting/removing sensitive information.
putting it into the data warehouse This could violate GDPR, HIPAA, and CCPA
or cloud server. This makes it easier standards. Sensitive information will be more
Compliance
to satisfy GDPR, HIPAA, and CCPA vulnerable to hacks and inadvertent
compliance standards. It also exposure. You could also violate some
protects data from hacks and compliance standards if the cloud-server is in
inadvertent exposure. another country.
Data size vs. ETL is best suited for dealing with ELT is best when dealing with massive
complexity of smaller data sets that require amounts of structured and unstructured
transformations complex transformations. data.
ETL works with cloud-based and ELT works with cloud-based data
Data
onsite data warehouses. It requires warehousing solutions to support structured,
warehousing
a relational or structured data unstructured, semi-structured, and raw data
support
format. types.
Cloud-based ETL platforms (like
Xplenty) don't require special
Hardware hardware. Legacy, onsite ETL ELT processes are cloud-based and don't
requirements processes have extensive and costly require special hardware.
hardware requirements, but they
are not as popular today.
How are Aggregation becomes more As long as you have a powerful, cloud-based
aggregations complicated as the dataset target data system, you can quickly process
different? increases in size. massive amounts of data.
ETL experts are easy to procure As a new technology, the tools to implement
Implementation when building an ETL pipeline. an ELT solution are still evolving. Moreover,
Complexity Highly evolved ETL tools are also experts with the requisite ELT knowledge and
available to facilitate this process. skills can be difficult to find.
Automated, cloud-based ETL
ELT is cloud-based and generally
Maintenance solutions, like Xplenty, require little
incorporates automated solutions, so very
requirement maintenance. However, an onsite
little maintenance is required.
ETL solution that uses a physical
10
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
server will require frequent
maintenance.
Data transformations happen
Order of the Data is extracted, then loaded into the target
immediately after extraction within
extract, data system first. Only later is some of the
a staging area. After
transform, load data transformed on an “as-needed” basis
transformation, the data is loaded
process for analytical purposes.
into the data warehouse.
Cloud-based SaaS ELT platforms that bill with
Cloud-based SaaS ETL platforms a pay-per-session pricing model offer flexible
that bill with a pay-per-session plans that start at approximately $100 and
pricing model (such as Xplenty) go up from there. One cost advantage of ELT
offer flexible plans that start at is that you can load and save your data
approximately $100 and go up from without incurring large fees, then apply
Costs
there, depending on usage transformations as needed. This can save
requirements. Meanwhile, an money on initial costs if you just want to load
enterprise-level onsite ETL solution and save information. However, financially
like Informatica could cost over $1 strapped businesses may never be able to
million a year! afford the processing power required to reap
the full benefits of their data lake.
Transformations happen within a
Transformation Transformations happen inside the data
staging area outside the data
process system itself, and no staging area is required.
warehouse.
ETL can be used to structure ELT is a solution for uploading unstructured
Unstructured unstructured data, but it can’t be data into a data lake and make unstructured
data support used to pass unstructured data into data available to business intelligence
the target system. systems.
ETL load times are longer than ELT
because it's a multi-stage process: Data loading happens faster because there's
Waiting time to (1) data loads into the staging area, no waiting for transformations and the data
load (2) transformations take place, (3) only loads one time into the target data
information data loads into the data warehouse. system. However, analysis of the information
Once the data is loaded, analysis of is slower than ETL.
the information is faster than ELT.
Waiting time to Data transformations take more Since transformations happen after loading,
perform time initially because every piece of on an as-needed basis—and you transform
transformations data requires transformation only the data you need to analyze at the
11
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
before loading. Also, as the size of time—transformations happen a lot of
the data system increases, faster. However, the need to continually
transformations take longer. transform data slows down the total time it
However, once transformed and in takes for querying/analysis.
the system, analysis happens
quickly and efficiently.
Have you ever heard the phrase, “garbage in, garbage out?” That phrase is more appropriate
than ever in today’s digital data landscape because it emphasizes how the quality of data is
directly related to accurate insights and better decision-making. And thus, we have ETL – extract,
transform, load – to help ensure good data hygiene and added business value on the output.
• Reconcile varying data formats to move data from a legacy system into modern
technology, which often cannot support the legacy format
• Sync data from external ecosystem partners, like suppliers, vendors, and customers
• Consolidate data from various overlapping systems acquired via merger and/or
acquisition
• Combine transactional data from a data store so it can be read and understood by
business users
The image shows the intertwined roles, tasks and timelines for performing ETL Testing with the
sampling method.
12
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Responsibilities of an ETL Tester
❖ Writing ETL test cases—crafting SQL queries that can simulate important parts of the ETL
process.
❖ Verifying source system tables—displaying expertise in data sources, their meaning, and
how they contribute to the final data in the data warehouse.
❖ Applying transformation logic—ETL testers are qualified ETL engineers who can run and
design transformation scripts.
❖ Loading data—loading data into staging environments to enable testing.
❖ ETL tool testing—checking and troubleshooting the underlying ETL system.
❖ Testing end user reports and dashboards—ensuring that business analysts and data
scientists have a complete and accurate view of the data they need.
➢ When you load customer data into a data warehouse, for example, ETL can help you
engage in batch processing to derive the most value for your data warehouse.
13
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
➢ SQL stands for “Structured Query Language” that is used to access information from a
variety of relational databases. By writing SQL commands, you can perform tasks such as
updates, data retrieval, deletion, and more.
➢ In this case, the SQL query asks the database the email addresses and date the customer
record was created for the first 100 customers in India, and pulls that information from
the “customers” table.
ETL can be described as the process of moving data from homogeneous or heterogeneous
sources to its target destination (most often a data warehouse). ETL, for example, encompasses
three steps that make it easy to extract raw data from multiple sources, process it (or transform
it), and load it into a data warehouse for analysis.
➢ Extract: to retrieve raw data from unstructured data pools, and migrate it into a
temporary, staging data repository;
➢ Transform: to structure, enrich, and convert raw data to match the target destination;
➢ Load: to load structured data into a data warehouse for analysis or to be leveraged by
Business Intelligence (BI) tools.
❖ For decades, data engineers have built or bought ETL pipelines to integrate diverse data
types into data warehouses for analysis, seamlessly.
❖ The goal here is simple – leveraging ETL makes the process of data analysis much easier
(especially when using online analytical processing or OLAP data warehouses).
❖ ETL is also essential when you want to transform online transactional processing
databases to work with an OLAP data warehouse quickly.
In recent years, it has become increasingly easy to perform a number of processes involved in
ETL, including exception handling, by taking advantage of user-friendly ETL tools that come with
robust graphical user interfaces.
14
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The noticeable difference here is that SQL is a query language, while ETL is an approach to extract,
process, and load data from multiple sources into a centralized target destination.
Some businesses who have the capital to bear large data warehouse costs opt to use several
databases (customer details, sales records, recent transactions, etc.). In such cases, you’ll need a
tool to access, adjust, or add to the database as necessary. That’s where SQL comes in.
➢ Create new tables, views, and stored procedures within the data warehouse
➢ Execute queries to ask and receive answers to structured questions
➢ Insert, update or delete records as needed
➢ Retrieve data and visualize the contents of the data warehouse
➢ Implement strict permissions to access tables, views, and stored procedures
❖ The modern data stack comes with a variety of tools, including ETL tools, and they use
SQL to read, write, and query warehouse data.
❖ SQL syntax can also be used to frame questions answered using a data warehouse.
❖ However, for the manipulation part of the process, for the most part, data needs to be
collected and interpreted before it’s moved to the target destination. That's where ETL
comes in.
❖ ETL uses batch or stream processing protocols to pull raw data, modify it according to
reporting requirements, and load the transformed data into a centralized data
warehouse.
❖ However, data scientists can query and analyze the information directly without having
to build complex ETL processes.
❖ So, we can say that ETL is used to get the data ready and acts as a data integration layer.
SQL is used when data needs to be manipulated and tweaked.
❖ ETL also helps companies implement data governance best practices and achieve data
democracy.
Example:
Let us assume there is a manufacturing company having multiple departments such as sales, HR,
Material Management, EWM, etc. All these departments have separate databases which they
use to maintain information w.r.t. their work and each database has a different technology,
landscape, table names, columns, etc. Now, if the company wants to analyze historical data and
15
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
generate reports, all the data from these data sources should be extracted and loaded into a Data
Warehouse to save it for analytical work.
An ETL tool extracts the data from all these heterogeneous data sources, transforms the data
(like applying calculations, joining fields, keys, removing incorrect data fields, etc.), and loads it
into a Data Warehouse.
➢ ETL testing is done to ensure that the data that has been loaded from a source to the
destination after business transformation is accurate.
➢ It also involves the verification of data at various middle stages that are being used
between source and destination.
➢ ETL stands for Extract-Transform-Load. ETL testing is a crucial part of ETL, because ETL is
typically performed on mission critical data.
16
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
ETL Testing Process
❖ Similar to other Testing Process, ETL also go through different phases. The different
phases of ETL testing process are as follows
17
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Types of ETL Testing:
18
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Types Of Testing Testing Process
compare and validate counts, aggregates and actual data between the
source and target for columns with simple transformation or no
transformation.
Data Accuracy This testing is done to ensure that the data is accurately loaded and
Testing transformed as expected.
Testing data transformation is done as in many cases it cannot be
Data Transformation achieved by writing one source SQL query and comparing the output with
Testing the target. Multiple SQL queries may need to be run for each row to verify
the transformation rules.
Data Quality Tests includes syntax and reference tests. In order to avoid
any error due to date or order number during business process Data
Quality testing is done.
Data quality testing includes number check, date check, precision check,
data check, null check etc.
This testing is done to check the data integrity of old and new data with
Incremental ETL the addition of new data. Incremental testing verifies that the inserts and
testing updates are getting processed as expected during incremental ETL
process.
GUI/Navigation This testing is done to check the navigation or GUI aspects of the front-
Testing end reports.
❖ ETL testing is a concept which can be applied to different tools and databases in
information management industry.
❖ The objective of ETL testing is to assure that the data that has been loaded from a source
to destination after business transformation is accurate.
❖ It also involves the verification of data at various middle stages that are being used
between source and destination.
19
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
While performing ETL testing, two documents that will always be used by an ETL tester are
1. ETL mapping sheets: An ETL mapping sheets contain all the information of
source and destination tables including each and every column and their look-
up in reference tables. An ETL testers need to be comfortable with SQL queries
as ETL testing may involve writing big queries with multiple joins to validate
data at any stage of ETL. ETL mapping sheets provide a significant help while
writing queries for data verification.
2. DB Schema of Source, Target: It should be kept handy to verify any detail in
mapping sheets.
Constraint
Ensure the constraints are defined for specific table as expected
Validation
1. The data type and length for a particular attribute may vary in files
Data consistency or tables though the semantic definition is the same.
issues 2. Misuse of integrity constraints
20
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Test Scenario Test Cases
2. Null, non-unique or out of range data
Transformation Transformation
1. Number check: Need to number check and validate it
2. Date Check: They have to follow date format and it should be same
across all records
Data Quality 3. Precision Check
4. Data check
5. Null check
Null Validate Verify the null values, where “Not Null” specified for a specific column.
1. Needs to validate the unique key, primary key and any other column
should be unique as per the business requirements are having any
duplicate rows
2. Check if any duplicate values exist in any column which is extracting
Duplicate Check
from multiple columns in source and combining into one column
3. As per the client requirements, needs to be ensure that no
duplicates in combination of multiple columns within target only
1. To validate the complete data set in source and target table minus
a query in a best solution
2. We need to source minus target and target minus source
3. If minus query returns any value those should be considered as
Complete Data
mismatching rows
Validation
4. Needs to matching rows among source and target using intersect
statement
5. The count returned by intersect should match with individual
counts of source and target tables
21
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Test Scenario Test Cases
6. If minus query returns of rows and count intersect is less than
source count or target table then we can consider as duplicate rows
are existed.
22
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Type of Bugs Description
• Mathematical errors
Calculation bugs • Final output is wrong
• No logo matching
• No version information available
Version control bugs
• This occurs usually in Regression Testing
ETL Testing is different from application testing because it requires a data centric testing
approach. Some of the challenges in ETL Testing are –
• ETL Testing involves comparing of large volumes of data typically millions of records.
23
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
• The data that needs to be tested is in heterogeneous data sources (eg. databases, flat
files).
• Data is often transformed which might require complex SQL queries for comparing the
data.
• ETL testing is very much dependent on the availability of test data with different test
scenarios.
Metadata Testing:
The purpose of Metadata Testing is to verify that the table definitions conform to the data model
and application design specifications.
Example: Data Model column data type is NUMBER but the database column data type is
STRING (or VARCHAR).
Example: Data Model specification for the ‘first_name’ column is of length 100 but the
corresponding database table column is only 80 characters long.
1. Verify that the columns that cannot be null have the ‘NOT NULL’ constraint.
2. Verify that the unique key and foreign key columns are indexed as per the
requirement.
3. Verify that the table was named according to the table naming convention.
Example 1: A column was defined as ‘NOT NULL’ but it can be optional as per the design.
Example 2: Foreign key constraints were not defined on the database table resulting in
orphan records in the child table.
24
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
❖ Metadata Naming Standards Check
➢ Verify that the names of the database metadata such as tables, columns, indexes are
as per the naming standards.
➢ Example: The naming standard for Fact tables is to end with an ‘_F’ but some of the
fact tables names end with ‘_FACT’.
❖ Metadata Check Across Environments
➢ Compare table and column metadata across environments to ensure that changes
have been migrated appropriately.
➢ Example: A new column added to the SALES fact table was not migrated from the
Development to the Test environment resulting in ETL failures.
❖ Automate metadata testing with ETL Validator
➢ ETL Validator comes with Metadata Compare Wizard for automatically capturing and
comparing Table Metadata.
1. Track changes to Table metadata over a period of time. This helps ensure that the QA
and development teams are aware of the changes to table metadata in both Source
and Target systems.
2. Compare table metadata across environments to ensure that metadata changes have
been migrated properly to the test and production environments.
3. Compare column data types between source and target environments.
4. Validate Reference data between spreadsheet and database or across environments.
Data Accuracy:
➢ In ETL testing, data accuracy is used to ensure if data is accurately loaded to the target
system as per the expectation. The key steps in performing data accuracy are as follows
Value Comparison
❖ Value comparison involves comparing the data in source and target system with minimum
or no transformation.
❖ It can be done using various ETL Testing tools, for example, Source Qualifier
Transformation in Informatica.
❖ Some expression transformations can also be performed in data accuracy testing.
❖ Various set operators can be used in SQL statements to check data accuracy in the source
and the target systems. Common operators are Minus and Intersect operators.
❖ The results of these operators can be considered as deviation in value in the target and
the source system.
❖ Critical data columns can be checked by comparing distinct values in the source and the
target systems.
25
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Here is a sample query that can be used to check critical data columns −
❖ The purpose of Data Quality tests is to verify the accuracy of the data.
❖ Data profiling is used to identify data quality issues and the ETL is designed to fix or handle
this issue.
❖ However, source data keeps changing and new data quality issues may be discovered
even after the ETL is being used in production.
❖ Automating the data quality checks in the source and target system is an important aspect
of ETL execution and testing.
❖ Look for duplicate rows with same unique key column or a unique combination of
columns as per business requirement.
Example: Business requirement says that a combination of First Name, Last Name, Middle
Name and Data of Birth should be unique.
❖ Many database fields can contain a range of values that cannot be enumerated. However,
there are reasonable constraints or rules that can be applied to detect situations where
the data is clearly wrong.
❖ Instances of fields containing values violating the validation rules defined represent a
quality gap that can impact ETL processing.
Example: Date of birth (DOB). This is defined as the DATE datatype and can assume any valid
date. However, a DOB in the future, or more than 100 years in the past are probably invalid. Also,
the date of birth of the child is should not be greater than that of their parents.
26
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
❖ The goal of these checks is to identify orphan records in the child entity with a foreign key
to the parent entity.
1. Count of records with null foreign key values in the child table.
2. Count of invalid foreign key values in the child table that do not have a corresponding
primary key in the parent table.
Example: In a data warehouse scenario, fact tables have foreign keys to the dimension tables. If
an ETL process does a full refresh of the dimension tables while the fact table is not refreshed,
the surrogate foreign keys in the fact table are not valid anymore. “Late arriving dimensions” is
another scenario where a foreign key relationship mismatch might occur because the fact record
gets loaded ahead of the dimension record.
SELECT cust_id FROM sales minus SELECT s.cust_id FROM sales s, customers c where
s.cust_id=c.cust_id
ETL Validator comes with Data Rules Test Plan and Foreign Key Test Plan for automating the
data quality testing.
1. Data Rules Test Plan: Define data rules and execute them on a periodic basis to check for
data that violates them.
2. Foreign Key Test Plan: Define data joins and identify data integrity issues without writing
any SQL queries.
➢ The purpose of Data Completeness tests is to verify that all the expected data is loaded
in target from the source.
➢ Some of the tests that can be run are: Compare and Validate counts, aggregates (min,
max, sum, avg) and actual data between the source and target.
❖ Compare count of records of the primary source table and target table. Check for any
rejected records.
27
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Example: A simple count of records comparison between the source and target tables.
Source Query
Target Query
❖ Column or attribute level data profiling is an effective tool to compare source and target
data without actually comparing the entire data.
❖ It is similar to comparing the checksum of your source and target data. These tests are
essential when testing large amounts of data.
Some of the common data profile comparisons that can be done between the source and target
are:
Example 1: Compare column counts with values (non-null values) between source and target
for each column based on the mapping.
Source Query
SELECT count(row_id), count(fst_name), count(lst_name), avg(revenue) FROM customer
Target Query
SELECT count(row_id), count(first_name), count(last_name), avg(revenue) FROM customer_dim
28
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Example 2: Compare the number of customers by country between the source and target.
Source Query
Target Query
SELECT country_cd, count(*) FROM customer_dim GROUP BY country_cd
❖ Compare data (values) between the flat file and target data effectively validating 100% of
the data.
❖ In regulated industries such as finance and pharmaceutical, 100% data validation might
be a compliance requirement.
❖ It is also a key requirement for data migration projects. However, performing 100% data
validation is a challenge when large volumes of data is involved.
❖ This is where ETL testing tools such as ETL Validator can be used because they have an
inbuilt ELV engine (Extract, Load, Validate) capabile of comparing large values of data.
Example: Write a source query that matches the data in the target table after transformation.
Source Query
SELECT cust_id, fst_name, lst_name, fst_name||’,’||lst_name, DOB FROM Customer
Target Query
SELECT integration_id, first_name, Last_name, full_name, date_of_birth FROM Customer_dim
ETL Validator comes with Data Profile Test Case, Component Test Case and Query Compare Test
Case for automating the comparison of source and target data.
1. Data Profile Test Case: Automatically computes profile of the source and target query
results – count, count distinct, nulls, avg, max, min, max length and min length.
2. Component Test Case: Provides a visual test case builder that can be used to compare
multiple sources and target.
3. Query Compare Test Case: Simplifies the comparison of results from source and target
queries.
29
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Sample Interview Questions:
30
Proprietary content. © 2013 - 2021 Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.