Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & The Solution
Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & The Solution
Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & The Solution
Prepared by
datagaps
http://www.datagaps.com
http://www.youtube.com/datagaps
http://www.twitter.com/datagaps
Contact
[email protected]
Contributors
Narendar Yalamanchilli
Adi Buddhavarapu
Table of Contents
Executive Summary..................................................................................................................................... 1
Page 1
THE PROBLEM
Data warehouse projects often involve extracting large volumes of data from
multiple data sources before transforming and loading. Testing of data
warehouse requires a data-centric testing approach which is different from
testing of traditional transactional applications. Most of the current test
automation tools are focussed on transaction application testing and are not
designed for testing of data warehouses. Hence it is a common practice for
Business Analysts, Engineers and QA Teams to manually test the ETL by
copying a handful of records from source and target tables into an excel spread
sheet and compare the data. This is a tedious and error prone process when
done manually for large number of ETL. Testing is further complicated because
the ETL process can run in an incremental mode or full mode often making it
necessary to create separate sets of test cases for these two modes. Every
incremental change to the source or target system increases the scope for
regression and these manual tests have to be repeated every time.
These issues often result in increased cost of ownership, inaccurate insights due
to bad data/reports and the loss of trust in enterprise data and reporting
capabilities. The below section identifies the kinds of issues we see based on the
above categories and the best practices to address them.
We believe that there are three areas to ensure proper data testing.
First, testing has to be done at each data entity level to ensure that the
data coming into the warehouse is as expected.
Second, as data moves from various sources to the warehouse, we need
to ensure that data is exactly as expected.
Finally, the reporting system itself has to be tested for accuracy,
regression, performance and scalability.
Page 2
Issues with Data Entities
Data can enter an enterprise from various sources such as transactional systems,
XML files, flat files, web services etc and it is possible for data to take up
unexpected forms in this step itself. Few examples are below:
1. An application does not have proper data validation and thus allows inaccurate
data to be entered into the system:
a. Invalid email addresses without an @ or not following the email
address pattern
b. A column to store country names has values that do not fall in the
encoded list of values defined for it
c. SSN is a required data attribute but has null values or less than 9
numbers
d. Orphan order item (child) records are persisted even though the
corresponding order (parent) record has been deleted
e. Duplicate records are created in the system for the same customer
f. Data for start and end dates is not consistent because end date is less
than start date
g. Address entered by the customer is a not a valid one or postal code is
null for US addresses
Prior to moving this data into other systems within the enterprise, it is important
that we identify such discrepancies early on. The current Data Quality tools
primarily focus on data discovery using data profiling and cleaning of data but do
not provide a good means to measure the quality of the data by providing
capability to define data rules. These data quality tools are often very expensive and
difficult to implement.
Page 3
Issues with Data Warehouses
Data warehouses (and Data Integration) projects typically involve movement of
large volumes of data from multiple data sources.
Page 5
BEST PRACTICES
Testing data warehouses and business intelligence applications requires a data-
centric testing approach. This section provides a list of the type of tests and the
strategy/best practice for doing the test
Data Validity Are the values within the acceptable For each attribute with a defined list of
subset of an encoded list? values, check for values in the data entity
to identify the values that are not in the list
of valid values
Data Are there any unexpected duplicates? Write queries to check for duplicates based
Consistency on combinations of columns that are
expected to be unique
Referential Are there any orphan records or Identify parent-child and many-to-many
Integrity missing foreign keys relationships between data entities and
write queries to identify inconsistencies
Data Validity Is the data as per business rules? For each attribute, run SQL queries to
check the data. This may involve using
Email column does not have valid
regular expressions.
email addresses
Birth date has values older than 200
years
Postal Code has non-numeric values
for US addresses
SSN has null values for some
records
Combination of Customer Name,
address columns is unique
Data Are the counts matching between While it is ideal to compare the entire
Completeness source and the target? source and target data, it is often a very
time consuming and resource intensive
For example, source records can be
process. Hence comparing the counts is a
missing in target due to bad filter,
good strategy. This will also help identify
rejected records, wrong data types.
Page 6
duplicate introduced during the ETL.
Data How to test Expression Write SQL query with the expected
Transformations transformation? For eg, are nulls transformation on the source data and
replaced appropriately? compare it with data in target table
Data My transformations are too complex Setup test data in the source for different
Transformations to test using single source and target conditions. Verify that the data in the target
queries against expected result
Data How to perform regression testing Benchmark the target table records for data
Transformations of transformations? that is not expected to change (eg. 10000
records) and compare it after the ETL
change
Page 7
Business Intelligence Testing
While data warehouse (ETL) testing is an essential part of ensuring that the data
shown in the reports is correct, issues in the BI tool metadata model can cause this
data to be reported wrongly. Hence the primary focus of BI testing is to ensure that
numbers and data shown in reports are correct when compared to the source and
the reports are accessible to the end user from a performance and security
standpoint. Listed below are some of the common test scenarios and strategies.
Unit Testing Dashboard prompts are showing For each prompt in a dashboard page,
wrong data or they are not getting check if the list of values shown is correct
applied to reports appropriately where applicable. Apply all the prompts and
verify the report and the corresponding
physical SQL. Verify that all prompts are
getting applied appropriately to the
drilldowns.
Unit Testing Dashboard page does not conform to Verify that the report layout, prompts,
UI Standards titles, download and filter display meet the
design
Unit Testing The SQL Query generated by BI Write your own SQL query on the target
Server for a report is complex and database and compare the output with the
shows wrong data (numbers) report data
Unit Testing An RPD change for a new subject Compare the SQL Query generated by the
area changes the join conditions for report before and after making RPD
an existing report changes to the RPD
Unit Testing Mismatches between Summary and Aggregate the detail reports output and
detail report data compare it with the summary report data.
Check the links to detail report from charts,
data and table headings in the summary
report
Unit Testing A defect in the new ETL results in Benchmark the report output and compare
the corresponding report showing it. Another option to compare the data
wrong data shown in reports with the output of a query
against the source database
Unit Testing Webcat merge from local desktop to Compare the report data between the local
development environment is and development environments
unsuccessful
Unit Testing How can we test adhoc reporting 1) For each dimension folder, pick all
since the number of possible test attributes and run reports. The objective is
cases can be unlimited? to check if any attribute is not mapped
properly. Also check the dimension
While it is not practical to test all hierarchy.
combinations of adhoc reports is 2) Run report by picking one attribute from
there a method to testing adhoc each dimension and one measure to verify
reports? the joins. Watch for any unwanted joins. If
Page 8
you are seeing a complex query it is usually
because of bad model or design.
3) Verify that the list of measures and their
names are meaningful
Unit Testing RPD merge results in duplicate Compare the list of presentation layer
attributes attributes before and after the merge
Regression A developer knowingly or Take screenshots of the dashboard pages
Testing unknowingly makes a dashboard UI before and after the changes and compare
change that the business them
analyst/tester is not aware of
Regression Some of the dashboard pages Benchmark and compare the dashboard
Testing suddenly start running slow after a page run time and corresponding SQL
recent release or system change query execution plan before after the
change
Regression An ETL change results in the Benchmark and compare the report data
Testing measure values in the reports to be and chart before after the change
wrong after the full ETL run
Regression Name change of an attribute in the Search for impacted reports in production
Testing presentation layer adversely impacts usage tracking data or search in the web
user reports in production catalog
Regression It is difficult to prioritize which Regression test most commonly used
Testing reports or adhoc request to regression reports or adhoc requests based on
test production usage tracking data
Regression New reports or dashboard pages are Capture dashboard page run times after
Testing running really slow and not meeting disabling caching. Alternatively capture the
SLAs average run times by running them multiple
times at different times
Regression Dashboard page runs fine for some Capture dashboard page run times after
Testing roles but is really slow for other roles disabling caching for users with different
(eg. CEO) roles
Regression Drilldown on Hierarchy columns is Verify the drilldown time for each hierarchy
Testing very slow column with caching disabled
Multi- A recent webcat migration results in Compare the report access for key users
Environment access issue due to which some of the with different roles across environments
users cannot access their reports after the migration
Multi- Uptake of new BI releases (eg. 10g to Have a pre and post upgrade environments
Environment 11g or 11.1.1.5 to 11.1.1.7) results in pointing to the same database and compare
report data, security and performance the report data and performance between
issues the two environments
Security Users see data in reports that they are Test data security by comparing the data for
not supposed to have access the same report between two different users
with different roles
Security Users lose access to reports and Verify the access to reports and dashboards
Page 9
dashboards that they previously had for users with different roles before and
access to after the security changes or webcat
migration
Stress Dashboard pages start running slow Stress test by simulating concurrent users
after rollout to new users or addition and compare dashboard page performance
of new data to see if the SLA requirements are met
Stress A software or hardware upgrade Stress test by simulating concurrent users
adversely impacts the dashboard page and compare dashboard page performance
performance to see if the SLA requirements are met
Stress Reports run fine in test environments Simulate realistic user load in a test
but run slow in production environment based on production usage
tracking data
Page 10
TESTING METRICS
User 0 0 12
Customer 5 10% 25
Sales 12 75% 15
Sales-Customer 0 0
Sales-Product 10 1%
Data Rules
Page 11
Target Table Data Data Data Data Total
Integrity Completeness Consistency Transformation
Tests Tests Tests Tests
Costs 1 1 0 0 2
SalesRep 0 0 0 0 0
Customer 0 1 1 0 2
Sales 1 0 0 0 1
Page 12
THE SOLUTION
Datagaps offers two solutions that collectively enable comprehensive data warehouse
and business intelligence testing automation. Testing teams, Business Analysts and
Engineers can leverage our solutions to minimize testing costs, expedite time to
market and deliver high quality dashboards and reports to business users. Details of
the solutions are below:
ETL Validator: This primarily addresses Data Entity Testing and ETL Testing. Key
features include:
Wizard driven test case creation
Enterprise grade test case visibility
Enterprise collaboration
Flat file testing automation
Data entity testing automation
BI Validator: This addresses BI Testing Automation. Key features include:
Dashboard testing
Report testing
Multi Environment testing
Upgrade testing
Stress testing
Data sheets for both the products are available on our website for your reference.
Page 13