Testing A Data Warehouse
Testing A Data Warehouse
Restructuring databases and streamlining data warehouses (DWH) have become integral requirements in every
organization. Managers now realize the need to study their business, scrutinize their data, and optimize available
information to their advantage, in order to stay competitive.
Business information is available in many forms, but mostly in knowledge repositories of unstructured data. And while
data warehousing projects are on the rise, testing plays a significant role, determining the success of each project by
evaluating data throughout the process to ensure it meets specified business needs and the scope of work. However,
there are two key challenges involved in data warehousing projects - increased complexities and the significant volume
of data. [2], [4]
This article addresses the following DWH testing topics:
The Challenges
Test Planning
Tester skills
ETL Test Scenarios
To ensure a methodical analysis of the end result, businesses should focus on testing the following areas of the DWH:
Data completeness and quality
Data transformations, source to target
Referential integrity of DWH facts and dimensions
Defect management
Performance and scalability
Integration testing
User-acceptance testing
Regression testing
Adherence to compliance standards
Planning for such data intensive tests is complex, requires exceptional human and technical resources but is vital for
data quality and the overall success of your project. How to achieve the desired success? Through test planning that
includes a study of technical artifacts such as data models, business rules, data mapping documents, and data
warehouse (DWH) loading design logic.
You can expect take away from this reading, DWH planning checklists, a variety of testing scenarios, concepts for data
profiling and methods for data verification.
The central areas of focus for DWH testing is verifying the quality and completeness of data. Data completeness testing
ensures that all expected records from the source are loaded into the database by reconciling with error and reject
records.
1
Best DWH QA practices encompass multiple disciplines that are designed to validate data completeness, data integrity,
business rule implementations and transformations, database ETL functions, and performance. See Figure 1 below.
What data should be used for testing? This question often arises very early in the project and is a part of test data
acquisition activity. Data sources are often the operational systems (OLTP), providing the lowest level of data. Source
data can be DB sources, front-user applications, legacy feeders, modular data models or files (flat files, .xls, ASCII,
EBCDIC etc.) . Multiple data sources introduce a large number of issues such as semantic conflicts, data capturing and
synchronization. So the biggest testing challenge here is to understand the different sources of data and to check
whether all data from heterogeneous system is properly fed into the target tables. [5]
Is my data accurate? Once the data acquisition phase is complete, the next step is often data cleansing and filtering.
Data cleansing is the process of removing unwanted data. After cleansing, data become usable and ready to be fed to
other work areas. The main testing challenge here is to check the data and ensure that there are no corrupting errors. In
addition, QA often validates the field values against known lists of entities, and makes sure that first level of
transformation rules are correctly implemented.
Are the expected data transformations completed? Data transformations are a process of mapping source-to-
destination data using the business transforming logic. Mappings include 1-1 look-ups, switch cases, DB logics,
combinations, truncating, defaulting and null processing. End-to-end testing of data flow is very important to ensure
accuracy, completeness (missing or invalid data) and consistency of the transformed data. Modeling business concepts
2
in technical aspects is the major area of testing challenge; so every business rule should be validated considering the
business objective.
Are all desired targets loaded correctly? Data load phases generally process data into the end target, usually the DWH.
Depending on the business requirement, data loads can be full or incremental. Loading can take place daily, weekly or
monthly; so testing in this phase needs to assure that correct data is loaded in the defined duration. Data load in the
dimensions and fact table actually presents the final reporting picture of the warehouse. Business Intelligence (BI) being
the user facing area, is a very important part, and facts act as feeders for the reports. Testing should further verify that
data flows properly from data source to staging them in data warehouse.
Following are other examples of DWH challenges that should be addressed when DWH test planning is under way:
Inadequate ETL design documents available for test planning; for example, data models and mapping or detailed
design documents are missing or not current.
Source table field values unexpectedly null or do not meet data mapping specifications, source data profiling not
completed early in project.
Target DWH table field values unexpectedly null or do not meet data mapping specifications, source data profiling
not conducted after each load to the DWH.
Source data to populated the DWH is dirty -- that is, not fully in compliance with expectations and requirements.
Excessive ETL errors may be discovered after entry to QA -- developer unit testing is minimal; ETL developed
without the aid of ETL tools such as Informatica, SSIS, etc.
Issues with source to target data model and mappings: examples, a) not reviewed / approved by project
stakeholders, b) not consistently maintained through development lifecycle.
Huge source data volumes and data types: due to loading months or years of historical data from a variety of
sources, important data may be missing or may not meet requirements over entire timeframe.
Field constraints and SQL not coded correctly for Informatica ETL
ETL logs and messages to be acted upon can be excessive in early phases.
Business and IT subject matter experts may not be available for support of QA.
Business requirements documentation: Testers must develop the project test strategy and high level test scenarios. As
a result, notify BAs and others of potential quality issues
Data models for source and DWH target schemas: QA team will gain an understanding of primary and foreign keys and
data flows from source to target
ETL or stored procedure design & logic documents: testers can develop grey, white and black box test cases to verify
end to-end ETL process
Production and QA deployment tasks: TQA team will develop test scenarios that address the physical deployment and
load of the DWH.
Required QA skills, methods and tools: Multiple phases of different ETL procedures require a variety of skills and tools.
They range from data comparisons to data profiling to performance and automation. Figure 2 illustrates a high level of
activity and implied testing that must be planned.
3
Figure 2: Verifying the ETL process. Graphic courtesy of RTTS, QuerySurge ETL tool [6]
When candidate data sources are identified and finalized, data profiling should be planned , then implemented
immediately on that source data. Data profiling is the examination and assessment of your source systems data
quality, integrity and consistency -- sometimes known as source systems analysis. Data profiling is fundamentally
important, yet often ignored. As a result, data warehouse quality can be significantly compromised. [1]
The problems uncovered in analysis can then be reviewed by business analysts to determine their root causes. A single
analyst can generate many issues to be studied and result in a great deal of data quality problems getting corrected.
At the beginning of a data warehouse project, as soon as a candidate data source is identified, a (quick) data profiling
assessment should be conducted to provide a go/no-go decision about proceeding with the project. Table 3 depicts
the possible causes of data quality degradation at the data profiling stage of data warehousing.
4
Identifying Those Crucial Tester Skills
The data warehouse testing lead and hands-on testers are expected to demonstrate extensive experience in the ability
to design, plan and execute database and data warehouse testing strategies and tactics to ensure DWH quality
throughout all stages of the ETL lifecycle.
Testing the data warehouse through all phases can be summarized in the following way. The QA lead should assure that
test cases are prepared for each is the basic requirement verifications associated with data warehouse testing
(examples below). This checklist represents central issues that surface during DWH tests or when not tested
thoroughly, those same issues may arise when DWH data is used by applications during production operations.
Verify data mappings, source to target before and after ETLs begin
Verify that all tables fields were loaded from source to staging
Verify that primary and foreign keys were properly generated using sequence generator
Verify that not-null fields were populated in all target DWH objects
Verify no data was truncated in each field
Verify field lengths, data types and data formats are as specified in the design phase
Verify no duplicate records in individual target tables.
Verify data transformations were applied based on business rules
Verify that numeric fields are populated with correct precision
Verify that every ETL session completed with only planned exceptions
Verify all cleansing, error and exception handling were implemented as planned
Verify ETL data calculations and aggregations
Figure 3 illustrates a variety of testing and tools that should be considered during the DWH project lifecycle.
5
Figure 3: Testing methods to support DWH development process. [3]
The following to replace text in Figure 3 under Testing. Please note that the Tools section in this figure above should be
removed. Also, the heading, Testing Types and Tools should be removed. And note, the text to describe Figure 3 (under the
figure) has been changed from that originally submitted.
TESTING
- Validate data acquisition - Validate data integration - Validate data mart design - Validate data on reports
business logic and transformation logic - Compare data between with DWH and data marts
- Compare schema & data between staged data and ODS and data marts with - Validate report filters and
between sources and that loaded in the DWH SQL queries drill downs
staged data (row counts, - Validate the dimension - Tune performance of data - Performance tune data
new and missing data, model mart access retrieval to reports.
miss-matched data, missing - Tune performance of data
or invalid constraints, eg., staging jobs
primary / foreign keys
- Tune performance of data
loading jobs
6
Although there are many non-functional requirements (NFRs) surrounding the performance of ETL and report responses,
it can be helpful to follow these guidelines:
Execution with peak production volume to check for completion of the ETL process within the agreeable window
Analysis of ETL loading times, with a smaller amount of data, to gauge scalability issues
Verification of ETL processing times, component by component, to identify areas of improvement
Shutdown of the server during ETL execution, to test for restart ability
Recreation of maximum concurrent user testing for all BI reports and for ad-hoc reports
Ensuring access to BI reports during simulated ETL production loads
Recommendations / Conclusion
Data warehouse solutions are becoming almost ubiquitous as a supporting technology for the operational and strategic
functions at most organizations. Data warehouses play an integral role in business functions as diverse as enterprise
process management and monitoring, business intelligence, and production of financial statements. The approach
described here combines an understanding of the business rules applied to the data with the ability to develop and use
testing procedures that check the accuracy of entire data sets. This level of testing rigor requires additional effort and
skilled resources. However, by employing this methodology, project teams can be more confident, from day one of the
implementation of the DW, in the quality of the data. The result will build confidence in the end-user community, and it
will ultimately lead to a more effective implementation.
References:
[1] Matt Austin, 2010, "The Necessity of Data Profiling: A How-to Guide to Getting Started and Driving Value", TDWI.org
[2] Jonathan G. Geiger, 2004, Data Quality Management, The Most Critical Initiative You Can Implement, SUGI 29,
Intelligent Solutions, Inc.
[3] Raj Kamal, 2013, Adventures with Testing BI/DW Application: On a crusade to find the Holy Grail, Microsoft Corp.
[4] Syntel Corp., 2012, Proven Testing Techniques in Large Data Warehousing Projects
[5] Abhijit, Singh, 2013, Meeting the Data Warehouse Business Intelligence Testing Challenges, L&T Infotech
[6] Jeffrey R. Bocarsly, Ph.D, 2011, Complex ETL Testing: A Strategic Approach, RTTS (Real-Time Testing Solutions), NY, NY
[7] Virtusa IT Global Services, 2012, Data Warehouse Testing