100% found this document useful (2 votes)
405 views7 pages

Testing A Data Warehouse

Restructuring databases and streamlining data warehouses (DWH) have become integral requirements in every organization. Managers now realize the need to study their business, scrutinize their data, and optimize available information to their advantage, in order to stay competitive. Business information is available in many forms, but mostly in knowledge repositories of unstructured data. And while data warehousing projects are on the rise, testing plays a significant role, determining the success of each project by evaluating data throughout the process to ensure it meets specified business needs and the scope of work. However, there are two key challenges involved in data warehousing projects - increased complexities and the significant volume of data. This article addresses the following DWH testing topics:  The Challenges  Test Planning  Tester skills  ETL Test Scenarios

Uploaded by

Wayne Yaddow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
405 views7 pages

Testing A Data Warehouse

Restructuring databases and streamlining data warehouses (DWH) have become integral requirements in every organization. Managers now realize the need to study their business, scrutinize their data, and optimize available information to their advantage, in order to stay competitive. Business information is available in many forms, but mostly in knowledge repositories of unstructured data. And while data warehousing projects are on the rise, testing plays a significant role, determining the success of each project by evaluating data throughout the process to ensure it meets specified business needs and the scope of work. However, there are two key challenges involved in data warehousing projects - increased complexities and the significant volume of data. This article addresses the following DWH testing topics:  The Challenges  Test Planning  Tester skills  ETL Test Scenarios

Uploaded by

Wayne Yaddow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Testing a Data Warehouse

Triumph Through Effective Up-front Planning


Assuring Data Warehouse Content, Structure and Quality
Wayne Yaddow Published in Professional Tester Magazine, 02/2014

Restructuring databases and streamlining data warehouses (DWH) have become integral requirements in every
organization. Managers now realize the need to study their business, scrutinize their data, and optimize available
information to their advantage, in order to stay competitive.
Business information is available in many forms, but mostly in knowledge repositories of unstructured data. And while
data warehousing projects are on the rise, testing plays a significant role, determining the success of each project by
evaluating data throughout the process to ensure it meets specified business needs and the scope of work. However,
there are two key challenges involved in data warehousing projects - increased complexities and the significant volume
of data. [2], [4]
This article addresses the following DWH testing topics:

The Challenges
Test Planning
Tester skills
ETL Test Scenarios
To ensure a methodical analysis of the end result, businesses should focus on testing the following areas of the DWH:
Data completeness and quality
Data transformations, source to target
Referential integrity of DWH facts and dimensions
Defect management
Performance and scalability
Integration testing
User-acceptance testing
Regression testing
Adherence to compliance standards

Planning for such data intensive tests is complex, requires exceptional human and technical resources but is vital for
data quality and the overall success of your project. How to achieve the desired success? Through test planning that
includes a study of technical artifacts such as data models, business rules, data mapping documents, and data
warehouse (DWH) loading design logic.

You can expect take away from this reading, DWH planning checklists, a variety of testing scenarios, concepts for data
profiling and methods for data verification.

The central areas of focus for DWH testing is verifying the quality and completeness of data. Data completeness testing
ensures that all expected records from the source are loaded into the database by reconciling with error and reject
records.

1
Best DWH QA practices encompass multiple disciplines that are designed to validate data completeness, data integrity,
business rule implementations and transformations, database ETL functions, and performance. See Figure 1 below.

Graphic, courtesy of Virtusa IT Global Services

Figure 1: Primary disciplines of DWH testing [7]

Challenges of Data Warehouse Testing


DWH stakeholders often ask questions like these as they review and comment on the DWH test strategy document:

What data should be used for testing? This question often arises very early in the project and is a part of test data
acquisition activity. Data sources are often the operational systems (OLTP), providing the lowest level of data. Source
data can be DB sources, front-user applications, legacy feeders, modular data models or files (flat files, .xls, ASCII,
EBCDIC etc.) . Multiple data sources introduce a large number of issues such as semantic conflicts, data capturing and
synchronization. So the biggest testing challenge here is to understand the different sources of data and to check
whether all data from heterogeneous system is properly fed into the target tables. [5]

Is my data accurate? Once the data acquisition phase is complete, the next step is often data cleansing and filtering.
Data cleansing is the process of removing unwanted data. After cleansing, data become usable and ready to be fed to
other work areas. The main testing challenge here is to check the data and ensure that there are no corrupting errors. In
addition, QA often validates the field values against known lists of entities, and makes sure that first level of
transformation rules are correctly implemented.

Are the expected data transformations completed? Data transformations are a process of mapping source-to-
destination data using the business transforming logic. Mappings include 1-1 look-ups, switch cases, DB logics,
combinations, truncating, defaulting and null processing. End-to-end testing of data flow is very important to ensure
accuracy, completeness (missing or invalid data) and consistency of the transformed data. Modeling business concepts

2
in technical aspects is the major area of testing challenge; so every business rule should be validated considering the
business objective.

Are all desired targets loaded correctly? Data load phases generally process data into the end target, usually the DWH.
Depending on the business requirement, data loads can be full or incremental. Loading can take place daily, weekly or
monthly; so testing in this phase needs to assure that correct data is loaded in the defined duration. Data load in the
dimensions and fact table actually presents the final reporting picture of the warehouse. Business Intelligence (BI) being
the user facing area, is a very important part, and facts act as feeders for the reports. Testing should further verify that
data flows properly from data source to staging them in data warehouse.

Following are other examples of DWH challenges that should be addressed when DWH test planning is under way:

Inadequate ETL design documents available for test planning; for example, data models and mapping or detailed
design documents are missing or not current.
Source table field values unexpectedly null or do not meet data mapping specifications, source data profiling not
completed early in project.
Target DWH table field values unexpectedly null or do not meet data mapping specifications, source data profiling
not conducted after each load to the DWH.
Source data to populated the DWH is dirty -- that is, not fully in compliance with expectations and requirements.
Excessive ETL errors may be discovered after entry to QA -- developer unit testing is minimal; ETL developed
without the aid of ETL tools such as Informatica, SSIS, etc.
Issues with source to target data model and mappings: examples, a) not reviewed / approved by project
stakeholders, b) not consistently maintained through development lifecycle.
Huge source data volumes and data types: due to loading months or years of historical data from a variety of
sources, important data may be missing or may not meet requirements over entire timeframe.
Field constraints and SQL not coded correctly for Informatica ETL
ETL logs and messages to be acted upon can be excessive in early phases.
Business and IT subject matter experts may not be available for support of QA.

Planning the DWH QA Strategy


Those planning the QA strategy and detailed test plans should carefully review the following as a source of project
knowledge:

Business requirements documentation: Testers must develop the project test strategy and high level test scenarios. As
a result, notify BAs and others of potential quality issues
Data models for source and DWH target schemas: QA team will gain an understanding of primary and foreign keys and
data flows from source to target
ETL or stored procedure design & logic documents: testers can develop grey, white and black box test cases to verify
end to-end ETL process
Production and QA deployment tasks: TQA team will develop test scenarios that address the physical deployment and
load of the DWH.
Required QA skills, methods and tools: Multiple phases of different ETL procedures require a variety of skills and tools.
They range from data comparisons to data profiling to performance and automation. Figure 2 illustrates a high level of
activity and implied testing that must be planned.

3
Figure 2: Verifying the ETL process. Graphic courtesy of RTTS, QuerySurge ETL tool [6]

Data Profiling: Planning for a Key DWH QA Process

When candidate data sources are identified and finalized, data profiling should be planned , then implemented
immediately on that source data. Data profiling is the examination and assessment of your source systems data
quality, integrity and consistency -- sometimes known as source systems analysis. Data profiling is fundamentally
important, yet often ignored. As a result, data warehouse quality can be significantly compromised. [1]

Examples of problems that are often uncovered through data profiling:

Data elements used for purposes other than expected


Empty columns; columns containing no data at all
Invalid values in columns
Inconsistent methods of representing the same value in a field
Missing values where a field is defined not null
Violation of structural dependencies
Violation of expected column relationships such as order of date values
Violation of business rules
Unrealistic percentages of specific values appearing in a column

The problems uncovered in analysis can then be reviewed by business analysts to determine their root causes. A single
analyst can generate many issues to be studied and result in a great deal of data quality problems getting corrected.

At the beginning of a data warehouse project, as soon as a candidate data source is identified, a (quick) data profiling
assessment should be conducted to provide a go/no-go decision about proceeding with the project. Table 3 depicts
the possible causes of data quality degradation at the data profiling stage of data warehousing.

4
Identifying Those Crucial Tester Skills
The data warehouse testing lead and hands-on testers are expected to demonstrate extensive experience in the ability
to design, plan and execute database and data warehouse testing strategies and tactics to ensure DWH quality
throughout all stages of the ETL lifecycle.

Firm understanding of DWH and DB concepts


High levels of skill with SQL queries and stored procedures: DB and SQL editors
Understanding of project data used by the business (data sources, data tables, data dictionary, business
terminology)
The practice of data profiling methods and tools
Developing strategies, test plans and test cases specific to DWH and the business
Creating effective ETL test cases / scenarios based on ETL DB loading technology and business requirements
Understanding of data models, data mapping documents, ETL design and ETL coding; ability to communicate
effectively with DWH designers and developers
Experience with multiple DB systems: Oracle, SQL Server, Sybase, DB2
Troubleshooting of ETL (e.g., Informatica / DataStage) sessions and workflows
Deployment of DB code to databases
Unix scripting, Autosys, Anthill, etc.
Use of Excel & MS Access for data analysis
Implementation of automated testing for the ETLs

The Basics of ETL Test Verifications

Testing the data warehouse through all phases can be summarized in the following way. The QA lead should assure that
test cases are prepared for each is the basic requirement verifications associated with data warehouse testing
(examples below). This checklist represents central issues that surface during DWH tests or when not tested
thoroughly, those same issues may arise when DWH data is used by applications during production operations.

Verify data mappings, source to target before and after ETLs begin
Verify that all tables fields were loaded from source to staging
Verify that primary and foreign keys were properly generated using sequence generator
Verify that not-null fields were populated in all target DWH objects
Verify no data was truncated in each field
Verify field lengths, data types and data formats are as specified in the design phase
Verify no duplicate records in individual target tables.
Verify data transformations were applied based on business rules
Verify that numeric fields are populated with correct precision
Verify that every ETL session completed with only planned exceptions
Verify all cleansing, error and exception handling were implemented as planned
Verify ETL data calculations and aggregations

Figure 3 illustrates a variety of testing and tools that should be considered during the DWH project lifecycle.

5
Figure 3: Testing methods to support DWH development process. [3]

The following to replace text in Figure 3 under Testing. Please note that the Tools section in this figure above should be
removed. Also, the heading, Testing Types and Tools should be removed. And note, the text to describe Figure 3 (under the
figure) has been changed from that originally submitted.
TESTING
- Validate data acquisition - Validate data integration - Validate data mart design - Validate data on reports
business logic and transformation logic - Compare data between with DWH and data marts
- Compare schema & data between staged data and ODS and data marts with - Validate report filters and
between sources and that loaded in the DWH SQL queries drill downs
staged data (row counts, - Validate the dimension - Tune performance of data - Performance tune data
new and missing data, model mart access retrieval to reports.
miss-matched data, missing - Tune performance of data
or invalid constraints, eg., staging jobs
primary / foreign keys
- Tune performance of data
loading jobs

Performance Evaluations for the DWH Project


Loading and populating the data warehouse with relevant and complete data, and ensuring the relevance of reports
constitutes a majority of user and DWH stakeholders expectations. But, these tasks have to completed within a given
timeline and should be scalable to support the ever-growing system. Testing the performance of ETL and reports for
responsiveness and scalability is critical to the success of the design.

6
Although there are many non-functional requirements (NFRs) surrounding the performance of ETL and report responses,
it can be helpful to follow these guidelines:

Execution with peak production volume to check for completion of the ETL process within the agreeable window
Analysis of ETL loading times, with a smaller amount of data, to gauge scalability issues
Verification of ETL processing times, component by component, to identify areas of improvement
Shutdown of the server during ETL execution, to test for restart ability
Recreation of maximum concurrent user testing for all BI reports and for ad-hoc reports
Ensuring access to BI reports during simulated ETL production loads

Recommendations / Conclusion

Data warehouse solutions are becoming almost ubiquitous as a supporting technology for the operational and strategic
functions at most organizations. Data warehouses play an integral role in business functions as diverse as enterprise
process management and monitoring, business intelligence, and production of financial statements. The approach
described here combines an understanding of the business rules applied to the data with the ability to develop and use
testing procedures that check the accuracy of entire data sets. This level of testing rigor requires additional effort and
skilled resources. However, by employing this methodology, project teams can be more confident, from day one of the
implementation of the DW, in the quality of the data. The result will build confidence in the end-user community, and it
will ultimately lead to a more effective implementation.

References:
[1] Matt Austin, 2010, "The Necessity of Data Profiling: A How-to Guide to Getting Started and Driving Value", TDWI.org
[2] Jonathan G. Geiger, 2004, Data Quality Management, The Most Critical Initiative You Can Implement, SUGI 29,
Intelligent Solutions, Inc.
[3] Raj Kamal, 2013, Adventures with Testing BI/DW Application: On a crusade to find the Holy Grail, Microsoft Corp.
[4] Syntel Corp., 2012, Proven Testing Techniques in Large Data Warehousing Projects
[5] Abhijit, Singh, 2013, Meeting the Data Warehouse Business Intelligence Testing Challenges, L&T Infotech
[6] Jeffrey R. Bocarsly, Ph.D, 2011, Complex ETL Testing: A Strategic Approach, RTTS (Real-Time Testing Solutions), NY, NY
[7] Virtusa IT Global Services, 2012, Data Warehouse Testing

You might also like