Alex Whittles Dissertation
Alex Whittles Dissertation
FACULTY OF ACES
by
Alexander Whittles
This dissertation does NOT contain confidential material and thus can be made
available to staff and students via the library.
Acknowledgements
Thank you to Angela Lauener and Keith Jones, from Sheffield Hallam University, for
their valuable assistance with this project.
A core part of this research relied on access to state of the art solid state hardware. I’d
like to thank Fusion IO for their support of this work, and for the loan of their hardware
which made the research possible.
The time taken to undertake this research has been at the cost of spending time at
work. I’d like to thank Purple Frog Systems Ltd for supporting me through this project.
Thanks to Tony Rogerson for helping define the technical specification of the test
server.
Finally, and most importantly, thanks go to my wife, Hollie, who has supported me
through this dissertation and throughout the entire MSc process. Without her support,
encouragement, understanding and limitless patience I would not have been able to
complete this work. My wholehearted thanks go to her.
ii
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Abstract
In the computer science field of Business Intelligence, one of the most fundamental
concepts is that of the dimensional data warehouse as proposed by Ralph Kimball
(Kimball and Ross 2002). A significant portion of the cost of implementing a data
warehouse is the extract, transform and load (ETL) process which retrieves the data
from source systems and populates it into the data warehouse.
Critical to the functionality of most dimensional data warehouses is the ability to track
historical changes of attribute values within each dimension, often referred to as
Slowly Changing Dimensions (SCD).
There are countless methods of loading data into SCDs within the ETL process, all
achieving a similar goal but using different techniques. This study investigates the
performance characteristics of four such methods under multiple scenarios covering
different volumes of data as well as traditional hard disk storage versus solid state
storage. The study focuses on the most complex SCD implementation, Type 2, which
stores multiple copies of each member, each valid for a different period of time.
The study uses Microsoft SQL Server 2012 as its test platform.
Using statistical analysis techniques, the methods are compared against each other,
with the most appropriate methods identified for the differing scenarios.
It is found that using a Merge Join approach within the ETL pipeline offers the best
performance under high data volumes of at least 500k new or changed records. The T-
SQL Merge statement offers comparable performance for data volumes lower than
500k new or changed rows.
It is also found that the use of solid state storage significantly improves ETL load
performance, reducing load time by up to 92% (12.5x), but does not affect the
comparative performance characteristics between the methods, and so should not
impact the decision as to the optimal design approach.
iii
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Contents
Acknowledgements ................................................................................................................... ii
Contents ................................................................................................................................... iv
1. Introduction........................................................................................................................ 1
E. Conclusion ........................................................................................................... 11
F. Toolset ................................................................................................................. 15
5. Discussion ......................................................................................................................... 53
6. Conclusion ........................................................................................................................ 58
7. Evaluation ......................................................................................................................... 60
8. References ........................................................................................................................ 63
9. Appendix ............................................................................................................................ 1
Appendix 3. SAS Code – General Linear Model (Log, category variables) .............. 2
Appendix 10. ANOVA Results – Method/Row Count Least Square Means .............. 9
v
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
vi
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
1. Introduction
A core component of any data warehouse project is the ETL (Extract, Transform and
Load) layer which extracts data from the source systems, transforms the data into a
new data model and loads the results into the warehouse. The ETL system is often
estimated to consume 70 percent of the time and effort of building a business
intelligence environment (Becker and Kimball 2007).
A study by Gagnon in 1999, cited by Hwang and Xu (Hwang and Xu 2007) reported that
the average data warehouse costs $2.2m to implement. Watson and Haley (Watson
and Haley 1997) report that a typical data warehouse project costs over $1m in the
first year alone. Although the cost will vary dramatically from project to project, these
sources illustrate the level of financial investment that can be required. Inmon states
that the long term cost of a data warehouse depends more on the developers and
designers and the decisions they make than on the actual cost of technology (Inmon
2007). There is therefore a compelling financial reason to ensure that the correct ETL
approach is taken from the outset, and that the right technical decisions are taken on
which techniques are employed.
A Kimball style data warehouse comprises fact and dimension tables (Kimball and Ross
2002). Fact tables store the numerical measure data to be aggregated, whereas
dimension tables store the attributes and hierarchies by which the fact data can be
filtered, sliced, grouped and pivoted. It is a common requirement that warehouses be
able to store a history of these attributes as they change, so they represent the value
as it was at the time each fact happened, instead of what the value is now. This is
implemented using a technique called Slowly Changing Dimensions (SCD) (Kimball
2008), used within the ETL process.
There are numerous different methods of implementing SCDs, of which the following
three are the most common (Ross and Kimball 2005) (Kimball 2008) (Wikipedia 2010):
Type 1: Only the current value is stored, history is lost. This is used where
changes are treated as corrections instead of genuine changes, or no history is
required.
1
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Type 2: Multiple copies of a record are maintained, each valid for a period of
time. Fact records are linked to the appropriate dimension record that was
valid when the fact happened. e.g. Customers address. To analyse sales by
region, sales should be allocated against the correct address where the
customer was living when they purchased the product, not where they live
now.
Type 3: Two (or more) separate fields are maintained for each attribute, storing
the current and previous value. No further history is stored. e.g. Customer’s
surname. It may be required to only store the current surname and maiden
name, not the full history of all names.
Type 0 and 6 SCDs are rare special cases. Type 0 does not track changes at all, and Type
6 is a rare hybrid of 1, 2 & 3. Neither are therefore relevant to this research.
Type 1 SCDs are the simplest approach to implement (Kimball and Ross 2002) however
all history is lost. Type 3 SCDs are used infrequently (Kimball and Ross 2002) due to
their limited ability to track history. These SCD types don’t provide any maintainability
or performance problems for the vast majority of data warehouses (Wikipedia 2010).
The most common form of SCD is therefore Type 2, which is recommended for most
attribute history tracking by most dimensional modellers including Ralph Kimball
himself (Kimball and Ross 2002). The downside of Type 2 is that it requires much more
complex processing, and is a frequent cause of performance bottlenecks (Wikipedia
2010).
Bulk insert (ETL) & singleton updates (ETL) - The whole process is managed
within the ETL data pipeline. For each input record, the ETL process determines
whether it’s a new or changed record via a singleton query to the dimension, and then
handles the two streams of data individually. New records can be inserted into the
2
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Bulk insert (ETL) & bulk update (DB) (using Lookup) - The SCD processing is split
between the ETL and the database. The ETL pipeline uses a ‘lookup’ approach to
identify each record as either a new record requiring an insert or an existing record
requiring an update. All inserts are piped to a bulk insert component within the ETL; all
updates are bulk inserted into a staging table to then be processed into the live
dimension table by the database engine using a MERGE statement. The ‘lookup’
approach is an ETL technique analogous to a nested loop join operation in T-SQL.
Bulk insert (ETL) & bulk update (DB) (using Merge Join) - The SCD processing is split
between the ETL and the database. The ETL pipeline uses a ‘merge join’ approach to
identify each record as either a new record requiring an insert or an existing record
requiring an update. All inserts are piped to a bulk insert component within the ETL; all
updates are bulk inserted into a staging table to then be processed into the live
dimension table by the database engine using a MERGE statement. The ‘merge join’
approach is an ETL technique analogous to a merge join operation in T-SQL.
Bulk inserts and updates (DB) - The ETL process does not perform any of the SCD
processing, instead it is entirely handled within the database engine. The ETL pipeline
outputs all records to a staging table using a bulk insert, then all records in the staging
table are processed into the live dimension table at once using a MERGE statement.
This single database operation manages the entire complexity of differentiating
between new and changed rows, as well as performing the resulting operations.
The majority of data warehouses are populated daily during an overnight ETL load
(Mundy, Thornthwaite and Kimball 2011). The performance of the load is vital in order
to ensure the entire data batch can be completed in an often very tight time window
between end of day processing within the source transactional systems and the start
of the following business day. There is now a growing trend towards real-time data
warehouses, with current data warehousing technologies making it possible to deliver
decision support systems with a latency of only a few minutes or even seconds
(Watson and Wixom 2007) (Mundy, Thornthwaite and Kimball 2011). The performance
3
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
focus is therefore shifting from a single bulk load of data to a large number of smaller
data loads. This research will concentrate on the performance aspects of the more
typical overnight batch ETL load as it is still the most common business practice
(Mundy, Thornthwaite and Kimball 2011).
Historically data warehouses have used traditional hard disk storage media for the
physical storage of the data. There has been significant growth recently in the
availability and reliability of NAND flash based solid state storage, and an equivalent
reduction in cost. A case study by Fusion-IO for a leading online university (Fusion-IO
2011) shows the very large difference in performance for database operations when
comparing physical disk based media with solid state, increasing the random read IOPS
(input/output operations per second) from 3,548 to 110,000 and the random write
IOPS from 3,517 to 145,000. A test query in this case study improved in performance
from 2.5 hours on disk based storage to only 5 minutes on solid state storage.
The intended outcome is to be able to predict the optimal method for a given set of
dimension data and hardware platform, to enable data warehouse ETL developers to
optimise the initial design in order to maximise the data throughput, minimising the
required data warehouse load window.
The process and methods of loading type 2 SCDs is generic across technology
platforms, however this investigation will be carried out using the Microsoft SQL
Server toolset, including the SQL Server database engine and the Integration Services
ETL platform. SQL Server is one of the most, if not the most, widely used database
platforms in use today (Embarcadero 2010). The techniques used in this research are
equally suited to other database platforms such as Oracle.
4
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Document Summary
Chapter 2 discusses the background literature and existing research that has been
conducted in this field. It also presents justification for this research.
Chapter 3 explains the methodology appropriate to the research question. The details
of the quantitative tests are discussed, as well as a summary of the statistical analysis
methodology.
Chapter 4 presents the test results and identifies the most appropriate statistical
models to be used. The results are analysed and interpreted using a variety of
statistical and data mining models.
Chapter 7 evaluates the research, identifying the limitations of the approach taken,
and discusses how further research could be conducted to improve the understanding
beyond that presented in this research.
5
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
2. Literature Review
This chapter explores the existing research that has been undertaken in this area, and
examines the justification for this research. The specific topic of SCD performance is
investigated, as well as the more generic performance of database operations and
then the relevance of the industry’s trend towards solid state storage devices.
Although the investigation and research approach is based primarily on the Microsoft
SQL Server toolset, the performance of loading data SCD Type 2 data is a generic issue
and just as big a problem when using other competing technologies such as SAS
(Dramatically Increasing SAS DI Studio performance of SCD Type-2 Loader Transforms
2010), as such although the terminology and implementation details will differ, the
concept has a much wider scope.
The subject of SCD Type 2 load performance is widely discussed in user forums and
blogs, providing an indication to the size of the problem. A simple Google search on
the topic returns ¼ million results including (Priyankara 2010) (Novoselac 2009)
(Various 2004). Given this, it is surprising that there has been a lack of any detailed
studies in academia or the commercial field. The concept of a Type 2 SCD is discussed
in the majority of books covering ETL methods of star schema data warehouses, for
example (Kimball 2008) (Veerman, Lachev and Sarka 2009), however alternative
implementation approaches are often not presented, and no sufficient performance
analysis was identified during background investigation for this research.
6
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Kimball (Kimball 2004) offers bulk merge (SET) as a method of improving the
functionality of a Type 2 data load, but as with other resources, does not discuss the
performance considerations of it. Warren Thornthwaite does however investigate this
approach in more detail in a more recent document (Thornthwaite 2008), explaining
that being able to handle the multiple required actions in a single pass should be
extremely efficient given a well-tuned optimizer. Uli Bethke has taken this same
approach and applied it to an Oracle environment (Bethke 2009).
Joost van Rossum has written a blog post on this topic (Rossum 2011), and provides a
number of options for loading data into SCDs, and also provides some basic timing
statistics for them. Although this is not an academic or refereed source, the author has
many years of experience as a business intelligence consultant, receiving the Microsoft
Community Contributor award in 2011 and the BI Professional of the year award from
Atos Origin in 2009. This post presents four alternatives to the Slowly Changing
Dimension Component:
Rossum chose to compare option (d) against the built in component, and proceeds to
extend this option into two tests, one performing singleton updates and one
performing a batch update. No reason is given for not pursuing the first three options,
however option (c) seems to have been added after the publication of the post which
explains its absence. Many corporations impose restrictions on the use of third party
software components; it is also preferable to use transparent techniques in which the
functionality can be understood instead of black box components which can’t be
analysed, explaining the absence of options (a) & (b).
In Rossum’s tests, he uses a small test dimension of 15,481 members, with a small
change set of 128 members and 100 new members. The results are provided in Table
1.
7
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
There is clearly a large performance variation between the methods; however with
such a small number of records, the results can only be used to provide an indication
of the difference, and cannot be interpreted with any degree of confidence. Rossum
does not perform any statistical analysis on the results, and does not repeat the
experiments with different volumes of data, or provide any information on the
conditions under which these tests were performed.
8
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
It’s disappointing that, although commonly discussed, no authors have been identified,
other than Rossum, that have investigated the performance characteristics of the
available methods. It is this shortage of existing research, along with the regularity
with which this problem is encountered in the commercial field, which has prompted
this research to investigate the load characteristics of SCD methods in more detail.
One such study by Muslih and Saleh (Muslih and Saleh 2010) describes the
performance of different join statements in SQL queries. Their comparison of nested
loop joins and sort-merge joins shows that there can be a dramatic difference in query
cost dependant on the size of the datasets being used. They advise that nested loop
joins should be used when there are a small number of rows, but sort-merge joins are
preferable with large amounts of data. Although this study is focusing on the
performance of the ETL process not the database engine, there is a high degree of
parallel as the ETL process must join two streams of data together, the incoming and
existing data. These findings can therefore be taken into account when determining
the methods to be used.
Olsen and Hauser (Olsen and Hauser 2007) advise that to get the best performance
from relational database systems the operations should be performed in bulk if more
than a very small portion of the database is updated.
9
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
updated all rows in a single set based operation. He calculates that the cursor based
approach would have taken in excess of 8 months to complete, whereas the set based
operation completed in approximately 24 hours. Scharlock acknowledges that there is
a much greater resource cost of the set based operation, although he doesn’t present
any details or evidence of this.
C. Random Vs Sequential IO
Loading data into a data warehouse dimension requires both random disk access as
well as sequential disk access.
Because track-to-track seeks are much faster than random seeks, it is possible to
achieve much higher throughput from a disk when performing sequential IO (Whalen
et al. 2006).
In contrast, solid state storage has no physically moving parts so random seeks require
less overhead. They are therefore able to achieve a much higher performance,
specifically with respect to random read operations (Shaw and Back 2010). Tony
Rogerson (Rogerson 2012) states that the more contiguous the data is held in the same
locality on disk the lower the latency and higher throughput, with Solid State Devices
(SSD) turning that reasoning on its head. Rogerson acknowledges that SSDs still offer
the best access performance for contiguous data, however the access latency is
significantly less variant than with hard disks, enabling a much higher comparative
performance for random access.
Given the change in nature of their performance, it is expected that the use of SSD will
change the performance characteristics of loading data when compared with
traditional disk based storage.
10
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
D. Data Growth
Data volumes within organisations continue to grow at a phenomenal rate, as more
data is made available from social media, cloud sources, improved internal IT systems,
data capture devices etc. Data growth projections vary, however a recent McKinsey &
Co report projects a 40% annual growth on global data, with only a corresponding 5%
growth in IT spending (McKinsey Global Institute 2011). There is therefore a
compelling need in industry to maximise the efficiency of any data processing system
whilst also minimising the cost of implementation and maintenance.
E. Conclusion
From the research presented, it’s clear that the performance of loading Type 2 slowly
changing dimensions is of concern to a large number of people in the Business
Intelligence industry, and as data volumes increase the problem will become more
prevalent.
Although numerous authors and bloggers have presented their own personal or
professional views on which method to use, there is very little experimental or
statistical evidence to justify their claims. There has also been no research undertaken
within academic circles to investigate the performance characteristics of ETL
processes.
This lack of empirical evidence makes it impossible to determine which is the best
approach to loading data warehouse dimensions for a given scenario, leaving
architects and developers to make design decisions based either on their own, often
limited, experience or on anecdotal evidence.
This is made more problematic by the introduction of solid state hardware, providing
yet another option for the data warehouse architect to consider.
The author therefore considers this research to be of great importance to the Business
Intelligence community, to provide guidance to those looking to optimise their system
design.
11
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
A. Inductive Vs Deductive
It is the intention of this research to perform an inductive investigation. This research
does not set out to prove an existing hypothesis that one method of loading data is
faster than another, but instead offers a number of different methods and scenarios
commonly found in industry, and attempts to compare them to investigate which is
the preferable method in any given scenario.
Following the ‘Research Wheel’ approach (Rudestam, Kjell and Newton 2001)
presented in Figure 1, the research will start with the empirical observation from the
author’s own experience in industry that the performance of loading Type 2 slowly
changing dimensions is a problematic area, and warrants investigation.
The previous chapter explored the literature in detail, presented justification for the
research and explored some of the specific questions and topics that have been raised,
which this research will explore in more detail.
Results will then be collected and analysed, then the cycle will be continued to
whatever extent is necessary in order to draw sufficient conclusions which can then be
applied to practical scenarios outside of this project.
12
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
B. Qualitative Vs Quantitative
Two high level approaches were considered for this research, quantitative and
qualitative (Rudestam, Kjell and Newton 2001).
The results would be interpreted to extract common findings from the answers
provided for each scenario. A quantitative investigation could also be adopted if the
participants were asked to rate each method on a performance scale.
The primary concern with this approach is that it is highly unlikely to actually reveal a
genuine performance difference between the methods, instead revealing each
individual’s preference for each method, which is likely to also be based on
convenience, lack of awareness of other methods, maintainability, code simplification,
available toolsets etc. This method would however enable the research to cover a
broader spectrum of technologies and implementation styles.
This approach also relies on getting responses from the questionnaire, which can be
problematic and costly.
To perform a quantitative analysis of the load performance, a simple data load test can
be set up to measure the time taken to process a number of new and changed rows in
a simulated data warehouse environment. The proportion of new and changed rows
can be altered to provide measurements of the data throughput.
13
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
measure the relevant metrics. It is therefore intended to set up a series of tests that
will generate the required measurements. To perform this, a number of components
must be set up.
C. Source Database
A representative online transactional processing (OLTP) database, complete with a set
of data records suitable to be populated in a data warehouse dimension. The contents
of this database will be preloaded into the data warehouse dimension, and then one of
a number of change scripts will be run to generate the required volume of SCD type 2
changes.
The nature of this database is immaterial, so an arbitrary set of tables will be created
modelling a frequently used dimension, Customer. The Customer dimension is often
the most challenging dimension in a data warehouse due to its large size and often
quickly changing attributes (Kimball 2001). These tables will be normalised to 3rd
Normal Form to accurately model a real-world OLTP source database. As this research
is solely focussing on the performance of SCD type 2 dimension data loads, it is not
necessary to simulate fact data such as sales or account balances.
The source OLTP database will need to be populated with random but realistic data. To
achieve this the SQL Data Generator application provided by RedGate will be used. This
allows for each field to be populated using a pseudo random generator but within
specified constraints, or selected randomly from a list of available values. This prevents
any violation of each fields’ constraints. This method will be used to generate the
starting dataset as well as the new and changed records for the ETL load test.
To generate the change data, SQL scripts will be written which will update a specified
percentage of the records, altering at least one of the fields being tracked by the type
2 process.
To ensure consistency between the methods, each test will use identical datasets.
D. Data Warehouse
A suitable data warehouse dimension will be created, following Kimball Group best
practices (Mundy, Thornthwaite and Kimball 2011). This will be a single dimension that
14
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
would normally form part of a larger star schema of fact and dimension tables within
the warehouse.
Fact data will not form part of the performance tests, so the complete star schema
does not need to be built.
E. ETL Process
To perform the data load, a number of ETL (Extract, Transform & Load) packages will
be created to populate the dimension from the source database, each performing the
data load in a different way. Each package will log the ETL method being used, the
number of new rows to be inserted, the number of change rows retrieved from the
source database and duration of the load process.
F. Toolset
There are a number of database systems and ETL tools available to use, from Oracle
and SQL Server to MySQL and DB2, and SSIS to Syncsort and SAS.
This analysis will make use of Microsoft SQL Server. SQL Server is one of the most, if
not the most, widely used database platforms in use today (Embarcadero 2010). It
integrates a highly scalable DBMS (database management system) with an integrated
ETL toolset, SSIS.
G. Quantitative Tests
The comparative performance of the load methods is expected to change depending
on the number of rows being loaded, and the ratio of new records to changed records.
It will therefore be necessary to create numerous different change data sets, each with
a different percentage of new data and changed data.
The tests will all be performed on the same hardware, with the exception of the
different storage platforms. This will ensure consistency, however it should be noted
that the results may be influenced by the specification of server used. For example,
some of the methods are very memory intensive and so may be expected to perform
better when given access to more memory. Ideally the datasets would be small enough
to ensure that memory would not be an influencing factor, however it is important to
perform the tests on data that is of sufficient size to provide usable and meaningful
15
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
data. Each ETL process will incur fixed processing overhead to initiate the process and
pre-validate the components and metadata etc. If the datasets were too small, the
fixed processing overheads could obscure the timing results. A dimension with 50m
records will therefore be used. This size is representative of a large dimension of a
typical large organisation, for example a customer dimension. The resulting size of the
databases will also be within the available hardware capacity of the solid state drives
available for the tests.
Four different ETL systems will be created to perform SCD type 2 dimension loads with
the following methods.
New records which don’t already exist in the dimension will be bulk inserted within the
ETL pipeline, with a full lock allowed on the destination table.
Changed records will be dealt with individually within the ETL pipeline, with two
actions performed for each change:
16
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 2 – Typical (simplified) structure of a Singleton load process using the Slowly Changing Dimension
component [taken from a screenshot of the actual load process used for this test]
Method 2: Bulk inserts (ETL) and bulk updates (DB), split using Lookup (ETL)
The process is managed by both the ETL layer and the database engine.
The ETL layer includes a Lookup component which cross references each incoming
record against the existing dimension contents. New records which don’t already exist
in the dimension will be bulk inserted, with a full lock allowed on the destination table.
Existing records will be loaded into a staging table and then merged into the
destination dimension in a single operation. The Merge operation takes care of the
multi stage process required for Type 2 changes:
17
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
method, as each record is being processed individually. In order to make any database
operation truly scalable the updates should be managed in bulk. As Olson and Hauser
describe, one should make careful use of edit scripts and replace them with bulk
operations if more than a very small portion of the database is updated (Olsen and
Hauser 2007). An adaptation to this approach to utilise bulk updating was adopted by
Rossum in his tests (Rossum 2011).
Method 3: Bulk inserts (ETL) and bulk updates (DB), split using Join (ETL)
The process is managed by both the ETL layer and the database engine.
The ETL layer includes a Merge Join component which left outer joins every incoming
record to a matching dimension record if one already exists. New records which don’t
already exist in the dimension will be bulk inserted, with a full lock allowed on the
destination table.
Existing records will be loaded into a staging table and then merged into the
destination dimension in a single operation. The Merge operation takes care of the
multi stage process required for Type 2 changes:
18
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
This method is very similar to method 2 in its approach, utilising the ETL pipeline to
distinguish the new and existing records, and processing both streams in bulk.
The key difference is the technique used to cross reference incoming records against
the existing dimension records. Method 2 uses a ‘Lookup’ approach, whereas this
method replaces it with a Merge Join.
The Lookup transformation uses an in memory hash table to index the data (Microsoft
2011), with each incoming record looking up its corresponding value in the hash table.
This means the entire existing dimension must be loaded into memory before the ETL
script can begin, and it remains in memory for the duration of the script.
The Merge Join transformation however applies a LEFT OUTER JOIN between the
incoming data and the existing dimension data. The downside of this is that both data
sets must be sorted prior to processing which can add a sizeable load to the data
sourcing. However, the existing dimension records only need to be kept in memory
whilst they are being used within the ETL processing pipeline. This has the advantages
of requiring potentially less memory as well as a reduced processing time prior to
execution, assuming the sort operations can be processed efficiently.
These two approaches can draw parallels with the different query join techniques
compared by Muslih and Saleh (Muslih and Saleh 2010), from which they identified a
sizeable difference in performance.
19
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
All records from the ETL pipeline will be loaded into a staging table, regardless of
whether they are new or changed rows. They are then merged into the destination
dimension table in a single operation. The single merge statement will perform three
actions on all records within a single transaction
This is the method proposed by Thornthwaite (Thornthwaite 2008) and Bethke (Bethke
2009) to make use of advances and new functionality in the T-SQL language and
database engines. Once this technique is learned it is also very fast and simple to
implement.
20
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
As can be seen in Figure 5, this is a much more simple process to implement within the
ETL pipeline in SSIS, as the complexity of the process is all contained within the Merge
statement.
Tests
All four ETL methods will be run against numerous sets of test data, with varying sizes
of destination data and percentages of change data. The proposed tests to be
conducted are presented in Table 2:
Table 2 – Summary of tests covering different data volumes for new/changed data
21
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
H. Statistical Analysis
Logarithmic intervals of sample percentages will be used in order to examine both
small and large test sets.
Each ETL package will contain duration measurement functionality which will log how
long each test takes to complete. This duration is taken as the result for each test.
When repeated for each of the hardware platforms, and then for each of the four load
methods this will result in 200 tests. To help mitigate against any external influencing
factors, each test will be run three times, resulting in 600 individual data load tests
being run.
The results of each test will be analysed, with the four ETL methods compared using
statistical techniques appropriate for the distribution of the results, such as a
univariate analysis of variance (ANOVA). This will reveal whether there is any
statistically significant difference in performance between the methods for each test.
A decision tree data mining technique will also be employed to analyse the influence of
the parameters on the preferred method.
The specification for this hardware platform was largely influenced by the work of
Tony Rogerson (Rogerson 2012) from his work on the Reporting-Brick.
The first storage platform will be a Raid 10 array of 7,200 rpm hard disks internal to the
server. It is common for corporate database server to use an external NAS (network
attached storage) system of 15,000 rpm drives for storage, however in the interests of
creating an isolated environment, maximising performance and reducing the
22
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
associated costs, internal 7,200 rpm drives will be used. A Raid 10 array has been used
to provide the increased performance expected from a corporate environment.
The second storage platform will be a solid state 160Gb Fusion-IO ioXtreme card,
directly attached to the server’s PCI bus.
The purpose of these tests is to identify the performance of loading data into the data
warehouse, it is therefore important to isolate the performance of data retrieval from
the source systems and insure that data sourcing does not have an impact on the
results. The server will therefore also be equipped with a separate solid state drive
which will serve the data to the ETL tests.
The tests will be run within a Hyper-V virtual machine provisioned with 4 cores and
12Gb RAM (random access memory), running 64 bit Windows Server 2008 R2 and SQL
Server 2012 Enterprise edition. The host server is a 6 core AMD Phenom II X6 1090T
3.2Ghz with 16Gb RAM running 64 bit Windows Server 2008 R2.
The ETL tasks will rely heavily on RAM. Further tests could be run using different
amounts of RAM in order to introduce this as a factor into the method comparison;
however this remains outside the scope of this project.
Database engines heavily use cache in order to optimise the performance of repeated
tasks. This would impact the performance tests being run, negatively impacting the
first tests and benefiting latter tests. To remove this influence, all services (database,
ETL engine, etc.) will be restarted between each test to clear out the RAM and reset
any cache.
All results will be collected from managed tests against databases created specifically
for this task, which will not require permission from any third party.
It is not expected that any problems will be encountered relating to the issues of
access or ethics.
23
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 6 (shown on page 26) presents a series of charts showing the average duration
of the three instances of each test. These are grouped by the number of new rows and
changed rows. Each chart compares the average duration of tests for each hardware
and method combination.
A number of findings can be drawn from this, before any statistical analysis has been
performed.
The Singleton method, when used with traditional hard disks, performed considerably
worse than any other method for large data volumes (>= 0.5m) of either new or
changed rows. This was expected, and confirms the advice of Mundy et al (Mundy,
Thornthwaite and Kimball 2006) who recommend that the Slowly Changing Dimension
component is only advisable for data sets of less than 10,000 rows.
It should be noted that the Singleton method actually out performs all other methods
for both hardware platforms when less than 5k new or changed rows are being loaded.
The recommendation therefore stands that the SCD component should only be used
for small data sets, however the hardware platform clearly has an impact on what is
considered a small data set.
24
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
When the Singleton approach is excluded, the remaining three methods are much
closer together in their performance; however the Lookup method is consistently the
next lowest performer in the vast majority of the tests.
25
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 6 – Average duration of each test, grouped by new and changed rows, comparing the methods for each hardware platform
26
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
27
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
28
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 7 (HDD) & Figure 8 (SSD) show the same results, grouped by the method. The
first column groups the results by the number of new records, showing the number of
changed rows within each group. The right column shows the opposite, with the
changed row count in the outer grouping.
The difference in pattern is immediately obvious, with the right hand column of charts
showing a much stronger correlation. This indicates that the number of changed rows
is the driving factor in determining the time taken to load, with the number of new
rows making less of an impact.
These results will be examined in more detail using appropriate statistical analysis.
On the face of it this is not normally distributed, but heavily positively skewed with a
seemingly exponential distribution.
29
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
To test this, a general linear model (prog glm) was run using code presented in
Appendix 1. The normal probability plot of the studentised residuals shown in Figure
10 passes through the origin but clearly shows a far from straight line. The assumption
of near-normality of the random errors is therefore not supported by this model.
Given the logarithmic intervals of the new row and changed row input variables, the
same test was run against the logarithm of the duration result, using the code
presented in Appendix 2. The resulting normal probability plot is shown in Figure 11
below. This shows that in most cases the studentised residuals conform to an
approximate straight line of unit slope passing through the origin. There are however a
sizeable number of points showing with a noticeable tail, resulting in a curvilinear plot
indicating negative skewness. Although most points conform, the assumption of the
near-normality of the random errors is not supported when using the logarithm of the
duration result.
30
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
31
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
The plot presented in Figure 12 above shows that the studentised residuals are not
randomly scattered about a mean of zero, the variance appears to decrease as the
fitted value increases.
The model used above treats the hardware and method as categorical factors and the
new and changed rows as numerical variables. The test was then repeated with all
inputs treated as categorical factors, using the SAS code presented in Appendix 3.
Categorical inputs
Figure 13 – Normal Probability Plot (QQ Plot) of Studentised Residuals (Log) with categorical variables
The plot presented in Figure 13 shows that the studentised residuals clearly conform
to an approximate straight line of unit slope passing through the origin. In this model
there is no tail of non-conforming values, indicating that the assumption of the near-
normality of the random errors is supported. This is further supported by the plot
presented in Figure 14, which shows that the studentised residuals are randomly
scattered about a mean of zero.
32
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
The smaller ranges at the extremes of the plot are likely to be reflective of the smaller
number of observations at these extremes rather than a genuine reduction in variance.
Figure 15 below shows a histogram of the studentised residuals, and shows a very
close fit to the superimposed normal curve. The studentised residuals therefore
appear to be symmetrically distributed and unimodel as required.
Based on this evidence, normal distribution of the error component can be assumed,
and the multivariate ANOVA test using a general linear model is an appropriate form of
analysis for this data when treating all input parameters as categorical factors.
33
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
The problem that this model causes is that, as can be seen from the results in
Appendix 4, the number of factor combinations and the number and complex nature
of the significant interactions makes interpretation very challenging. Treating the row
counts as categorical factors also does not provide sufficient information in the
statistical analysis results to interpolate or extrapolate the expected performance
characteristics of data volumes not tested in this research. This will reduce the ability
to apply the findings of this research to real-world scenarios.
The Singleton method has already been discounted as a viable option for all scenarios
where the data volumes exceed 5k rows, as found in the original data plots in Figure 6.
Where necessary, the Singleton method’s performance characteristics can be
34
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
extracted from the categorical factor model analysis, with the scalability analysis for
the remaining methods derived from the numerical variable model.
The SAS code to generate the revised numerical model is presented in Appendix 13.
The analysis of the studentised residuals in Figure 16, Figure 17 and Figure 18 below
show that the numerical model is an appropriate form of analysis for this data, when
the singleton method is excluded from the results.
The plot presented in Figure 16 shows that the studentised residuals clearly conform
to an approximate straight line of unit slope passing through the origin.
The studentised residuals shown in Figure 17 appear randomly scattered about a mean
of zero. Again, the reduced range at the extremes of this plot reflects a smaller number
of observations. The histogram shown in Figure 18 shows a very close fit to the
superimposed normal curve.
Figure 16 - Normal Probability Plot (QQ Plot) of Studentised Residuals (Log), numerical row counts, excluding
Singleton
35
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 17 - Plot of Studentised Residuals against Fitted Values, numerical row counts, excluding Singleton
Figure 18 - Histogram of the Studentised Residual, numerical row counts, excluding Singleton
36
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
The results from the Analysis of Variance (ANOVA) test using row counts as categorical
factors are presented in Appendix 4.
The ANOVA results presented in Appendix 4 show that, with p values of <0.0001, all of
the individual explanatory terms are highly statistically significant, and therefore have
a proven impact on the duration of the ETL load.
With p values of <0.0001, all of the interactions between the explanatory factors are
also highly statistically significant, the only exception being the four way interaction
between all of the factors; Method, Hardware, ChangeRows and NewRows.
By itself this doesn’t provide much in the way of useful information for interpretation.
However by conducting further analysis of the least squares means (LS Means, or
marginal means) of the lower order factors it’s possible to investigate the relative
influence of the factors and their interactions.
The SAS code for this analysis is presented in Appendix 5 with the results presented in
Appendix 6 through Appendix 12.
Table 3 below shows the least squares means analysis comparing just the method,
excluding all other factors and interactions, with the Join method as the baseline
(Appendix 6). The performance degradation using the Lookup and Singleton methods
are both highly visible, with the Singleton method being the considerably worse
performing method. The Merge and Join methods are very close in performance, with
Join being the marginally better choice.
37
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
The hardware choice, excluding any interactions, also shows a sizeable difference as
shown in Table 4, using the results from Appendix 7. As expected, the solid state
storage outperforms traditional hard disks.
When we introduce the method, the interaction effects show the different impact on
performance for the combinations, with the combined least square means shown in
Table 5 below, with the full results presented in Appendix 8.
The solid state storage tests showed consistently better performance across all
methods. The interactions between hardware and method show that the Singleton
and Merge methods benefit more from solid state that the other methods.
Table 6 below shows the least squares means analysis of the number of new and
change rows, excluding any other interactions, The LS Means clearly increase at a
visibly consistent rate as the new and change rows are increased, with a larger
increase for the number of changed rows. As we’re analysing the log of the result, this
indicates that the impact is increasing in an approximately exponential fashion, which
would be expected as the input row counts also increase exponentially. The full results
are presented in Appendix 9.
38
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
It is also interesting to note that the interaction between new rows and change rows is
also highly statistically significant, with the details presented in Table 7 below. This
shows that the effects on log(result) are not additive. i.e. the log of the result is lower
for a combined load than for two individual loads of new and changed rows separated.
Even though the least squares means reflect an interaction, the pattern of the values is
consistent throughout, with higher log times for greater numbers of new and change
rows.
39
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
The next analyses, the results of which are presented in Appendix 10, investigate the
interactions between the method and increasing row counts.
Merge is found to perform the best in low data volume scenarios, and in all tests with
a low volume of new rows (500k and less), with Join becoming preferable with higher
volumes of new rows. Singleton proves comparable at very low data volumes (5k and
less), but scales very poorly. Lookup performs better than Singleton in tests with
greater than 50k rows, but only marginally. Although never the worst performing
method, Lookup is also never the best. This is visualised in the following chart in Figure
19 which clearly shows that the Singleton method provides comparable performance
for data volumes up to 5k new rows and 5k changed rows, but not beyond. This
confirms our earlier findings from the analysis of Figure 6, as well as that from other
40
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 19 - Combined Least Squares Means for Method and Varying Input Row Counts
The statistics presented earlier confirm that the two best performing methods are the
T-SQL Merge method and the SSIS Merge Join methods, with no statistically significant
difference between them. This reaffirms the findings from the initial plots in Figure 6
(page 26).
The next investigation focuses on these two methods, and examines the interaction
between these methods and other parameters. Note that the statistical model used
for this excludes the other two methods (Lookup and Singleton), and is analysing a
subset of the original test data. The code for this is presented in Appendix 11, with the
results presented in Appendix 12.
The more detailed investigation again shows that there is no significant difference
between the methods, with a p value of 0.7182, however the hardware and all two
and three way interactions are all highly significant at the 1% level, with the exception
of method against hardware which is only just significant at the 5% level.
41
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
When looking at the parameter estimates, the Merge method is significantly better
than Join for the baseline data of SSD and zero new & change rows, with a relative
parameter estimate difference of 0.426, significant at the 5% level.
All other 2 way interactions prove to be significant, again highlighting the complex
nature of the performance characteristics of ETL loads.
In this section the numerical variable model will be interpreted, which excludes the
Singleton method and treats the new and change rows as numerical variables, using
the code presented in Appendix 13. Note that due to the very large values of new and
change rows, the parameter estimate per row is incredibly small. To improve the
accuracy of the analysis the number of rows has been divided by 1000 to allow greater
precision in the parameter estimates.
Treating the row counts as numerical variables allows a more in depth analysis of the
impact on ETL duration of varying numbers of new and change rows for the Join,
Merge and Lookup methods.
The results from the reduced model, excluding non-significant interactions, are
presented in Appendix 14.
42
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
With a parameter estimate of 0.990, hard disks are significantly slower than solid state
storage.
Without taking any interactions into account, the number of change rows has a
significantly higher impact on performance degradation than the number of new rows,
with parameter estimates of 476x10-9 and 380x10-9 respectively.
Also note the interaction between new and change rows. This is a statistically
significant interaction, but appears to have a negligibly small parameter estimate.
However this interaction estimate is applied to the product of new and change rows,
each with values up to 5m having a product of (5m)2. This interaction therefore
generates a material difference to the model at high data volumes.
Table 8 – Combined parameter estimates per row of input data, by hardware and method
From this we can see that the Lookup method starts out from a worse performing
position, with a starting parameter estimate of 6.737 against 5.827 for the other
methods on a hard disk platform, and 5.747 against 4.837 for solid state.
43
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
The log duration increases as the number of change rows increases, but at the same
rate for all three methods and for both hardware platforms. The impact of change
rows increasing is higher than that for new rows.
It should be noted that, although the parameter estimate (log duration) for HDD and
SSD are the same per change row, as SSD has a lower baseline value the impact on the
untransformed duration will be less. i.e. SSD scales much better than HDD for
increasing volumes of change rows. Contrast this with the increased parameter
estimates for SSD for volumes of new rows when compared with HDD. This confirms
the findings by Rogerson (Rogerson 2012) and Shaw & Black (Shaw and Back 2010)
that the performance gains of SSD can be best realised for random IO scenarios such as
database updates, not sequential IO such as database inserts.
The log duration of the load increases as the number of new rows increases, with the
largest increase per row for the merge method, followed by join then lookup
increasing the least, for both hardware platforms. This indicates that the gap between
the lookup method’s parameter and the other two will decrease as the data volumes
increase. It should be noted however that this model is estimating the log of the load
duration, not the duration itself. The Join method also has a higher parameter
estimate than the merge method for both hardware platforms. Although they start out
with comparable performance, the join method is likely to scale better at high volumes
of new data.
These findings are backed up by the visualisations in Figure 20 and Figure 21, which
show the effect on the parameter estimate of increasing the volume of input rows.
44
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 20 – Chart comparing the parameter estimates for the methods using HDD with increasing data volumes
Figure 21 - Chart comparing the parameter estimates for the methods using SSD with increasing data volumes
As the dependent variable being analysed is the log of the duration, the following two
charts, Figure 22 and Figure 23, show the same data but with the parameter estimates
transformed back into duration (in seconds) by taking the exponential of the
parameter estimate.
45
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 22 - Chart comparing the estimated load duration for the methods using HDD with increasing data volumes
Figure 23 - Chart comparing the estimated load duration for the methods using SSD with increasing data volumes
These charts clearly show that although the parameter estimate increases less for the
lookup method than the other methods, the logarithm transformation hides the fact
that the lookup method scales far worse than the merge or join methods.
It can also be seen that the performance characteristics of the methods when using
SSD are very similar to those when using HDD. Therefore despite the significant
46
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
These charts represent the performance characteristics when loading data with a
new/change split of 25%/75%. The characteristics and nature of the curves will change
if this split is varied. The following two plots in Figure 24 and Figure 25 show the same
curves when the split is reversed, at 75% new rows and 25% change rows.
Figure 24 - Chart comparing the estimated load duration for the methods using HDD with increasing data volumes
47
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 25 - Chart comparing the estimated load duration for the methods using SSD with increasing data volumes
These plots show that with a higher proportion of new rows to change rows, the
Merge method doesn’t scale anywhere near as well when on the HDD hardware
platform, and slightly worse when using SSD.
It should be noted that the durations presented on the y axis of the charts are only of
relevance to the hardware configuration used in this research. Different hardware
platforms with different CPUs, memory etc. will experience a different scale on the y-
axis, however the characteristics and nature of the performance comparison would be
expected to be consistent.
All of the above charts show an exponential increase in ETL load duration as the input
data volumes increase. It is expected that this is in large part caused by limitations of
hardware resource. There’s a finite amount of memory on a server and a finite size of
database cache etc. At lower data volumes the exponential curve is a very close
approximation to a linear relationship whereas the curved nature of the lines becomes
far more apparent at higher data volumes, which is to be expected as system resources
reach capacity.
Further research should be performed to investigate the scalability of the ETL methods
and the effect on their performance as server resources are increased.
48
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
D. Projection Model
The models discussed above and presented in Figure 22 to Figure 25 use formulae
derived from the parameter estimates of the various terms in the model. As discussed,
the scale of the duration will be impacted by the specific details of the hardware
platform, however the characteristics should be relatively consistent. The terms a & b
are included to provide customisation for different hardware platforms. These should
take the values 0 and 1 respectively to achieve the model used in this research.
The formulae for the models are presented in Equation 1 to Equation 6 below, where
the terms have the following meaning:
Equation 1 – ETL Duration formula for using the Join method on HDD
Equation 2 - ETL Duration formula for using the Merge method on HDD
Equation 3 - ETL Duration formula for using the Lookup method on HDD
Equation 4 – ETL Duration formula for using the Join method on SSD
Equation 5 - ETL Duration formula for using the Merge method on SSD
Equation 6 - ETL Duration formula for using the Lookup method on SSD
49
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Figure 26 – Decision Tree showing the probability of being the best method for a given scenario
50
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
E. Decision Tree
Each of the methods was ranked within each test, with the best performing method
being given a rank of 1, with the worst performing method ranked 4.
A decision tree data mining algorithm was then applied to this rank data to determine
the decision process a user should use to identify the best method for a given scenario.
This was performed using the Microsoft Decision Trees Algorithm within SQL Server
Analysis Services.
Four input variables were used (Method, Hardware, NewRows and ChangeRows), with
the Rank being predicted. The results of this are presented in Figure 26 on the previous
page.
A number of conclusions can be drawn from the resulting decision tree map.
The Singleton method ranks last more than any other method, in 67% of the tests.
However it still ranks 1st in 14% of cases. Tracing the Singleton path through to levels 6
and 7 it is clear that the most effective situation for this method is where SSD
hardware is used, and with a small number of change rows, <= 5k.
The Lookup method ranks 3rd in 53% of tests, only ranking 1st in 7%; the majority of
cases where it was ranked 1st were in cases with zero changed rows.
The Merge and Join methods are ranked similarly, with the Join method preferred in
44% of cases and Merge with 34%. Merge is the preferred method when there are 50k
new rows. The Join method ranks better when there are a higher number of change
rows, it only ranked 1st in 10% of cases with zero change rows, 36% in cases with 5k
change rows and 58% of cases of 50k changes and above.
51
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
F. Dependency Network
The resulting dependency network, presented in Figure 27 shows that the strongest
influencer of achieving the top rank is the Method itself. This shows that the methods
are relatively stable with respect to being ranked 1st.
The number of change rows has the next strongest influence, followed by the number
of new rows.
The hardware platform influences the rank the least of all the variables.
52
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
5. Discussion
This chapter takes the statistical analysis performed in the previous chapter and breaks
it down into a number of summarised interpretations applicable to real world
scenarios. It applies the findings to those identified in the literature review, and aims
to provide those embarking on the development of a new ETL system with sufficient
knowledge from which to make informed choices.
A. Singleton Method
The statistical analysis shows that the singleton approach to loading SCD data offers
significantly lower performance than other methods in most scenarios.
The analysis presented in the discussion of Figure 6 and Figure 19 show that the
singleton method has comparable performance to the other methods with zero new
and changed records, but that the performance decreases far more dramatically than
the other methods when the data volumes increase. This indicates that the singleton
method is a potentially viable option for low data volume scenarios, especially when
solid state storage is in use.
The decision tree in Figure 26 shows that the singleton approach is particularly well
suited to <= 5k changed rows when solid state storage is used, and when the number
of new rows is less than 5m. The charts in Figure 6 also confirm this visually,
highlighting that this approach is well suited to low volumes of new and change
records (<=5k), especially when using solid state storage. The recommendation offered
by Mundy et al (Mundy, Thornthwaite and Kimball 2006) that the singleton approach is
most suited to small datasets with less than 10,000 rows is therefore confirmed.
All analysis shows that this approach is the least preferred method in most other cases.
These findings also confirm the findings of Olsen and Hauser (Olsen and Hauser 2007)
and Peter Scharlock (Scharlock 2008) in that when loading any sizeable data, bulk, set
based operations are preferable over row based singleton operations.
53
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
It should be noted, however, that even though the Singleton method offers the best
performance for these very low data volumes, the maximum benefit compared to the
next best performing method (T-SQL Merge) was only 54 seconds. The benefit is
therefore minimal when compared to the significant performance degradation as
volumes scale up.
B. Lookup Method
All analyses indicate that using the Lookup method should be avoided. The charts in
Figure 6 show that although rarely the worst performing method, it is very rarely the
best performing method. This is confirmed by the statistical analysis presented in Table
3 and Table 5. Figure 22 and Figure 23, showing the duration estimates for HDD and
SSD from the ANOVA model, both show a clear problem with the Lookup method, both
in its initial performance as well as its saleability when compared with the Merge and
Join methods.
The decision tree in Figure 26 shows that all bar one of the instances when this is the
preferred option are when there are zero changed records. As the purpose of a type 2
SCD is to manage changes, this is expected to be a rare occurrence in reality. It is
therefore advised to not use the lookup method as a high performance load option.
It should be noted that these results may be skewed by the large base data set used
(50m rows). The Lookup method requires the entire base data set to be loaded into
memory before ETL processing can begin, making this method more susceptible to
memory availability and increases in the base data set size. Further investigation
should be performed on smaller base sets to identify whether this method is more
appropriate in smaller scale scenarios which are out of scope of this research.
Figure 22 to Figure 25 indicate that at very high volumes of input data, the Join
method is usually preferable, which is backed up by the raw test results visualised in
the charts in Figure 6. Figure 24 shows that this is most prominent for traditional hard
disks and where there are a high proportion of new rows compared to change rows,
where the performance of the methods starts to diverge as early as 500k input rows.
On SSD the divergence starts at 3m rows. However, where there is high proportion of
change rows to new rows, Merge always outperforms Join for all data volumes on SSD,
and up to 2m input rows on HDD.
The charts presented in Figure 6 show that the Merge method performed better than
the Join method in all cases with lower data volumes, specifically <=5k changed rows
and <=50k new rows, for both hardware platforms. The Join method seems to scale
better, with marginally improved performance when compared against Merge as
either new or change rows reach and exceed 500k rows. This is confirmed by the
results presented in Appendix 10.
The decision tree presented in Figure 26 finds that the Join method is the best option
in most cases, followed very closely by the Merge method. Merge performs top in 31%
and 2nd in 47% of tests, with Join performing top in 44% and 2nd in 37% of cases.
The decision tree then refines the criteria for each, showing Join as unsuitable when
there are zero change rows, and showing Merge as most suitable when there are 50k
new rows.
These two approaches compete for the role of the best performing method, with each
marginally outperforming the other in different scenarios.
Given the comparable performance of the two methods, it should be left to the system
architect to determine the best approach, taking into account other factors such as
speed of development, maintainability, experience, code flexibility etc.
55
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
The dependency network in Figure 27 also confirms that the storage platform has the
least influence of all the parameters when considering which design method offers the
best performance.
The statistical analysis however confirms that the use of solid state storage provides a
significant improvement in load performance in every scenario. The use of SSD
technology will therefore have a large beneficial impact on the duration of the data
loads in all cases.
Although the use of SSD should not alter the design decisions that are made when
planning a new data load project, it is clear that the technology will significantly
improve the performance of any implementation it is applied to.
As can be seen from Figure 6, the performance benefit of SSD is most noticeable with
the singleton method, and with the impact increasing with higher volumes of change
records. In some cases the performance improvement was up to 92% (12.5x
performance) on like for like tests. The nature of this performance gain can be
attributed to the characteristics of solid state, as presented by Shaw and Back (Shaw
and Back 2010), Fusion IO (Fusion-IO 2011) and Tony Rogerson (Rogerson 2012); the
singleton method relies very heavily on random read operations to read each existing
dimension record, one at a time. The biggest performance difference between
traditional disks and solid state storage is the performance of random reads, which
explains the slow results when using traditional disks and the significant improvement
when using solid state technology.
The timing results show that the impact of solid state storage was smallest in tests
with 5m new rows, although still providing on average a 52.9% performance
56
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
improvement (2.1x). The nature of new records requires largely sequential IO, writing
all new rows in a single sequential block. This doesn’t make use of the random IO
benefits of solid state, however solid state still provides a significant improvement in
performance of at least 19.5% (1.2x) in the worst case scenario for the singleton
method (5m new rows, 0 change rows).
Although earlier analysis showed that the use of solid state devices shouldn’t change
the design approach for a new system, this shows that it can be a very effective
solution to improve the performance of existing systems which may not have been
designed in an optimal way, and may negate the need to rewrite systems that are
approaching the limit of the available data load window.
The dominance of the change records over the new records is backed up by the
dependency network in Figure 27 as well as visually in Figure 7 and Figure 8.
Figure 22 and Figure 24 also show that the ratio of new to change rows can also impact
the relative performance of ETL load methods, with Merge scaling comparatively much
better when there is a higher proportion of change rows, and worse when there’s a
low proportion of changes to new rows.
57
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
6. Conclusion
The results and analyses of this research has identified a number of criteria that affect
the performance of loading data into Type 2 data warehouse slowly changing
dimensions. This chapter provides a high level overview of the findings.
The use of solid state devices for data storage provides a significant benefit to the
performance of loading data in virtually every scenario, with performance benefits of
up to 92% (12.5x). Using solid state storage however should not fundamentally change
the design patterns of how ETL systems are designed.
When determining the most appropriate method to manage the loading of Type 2
SCDs, both the T-SQL Merge and SSIS Merge Join methods offered significantly higher
performance than the other methods in most tests. Merge Join however should be
preferred for higher volume scenarios, where the number of new or changed rows
reached and exceeds 500k. For other scenarios the choice can be determined by other
factors such as personal preference or server architecture.
The exception to this is where there are a very small number of changed rows, at 5k
rows or less, especially when solid state storage is in use. In these cases a Singleton
approach becomes feasible from a performance perspective. However, considering the
small benefit over other methods, as well as the inability of the method to scale, it is
recommended that the Singleton approach is not adopted.
It should be noted that this research focuses entirely on batch ETL load systems. As
described in the introduction, there is a growing trend towards real-time data
warehouse systems which by their very nature need to load small volumes of data as
soon as they’re received. The entire load framework is therefore constrained by design
to use a singleton approach to load the incoming data. The findings in this research
show that solid state storage systems should be of particular interest to these
scenarios, as they should be able to leverage the maximum possible benefit from SSD
technology.
58
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
This research has focused entirely on the performance of the methods and other
variables. In reality the run-time performance is only one of a number of factors which
need to be considered including the implementation complexity, development
duration, hardware cost, resource/skill availability and simplicity/ease of maintenance.
Given the lack of detailed analysis found during the research phase of this work, the
author hopes that this project will go some way to filling the void, and provide some
guidance to business intelligence architects, designers and developers to have more
confidence in their choice when selecting an ETL methodology.
59
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
7. Evaluation
The issue of loading data into data warehouse dimensions is in itself an incredibly
broad scope. This research has attempted to provide detailed analysis on the core
functionality in order to provide direction to anyone embarking on a new ETL project.
It should be noted however that due to the sheer number of possible factor
combinations, a single research investigation is unable to cover all possible scenarios.
This research has investigated the primary factors and provided a comprehensive
understanding of the nature of those factors. The results will however not necessarily
hold true for every scenario.
Further research should be conducted exploring the impact of other variables such as
Server memory & other hardware specification – The considerable impact of the hard
disk platform has been shown in this research, however this is only one of many
variables in hardware selection. The Lookup method is especially impacted by the
available memory due to its requirement to load the complete dimension into
memory, however the impact on the other methods is not explored by this research.
The exponential nature of the performance curves, as presented in Figure 22 and
Figure 23, indicate that scalability is likely to be impacted by hardware constraints
Changing the size of the base data set – The data set in this research used a static 50m
records. It’s possible that smaller or larger data sets may provide different results,
especially when tested in conjunction with the available server memory, and the
width/size of each record.
Storage Area Network (SAN) storage – This research used local storage for both
hardware platforms, HDD Raid 10 and SSD, in order to provide an isolated test
environment. The impact of the storage platform has been proven; it would therefore
be of interest to explore different storage platforms. It’s common for data warehouse
in the real world to use storage area networks, which exhibit their own unique
performance characteristics.
Solid State Storage – The solid state device used in this research was a relatively low
performance card compared to some that are now available from a variety of
60
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
manufacturers. Fusion IO now offer a very wide range of cards including an Octal card
which offers performance up to 8 times that of the card used in this project. This is
likely to exaggerate the HDD/SSD differences considerably, and may expose
performance characteristics not revealed by this research. Fusion IO is also only one of
many enterprise NAND/SSD storage providers including X-IO and Violin, each of which
offer different performance characteristics.
Splitting the workload onto a number of servers – This research used a single server
to run the ETL process as well as the source and destination databases. These three
elements are often split up onto three separate servers to improve performance
further. This offers an opportunity to benefit from specific performance characteristics
of different load methods, based on the relative performance of the method. For
example the Singleton process relies heavily on the ETL server to manage the load
process, whereas the T-SQL Merge method offloads the bulk of the work to the
database server.
Loading data into multiple partitions – In large data warehouses it is common to
partition fact tables to improve query and load performance. It may also be of benefit
to explore the impact of partitioning dimension data, if the dataset is suitable.
Data throughput characteristics of retrieving data from source systems – The tests
performed in this project sourced the incoming data from a local sold state device in
order to exclude the performance of source data retrieval from the results. It’s
common for source systems to provide data at a rate slower than the capacity of the
ETL mechanism, reducing the impact of ETL method selection.
Derivative or alternative ETL load methods – There are countless enhancements and
alternative methods available aside from the four presented in this research. The use
of third party components, checksums etc. all provide ETL load options not explored in
this project. It would be of interest to take the two best methods identified by this
project (Merge Join and T-SQL Merge) and explore the impact of evolving these
further.
Different toolset – SQL Server Integration Services is only one of a number of toolsets
that can be used for ETL processing, including SAS Data Integration Server, Informatica
PowerCenter, Oracle Data Integrator and IBM InfoSphere. Although the theory behind
61
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
This research has found significant differences in the performance of loading data,
depending on the hardware and method used. It is expected that most of the factors
above are also likely to have an impact on the load performance; some may change
the relative performance of the methods whereas others may not.
Analysing the interaction of the variables present in this research presented somewhat
of a challenge due the sheer number of statistically significant interactions. Increasing
the number of variables further would render statistical analysis even more complex,
so is unlikely to be feasible. It is therefore likely that further research would benefit
from selecting a different subset of the parameters, or an alternative statistical
method adopted.
Given the scope of this research, and taking into account the limitations discussed
above, the findings provide clear guidance to data warehouse architects and
developers on the relative merits of the different load methods. It’s now clear that the
Merge Join and T-SQL Merge methods are equivalent in performance and in most
cases should be considered the only choices; the decision between them can be left to
personal choice or other input factors not considered here.
It’s hoped that the work undertaken here will be of benefit to any organisation looking
to implement a data warehouse, reducing both the cost and duration of development
by providing clear guidelines and reducing the need to perform investigative
prototypes.
It’s also hoped that organisations will benefit from the investigation into the
performance of solid state storage. There is a clear benefit both to new projects, and
also as a remedy for poorly performing systems, for which the use of SSD technology
may be far more cost effective than the redesign and redevelopment of the ETL layer.
62
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
8. References
BECKER, B and KIMBALL, R (2007). Kimball University: Think Critically When Applying
Best Practices. [online]. Last accessed 28 May 2011 at:
http://www.kimballgroup.com/html/articles_search/articles2007/0703IE.html?articleI
D=198700049
BETHKE, Uli (2009). One pass SCD2 load: How to load a Slowly Changing Dimension
Type 2 with one SQL Merge statement in Oracle. [online]. Last accessed 17 12 2010 at:
http://www.business-intelligence-quotient.com/?p=66
EMBARCADERO (2010). Database Trends Survey. [online]. Last accessed 12 12 2010 at:
http://www.embarcadero.com/reports/database-trends-survey
FUSION-IO (2011). Online University Learns the Power of Fusion-io. [online]. Last
accessed 22 10 2011 at: http://www.fusionio.com/case-studies/online-university/
HWANG, Mark I and XU, Hongjiang (2007). The Effect of Implementation Factors on
Data Warehousing Success: An Exploratory Study. Journal of Information, Information
Technology, and Organizations, 2, 1-14.
INMON, W. H. (2007). Some straight talk about the costs of data warehousing. Inmon
Consulting.
KIMBALL, R (2004). The Data Warehouse ETL Toolkit : Practical Techniques for
Extracting, Cleaning, Conforming, and Delivering Data. Wiley.
KIMBALL, Ralph (2001). Kimball Design Top #22: Variable Depth Customer Dimensions.
[online]. Last accessed 14 January 2012 at:
63
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
http://www.kimballgroup.com/html/designtipsPDF/DesignTips2001/KimballDT22Varia
bleDepth.pdf
KIMBALL, R and ROSS, M (2002). The Data Warehouse Toolkit. 2nd ed., John Wiley and
Sons.
MCKINSEY GLOBAL INSTITUTE (2011). Big Data: The next frontier for innovation,
competition, and productivity. White Paper, McKinsey Global Institute.
MUSLIH, O.K. and SALEH, I.H. (2010). Increasing Database Performance through
Optimizing Structure Query Language Join Statement. Journal of Computer Science, 6
(5), 585-590.
NOVOSELAC, Steve (2009). SSIS - Using Checksums to Load Data into Slowly Changing
Dimensions. [online]. Last accessed 11 March 2012 at:
http://sqlserverpedia.com/wiki/SSIS_-
_Using_Checksum_to_Load_Data_into_Slowly_Changing_Dimensions
OLSEN, David and HAUSER, Karina (2007). Teaching Advanced SQL Skills: Text Bulk
Loading. Journal of Information Systems Education, 18 (4), 399.
PRIYANKARA, Dinesh (2010). SSIS: Replacing SCD Wizard with the MERGE statement.
[online]. Last accessed 11 March 2012 at: http://dinesql.blogspot.com/2010/11/ssis-
replacing-slowly-changing.html
64
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
ROSS, M and KIMBALL, R (2005). Slowly Changing Dimension Are Not Always as Easy as
1,2,3. Intelligent Enterprise, 8 (3), 41-43.
ROSSUM, Joost van (2011). Slowly Changing Dimension Alternatives. [online]. Last
accessed 22 October 2011 at: http://microsoft-ssis.blogspot.com/2011/01/slowly-
changing-dimension-alternatives.html
RUDESTAM, KJELL, Erik and NEWTON, Rae R (2001). Surviving your dissertation: A
comprehensive guide to content and process. Thousand Oaks, Calif., Sage Publications.
SCHARLOCK, Peter (2008). Increase your SQL Server performance by replacing cursors
with set operations. [online]. Last accessed 14 10 2011 at:
http://blogs.msdn.com/b/sqlprogrammability/archive/2008/03/18/increase-your-sql-
server-performance-by-replacing-cursors-with-set-operations.aspx
SHAW, Steve and BACK, Martin (2010). Pro Oracle Database 11g RAC on Linux. Apress
Academic.
THORNTHWAITE, Warren (2008). Design Tip #107 Using the SQL MERGE Statement for
Slowly Changing Dimension Processing. [online]. Last accessed 17 12 2010 at:
http://www.rkimball.com/html/08dt/KU107_UsingSQL_MERGESlowlyChangingDimens
ion.pdf
VARIOUS (2004). Best method to handle SCD. [online]. Last accessed 11 March 2012 at:
http://www.sqlservercentral.com/Forums/Topic1200461-363-1.aspx
VEERMAN, Erik, LACHEV, Teo and SARKA, Dejan (2009). Microsoft SQL Server 2008 -
Business Intelligence Development and Maintenance. Redmond, Microsoft Press.
WHALEN, Edward, et al. (2006). Microsoft SQL Server 2005 Administrator’s Companion.
Microsoft Press.
WIKIPEDIA (2010). Slowly Changing Dimension. [online]. Last accessed 18 12 2010 at:
http://en.wikipedia.org/wiki/Slowly_changing_dimension
66
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
9. Appendix
proc gplot;
plot E*P/href=0;
run;
quit;
proc gplot;
plot E*P/href=0;
run;
quit;
[Appendix] 1
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
proc gplot;
plot E*P/href=0;
run;
quit;
[Appendix] 2
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
[Appendix] 3
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
proc format;
value RowOrd 5000000='5000k' 500000='500k' 50000='50k' 5000='5k' 0='Zero';
value $MethOrd 'Join'='zJoin' 'Lookup'='Lookup' 'Singleton'='Singleton'
'Merge'='Merge';
run;
quit;
[Appendix] 4
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
logresults
MethodName LSMEAN
Lookup 6.69438576
Merge 6.18270936
Singleton 7.46886445
zJoin 6.16685334
[Appendix] 5
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
logresults
Hardware LSMEAN
HDD 7.05939923
SSD 6.19700723
[Appendix] 6
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
logresults
Hardware MethodName LSMEAN
[Appendix] 7
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
logresults
changerows newrows LSMEAN
5000k 5k 8.44269542
500k 5k 7.19890130
50k 5k 6.00435179
5k 5000k 7.37022561
5k 500k 6.10132129
5k 50k 5.16488665
5k 5k 4.96281769
5k Zero 4.85320691
Zero 5k 4.26221253
[Appendix] 8
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
logresults
MethodName changerows newrows LSMEAN
Lookup 5k 5k 5.4965852
[Appendix] 9
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
logresults
MethodName changerows newrows LSMEAN
Merge 5k 5k 4.6075410
Singleton 5k 5k 4.7616561
[Appendix] 10
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
logresults
MethodName changerows newrows LSMEAN
zJoin 5k 5k 4.9854885
[Appendix] 11
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
quit;
[Appendix] 12
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 4.142689750 B 0.13169792 31.46 <.0001
MethodName Join 0.425552708 B 0.18624899 2.28 0.0231
MethodName Merge 0.000000000 B . . .
Hardware HDD 0.464630837 B 0.18624899 2.49 0.0132
Hardware SSD 0.000000000 B . . .
MethodName*Hardware Join HDD -0.054055915 B 0.26339585 -0.21 0.8376
MethodName*Hardware Join SSD 0.000000000 B . . .
MethodName*Hardware Merge HDD 0.000000000 B . . .
MethodName*Hardware Merge SSD 0.000000000 B . . .
MethodNam*changerows Join 5000k 2.704117172 B 0.13882180 19.48 <.0001
MethodNam*changerows Join 500k 1.333582846 B 0.13882180 9.61 <.0001
MethodNam*changerows Join 50k 0.625439344 B 0.13882180 4.51 <.0001
MethodNam*changerows Join 5k 0.147147697 B 0.13882180 1.06 0.2901
MethodNam*changerows Join Zero 0.000000000 B . . .
MethodNam*changerows Merge 5000k 2.806825815 B 0.13882180 20.22 <.0001
MethodNam*changerows Merge 500k 1.479337039 B 0.13882180 10.66 <.0001
MethodNam*changerows Merge 50k 0.777049662 B 0.13882180 5.60 <.0001
MethodNam*changerows Merge 5k 0.298682695 B 0.13882180 2.15 0.0323
MethodNam*changerows Merge Zero 0.000000000 B . . .
MethodName*newrows Join 5000k 1.148798851 B 0.13882180 8.28 <.0001
MethodName*newrows Join 500k 0.328658299 B 0.13882180 2.37 0.0186
MethodName*newrows Join 50k -0.028393304 B 0.13882180 -0.20 0.8381
[Appendix] 13
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Standard
Parameter Estimate Error t Value Pr > |t|
MethodName*newrows Join 5k -0.063072701 B 0.13882180 -0.45 0.6500
MethodName*newrows Join Zero 0.000000000 B . . .
MethodName*newrows Merge 5000k 1.866529177 B 0.13882180 13.45 <.0001
MethodName*newrows Merge 500k 0.834375788 B 0.13882180 6.01 <.0001
MethodName*newrows Merge 50k 0.010436300 B 0.13882180 0.08 0.9401
MethodName*newrows Merge 5k -0.107154821 B 0.13882180 -0.77 0.4409
MethodName*newrows Merge Zero 0.000000000 B . . .
Method*Hardwa*change Join HDD 5000k 0.062713738 B 0.19632367 0.32 0.7496
Method*Hardwa*change Join HDD 500k 0.455555813 B 0.19632367 2.32 0.0211
Method*Hardwa*change Join HDD 50k 0.671227236 B 0.19632367 3.42 0.0007
Method*Hardwa*change Join HDD 5k 0.381935111 B 0.19632367 1.95 0.0528
Method*Hardwa*change Join HDD Zero 0.000000000 B . . .
Method*Hardwa*change Join SSD 5000k 0.000000000 B . . .
Method*Hardwa*change Join SSD 500k 0.000000000 B . . .
Method*Hardwa*change Join SSD 50k 0.000000000 B . . .
Method*Hardwa*change Join SSD 5k 0.000000000 B . . .
Method*Hardwa*change Join SSD Zero 0.000000000 B . . .
Method*Hardwa*change Merge HDD 5000k 0.135016286 B 0.19632367 0.69 0.4922
Method*Hardwa*change Merge HDD 500k 1.149062584 B 0.19632367 5.85 <.0001
Method*Hardwa*change Merge HDD 50k 0.980363107 B 0.19632367 4.99 <.0001
Method*Hardwa*change Merge HDD 5k 0.403303852 B 0.19632367 2.05 0.0409
Method*Hardwa*change Merge HDD Zero 0.000000000 B . . .
Method*Hardwa*change Merge SSD 5000k 0.000000000 B . . .
Method*Hardwa*change Merge SSD 500k 0.000000000 B . . .
Method*Hardwa*change Merge SSD 50k 0.000000000 B . . .
Method*Hardwa*change Merge SSD 5k 0.000000000 B . . .
Method*Hardwa*change Merge SSD Zero 0.000000000 B . . .
Method*Hardwa*newrow Join HDD 5000k -0.289088246 B 0.19632367 -1.47 0.1421
Method*Hardwa*newrow Join HDD 500k -0.098592314 B 0.19632367 -0.50 0.6160
Method*Hardwa*newrow Join HDD 50k 0.135230950 B 0.19632367 0.69 0.4915
Method*Hardwa*newrow Join HDD 5k 0.221695494 B 0.19632367 1.13 0.2598
Method*Hardwa*newrow Join HDD Zero 0.000000000 B . . .
Method*Hardwa*newrow Join SSD 5000k 0.000000000 B . . .
Method*Hardwa*newrow Join SSD 500k 0.000000000 B . . .
Method*Hardwa*newrow Join SSD 50k 0.000000000 B . . .
Method*Hardwa*newrow Join SSD 5k 0.000000000 B . . .
Method*Hardwa*newrow Join SSD Zero 0.000000000 B . . .
Method*Hardwa*newrow Merge HDD 5000k -0.618608060 B 0.19632367 -3.15 0.0018
Method*Hardwa*newrow Merge HDD 500k -0.306753690 B 0.19632367 -1.56 0.1194
Method*Hardwa*newrow Merge HDD 50k 0.172620281 B 0.19632367 0.88 0.3801
Method*Hardwa*newrow Merge HDD 5k 0.229874256 B 0.19632367 1.17 0.2427
Method*Hardwa*newrow Merge HDD Zero 0.000000000 B . . .
Method*Hardwa*newrow Merge SSD 5000k 0.000000000 B . . .
[Appendix] 14
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Standard
Parameter Estimate Error t Value Pr > |t|
Method*Hardwa*newrow Merge SSD 500k 0.000000000 B . . .
Method*Hardwa*newrow Merge SSD 50k 0.000000000 B . . .
Method*Hardwa*newrow Merge SSD 5k 0.000000000 B . . .
Method*Hardwa*newrow Merge SSD Zero 0.000000000 B . . .
[Appendix] 15
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Title;
proc gplot;
plot E*P/href=0;
run;
quit;
[Appendix] 16
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Standard
Parameter Estimate Error t Value Pr > |t|
[Appendix] 17
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
Standard
Parameter Estimate Error t Value Pr > |t|
[Appendix] 18
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
[Appendix] 19
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse
[Appendix] 27