0% found this document useful (0 votes)

78 views99 pages

Alex Whittles Dissertation

This dissertation examines the performance of four techniques for loading type 2 slowly changing dimensions in a Kimball style data warehouse: the singleton method, lookup method, join method, and merge method. Experiments were conducted using SQL Server 2012 with varying data volumes and storage types. Statistical analysis found the merge join approach had the best performance with over 500k new records. The merge statement performed comparably under 500k records. Solid state storage reduced load times up to 12.5x but did not impact the best technique. The optimal method depends on data volume and storage type.

Uploaded by

Murali Bala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views99 pages

Alex Whittles Dissertation

Uploaded by

Murali Bala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 99

SHEFFIELD HALLAM UNIVERSITY

FACULTY OF ACES

“PERFORMANCE COMPARISON OF TECHNIQUES TO LOAD TYPE 2

SLOWLY CHANGING DIMENSIONS IN A KIMBALL STYLE DATA
WAREHOUSE”

Alexander Whittles

27th April 2012

Supervised by: Angela Lauener

This dissertation does NOT contain confidential material and thus can be made
available to staff and students via the library.

A dissertation submitted in partial fulfilment of the requirements of Sheffield

Hallam University for the degree of Master of Science in Business Intelligence .
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Acknowledgements
Thank you to Angela Lauener and Keith Jones, from Sheffield Hallam University, for
their valuable assistance with this project.

A core part of this research relied on access to state of the art solid state hardware. I’d
like to thank Fusion IO for their support of this work, and for the loan of their hardware
which made the research possible.

The time taken to undertake this research has been at the cost of spending time at
work. I’d like to thank Purple Frog Systems Ltd for supporting me through this project.

Thanks to Tony Rogerson for helping define the technical specification of the test
server.

My thanks also go to the SQLBits conference committee, who asked me to present a

summary of this work at the UK launch of SQL Server 2012.

Finally, and most importantly, thanks go to my wife, Hollie, who has supported me
through this dissertation and throughout the entire MSc process. Without her support,
encouragement, understanding and limitless patience I would not have been able to
complete this work. My wholehearted thanks go to her.

ii
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Abstract
In the computer science field of Business Intelligence, one of the most fundamental
concepts is that of the dimensional data warehouse as proposed by Ralph Kimball
(Kimball and Ross 2002). A significant portion of the cost of implementing a data
warehouse is the extract, transform and load (ETL) process which retrieves the data
from source systems and populates it into the data warehouse.

Critical to the functionality of most dimensional data warehouses is the ability to track
historical changes of attribute values within each dimension, often referred to as
Slowly Changing Dimensions (SCD).

There are countless methods of loading data into SCDs within the ETL process, all
achieving a similar goal but using different techniques. This study investigates the
performance characteristics of four such methods under multiple scenarios covering
different volumes of data as well as traditional hard disk storage versus solid state
storage. The study focuses on the most complex SCD implementation, Type 2, which
stores multiple copies of each member, each valid for a different period of time.

The study uses Microsoft SQL Server 2012 as its test platform.

Using statistical analysis techniques, the methods are compared against each other,
with the most appropriate methods identified for the differing scenarios.

It is found that using a Merge Join approach within the ETL pipeline offers the best
performance under high data volumes of at least 500k new or changed records. The T-
SQL Merge statement offers comparable performance for data volumes lower than
500k new or changed rows.

It is also found that the use of solid state storage significantly improves ETL load
performance, reducing load time by up to 92% (12.5x), but does not affect the
comparative performance characteristics between the methods, and so should not
impact the decision as to the optimal design approach.

iii
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Contents
Acknowledgements ................................................................................................................... ii

Abstract .................................................................................................................................... iii

Contents ................................................................................................................................... iv

1. Introduction........................................................................................................................ 1

2. Literature Review ............................................................................................................... 6

A. Slowly Changing Dimension Performance............................................................. 6

B. Database Operation Performance......................................................................... 9

C. Random Vs Sequential IO .................................................................................... 10

D. Data Growth ........................................................................................................ 11

E. Conclusion ........................................................................................................... 11

3. Methodology and data collection methods ..................................................................... 12

A. Inductive Vs Deductive ........................................................................................ 12

B. Qualitative Vs Quantitative ................................................................................. 13

C. Source Database .................................................................................................. 14

D. Data Warehouse .................................................................................................. 14

E. ETL Process .......................................................................................................... 15

F. Toolset ................................................................................................................. 15

G. Quantitative Tests ............................................................................................... 15

H. Statistical Analysis ............................................................................................... 22

I. Test Rig Hardware ............................................................................................... 22

J. Issues of access and ethics .................................................................................. 23

4. Results and Data analysis ................................................................................................. 24

A. Statistical Analysis Method ................................................................................. 29

B. Statistical Analysis – Factor Model ...................................................................... 37

C. Statistical Analysis – Numerical Model ............................................................... 42

iv
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

D. Projection Model ................................................................................................. 49

E. Decision Tree ....................................................................................................... 51

F. Dependency Network .......................................................................................... 52

5. Discussion ......................................................................................................................... 53

A. Singleton Method ................................................................................................ 53

B. Lookup Method ................................................................................................... 54

C. Join & Merge Methods ........................................................................................ 54

D. Solid State Storage............................................................................................... 56

E. New & Changed Rows ......................................................................................... 57

6. Conclusion ........................................................................................................................ 58

7. Evaluation ......................................................................................................................... 60

8. References ........................................................................................................................ 63

9. Appendix ............................................................................................................................ 1

Appendix 1. SAS Code – General Linear Model....................................................... 1

Appendix 2. SAS Code – General Linear Model (Log).............................................. 1

Appendix 3. SAS Code – General Linear Model (Log, category variables) .............. 2

Appendix 4. ANOVA Statistical Results.................................................................... 3

Appendix 5. SAS Analysis code ................................................................................ 4

Appendix 6. ANOVA Results – Method Least Square Means .................................. 5

Appendix 7. ANOVA Results – Hardware Least Square Means ............................... 6

Appendix 8. ANOVA Results – Hardware/Method Least Square Means ................ 7

Appendix 9. ANOVA Results – Row Count Least Square Means ............................. 8

Appendix 10. ANOVA Results – Method/Row Count Least Square Means .............. 9

Appendix 11. SAS Analysis Code – Join and Merge ................................................. 12

Appendix 12. ANOVA Results – Join and Merge ..................................................... 13

Appendix 13. SAS Code – Numerical model excluding Singleton ........................... 16

v
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 14. Statistical Results – Reduced numerical model excluding singleton 17

Appendix 15. Full Test Results ................................................................................. 19

vi
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

1. Introduction
A core component of any data warehouse project is the ETL (Extract, Transform and
Load) layer which extracts data from the source systems, transforms the data into a
new data model and loads the results into the warehouse. The ETL system is often
estimated to consume 70 percent of the time and effort of building a business
intelligence environment (Becker and Kimball 2007).

A study by Gagnon in 1999, cited by Hwang and Xu (Hwang and Xu 2007) reported that
the average data warehouse costs $2.2m to implement. Watson and Haley (Watson
and Haley 1997) report that a typical data warehouse project costs over $1m in the
first year alone. Although the cost will vary dramatically from project to project, these
sources illustrate the level of financial investment that can be required. Inmon states
that the long term cost of a data warehouse depends more on the developers and
designers and the decisions they make than on the actual cost of technology (Inmon
2007). There is therefore a compelling financial reason to ensure that the correct ETL
approach is taken from the outset, and that the right technical decisions are taken on
which techniques are employed.

A Kimball style data warehouse comprises fact and dimension tables (Kimball and Ross
2002). Fact tables store the numerical measure data to be aggregated, whereas
dimension tables store the attributes and hierarchies by which the fact data can be
filtered, sliced, grouped and pivoted. It is a common requirement that warehouses be
able to store a history of these attributes as they change, so they represent the value
as it was at the time each fact happened, instead of what the value is now. This is
implemented using a technique called Slowly Changing Dimensions (SCD) (Kimball
2008), used within the ETL process.

There are numerous different methods of implementing SCDs, of which the following
three are the most common (Ross and Kimball 2005) (Kimball 2008) (Wikipedia 2010):

Type 1: Only the current value is stored, history is lost. This is used where
changes are treated as corrections instead of genuine changes, or no history is
required.
1
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Type 2: Multiple copies of a record are maintained, each valid for a period of
time. Fact records are linked to the appropriate dimension record that was
valid when the fact happened. e.g. Customers address. To analyse sales by
region, sales should be allocated against the correct address where the
customer was living when they purchased the product, not where they live
now.

Type 3: Two (or more) separate fields are maintained for each attribute, storing
the current and previous value. No further history is stored. e.g. Customer’s
surname. It may be required to only store the current surname and maiden
name, not the full history of all names.

Type 0 and 6 SCDs are rare special cases. Type 0 does not track changes at all, and Type
6 is a rare hybrid of 1, 2 & 3. Neither are therefore relevant to this research.

Type 1 SCDs are the simplest approach to implement (Kimball and Ross 2002) however
all history is lost. Type 3 SCDs are used infrequently (Kimball and Ross 2002) due to
their limited ability to track history. These SCD types don’t provide any maintainability
or performance problems for the vast majority of data warehouses (Wikipedia 2010).
The most common form of SCD is therefore Type 2, which is recommended for most
attribute history tracking by most dimensional modellers including Ralph Kimball
himself (Kimball and Ross 2002). The downside of Type 2 is that it requires much more
complex processing, and is a frequent cause of performance bottlenecks (Wikipedia
2010).

It is the intention of this research assignment to perform an inductive investigation to

compare the performance of different methods of implementing type 2 SCDs, with a
view to identifying the most effective method for different scales and characteristics of
data warehouse. The methods that will be assessed are:

Bulk insert (ETL) & singleton updates (ETL) - The whole process is managed
within the ETL data pipeline. For each input record, the ETL process determines
whether it’s a new or changed record via a singleton query to the dimension, and then
handles the two streams of data individually. New records can be inserted into the

2
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

dimension table in bulk. Changed records however are processed individually by

executing update & insert statements against the database.

Bulk insert (ETL) & bulk update (DB) (using Lookup) - The SCD processing is split
between the ETL and the database. The ETL pipeline uses a ‘lookup’ approach to
identify each record as either a new record requiring an insert or an existing record
requiring an update. All inserts are piped to a bulk insert component within the ETL; all
updates are bulk inserted into a staging table to then be processed into the live
dimension table by the database engine using a MERGE statement. The ‘lookup’
approach is an ETL technique analogous to a nested loop join operation in T-SQL.

Bulk insert (ETL) & bulk update (DB) (using Merge Join) - The SCD processing is split
between the ETL and the database. The ETL pipeline uses a ‘merge join’ approach to
identify each record as either a new record requiring an insert or an existing record
requiring an update. All inserts are piped to a bulk insert component within the ETL; all
updates are bulk inserted into a staging table to then be processed into the live
dimension table by the database engine using a MERGE statement. The ‘merge join’
approach is an ETL technique analogous to a merge join operation in T-SQL.

Bulk inserts and updates (DB) - The ETL process does not perform any of the SCD
processing, instead it is entirely handled within the database engine. The ETL pipeline
outputs all records to a staging table using a bulk insert, then all records in the staging
table are processed into the live dimension table at once using a MERGE statement.
This single database operation manages the entire complexity of differentiating
between new and changed rows, as well as performing the resulting operations.

The majority of data warehouses are populated daily during an overnight ETL load
(Mundy, Thornthwaite and Kimball 2011). The performance of the load is vital in order
to ensure the entire data batch can be completed in an often very tight time window
between end of day processing within the source transactional systems and the start
of the following business day. There is now a growing trend towards real-time data
warehouses, with current data warehousing technologies making it possible to deliver
decision support systems with a latency of only a few minutes or even seconds
(Watson and Wixom 2007) (Mundy, Thornthwaite and Kimball 2011). The performance
3
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

focus is therefore shifting from a single bulk load of data to a large number of smaller
data loads. This research will concentrate on the performance aspects of the more
typical overnight batch ETL load as it is still the most common business practice
(Mundy, Thornthwaite and Kimball 2011).

Historically data warehouses have used traditional hard disk storage media for the
physical storage of the data. There has been significant growth recently in the
availability and reliability of NAND flash based solid state storage, and an equivalent
reduction in cost. A case study by Fusion-IO for a leading online university (Fusion-IO
2011) shows the very large difference in performance for database operations when
comparing physical disk based media with solid state, increasing the random read IOPS
(input/output operations per second) from 3,548 to 110,000 and the random write
IOPS from 3,517 to 145,000. A test query in this case study improved in performance
from 2.5 hours on disk based storage to only 5 minutes on solid state storage.

This sizeable shift in the potential performance of database systems is therefore of

great relevance to this project; it raises the question of whether the performance of
the hardware platform has an impact on the preferred methodology. It stands to
reason that loading data and processing SCDs is likely to be significantly faster using
such hardware, it is of interest to this project whether the change in hardware actually
changes the relative merits of each method and may perhaps influence the selection
process.

The intended outcome is to be able to predict the optimal method for a given set of
dimension data and hardware platform, to enable data warehouse ETL developers to
optimise the initial design in order to maximise the data throughput, minimising the
required data warehouse load window.

The process and methods of loading type 2 SCDs is generic across technology
platforms, however this investigation will be carried out using the Microsoft SQL
Server toolset, including the SQL Server database engine and the Integration Services
ETL platform. SQL Server is one of the most, if not the most, widely used database
platforms in use today (Embarcadero 2010). The techniques used in this research are
equally suited to other database platforms such as Oracle.
4
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Document Summary

Chapter 2 discusses the background literature and existing research that has been
conducted in this field. It also presents justification for this research.

Chapter 3 explains the methodology appropriate to the research question. The details
of the quantitative tests are discussed, as well as a summary of the statistical analysis
methodology.

Chapter 4 presents the test results and identifies the most appropriate statistical
models to be used. The results are analysed and interpreted using a variety of
statistical and data mining models.

Chapter 5 presents a summary and interpretation of the statistical results, cross

referencing the findings to the literature review and presenting it in a manor more
appropriate for use in a future non-academic scenario.

Chapter 6 summarises the research in a high level overview.

Chapter 7 evaluates the research, identifying the limitations of the approach taken,
and discusses how further research could be conducted to improve the understanding
beyond that presented in this research.

5
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

2. Literature Review
This chapter explores the existing research that has been undertaken in this area, and
examines the justification for this research. The specific topic of SCD performance is
investigated, as well as the more generic performance of database operations and
then the relevance of the industry’s trend towards solid state storage devices.

A. Slowly Changing Dimension Performance

There is a component shipped with SQL Server Integration Services (SSIS) which is
intended to take care of slowly changing dimension loads for the developer, The Slowly
Changing Dimension component (Veerman, Lachev and Sarka 2009). This automates
the creation of the first of the intended methods, bulk insert and singleton updates. It
is widely accepted that this component is satisfactory for small dimensions, but when
the complexity or size increases it becomes less of an option (Mundy, Thornthwaite
and Kimball 2006).

Although the investigation and research approach is based primarily on the Microsoft
SQL Server toolset, the performance of loading data SCD Type 2 data is a generic issue
and just as big a problem when using other competing technologies such as SAS
(Dramatically Increasing SAS DI Studio performance of SCD Type-2 Loader Transforms
2010), as such although the terminology and implementation details will differ, the
concept has a much wider scope.

The subject of SCD Type 2 load performance is widely discussed in user forums and
blogs, providing an indication to the size of the problem. A simple Google search on
the topic returns ¼ million results including (Priyankara 2010) (Novoselac 2009)
(Various 2004). Given this, it is surprising that there has been a lack of any detailed
studies in academia or the commercial field. The concept of a Type 2 SCD is discussed
in the majority of books covering ETL methods of star schema data warehouses, for
example (Kimball 2008) (Veerman, Lachev and Sarka 2009), however alternative
implementation approaches are often not presented, and no sufficient performance
analysis was identified during background investigation for this research.

6
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Kimball (Kimball 2004) offers bulk merge (SET) as a method of improving the
functionality of a Type 2 data load, but as with other resources, does not discuss the
performance considerations of it. Warren Thornthwaite does however investigate this
approach in more detail in a more recent document (Thornthwaite 2008), explaining
that being able to handle the multiple required actions in a single pass should be
extremely efficient given a well-tuned optimizer. Uli Bethke has taken this same
approach and applied it to an Oracle environment (Bethke 2009).

Joost van Rossum has written a blog post on this topic (Rossum 2011), and provides a
number of options for loading data into SCDs, and also provides some basic timing
statistics for them. Although this is not an academic or refereed source, the author has
many years of experience as a business intelligence consultant, receiving the Microsoft
Community Contributor award in 2011 and the BI Professional of the year award from
Atos Origin in 2009. This post presents four alternatives to the Slowly Changing
Dimension Component:

a) An open source project, “SSIS Dimension Merge SCD Component”

b) A commercial “Table Difference Component” or free “Konesans Checksum
Transformation”
c) The T-SQL Merge statement
d) Standard SSIS lookup components

Rossum chose to compare option (d) against the built in component, and proceeds to
extend this option into two tests, one performing singleton updates and one
performing a batch update. No reason is given for not pursuing the first three options,
however option (c) seems to have been added after the publication of the post which
explains its absence. Many corporations impose restrictions on the use of third party
software components; it is also preferable to use transparent techniques in which the
functionality can be understood instead of black box components which can’t be
analysed, explaining the absence of options (a) & (b).

In Rossum’s tests, he uses a small test dimension of 15,481 members, with a small
change set of 128 members and 100 new members. The results are provided in Table
1.
7
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Method Duration (s)

Slowly Changing Dimension Component 25

SSIS Lookups (singleton update) 1.5

SSIS Lookups (batch update) 6

Table 1 – Results of Rossum’s SCD method tests

There is clearly a large performance variation between the methods; however with
such a small number of records, the results can only be used to provide an indication
of the difference, and cannot be interpreted with any degree of confidence. Rossum
does not perform any statistical analysis on the results, and does not repeat the
experiments with different volumes of data, or provide any information on the
conditions under which these tests were performed.

Mundy, Thornthwaite and Kimball (Mundy, Thornthwaite and Kimball 2006)

recommend using the Slowly Changing Dimension component approach for small data
sets with less than 10,000 input or destination rows, they also advise that the
performance should be acceptable even for large dimensions but which only have
small input change data sets. Rossum’s findings show that, although the SCD
component does take longer in his change dataset of only 228 members, the durations
are so small that it’s likely to be acceptable.

In a higher volume scenario, Mundy et al advise a manual approach to SCD processing,

using a lookup or merge join component within SSIS to map incoming records to
existing members in the dimension. Once records are mapped using the
natural/business key, the input stream is split into new and existing members. The
attributes of the existing stream can then be compared to determine whether the
record has changed or not. The ‘new’ stream should be piped directly to a bulk insert
component. They advise recreating a singleton update process for the update stream
and comment that this could be improved for performance and functionality but stop
short of presenting options on how to accomplish this. The obvious solution to

8
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

increase performance however is to simply process the updates in a single batch

operation rather than individually, using the database engine to perform the work.

It’s disappointing that, although commonly discussed, no authors have been identified,
other than Rossum, that have investigated the performance characteristics of the
available methods. It is this shortage of existing research, along with the regularity
with which this problem is encountered in the commercial field, which has prompted
this research to investigate the load characteristics of SCD methods in more detail.

B. Database Operation Performance

Despite the shortage of research focusing on data warehouse SCD load performance,
there has been considerable activity investigating the operational performance of
database engines, and the optimisation of queries.

One such study by Muslih and Saleh (Muslih and Saleh 2010) describes the
performance of different join statements in SQL queries. Their comparison of nested
loop joins and sort-merge joins shows that there can be a dramatic difference in query
cost dependant on the size of the datasets being used. They advise that nested loop
joins should be used when there are a small number of rows, but sort-merge joins are
preferable with large amounts of data. Although this study is focusing on the
performance of the ETL process not the database engine, there is a high degree of
parallel as the ETL process must join two streams of data together, the incoming and
existing data. These findings can therefore be taken into account when determining
the methods to be used.

Olsen and Hauser (Olsen and Hauser 2007) advise that to get the best performance
from relational database systems the operations should be performed in bulk if more
than a very small portion of the database is updated.

An investigation by Peter Scharlock (Scharlock 2008) into the performance of using

cursors in SQL Server showed quite how great the performance differential can be
between row based operations and set based operations. He created two experiments
updating 200 million rows in a single table; in the first experiment each row was
updated separately using a cursor to loop through them, whereas the second test

9
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

updated all rows in a single set based operation. He calculates that the cursor based
approach would have taken in excess of 8 months to complete, whereas the set based
operation completed in approximately 24 hours. Scharlock acknowledges that there is
a much greater resource cost of the set based operation, although he doesn’t present
any details or evidence of this.

C. Random Vs Sequential IO
Loading data into a data warehouse dimension requires both random disk access as
well as sequential disk access.

In traditional physical disk drives, sequential IO (Input/Output) requires only a single

seek operation to move the disk head to the correct location, following which all the
necessary data can be read or written to the same physical location with a simple seek
from one track to its adjacent track. Random IO is required when the data to be read
or written exists in different locations on the disk, requiring multiple seeks to correctly
position the head to tracks in differing physical locations.

Because track-to-track seeks are much faster than random seeks, it is possible to
achieve much higher throughput from a disk when performing sequential IO (Whalen
et al. 2006).

In contrast, solid state storage has no physically moving parts so random seeks require
less overhead. They are therefore able to achieve a much higher performance,
specifically with respect to random read operations (Shaw and Back 2010). Tony
Rogerson (Rogerson 2012) states that the more contiguous the data is held in the same
locality on disk the lower the latency and higher throughput, with Solid State Devices
(SSD) turning that reasoning on its head. Rogerson acknowledges that SSDs still offer
the best access performance for contiguous data, however the access latency is
significantly less variant than with hard disks, enabling a much higher comparative
performance for random access.

Given the change in nature of their performance, it is expected that the use of SSD will
change the performance characteristics of loading data when compared with
traditional disk based storage.

10
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

D. Data Growth
Data volumes within organisations continue to grow at a phenomenal rate, as more
data is made available from social media, cloud sources, improved internal IT systems,
data capture devices etc. Data growth projections vary, however a recent McKinsey &
Co report projects a 40% annual growth on global data, with only a corresponding 5%
growth in IT spending (McKinsey Global Institute 2011). There is therefore a
compelling need in industry to maximise the efficiency of any data processing system
whilst also minimising the cost of implementation and maintenance.

E. Conclusion
From the research presented, it’s clear that the performance of loading Type 2 slowly
changing dimensions is of concern to a large number of people in the Business
Intelligence industry, and as data volumes increase the problem will become more
prevalent.

Although numerous authors and bloggers have presented their own personal or
professional views on which method to use, there is very little experimental or
statistical evidence to justify their claims. There has also been no research undertaken
within academic circles to investigate the performance characteristics of ETL
processes.

This lack of empirical evidence makes it impossible to determine which is the best
approach to loading data warehouse dimensions for a given scenario, leaving
architects and developers to make design decisions based either on their own, often
limited, experience or on anecdotal evidence.

This is made more problematic by the introduction of solid state hardware, providing
yet another option for the data warehouse architect to consider.

The author therefore considers this research to be of great importance to the Business
Intelligence community, to provide guidance to those looking to optimise their system
design.

11
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

3. Methodology and data collection methods

This chapter explores the methods available to undertake this research, and identifies
the relevant approach that is likely to generate the most useful results.

A. Inductive Vs Deductive
It is the intention of this research to perform an inductive investigation. This research
does not set out to prove an existing hypothesis that one method of loading data is
faster than another, but instead offers a number of different methods and scenarios
commonly found in industry, and attempts to compare them to investigate which is
the preferable method in any given scenario.

Following the ‘Research Wheel’ approach (Rudestam, Kjell and Newton 2001)
presented in Figure 1, the research will start with the empirical observation from the
author’s own experience in industry that the performance of loading Type 2 slowly
changing dimensions is a problematic area, and warrants investigation.

Figure 1 – The Research Wheel

As an inductive investigation, the proposition is to explore the nature and performance

of loading Type 2 SCDs, with a view to determining the most appropriate method(s) for
a given scenario.

The previous chapter explored the literature in detail, presented justification for the
research and explored some of the specific questions and topics that have been raised,
which this research will explore in more detail.

Results will then be collected and analysed, then the cycle will be continued to
whatever extent is necessary in order to draw sufficient conclusions which can then be
applied to practical scenarios outside of this project.
12
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

B. Qualitative Vs Quantitative
Two high level approaches were considered for this research, quantitative and
qualitative (Rudestam, Kjell and Newton 2001).

To perform a qualitative assessment a questionnaire would be distributed to business

intelligence consultants, professionals, architects and programmers requesting, in their
opinion, the relative pros and cons of the approaches given different scenarios. Each
scenario would represent different percentage change factors in the source data.

The results would be interpreted to extract common findings from the answers
provided for each scenario. A quantitative investigation could also be adopted if the
participants were asked to rate each method on a performance scale.

The primary concern with this approach is that it is highly unlikely to actually reveal a
genuine performance difference between the methods, instead revealing each
individual’s preference for each method, which is likely to also be based on
convenience, lack of awareness of other methods, maintainability, code simplification,
available toolsets etc. This method would however enable the research to cover a
broader spectrum of technologies and implementation styles.

This approach also relies on getting responses from the questionnaire, which can be
problematic and costly.

To perform a quantitative analysis of the load performance, a simple data load test can
be set up to measure the time taken to process a number of new and changed rows in
a simulated data warehouse environment. The proportion of new and changed rows
can be altered to provide measurements of the data throughput.

The resulting measurements can be statistically analysed to determine whether there

is a significant difference between the methods.

The primary outcome of this research is focused on the performance of data

throughput, so the quantitative approach is the more appropriate as it will allow
control over the majority of external influencing factors in order to isolate and

13
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

measure the relevant metrics. It is therefore intended to set up a series of tests that
will generate the required measurements. To perform this, a number of components
must be set up.

C. Source Database
A representative online transactional processing (OLTP) database, complete with a set
of data records suitable to be populated in a data warehouse dimension. The contents
of this database will be preloaded into the data warehouse dimension, and then one of
a number of change scripts will be run to generate the required volume of SCD type 2
changes.

The nature of this database is immaterial, so an arbitrary set of tables will be created
modelling a frequently used dimension, Customer. The Customer dimension is often
the most challenging dimension in a data warehouse due to its large size and often
quickly changing attributes (Kimball 2001). These tables will be normalised to 3rd
Normal Form to accurately model a real-world OLTP source database. As this research
is solely focussing on the performance of SCD type 2 dimension data loads, it is not
necessary to simulate fact data such as sales or account balances.

The source OLTP database will need to be populated with random but realistic data. To
achieve this the SQL Data Generator application provided by RedGate will be used. This
allows for each field to be populated using a pseudo random generator but within
specified constraints, or selected randomly from a list of available values. This prevents
any violation of each fields’ constraints. This method will be used to generate the
starting dataset as well as the new and changed records for the ETL load test.

To generate the change data, SQL scripts will be written which will update a specified
percentage of the records, altering at least one of the fields being tracked by the type
2 process.

To ensure consistency between the methods, each test will use identical datasets.

D. Data Warehouse
A suitable data warehouse dimension will be created, following Kimball Group best
practices (Mundy, Thornthwaite and Kimball 2011). This will be a single dimension that

14
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

would normally form part of a larger star schema of fact and dimension tables within
the warehouse.

Fact data will not form part of the performance tests, so the complete star schema
does not need to be built.

E. ETL Process
To perform the data load, a number of ETL (Extract, Transform & Load) packages will
be created to populate the dimension from the source database, each performing the
data load in a different way. Each package will log the ETL method being used, the
number of new rows to be inserted, the number of change rows retrieved from the
source database and duration of the load process.

F. Toolset
There are a number of database systems and ETL tools available to use, from Oracle
and SQL Server to MySQL and DB2, and SSIS to Syncsort and SAS.

This analysis will make use of Microsoft SQL Server. SQL Server is one of the most, if
not the most, widely used database platforms in use today (Embarcadero 2010). It
integrates a highly scalable DBMS (database management system) with an integrated
ETL toolset, SSIS.

G. Quantitative Tests
The comparative performance of the load methods is expected to change depending
on the number of rows being loaded, and the ratio of new records to changed records.
It will therefore be necessary to create numerous different change data sets, each with
a different percentage of new data and changed data.

The tests will all be performed on the same hardware, with the exception of the
different storage platforms. This will ensure consistency, however it should be noted
that the results may be influenced by the specification of server used. For example,
some of the methods are very memory intensive and so may be expected to perform
better when given access to more memory. Ideally the datasets would be small enough
to ensure that memory would not be an influencing factor, however it is important to
perform the tests on data that is of sufficient size to provide usable and meaningful

15
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

data. Each ETL process will incur fixed processing overhead to initiate the process and
pre-validate the components and metadata etc. If the datasets were too small, the
fixed processing overheads could obscure the timing results. A dimension with 50m
records will therefore be used. This size is representative of a large dimension of a
typical large organisation, for example a customer dimension. The resulting size of the
databases will also be within the available hardware capacity of the solid state drives
available for the tests.

Four different ETL systems will be created to perform SCD type 2 dimension loads with
the following methods.

Method 1: Bulk insert (ETL) and singleton updates (ETL)

The whole process is managed within the ETL layer.

Each record is checked individually to determine whether it already exists in the

dimension or not.

New records which don’t already exist in the dimension will be bulk inserted within the
ETL pipeline, with a full lock allowed on the destination table.

Changed records will be dealt with individually within the ETL pipeline, with two
actions performed for each change:

- Terminate the previous record by flagging it as historic

- Insert new record

This method is that recommended by Mundy et al (Mundy, Thornthwaite and Kimball

2006) for smaller data sets and is an obvious inclusion being the simplest to
implement.

16
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 2 – Typical (simplified) structure of a Singleton load process using the Slowly Changing Dimension
component [taken from a screenshot of the actual load process used for this test]

Method 2: Bulk inserts (ETL) and bulk updates (DB), split using Lookup (ETL)

The process is managed by both the ETL layer and the database engine.

The ETL layer includes a Lookup component which cross references each incoming
record against the existing dimension contents. New records which don’t already exist
in the dimension will be bulk inserted, with a full lock allowed on the destination table.

Existing records will be loaded into a staging table and then merged into the
destination dimension in a single operation. The Merge operation takes care of the
multi stage process required for Type 2 changes:

- Terminate the previous record by flagging it as historic

- Insert new record

This method of using the Lookup to differentiate new/existing records is also

recommended by Mundy et al for larger data sets, although they recommend still
processing the existing channel using singleton updates. This falls short of an ideal

17
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

method, as each record is being processed individually. In order to make any database
operation truly scalable the updates should be managed in bulk. As Olson and Hauser
describe, one should make careful use of edit scripts and replace them with bulk
operations if more than a very small portion of the database is updated (Olsen and
Hauser 2007). An adaptation to this approach to utilise bulk updating was adopted by
Rossum in his tests (Rossum 2011).

Figure 3 – Typical (simplified) structure of a load process using Lookup

Method 3: Bulk inserts (ETL) and bulk updates (DB), split using Join (ETL)

The process is managed by both the ETL layer and the database engine.

The ETL layer includes a Merge Join component which left outer joins every incoming
record to a matching dimension record if one already exists. New records which don’t
already exist in the dimension will be bulk inserted, with a full lock allowed on the
destination table.

18
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

- Terminate the previous record by flagging it as historic

- Insert new record

This method is very similar to method 2 in its approach, utilising the ETL pipeline to
distinguish the new and existing records, and processing both streams in bulk.

The key difference is the technique used to cross reference incoming records against
the existing dimension records. Method 2 uses a ‘Lookup’ approach, whereas this
method replaces it with a Merge Join.

The Lookup transformation uses an in memory hash table to index the data (Microsoft
2011), with each incoming record looking up its corresponding value in the hash table.
This means the entire existing dimension must be loaded into memory before the ETL
script can begin, and it remains in memory for the duration of the script.

The Merge Join transformation however applies a LEFT OUTER JOIN between the
incoming data and the existing dimension data. The downside of this is that both data
sets must be sorted prior to processing which can add a sizeable load to the data
sourcing. However, the existing dimension records only need to be kept in memory
whilst they are being used within the ETL processing pipeline. This has the advantages
of requiring potentially less memory as well as a reduced processing time prior to
execution, assuming the sort operations can be processed efficiently.

These two approaches can draw parallels with the different query join techniques
compared by Muslih and Saleh (Muslih and Saleh 2010), from which they identified a
sizeable difference in performance.

19
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 4 - Typical (simplified) structure of a load process using Merge Join

Method 4: Merge insert and updates (DB)

The entire process is managed within the database engine.

All records from the ETL pipeline will be loaded into a staging table, regardless of
whether they are new or changed rows. They are then merged into the destination
dimension table in a single operation. The single merge statement will perform three
actions on all records within a single transaction

- Insert new records

- Terminate previous records
- Insert changed records

This is the method proposed by Thornthwaite (Thornthwaite 2008) and Bethke (Bethke
2009) to make use of advances and new functionality in the T-SQL language and
database engines. Once this technique is learned it is also very fast and simple to
implement.

20
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 5 - Typical (simplified) structure of a load process using T-SQL Merge

As can be seen in Figure 5, this is a much more simple process to implement within the
ETL pipeline in SSIS, as the complexity of the process is all contained within the Merge
statement.

Tests

All four ETL methods will be run against numerous sets of test data, with varying sizes
of destination data and percentages of change data. The proposed tests to be
conducted are presented in Table 2:

% of rows containing changes

0% 0.01% 0.1% 1% 10%

0% Test 0 Test 1 Test 2 Test 3 Test 4

% of rows containing new data

0.01% Test 5 Test 6 Test 7 Test 8 Test 9

0.1% Test 10 Test 11 Test 12 Test 13 Test 14

1% Test 15 Test 16 Test 17 Test 18 Test 19

10% Test 20 Test 21 Test 22 Test 23 Test 24

Table 2 – Summary of tests covering different data volumes for new/changed data

21
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

H. Statistical Analysis
Logarithmic intervals of sample percentages will be used in order to examine both
small and large test sets.

Each ETL package will contain duration measurement functionality which will log how
long each test takes to complete. This duration is taken as the result for each test.

When repeated for each of the hardware platforms, and then for each of the four load
methods this will result in 200 tests. To help mitigate against any external influencing
factors, each test will be run three times, resulting in 600 individual data load tests
being run.

The results of each test will be analysed, with the four ETL methods compared using
statistical techniques appropriate for the distribution of the results, such as a
univariate analysis of variance (ANOVA). This will reveal whether there is any
statistically significant difference in performance between the methods for each test.

A decision tree data mining technique will also be employed to analyse the influence of
the parameters on the preferred method.

I. Test Rig Hardware

The results of the tests will be influenced heavily by the specification and performance
of the hardware running the tests. To ensure consistency across all tests, they will all
be run on the same machine, which will be isolated from any external influencing
factors and will not run any software other than that necessary for the tests.

The specification for this hardware platform was largely influenced by the work of
Tony Rogerson (Rogerson 2012) from his work on the Reporting-Brick.

The first storage platform will be a Raid 10 array of 7,200 rpm hard disks internal to the
server. It is common for corporate database server to use an external NAS (network
attached storage) system of 15,000 rpm drives for storage, however in the interests of
creating an isolated environment, maximising performance and reducing the

22
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

associated costs, internal 7,200 rpm drives will be used. A Raid 10 array has been used
to provide the increased performance expected from a corporate environment.

The second storage platform will be a solid state 160Gb Fusion-IO ioXtreme card,
directly attached to the server’s PCI bus.

The purpose of these tests is to identify the performance of loading data into the data
warehouse, it is therefore important to isolate the performance of data retrieval from
the source systems and insure that data sourcing does not have an impact on the
results. The server will therefore also be equipped with a separate solid state drive
which will serve the data to the ETL tests.

The tests will be run within a Hyper-V virtual machine provisioned with 4 cores and
12Gb RAM (random access memory), running 64 bit Windows Server 2008 R2 and SQL
Server 2012 Enterprise edition. The host server is a 6 core AMD Phenom II X6 1090T
3.2Ghz with 16Gb RAM running 64 bit Windows Server 2008 R2.

The ETL tasks will rely heavily on RAM. Further tests could be run using different
amounts of RAM in order to introduce this as a factor into the method comparison;
however this remains outside the scope of this project.

Database engines heavily use cache in order to optimise the performance of repeated
tasks. This would impact the performance tests being run, negatively impacting the
first tests and benefiting latter tests. To remove this influence, all services (database,
ETL engine, etc.) will be restarted between each test to clear out the RAM and reset
any cache.

J. Issues of access and ethics

For the purposes of this research, realistic dummy data will be generated in order to
prevent any issues arising from data security or confidentiality.

All results will be collected from managed tests against databases created specifically
for this task, which will not require permission from any third party.

It is not expected that any problems will be encountered relating to the issues of
access or ethics.

23
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

4. Results and Data analysis

This chapter presents the results of the data load tests, and explores the statistical
analysis and data mining techniques used to interpret the results. Statistically
significant outcomes are drawn from the various analyses, which will be further
interpreted in the following chapter.

Figure 6 (shown on page 26) presents a series of charts showing the average duration
of the three instances of each test. These are grouped by the number of new rows and
changed rows. Each chart compares the average duration of tests for each hardware
and method combination.

Note that these charts do not share the same scale.

A number of findings can be drawn from this, before any statistical analysis has been
performed.

The Singleton method, when used with traditional hard disks, performed considerably
worse than any other method for large data volumes (>= 0.5m) of either new or
changed rows. This was expected, and confirms the advice of Mundy et al (Mundy,
Thornthwaite and Kimball 2006) who recommend that the Slowly Changing Dimension
component is only advisable for data sets of less than 10,000 rows.

However, it is interesting to note that their recommendations aren’t as accurate when

solid state drives are in use. The results in the charts clearly show that the singleton
method performed on a par with or better than the other methods for both the 50k
(changed & new) data sets.

It should be noted that the Singleton method actually out performs all other methods
for both hardware platforms when less than 5k new or changed rows are being loaded.

The recommendation therefore stands that the SCD component should only be used
for small data sets, however the hardware platform clearly has an impact on what is
considered a small data set.

24
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

When the Singleton approach is excluded, the remaining three methods are much
closer together in their performance; however the Lookup method is consistently the
next lowest performer in the vast majority of the tests.

25
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 6 – Average duration of each test, grouped by new and changed rows, comparing the methods for each hardware platform

26
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Duration (s) --->

Figure 7 – HDD Results grouped by Method

27
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Duration (s) --->

Figure 8 – SSD Results grouped by Method

28
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 7 (HDD) & Figure 8 (SSD) show the same results, grouped by the method. The
first column groups the results by the number of new records, showing the number of
changed rows within each group. The right column shows the opposite, with the
changed row count in the outer grouping.

The difference in pattern is immediately obvious, with the right hand column of charts
showing a much stronger correlation. This indicates that the number of changed rows
is the driving factor in determining the time taken to load, with the number of new
rows making less of an impact.

These results will be examined in more detail using appropriate statistical analysis.

A. Statistical Analysis Method

In order to determine the appropriate statistical analysis method, the distribution of
the data was considered. The distribution of the raw dependent variable (duration) is
presented in Figure 9 below.

Figure 9 – Distribution of the dependent Duration variable

On the face of it this is not normally distributed, but heavily positively skewed with a
seemingly exponential distribution.

This is however a misleading representation, as the majority of the variation in the

results is expected to be caused by the input parameters (method, input rows, etc.).
Once these are taken into account, the remaining variance between the tests is
expected to be normally distributed.

29
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

To test this, a general linear model (prog glm) was run using code presented in
Appendix 1. The normal probability plot of the studentised residuals shown in Figure
10 passes through the origin but clearly shows a far from straight line. The assumption
of near-normality of the random errors is therefore not supported by this model.

Normal Probability Plot of Studentised Residuals for

the ETL Duration

Figure 10 – Normal Probability Plot (QQ Plot) of Studentised Residual

Given the logarithmic intervals of the new row and changed row input variables, the
same test was run against the logarithm of the duration result, using the code
presented in Appendix 2. The resulting normal probability plot is shown in Figure 11
below. This shows that in most cases the studentised residuals conform to an
approximate straight line of unit slope passing through the origin. There are however a
sizeable number of points showing with a noticeable tail, resulting in a curvilinear plot
indicating negative skewness. Although most points conform, the assumption of the
near-normality of the random errors is not supported when using the logarithm of the
duration result.

30
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Normal Probability Plot of Studentised Residuals for

the Logarithm of the ETL Duration

Figure 11 - Normal Probability Plot (QQ Plot) of Studentised Residuals (Log)

Studentised Residuals against Fitted Values for the ETL Duration

Figure 12 – Plot of Studentised Residuals against Fitted Values

31
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

The plot presented in Figure 12 above shows that the studentised residuals are not
randomly scattered about a mean of zero, the variance appears to decrease as the
fitted value increases.

The model used above treats the hardware and method as categorical factors and the
new and changed rows as numerical variables. The test was then repeated with all
inputs treated as categorical factors, using the SAS code presented in Appendix 3.

Normal Probability Plot of Studentised Residuals for

the Logarithm of the ETL Duration

Categorical inputs

Figure 13 – Normal Probability Plot (QQ Plot) of Studentised Residuals (Log) with categorical variables

The plot presented in Figure 13 shows that the studentised residuals clearly conform
to an approximate straight line of unit slope passing through the origin. In this model
there is no tail of non-conforming values, indicating that the assumption of the near-
normality of the random errors is supported. This is further supported by the plot
presented in Figure 14, which shows that the studentised residuals are randomly
scattered about a mean of zero.

32
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Studentised Residuals against Fitted Values for the ETL Duration

Figure 14 – Plot of Studentised Residuals against Fitted Values

The smaller ranges at the extremes of the plot are likely to be reflective of the smaller
number of observations at these extremes rather than a genuine reduction in variance.

Figure 15 below shows a histogram of the studentised residuals, and shows a very
close fit to the superimposed normal curve. The studentised residuals therefore
appear to be symmetrically distributed and unimodel as required.

Based on this evidence, normal distribution of the error component can be assumed,
and the multivariate ANOVA test using a general linear model is an appropriate form of
analysis for this data when treating all input parameters as categorical factors.

33
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Histogram of the Studentised Residuals for the ETL Duration

Figure 15 – Histogram of the Studentised Residual

The problem that this model causes is that, as can be seen from the results in
Appendix 4, the number of factor combinations and the number and complex nature
of the significant interactions makes interpretation very challenging. Treating the row
counts as categorical factors also does not provide sufficient information in the
statistical analysis results to interpolate or extrapolate the expected performance
characteristics of data volumes not tested in this research. This will reduce the ability
to apply the findings of this research to real-world scenarios.

After further experimentation with different transformations of the result, it was

found that both the curvilinear nature of the QQ Plot in Figure 11 and the decrease in
studentised residuals at high fitted values in Figure 12 appear to be largely caused by
the results from the Singleton method.

The Singleton method has already been discounted as a viable option for all scenarios
where the data volumes exceed 5k rows, as found in the original data plots in Figure 6.
Where necessary, the Singleton method’s performance characteristics can be

34
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

extracted from the categorical factor model analysis, with the scalability analysis for
the remaining methods derived from the numerical variable model.

The SAS code to generate the revised numerical model is presented in Appendix 13.
The analysis of the studentised residuals in Figure 16, Figure 17 and Figure 18 below
show that the numerical model is an appropriate form of analysis for this data, when
the singleton method is excluded from the results.

The plot presented in Figure 16 shows that the studentised residuals clearly conform
to an approximate straight line of unit slope passing through the origin.

The studentised residuals shown in Figure 17 appear randomly scattered about a mean
of zero. Again, the reduced range at the extremes of this plot reflects a smaller number
of observations. The histogram shown in Figure 18 shows a very close fit to the
superimposed normal curve.

Normal Probability Plot of Studentised Residuals for

the Logarithm of the ETL Duration

Numerical row counts, Excluding Singleton

Figure 16 - Normal Probability Plot (QQ Plot) of Studentised Residuals (Log), numerical row counts, excluding
Singleton

35
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Studentised Residuals against Fitted Values for the ETL Duration

Numerical row counts, excluding Singleton

Figure 17 - Plot of Studentised Residuals against Fitted Values, numerical row counts, excluding Singleton

Histogram of the Studentised Residuals for the ETL Duration

Numerical row counts
Excluding Singleton

Figure 18 - Histogram of the Studentised Residual, numerical row counts, excluding Singleton

36
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

B. Statistical Analysis – Factor Model

The results from the Analysis of Variance (ANOVA) test using row counts as categorical
factors are presented in Appendix 4.

The ANOVA results presented in Appendix 4 show that, with p values of <0.0001, all of
the individual explanatory terms are highly statistically significant, and therefore have
a proven impact on the duration of the ETL load.

With p values of <0.0001, all of the interactions between the explanatory factors are
also highly statistically significant, the only exception being the four way interaction
between all of the factors; Method, Hardware, ChangeRows and NewRows.

By itself this doesn’t provide much in the way of useful information for interpretation.
However by conducting further analysis of the least squares means (LS Means, or
marginal means) of the lower order factors it’s possible to investigate the relative
influence of the factors and their interactions.

The SAS code for this analysis is presented in Appendix 5 with the results presented in
Appendix 6 through Appendix 12.

Table 3 below shows the least squares means analysis comparing just the method,
excluding all other factors and interactions, with the Join method as the baseline
(Appendix 6). The performance degradation using the Lookup and Singleton methods
are both highly visible, with the Singleton method being the considerably worse
performing method. The Merge and Join methods are very close in performance, with
Join being the marginally better choice.

Parameter Least Squares Means

Lookup 0.527532426
Merge 0.015856022
Singleton 1.302011109
Join 0
Table 3 – Least Squares Means of Log(Duration) for the Methods, no interactions

37
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

The hardware choice, excluding any interactions, also shows a sizeable difference as
shown in Table 4, using the results from Appendix 7. As expected, the solid state
storage outperforms traditional hard disks.

Parameter Least Squares Means

Hardware HDD 0.862392001
Hardware SSD 0

Table 4 – Least Squares Means of Log(Duration) for the hardware, no interactions

When we introduce the method, the interaction effects show the different impact on
performance for the combinations, with the combined least square means shown in
Table 5 below, with the full results presented in Appendix 8.

Least Squares Means

Join HDD 6.526209
Join SSD 5.807498
Lookup HDD 6.901355
Lookup SSD 6.487416
Merge HDD 6.629513
Merge SSD 5.735906
Singleton HDD 8.180520
Singleton SSD 6.757208
Table 5 - Least Squares Means of Log(Duration) for the hardware and method interaction

The solid state storage tests showed consistently better performance across all
methods. The interactions between hardware and method show that the Singleton
and Merge methods benefit more from solid state that the other methods.

Both hardware platforms show a consistent pattern of performance across the

methods, putting the Singleton approach as the worst performing, with Join and
Merge as the best.

Table 6 below shows the least squares means analysis of the number of new and
change rows, excluding any other interactions, The LS Means clearly increase at a
visibly consistent rate as the new and change rows are increased, with a larger
increase for the number of changed rows. As we’re analysing the log of the result, this
indicates that the impact is increasing in an approximately exponential fashion, which
would be expected as the input row counts also increase exponentially. The full results
are presented in Appendix 9.

38
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Parameter Least Squares Means

changerows 5000k 4.755016135
changerows 500k 3.558073051
changerows 50k 2.283023333
changerows 5k 1.178104491
changerows Zero 0
newrows 5000k 3.205715956
newrows 500k 1.893797812
newrows 50k 0.983901698
newrows 5k 0.587110111
newrows Zero 0
Table 6 – Least Squares Means of Log(Duration) for new and changed rows, no interactions

It is also interesting to note that the interaction between new rows and change rows is
also highly statistically significant, with the details presented in Table 7 below. This
shows that the effects on log(result) are not additive. i.e. the log of the result is lower
for a combined load than for two individual loads of new and changed rows separated.

Even though the least squares means reflect an interaction, the pattern of the values is
consistent throughout, with higher log times for greater numbers of new and change
rows.

39
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Parameter Least Squares Means

changerows*newrows 5000k 5000k -2.774849232
changerows*newrows 5000k 500k -1.725871034
changerows*newrows 5000k 50k -0.925542348
changerows*newrows 5000k 5k -0.57453325
changerows*newrows 5000k Zero 0
changerows*newrows 500k 5000k -2.556194177
changerows*newrows 500k 500k -1.772295029
changerows*newrows 500k 50k -0.962048611
changerows*newrows 500k 5k -0.621384278
changerows*newrows 500k Zero 0
changerows*newrows 50k 5000k -1.401072078
changerows*newrows 50k 500k -1.197757761
changerows*newrows 50k 50k -0.859658893
changerows*newrows 50k 5k -0.540884073
changerows*newrows 50k Zero 0
changerows*newrows 5k 5000k -0.688697255
changerows*newrows 5k 500k -0.64568343
changerows*newrows 5k 50k -0.67222196
changerows*newrows 5k 5k -0.477499334
changerows*newrows 5k Zero 0
changerows*newrows Zero 5000k 0
changerows*newrows Zero 500k 0
changerows*newrows Zero 50k 0
changerows*newrows Zero 5k 0
changerows*newrows Zero Zero 0
Table 7 – Least Squares Means of Log(Duration) for the interaction between new and change rows

The next analyses, the results of which are presented in Appendix 10, investigate the
interactions between the method and increasing row counts.

Merge is found to perform the best in low data volume scenarios, and in all tests with
a low volume of new rows (500k and less), with Join becoming preferable with higher
volumes of new rows. Singleton proves comparable at very low data volumes (5k and
less), but scales very poorly. Lookup performs better than Singleton in tests with
greater than 50k rows, but only marginally. Although never the worst performing
method, Lookup is also never the best. This is visualised in the following chart in Figure
19 which clearly shows that the Singleton method provides comparable performance
for data volumes up to 5k new rows and 5k changed rows, but not beyond. This
confirms our earlier findings from the analysis of Figure 6, as well as that from other

40
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

sources covered in the literature review including Mundy et al (Mundy, Thornthwaite

and Kimball 2006).

Figure 19 - Combined Least Squares Means for Method and Varying Input Row Counts

The statistics presented earlier confirm that the two best performing methods are the
T-SQL Merge method and the SSIS Merge Join methods, with no statistically significant
difference between them. This reaffirms the findings from the initial plots in Figure 6
(page 26).

The next investigation focuses on these two methods, and examines the interaction
between these methods and other parameters. Note that the statistical model used
for this excludes the other two methods (Lookup and Singleton), and is analysing a
subset of the original test data. The code for this is presented in Appendix 11, with the
results presented in Appendix 12.

The more detailed investigation again shows that there is no significant difference
between the methods, with a p value of 0.7182, however the hardware and all two
and three way interactions are all highly significant at the 1% level, with the exception
of method against hardware which is only just significant at the 5% level.

41
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

When looking at the parameter estimates, the Merge method is significantly better
than Join for the baseline data of SSD and zero new & change rows, with a relative
parameter estimate difference of 0.426, significant at the 5% level.

Both methods show no statistically significant degradation in performance as new

rows are increased to 5k or 50k, only showing a significant increase in duration when
new rows reach 500k and 5m. Both methods however show a significant increase in
duration when the volume of change rows is increased to 50k and above, with Merge
also showing an increase at 5k change rows.

All other 2 way interactions prove to be significant, again highlighting the complex
nature of the performance characteristics of ETL loads.

C. Statistical Analysis – Numerical Model

In this section the numerical variable model will be interpreted, which excludes the
Singleton method and treats the new and change rows as numerical variables, using
the code presented in Appendix 13. Note that due to the very large values of new and
change rows, the parameter estimate per row is incredibly small. To improve the
accuracy of the analysis the number of rows has been divided by 1000 to allow greater
precision in the parameter estimates.

Treating the row counts as numerical variables allows a more in depth analysis of the
impact on ETL duration of varying numbers of new and change rows for the Join,
Merge and Lookup methods.

The results from the reduced model, excluding non-significant interactions, are
presented in Appendix 14.

As found previously, there is no statistically significant difference between the Merge

and Join methods, with the Lookup method performing significantly worse with a
parameter estimate of 0.910. Note that this is before any higher order interactions are
taken into account.

42
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

With a parameter estimate of 0.990, hard disks are significantly slower than solid state
storage.

Without taking any interactions into account, the number of change rows has a
significantly higher impact on performance degradation than the number of new rows,
with parameter estimates of 476x10-9 and 380x10-9 respectively.

Comparing method against hardware, Table 8 presents the combined parameter

estimates showing the baseline performance of the hardware and method
combination, and the impact per row of new and change data.

Also note the interaction between new and change rows. This is a statistically
significant interaction, but appears to have a negligibly small parameter estimate.
However this interaction estimate is applied to the product of new and change rows,
each with values up to 5m having a product of (5m)2. This interaction therefore
generates a material difference to the model at high data volumes.

Baseline: Zero new or change rows HDD SSD

Join 5.827 4.837
Lookup 6.737 5.747
Merge 5.827 4.837

Per 1000 Change Rows HDD SSD

Join 0.000475907 0.000475907
Lookup 0.000475907 0.000475907
Merge 0.000475907 0.000475907

Per 1000 New Rows HDD SSD

Join 0.000193906 0.000280638
Lookup 0.000150218 0.000236950
Merge 0.000292869 0.000379601

Per 1000 New & Change Rows (Interaction) HDD SSD

Join -0.000000042 -0.000000042
Lookup -0.000000042 -0.000000042
Merge -0.000000042 -0.000000042

Table 8 – Combined parameter estimates per row of input data, by hardware and method

From this we can see that the Lookup method starts out from a worse performing
position, with a starting parameter estimate of 6.737 against 5.827 for the other
methods on a hard disk platform, and 5.747 against 4.837 for solid state.

43
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

The log duration increases as the number of change rows increases, but at the same
rate for all three methods and for both hardware platforms. The impact of change
rows increasing is higher than that for new rows.

It should be noted that, although the parameter estimate (log duration) for HDD and
SSD are the same per change row, as SSD has a lower baseline value the impact on the
untransformed duration will be less. i.e. SSD scales much better than HDD for
increasing volumes of change rows. Contrast this with the increased parameter
estimates for SSD for volumes of new rows when compared with HDD. This confirms
the findings by Rogerson (Rogerson 2012) and Shaw & Black (Shaw and Back 2010)
that the performance gains of SSD can be best realised for random IO scenarios such as
database updates, not sequential IO such as database inserts.

The log duration of the load increases as the number of new rows increases, with the
largest increase per row for the merge method, followed by join then lookup
increasing the least, for both hardware platforms. This indicates that the gap between
the lookup method’s parameter and the other two will decrease as the data volumes
increase. It should be noted however that this model is estimating the log of the load
duration, not the duration itself. The Join method also has a higher parameter
estimate than the merge method for both hardware platforms. Although they start out
with comparable performance, the join method is likely to scale better at high volumes
of new data.

These findings are backed up by the visualisations in Figure 20 and Figure 21, which
show the effect on the parameter estimate of increasing the volume of input rows.

44
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 20 – Chart comparing the parameter estimates for the methods using HDD with increasing data volumes

Figure 21 - Chart comparing the parameter estimates for the methods using SSD with increasing data volumes

As the dependent variable being analysed is the log of the duration, the following two
charts, Figure 22 and Figure 23, show the same data but with the parameter estimates
transformed back into duration (in seconds) by taking the exponential of the
parameter estimate.

45
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 22 - Chart comparing the estimated load duration for the methods using HDD with increasing data volumes

Figure 23 - Chart comparing the estimated load duration for the methods using SSD with increasing data volumes

These charts clearly show that although the parameter estimate increases less for the
lookup method than the other methods, the logarithm transformation hides the fact
that the lookup method scales far worse than the merge or join methods.

It can also be seen that the performance characteristics of the methods when using
SSD are very similar to those when using HDD. Therefore despite the significant

46
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

improvement in performance that SSD provides, it doesn’t materially change the

nature of the performance characteristics. The only impact that SSD does have is that
the Merge method scales comparatively better and is still the best choice at very high
data volumes. In the HDD model, the Join method scales better than Merge and is the
best choice for high data volumes, diverging from Merge when loading over 2m rows.

These charts represent the performance characteristics when loading data with a
new/change split of 25%/75%. The characteristics and nature of the curves will change
if this split is varied. The following two plots in Figure 24 and Figure 25 show the same
curves when the split is reversed, at 75% new rows and 25% change rows.

Figure 24 - Chart comparing the estimated load duration for the methods using HDD with increasing data volumes

47
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 25 - Chart comparing the estimated load duration for the methods using SSD with increasing data volumes

These plots show that with a higher proportion of new rows to change rows, the
Merge method doesn’t scale anywhere near as well when on the HDD hardware
platform, and slightly worse when using SSD.

It should be noted that the durations presented on the y axis of the charts are only of
relevance to the hardware configuration used in this research. Different hardware
platforms with different CPUs, memory etc. will experience a different scale on the y-
axis, however the characteristics and nature of the performance comparison would be
expected to be consistent.

All of the above charts show an exponential increase in ETL load duration as the input
data volumes increase. It is expected that this is in large part caused by limitations of
hardware resource. There’s a finite amount of memory on a server and a finite size of
database cache etc. At lower data volumes the exponential curve is a very close
approximation to a linear relationship whereas the curved nature of the lines becomes
far more apparent at higher data volumes, which is to be expected as system resources
reach capacity.

Further research should be performed to investigate the scalability of the ETL methods
and the effect on their performance as server resources are increased.

48
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

D. Projection Model

The models discussed above and presented in Figure 22 to Figure 25 use formulae
derived from the parameter estimates of the various terms in the model. As discussed,
the scale of the duration will be impacted by the specific details of the hardware
platform, however the characteristics should be relatively consistent. The terms a & b
are included to provide customisation for different hardware platforms. These should
take the values 0 and 1 respectively to achieve the model used in this research.

The formulae for the models are presented in Equation 1 to Equation 6 below, where
the terms have the following meaning:

t = Time, ETL duration (seconds)

c = number of change rows / 1000
n = number of new rows / 1000
a = customisation term to apply model to different hardware scenarios (default 0)
b = customisation term to apply model to different hardware scenarios (default 1)

HDD Join model

Equation 1 – ETL Duration formula for using the Join method on HDD

HHD Merge model

Equation 2 - ETL Duration formula for using the Merge method on HDD

HHD Lookup model

Equation 3 - ETL Duration formula for using the Lookup method on HDD

SSD Join model

Equation 4 – ETL Duration formula for using the Join method on SSD

SSD Merge model

Equation 5 - ETL Duration formula for using the Merge method on SSD

SSD Lookup model

Equation 6 - ETL Duration formula for using the Lookup method on SSD

49
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 26 – Decision Tree showing the probability of being the best method for a given scenario

50
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

E. Decision Tree

Each of the methods was ranked within each test, with the best performing method
being given a rank of 1, with the worst performing method ranked 4.

A decision tree data mining algorithm was then applied to this rank data to determine
the decision process a user should use to identify the best method for a given scenario.
This was performed using the Microsoft Decision Trees Algorithm within SQL Server
Analysis Services.

Four input variables were used (Method, Hardware, NewRows and ChangeRows), with
the Rank being predicted. The results of this are presented in Figure 26 on the previous
page.

A number of conclusions can be drawn from the resulting decision tree map.

The Singleton method ranks last more than any other method, in 67% of the tests.
However it still ranks 1st in 14% of cases. Tracing the Singleton path through to levels 6
and 7 it is clear that the most effective situation for this method is where SSD
hardware is used, and with a small number of change rows, <= 5k.

The Lookup method ranks 3rd in 53% of tests, only ranking 1st in 7%; the majority of
cases where it was ranked 1st were in cases with zero changed rows.

The Merge and Join methods are ranked similarly, with the Join method preferred in
44% of cases and Merge with 34%. Merge is the preferred method when there are 50k
new rows. The Join method ranks better when there are a higher number of change
rows, it only ranked 1st in 10% of cases with zero change rows, 36% in cases with 5k
change rows and 58% of cases of 50k changes and above.

51
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

F. Dependency Network

The resulting dependency network, presented in Figure 27 shows that the strongest
influencer of achieving the top rank is the Method itself. This shows that the methods
are relatively stable with respect to being ranked 1st.

Figure 27 – Dependency Network

The number of change rows has the next strongest influence, followed by the number
of new rows.

The hardware platform influences the rank the least of all the variables.

52
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

5. Discussion
This chapter takes the statistical analysis performed in the previous chapter and breaks
it down into a number of summarised interpretations applicable to real world
scenarios. It applies the findings to those identified in the literature review, and aims
to provide those embarking on the development of a new ETL system with sufficient
knowledge from which to make informed choices.

A. Singleton Method
The statistical analysis shows that the singleton approach to loading SCD data offers
significantly lower performance than other methods in most scenarios.

The analysis presented in the discussion of Figure 6 and Figure 19 show that the
singleton method has comparable performance to the other methods with zero new
and changed records, but that the performance decreases far more dramatically than
the other methods when the data volumes increase. This indicates that the singleton
method is a potentially viable option for low data volume scenarios, especially when
solid state storage is in use.

The decision tree in Figure 26 shows that the singleton approach is particularly well
suited to <= 5k changed rows when solid state storage is used, and when the number
of new rows is less than 5m. The charts in Figure 6 also confirm this visually,
highlighting that this approach is well suited to low volumes of new and change
records (<=5k), especially when using solid state storage. The recommendation offered
by Mundy et al (Mundy, Thornthwaite and Kimball 2006) that the singleton approach is
most suited to small datasets with less than 10,000 rows is therefore confirmed.

All analysis shows that this approach is the least preferred method in most other cases.

These findings also confirm the findings of Olsen and Hauser (Olsen and Hauser 2007)
and Peter Scharlock (Scharlock 2008) in that when loading any sizeable data, bulk, set
based operations are preferable over row based singleton operations.

53
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

It should be noted, however, that even though the Singleton method offers the best
performance for these very low data volumes, the maximum benefit compared to the
next best performing method (T-SQL Merge) was only 54 seconds. The benefit is
therefore minimal when compared to the significant performance degradation as
volumes scale up.

B. Lookup Method
All analyses indicate that using the Lookup method should be avoided. The charts in
Figure 6 show that although rarely the worst performing method, it is very rarely the
best performing method. This is confirmed by the statistical analysis presented in Table
3 and Table 5. Figure 22 and Figure 23, showing the duration estimates for HDD and
SSD from the ANOVA model, both show a clear problem with the Lookup method, both
in its initial performance as well as its saleability when compared with the Merge and
Join methods.

The decision tree in Figure 26 shows that all bar one of the instances when this is the
preferred option are when there are zero changed records. As the purpose of a type 2
SCD is to manage changes, this is expected to be a rare occurrence in reality. It is
therefore advised to not use the lookup method as a high performance load option.

It should be noted that these results may be skewed by the large base data set used
(50m rows). The Lookup method requires the entire base data set to be loaded into
memory before ETL processing can begin, making this method more susceptible to
memory availability and increases in the base data set size. Further investigation
should be performed on smaller base sets to identify whether this method is more
appropriate in smaller scale scenarios which are out of scope of this research.

C. Join & Merge Methods

The analyses conducted in Table 3, Appendix 6, Appendix 12 and Appendix 14 indicate
that there is no significant difference between the performance of the Join and Merge
methods for either traditional disk storage or SSD storage. The charts presented in
54
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Figure 22 to Figure 25 indicate that at very high volumes of input data, the Join
method is usually preferable, which is backed up by the raw test results visualised in
the charts in Figure 6. Figure 24 shows that this is most prominent for traditional hard
disks and where there are a high proportion of new rows compared to change rows,
where the performance of the methods starts to diverge as early as 500k input rows.
On SSD the divergence starts at 3m rows. However, where there is high proportion of
change rows to new rows, Merge always outperforms Join for all data volumes on SSD,
and up to 2m input rows on HDD.

The charts presented in Figure 6 show that the Merge method performed better than
the Join method in all cases with lower data volumes, specifically <=5k changed rows
and <=50k new rows, for both hardware platforms. The Join method seems to scale
better, with marginally improved performance when compared against Merge as
either new or change rows reach and exceed 500k rows. This is confirmed by the
results presented in Appendix 10.

The decision tree presented in Figure 26 finds that the Join method is the best option
in most cases, followed very closely by the Merge method. Merge performs top in 31%
and 2nd in 47% of tests, with Join performing top in 44% and 2nd in 37% of cases.

The decision tree then refines the criteria for each, showing Join as unsuitable when
there are zero change rows, and showing Merge as most suitable when there are 50k
new rows.

These two approaches compete for the role of the best performing method, with each
marginally outperforming the other in different scenarios.

Given the comparable performance of the two methods, it should be left to the system
architect to determine the best approach, taking into account other factors such as
speed of development, maintainability, experience, code flexibility etc.

55
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

D. Solid State Storage

It is clear that using solid state storage does not fundamentally change the design
approach of which method is the most appropriate to achieve maximum performance.
The decision tree in Figure 8 shows that the only case where it does have a noticeable
impact is when the singleton method is employed, and where there are low number of
change records (<=5k).

The dependency network in Figure 27 also confirms that the storage platform has the
least influence of all the parameters when considering which design method offers the
best performance.

The statistical analysis however confirms that the use of solid state storage provides a
significant improvement in load performance in every scenario. The use of SSD
technology will therefore have a large beneficial impact on the duration of the data
loads in all cases.

Although the use of SSD should not alter the design decisions that are made when
planning a new data load project, it is clear that the technology will significantly
improve the performance of any implementation it is applied to.

As can be seen from Figure 6, the performance benefit of SSD is most noticeable with
the singleton method, and with the impact increasing with higher volumes of change
records. In some cases the performance improvement was up to 92% (12.5x
performance) on like for like tests. The nature of this performance gain can be
attributed to the characteristics of solid state, as presented by Shaw and Back (Shaw
and Back 2010), Fusion IO (Fusion-IO 2011) and Tony Rogerson (Rogerson 2012); the
singleton method relies very heavily on random read operations to read each existing
dimension record, one at a time. The biggest performance difference between
traditional disks and solid state storage is the performance of random reads, which
explains the slow results when using traditional disks and the significant improvement
when using solid state technology.

The timing results show that the impact of solid state storage was smallest in tests
with 5m new rows, although still providing on average a 52.9% performance

56
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

improvement (2.1x). The nature of new records requires largely sequential IO, writing
all new rows in a single sequential block. This doesn’t make use of the random IO
benefits of solid state, however solid state still provides a significant improvement in
performance of at least 19.5% (1.2x) in the worst case scenario for the singleton
method (5m new rows, 0 change rows).

Although earlier analysis showed that the use of solid state devices shouldn’t change
the design approach for a new system, this shows that it can be a very effective
solution to improve the performance of existing systems which may not have been
designed in an optimal way, and may negate the need to rewrite systems that are
approaching the limit of the available data load window.

E. New & Changed Rows

The statistical analysis presented in Appendix 4 indicates that the number of changed
rows has a higher impact than the number of new rows being loaded into the
dimension.

The dominance of the change records over the new records is backed up by the
dependency network in Figure 27 as well as visually in Figure 7 and Figure 8.

Figure 22 and Figure 24 also show that the ratio of new to change rows can also impact
the relative performance of ETL load methods, with Merge scaling comparatively much
better when there is a higher proportion of change rows, and worse when there’s a
low proportion of changes to new rows.

57
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

6. Conclusion
The results and analyses of this research has identified a number of criteria that affect
the performance of loading data into Type 2 data warehouse slowly changing
dimensions. This chapter provides a high level overview of the findings.

The use of solid state devices for data storage provides a significant benefit to the
performance of loading data in virtually every scenario, with performance benefits of
up to 92% (12.5x). Using solid state storage however should not fundamentally change
the design patterns of how ETL systems are designed.

When determining the most appropriate method to manage the loading of Type 2
SCDs, both the T-SQL Merge and SSIS Merge Join methods offered significantly higher
performance than the other methods in most tests. Merge Join however should be
preferred for higher volume scenarios, where the number of new or changed rows
reached and exceeds 500k. For other scenarios the choice can be determined by other
factors such as personal preference or server architecture.

The exception to this is where there are a very small number of changed rows, at 5k
rows or less, especially when solid state storage is in use. In these cases a Singleton
approach becomes feasible from a performance perspective. However, considering the
small benefit over other methods, as well as the inability of the method to scale, it is
recommended that the Singleton approach is not adopted.

It should be noted that this research focuses entirely on batch ETL load systems. As
described in the introduction, there is a growing trend towards real-time data
warehouse systems which by their very nature need to load small volumes of data as
soon as they’re received. The entire load framework is therefore constrained by design
to use a singleton approach to load the incoming data. The findings in this research
show that solid state storage systems should be of particular interest to these
scenarios, as they should be able to leverage the maximum possible benefit from SSD
technology.

58
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

This research has focused entirely on the performance of the methods and other
variables. In reality the run-time performance is only one of a number of factors which
need to be considered including the implementation complexity, development
duration, hardware cost, resource/skill availability and simplicity/ease of maintenance.

Given the lack of detailed analysis found during the research phase of this work, the
author hopes that this project will go some way to filling the void, and provide some
guidance to business intelligence architects, designers and developers to have more
confidence in their choice when selecting an ETL methodology.

59
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

7. Evaluation
The issue of loading data into data warehouse dimensions is in itself an incredibly
broad scope. This research has attempted to provide detailed analysis on the core
functionality in order to provide direction to anyone embarking on a new ETL project.

It should be noted however that due to the sheer number of possible factor
combinations, a single research investigation is unable to cover all possible scenarios.
This research has investigated the primary factors and provided a comprehensive
understanding of the nature of those factors. The results will however not necessarily
hold true for every scenario.

Further research should be conducted exploring the impact of other variables such as

Server memory & other hardware specification – The considerable impact of the hard
disk platform has been shown in this research, however this is only one of many
variables in hardware selection. The Lookup method is especially impacted by the
available memory due to its requirement to load the complete dimension into
memory, however the impact on the other methods is not explored by this research.
The exponential nature of the performance curves, as presented in Figure 22 and
Figure 23, indicate that scalability is likely to be impacted by hardware constraints
Changing the size of the base data set – The data set in this research used a static 50m
records. It’s possible that smaller or larger data sets may provide different results,
especially when tested in conjunction with the available server memory, and the
width/size of each record.
Storage Area Network (SAN) storage – This research used local storage for both
hardware platforms, HDD Raid 10 and SSD, in order to provide an isolated test
environment. The impact of the storage platform has been proven; it would therefore
be of interest to explore different storage platforms. It’s common for data warehouse
in the real world to use storage area networks, which exhibit their own unique
performance characteristics.
Solid State Storage – The solid state device used in this research was a relatively low
performance card compared to some that are now available from a variety of

60
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

manufacturers. Fusion IO now offer a very wide range of cards including an Octal card
which offers performance up to 8 times that of the card used in this project. This is
likely to exaggerate the HDD/SSD differences considerably, and may expose
performance characteristics not revealed by this research. Fusion IO is also only one of
many enterprise NAND/SSD storage providers including X-IO and Violin, each of which
offer different performance characteristics.
Splitting the workload onto a number of servers – This research used a single server
to run the ETL process as well as the source and destination databases. These three
elements are often split up onto three separate servers to improve performance
further. This offers an opportunity to benefit from specific performance characteristics
of different load methods, based on the relative performance of the method. For
example the Singleton process relies heavily on the ETL server to manage the load
process, whereas the T-SQL Merge method offloads the bulk of the work to the
database server.
Loading data into multiple partitions – In large data warehouses it is common to
partition fact tables to improve query and load performance. It may also be of benefit
to explore the impact of partitioning dimension data, if the dataset is suitable.
Data throughput characteristics of retrieving data from source systems – The tests
performed in this project sourced the incoming data from a local sold state device in
order to exclude the performance of source data retrieval from the results. It’s
common for source systems to provide data at a rate slower than the capacity of the
ETL mechanism, reducing the impact of ETL method selection.
Derivative or alternative ETL load methods – There are countless enhancements and
alternative methods available aside from the four presented in this research. The use
of third party components, checksums etc. all provide ETL load options not explored in
this project. It would be of interest to take the two best methods identified by this
project (Merge Join and T-SQL Merge) and explore the impact of evolving these
further.
Different toolset – SQL Server Integration Services is only one of a number of toolsets
that can be used for ETL processing, including SAS Data Integration Server, Informatica
PowerCenter, Oracle Data Integrator and IBM InfoSphere. Although the theory behind

61
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

the load process is likely to be similar between different implementations, the

performance specific are likely to vary.

This research has found significant differences in the performance of loading data,
depending on the hardware and method used. It is expected that most of the factors
above are also likely to have an impact on the load performance; some may change
the relative performance of the methods whereas others may not.

Analysing the interaction of the variables present in this research presented somewhat
of a challenge due the sheer number of statistically significant interactions. Increasing
the number of variables further would render statistical analysis even more complex,
so is unlikely to be feasible. It is therefore likely that further research would benefit
from selecting a different subset of the parameters, or an alternative statistical
method adopted.

Given the scope of this research, and taking into account the limitations discussed
above, the findings provide clear guidance to data warehouse architects and
developers on the relative merits of the different load methods. It’s now clear that the
Merge Join and T-SQL Merge methods are equivalent in performance and in most
cases should be considered the only choices; the decision between them can be left to
personal choice or other input factors not considered here.

It’s hoped that the work undertaken here will be of benefit to any organisation looking
to implement a data warehouse, reducing both the cost and duration of development
by providing clear guidelines and reducing the need to perform investigative
prototypes.

It’s also hoped that organisations will benefit from the investigation into the
performance of solid state storage. There is a clear benefit both to new projects, and
also as a remedy for poorly performing systems, for which the use of SSD technology
may be far more cost effective than the redesign and redevelopment of the ETL layer.

62
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

8. References
BECKER, B and KIMBALL, R (2007). Kimball University: Think Critically When Applying
Best Practices. [online]. Last accessed 28 May 2011 at:
http://www.kimballgroup.com/html/articles_search/articles2007/0703IE.html?articleI
D=198700049

BETHKE, Uli (2009). One pass SCD2 load: How to load a Slowly Changing Dimension
Type 2 with one SQL Merge statement in Oracle. [online]. Last accessed 17 12 2010 at:
http://www.business-intelligence-quotient.com/?p=66

Dramatically Increasing SAS DI Studio performance of SCD Type-2 Loader Transforms.

(2010). [online]. Last accessed 18 12 2010 at:
http://www.philihp.com/blog/2010/dramatically-increasing-sas-di-studio-
performance-of-scd-type-2-loader-transforms/

EMBARCADERO (2010). Database Trends Survey. [online]. Last accessed 12 12 2010 at:
http://www.embarcadero.com/reports/database-trends-survey

FUSION-IO (2011). Online University Learns the Power of Fusion-io. [online]. Last
accessed 22 10 2011 at: http://www.fusionio.com/case-studies/online-university/

GAGNON, G (1999). Data warehousing: An overview. PC Magazine, 19 March, 245-246.

HWANG, Mark I and XU, Hongjiang (2007). The Effect of Implementation Factors on
Data Warehousing Success: An Exploratory Study. Journal of Information, Information
Technology, and Organizations, 2, 1-14.

INMON, W. H. (2007). Some straight talk about the costs of data warehousing. Inmon
Consulting.

KIMBALL, R (2004). The Data Warehouse ETL Toolkit : Practical Techniques for
Extracting, Cleaning, Conforming, and Delivering Data. Wiley.

KIMBALL, Ralph (2001). Kimball Design Top #22: Variable Depth Customer Dimensions.
[online]. Last accessed 14 January 2012 at:

63
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

http://www.kimballgroup.com/html/designtipsPDF/DesignTips2001/KimballDT22Varia
bleDepth.pdf

KIMBALL, R (2008). Slowly Changing Dimension. DM review, 18 (9), 29.

KIMBALL, R and ROSS, M (2002). The Data Warehouse Toolkit. 2nd ed., John Wiley and
Sons.

MCKINSEY GLOBAL INSTITUTE (2011). Big Data: The next frontier for innovation,
competition, and productivity. White Paper, McKinsey Global Institute.

MICROSOFT (2011). Lookup Transformation. [online]. Last accessed 23 10 2011 at:

http://msdn.microsoft.com/en-us/library/ms141821.aspx

MUNDY, J, THORNTHWAITE, W and KIMBALL, R (2006). The Microsoft Data Warehouse

Toolkit. Indianapolis, Wiley.

MUNDY, J, THORNTHWAITE, W and KIMBALL, R (2011). The Microsoft Data Warehouse

Toolkit. 2nd ed., Indianapolis, Wiley Publishing.

MUSLIH, O.K. and SALEH, I.H. (2010). Increasing Database Performance through
Optimizing Structure Query Language Join Statement. Journal of Computer Science, 6
(5), 585-590.

NOVOSELAC, Steve (2009). SSIS - Using Checksums to Load Data into Slowly Changing
Dimensions. [online]. Last accessed 11 March 2012 at:
http://sqlserverpedia.com/wiki/SSIS_-
_Using_Checksum_to_Load_Data_into_Slowly_Changing_Dimensions

OLSEN, David and HAUSER, Karina (2007). Teaching Advanced SQL Skills: Text Bulk
Loading. Journal of Information Systems Education, 18 (4), 399.

PRIYANKARA, Dinesh (2010). SSIS: Replacing SCD Wizard with the MERGE statement.
[online]. Last accessed 11 March 2012 at: http://dinesql.blogspot.com/2010/11/ssis-
replacing-slowly-changing.html

64
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

ROGERSON, Tony (2012). MSc Dissertation: Reporting-Brick (www.reportingbrick.com).

University of Dundee.

ROSS, M and KIMBALL, R (2005). Slowly Changing Dimension Are Not Always as Easy as
1,2,3. Intelligent Enterprise, 8 (3), 41-43.

ROSSUM, Joost van (2011). Slowly Changing Dimension Alternatives. [online]. Last
accessed 22 October 2011 at: http://microsoft-ssis.blogspot.com/2011/01/slowly-
changing-dimension-alternatives.html

RUDESTAM, KJELL, Erik and NEWTON, Rae R (2001). Surviving your dissertation: A
comprehensive guide to content and process. Thousand Oaks, Calif., Sage Publications.

SCHARLOCK, Peter (2008). Increase your SQL Server performance by replacing cursors
with set operations. [online]. Last accessed 14 10 2011 at:
http://blogs.msdn.com/b/sqlprogrammability/archive/2008/03/18/increase-your-sql-
server-performance-by-replacing-cursors-with-set-operations.aspx

SHAW, Steve and BACK, Martin (2010). Pro Oracle Database 11g RAC on Linux. Apress
Academic.

THORNTHWAITE, Warren (2008). Design Tip #107 Using the SQL MERGE Statement for
Slowly Changing Dimension Processing. [online]. Last accessed 17 12 2010 at:
http://www.rkimball.com/html/08dt/KU107_UsingSQL_MERGESlowlyChangingDimens
ion.pdf

VARIOUS (2004). Best method to handle SCD. [online]. Last accessed 11 March 2012 at:
http://www.sqlservercentral.com/Forums/Topic1200461-363-1.aspx

VEERMAN, Erik, LACHEV, Teo and SARKA, Dejan (2009). Microsoft SQL Server 2008 -
Business Intelligence Development and Maintenance. Redmond, Microsoft Press.

WATSON, H. J. and HALEY, B. J. (1997). Data Warehousing: A Framework and Survey of

Current Practices. Journal of Data Warehousing, 2 (1), 10-17.

WATSON, H and WIXOM, B (2007). The Current State of Business Intelligence.

Computer, 40 (9), 96-99.
65
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

WHALEN, Edward, et al. (2006). Microsoft SQL Server 2005 Administrator’s Companion.
Microsoft Press.

WIKIPEDIA (2010). Slowly Changing Dimension. [online]. Last accessed 18 12 2010 at:
http://en.wikipedia.org/wiki/Slowly_changing_dimension

66
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

9. Appendix

Appendix 1. SAS Code – General Linear Model

proc glm data = etlresults;

class methodname hardware;
model results = methodname|hardware|changerows|newrows /ss3 solution;
output out=FITS predicted=P rstudent=E;

proc univariate data=FITS;

histogram E/normal;
qqplot E;
run;

proc gplot;
plot E*P/href=0;
run;
quit;

Appendix 2. SAS Code – General Linear Model (Log)

data etlresults; set etlresults;

logresults = log(results);
run;

proc glm data = etlresults;

class methodname hardware;
model logresults = methodname|hardware|changerows|newrows /ss3 solution;
output out=FITS predicted=P rstudent=E;

proc univariate data=FITS;

histogram E/normal;
qqplot E;
run;

proc gplot;
plot E*P/href=0;
run;
quit;

[Appendix] 1
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 3. SAS Code – General Linear Model (Log, category

variables)

data etlresults; set etlresults;

logresults = log(results);
run;

proc glm data = etlresults;

class methodname hardware changerows newrows;
model logresults = methodname|hardware|changerows|newrows /ss3 solution;
output out=FITS predicted=P rstudent=E;

proc univariate data=FITS;

histogram E/normal;
qqplot E;
run;

proc gplot;
plot E*P/href=0;
run;
quit;

[Appendix] 2
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 4. ANOVA Statistical Results

Source DF Sum of Squares Mean Square F Value Pr > F

Model 199 1852.440533 9.308746 192.77 <.0001

Error 400 19.315949 0.048290

Corrected Total 599 1871.756482

R-Square Coeff Var Root MSE logresults Mean

0.989680 3.315372 0.219750 6.628203

Source DF Type III SS Mean Square F Value Pr > F

MethodName 3 168.3599881 56.1199960 1162.15 <.0001

Hardware 1 111.5579946 111.5579946 2310.17 <.0001

MethodName*Hardware 3 20.1510047 6.7170016 139.10 <.0001

changerows 4 940.6857049 235.1714262 4869.99 <.0001

MethodNam*changerows 12 85.8984729 7.1582061 148.23 <.0001

Hardware*changerows 4 9.1620845 2.2905211 47.43 <.0001

MethodHardwachange 12 12.6322597 1.0526883 21.80 <.0001

newrows 4 235.9692219 58.9923055 1221.63 <.0001

MethodName*newrows 12 88.3301587 7.3608466 152.43 <.0001

Hardware*newrows 4 11.4666393 2.8666598 59.36 <.0001

MethodHardwanewrow 12 3.3383592 0.2781966 5.76 <.0001

changerows*newrows 16 89.3853195 5.5865825 115.69 <.0001

Methodchangenewrow 48 68.3971595 1.4249408 29.51 <.0001

Hardwachangenewrow 16 4.1683444 0.2605215 5.39 <.0001

MethHardchan*newro 48 2.9378214 0.0612046 1.27 0.1178

[Appendix] 3
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 5. SAS Analysis code

data etlresults; set etlresults;

logresults = log(results);
run;

proc format;
value RowOrd 5000000='5000k' 500000='500k' 50000='50k' 5000='5k' 0='Zero';
value $MethOrd 'Join'='zJoin' 'Lookup'='Lookup' 'Singleton'='Singleton'
'Merge'='Merge';
run;

Title 'Detailed Analysis';

proc glm data = etlresults;
class hardware methodname changerows newrows;
model logresults = hardware|methodname|changerows|newrows /ss3 solution;
FORMAT methodname $MethOrd.;
FORMAT changerows RowOrd.;
FORMAT newrows RowOrd.;
lsmeans methodname hardware hardware*methodname newrows*changerows
method*newrows*changerows;
run;

quit;

[Appendix] 4
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 6. ANOVA Results – Method Least Square Means

logresults
MethodName LSMEAN

Lookup 6.69438576

Merge 6.18270936

Singleton 7.46886445

zJoin 6.16685334

[Appendix] 5
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 7. ANOVA Results – Hardware Least Square Means

logresults
Hardware LSMEAN

HDD 7.05939923

SSD 6.19700723

[Appendix] 6
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 8. ANOVA Results – Hardware/Method Least Square

Means

logresults
Hardware MethodName LSMEAN

HDD Lookup 6.90135524

HDD Merge 6.62951264

HDD Singleton 8.18052045

HDD zJoin 6.52620858

SSD Lookup 6.48741629

SSD Merge 5.73590608

SSD Singleton 6.75720844

SSD zJoin 5.80749810

[Appendix] 7
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 9. ANOVA Results – Row Count Least Square Means

logresults
changerows newrows LSMEAN

5000k 5000k 8.86098528

5000k 500k 8.59804533

5000k 50k 8.48847791

5000k 5k 8.44269542

5000k Zero 8.43011856

500k 5000k 7.88269725

500k 500k 7.35467825

500k 50k 7.25502856

500k 5k 7.19890130

500k Zero 7.23317547

50k 5000k 7.76276963

50k 500k 6.65416580

50k 50k 6.08236856

50k 5k 6.00435179

50k Zero 5.95812575

5k 5000k 7.37022561

5k 500k 6.10132129

5k 50k 5.16488665

5k 5k 4.96281769

5k Zero 4.85320691

Zero 5000k 6.88081838

Zero 500k 5.56890023

Zero 50k 4.65900412

Zero 5k 4.26221253

Zero Zero 3.67510242

[Appendix] 8
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 10. ANOVA Results – Method/Row Count Least Square

Means

logresults
MethodName changerows newrows LSMEAN

Lookup 5000k 5000k 8.3790138

Lookup 5000k 500k 8.2956621

Lookup 5000k 50k 8.2972324

Lookup 5000k 5k 8.1367724

Lookup 5000k Zero 8.1780462

Lookup 500k 5000k 7.5312798

Lookup 500k 500k 7.6100263

Lookup 500k 50k 7.6432768

Lookup 500k 5k 7.5483793

Lookup 500k Zero 7.6526958

Lookup 50k 5000k 7.5203225

Lookup 50k 500k 7.0665244

Lookup 50k 50k 6.4766623

Lookup 50k 5k 6.3562486

Lookup 50k Zero 6.3463304

Lookup 5k 5000k 6.9914215

Lookup 5k 500k 6.0073375

Lookup 5k 50k 5.5500444

Lookup 5k 5k 5.4965852

Lookup 5k Zero 5.5387974

Lookup Zero 5000k 5.8247908

Lookup Zero 500k 4.9054753

Lookup Zero 50k 4.6832403

Lookup Zero 5k 4.7222190

Lookup Zero Zero 4.6012595

Merge 5000k 5000k 8.0611437

Merge 5000k 500k 7.7566498

Merge 5000k 50k 7.6264854

Merge 5000k 5k 7.5819476

Merge 5000k Zero 7.5632220

Merge 500k 5000k 7.3648676

Merge 500k 500k 6.8984390

Merge 500k 50k 6.7824013

Merge 500k 5k 6.6621359

Merge 500k Zero 6.7792765

[Appendix] 9
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

logresults
MethodName changerows newrows LSMEAN

Merge 50k 5000k 7.3316355

Merge 50k 500k 6.1182973

Merge 50k 50k 5.6901484

Merge 50k 5k 5.7358364

Merge 50k Zero 5.6780172

Merge 5k 5000k 6.8022395

Merge 5k 500k 5.9069238

Merge 5k 50k 4.7892505

Merge 5k 5k 4.6075410

Merge 5k Zero 4.6134970

Merge Zero 5000k 6.7970334

Merge Zero 500k 5.2954787

Merge Zero 50k 4.1662407

Merge Zero 5k 4.0222447

Merge Zero Zero 3.9367813

Singleton 5000k 5000k 10.9931310

Singleton 5000k 500k 10.4698805

Singleton 5000k 50k 10.3680490

Singleton 5000k 5k 10.3553656

Singleton 5000k Zero 10.3030546

Singleton 500k 5000k 9.4808926

Singleton 500k 500k 8.4699038

Singleton 500k 50k 8.1284349

Singleton 500k 5k 8.0803641

Singleton 500k Zero 8.0204820

Singleton 50k 5000k 9.2317580

Singleton 50k 500k 7.4039275

Singleton 50k 50k 6.4412935

Singleton 50k 5k 6.2271706

Singleton 50k Zero 6.1794251

Singleton 5k 5000k 9.1476191

Singleton 5k 500k 7.0357932

Singleton 5k 50k 5.2898797

Singleton 5k 5k 4.7616561

Singleton 5k Zero 4.3424068

Singleton Zero 5000k 9.0879094

Singleton Zero 500k 7.0076610

Singleton Zero 50k 5.0066602

[Appendix] 10
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

logresults
MethodName changerows newrows LSMEAN

Singleton Zero 5k 3.4869206

Singleton Zero Zero 1.4019721

zJoin 5000k 5000k 8.0106526

zJoin 5000k 500k 7.8699889

zJoin 5000k 50k 7.6621449

zJoin 5000k 5k 7.6966961

zJoin 5000k Zero 7.6761514

zJoin 500k 5000k 7.1537489

zJoin 500k 500k 6.4403438

zJoin 500k 50k 6.4660013

zJoin 500k 5k 6.5047259

zJoin 500k Zero 6.4802476

zJoin 50k 5000k 6.9673625

zJoin 50k 500k 6.0279140

zJoin 50k 50k 5.7213700

zJoin 50k 5k 5.6981516

zJoin 50k Zero 5.6287304

zJoin 5k 5000k 6.5396223

zJoin 5k 500k 5.4552307

zJoin 5k 50k 5.0303721

zJoin 5k 5k 4.9854885

zJoin 5k Zero 4.9181265

zJoin Zero 5000k 5.8135399

zJoin Zero 500k 5.0669859

zJoin Zero 50k 4.7798753

zJoin Zero 5k 4.8174658

zJoin Zero Zero 4.7603967

[Appendix] 11
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 11. SAS Analysis Code – Join and Merge

Title 'Join and Merge';

data etlresults2; set etlresults;
if MethodName='Join' OR MethodName='Merge';
run;
proc glm data = etlresults2;
class methodname hardware changerows newrows;
model logresults = methodname hardware methodname*hardware methodname*changerows
methodname*newrows methodname*hardware*changerows
methodname*hardware*newrows /ss3 solution;
FORMAT changerows RowOrd.;
FORMAT newrows RowOrd.;
run;

quit;

[Appendix] 12
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 12. ANOVA Results – Join and Merge

Source DF Sum of Squares Mean Square F Value Pr > F

Model 35 441.8826822 12.6252195 87.35 <.0001
Error 264 38.1575544 0.1445362
Corrected Total 299 480.0402366

R-Square Coeff Var Root MSE logresults Mean

0.920512 6.156965 0.380179 6.174781

Source DF Type III SS Mean Square F Value Pr > F

MethodName 1 0.0188560 0.0188560 0.13 0.7182
Hardware 1 48.7418668 48.7418668 337.23 <.0001
MethodName*Hardware 1 0.5735370 0.5735370 3.97 0.0474
MethodNam*changerows 8 301.9583629 37.7447954 261.14 <.0001
MethodName*newrows 8 75.4532096 9.4316512 65.25 <.0001
Method*Hardwa*change 8 10.1472853 1.2684107 8.78 <.0001
Method*Hardwa*newrow 8 4.9895645 0.6236956 4.32 <.0001

Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 4.142689750 B 0.13169792 31.46 <.0001
MethodName Join 0.425552708 B 0.18624899 2.28 0.0231
MethodName Merge 0.000000000 B . . .
Hardware HDD 0.464630837 B 0.18624899 2.49 0.0132
Hardware SSD 0.000000000 B . . .
MethodName*Hardware Join HDD -0.054055915 B 0.26339585 -0.21 0.8376
MethodName*Hardware Join SSD 0.000000000 B . . .
MethodName*Hardware Merge HDD 0.000000000 B . . .
MethodName*Hardware Merge SSD 0.000000000 B . . .
MethodNam*changerows Join 5000k 2.704117172 B 0.13882180 19.48 <.0001
MethodNam*changerows Join 500k 1.333582846 B 0.13882180 9.61 <.0001
MethodNam*changerows Join 50k 0.625439344 B 0.13882180 4.51 <.0001
MethodNam*changerows Join 5k 0.147147697 B 0.13882180 1.06 0.2901
MethodNam*changerows Join Zero 0.000000000 B . . .
MethodNam*changerows Merge 5000k 2.806825815 B 0.13882180 20.22 <.0001
MethodNam*changerows Merge 500k 1.479337039 B 0.13882180 10.66 <.0001
MethodNam*changerows Merge 50k 0.777049662 B 0.13882180 5.60 <.0001
MethodNam*changerows Merge 5k 0.298682695 B 0.13882180 2.15 0.0323
MethodNam*changerows Merge Zero 0.000000000 B . . .
MethodName*newrows Join 5000k 1.148798851 B 0.13882180 8.28 <.0001
MethodName*newrows Join 500k 0.328658299 B 0.13882180 2.37 0.0186
MethodName*newrows Join 50k -0.028393304 B 0.13882180 -0.20 0.8381

[Appendix] 13
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Standard
Parameter Estimate Error t Value Pr > |t|
MethodName*newrows Join 5k -0.063072701 B 0.13882180 -0.45 0.6500
MethodName*newrows Join Zero 0.000000000 B . . .
MethodName*newrows Merge 5000k 1.866529177 B 0.13882180 13.45 <.0001
MethodName*newrows Merge 500k 0.834375788 B 0.13882180 6.01 <.0001
MethodName*newrows Merge 50k 0.010436300 B 0.13882180 0.08 0.9401
MethodName*newrows Merge 5k -0.107154821 B 0.13882180 -0.77 0.4409
MethodName*newrows Merge Zero 0.000000000 B . . .
Method*Hardwa*change Join HDD 5000k 0.062713738 B 0.19632367 0.32 0.7496
Method*Hardwa*change Join HDD 500k 0.455555813 B 0.19632367 2.32 0.0211
Method*Hardwa*change Join HDD 50k 0.671227236 B 0.19632367 3.42 0.0007
Method*Hardwa*change Join HDD 5k 0.381935111 B 0.19632367 1.95 0.0528
Method*Hardwa*change Join HDD Zero 0.000000000 B . . .
Method*Hardwa*change Join SSD 5000k 0.000000000 B . . .
Method*Hardwa*change Join SSD 500k 0.000000000 B . . .
Method*Hardwa*change Join SSD 50k 0.000000000 B . . .
Method*Hardwa*change Join SSD 5k 0.000000000 B . . .
Method*Hardwa*change Join SSD Zero 0.000000000 B . . .
Method*Hardwa*change Merge HDD 5000k 0.135016286 B 0.19632367 0.69 0.4922
Method*Hardwa*change Merge HDD 500k 1.149062584 B 0.19632367 5.85 <.0001
Method*Hardwa*change Merge HDD 50k 0.980363107 B 0.19632367 4.99 <.0001
Method*Hardwa*change Merge HDD 5k 0.403303852 B 0.19632367 2.05 0.0409
Method*Hardwa*change Merge HDD Zero 0.000000000 B . . .
Method*Hardwa*change Merge SSD 5000k 0.000000000 B . . .
Method*Hardwa*change Merge SSD 500k 0.000000000 B . . .
Method*Hardwa*change Merge SSD 50k 0.000000000 B . . .
Method*Hardwa*change Merge SSD 5k 0.000000000 B . . .
Method*Hardwa*change Merge SSD Zero 0.000000000 B . . .
Method*Hardwa*newrow Join HDD 5000k -0.289088246 B 0.19632367 -1.47 0.1421
Method*Hardwa*newrow Join HDD 500k -0.098592314 B 0.19632367 -0.50 0.6160
Method*Hardwa*newrow Join HDD 50k 0.135230950 B 0.19632367 0.69 0.4915
Method*Hardwa*newrow Join HDD 5k 0.221695494 B 0.19632367 1.13 0.2598
Method*Hardwa*newrow Join HDD Zero 0.000000000 B . . .
Method*Hardwa*newrow Join SSD 5000k 0.000000000 B . . .
Method*Hardwa*newrow Join SSD 500k 0.000000000 B . . .
Method*Hardwa*newrow Join SSD 50k 0.000000000 B . . .
Method*Hardwa*newrow Join SSD 5k 0.000000000 B . . .
Method*Hardwa*newrow Join SSD Zero 0.000000000 B . . .
Method*Hardwa*newrow Merge HDD 5000k -0.618608060 B 0.19632367 -3.15 0.0018
Method*Hardwa*newrow Merge HDD 500k -0.306753690 B 0.19632367 -1.56 0.1194
Method*Hardwa*newrow Merge HDD 50k 0.172620281 B 0.19632367 0.88 0.3801
Method*Hardwa*newrow Merge HDD 5k 0.229874256 B 0.19632367 1.17 0.2427
Method*Hardwa*newrow Merge HDD Zero 0.000000000 B . . .
Method*Hardwa*newrow Merge SSD 5000k 0.000000000 B . . .

[Appendix] 14
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Standard
Parameter Estimate Error t Value Pr > |t|
Method*Hardwa*newrow Merge SSD 500k 0.000000000 B . . .
Method*Hardwa*newrow Merge SSD 50k 0.000000000 B . . .
Method*Hardwa*newrow Merge SSD 5k 0.000000000 B . . .
Method*Hardwa*newrow Merge SSD Zero 0.000000000 B . . .

[Appendix] 15
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 13. SAS Code – Numerical model excluding Singleton

data etlresults; set etlresults;

logresults = log(results);
run;

Title ‘Numeric Variable Analysis, excluding Singleton’;

data etlresults2; set etlresults;

if MethodName^='Singleton';
new = newrows/1000;
change = changerows/1000;
run;

proc glm data = etlresults2;

class methodname hardware;
model logresults = methodname|hardware|change|new /ss3;
output out=FITS predicted=P rstudent=E;

Title;

proc univariate data=FITS;

histogram E/normal;
qqplot E;
run;

proc gplot;
plot E*P/href=0;
run;

Title ‘Numeric Variable Analysis, excluding Singleton - reduced’;

proc glm data = etlresults2;

class methodname hardware;
model logresults = methodname hardware change new
methodname*hardware
methodname*new
hardware*new
change*new /ss3 solution;
run;

quit;

[Appendix] 16
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 14. Statistical Results – Reduced numerical model

excluding singleton

Source DF Sum of Squares Mean Square F Value Pr > F

Model 11 495.4147596 45.0377054 77.04 <.0001

Error 438 256.0621373 0.5846168

Corrected Total 449 751.4768970

R-Square Coeff Var Root MSE logresults Mean

0.659255 12.04481 0.764602 6.347983

Source DF Type III SS Mean Square F Value Pr > F

MethodName 2 29.8398670 14.9199335 25.52 <.0001

Hardware 1 50.6327218 50.6327218 86.61 <.0001

change 1 293.8871150 293.8871150 502.70 <.0001

new 1 84.8373264 84.8373264 145.12 <.0001

MethodName*Hardware 2 4.4194417 2.2097208 3.78 0.0236

new*MethodName 2 6.1157690 3.0578845 5.23 0.0057

new*Hardware 1 3.2295127 3.2295127 5.52 0.0192

change*new 1 11.4725527 11.4725527 19.62 <.0001

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 4.837081325 B 0.10015892 48.29 <.0001

MethodName Join 0.181539650 B 0.13457707 1.35 0.1780

MethodName Lookup 0.909995923 B 0.13457707 6.76 <.0001

MethodName Merge 0.000000000 B . . .

Hardware HDD 0.989965417 B 0.13141760 7.53 <.0001

Hardware SSD 0.000000000 B . . .

change 0.000475907 0.00002123 22.42 <.0001

new 0.000379601 B 0.00003836 9.89 <.0001

MethodName*Hardware Join HDD -0.174896082 B 0.17657735 -0.99 0.3225

MethodName*Hardware Join SSD 0.000000000 B . . .

MethodName*Hardware Lookup HDD -0.479667606 B 0.17657735 -2.72 0.0069

[Appendix] 17
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Standard
Parameter Estimate Error t Value Pr > |t|

MethodName*Hardware Lookup SSD 0.000000000 B . . .

MethodName*Hardware Merge HDD 0.000000000 B . . .

MethodName*Hardware Merge SSD 0.000000000 B . . .

new*MethodName Join -0.000098963 B 0.00004519 -2.19 0.0291

new*MethodName Lookup -0.000142651 B 0.00004519 -3.16 0.0017

new*MethodName Merge 0.000000000 B . . .

new*Hardware HDD -0.000086732 B 0.00003690 -2.35 0.0192

new*Hardware SSD 0.000000000 B . . .

change*new -0.000000042 0.00000001 -4.43 <.0001

[Appendix] 18
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

Appendix 15. Full Test Results

TestNumber Hardware Iteration ChangeRows NewRows MethodName DurationSec Rank
0 HDD Iteration 1 0 0 Singleton 5 1
0 HDD Iteration 1 0 0 Merge 55 2
0 HDD Iteration 1 0 0 Lookup 136 3
0 HDD Iteration 1 0 0 Join 142 4
0 HDD Iteration 2 0 0 Singleton 5 1
0 HDD Iteration 2 0 0 Merge 55 2
0 HDD Iteration 2 0 0 Join 133 3
0 HDD Iteration 2 0 0 Lookup 139 4
0 HDD Iteration 3 0 0 Singleton 5 1
0 HDD Iteration 3 0 0 Merge 59 2
0 HDD Iteration 3 0 0 Lookup 133 3
0 HDD Iteration 3 0 0 Join 139 4
0 SSD Iteration 1 0 0 Singleton 3 1
0 SSD Iteration 1 0 0 Merge 46 2
0 SSD Iteration 1 0 0 Lookup 72 3
0 SSD Iteration 1 0 0 Join 104 4
0 SSD Iteration 2 0 0 Singleton 4 1
0 SSD Iteration 2 0 0 Merge 46 2
0 SSD Iteration 2 0 0 Lookup 71 3
0 SSD Iteration 2 0 0 Join 112 4
0 SSD Iteration 3 0 0 Singleton 3 1
0 SSD Iteration 3 0 0 Merge 48 2
0 SSD Iteration 3 0 0 Lookup 76 3
0 SSD Iteration 3 0 0 Join 83 4
1 HDD Iteration 1 5000 0 Singleton 157 1
1 HDD Iteration 1 5000 0 Merge 167 2
1 HDD Iteration 1 5000 0 Join 200 3
1 HDD Iteration 1 5000 0 Lookup 286 4
1 HDD Iteration 2 5000 0 Singleton 161 1
1 HDD Iteration 2 5000 0 Merge 171 2
1 HDD Iteration 2 5000 0 Join 206 3
1 HDD Iteration 2 5000 0 Lookup 306 4
1 HDD Iteration 3 5000 0 Singleton 154 1
1 HDD Iteration 3 5000 0 Merge 173 2
1 HDD Iteration 3 5000 0 Join 226 3
1 HDD Iteration 3 5000 0 Lookup 341 4
1 SSD Iteration 1 5000 0 Singleton 41 1
1 SSD Iteration 1 5000 0 Merge 62 2
1 SSD Iteration 1 5000 0 Join 120 3
1 SSD Iteration 1 5000 0 Lookup 253 4
1 SSD Iteration 2 5000 0 Singleton 37 1
1 SSD Iteration 2 5000 0 Merge 66 2
1 SSD Iteration 2 5000 0 Join 77 3
1 SSD Iteration 2 5000 0 Lookup 184 4
1 SSD Iteration 3 5000 0 Singleton 35 1
1 SSD Iteration 3 5000 0 Merge 52 2
1 SSD Iteration 3 5000 0 Join 76 3
1 SSD Iteration 3 5000 0 Lookup 195 4
2 HDD Iteration 1 50000 0 Join 543 1
2 HDD Iteration 1 50000 0 Merge 745 2
2 HDD Iteration 1 50000 0 Lookup 815 3
2 HDD Iteration 1 50000 0 Singleton 1454 4
2 HDD Iteration 2 50000 0 Join 482 1
2 HDD Iteration 2 50000 0 Merge 682 2
2 HDD Iteration 2 50000 0 Lookup 743 3
2 HDD Iteration 2 50000 0 Singleton 1475 4
2 HDD Iteration 3 50000 0 Join 500 1
2 HDD Iteration 3 50000 0 Merge 711 2
2 HDD Iteration 3 50000 0 Lookup 755 3
2 HDD Iteration 3 50000 0 Singleton 1453 4
2 SSD Iteration 1 50000 0 Merge 118 1
2 SSD Iteration 1 50000 0 Join 147 2
2 SSD Iteration 1 50000 0 Singleton 151 3
2 SSD Iteration 1 50000 0 Lookup 752 4

[Appendix] 19
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

TestNumber Hardware Iteration ChangeRows NewRows MethodName DurationSec Rank

2 SSD Iteration 2 50000 0 Join 132 1
2 SSD Iteration 2 50000 0 Merge 148 2
2 SSD Iteration 2 50000 0 Singleton 161 3
2 SSD Iteration 2 50000 0 Lookup 316 4
2 SSD Iteration 3 50000 0 Merge 99 1
2 SSD Iteration 3 50000 0 Singleton 167 2
2 SSD Iteration 3 50000 0 Join 183 3
2 SSD Iteration 3 50000 0 Lookup 317 4
3 HDD Iteration 1 500000 0 Join 1487 1
3 HDD Iteration 1 500000 0 Merge 1856 2
3 HDD Iteration 1 500000 0 Lookup 2486 3
3 HDD Iteration 1 500000 0 Singleton 9441 4
3 HDD Iteration 2 500000 0 Join 810 1
3 HDD Iteration 2 500000 0 Merge 1889 2
3 HDD Iteration 2 500000 0 Lookup 2023 3
3 HDD Iteration 2 500000 0 Singleton 9643 4
3 HDD Iteration 3 500000 0 Join 706 1
3 HDD Iteration 3 500000 0 Merge 1881 2
3 HDD Iteration 3 500000 0 Lookup 2393 3
3 HDD Iteration 3 500000 0 Singleton 9312 4
3 SSD Iteration 1 500000 0 Merge 519 1
3 SSD Iteration 1 500000 0 Join 694 2
3 SSD Iteration 1 500000 0 Singleton 916 3
3 SSD Iteration 1 500000 0 Lookup 1887 4
3 SSD Iteration 2 500000 0 Join 433 1
3 SSD Iteration 2 500000 0 Merge 436 2
3 SSD Iteration 2 500000 0 Singleton 1071 3
3 SSD Iteration 2 500000 0 Lookup 1946 4
3 SSD Iteration 3 500000 0 Join 301 1
3 SSD Iteration 3 500000 0 Merge 310 2
3 SSD Iteration 3 500000 0 Singleton 954 3
3 SSD Iteration 3 500000 0 Lookup 1976 4
4 HDD Iteration 1 5000000 0 Merge 2221 1
4 HDD Iteration 1 5000000 0 Join 2580 2
4 HDD Iteration 1 5000000 0 Lookup 3933 3
4 HDD Iteration 1 5000000 0 Singleton 62574 4
4 HDD Iteration 2 5000000 0 Merge 2334 1
4 HDD Iteration 2 5000000 0 Join 2584 2
4 HDD Iteration 2 5000000 0 Lookup 3597 3
4 HDD Iteration 2 5000000 0 Singleton 60081 4
4 HDD Iteration 3 5000000 0 Merge 2746 1
4 HDD Iteration 3 5000000 0 Join 3092 2
4 HDD Iteration 3 5000000 0 Lookup 5567 3
4 HDD Iteration 3 5000000 0 Singleton 61997 4
4 SSD Iteration 1 5000000 0 Merge 1654 1
4 SSD Iteration 1 5000000 0 Join 2248 2
4 SSD Iteration 1 5000000 0 Lookup 2632 3
4 SSD Iteration 1 5000000 0 Singleton 10020 4
4 SSD Iteration 2 5000000 0 Join 1292 1
4 SSD Iteration 2 5000000 0 Merge 1469 2
4 SSD Iteration 2 5000000 0 Lookup 3403 3
4 SSD Iteration 2 5000000 0 Singleton 8690 4
4 SSD Iteration 3 5000000 0 Merge 1476 1
4 SSD Iteration 3 5000000 0 Join 1679 2
4 SSD Iteration 3 5000000 0 Lookup 2895 3
4 SSD Iteration 3 5000000 0 Singleton 9483 4
5 HDD Iteration 1 0 5000 Singleton 48 1
5 HDD Iteration 1 0 5000 Merge 62 2
5 HDD Iteration 1 0 5000 Lookup 134 3
5 HDD Iteration 1 0 5000 Join 137 4
5 HDD Iteration 2 0 5000 Singleton 36 1
5 HDD Iteration 2 0 5000 Merge 57 2
5 HDD Iteration 2 0 5000 Lookup 133 3
5 HDD Iteration 2 0 5000 Join 135 4
5 HDD Iteration 3 0 5000 Singleton 70 1
5 HDD Iteration 3 0 5000 Merge 90 2
5 HDD Iteration 3 0 5000 Join 207 3
5 HDD Iteration 3 0 5000 Lookup 219 4
5 SSD Iteration 1 0 5000 Singleton 20 1
[Appendix] 20
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

TestNumber Hardware Iteration ChangeRows NewRows MethodName DurationSec Rank

5 SSD Iteration 1 0 5000 Merge 45 2
5 SSD Iteration 1 0 5000 Lookup 92 3
5 SSD Iteration 1 0 5000 Join 118 4
5 SSD Iteration 2 0 5000 Singleton 24 1
5 SSD Iteration 2 0 5000 Merge 45 2
5 SSD Iteration 2 0 5000 Lookup 77 3
5 SSD Iteration 2 0 5000 Join 92 4
5 SSD Iteration 3 0 5000 Singleton 21 1
5 SSD Iteration 3 0 5000 Merge 47 2
5 SSD Iteration 3 0 5000 Lookup 73 3
5 SSD Iteration 3 0 5000 Join 86 4
6 HDD Iteration 1 5000 5000 Merge 171 1
6 HDD Iteration 1 5000 5000 Join 213 2
6 HDD Iteration 1 5000 5000 Singleton 240 3
6 HDD Iteration 1 5000 5000 Lookup 316 4
6 HDD Iteration 2 5000 5000 Merge 171 1
6 HDD Iteration 2 5000 5000 Join 212 2
6 HDD Iteration 2 5000 5000 Singleton 238 3
6 HDD Iteration 2 5000 5000 Lookup 305 4
6 HDD Iteration 3 5000 5000 Merge 238 1
6 HDD Iteration 3 5000 5000 Join 297 2
6 HDD Iteration 3 5000 5000 Singleton 381 3
6 HDD Iteration 3 5000 5000 Lookup 405 4
6 SSD Iteration 1 5000 5000 Singleton 51 1
6 SSD Iteration 1 5000 5000 Merge 55 2
6 SSD Iteration 1 5000 5000 Join 105 3
6 SSD Iteration 1 5000 5000 Lookup 169 4
6 SSD Iteration 2 5000 5000 Singleton 48 1
6 SSD Iteration 2 5000 5000 Merge 50 2
6 SSD Iteration 2 5000 5000 Join 74 3
6 SSD Iteration 2 5000 5000 Lookup 161 4
6 SSD Iteration 3 5000 5000 Singleton 48 1
6 SSD Iteration 3 5000 5000 Merge 53 2
6 SSD Iteration 3 5000 5000 Join 94 3
6 SSD Iteration 3 5000 5000 Lookup 198 4
7 HDD Iteration 1 50000 5000 Join 523 1
7 HDD Iteration 1 50000 5000 Lookup 695 2
7 HDD Iteration 1 50000 5000 Merge 707 3
7 HDD Iteration 1 50000 5000 Singleton 1126 4
7 HDD Iteration 2 50000 5000 Join 516 1
7 HDD Iteration 2 50000 5000 Lookup 735 2
7 HDD Iteration 2 50000 5000 Merge 740 3
7 HDD Iteration 2 50000 5000 Singleton 1453 4
7 HDD Iteration 3 50000 5000 Join 778 1
7 HDD Iteration 3 50000 5000 Lookup 973 2
7 HDD Iteration 3 50000 5000 Merge 1049 3
7 HDD Iteration 3 50000 5000 Singleton 2098 4
7 SSD Iteration 1 50000 5000 Merge 115 1
7 SSD Iteration 1 50000 5000 Join 141 2
7 SSD Iteration 1 50000 5000 Singleton 176 3
7 SSD Iteration 1 50000 5000 Lookup 443 4
7 SSD Iteration 2 50000 5000 Merge 112 1
7 SSD Iteration 2 50000 5000 Join 133 2
7 SSD Iteration 2 50000 5000 Singleton 167 3
7 SSD Iteration 2 50000 5000 Lookup 438 4
7 SSD Iteration 3 50000 5000 Merge 125 1
7 SSD Iteration 3 50000 5000 Singleton 167 2
7 SSD Iteration 3 50000 5000 Join 179 3
7 SSD Iteration 3 50000 5000 Lookup 379 4
8 HDD Iteration 1 500000 5000 Join 942 1
8 HDD Iteration 1 500000 5000 Merge 2129 2
8 HDD Iteration 1 500000 5000 Lookup 2748 3
8 HDD Iteration 1 500000 5000 Singleton 9455 4
8 HDD Iteration 2 500000 5000 Join 1501 1
8 HDD Iteration 2 500000 5000 Merge 1915 2
8 HDD Iteration 2 500000 5000 Lookup 2629 3
8 HDD Iteration 2 500000 5000 Singleton 9500 4
8 HDD Iteration 3 500000 5000 Join 1189 1
8 HDD Iteration 3 500000 5000 Merge 2246 2
[Appendix] 21
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

TestNumber Hardware Iteration ChangeRows NewRows MethodName DurationSec Rank

8 HDD Iteration 3 500000 5000 Lookup 2301 3
8 HDD Iteration 3 500000 5000 Singleton 12348 4
8 SSD Iteration 1 500000 5000 Merge 287 1
8 SSD Iteration 1 500000 5000 Join 341 2
8 SSD Iteration 1 500000 5000 Singleton 1061 3
8 SSD Iteration 1 500000 5000 Lookup 1700 4
8 SSD Iteration 2 500000 5000 Merge 308 1
8 SSD Iteration 2 500000 5000 Join 553 2
8 SSD Iteration 2 500000 5000 Singleton 1007 3
8 SSD Iteration 2 500000 5000 Lookup 1286 4
8 SSD Iteration 3 500000 5000 Join 281 1
8 SSD Iteration 3 500000 5000 Merge 283 2
8 SSD Iteration 3 500000 5000 Singleton 959 3
8 SSD Iteration 3 500000 5000 Lookup 1285 4
9 HDD Iteration 1 5000000 5000 Merge 2325 1
9 HDD Iteration 1 5000000 5000 Join 2615 2
9 HDD Iteration 1 5000000 5000 Lookup 4164 3
9 HDD Iteration 1 5000000 5000 Singleton 62687 4
9 HDD Iteration 2 5000000 5000 Merge 2357 1
9 HDD Iteration 2 5000000 5000 Join 2419 2
9 HDD Iteration 2 5000000 5000 Lookup 2801 3
9 HDD Iteration 2 5000000 5000 Singleton 59363 4
9 HDD Iteration 3 5000000 5000 Merge 3091 1
9 HDD Iteration 3 5000000 5000 Join 5281 2
9 HDD Iteration 3 5000000 5000 Lookup 5977 3
9 HDD Iteration 3 5000000 5000 Singleton 61131 4
9 SSD Iteration 1 5000000 5000 Join 1397 1
9 SSD Iteration 1 5000000 5000 Merge 1533 2
9 SSD Iteration 1 5000000 5000 Lookup 2969 3
9 SSD Iteration 1 5000000 5000 Singleton 9903 4
9 SSD Iteration 2 5000000 5000 Join 1574 1
9 SSD Iteration 2 5000000 5000 Merge 1627 2
9 SSD Iteration 2 5000000 5000 Lookup 3300 3
9 SSD Iteration 2 5000000 5000 Singleton 9913 4
9 SSD Iteration 3 5000000 5000 Merge 1352 1
9 SSD Iteration 3 5000000 5000 Join 1548 2
9 SSD Iteration 3 5000000 5000 Lookup 2334 3
9 SSD Iteration 3 5000000 5000 Singleton 10128 4
10 HDD Iteration 1 0 50000 Merge 65 1
10 HDD Iteration 1 0 50000 Join 134 2
10 HDD Iteration 1 0 50000 Lookup 134 3
10 HDD Iteration 1 0 50000 Singleton 142 4
10 HDD Iteration 2 0 50000 Merge 67 1
10 HDD Iteration 2 0 50000 Lookup 134 2
10 HDD Iteration 2 0 50000 Join 136 3
10 HDD Iteration 2 0 50000 Singleton 152 4
10 HDD Iteration 3 0 50000 Merge 103 1
10 HDD Iteration 3 0 50000 Join 202 2
10 HDD Iteration 3 0 50000 Lookup 221 3
10 HDD Iteration 3 0 50000 Singleton 430 4
10 SSD Iteration 1 0 50000 Merge 57 1
10 SSD Iteration 1 0 50000 Lookup 80 2
10 SSD Iteration 1 0 50000 Join 82 3
10 SSD Iteration 1 0 50000 Singleton 113 4
10 SSD Iteration 2 0 50000 Merge 53 1
10 SSD Iteration 2 0 50000 Lookup 68 2
10 SSD Iteration 2 0 50000 Join 105 3
10 SSD Iteration 2 0 50000 Singleton 105 4
10 SSD Iteration 3 0 50000 Merge 53 1
10 SSD Iteration 3 0 50000 Lookup 74 2
10 SSD Iteration 3 0 50000 Join 90 3
10 SSD Iteration 3 0 50000 Singleton 101 4
11 HDD Iteration 1 5000 50000 Merge 179 1
11 HDD Iteration 1 5000 50000 Join 205 2
11 HDD Iteration 1 5000 50000 Singleton 241 3
11 HDD Iteration 1 5000 50000 Lookup 303 4
11 HDD Iteration 2 5000 50000 Merge 178 1
11 HDD Iteration 2 5000 50000 Join 208 2
11 HDD Iteration 2 5000 50000 Lookup 310 3
[Appendix] 22
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

TestNumber Hardware Iteration ChangeRows NewRows MethodName DurationSec Rank

11 HDD Iteration 2 5000 50000 Singleton 310 4
11 HDD Iteration 3 5000 50000 Merge 284 1
11 HDD Iteration 3 5000 50000 Join 334 2
11 HDD Iteration 3 5000 50000 Lookup 510 3
11 HDD Iteration 3 5000 50000 Singleton 541 4
11 SSD Iteration 1 5000 50000 Merge 77 1
11 SSD Iteration 1 5000 50000 Join 97 2
11 SSD Iteration 1 5000 50000 Singleton 120 3
11 SSD Iteration 1 5000 50000 Lookup 165 4
11 SSD Iteration 2 5000 50000 Merge 71 1
11 SSD Iteration 2 5000 50000 Join 78 2
11 SSD Iteration 2 5000 50000 Singleton 112 3
11 SSD Iteration 2 5000 50000 Lookup 189 4
11 SSD Iteration 3 5000 50000 Merge 61 1
11 SSD Iteration 3 5000 50000 Singleton 112 2
11 SSD Iteration 3 5000 50000 Join 119 3
11 SSD Iteration 3 5000 50000 Lookup 194 4
12 HDD Iteration 1 50000 50000 Join 503 1
12 HDD Iteration 1 50000 50000 Merge 706 2
12 HDD Iteration 1 50000 50000 Lookup 755 3
12 HDD Iteration 1 50000 50000 Singleton 1570 4
12 HDD Iteration 2 50000 50000 Join 513 1
12 HDD Iteration 2 50000 50000 Merge 712 2
12 HDD Iteration 2 50000 50000 Lookup 830 3
12 HDD Iteration 2 50000 50000 Singleton 1590 4
12 HDD Iteration 3 50000 50000 Join 851 1
12 HDD Iteration 3 50000 50000 Merge 1180 2
12 HDD Iteration 3 50000 50000 Lookup 1236 3
12 HDD Iteration 3 50000 50000 Singleton 2375 4
12 SSD Iteration 1 50000 50000 Merge 107 1
12 SSD Iteration 1 50000 50000 Join 140 2
12 SSD Iteration 1 50000 50000 Singleton 241 3
12 SSD Iteration 1 50000 50000 Lookup 592 4
12 SSD Iteration 2 50000 50000 Merge 98 1
12 SSD Iteration 2 50000 50000 Join 144 2
12 SSD Iteration 2 50000 50000 Singleton 201 3
12 SSD Iteration 2 50000 50000 Lookup 275 4
12 SSD Iteration 3 50000 50000 Merge 108 1
12 SSD Iteration 3 50000 50000 Join 183 2
12 SSD Iteration 3 50000 50000 Singleton 212 3
12 SSD Iteration 3 50000 50000 Lookup 597 4
13 HDD Iteration 1 500000 50000 Join 799 1
13 HDD Iteration 1 500000 50000 Merge 1873 2
13 HDD Iteration 1 500000 50000 Lookup 1995 3
13 HDD Iteration 1 500000 50000 Singleton 9476 4
13 HDD Iteration 2 500000 50000 Join 1458 1
13 HDD Iteration 2 500000 50000 Merge 2034 2
13 HDD Iteration 2 500000 50000 Lookup 2624 3
13 HDD Iteration 2 500000 50000 Singleton 9346 4
13 HDD Iteration 3 500000 50000 Join 1281 1
13 HDD Iteration 3 500000 50000 Merge 2923 2
13 HDD Iteration 3 500000 50000 Lookup 3590 3
13 HDD Iteration 3 500000 50000 Singleton 13022 4
13 SSD Iteration 1 500000 50000 Join 219 1
13 SSD Iteration 1 500000 50000 Merge 298 2
13 SSD Iteration 1 500000 50000 Singleton 1160 3
13 SSD Iteration 1 500000 50000 Lookup 1364 4
13 SSD Iteration 2 500000 50000 Merge 266 1
13 SSD Iteration 2 500000 50000 Join 410 2
13 SSD Iteration 2 500000 50000 Singleton 1092 3
13 SSD Iteration 2 500000 50000 Lookup 2021 4
13 SSD Iteration 3 500000 50000 Join 527 1
13 SSD Iteration 3 500000 50000 Merge 534 2
13 SSD Iteration 3 500000 50000 Singleton 1038 3
13 SSD Iteration 3 500000 50000 Lookup 1593 4
14 HDD Iteration 1 5000000 50000 Join 2437 1
14 HDD Iteration 1 5000000 50000 Merge 2536 2
14 HDD Iteration 1 5000000 50000 Lookup 3569 3
14 HDD Iteration 1 5000000 50000 Singleton 61807 4
[Appendix] 23
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

TestNumber Hardware Iteration ChangeRows NewRows MethodName DurationSec Rank

14 HDD Iteration 2 5000000 50000 Merge 2636 1
14 HDD Iteration 2 5000000 50000 Join 2640 2
14 HDD Iteration 2 5000000 50000 Lookup 4077 3
14 HDD Iteration 2 5000000 50000 Singleton 62870 4
14 HDD Iteration 3 5000000 50000 Join 2519 1
14 HDD Iteration 3 5000000 50000 Merge 2599 2
14 HDD Iteration 3 5000000 50000 Lookup 5874 3
14 HDD Iteration 3 5000000 50000 Singleton 61398 4
14 SSD Iteration 1 5000000 50000 Join 1523 1
14 SSD Iteration 1 5000000 50000 Merge 1690 2
14 SSD Iteration 1 5000000 50000 Lookup 2909 3
14 SSD Iteration 1 5000000 50000 Singleton 9591 4
14 SSD Iteration 2 5000000 50000 Merge 1749 1
14 SSD Iteration 2 5000000 50000 Join 2063 2
14 SSD Iteration 2 5000000 50000 Lookup 4413 3
14 SSD Iteration 2 5000000 50000 Singleton 9843 4
14 SSD Iteration 3 5000000 50000 Merge 1453 1
14 SSD Iteration 3 5000000 50000 Join 1815 2
14 SSD Iteration 3 5000000 50000 Lookup 3805 3
14 SSD Iteration 3 5000000 50000 Singleton 10269 4
15 HDD Iteration 1 0 500000 Lookup 152 1
15 HDD Iteration 1 0 500000 Join 180 2
15 HDD Iteration 1 0 500000 Merge 213 3
15 HDD Iteration 1 0 500000 Singleton 1041 4
15 HDD Iteration 2 0 500000 Lookup 158 1
15 HDD Iteration 2 0 500000 Join 172 2
15 HDD Iteration 2 0 500000 Merge 268 3
15 HDD Iteration 2 0 500000 Singleton 1121 4
15 HDD Iteration 3 0 500000 Lookup 245 1
15 HDD Iteration 3 0 500000 Join 267 2
15 HDD Iteration 3 0 500000 Merge 336 3
15 HDD Iteration 3 0 500000 Singleton 2955 4
15 SSD Iteration 1 0 500000 Lookup 109 1
15 SSD Iteration 1 0 500000 Join 128 2
15 SSD Iteration 1 0 500000 Merge 143 3
15 SSD Iteration 1 0 500000 Singleton 899 4
15 SSD Iteration 2 0 500000 Lookup 105 1
15 SSD Iteration 2 0 500000 Join 111 2
15 SSD Iteration 2 0 500000 Merge 148 3
15 SSD Iteration 2 0 500000 Singleton 776 4
15 SSD Iteration 3 0 500000 Lookup 90 1
15 SSD Iteration 3 0 500000 Join 136 2
15 SSD Iteration 3 0 500000 Merge 155 3
15 SSD Iteration 3 0 500000 Singleton 757 4
16 HDD Iteration 1 5000 500000 Join 342 1
16 HDD Iteration 1 5000 500000 Merge 405 2
16 HDD Iteration 1 5000 500000 Lookup 459 3
16 HDD Iteration 1 5000 500000 Singleton 1119 4
16 HDD Iteration 2 5000 500000 Join 302 1
16 HDD Iteration 2 5000 500000 Lookup 392 2
16 HDD Iteration 2 5000 500000 Merge 394 3
16 HDD Iteration 2 5000 500000 Singleton 1183 4
16 HDD Iteration 3 5000 500000 Join 411 1
16 HDD Iteration 3 5000 500000 Merge 536 2
16 HDD Iteration 3 5000 500000 Lookup 630 3
16 HDD Iteration 3 5000 500000 Singleton 3226 4
16 SSD Iteration 1 5000 500000 Join 139 1
16 SSD Iteration 1 5000 500000 Merge 168 2
16 SSD Iteration 1 5000 500000 Lookup 321 3
16 SSD Iteration 1 5000 500000 Singleton 833 4
16 SSD Iteration 2 5000 500000 Join 158 1
16 SSD Iteration 2 5000 500000 Lookup 402 2
16 SSD Iteration 2 5000 500000 Merge 439 3
16 SSD Iteration 2 5000 500000 Singleton 781 4
16 SSD Iteration 3 5000 500000 Join 176 1
16 SSD Iteration 3 5000 500000 Lookup 308 2
16 SSD Iteration 3 5000 500000 Merge 391 3
16 SSD Iteration 3 5000 500000 Singleton 776 4
17 HDD Iteration 1 50000 500000 Merge 558 1
[Appendix] 24
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

TestNumber Hardware Iteration ChangeRows NewRows MethodName DurationSec Rank

17 HDD Iteration 1 50000 500000 Join 559 2
17 HDD Iteration 1 50000 500000 Lookup 1078 3
17 HDD Iteration 1 50000 500000 Singleton 2389 4
17 HDD Iteration 2 50000 500000 Join 564 1
17 HDD Iteration 2 50000 500000 Merge 619 2
17 HDD Iteration 2 50000 500000 Lookup 1183 3
17 HDD Iteration 2 50000 500000 Singleton 2430 4
17 HDD Iteration 3 50000 500000 Merge 772 1
17 HDD Iteration 3 50000 500000 Join 927 2
17 HDD Iteration 3 50000 500000 Lookup 1757 3
17 HDD Iteration 3 50000 500000 Singleton 4702 4
17 SSD Iteration 1 50000 500000 Merge 170 1
17 SSD Iteration 1 50000 500000 Join 226 2
17 SSD Iteration 1 50000 500000 Singleton 960 3
17 SSD Iteration 1 50000 500000 Lookup 1169 4
17 SSD Iteration 2 50000 500000 Join 327 1
17 SSD Iteration 2 50000 500000 Merge 454 2
17 SSD Iteration 2 50000 500000 Singleton 864 3
17 SSD Iteration 2 50000 500000 Lookup 1070 4
17 SSD Iteration 3 50000 500000 Join 236 1
17 SSD Iteration 3 50000 500000 Merge 426 2
17 SSD Iteration 3 50000 500000 Singleton 867 3
17 SSD Iteration 3 50000 500000 Lookup 925 4
18 HDD Iteration 1 500000 500000 Join 697 1
18 HDD Iteration 1 500000 500000 Merge 1687 2
18 HDD Iteration 1 500000 500000 Lookup 2243 3
18 HDD Iteration 1 500000 500000 Singleton 10280 4
18 HDD Iteration 2 500000 500000 Join 807 1
18 HDD Iteration 2 500000 500000 Merge 1820 2
18 HDD Iteration 2 500000 500000 Lookup 2391 3
18 HDD Iteration 2 500000 500000 Singleton 10290 4
18 HDD Iteration 3 500000 500000 Join 1208 1
18 HDD Iteration 3 500000 500000 Lookup 1857 2
18 HDD Iteration 3 500000 500000 Merge 2585 3
18 HDD Iteration 3 500000 500000 Singleton 14775 4
18 SSD Iteration 1 500000 500000 Join 281 1
18 SSD Iteration 1 500000 500000 Merge 363 2
18 SSD Iteration 1 500000 500000 Lookup 1520 3
18 SSD Iteration 1 500000 500000 Singleton 1760 4
18 SSD Iteration 2 500000 500000 Join 585 1
18 SSD Iteration 2 500000 500000 Merge 624 2
18 SSD Iteration 2 500000 500000 Lookup 2041 3
18 SSD Iteration 2 500000 500000 Singleton 2170 4
18 SSD Iteration 3 500000 500000 Merge 526 1
18 SSD Iteration 3 500000 500000 Join 542 2
18 SSD Iteration 3 500000 500000 Singleton 1971 3
18 SSD Iteration 3 500000 500000 Lookup 2188 4
19 HDD Iteration 1 5000000 500000 Join 2639 1
19 HDD Iteration 1 5000000 500000 Merge 2819 2
19 HDD Iteration 1 5000000 500000 Lookup 3502 3
19 HDD Iteration 1 5000000 500000 Singleton 59929 4
19 HDD Iteration 2 5000000 500000 Join 2503 1
19 HDD Iteration 2 5000000 500000 Lookup 2812 2
19 HDD Iteration 2 5000000 500000 Merge 2924 3
19 HDD Iteration 2 5000000 500000 Singleton 57683 4
19 HDD Iteration 3 5000000 500000 Merge 2859 1
19 HDD Iteration 3 5000000 500000 Join 3841 2
19 HDD Iteration 3 5000000 500000 Lookup 5314 3
19 HDD Iteration 3 5000000 500000 Singleton 62249 4
19 SSD Iteration 1 5000000 500000 Merge 1755 1
19 SSD Iteration 1 5000000 500000 Join 1882 2
19 SSD Iteration 1 5000000 500000 Lookup 3034 3
19 SSD Iteration 1 5000000 500000 Singleton 10127 4
19 SSD Iteration 2 5000000 500000 Merge 1813 1
19 SSD Iteration 2 5000000 500000 Join 2829 2
19 SSD Iteration 2 5000000 500000 Lookup 5553 3
19 SSD Iteration 2 5000000 500000 Singleton 10305 4
19 SSD Iteration 3 5000000 500000 Merge 2173 1
19 SSD Iteration 3 5000000 500000 Join 2381 2
[Appendix] 25
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

TestNumber Hardware Iteration ChangeRows NewRows MethodName DurationSec Rank

19 SSD Iteration 3 5000000 500000 Lookup 4691 3
19 SSD Iteration 3 5000000 500000 Singleton 10617 4
20 HDD Iteration 1 0 5000000 Lookup 466 1
20 HDD Iteration 1 0 5000000 Join 516 2
20 HDD Iteration 1 0 5000000 Merge 808 3
20 HDD Iteration 1 0 5000000 Singleton 9840 4
20 HDD Iteration 2 0 5000000 Join 297 1
20 HDD Iteration 2 0 5000000 Lookup 321 2
20 HDD Iteration 2 0 5000000 Merge 1142 3
20 HDD Iteration 2 0 5000000 Singleton 9803 4
20 HDD Iteration 3 0 5000000 Join 338 1
20 HDD Iteration 3 0 5000000 Lookup 528 2
20 HDD Iteration 3 0 5000000 Merge 1180 3
20 HDD Iteration 3 0 5000000 Singleton 9905 4
20 SSD Iteration 1 0 5000000 Join 299 1
20 SSD Iteration 1 0 5000000 Lookup 406 2
20 SSD Iteration 1 0 5000000 Merge 779 3
20 SSD Iteration 1 0 5000000 Singleton 7925 4
20 SSD Iteration 2 0 5000000 Lookup 227 1
20 SSD Iteration 2 0 5000000 Join 421 2
20 SSD Iteration 2 0 5000000 Merge 821 3
20 SSD Iteration 2 0 5000000 Singleton 7912 4
20 SSD Iteration 3 0 5000000 Lookup 207 1
20 SSD Iteration 3 0 5000000 Join 216 2
20 SSD Iteration 3 0 5000000 Merge 739 3
20 SSD Iteration 3 0 5000000 Singleton 7955 4
21 HDD Iteration 1 5000 5000000 Join 814 1
21 HDD Iteration 1 5000 5000000 Merge 910 2
21 HDD Iteration 1 5000 5000000 Lookup 1233 3
21 HDD Iteration 1 5000 5000000 Singleton 10746 4
21 HDD Iteration 2 5000 5000000 Merge 820 1
21 HDD Iteration 2 5000 5000000 Join 824 2
21 HDD Iteration 2 5000 5000000 Lookup 882 3
21 HDD Iteration 2 5000 5000000 Singleton 10810 4
21 HDD Iteration 3 5000 5000000 Join 875 1
21 HDD Iteration 3 5000 5000000 Merge 1002 2
21 HDD Iteration 3 5000 5000000 Lookup 1068 3
21 HDD Iteration 3 5000 5000000 Singleton 10679 4
21 SSD Iteration 1 5000 5000000 Join 590 1
21 SSD Iteration 1 5000 5000000 Lookup 817 2
21 SSD Iteration 1 5000 5000000 Merge 1022 3
21 SSD Iteration 1 5000 5000000 Singleton 8070 4
21 SSD Iteration 2 5000 5000000 Join 654 1
21 SSD Iteration 2 5000 5000000 Merge 861 2
21 SSD Iteration 2 5000 5000000 Lookup 1268 3
21 SSD Iteration 2 5000 5000000 Singleton 8115 4
21 SSD Iteration 3 5000 5000000 Join 485 1
21 SSD Iteration 3 5000 5000000 Merge 807 2
21 SSD Iteration 3 5000 5000000 Lookup 1373 3
21 SSD Iteration 3 5000 5000000 Singleton 7939 4
22 HDD Iteration 1 50000 5000000 Join 1186 1
22 HDD Iteration 1 50000 5000000 Lookup 1558 2
22 HDD Iteration 1 50000 5000000 Merge 1652 3
22 HDD Iteration 1 50000 5000000 Singleton 12531 4
22 HDD Iteration 2 50000 5000000 Join 1348 1
22 HDD Iteration 2 50000 5000000 Merge 1508 2
22 HDD Iteration 2 50000 5000000 Lookup 2214 3
22 HDD Iteration 2 50000 5000000 Singleton 12979 4
22 HDD Iteration 3 50000 5000000 Join 1573 1
22 HDD Iteration 3 50000 5000000 Merge 1912 2
22 HDD Iteration 3 50000 5000000 Lookup 2390 3
22 HDD Iteration 3 50000 5000000 Singleton 12556 4
22 SSD Iteration 1 50000 5000000 Join 671 1
22 SSD Iteration 1 50000 5000000 Merge 961 2
22 SSD Iteration 1 50000 5000000 Lookup 1688 3
22 SSD Iteration 1 50000 5000000 Singleton 8217 4
22 SSD Iteration 2 50000 5000000 Join 804 1
22 SSD Iteration 2 50000 5000000 Lookup 1433 2
22 SSD Iteration 2 50000 5000000 Merge 1820 3
[Appendix] 26
Performance comparison of techniques to load type 2 slowly changing dimensions in a Kimball style data warehouse

TestNumber Hardware Iteration ChangeRows NewRows MethodName DurationSec Rank

22 SSD Iteration 2 50000 5000000 Singleton 8262 4
22 SSD Iteration 3 50000 5000000 Join 1054 1
22 SSD Iteration 3 50000 5000000 Merge 1527 2
22 SSD Iteration 3 50000 5000000 Lookup 1979 3
22 SSD Iteration 3 50000 5000000 Singleton 8136 4
23 HDD Iteration 1 500000 5000000 Lookup 1492 1
23 HDD Iteration 1 500000 5000000 Join 2046 2
23 HDD Iteration 1 500000 5000000 Merge 2166 3
23 HDD Iteration 1 500000 5000000 Singleton 19388 4
23 HDD Iteration 2 500000 5000000 Join 1452 1
23 HDD Iteration 2 500000 5000000 Lookup 1585 2
23 HDD Iteration 2 500000 5000000 Merge 2125 3
23 HDD Iteration 2 500000 5000000 Singleton 19849 4
23 HDD Iteration 3 500000 5000000 Join 1663 1
23 HDD Iteration 3 500000 5000000 Merge 2870 2
23 HDD Iteration 3 500000 5000000 Lookup 3701 3
23 HDD Iteration 3 500000 5000000 Singleton 20216 4
23 SSD Iteration 1 500000 5000000 Join 800 1
23 SSD Iteration 1 500000 5000000 Merge 1194 2
23 SSD Iteration 1 500000 5000000 Lookup 2695 3
23 SSD Iteration 1 500000 5000000 Singleton 8993 4
23 SSD Iteration 2 500000 5000000 Merge 815 1
23 SSD Iteration 2 500000 5000000 Join 820 2
23 SSD Iteration 2 500000 5000000 Lookup 1234 3
23 SSD Iteration 2 500000 5000000 Singleton 8976 4
23 SSD Iteration 3 500000 5000000 Merge 1208 1
23 SSD Iteration 3 500000 5000000 Join 1350 2
23 SSD Iteration 3 500000 5000000 Lookup 1448 3
23 SSD Iteration 3 500000 5000000 Singleton 8939 4
24 HDD Iteration 1 5000000 5000000 Merge 3218 1
24 HDD Iteration 1 5000000 5000000 Join 3517 2
24 HDD Iteration 1 5000000 5000000 Lookup 4902 3
24 HDD Iteration 1 5000000 5000000 Singleton 69760 4
24 HDD Iteration 2 5000000 5000000 Join 3432 1
24 HDD Iteration 2 5000000 5000000 Lookup 4754 2
24 HDD Iteration 2 5000000 5000000 Merge 4754 3
24 HDD Iteration 2 5000000 5000000 Singleton 67245 4
24 HDD Iteration 3 5000000 5000000 Join 4902 1
24 HDD Iteration 3 5000000 5000000 Merge 5141 2
24 HDD Iteration 3 5000000 5000000 Lookup 7633 3
24 HDD Iteration 3 5000000 5000000 Singleton 66951 4
24 SSD Iteration 1 5000000 5000000 Join 2318 1
24 SSD Iteration 1 5000000 5000000 Merge 2383 2
24 SSD Iteration 1 5000000 5000000 Lookup 3471 3
24 SSD Iteration 1 5000000 5000000 Singleton 16371 4
24 SSD Iteration 2 5000000 5000000 Merge 2311 1
24 SSD Iteration 2 5000000 5000000 Join 2393 2
24 SSD Iteration 2 5000000 5000000 Lookup 3121 3
24 SSD Iteration 2 5000000 5000000 Singleton 17308 4
24 SSD Iteration 3 5000000 5000000 Join 2279 1
24 SSD Iteration 3 5000000 5000000 Merge 2338 2
24 SSD Iteration 3 5000000 5000000 Lookup 3539 3
24 SSD Iteration 3 5000000 5000000 Singleton 16924 4

[Appendix] 27

Power BI Guide
100% (4)
Power BI Guide
122 pages
ETL Testing - PPT
No ratings yet
ETL Testing - PPT
77 pages
Havij SQL Injection Help English PDF
No ratings yet
Havij SQL Injection Help English PDF
43 pages
Wancerz
No ratings yet
Wancerz
2 pages
History Management of Data - Slowly Changing Dimensions: Marek Wancerz, Paweł Wancerz
No ratings yet
History Management of Data - Slowly Changing Dimensions: Marek Wancerz, Paweł Wancerz
3 pages
SCD Types
No ratings yet
SCD Types
5 pages
Slowly Changing Dimensions Specification A Relational Algebra Approach
No ratings yet
Slowly Changing Dimensions Specification A Relational Algebra Approach
6 pages
An Efficient Approach For Data Indexing in Datawarehousing and Datamining
No ratings yet
An Efficient Approach For Data Indexing in Datawarehousing and Datamining
7 pages
Slowly Changing Dimension DW
No ratings yet
Slowly Changing Dimension DW
3 pages
Cost Based Optimization
No ratings yet
Cost Based Optimization
14 pages
Slowly Changing Dimension (SCD'S) : Submitted By: BALAJI K
No ratings yet
Slowly Changing Dimension (SCD'S) : Submitted By: BALAJI K
14 pages
Slowly Changing Dimensions
No ratings yet
Slowly Changing Dimensions
5 pages
Slowly Changing Dimensions Specification A Relational Algebra Approach
No ratings yet
Slowly Changing Dimensions Specification A Relational Algebra Approach
7 pages
Indexing Techniquesto Enhancethe Performance
No ratings yet
Indexing Techniquesto Enhancethe Performance
10 pages
SQL Age - DWH Interview Questions
No ratings yet
SQL Age - DWH Interview Questions
1 page
Types of SCD With Example
No ratings yet
Types of SCD With Example
30 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
11 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
10 pages
What Are Slowly Changing Dimensions
No ratings yet
What Are Slowly Changing Dimensions
2 pages
SCD 20II 20implementation 20in 20datastage 207.X
No ratings yet
SCD 20II 20implementation 20in 20datastage 207.X
7 pages
Article in Press: Handling Slowly Changing Dimensions in Data Warehouses
No ratings yet
Article in Press: Handling Slowly Changing Dimensions in Data Warehouses
10 pages
Z Data Warehouse Concepts
No ratings yet
Z Data Warehouse Concepts
19 pages
Datawarehouse Concepts
No ratings yet
Datawarehouse Concepts
7 pages
Enhancement Techniques For Data
No ratings yet
Enhancement Techniques For Data
19 pages
Using The Mapping Wizards
No ratings yet
Using The Mapping Wizards
38 pages
Abinitio Vijay - 8553385664
No ratings yet
Abinitio Vijay - 8553385664
28 pages
DWH Indexes
No ratings yet
DWH Indexes
11 pages
Data Warehousing Strategy Is Used To Enhance The Functionality of
No ratings yet
Data Warehousing Strategy Is Used To Enhance The Functionality of
5 pages
How Evolution of Database Led To Data Mining
No ratings yet
How Evolution of Database Led To Data Mining
10 pages
DWDM Unit 1 (Lecture 1)
No ratings yet
DWDM Unit 1 (Lecture 1)
7 pages
Dat Mining Module1
No ratings yet
Dat Mining Module1
15 pages
Data Ware Hose Fundamentals
No ratings yet
Data Ware Hose Fundamentals
13 pages
Slowly Changing Dimentions (SCD) - Type 1, Type 2, Type 3
No ratings yet
Slowly Changing Dimentions (SCD) - Type 1, Type 2, Type 3
3 pages
Novel Authomatic Algoritm For Normalization
No ratings yet
Novel Authomatic Algoritm For Normalization
16 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
46 pages
SCD Types Olap-Oltp
No ratings yet
SCD Types Olap-Oltp
3 pages
Acceptance Testing and ETL Process j8Mus6Ctvj
No ratings yet
Acceptance Testing and ETL Process j8Mus6Ctvj
19 pages
Slowly Changing Dimension: Rahma Hassan
No ratings yet
Slowly Changing Dimension: Rahma Hassan
11 pages
What Are Slowly Changing Dimensions
No ratings yet
What Are Slowly Changing Dimensions
2 pages
Lecture 6-Data Mining and Warehousing
No ratings yet
Lecture 6-Data Mining and Warehousing
7 pages
Cb-Pattern Trees: Identifying Distributed Nodes Where Materialization Is Available
No ratings yet
Cb-Pattern Trees: Identifying Distributed Nodes Where Materialization Is Available
12 pages
Understanding Slowly Changing Dimensions SCD in Data Warehousing by Mainak Das Python in Plain English
No ratings yet
Understanding Slowly Changing Dimensions SCD in Data Warehousing by Mainak Das Python in Plain English
13 pages
SCD Types
No ratings yet
SCD Types
23 pages
Lecture 02 Data Warehouses
No ratings yet
Lecture 02 Data Warehouses
3 pages
Data Warehousing and Data Mining 3rd Class Second Course: Dr. Khalil I. Ghathwan
No ratings yet
Data Warehousing and Data Mining 3rd Class Second Course: Dr. Khalil I. Ghathwan
32 pages
Data - Mining - and - Data by Manyindo
No ratings yet
Data - Mining - and - Data by Manyindo
3 pages
In The Star Schema Design
No ratings yet
In The Star Schema Design
11 pages
Database Pioneer Rethinks The Best Way To Organize Data
No ratings yet
Database Pioneer Rethinks The Best Way To Organize Data
1 page
Study Material For Interview
No ratings yet
Study Material For Interview
47 pages
Enhanced Method For Huge Data Insertion: Abstract
No ratings yet
Enhanced Method For Huge Data Insertion: Abstract
4 pages
Data Warehousing Interview Questions
No ratings yet
Data Warehousing Interview Questions
6 pages
Data Warehouse: What, Why and How ?
No ratings yet
Data Warehouse: What, Why and How ?
25 pages
Exam - 1: October 5, 2016 Exam - 2: November 23, 2016 Quiz - 2: October 26, 2016 Quiz - 3: November 9, 2016
No ratings yet
Exam - 1: October 5, 2016 Exam - 2: November 23, 2016 Quiz - 2: October 26, 2016 Quiz - 3: November 9, 2016
14 pages
DWM Lab Manual
No ratings yet
DWM Lab Manual
17 pages
ETL Testing - InterviewQuestion PDF
No ratings yet
ETL Testing - InterviewQuestion PDF
21 pages
SQL SERVER - Data Warehousing Interview Questions and Answers - Part 1
No ratings yet
SQL SERVER - Data Warehousing Interview Questions and Answers - Part 1
7 pages
Microsoft Data Warehouse Design Considerations
No ratings yet
Microsoft Data Warehouse Design Considerations
36 pages
Data Warehouse Basics (Lec. Notes 1)
No ratings yet
Data Warehouse Basics (Lec. Notes 1)
5 pages
DWH Tables
No ratings yet
DWH Tables
8 pages
Pictorial Essay: Postpneumonectomy Complications
No ratings yet
Pictorial Essay: Postpneumonectomy Complications
9 pages
Working With SAS System Date and Time Functions: Andrew H. Karp
No ratings yet
Working With SAS System Date and Time Functions: Andrew H. Karp
70 pages
Pneumonectomy Management PDF
No ratings yet
Pneumonectomy Management PDF
6 pages
SAS Customer Intelligence 360 For Dummies: Fariba Bat-Haee, SAS Institute Inc., Cary, NC
No ratings yet
SAS Customer Intelligence 360 For Dummies: Fariba Bat-Haee, SAS Institute Inc., Cary, NC
15 pages
Sas Customer Intelligence Solutions 103116
No ratings yet
Sas Customer Intelligence Solutions 103116
2 pages
Decrypt Stored Procedures
No ratings yet
Decrypt Stored Procedures
7 pages
ML Studio Overview
No ratings yet
ML Studio Overview
1 page
Qie Install Guide
No ratings yet
Qie Install Guide
26 pages
Pe PRN212 Fa24
No ratings yet
Pe PRN212 Fa24
3 pages
Msbi Interview - Questions
No ratings yet
Msbi Interview - Questions
45 pages
Fullstack Engineer Resume
No ratings yet
Fullstack Engineer Resume
5 pages
Import DataTable To Excel C#
No ratings yet
Import DataTable To Excel C#
4 pages
Reporting Services 2014 PDF
No ratings yet
Reporting Services 2014 PDF
130 pages
Skills Measured by Exam 70-315 Course 2310 Course 2389 Course 2640
No ratings yet
Skills Measured by Exam 70-315 Course 2310 Course 2389 Course 2640
4 pages
Convert Access Database To SQL Server
No ratings yet
Convert Access Database To SQL Server
11 pages
Aradial Overview
No ratings yet
Aradial Overview
29 pages
SQL Server Reporting Services (SSRS) PDF
100% (1)
SQL Server Reporting Services (SSRS) PDF
2,489 pages
Creating A Database Backup in SQL Server 2012
No ratings yet
Creating A Database Backup in SQL Server 2012
16 pages
Anusha
No ratings yet
Anusha
12 pages
Benjamin Nevarez Microsoft SQL Server 2017 On Linux McGraw Hill Education - 2018
No ratings yet
Benjamin Nevarez Microsoft SQL Server 2017 On Linux McGraw Hill Education - 2018
319 pages
SQL 2005/2008 DBA (Database Administrator) : Kebutuhan: 1 Orang
No ratings yet
SQL 2005/2008 DBA (Database Administrator) : Kebutuhan: 1 Orang
4 pages
Safend Data Protection Suite 3.4.5 Installation Guide
No ratings yet
Safend Data Protection Suite 3.4.5 Installation Guide
76 pages
TC 10 1 Release Bulletin PDF
No ratings yet
TC 10 1 Release Bulletin PDF
500 pages
Editing and Updating Data in Gridview Using Sqldatasource Control - Part 17
No ratings yet
Editing and Updating Data in Gridview Using Sqldatasource Control - Part 17
2 pages
Whatsup Gold 2023.0 Release Notes 11-9-2023
No ratings yet
Whatsup Gold 2023.0 Release Notes 11-9-2023
22 pages
SQLServer2000 - R3 Com SAP
No ratings yet
SQLServer2000 - R3 Com SAP
17 pages
Programming Ado - CHM
No ratings yet
Programming Ado - CHM
257 pages
HMIi Wincc v6 2 Getting Started en
100% (1)
HMIi Wincc v6 2 Getting Started en
120 pages
Hospital Management System: A Project Work Submitted To The
No ratings yet
Hospital Management System: A Project Work Submitted To The
66 pages
Lib Net Access Project Proposal
No ratings yet
Lib Net Access Project Proposal
2 pages
Datareader in C#: John Hudai Godel
No ratings yet
Datareader in C#: John Hudai Godel
6 pages
Create Custom MDX Query
No ratings yet
Create Custom MDX Query
23 pages
CREATE DATABASE SCOPED CREDENTIAL (Transact-SQL)
No ratings yet
CREATE DATABASE SCOPED CREDENTIAL (Transact-SQL)
5 pages