Top 10 Methods To Improve ETL Performance Using SSIS
Top 10 Methods To Improve ETL Performance Using SSIS
SSIS
Extraction Transformation Load (ETL) is the backbone for any data warehouse. In the data
warehouse world data is managed by the ETL process, which consists of three processes, Extraction-
Pull/Acquire data from sources, Transformation-change data in the required format and Load-push
data to the destination generally into a data warehouse or a data mart.
We will discuss how easily you can improve ETL performance or design a high performing ETL
system with the help of SSIS. For a better understanding, I will divide ten methods into two different
categories; first, SSIS package design time considerations and second configuring different property
values of components available in the SSIS package.
#2, Extract required data; pull only the required set of data from any table or file. You need to avoid
the tendency to pull everything available on the source for now that you will use in future; it eats up
network bandwidth, consumes system resources (I/O and CPU), requires extra storage, and it
degrades the overall performance of ETL system.
If your ETL system is really dynamic in nature and your requirements frequently change, it would be
better to consider other design approaches, like Meta Data driven ETL, etc. rather than design to pull
everything in at one time.
#3, Avoid the use of Asynchronous transformation components; SSIS is a rich tool with a set of
transformation components to achieve complex tasks during ETL execution but at the same time it
costs you a lot if these components are not being used properly.
Two categories of transformation components are available in SSIS; Synchronous and Asynchronous.
Synchronous transformations are those components which process each row and push down to the
next component/destination, it uses allocated buffer memory and doesn’t require additional memory
as it is direct relation between input/output data row which fits completely into allocated memory.
Components like Lookup, Derived Columns, and Data Conversion etc. fall into this category.
Asynchronous transformations are those components which first store data into buffer memory then
process operations like Sort and Aggregate. Additional buffer memory is required to complete the task
and until the buffer memory is available it holds up the entire data in memory and blocks the
transaction, also known as blocking transformation. To complete the task SSIS engine (data flow
pipeline engine) will allocate extra buffer memory, which is again an overhead to the ETL system.
Components like Sort, Aggregate, Merge, Join, etc. fall into this category.
Overall, you should avoid Asynchronous transformations but still, if you get into a situation where you
don’t have any other choice then you must aware of how to deal with the available property values of
these components. I’ll discuss them later in this article.
#4, Optimum use of event in event handlers; to track package execution progress or take any other
appropriate action on a specific event, SSIS provides a set of events. Events are very useful but
excess use of events will cost extra overhead on ETL execution.
Here, you need to validate all traits before enabling an event in the SSIS package.
#5, Need to be aware of the destination table schema when working on a huge volume of data. You
need to think twice when you need to pull a huge volume of data from the source and push it into a
data warehouse or data mart. You may see performance issues when trying to push huge data into
the destination with a combination of insert, update and delete (DML) operations, as there could be a
chance that the destination table will have clustered or non-clustered indexes, which may cause a lot
of data shuffling in memory due to DML operations.
If ETL is having performance issues due to a huge amount of DML operations on a table that has an
index, you need to make appropriate changes in the ETL design, like dropping existing clustered
indexes in the pre-execution phase and re-create all indexes in the post-execute phase. You may find
other better alternatves to resolve the issue based on your situation.
Package
EngineThreads is a data flow task level property and has a default value of 10, which specifies the
total number of threads that can be created for executing the data flow task.
Data Flow Task
You can change default values of these properties as per ETL needs and resources availability.
#7, Configure Data access mode option in OLEDB Destination. In the SSIS data flow task we can find
the OLEDB destination, which provides a couple of options to push data into the destination table,
under the Data access mode; first, the “Table or view“ option, which inserts one row at a time;
second, the “Table or view fast load” option, which internally uses the bulk insert statement to send
data into the destination table, which always provides better performance compared to other options.
Once you choose the “fast load” option it gives you more control to manage the destination table
behavior during a data push operation, like Keep identity, Keep nulls, Table lock and Check
constraints.
OLE DB Destination Editor
It’s highly recommended that you use the fast load option to push data into the destination table to
improve ETL performance.
#8, Configure Rows per Batch and Maximum Insert Commit Size in OLEDB destination. These two
settings are important to control the performance of tempdb and transaction log because with the
given default values of these properties it will push data into the destination table under one batch and
one transaction. It will require excessive use of tempdb and transaction log, which turns into an ETL
performance issue because of excessive consumption of memory and disk storage.
OLE DB Destination Editor
To improve ETL performance you can put a positive integer value in both of the properties based on
anticipated data volume, which will help to divide a whole bunch of data into multiple batches, and
data in a batch can again commit into the destination table depending on the specified value. It will
avoid excessive use of tempdb and transaction log, which will help to improve the ETL performance.
#9, Use of SQL Server Destination in a data flow task. When you want to push data into a local SQL
Server database, it is highly recommended to use SQL Server Destination, as it provides many
benefits to overcome other option’s limitations, which helps you to improve ETL performance. For
example, it uses the bulk insert feature that is built into SQL Server but it gives you the option to apply
transformation before loading data into the destination table. Apart from that, it gives you the option to
enable/disable the trigger to be fired when loading data, which also helps to reduce ETL overhead.
SQL Server Destination Data Flow Component
#10, Avoid implicit typecast. When data comes from a flat file, the flat file connection manager treats
all columns as a string (DS_STR) data type, including numeric columns. As you know, SSIS uses
buffer memory to store the whole set of data and applies the required transformation before pushing
data into the destination table. Now, when all columns are string data types, it will require more space
in the buffer, which will reduce ETL performance.
To improve ETL performance you should convert all the numeric columns into the appropriate data
type and avoid implicit conversion, which will help the SSIS engine to accommodate more rows in a
single buffer.
3 Summary
In this article we explored how easily ETL performance can be controlled at any point of time. These
are 10 common ways to improve ETL performance. There may be more methods based on different
scenarios through which performance can be improved.
Overall, with the help of categorization you can identify how to handle the situation. If you are in the
design phase of a data warehouse then you may need to concentrate on both the categories but if
you're supporting any legacy system then first closely work on the second category.
http://msdn.microsoft.com/en-us/library/ms141031.aspx