Integration Services Tutorial
Integration Services Tutorial
Overview
SQL Server Integration Services (SSIS) is the integration and ETL (extract – transform –
load) tool in the Microsoft Data Platform stack. SSIS is typically used in data warehousing
scenarios, but can also be used in common data integration use cases or just to move
data around. SSIS is used behind the scenes in the Maintenance Plans of SQL Server and
in the Import/Export wizard.
Transferring data from a source to a destination. This is done in memory and you
can perform data manipulation tasks on the data while it is in memory, which makes
SSIS one of the faster tools on the market.
You can perform simple FTP tasks.
SSIS can send emails to notify people.
SSIS is capable of robust error and event handling.
You can define a workflow with constrains to conditionally execute certain tasks.
And if all that isn’t enough, you can always extend SSIS with .NET code.
We will go through a number of topics in order to create our package. The high-level
outline is as follows:
Overview
In this section, we’ll briefly discuss the history of the Integration Services product and the
tools you would use to create SSIS projects.
History
Before SSIS, SQL Server came with Data Transformation Services (DTS), which was part
of SQL Server 7 and 2000. For SQL Server 2005, the teams at Microsoft decided to
revamp DTS. Ultimately, they ended with a replacement for DTS instead of just an
upgrade and because it was such a drastic change, it was decided to name the product
Integration Services instead of DTS. This name change came late in the product
2
development cycle and that’s why some objects still refer to DTS. For example, the
command line tools DTEXEC (to execute SSIS packages) and DTUTIL (to deploy
packages to a server).
Integration Services was launched with SQL Server 2005 and the most basic core
functionality is still the same today. It was a drastic change with DTS and it quickly become
a popular ETL due to its speed, flexibility and its support for various sources.
With SQL Server 2008, lots of performance improvements were made to SSIS and new
sources were introduced as well. SQL Server 2008R2 didn’t introduce any noticeable
changes for SSIS.
SQL Server 2012 was a major release for SSIS. It introduced the concept of the project
deployment model, where entire projects with their packages are deployed to a server,
instead of individual packages. The SSIS of SQL Server 2005 and 2008 is now referred to
as the (legacy) package deployment model. SSIS 2012 made it easier to configure
packages and it came with a centralized storage and management utility: the catalog. We’ll
dive deeper into those topics later on in the tutorial.
SQL Server 2014 didn’t have any changes for SSIS, but on the side new sources or
transformations were added to the product. This was done by separate downloads trough
CodePlex (an open-source code website) or through the SQL Server Feature Pack.
Examples are the Azure feature pack (to connect to cloud sources and objects) and
the balanced data distributor (to divide your data stream into multiple pipelines).
In SQL Server 2016 there were some updates to the SSIS product. Instead of deploying
entire projects, you can new deploy packages individually again. There are additional
sources – especially cloud and big data sources – and some important changes were
made to the catalog. You can find an overview of all new features here and here.
During all these years, SSIS has built itself a reputation for being a stable, robust and fast
ETL tool with support for many sources. However, it’s still mainly an on-premises solution,
there is – at the time of writing – no real cloud alternative.
SSIS 2005 VS 2005 – templates were called Business Intelligence Development Studio (BIDS)
SSIS 2012 VS 2010. Templates renamed to SQL Server Data Tools (SSDT). This tool came
3
VS 2015. Database tools and business intelligence tools are combined into one
SSIS 2016
single product: SSDT. Separate download.
If you want to follow along in this SSIS tutorial for SQL Server 2016, you can download the
latest version of SSDT here. Make sure you download the templates for Visual Studio
2015. At the time of writing, SQL Server 2017 hasn’t been released yet, but with the latest
version of SSDT for Visual Studio 2015 (SSDT 17.2) you can already develop projects for
SQL Server 2017.
Since SQL Server 2016, it’s possible to develop projects for earlier versions of SSIS within
the same version of Visual Studio. In the latest version, you can develop projects for SQL
Server 2017, 2016, 2014 and 2012. The tip Backwards Compatibility in SQL Server Data
Tools for Integration Services explains the concept in more detail.
Overview
Like any ETL tool, Integration Services is all about moving and transforming data. In this
tutorial, we’ll also want to extract data from a certain source and write data to another
source. In many cases, either the source or the destination will be a relational database,
such as SQL Server. In this tutorial, we’ll use the Wide World Importers sample database.
Explanation
We will setup databases that can be used for testing and learning more about SSIS.
In this section of the tutorial, we will restore the Wide World Importers database from a
backup, as it is the easiest option. If you want to learn more about Wide World Importers,
you can check out these tips:
You can download a backup of the Wide World Importers database from here. This
backup currently contains data from 2013-01-01 till 2016-05-31. Make sure you have a
SQL Server instance available (installing and configuring SQL Server is not part of this
tutorial). Since there are new features present in this backup, you need SQL Server 2016
or later. If you upgrade to SP1, you can use all features that were previously Enterprise
edition only, such as compression or columnstore indexes. You can find more information
on which features are present in which edition in this overview. You could also use SQL
Server 2016 Developer edition, which is free and has all the features available.
We are going to restore the backup using SQL Server Management Studio (SSMS). Right-
click on the Databases node and select Restore Database…
Next, we have to choose the backup file we downloaded from Github. By default, the
explorer shows the folder that has been configured as the default backup folder for the
SQL Server instance. Either move your backup file to that folder, or navigate to the
directory where you have saved the .bak file.
6
Click OK twice until you’re back in the Restore Database menu. We have one more thing
to do before we can restore the backup. Go to the Files pane.
In Files, choose to relocate all of the database files to the default SQL Server folders
(which you can configure during the SQL Server set-up).
Now you can click on OK to start the restore procedure. Depending on your machine, this
might take some time. To restore the Wide World Importers data warehouse, you can
follow the exact same steps. You can find the .bak file here.
7
If you have followed all the steps, you should now have two new databases in your SQL
Server instance:
Overview
In this chapter, we’re going to create our SSIS project. For this, we need Visual Studio
2015. You need to download SQL Server Data Tools 2015, which will install a shell of
Visual Studio 2015. If you want a full-blown Visual Studio, so you can also tackle other
type of projects such as R projects or .NET projects, you can download Visual Studio 2015
Community Edition (which is free if you subscribe to Visual Studio Dev Essentials). With
the latest version of SSDT 2015 (SSDT 17.2 at the time of writing), you can create SSIS
projects for SQL Server 2012, 2014, 2016 and 2017.
Keep in mind that if you want source control integration with Team Foundation Server, you
need the full-blown Visual Studio. There is no Team Foundation Explorer plug-in for Visual
Studio 2015, so you can’t use the shell of SSDT.
Creating a Project
Start Visual Studio. If it’s the first time, you might get a prompt asking which settings Visual
Studio should use. You can pick the Business Intelligence settings. When Visual Studio
has started, go to File > New > Project.
In the New Project menu, enter a name for the project and specify a location to save the
project.
8
When you create a project, Visual Studio will create a solution first and add the project to
that solution. By default, the solution has the same name as the project. If you want to add
multiple projects to one solution, you might want to change the solution name. If you have
source control integrated into Visual Studio, you will have an extra checkbox asking you if
you want to add the project to source control.
When you click OK, the solution and the project will be created and an empty package will
be added to the project. You can view the project structure in the Solution
Explorer window:
When there’s only one project, the solution will not be displayed.
Let’s take a look at our development environment for creating SSIS packages. Keep in
mind that most of the windows are dockable, which means you can move them around, so
it’s possible you do not have the exact same view as in this screenshot.
1. This is your canvas. Here you drag items from the toolbox and you connect them with each
other to create a workflow. This will be discussed in more detail in the next sections of the
tutorial. The package canvas has multiple tabs:
1. The control flow. Here you can have multiple tasks which you can connect with each other.
The control flow is important as it defines what your package actually does.
2. The data flow. This is a special task of the control flow. Here you move data around between
sources and destinations, and you can transform the data while it is in memory.
3. Parameters. You can define parameters to make your package more flexible.
4. Event Handlers. This are special “control flow”-like canvasses where you can define tasks
that will only execute if a specific event occurs. Event handlers fall out of scope of this
tutorial.
5. Package Explorer. A tree-view of all the objects inside your package.
2. The SSIS Toolbox. Here you can find all tasks and transformations for the control and data
flow. You can drag them from the toolbox into the canvas. There’s also another window just
called “Toolbox”. It’s used for other types of projects such as Reporting Services, so don’t
confuse it for the SSIS Toolbox. If you can’t find the SSIS Toolbox, right-click on the canvas
and select SSIS Toolbox from the context menu.
3. The connection managers. A connection manager defines a connection to a specific object.
This can be a flat file, a database, a folder and so on. Tasks and transformations use a
connection manager to create a connection to the object.
4. The solution Explorer. A tree view of all the objects in the project or solution.
5. The properties window. Here you can view and change the properties of almost all objects
within an SSIS package.
6. The toolbars. The most important item is the green arrow, which you can use to start the
debugger. The debugger will execute the SSIS package within Visual Studio.
Everything mentioned here will be explained in more detail in the following section in the
tutorial.
There’s only one window missing from this view: the variables. When you create your first
SSIS project, this window is hidden. You can right-click on the canvas and select Variables
to open the window.
10
Variables are used to make your package more flexible and change properties on the fly
when a package is running. The difference between parameters and variables is that
parameters cannot change value once the package has started executing, while variables
can. Parameters are used as input for the package before it starts.
Let’s go to the next section to learn more about the control flow
Overview
Now that we’ve created our SSIS project in the previous chapter, it’s time to start to
explore the control flow and its abilities. The control flow allows you to execute different
tasks and organize a workflow between the tasks. In this section, we’ll give an overview of
the objects you can add to the control flow.
SSIS Tasks
In an SSIS package, you can add tasks to the control flow. A task is a unit of work and you
have different kinds of tasks to perform different kinds of work. Explaining all tasks would
take us too far in this tutorial, so here’s an overview of some of the most common tasks:
Execute SQL Task: These tasks will execute a SQL statement against a relation
database.
Data Flow Task: These special tasks can read data from one or more sources,
transform the data while in memory and write it out against one or more
destinations. We’ll describe the data flow in more detail in the next sections of the
tutorial.
Analysis Services Processing Task: You can use this task to process objects of
an SSAS cube or Tabular model.
Execute Package Task: With this task, you can execute other packages from
within the same project. You can also pass variable values to the called package.
Execute Process Task: Allows you to call an executable (.exe). You can specify
command line parameters. With this task, you can for example unzip files, execute
batch scripts and so on.
File System Task: This task can perform manipulations in the file system, such as
moving files, renaming files, deleting files, and creating directories et cetera.
FTP Tasks: Allows you to perform basic FTP functionalities. However, this task is
limited because it doesn’t support FTPS or SFTP.
11
Script Task: This task is essentially a blank canvas. You can write .NET code (C#
or VB) that performs any task you want.
Send Mail Task: Here you can send an email. Ideal for notifying users that your
package has done running or that something went wrong.
In the screenshot, you can see an Execute SQL Task that has been added to the control
flow:
There are of course more tasks. Some are for working with Azure or big data systems,
others are for performing DBA tasks (and are essentially the building blocks of SQL Server
Maintenance Plans). You can find an overview in the documentation.
SSIS Containers
Next to tasks, you also have containers. These give you more power over how tasks are
executed. You can add one or more tasks to a single container.
For Loop Container: With this container, you can execute all tasks inside for a
fixed number of executions. This is equivalent to for loops in a programming
language.
For each Loop Container: This container doesn’t execute a fixed number of times
like the for loop, but the number of executions is determined by a collection. This
can be for example the number of files in a directory or the number of rows in a
table. This makes the container more flexible than a for loop container.
Sequence Container: This container simply groups tasks together. The tasks will
execute together. This container is useful to split your control flow into logical units
of work.
Here you can see a couple of tasks inside a sequence container. When you would execute
the sequence container, all three tasks will execute at the same time.
12
It’s time to start building an SSIS package. In this chapter, we’ll add tasks to the control
flow and learn how you can start the debugger to execute the package. We’ll also look
how the execution of different tasks can be related to each other.
You can see there’s a red error icon on the task. That’s because we haven’t defined a
database connection yet.
13
Double click the task to open it. In the editor, open the connection dropdown and click
on <New Connection…>.
If you have already created connection managers, you can pick one from the list in the
next window. However, you can also create a new one by clicking the New… button at the
bottom.
14
This will open a connection manager editor. You need to enter the server name and select
a database from the dropdown list. You can also optionally specify a username and
password if you don’t want to use Windows Authentication.
15
Click OK two times to go back to the Execute SQL Task editor. You can either directly type
a SQL statement in the SQLStatement property or you can click on the ellipsis to open up
the editor. This editor is basically a notepad editor and it has no additional functionality.
You are most likely better off writing SQL statements in Management Studio and copy
pasting them into the editor. Let’s enter a very basic statement: SELECT 1.
We can now run the package to test our Execute SQL Task. You can click on the green
arrow or just hit F5. This will start the debugger which will run the package.
When the task has finished, you will see a green icon in the corner of the task. You can
click on the stop icon in the task bar to stop the debugger or you can click on the sentence
below the connection manager window.
When the package is running, an extra tab is added called Progress. Here you can see all
of the informational messages, errors and warnings generated by the SSIS package as
well as timing information.
17
When the debugger stops, the Progress tab is renamed to Execution Results.
You can create a precedence constraint by selecting the first task and dragging the green
arrow to the other task. Now when we execute the package, the first tasks will be executed
and then the other.
18
The green arrow signifies a “Success” precedence constraint, which means the second
task will only be executed if the first task is successful. You can change the behavior of the
precedence constraint by double clicking on the arrow:
You can change the precedence constraint to “Failure”, which means the second task will
only be executed if the first task fails. With “Completion”, the second tasks will execute
once the first task has finished, but it doesn’t matter if the first task was successful or not.
When you have multiple arrows going into one single task, you can change the constraint
to AND or OR. With AND, all tasks need to be successful before the task starts. With OR,
only one task needs to be successful. In the following screenshot, only one of the two top
tasks must finish successfully so the last task can start.
19
With precedence constraints and containers, you can create complex workflows:
Overview
In this section, we will introduce the Integration Services Data Flow. It’s one of the more
important features of SSIS and one of the reasons SSIS is considered one of the fastest
ETL tools. We’ll give also an overview of the more important transformations you can do in
the data flow.
The data flow is a special task of the control flow. It needs a canvas of its own, so there’s
an extra tab for the data flow, right next to the control flow.
The data flow is a construct where you can read data from various sources into the
memory of the machine that is executing the SSIS package. While the data is in memory,
you can perform different kinds of transformations. Because it’s in memory, these are very
fast. After the transformations, the data is written to one or more destinations (a flat file, an
Excel file, a database, etc.). In most cases, not all data is read into the memory at once -
although this is possible if you use certain kind of transformations – but the data is read
into buffers. Once a buffer is filled by the source component, it is passed on to the next
transformation which does it logic on the buffer. Then the buffer is passed to the following
transformation and so on until it is written to the destination. You can imagine the data flow
is like a pipeline, with data flowing through.
To create a data flow, you can drag it from the SSIS toolbox to the control flow canvas.
Another option is to simply go to the data flow tab, where you will be greeted with the
following message:
Clicking the link will create a new data flow task for you. You end up with an empty
canvas, just like in the Control Flow.
21
As you can see in the screenshot above, the SSIS toolbox will change once you go to the
data flow canvas. All the tasks are now replaced with transformations, sources and
destinations for the data flow. At the top, you also have a dropdown box that lets you
easily switch between multiple data flows if you have any.
There are other types of sources and destinations available. You can take a look at
the documentation to learn more. You also have the possibility to use a .NET script
component to make your own source or destination. This is similar to a .NET script task;
you can use C# or VB, but now there are special methods and classes included to handle
the buffers of the data flow.
22
In the next two sections of the tutorial, we’ll configure a source, some transformations and
a destination.
Overview
In this section, we’ll get our hands dirty in the data flow. We’ll read data from our sample
database and look at tools on how we can inspect this data.
In the assistant, double click on New… to create a new connection manager, while SQL
Server is still selected as the source type. In the connection manager editor, enter the
server name and select the WideWorldImporters database. Click OK.
23
The assistant will now put an OLE DB Source component on the data flow and create a
connection manager. If you want to re-use the same connection manager across different
packages, you can right-click the connection manager and choose “Convert to Project
Connection”. This will upgrade the connection manager to the project level, where it is
shared between all packages of the project.
24
Double click on the OLE DB Source to open its editor. There are different options to read
data from the database:
However, it’s almost always better to write a SQL statement instead of using the dropdown
(table or view option) to select a table. With the dropdown, you select all rows and all
columns and you do not have the option to do some transformations using the SQL
language (such as grouping, sorting and aggregating the data). Change the option to SQL
command. This will give you a text box where you can enter your T-SQL statement. If you
want, you can use a graphical query builder to construct your statement, but most of the
time it’s easier to just write it in Management Studio and copy paste it in the source
component. You can use the following SQL statement:
SELECT
[CityID]
,[CityName]
,[StateProvinceID]
25
,[LatestRecordedPopulation]
FROM [WideWorldImporters].[Application].[Cities];
This selects all the cities from the WideWorldImporters database. The source table is
a system-versioned table, so we get the latest data when we execute this statement.
When you copy and paste the SQL statement into the source component, you can hit
preview to take a look at the data:
In the columns tab, you can inspect all the columns returned by the query defined in the
first tab. You can deselect columns to remove them from the output and you can rename
columns as well. Although it’s better to do these manipulations in the query directly.
26
Every column has a data type associated with it. The data flow expects that this metadata
doesn’t change. If you would change the data type in the source table (for example change
cityID to a date if that were possible), the data flow would throw an error. Sometimes SSIS
doesn’t realize though metadata has changed. In that case, you can just deselect all
columns and select them again (using the checkbox right next to Name) to quickly refresh
the metadata of all columns.
Click OK to close the editor. To be able to run the data flow and see the data flowing
through, you need to add one more transformation. Let’s use the Multicast as a dummy.
Connect the source component to the Multicast with the blue arrow.
27
The arrows are not exactly precedence constraints like in the control flow. They tell the
data flow in which direction the data flows. You have two types: the normal output error
and the red arrow. The red arrow is the error output of the transformation. If some rows
have an error (for example data type mismatch in the source), you can redirect them to
another destination so you can inspect them later. If you would click on the source again
you can see the red arrow.
You can find more information about error handling in the tip How to serialize error logging
in SSIS.
Finally, right-click on the blue error and choose “Enable Data Viewer”.
28
This will add some sort of “debug” window on your output path. When the data flow runs,
you can inspect the rows in the current memory buffer. Let’s start the package. The first
buffer contains 9,637 rows and they are shown in the data viewer.
You can copy the data to inspect them in another tool, such as Excel for example. To fetch
the next buffer, click on the little green arrow in the data viewer. When you close the data
viewer, the data flow will run till all the rows have been fetched from the source. In total,
37,940 rows are read from the source:
In the next chapter, we’ll add some transformations to enrich the data.
29
Overview
In this section, we build further upon the data flow created in the previous section. We will
add an extra column using the Derived Column transformation and fetch extra data using
the Lookup component.
In the editor, you can drag variables, parameters and existing columns to the expression.
At the right, you have also a library of functions available. You can also choose if you want
to replace existing columns or if you want to add a new column:
30
Let’s add a new column that contains the date of today (using the GETDATE() function)
and trim the existing CityName column. You can drag the column name and
the Trim function to the editor:
The Derived Column transformation is very powerful, but the one-line expression editor
can be frustrating. Let’s add a multicast and a data viewer to inspect the results:
31
Cache mode. This defines how the reference dataset is loaded. With full cache, the
entire dataset is loaded into memory at the start of the data flow. This allows for
very quick matching between the datasets. With partial cache, only a part of the
dataset is loaded into memory. If there’s a cache miss, the data will be fetched from
the database and put in the cache, possibly evicting older data. With no cache,
nothing is loaded into memory and for every row a query needs to be sent to the
database, which is quite slow.
32
Connection type. You can either choose a cache connection manager for when
you when to pre-load your reference datasets and use them in multiple packages or
data flows, or a regular OLE DB connection manager. The default is a standard
OLE DB connection manager. Notice there’s no option to use ADO.NET or ODBC.
No match behavior. Here you specify how the lookup component should behave if
no match was found for a row.
o Ignore failure. The row is passed to the Match Output and the columns from the
reference dataset get NULL values.
o Redirect rows to error output. All rows without a match are sent to the error output
(the red arrow).
o Fail Component. The default, but a bit drastic. The data flow and package will stop if
no match is found.
o Redirect row to no match output. A new output is created where all rows without a
match are sent to.
Let’s set this option to “Redirect row to no match output”. In the next tab, you need to
define the reference dataset. You can use the dropdown box to select a table, but just as
with the source, a SQL statement is preferred. Select only the columns you need to make
the match and of course the columns you want to return. We can use this T-SQL
statement:
SELECT
[StateProvinceID]
,[StateProvinceName]
FROM [Application].[StateProvinces];
The last tab we need to edit is the Columns tab, where we specify how the matching will
take place and which columns we want to return. You need to drag the key columns from
33
the input columns to the key columns from the lookup columns. Then you can check each
column from the lookup columns which you want returned; in our case
the StateProvinceName.
Close the editor. When we now attach a multicast to the lookup component, we can
choose which output we want:
Let’s attach multicasts on both outputs, combined with data viewers so we can test the
lookup component:
34
As you can see, the StateProvinceName column has been added to the buffer and there
were no rows sent to the no match output.
Overview
In this chapter, we are going to write the data to a destination. Make sure you have
finished the previous section of the tutorial to have a finished data flow.
Select the SQL Server destination and double-click on New… to create a new connection
manager. We will write the data to the WideWorldImporersDW database.
35
Click OK twice to close the editors. The destination assistant will add a new OLE DB
Destination to the canvas. Connect the Lookup to this destination with the Lookup Match
Output:
36
Open up the destination editor. Make sure the correct connection manager is selected
and Table or view – Fast Load is selected as data access mode. You can select a table
from the dropdown menu (for the destination it’s fine to use the dropdown), but we are
going to create a new table first.
Click on New… next to the dropdown. This will open up an editor with the CREATE TABLE
statement, based on the metadata of the data flow.
37
If you click OK, the table will be created in the database specified by the connection
manager. You might want to change the table name first though.
Make sure the new table is selected in the dropdown menu. Leave all the other settings
as-is. In the Mapping pane, we can map columns from the input to the columns of the
destination table. Since all columns have the same name, they are mapped automatically.
38
You can map columns by dragging them from the left list to the right, or you can map them
in the grid below. If your columns have the same name though (recommended) but they
haven’t been mapped already, you can right-click anywhere in the space above the grid
and select Map Items by Matching Names from the context menu. This will save you
quite some time with bigger tables.
When the mapping is finished, you can click OK to close the editor. The data flow is now
finished.
Open up the editor, choose the WideWorldImportersDW connection manager and type the
SQL statement to truncate the table:
When we take a look at the destination table, we can see 37,940 rows have been inserted.
41
In the following chapters of the tutorial, we’ll learn how we can deploy our package to the
server and how we can execute it over there.
Overview
Now that our SSIS package development is finished, we can deploy it to the server. There
we can schedule and execute the package as well.
This will start the SSIS deployment wizard. Keep in mind this will deploy the entire project,
with all packages included. If you want to deploy an individual package, you can right-click
on the package itself and choose Deploy (since SSIS 2016).
In the first step of the wizard, we need to choose the destination (several steps are
skipped since we started the wizard from Visual Studio). Enter the server name and make
sure the SSIS catalog has already been created on that server. If you want, you can also
create a folder to store the project in.
42
At the next step, you get an overview of the actions the wizard will take. Hit Deploy to start
the deployment.
The project has now been deployed to the server and you can find it in the catalog:
You will be taken to a dialog where you can edit certain properties, such as the connection
managers, parameters if any, the amount of logging and so on.
45
Click on OK to start the execution of the package. A pop-up will open asking you if you
want to open one of the catalogs built-in reports.
Click Yes. This will take you to the Overview report, where can see the package has
successfully executed.
To learn more about the catalog reports, check out the tip Reporting with the SQL Server
Integration Services Catalog.
In the General pane, enter a name for the job, choose an owner and optionally enter a
description:
In the job step configuration, you can enter a name for the step. Choose the SQL Server
Integration Services Package type, enter the name of the server and select the package.
48
In the configuration tab, you can optionally set more properties, just like when executing a
package manually. Click OK to save the job step. In the Schedules tab, you can define one
or more schedule to execute the package on predefined points in time. Click New… to
create a new schedule. In the schedule editor, you can choose between multiple types of
schedules: daily, weekly or monthly. You can also schedule packages to run only once. In
the example below we have scheduled the job to run every day at 1AM, except in the
weekend.
49
Click OK twice to exit the editors. The job is now created and scheduled.
Overview
In the last chapter of this tutorial we’ll look at a couple of performance optimizations you
can implement in your SSIS packages. After all, you want to move data around as quickly
as possible.
Think about if you want to perform tasks in SSIS or if you can do them somewhere else.
For example, sorting data will be faster in SQL Server T-SQL code than in SSIS.
Perform tasks in parallel if possible, but don’t overdo it. Going into parallel will surely
improve performance, but this is heavily influenced by available memory and the number of
processors. There is a certain overhead to parallelism. If there’s too much parallelism, the
system will go slower instead of faster. Carefully test to find the optimum balance.
Most performance issues are related to the data flow. As with the control flow, think if SSIS
or transformations in SQL will be faster. Try to visualize the data flow as a pipeline with
data flowing through. You want to maximize the flow rate to get data to the destination as
quickly as possible. There are some important properties you can set to influence the
memory buffers.
The actual buffer size will be determined by which of the two properties is reached first.
You can set AutoAdjustBufferSize to True to make sure that the specified number of rows
in DefaultBufferMaxRows is always met.
Some guidelines:
Don’t use the dropdown box to select the source table. Write a SQL statement and include
filtering, grouping and sorting in the SQL code.
Only select columns you actually need.
Keep the data types of the columns small. The more rows you can fit in a single memory
buffer, the better.
SSIS Transforming Data Performance Optimizations
Don’t use blocking transformations (e.g. sort and aggregate component). They read all data
in memory before even sending one single row to the output. Asynchronous
transformations are to be avoided as well since they modify the memory buffer. You can
find a good overview in this blog post.
Avoid the Slowly Changing Dimension Wizard. It uses the OLE DB Command, which
executes SQL statements row-by-row, which is slow and results in excessive logging.
Don’t use the OLE DB Command, as stated in the previous point.
SSIS Writing Data Performance Optimizations
Writing data is typically the slowest part of the process. Here are some tips to optimize the
process:
The OLE DB Destination is the fastest adaptor for SQL Server at the moment. If you use
the Fast Load option of course.
Make sure you use a table lock (which is enabled by default).
To speed up inserts, you can disable constraints and drop and recreate indexes.
51
For instance, We are working with stock market data, and every day we are getting
billions of data in .csv format (Comma Separated Values). Our task is to copy data
inside this .csv file to SQL database table every day. We usually have two
approaches to do the Bulk Insert Task in SSIS.
Drag and drop the data flow task and inside the data flow drag and
drop flat file source and OLE DB destinations and copy the data. This
approach is useful if we want to perform any SSIS transformations.
Use the SSIS Bulk Insert Task. This approach is more powerful
compared to the previous one because internally, Bulk Insert Task uses
Bulk Copy (BCP) operation (Which is very fast in SQL Server).
CodePage: Specify the code page of the data in the data file. Generally
used for other languages.
DataFileType: Specify the data-type value to use in the load operation.
BatchSize: Specify the number of rows in a batch. The default is the
entire data file. If you set BatchSize to zero, the data loaded in a single
batch. For instance, If we set the batch size as 100, then each batch acts
as one transaction, and if the task fails after some time, then
successfully loaded batches will not be rollback.
LastRow: Specify the last row to copy.
FirstRow: Specify the first row from which copying starts.
SortedData: Specify the ORDER BY clause in the bulk insert statement.
The default is false.
MaxErrors: Specify the maximum number of errors that can occur
before the Bulk insert operation canceled. A value of 0 indicates that an
infinite number of errors are allowed.
52
Check
Checks the column data.
constraints
Enable identity
Select to insert existing values into an identity column.
insert
Table lock Select to lock the table during the bulk insert.
Introduction
SSIS package control flow is useful for executing multiple tasks and design workflow for execution. A
container in the control flow plays an essential role in workflow design. We can see the following
containers in SSIS Toolbox:
53
Sequence Container
The sequence container in SSIS is useful for grouping tasks together. We can split the control flow into
multiple logical units using this. We will explore more on the Sequence container in this article.
We can define variables under the scope of tasks inside a sequence container
It follows a parent-child relationship with the underlying tasks. We can change the property of a
sequence container (parent), and it is propagated to inside tasks (child)
It provides flexibility to manage the tasks in a container
Suppose you have a control flow for executing following SQL tasks daily:
54
Currently, we have a similar procedure on each task that runs daily. It is also running on a fixed schedule
by SQL Server Agent. Now, due to some business requirements, your development team created
separate stored procedures for each day of the week.
We can create separate SSIS packages for each day and schedule SQL agent jobs. It increases the
complexity and flexibility to manage the package:
Sequence Container in SSIS package solves this problem for us. Let’s explore the solution.
Drag a sequence container from the SSIS toolbox to the control flow area. Currently, it does not have
any tasks associated with it:
Now, drag and drop the SQL task 1 inside the Sunday container. You get the following error message
that it cannot move a connected task to a new container. SQL task 1 connected with other tasks using
precedence container:
We can remove the precedence constraints or select all SQL tasks together and move in the container.
Once we select all the tasks together, you can see bold outlines for each task:
Now, move them together inside the Sunday sequence container in SSIS and resize the container so
that we can fix another sequence container also on the screen. I have renamed the tasks and given them
a shorter name:
56
Make similar copies of the sequence container in the SSIS package for the rest of the week with
appropriate scripts.
Note: We are not covering the configuration of individual tasks inside the container. You should have
basic SSIS knowledge before using this article.
Now, my SSIS package looks like below with a Sequence container in SSIS for each day of the week.
Currently, if we execute the SSIS package, it will execute each sequence container individually.
In the following screenshot, we can see that for each sequence container, task 1 fails and it marks
container fails:
57
It did not execute the task 3 because task 3 contains multiple precedences, and by default, all inputs to
a task should be true.
It changes the solid precedence lines to dotted lines. Fix the issue and execute the package and we can
see each sequence contains runs inside task individually:
58
Now, we need to execute the Sequence container based on the day of the week. For this, right-click on
the package and add a variable:
Click on Add variable and provide a name, data type for the variable. By default, the variable scope is at
the package level. We will use this variable for the current day of the week:
Add a new execute SQL task and rename to find the day of the week:
59
Double-click on this task, and it opens the editor window. Make the following changes in this editor:
This query uses the DATENAME function and GETDATE function to find today’s day of the week. For
example, it returns Wednesday for 27/11/2019.
60
Navigate to the Result set and map the query output with the SSIS variable:
Click OK and join the precedence constraint from SQL task to Sunday Sequence container in SSIS:
1@[User::Day]=="Sunday"
61
You can click on the test to verify the expression. It gives the following message for successful
validation:
Click OK, and you can see the following configuration for precedence constraint with Sunday sequence
container in SSIS:
62
Similarly, add the precedence constraint from SQL task to respective sequence container in SSIS. Make
sure to change the expression for the particular day of the week. You can refer to the following table for
expressions:
Monday @[User::Day]==”Monday”
Tuesday @[User::Day]==”Tuesday”
Wednesday @[User::Day]==”Wednesday”
Thursday @[User::Day]==”Thursday”
Friday @[User::Day]==”Friday”
Saturday @[User::Day]==”Saturday”
For example, I am running this package on 27/11/2019 that is Wednesday. Let’s execute the SSIS
package. It should execute a sequence container for Wednesday:
63
Here we go. In the following screenshot, we can see that only the Wednesday Sequence container in
SSIS is executed:
It disables the Sequence container. It also greyed out the task inside:
We can design nested Sequence containers as well. In the following screenshot, we added a Sequence
container inside the Sunday Sequence container. Once the task 2 is successful, it triggers the nested
container execution:
65
We can collapse or expand a Sequence container in SSIS package with a click on the arrow as shown
below:
We can configure sequence container property as well. Few useful properties are:
Conclusion
In this article, we demonstrated the Sequence container in the SSIS package. It is useful in combining
tasks and defining the package workflow. You should practice this container as per your requirement
and use it.