Lab4 SSIS DQS
Lab4 SSIS DQS
Lab 4
SQL Server Data Quality Services
December 2023
Objectives
• Profiling Data using SSIS
• Cleansing data using SSIS and DQS
At the end of this lab, you are required to deliver a report of your findings and observations (Take a step back
to understand the general steps for using SSIS for Data Quality).
SSIS provides a few components that can help you not only understand your data but also automate the
data cleansing process and run it in batch mode. In this lab we will examine two SSIS tasks: Data Profiling
task and DQS Cleansing component.
The Data Profiling task provides data profiling functionality inside the process of extracting, transforming, and
loading data. By using the Data Profiling task, you can achieve the following benefits:
• Prevent data quality problems before they are introduced into the data warehouse / data stack.
The Data Profiling task however, works only with data that is stored in SQL Server. It does not work with
third-party or file-based data sources.
1
SQL SERVER - DATA QUALITY SERVICES LAB
The Data Profiling task is a task that you use to configure the profiles that you want to compute. You
then run the package that contains the Data Profiling task to compute the profiles. The task saves the
profile output in XML format to a file or a package variable.
Step 2: Reviewing the Profiles that the Data Profiling Task Computes
To view the data profiles that the Data Profiling task computes, you send the output to a file, and then
you use the Data Profile Viewer. This viewer is a stand-alone utility that displays the profile output in
both summary and detail format with optional Drill-Down capability.
You can use the SSIS Data Profiling task for profiling the following:
• Column Length Distribution: This helps you find strings of unexpected length.
• column null ratio Use this to find the percentage of NULLs in a column.
• Column pattern: This is a very powerful profile that expresses patterns in strings as regular
expressions and then calculates the distribution of these regular expressions.
• Column statistics: This gives you the minimum, maximum, average, and standard deviation for
numeric columns, and the minimum and maximum for Datetime columns
• Column value Distribution: This gives you the distribution of values for discrete columns.
• Candidate key: This profile gives you the percentage of unique values in columns, thus helping
you identify columns that are candidates for keys.
• Functional Dependency: This profile reports how much the values in a dependent column
depend on the values in a set of determinant columns.
• Value inclusion: This profile finds the extent to which column values from one table have
corresponding values in a set of column values of another table, thus helping you find potential
foreign keys.
In this exercise, we will use the Data Profiling task to find inaccurate data in the CustomersDirty view we
created in the previous lab, within the DQS_ STAGING_DATA database.
1. Open Visual Studio (or SQL Server Data Tools (SSDT) for older versions). Create a new SSIS project and
solution.
2
SQL SERVER - DATA QUALITY SERVICES LAB
2. Drag the Data Profiling task from the SSIS Toolbox (it should be in the Common Tasks group) to the
control flow working area. Right-click it and select Edit.
3. On the General tab, use the Destination Property drop-down list to select New File Connection.
4. In the File Connection Manager Editor window, change the usage type to Create File. In the File text box,
type the file name ProfilingCustomers.xml.
5. When you are back in the Data Profiling Task Editor, on the General tab, change the OverwriteDestination
property to True to make it possible to re-execute the package multiple times (otherwise you will get an
error saying that the destination file already exists when the package next executes).
6. In the lower-right corner of the Data Profiling Task Editor, on the General tab, click the Quick Profile
button.
7. In the Simple Table Quick Profiling Form dialog box, click the New button to create a new ADO.NET
connection. The Data Profiling task accepts only ADO.NET connections.
8. Connect to your SQL Server instance by using Windows authentication, and select the
DQS_STAGING_DATA database. Click OK to return to the Simple Table Quick Profiling Form dialog box.
9. Select the CustomersDirty view in the Table Or View drop-down list. Leave the first four check boxes
selected, as they are by default. Clear the Candidate Key Profile check box, and select the Column
Pattern Profile check box.
10. In the Data Profiling Task Editor window, in the Profile Type list on the right, select different profiles and
check their settings. Change the Column property for the Column Value Distribution Profile Request from
(*) to Occupation (you are going to profile this column only). Change the ValueDistributionOption property
for this request to All-Values. In addition, change the value for the Column property of the Column Pattern
Profile Request from (*) to EmailAddress. Click OK.
12. When the Execution finishes, Check whether the XML file appeared in the folder you chose in step 4.
Viewing and analyzing the data profiles is the next step in the data profiling process. To view the data profiles,
you configure the Data Profiling task to send its output to a file, and then you use the stand-alone Data
Profile Viewer. To open the Data Profile Viewer, do one of the following.
3
SQL SERVER - DATA QUALITY SERVICES LAB
• Right-click the Data Profiling task in the SSIS Designer, and then click Edit. Click Open Profile Viewer
on the General page of the Data Profiling Task Editor.
• In the folder (or an aproximate depending of your SQL Server version), <drive>:\Program Files (x86) |
Program Files\Microsoft SQL Server\110\DTS\Binn, run DataProfileViewer.exe.
The viewer uses multiple panes to display the profiles that you requested and the computed results, with
optional details and Drill-Down capability:
• Profiles pane: The Profiles pane displays the profiles that were requested in the Data Profile task. To view
the computed results for the profile, select the profile in the Profiles pane and the results will appear in the
other panes of the viewer.
• Results pane: The Results pane uses a single row to summarize the computed results of the profile. For
example, if you request a Column Length Distribution Profile, this row includes the minimum and maximum
length, and the row count. For most profiles, you can select this row in the Results pane to see additional
detail in the optional Details pane.
• Details pane: For most profile types, the Details pane displays additional information about the profile
results selected in the Results pane. For example, if you request a Column Length Distribution Profile, the
Details pane displays each column length that was found. The pane also displays the number and
percentage of rows in which the column value has that column length.
For the three profile types that are computed against more than one column (Candidate Key, Functional
Dependency, and Value Inclusion), the Details pane displays violations of the expected relationship. For
example, if you request a Candidate Key Profile, the Details pane displays duplicate values that violate the
uniqueness of the candidate key.
If the data source that is used to compute the profile is available, you can double-click a row in the Details
pane to see the matching rows of data in the Drill-Down pane.
• Drill-Down pane: You can double-click a row in the Details pane to see the matching rows of data in the
Drill-Down pane when the following conditions are true: The data source that is used to compute the profile
is available and you have permission to view the data.
To connect to the source database for a Drill-Down request, the Data Profile Viewer uses Windows
Authentication and the credentials of the current user. The Data Profile Viewer does not use the connection
information that is stored in the package that ran the Data Profiling task.
4
SQL SERVER - DATA QUALITY SERVICES LAB
1. Open Data Profile Viewer and Navigate to the ProfilingCustomers.xml file and open it. Now you can thus
start harvesting the results.
2. On the left, in the Profiles pane, select, for example, the Column Value Distribution Profiles. In the upper-
right pane, select the Occupation column. In the middle-right window, you should see the distribution for
the Occupation attribute. Click the value that has very low frequency (the Profesional value). Find the drill-
down button in the upper-right corner of the middle-right window. Click it, and in the lower-right pane,
check the row with this suspicious value.
3. Check the Column Pattern Profiles. Note that for the EmailAddress column, the Data Profiling task shows
you the regular expression patterns for this column. Note that these two regular expressions are the
regular expressions you used when you prepared a DQS knowledge base in the previous Labs.
4. Also check the other profiles. When you are done checking, close the Data Profile Viewer.
What information have you been able to glean out of the Data profiling you just performed on your
dirty Dim Customers data.
SSIS incorporates a DQS Cleansing transformation which uses Data Quality Services (DQS) to correct data.
To run the DQS Cleansing transformation, we need to create the DQS knowledge base in advance.
Some advanced configuration options for the DQS Cleansing transformations are:
• Standardize output: Use this option to standardize the output as you define in domain settings. You can
standardize strings to uppercase or lowercase, or capitalize each word. In addition, the values are
standardized to the leading value.
• Confidence: If you select this option, you also get the confidence level for a correction or a suggestion.
• Appended data: This setting is valid only if you are using a reference data provider. Some reference data
providers can include additional information, such as geographic coordinates when you check an address.
You get this additional information in the field called Appended Data.
• Appended data schema: If you append a reference provider’s data, you get information about the
schema for this data in this field.
5
SQL SERVER - DATA QUALITY SERVICES LAB
The DQS Cleansing transformation, like other transformations, includes an error output allowing you to
handle potential row-level errors.
In this exercise, we will try to improve the quality of our data by using the DQS Cleansing transformation in
SSIS. First we will need to prepare our Clean and Dirty Data and create a couple of tables we will need later.
1. Open SSMS, connect to your SQL Server instance, open a new query window, and change the context
to the DQS_STAGING_DATA database.
2. Create a table for clean customer data. Name it CustomersCleanT. Include only columns for the
customer key, full name, and street address. Use the following code.
3. Populate the table with every tenth customer from the DimCustomer table from the AdventureWorksDW
database by using the following query.
4. Create a table with a structure similar to the one for CustomersCleanT and call it customersDirtyT. Add
two integer columns to this table called Updated and CleanCustomerKey. The first one will be used by
the query that makes the data dirty and the second one to populate the table with the customer key from
the clean table after identity mapping (process of linking or mapping data from an input data source to
corresponding records in a reference data source.).
Populate this table with the same data; multiply the CustomerKey column by –1 and populate the
Updated column with zero. Do not populate the CleanCustomerKey column. This will allow you to track
the accuracy of matches in later exercises. Use the following code.
6
SQL SERVER - DATA QUALITY SERVICES LAB
To create our Dirty Data we will execute the queries in the createDiryData.sql file.
5. Check the dirty data after changes. A little bit more than 40 percent of data should be updated. Because
there is randomness in updates, you get a different number of rows and different rows updated every
time you run the code. You can check the changes with the following query.
SELECT
C.FullName
,D.FullName
,C.StreetAddress
,D.StreetAddress
,D.Updated
FROM dbo.CustomersCleanT AS C
INNER JOIN dbo.CustomersDirtyT AS D
ON C.CustomerKey = D.CustomerKey * (-1)
WHERE C.FullName <> D.FullName
OR C.StreetAddress <> D.StreetAddress
ORDER BY D.Updated DESC;
6. Finally, update the row for the customer with a key equal to -11010. Set the FullName to jacquelyn
suarez and StreetAddress to 7800 corrinne ct. This gives you a row that can be corrected with the DQS
Cleansing transformation in the practice next. Use the following code.
UPDATE dbo.CustomersDirtyT
SET FullName = N'Jacquelyn Suarez',
StreetAddress = N'7800 Corrinne Ct.'
WHERE CustomerKey = -11010;
7
SQL SERVER - DATA QUALITY SERVICES LAB
7. Create a new table in the DQS_STAGING_DATA database in the dbo schema and name it
CustomersDirtyMatchT. Use the following code.
8. Add another new table in the dbo schema and name it CustomersDirtyNoMatchT. Use the same
schema as for the previous table.
Now that our data and tables are prepared, we will create an SSIS flow to clean the dirty data.
1. Create a new package in your integration project from the first exercise. You can name the package
DQSCleansing.
2. Drag a data flow task to the control flow working area. Click the Data Flow tab to open the data flow
working area.
3. Right-click the Connection Managers folder in Solution Explorer and select New Connection Manager.
4. Select the OLEDB connection manager type and click Add. In the Configure OLE DB Connection
Manager window, click New.
5. Select Native OLE DB\SQL Server Native Client 11.0 Provider. Provide the name of your SQL Server
instance and authentication information, and select the DQS_ STAGING_DATA database. Click OK.
When you are back in the Configure OLE DB Connection Manager window, click OK.
6. Add an OLE DB source to your data flow. Rename it to customersDirty. Double-click it to open the
editor. Select the table CustomersDirtyT table. Click the Columns tab in the left pane to get the column
mappings. Check the mappings and click OK.
8
SQL SERVER - DATA QUALITY SERVICES LAB
7. In the SSIS Toolbox, expand Other Transforms. Drag the DQS Cleansing transformation to the data flow.
Connect it to the CustomersDirty data source with the normal data flow (gray arrow). Rename the
transformation CleanseStreetAddress. Double-click it to open the editor.
8. In the DQS Cleansing Transformation Editor, make sure that the Connection Manager tab is selected.
Then click New. Enter your server name in the DQS Cleansing Connection Manager pop-up window and
then click OK.
9. Select the Customers knowledge base we’ve created in Lab2. Click the Mapping tab. In the Available
Input Columns list, select the check box near the StreetAddress column. Then map it to the
StreetAddress domain and rename the StreetAddress_Output column to streetaddress. We are going
to use this column with corrected data in downstream flow.
10. Click the Advanced tab. Select the Confidence and Reason check boxes. Then click OK.
11. In preparation for our next lab (Matching Data with DQS), we will perform exact matches. Drag the
Lookup transformation to the working area and connect it to the DQS Cleansing transformation. Name it
exact matches and double-click it to open the editor.
12. In the Lookup Transformation Editor, click the Connection tab in the left pane. Use the connection
manager for the DQS_STAGING_DATA database. Select the dbo.CustomersCleanT table. Then click the
Columns tab.
13. Drag the FullName and StreetAddress columns from Available Input Columns onto the columns with the
same name in the Available Lookup Columns table. Select the check box near the CustomerKey column
in the Available Lookup Columns table. In the Lookup Operation field in the grid in the bottom part of the
editor, select the Replace ‘CleanCustomerKey’ option. Rename the output alias CleanCustomerKey.
14. Click the General tab. For Specify How To Handle Rows With No Matching Entities, select the Redirect
Rows To No Match Output option. Then click OK to close the Lookup Transformation Editor.
15. Drag two Multicast transformations to the working area. Rename the first one match and the second
one no match. Connect them to the Lookup transformation, the first by using the Lookup Match Output
and the second by using the Lookup No Match Output. We do not need to multicast the data for now;
however, we are going to expand the package later.
9
SQL SERVER - DATA QUALITY SERVICES LAB
16. Add a new OLE DB destination and rename it customersDirtymatch. Connect it to the Match Multicast
transformation. Double-click it to open the editor. Select the dbo.CustomersDirtyMatchT table. Click the
Mappings tab to check the mappings. Note that the last column is ignored on the input side. Click the
<Ignore> value and select the Record Status input column to map it to the Record_Status output
column. Click OK.
17. Add a new OLE DB destination and rename it customersDirtynomatch. Connect it to the No Match
Multicast transformation. Double-click it to open the editor. Select the dbo.CustomersDirtyNoMatchT
table. Click the Mappings tab to check the mappings. Note that the last column is ignored on the input
side. Click the <Ignore> value and select the Record Status input column to map it to the Record_Status
output column. Click OK.
18. Save the project. Execute the package and resolve any errors. When the package executes successfully,
check the content of the dbo.CustomersDirtyMatchT and dbo.CustomersDirtyNoMatchT tables. Check
the corrected StreetAddress column value for the customer with CustomerKey equal to -11010.
Summarize the general steps to follow to cleanse data using SSIS and SQS, illustrate them using the
exercise above. Describe and comment the results of executing your dataflow.
10