ADFsenario
ADFsenario
Step 6: The inherent nature of the aggregate transformation is to block all metadata columns not used
in the aggregate. But here, we are using the aggregate to filter out non-distinct rows, so we need every
column from the original dataset. To do this, go to the aggregate settings and choose the column
pattern.
Step 7: That’s all you need to do to find distinct rows in your data, click on the Data preview tab to see
the result. You can see the duplicate data have been removed.
Step 8: The row counts are just aggregate transformation, to create a row counts go to Aggregate
settings and use the function count(1). This will create a running count of every row.
4.how to capture and persist Azure Data Factory pipeline errors to an Azure SQL Database
table.
To re-cap the tables needed for this process, I have included the diagram below which
illustrates how the pipeline_parameter, pipeline_log, and pipeline_error tables are
interconnected with each other.
the primary key. Note that this table drives the meta-data ETL approach.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
logs. In this table, column log_id is the primary key and column parameter_id is a foreign
SET QUOTED_IDENTIFIER ON
GO
Factory error details from failed pipeline activities. In this table, column error_id is the
primary key and column parameter_id is a foreign key with a reference to column
SET QUOTED_IDENTIFIER ON
GO
necessary stored procedures. Let’s begin with the following script which will create a
stored procedure to update the pipeline_log table with data from the successful pipeline
run. Note that this stored procedure will be called from the Data Factory pipeline at run-
time.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
pipeline_errors table with detailed error data from the failed pipeline run. Note that this
stored procedure will be called from the Data Factory pipeline at run-time.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
Objects to ADLS Gen2, that we used a source SQL Server Table that we then moved to
the Data Lake Storage Gen2 and ultimately into Synapse DW. Based on this process, we
will need to test a known error within the Data Factory pipeline and process. It is known
that generally a varchar(max) datatype containing at least 8000+ characters will fail when
being loaded into Synapse DW since varchar(max) is an unsupported data type. This
varchar(max) datatype.
Within dbo.MyErrorTable I have added a large block of text and decided to randomly
choose Sample text for Roma : the novel of ancient Rome by Steven Saylor. After doing
some editing of the text, I confirmed that col1 contains 8001 words, which is sure to fail my
Azure Data Factory pipeline and trigger a record to be created in the pipeline_errors table.
to the pipeline_parameter table. For this demonstration I have added the Error table that
we created in the previous step along with a regular table that we would expect to succeed
can see that the destination ADLS2 container now has both of the tables in snappy
As an additional verification step, we can see that the folder contains the expected parquet
file.
files into Azure Synapse Analytics Using Azure Data Factory, covers the details on how to
build this pipeline. To recap the process, the select query within the lookup gets the list of
parquet files that need to be loaded to Synapse DW and then passes them on to each loop
them to Synapse DW while auto-creating the tables. If the Copy-Table activity succeeds, it
will log the pipeline run data to the pipeline_log table. However, if the Copy-Table activity
fails, it will log the pipeline error details to the pipeline_errors table.
Name Values
DataFactory_Name @{pipeline().DataFactory}
Pipeline_Name @{pipeline().Pipeline}
RunId @{pipeline().RunId}
Source @{item().src_name}
Destination @{item().dst_name}
TriggerType @{pipeline().TriggerType}
TriggerId @{pipeline().TriggerId}
TriggerName @{pipeline().TriggerName}
TriggerTime @{pipeline().TriggerTime}
rowsCopied @{activity('Copy-Table').output.rowsCopied}
RowsRead @{activity('Copy-Table').output.rowsRead}
No_ParallelCopies @{activity('Copy-Table').output.usedParallelCopies}
copyDuration_in_secs @{activity('Copy-Table').output.copyDuration}
effectiveIntegrationRuntime @{activity('Copy-Table').output.effectiveIntegrationRuntime}
Source_Type @{activity('Copy-Table').output.executionDetails[0].source.type}
Sink_Type @{activity('Copy-Table').output.executionDetails[0].sink.type}
Execution_Status @{activity('Copy-Table').output.executionDetails[0].status}
CopyActivity_Start_Time @{activity('Copy-Table').output.executionDetails[0].start}
CopyActivity_End_Time @{utcnow()}
CopyActivity_queuingDuration_in_secs @{activity('Copy-Table').output.executionDetails[0]. detailedDurations.queuingDuration}
CopyActivity_transferDuration_in_secs @{activity('Copy-Table').output.executionDetails[0]. detailedDurations.transferDuration}
Configure Stored Procedure to Update the Error Table
The last stored procedure within the Foreach loop activity is the UpdateErrorTable Stored
procedure that we created earlier and will be called by the failure stored procedure activity.
Below are the stored procedure parameters that will Update the pipeline_errors table and
Description Source
DataFactory_Name @{pipeline().DataFactory}
Pipeline_Name @{pipeline().Pipeline}
RunId @{pipeline().RunId}
Source @{item().src_name}
Destination @{item().dst_name}
TriggerType @{pipeline().TriggerType}
TriggerId @{pipeline().TriggerId}
TriggerName @{pipeline().TriggerName}
TriggerTime @{pipeline().TriggerTime}
No_ParallelCopies @{activity('Copy-Table').output.usedParallelCopies}
copyDuration_in_secs @{activity('Copy-Table').output.copyDuration}
effectiveIntegrationRuntime @{activity('Copy-Table').output.effectiveIntegrationRuntime}
Source_Type @{activity('Copy-Table').output.executionDetails[0].source.type}
Sink_Type @{activity('Copy-Table').output.executionDetails[0].sink.type}
Execution_Status @{activity('Copy-Table').output.executionDetails[0].status}
ErrorCode @{activity('Copy-Table').error.errorCode}
ErrorDescription @{activity('Copy-Table').error.message}
ErrorLoggedTIme @utcnow()
FailureType @concat(activity('Copy-Table').error.message,'failureType:',activity('Copy-Table').error.failureType)
the debug mode Output log, one table succeeded and the other failed, as expected.
Verify the Results
Finally, lets verify the results in the pipeline_log table. As we can see, the pipeline_log
And the pipeline_errors table now has one record for MyErrorTable, along with detailed
As a final check, when I navigate to the Synapse DW, I can see that both tables have been
auto-created, despite the fact that one failed and one succeeded.
However, data was only loaded in MyTable since MyErrorTable contains no data.
How to implement Slow changing
Dimension Type 1 in Azure Data factory
In this article, I will talk about how we can implement slow changing
dimension (SCD Type 1).
One question must be arise in your mind that what is slow changing
dimension and why we categorize this as type 1 , type 2 and type 3.
I will give you high level overview of this. Let’s try to understand this
diagram.
SCD types
So, this is one of the data loading technique from one system (source) to
another system destination (sink) incrementally. It also depends on the
organisation how they want to perform this data load means they
want to preserve the history or they don’t want to preserve. It totally
depends on situation and business needs.
Now, Imagine data in source got changed, and we loaded that data
into our stage table (STG_Customer). For ID 1, country got changed
from India to London. ID2 and ID3 remains same whereas ID4 and
ID5 are new records.
Let’s run the data flow one more time, this time, as Cust_Number is
already present for ID1 so it will update the record for ID1 whereas for
ID 4 and ID 5 as these are new records so these records will be inserted
in DIM_CUSTOMER table and ID2 and ID3 will remain unchanged.
Q: How to copy data from an Azure Blob Storage (text file) to an Azure SQL Database table?
Ans:
Set up the Data Factory pipeline which will be used to copy data from the blob
storage to the Azure SQL Database.
1.Navigate to the Azure Portal and select the Author & Monitor option.
3.Enter a unique name for the copy activity and select whether to schedule or
execute it once for the Copy Data option -> Click Next
4.Select the type of source data store to link to in order to construct a Linked Service
(Azure Blob Storage) and click Continue
5.A new Linked Service window will open -> enter a unique name and other details -
> test the connection and then click Create.
6.If you need to copy many files recursively, specify the input data file or folder
(DataSet) and click Next.
7.Select Azure SQL Database from the list of New Linked Services, then click
Continue to configure the new Linked service.
8.Fill out all of the information in the New Linked Service window -> Test
Connection -> Create
9.Create the sink dataset with the destination database table specified.
10.Copy Data Tool -> Settings -> Next -> Review all the copy configuration from
the Summary Window -> Next -> Finish
The Data Factory will create all pipeline components before running the pipeline. Executing
the pipeline involves running the copy activity, which copies the data from the input file in
Azure Blob Storage and writes it to the Azure SQL Database table.
Question 1 : Assume that you are a data engineer for company ABC The company wanted
to do cloud migration from their on-premises to Microsoft Azure cloud. You probably will
use the Azure data factory for this purpose. You have created a pipeline that copies data of
one table from on-premises to Azure cloud. What are the necessary steps you need to take
to ensure this pipeline will get executed successfully?
The company has taken a very good decision of moving to the cloud from the traditional
on-premises database. As we have to move the data from the on-premise location to the
cloud location we need to have an Integration Runtime created. The reason being the
auto-resolve Integration runtime provided by the Azure data factory cannot connect to your
on-premises. Hence in step 1, we should create our own self-hosted integration runtime.
Now this can be done in two ways:
The first way is we can have one virtual Machine ready in the cloud and there we will
install the integration runtime of our own.
The second way, we could take a machine on the on-premises network and install the
integration runtime there.
Go to the azure data factory portal. In the manage tab select the Integration runtime.
1.Create self hosted integration runtime by simply giving general information like name
description.
2.Create Azure VM (If u already have then you can skip this step)
3. Download the integration runtime software on azure virtual machine. and install it.
4.Copy the autogenerated key from step 2 and paste it newly installed integration runtime
on azure vm.
You can follow this link for detailed step by step guide to understand the process of how to
install sefl-hosted Integration runtime.How to Install Self-Hosted Integration Runtime on
Azure vm – AzureLib
Once your Integration runtime is ready we go to linked service creation.Create the linked
servicewhich connect to the your data source and for this you use the integration runtime
created above.
After this we will create the pipeline.Your pipeline will have copy activity where source
should be the database available on the on-premises location. While sink would be the
database available in the cloud.
Question 2: Assume that you are working for a company ABC as a data engineer. You have
successfully created a pipeline needed for migration. This is working fine in your
development environment. how would you deploy this pipeline in production without
making any or very minimal changes?
When you create the pipeline for migration or for any other purposes like ETL, most of the
time it will use the data source. In the above mentioned scenario, we are doing the
migration hence it is definitely using a data source at the source side and similarly a data
source at the destination side and we need to move the data from source to destination. It
is also described in the in the question itself data engineer has developed the pipeline
successfully in the development environment.Hence it is safe to assume that source side
data source and destination side data source both probably will be pointing to the
development environment only.pipeline would have copy activity which uses the dataset
with the help of linked service for source and sink.
Linked service provides way to connect to the data source by providing the data source
details like the server address, port number, username, password, key, or other credential
related information.
In this case, our linked services probably pointing to the development environment only.
As we want to do production deployment before that we may need to do a couple of other
deployments as well like deployment for the testing environment or UAT environment.
Hence we need to design our Azure data factory pipelinecomponents in such a way that
we can provide the environment related information dynamic and as a part of a parameter.
There should be no hard coding of these kind of information.
We need to create the arm template for our pipeline. ARM template needs to have a
definition defined for all the constituents of the pipeline like Linked services, dataset,
activities and pipeline.
Once the ARM template is ready,it should be checked-in into the GIT repository. Lead or
Admin will create the devops pipeline which will take up this arm template and parameter
file as an input.Devops pipeline will deploy this arm template and create all the resources
like linked service,dataset, activities and your data pipeline into the production
environment.
Question 3: Assume that you have around 1 TB of data stored in Azure blob storage . This
data is in multiple csv files. You are asked to do couple of transformations on this data as
per business logic and needs, before moving this data into the staging container.How
would you plan and architect the solution for this given scenario. Explain with the details.
First of all, we need to analyze the situation. Here if you closely look at the size of the data,
you find that it is very huge in the size. Hence directly doing the transformation on such a
huge size of data could be very cumbersome and time consuming process. Hence we
should think about the big data processing mechanism where we can leverage the parallel
and distributed computing advantages.. Here we have two choices.
1.We can use the Hadoop MapReduce through HDInsight capability for doing the
transformation.
2.We can also think of using the spark through the Azure databricks for doing the
transformation on such a huge scale of data.
Out of these two, Spark on Azure databricks is better choice because Spark is much faster
than Hadoop due to in memory computation. So let’s choose the Azure databricks as the
option.
Next we need to create the pipeline in Azure data factory. A pipeline should use the
databricks notebook as an activity.
We can write all the business related transformation logic into the Spark notebook.
Notebook can be executed using either python, scala or java language.
When you execute the pipeline it will trigger the Azure databricks notebook and your
analytics algorithm logic runs an do transformations as you defined into the Notebook. In
the notebook itself, you can write the logic to store the output into the blob storage Staging
area.
Question 4: Assume that you have an IoT device enabled on your vehicle. This device
from the vehicle sends the data every hour and this is getting stored in a blob storage
location in Microsoft Azure. You have to move this data from this storage location into the
SQL database.How would design the solution explain with reason.
This looks like an a typical incremental load scenario. As described in the problem
statement, IoT device write the data to the location every hour. It is most likely that this
device is sending the JSON data to the cloud storage (as most of the IoT device generate
the data in JSON format). It will probably writing the new JSON file every time whenever
the data from the device sent to the cloud.
Hence we will have couple of files available in the storage location generated on hourly
basis and we need to pull these file into the azure sql database.
we need to create the pipeline into the Azure data factory which should do the incremental
load. we can use the conventional high watermark file mechanism for solving this problem.
Highwater mark design is as follows :
1.Create a file named lets say HighWaterMark.txt and stored in some place in azure blob
storage. In this file we will put the start date and time.
2.Now create the pipeline in the azure data factory.Pipeline has the first activity defined as
lookup activity. This will read the date from the HighWaterMark.txt
3.Add a one more lookup activity which will return the current date time.
4.Add the copy activity in the pipeline which will pull the file JSON files having created
timestamp greater than High Water Mark date. In the sink push the read data into the
azure sql database.
5.After copy activity add the another copy activity which will update the current date time
generated in the step 2, to the High Water Mark file.
6.Add the trigger to execute this pipeline on hourly basis.
That’s how we can design the incremental data load solution for the above described
scenario.