0% found this document useful (0 votes)

12 views26 pages

ADFsenario

Uploaded by

talamanchi sudheernadh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views26 pages

ADFsenario

Uploaded by

talamanchi sudheernadh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

1.

Creating Azure Data-Factory using theAzure portal

Step 1:Find - "create a resource' and search for "Data Factory". Click the create icon.
Step 2:Give your data factory a name. Select your resource group. Give it a path to and choose
the version you would like.
Step 3:Click on create.
Thus your data factory is ready to be filled with more data.

2.Copy data from a SQL Server database to Azure Blob storage

Create a data factory.
•Create a self-hosted integration runtime.
•Create SQL Server and Azure Storage linked services.
•Create SQL Server and Azure Blob datasets.
•Create a pipeline with a copy activity to move the data.
•Start a pipeline run.
•Monitor the pipeline run.

3.ADF’s Mapping Data flows – How do you get distinct

rows and rows count from the data
Step 1: Create an Azure Data Pipeline.
Step 2: Add a data flow activity and name as “DistinctRows”.
Step 3: Go to settings and add a new data flow. Select the Source Settings tab, add a source
transformation, and connect it to one of your datasets.
Step 3: In the Projection tab, it allows you the change the column data type. Here I have changed my
Emp ID column to Integer.
Step 4: In the Data preview tab you can see your data.
Step 5: Add an Aggregate transformation, named “DistinctRows”. In the group by settings, you need to
choose which column or combination of columns will make up the key(s) for ADF to determine distinct
rows, here in this demo I pick up “Emp ID” as my key columns.

Step 6: The inherent nature of the aggregate transformation is to block all metadata columns not used
in the aggregate. But here, we are using the aggregate to filter out non-distinct rows, so we need every
column from the original dataset. To do this, go to the aggregate settings and choose the column
pattern.

Step 7: That’s all you need to do to find distinct rows in your data, click on the Data preview tab to see
the result. You can see the duplicate data have been removed.
Step 8: The row counts are just aggregate transformation, to create a row counts go to Aggregate
settings and use the function count(1). This will create a running count of every row.
4.how to capture and persist Azure Data Factory pipeline errors to an Azure SQL Database
table.

To re-cap the tables needed for this process, I have included the diagram below which
illustrates how the pipeline_parameter, pipeline_log, and pipeline_error tables are
interconnected with each other.

Create a Parameter Table

The following script will create the pipeline_parameter table with column parameter_id as

the primary key. Note that this table drives the meta-data ETL approach.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_parameter](

[PARAMETER_ID] [int] IDENTITY(1,1) NOT NULL,
[server_name] [nvarchar](500) NULL,
[src_type] [nvarchar](500) NULL,
[src_schema] [nvarchar](500) NULL,
[src_db] [nvarchar](500) NULL,
[src_name] [nvarchar](500) NULL,
[dst_type] [nvarchar](500) NULL,
[dst_name] [nvarchar](500) NULL,
[include_pipeline_flag] [nvarchar](500) NULL,
[partition_field] [nvarchar](500) NULL,
[process_type] [nvarchar](500) NULL,
[priority_lane] [nvarchar](500) NULL,
[pipeline_date] [nvarchar](500) NULL,
[pipeline_status] [nvarchar](500) NULL,
[load_synapse] [nvarchar](500) NULL,
[load_frequency] [nvarchar](500) NULL,
[dst_folder] [nvarchar](500) NULL,
[file_type] [nvarchar](500) NULL,
[lake_dst_folder] [nvarchar](500) NULL,
[spark_flag] [nvarchar](500) NULL,
[dst_schema] [nvarchar](500) NULL,
[distribution_type] [nvarchar](500) NULL,
[load_sqldw_etl_pipeline_date] [datetime] NULL,
[load_sqldw_etl_pipeline_status] [nvarchar](500) NULL,
[load_sqldw_curated_pipeline_date] [datetime] NULL,
[load_sqldw_curated_pipeline_status] [nvarchar](500) NULL,
[load_delta_pipeline_date] [datetime] NULL,
[load_delta_pipeline_status] [nvarchar](500) NULL,
PRIMARY KEY CLUSTERED
(
[PARAMETER_ID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

Create a Log Table

This next script will create the pipeline_log table for capturing the Data Factory success

logs. In this table, column log_id is the primary key and column parameter_id is a foreign

key with a reference to column parameter_id from the pipeline_parameter table.

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_log](

[LOG_ID] [int] IDENTITY(1,1) NOT NULL,
[PARAMETER_ID] [int] NULL,
[DataFactory_Name] [nvarchar](500) NULL,
[Pipeline_Name] [nvarchar](500) NULL,
[RunId] [nvarchar](500) NULL,
[Source] [nvarchar](500) NULL,
[Destination] [nvarchar](500) NULL,
[TriggerType] [nvarchar](500) NULL,
[TriggerId] [nvarchar](500) NULL,
[TriggerName] [nvarchar](500) NULL,
[TriggerTime] [nvarchar](500) NULL,
[rowsCopied] [nvarchar](500) NULL,
[DataRead] [int] NULL,
[No_ParallelCopies] [int] NULL,
[copyDuration_in_secs] [nvarchar](500) NULL,
[effectiveIntegrationRuntime] [nvarchar](500) NULL,
[Source_Type] [nvarchar](500) NULL,
[Sink_Type] [nvarchar](500) NULL,
[Execution_Status] [nvarchar](500) NULL,
[CopyActivity_Start_Time] [nvarchar](500) NULL,
[CopyActivity_End_Time] [nvarchar](500) NULL,
[CopyActivity_queuingDuration_in_secs] [nvarchar](500) NULL,
[CopyActivity_transferDuration_in_secs] [nvarchar](500) NULL,
CONSTRAINT [PK_pipeline_log] PRIMARY KEY CLUSTERED
(
[LOG_ID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[pipeline_log] WITH CHECK ADD FOREIGN KEY([PARAMETER_ID])

REFERENCES [dbo].[pipeline_parameter] ([PARAMETER_ID])
ON UPDATE CASCADE
GO

Create an Error Table

This next script will create a pipeline_errors table which will be used to capture the Data

Factory error details from failed pipeline activities. In this table, column error_id is the

primary key and column parameter_id is a foreign key with a reference to column

parameter_id from the pipeline_parameter table.

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_errors](

[error_id] [int] IDENTITY(1,1) NOT NULL,
[parameter_id] [int] NULL,
[DataFactory_Name] [nvarchar](500) NULL,
[Pipeline_Name] [nvarchar](500) NULL,
[RunId] [nvarchar](500) NULL,
[Source] [nvarchar](500) NULL,
[Destination] [nvarchar](500) NULL,
[TriggerType] [nvarchar](500) NULL,
[TriggerId] [nvarchar](500) NULL,
[TriggerName] [nvarchar](500) NULL,
[TriggerTime] [nvarchar](500) NULL,
[No_ParallelCopies] [int] NULL,
[copyDuration_in_secs] [nvarchar](500) NULL,
[effectiveIntegrationRuntime] [nvarchar](500) NULL,
[Source_Type] [nvarchar](500) NULL,
[Sink_Type] [nvarchar](500) NULL,
[Execution_Status] [nvarchar](500) NULL,
[ErrorDescription] [nvarchar](max) NULL,
[ErrorCode] [nvarchar](500) NULL,
[ErrorLoggedTime] [nvarchar](500) NULL,
[FailureType] [nvarchar](500) NULL,
CONSTRAINT [PK_pipeline_error] PRIMARY KEY CLUSTERED
(
[error_id] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[pipeline_errors] WITH CHECK ADD FOREIGN KEY([parameter_id])

REFERENCES [dbo].[pipeline_parameter] ([PARAMETER_ID])
ON UPDATE CASCADE
GO

Create a Stored Procedure to Update the Log Table

Now that we have all the necessary SQL Tables in place, we can begin creating a few

necessary stored procedures. Let’s begin with the following script which will create a

stored procedure to update the pipeline_log table with data from the successful pipeline

run. Note that this stored procedure will be called from the Data Factory pipeline at run-

time.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[sp_UpdateLogTable]

@DataFactory_Name VARCHAR(250),
@Pipeline_Name VARCHAR(250),
@RunID VARCHAR(250),
@Source VARCHAR(300),
@Destination VARCHAR(300),
@TriggerType VARCHAR(300),
@TriggerId VARCHAR(300),
@TriggerName VARCHAR(300),
@TriggerTime VARCHAR(500),
@rowsCopied VARCHAR(300),
@DataRead INT,
@No_ParallelCopies INT,
@copyDuration_in_secs VARCHAR(300),
@effectiveIntegrationRuntime VARCHAR(300),
@Source_Type VARCHAR(300),
@Sink_Type VARCHAR(300),
@Execution_Status VARCHAR(300),
@CopyActivity_Start_Time VARCHAR(500),
@CopyActivity_End_Time VARCHAR(500),
@CopyActivity_queuingDuration_in_secs VARCHAR(500),
@CopyActivity_transferDuration_in_secs VARCHAR(500)
AS
INSERT INTO [pipeline_log]
(
[DataFactory_Name]
,[Pipeline_Name]
,[RunId]
,[Source]
,[Destination]
,[TriggerType]
,[TriggerId]
,[TriggerName]
,[TriggerTime]
,[rowsCopied]
,[DataRead]
,[No_ParallelCopies]
,[copyDuration_in_secs]
,[effectiveIntegrationRuntime]
,[Source_Type]
,[Sink_Type]
,[Execution_Status]
,[CopyActivity_Start_Time]
,[CopyActivity_End_Time]
,[CopyActivity_queuingDuration_in_secs]
,[CopyActivity_transferDuration_in_secs]
)
VALUES
(
@DataFactory_Name
,@Pipeline_Name
,@RunId
,@Source
,@Destination
,@TriggerType
,@TriggerId
,@TriggerName
,@TriggerTime
,@rowsCopied
,@DataRead
,@No_ParallelCopies
,@copyDuration_in_secs
,@effectiveIntegrationRuntime
,@Source_Type
,@Sink_Type
,@Execution_Status
,@CopyActivity_Start_Time
,@CopyActivity_End_Time
,@CopyActivity_queuingDuration_in_secs
,@CopyActivity_transferDuration_in_secs
)
GO

Create a Stored Procedure to Update the Errors Table

Next, lets run the following script which will create a stored procedure to update the

pipeline_errors table with detailed error data from the failed pipeline run. Note that this

stored procedure will be called from the Data Factory pipeline at run-time.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[sp_UpdateErrorTable]

@DataFactory_Name [nvarchar](500) NULL,
@Pipeline_Name [nvarchar](500) NULL,
@RunId [nvarchar](500) NULL,
@Source [nvarchar](500) NULL,
@Destination [nvarchar](500) NULL,
@TriggerType [nvarchar](500) NULL,
@TriggerId [nvarchar](500) NULL,
@TriggerName [nvarchar](500) NULL,
@TriggerTime [nvarchar](500) NULL,
@No_ParallelCopies [int] NULL,
@copyDuration_in_secs [nvarchar](500) NULL,
@effectiveIntegrationRuntime [nvarchar](500) NULL,
@Source_Type [nvarchar](500) NULL,
@Sink_Type [nvarchar](500) NULL,
@Execution_Status [nvarchar](500) NULL,
@ErrorDescription [nvarchar](max) NULL,
@ErrorCode [nvarchar](500) NULL,
@ErrorLoggedTime [nvarchar](500) NULL,
@FailureType [nvarchar](500) NULL
AS
INSERT INTO [pipeline_errors]
(
[DataFactory_Name],
[Pipeline_Name],
[RunId],
[Source],
[Destination],
[TriggerType],
[TriggerId],
[TriggerName],
[TriggerTime],
[No_ParallelCopies],
[copyDuration_in_secs],
[effectiveIntegrationRuntime],
[Source_Type],
[Sink_Type],
[Execution_Status],
[ErrorDescription],
[ErrorCode],
[ErrorLoggedTime],
[FailureType]
)
VALUES
(
@DataFactory_Name,
@Pipeline_Name,
@RunId,
@Source,
@Destination,
@TriggerType,
@TriggerId,
@TriggerName,
@TriggerTime,
@No_ParallelCopies,
@copyDuration_in_secs,
@effectiveIntegrationRuntime,
@Source_Type,
@Sink_Type,
@Execution_Status,
@ErrorDescription,
@ErrorCode,
@ErrorLoggedTime,
@FailureType
)
GO

Create a Source Error SQL Table

Recall from my previous article, Azure Data Factory Pipeline to fully Load all SQL Server

Objects to ADLS Gen2, that we used a source SQL Server Table that we then moved to

the Data Lake Storage Gen2 and ultimately into Synapse DW. Based on this process, we

will need to test a known error within the Data Factory pipeline and process. It is known

that generally a varchar(max) datatype containing at least 8000+ characters will fail when

being loaded into Synapse DW since varchar(max) is an unsupported data type. This

seems like a good use case for an error test.

The following table dbo.MyErrorTable contains two columns with col1 being the

varchar(max) datatype.

Within dbo.MyErrorTable I have added a large block of text and decided to randomly

choose Sample text for Roma : the novel of ancient Rome by Steven Saylor. After doing

some editing of the text, I confirmed that col1 contains 8001 words, which is sure to fail my

Azure Data Factory pipeline and trigger a record to be created in the pipeline_errors table.

Add Records to Parameter Table

Now that we’ve identified the source SQL tables to run through the process, I’ll add them

to the pipeline_parameter table. For this demonstration I have added the Error table that

we created in the previous step along with a regular table that we would expect to succeed

to demonstrate both a success and failure end to end logging process.

Verify the Azure Data Lake Storage Gen2 Folders and Files
After running the pipeline to load my SQL tables to Azure Data Lake Storage Gen2, we

can see that the destination ADLS2 container now has both of the tables in snappy

compressed parquet format.

As an additional verification step, we can see that the folder contains the expected parquet

file.

Configure the Pipeline Lookup Activity

It’s now time to build and configure the ADF pipeline. My previous article, Load Data Lake

files into Azure Synapse Analytics Using Azure Data Factory, covers the details on how to

build this pipeline. To recap the process, the select query within the lookup gets the list of

parquet files that need to be loaded to Synapse DW and then passes them on to each loop

which will load the parquet files to Synapse DW.

Configure the Pipeline Foreach Loop Activity
The Foreach loop contains the Copy Table activity with takes the parquet files and loads

them to Synapse DW while auto-creating the tables. If the Copy-Table activity succeeds, it

will log the pipeline run data to the pipeline_log table. However, if the Copy-Table activity

fails, it will log the pipeline error details to the pipeline_errors table.

Configure Stored Procedure to Update the Log Table

Notice that the UpdateLogTable Stored procedure that we created earlier will be called by

the success stored procedure activity.

Below are the stored procedure parameters that will Update the pipeline_log table and can

be imported directly from the Stored Procedure.

The following values will need to be entered into the stored procedure parameter values.

Name Values
DataFactory_Name @{pipeline().DataFactory}
Pipeline_Name @{pipeline().Pipeline}
RunId @{pipeline().RunId}
Source @{item().src_name}
Destination @{item().dst_name}
TriggerType @{pipeline().TriggerType}
TriggerId @{pipeline().TriggerId}
TriggerName @{pipeline().TriggerName}
TriggerTime @{pipeline().TriggerTime}
rowsCopied @{activity('Copy-Table').output.rowsCopied}
RowsRead @{activity('Copy-Table').output.rowsRead}
No_ParallelCopies @{activity('Copy-Table').output.usedParallelCopies}
copyDuration_in_secs @{activity('Copy-Table').output.copyDuration}
effectiveIntegrationRuntime @{activity('Copy-Table').output.effectiveIntegrationRuntime}
Source_Type @{activity('Copy-Table').output.executionDetails[0].source.type}
Sink_Type @{activity('Copy-Table').output.executionDetails[0].sink.type}
Execution_Status @{activity('Copy-Table').output.executionDetails[0].status}
CopyActivity_Start_Time @{activity('Copy-Table').output.executionDetails[0].start}
CopyActivity_End_Time @{utcnow()}
CopyActivity_queuingDuration_in_secs @{activity('Copy-Table').output.executionDetails[0]. detailedDurations.queuingDuration}
CopyActivity_transferDuration_in_secs @{activity('Copy-Table').output.executionDetails[0]. detailedDurations.transferDuration}
Configure Stored Procedure to Update the Error Table
The last stored procedure within the Foreach loop activity is the UpdateErrorTable Stored

procedure that we created earlier and will be called by the failure stored procedure activity.
Below are the stored procedure parameters that will Update the pipeline_errors table and

can be imported directly from the Stored Procedure.

The following values will need to be entered into the stored procedure parameter values.

Description Source
DataFactory_Name @{pipeline().DataFactory}
Pipeline_Name @{pipeline().Pipeline}
RunId @{pipeline().RunId}
Source @{item().src_name}
Destination @{item().dst_name}
TriggerType @{pipeline().TriggerType}
TriggerId @{pipeline().TriggerId}
TriggerName @{pipeline().TriggerName}
TriggerTime @{pipeline().TriggerTime}
No_ParallelCopies @{activity('Copy-Table').output.usedParallelCopies}
copyDuration_in_secs @{activity('Copy-Table').output.copyDuration}
effectiveIntegrationRuntime @{activity('Copy-Table').output.effectiveIntegrationRuntime}
Source_Type @{activity('Copy-Table').output.executionDetails[0].source.type}
Sink_Type @{activity('Copy-Table').output.executionDetails[0].sink.type}
Execution_Status @{activity('Copy-Table').output.executionDetails[0].status}
ErrorCode @{activity('Copy-Table').error.errorCode}
ErrorDescription @{activity('Copy-Table').error.message}
ErrorLoggedTIme @utcnow()
FailureType @concat(activity('Copy-Table').error.message,'failureType:',activity('Copy-Table').error.failureType)

Run the Pipeline

Now that we have configured the pipeline, it is time to run the pipeline. As we can see from

the debug mode Output log, one table succeeded and the other failed, as expected.
Verify the Results
Finally, lets verify the results in the pipeline_log table. As we can see, the pipeline_log

table has captured one log containing the source, MyTable.

And the pipeline_errors table now has one record for MyErrorTable, along with detailed

error codes, descriptions, messages and more.

As a final check, when I navigate to the Synapse DW, I can see that both tables have been

auto-created, despite the fact that one failed and one succeeded.
However, data was only loaded in MyTable since MyErrorTable contains no data.
How to implement Slow changing
Dimension Type 1 in Azure Data factory
In this article, I will talk about how we can implement slow changing
dimension (SCD Type 1).

One question must be arise in your mind that what is slow changing
dimension and why we categorize this as type 1 , type 2 and type 3.

I will give you high level overview of this. Let’s try to understand this
diagram.
SCD types

So, this is one of the data loading technique from one system (source) to
another system destination (sink) incrementally. It also depends on the
organisation how they want to perform this data load means they
want to preserve the history or they don’t want to preserve. It totally
depends on situation and business needs.

Let’s see SCD type 1 in action.

I have a stage table (STG_Customer) where I have data that I need

to to insert in my dimension table (DIM_Customer)
Note: Data in stage table is copied from Source system(ADLS gen2).
As per design we first move the data from source to Staging area then
move the data from staging area to main table (used by customer)

Let’s see what we have in our dimension table (DIM_Customer)

As DIM_Customer table has no records, so all the records from stage

table will be inserted into Dimension table.
To implement this we need to design a data flow inside our data
factory like this.

Dataflow to implement SCD type1

Once we run this data flow, It will check the key-

column(Cust_Number) to performs Data insertion or data update
logic and perform upsert operation.
Once I ran this dataflow above, as we have no data
in DIM_Customer so it will insert the data
into DIM_Customer from STG_Customer.As a result all the three
records are copied from STG_Customer to DIM_Customer table.

Now, Imagine data in source got changed, and we loaded that data
into our stage table (STG_Customer). For ID 1, country got changed
from India to London. ID2 and ID3 remains same whereas ID4 and
ID5 are new records.
Let’s run the data flow one more time, this time, as Cust_Number is
already present for ID1 so it will update the record for ID1 whereas for
ID 4 and ID 5 as these are new records so these records will be inserted
in DIM_CUSTOMER table and ID2 and ID3 will remain unchanged.

Q: How to copy data from an Azure Blob Storage (text file) to an Azure SQL Database table?
Ans:
Set up the Data Factory pipeline which will be used to copy data from the blob
storage to the Azure SQL Database.

1.Navigate to the Azure Portal and select the Author & Monitor option.

2.Click on the Create Pipeline / Copy Data option.

3.Enter a unique name for the copy activity and select whether to schedule or
execute it once for the Copy Data option -> Click Next

4.Select the type of source data store to link to in order to construct a Linked Service
(Azure Blob Storage) and click Continue

5.A new Linked Service window will open -> enter a unique name and other details -
> test the connection and then click Create.
6.If you need to copy many files recursively, specify the input data file or folder
(DataSet) and click Next.

7.Select Azure SQL Database from the list of New Linked Services, then click
Continue to configure the new Linked service.

8.Fill out all of the information in the New Linked Service window -> Test
Connection -> Create

9.Create the sink dataset with the destination database table specified.

10.Copy Data Tool -> Settings -> Next -> Review all the copy configuration from
the Summary Window -> Next -> Finish

The Data Factory will create all pipeline components before running the pipeline. Executing
the pipeline involves running the copy activity, which copies the data from the input file in
Azure Blob Storage and writes it to the Azure SQL Database table.

Question 1 : Assume that you are a data engineer for company ABC The company wanted
to do cloud migration from their on-premises to Microsoft Azure cloud. You probably will
use the Azure data factory for this purpose. You have created a pipeline that copies data of
one table from on-premises to Azure cloud. What are the necessary steps you need to take
to ensure this pipeline will get executed successfully?
The company has taken a very good decision of moving to the cloud from the traditional
on-premises database. As we have to move the data from the on-premise location to the
cloud location we need to have an Integration Runtime created. The reason being the
auto-resolve Integration runtime provided by the Azure data factory cannot connect to your
on-premises. Hence in step 1, we should create our own self-hosted integration runtime.
Now this can be done in two ways:
The first way is we can have one virtual Machine ready in the cloud and there we will
install the integration runtime of our own.
The second way, we could take a machine on the on-premises network and install the
integration runtime there.

Go to the azure data factory portal. In the manage tab select the Integration runtime.
1.Create self hosted integration runtime by simply giving general information like name
description.
2.Create Azure VM (If u already have then you can skip this step)
3. Download the integration runtime software on azure virtual machine. and install it.
4.Copy the autogenerated key from step 2 and paste it newly installed integration runtime
on azure vm.
You can follow this link for detailed step by step guide to understand the process of how to
install sefl-hosted Integration runtime.How to Install Self-Hosted Integration Runtime on
Azure vm – AzureLib
Once your Integration runtime is ready we go to linked service creation.Create the linked
servicewhich connect to the your data source and for this you use the integration runtime
created above.
After this we will create the pipeline.Your pipeline will have copy activity where source
should be the database available on the on-premises location. While sink would be the
database available in the cloud.

Question 2: Assume that you are working for a company ABC as a data engineer. You have
successfully created a pipeline needed for migration. This is working fine in your
development environment. how would you deploy this pipeline in production without
making any or very minimal changes?
When you create the pipeline for migration or for any other purposes like ETL, most of the
time it will use the data source. In the above mentioned scenario, we are doing the
migration hence it is definitely using a data source at the source side and similarly a data
source at the destination side and we need to move the data from source to destination. It
is also described in the in the question itself data engineer has developed the pipeline
successfully in the development environment.Hence it is safe to assume that source side
data source and destination side data source both probably will be pointing to the
development environment only.pipeline would have copy activity which uses the dataset
with the help of linked service for source and sink.
Linked service provides way to connect to the data source by providing the data source
details like the server address, port number, username, password, key, or other credential
related information.
In this case, our linked services probably pointing to the development environment only.
As we want to do production deployment before that we may need to do a couple of other
deployments as well like deployment for the testing environment or UAT environment.
Hence we need to design our Azure data factory pipelinecomponents in such a way that
we can provide the environment related information dynamic and as a part of a parameter.
There should be no hard coding of these kind of information.
We need to create the arm template for our pipeline. ARM template needs to have a
definition defined for all the constituents of the pipeline like Linked services, dataset,
activities and pipeline.
Once the ARM template is ready,it should be checked-in into the GIT repository. Lead or
Admin will create the devops pipeline which will take up this arm template and parameter
file as an input.Devops pipeline will deploy this arm template and create all the resources
like linked service,dataset, activities and your data pipeline into the production
environment.
Question 3: Assume that you have around 1 TB of data stored in Azure blob storage . This
data is in multiple csv files. You are asked to do couple of transformations on this data as
per business logic and needs, before moving this data into the staging container.How
would you plan and architect the solution for this given scenario. Explain with the details.
First of all, we need to analyze the situation. Here if you closely look at the size of the data,
you find that it is very huge in the size. Hence directly doing the transformation on such a
huge size of data could be very cumbersome and time consuming process. Hence we
should think about the big data processing mechanism where we can leverage the parallel
and distributed computing advantages.. Here we have two choices.
1.We can use the Hadoop MapReduce through HDInsight capability for doing the
transformation.
2.We can also think of using the spark through the Azure databricks for doing the
transformation on such a huge scale of data.
Out of these two, Spark on Azure databricks is better choice because Spark is much faster
than Hadoop due to in memory computation. So let’s choose the Azure databricks as the
option.
Next we need to create the pipeline in Azure data factory. A pipeline should use the
databricks notebook as an activity.
We can write all the business related transformation logic into the Spark notebook.
Notebook can be executed using either python, scala or java language.
When you execute the pipeline it will trigger the Azure databricks notebook and your
analytics algorithm logic runs an do transformations as you defined into the Notebook. In
the notebook itself, you can write the logic to store the output into the blob storage Staging
area.

Question 4: Assume that you have an IoT device enabled on your vehicle. This device
from the vehicle sends the data every hour and this is getting stored in a blob storage
location in Microsoft Azure. You have to move this data from this storage location into the
SQL database.How would design the solution explain with reason.
This looks like an a typical incremental load scenario. As described in the problem
statement, IoT device write the data to the location every hour. It is most likely that this
device is sending the JSON data to the cloud storage (as most of the IoT device generate
the data in JSON format). It will probably writing the new JSON file every time whenever
the data from the device sent to the cloud.
Hence we will have couple of files available in the storage location generated on hourly
basis and we need to pull these file into the azure sql database.
we need to create the pipeline into the Azure data factory which should do the incremental
load. we can use the conventional high watermark file mechanism for solving this problem.
Highwater mark design is as follows :
1.Create a file named lets say HighWaterMark.txt and stored in some place in azure blob
storage. In this file we will put the start date and time.
2.Now create the pipeline in the azure data factory.Pipeline has the first activity defined as
lookup activity. This will read the date from the HighWaterMark.txt
3.Add a one more lookup activity which will return the current date time.
4.Add the copy activity in the pipeline which will pull the file JSON files having created
timestamp greater than High Water Mark date. In the sink push the read data into the
azure sql database.
5.After copy activity add the another copy activity which will update the current date time
generated in the step 2, to the High Water Mark file.
6.Add the trigger to execute this pipeline on hourly basis.
That’s how we can design the incremental data load solution for the above described
scenario.

Azure Data Factory For Beginners
No ratings yet
Azure Data Factory For Beginners
250 pages
Interview Ques
No ratings yet
Interview Ques
102 pages
Azure DATA Fatcory
No ratings yet
Azure DATA Fatcory
2,982 pages
Azure Data Factory
No ratings yet
Azure Data Factory
3,167 pages
ADF Course Deck
No ratings yet
ADF Course Deck
88 pages
Azure Data Factory - A Complete Introduction
No ratings yet
Azure Data Factory - A Complete Introduction
72 pages
Oracle Test-King 1z0-908 v2021-02-24 by Jude 45q
No ratings yet
Oracle Test-King 1z0-908 v2021-02-24 by Jude 45q
29 pages
ADF Copy Data
No ratings yet
ADF Copy Data
85 pages
Azure Data Factory Interview Questions
0% (1)
Azure Data Factory Interview Questions
14 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
ADFDF Cheat Sheet Sqlplayer v2.1
100% (1)
ADFDF Cheat Sheet Sqlplayer v2.1
2 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Azure Data Factory Full Notes
No ratings yet
Azure Data Factory Full Notes
4 pages
Interview Series ADF Part-1
No ratings yet
Interview Series ADF Part-1
17 pages
Data Factory, Data Integration
No ratings yet
Data Factory, Data Integration
2,034 pages
DFS Azure Ad
No ratings yet
DFS Azure Ad
32 pages
SQL Queries
100% (6)
SQL Queries
194 pages
Pipeline: Azure Data Factory Cheat Sheet by
100% (1)
Pipeline: Azure Data Factory Cheat Sheet by
14 pages
Azure DataEngineer Training
No ratings yet
Azure DataEngineer Training
13 pages
Data Factory
100% (2)
Data Factory
26 pages
MIE1628 Big Data Analytics Lecture6
No ratings yet
MIE1628 Big Data Analytics Lecture6
108 pages
Etl Testing Sample TestCases
75% (4)
Etl Testing Sample TestCases
25 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
ADF Course Content
No ratings yet
ADF Course Content
11 pages
Start To Finish With Azure Data Factory
100% (2)
Start To Finish With Azure Data Factory
30 pages
DP 203
No ratings yet
DP 203
13 pages
AZURE DATA FACTORY Content
No ratings yet
AZURE DATA FACTORY Content
5 pages
Learn: Zure Data Factory (Adf)
No ratings yet
Learn: Zure Data Factory (Adf)
9 pages
Copy Activity in ADF
No ratings yet
Copy Activity in ADF
52 pages
Copy Multiple Tables in Bulk by Using Azure Data Factory
No ratings yet
Copy Multiple Tables in Bulk by Using Azure Data Factory
26 pages
Azure Data Factory Use Cases 1740680571
No ratings yet
Azure Data Factory Use Cases 1740680571
11 pages
Comprehensive Guide To Implement Fault Tolerance 1740623920
No ratings yet
Comprehensive Guide To Implement Fault Tolerance 1740623920
11 pages
Adf 25 Questions
No ratings yet
Adf 25 Questions
16 pages
Azure Data Factory
100% (2)
Azure Data Factory
10 pages
Most Frequently Asked Azure Data Factory Interview Questions
0% (1)
Most Frequently Asked Azure Data Factory Interview Questions
5 pages
How To Test Azure Data Pipeline
No ratings yet
How To Test Azure Data Pipeline
17 pages
Resume 2
No ratings yet
Resume 2
5 pages
Azure de QSN and Ans
No ratings yet
Azure de QSN and Ans
16 pages
Transformations of Mapping Data Flow
No ratings yet
Transformations of Mapping Data Flow
2 pages
Saikrishna
No ratings yet
Saikrishna
3 pages
Reading 1
No ratings yet
Reading 1
4 pages
Professional Summary: Jaswanth K
No ratings yet
Professional Summary: Jaswanth K
4 pages
Azure Data Factory: Basic Azure Data Factory Interview Questions For Freshers 1. Why Do We Need Azure Data Factory?
No ratings yet
Azure Data Factory: Basic Azure Data Factory Interview Questions For Freshers 1. Why Do We Need Azure Data Factory?
18 pages
Azure Interview
No ratings yet
Azure Interview
13 pages
Capgemini Questionnaire
No ratings yet
Capgemini Questionnaire
11 pages
Az Questions
No ratings yet
Az Questions
11 pages
Azure Data Factory
No ratings yet
Azure Data Factory
13 pages
Azure Data Factory Interview Questions Answers 1740678784
No ratings yet
Azure Data Factory Interview Questions Answers 1740678784
9 pages
Incremental Load/Creation of Dimension Table Logic at Mayo Clinic
No ratings yet
Incremental Load/Creation of Dimension Table Logic at Mayo Clinic
11 pages
Azure Data Factory Deck 1
No ratings yet
Azure Data Factory Deck 1
59 pages
ADF Interview Questions v2
No ratings yet
ADF Interview Questions v2
29 pages
Full Load
No ratings yet
Full Load
16 pages
Taking Interviw
No ratings yet
Taking Interviw
15 pages
Task Document Manish
No ratings yet
Task Document Manish
19 pages
Azure Interview Questions
No ratings yet
Azure Interview Questions
7 pages
Databricks
No ratings yet
Databricks
43 pages
Azure Notes - 3 Data Integration
No ratings yet
Azure Notes - 3 Data Integration
9 pages
Azure Data Engineer
No ratings yet
Azure Data Engineer
2 pages
Load Data With Azure Data Factory
No ratings yet
Load Data With Azure Data Factory
4 pages
06.introduction To Data Factory
No ratings yet
06.introduction To Data Factory
26 pages
Rajshekarreddy (6y - 1m) - Cloud Data Engineer
No ratings yet
Rajshekarreddy (6y - 1m) - Cloud Data Engineer
3 pages
Instructor's Notes: Database Management Systems 3 Edition, by Gerald Post
No ratings yet
Instructor's Notes: Database Management Systems 3 Edition, by Gerald Post
15 pages
Imp Datastage New
100% (1)
Imp Datastage New
158 pages
Lecture 2 Distriburted Databases
No ratings yet
Lecture 2 Distriburted Databases
45 pages
CommandLineInterface11.5x EN
No ratings yet
CommandLineInterface11.5x EN
492 pages
Acdoca PDF
No ratings yet
Acdoca PDF
15 pages
Web Mining
No ratings yet
Web Mining
20 pages
Lecture02 UCCD2303 Data Modelling Part 2
No ratings yet
Lecture02 UCCD2303 Data Modelling Part 2
55 pages
SQL For Beginners LearnProgrammingAcademy
No ratings yet
SQL For Beginners LearnProgrammingAcademy
43 pages
Advance Java Programming
No ratings yet
Advance Java Programming
7 pages
24 Jan DBMS Workshop BE Kargil Batch 2026
No ratings yet
24 Jan DBMS Workshop BE Kargil Batch 2026
1 page
Word
No ratings yet
Word
9 pages
3252 Ids9
No ratings yet
3252 Ids9
6 pages
Oracle SQL - Selecting From All - Tab - Columns Does Not Find Existing Column - Stack Overflow
No ratings yet
Oracle SQL - Selecting From All - Tab - Columns Does Not Find Existing Column - Stack Overflow
2 pages
Com 312 (Database Design)
No ratings yet
Com 312 (Database Design)
6 pages
Veritas Cluster Cheat Sheet
No ratings yet
Veritas Cluster Cheat Sheet
6 pages
Transaction Management and Concurrency Control
No ratings yet
Transaction Management and Concurrency Control
34 pages
STRESS TEST OBSERVATIONS - 11 Jan - 2024
No ratings yet
STRESS TEST OBSERVATIONS - 11 Jan - 2024
6 pages
Getting Started With MySQL Command Line
No ratings yet
Getting Started With MySQL Command Line
8 pages
TP3.2 MongoDB Uk
No ratings yet
TP3.2 MongoDB Uk
5 pages
Store Files To Synology NAS
No ratings yet
Store Files To Synology NAS
6 pages
01 04 Accview 02
100% (1)
01 04 Accview 02
15 pages
Mca 207
No ratings yet
Mca 207
3 pages
IMS - Brighter Blue Trainings
No ratings yet
IMS - Brighter Blue Trainings
6 pages
Entrance Test - SQL
No ratings yet
Entrance Test - SQL
5 pages
Introduction To Oracle: SQL Plus
100% (1)
Introduction To Oracle: SQL Plus
6 pages
Tutorial 8: Relational Model and Database Normalization
No ratings yet
Tutorial 8: Relational Model and Database Normalization
3 pages
Bank Al Habib
No ratings yet
Bank Al Habib
4 pages

ADFsenario

Uploaded by

ADFsenario

Uploaded by

1.

Creating Azure Data-Factory using theAzure portal

2.Copy data from a SQL Server database to Azure Blob storage

3.ADF’s Mapping Data flows – How do you get distinct

Create a Parameter Table

CREATE TABLE [dbo].[pipeline_parameter](

Create a Log Table

key with a reference to column parameter_id from the pipeline_parameter table.

CREATE TABLE [dbo].[pipeline_log](

ALTER TABLE [dbo].[pipeline_log] WITH CHECK ADD FOREIGN KEY([PARAMETER_ID])

Create an Error Table

parameter_id from the pipeline_parameter table.

CREATE TABLE [dbo].[pipeline_errors](

ALTER TABLE [dbo].[pipeline_errors] WITH CHECK ADD FOREIGN KEY([parameter_id])

Create a Stored Procedure to Update the Log Table

CREATE PROCEDURE [dbo].[sp_UpdateLogTable]

Create a Stored Procedure to Update the Errors Table

CREATE PROCEDURE [dbo].[sp_UpdateErrorTable]

Create a Source Error SQL Table

seems like a good use case for an error test.

Add Records to Parameter Table

to demonstrate both a success and failure end to end logging process.

compressed parquet format.

Configure the Pipeline Lookup Activity

which will load the parquet files to Synapse DW.

Configure Stored Procedure to Update the Log Table

the success stored procedure activity.

be imported directly from the Stored Procedure.

can be imported directly from the Stored Procedure.

Run the Pipeline

table has captured one log containing the source, MyTable.

error codes, descriptions, messages and more.

Let’s see SCD type 1 in action.

I have a stage table (STG_Customer) where I have data that I need

Let’s see what we have in our dimension table (DIM_Customer)

As DIM_Customer table has no records, so all the records from stage

Dataflow to implement SCD type1

Once we run this data flow, It will check the key-

2.Click on the Create Pipeline / Copy Data option.

You might also like