0% found this document useful (0 votes)
12 views26 pages

ADFsenario

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

ADFsenario

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

1.

Creating Azure Data-Factory using theAzure portal


Step 1:Find - "create a resource' and search for "Data Factory". Click the create icon.
Step 2:Give your data factory a name. Select your resource group. Give it a path to and choose
the version you would like.
Step 3:Click on create.
Thus your data factory is ready to be filled with more data.

2.Copy data from a SQL Server database to Azure Blob storage


Create a data factory.
•Create a self-hosted integration runtime.
•Create SQL Server and Azure Storage linked services.
•Create SQL Server and Azure Blob datasets.
•Create a pipeline with a copy activity to move the data.
•Start a pipeline run.
•Monitor the pipeline run.

3.ADF’s Mapping Data flows – How do you get distinct


rows and rows count from the data
Step 1: Create an Azure Data Pipeline.
Step 2: Add a data flow activity and name as “DistinctRows”.
Step 3: Go to settings and add a new data flow. Select the Source Settings tab, add a source
transformation, and connect it to one of your datasets.
Step 3: In the Projection tab, it allows you the change the column data type. Here I have changed my
Emp ID column to Integer.
Step 4: In the Data preview tab you can see your data.
Step 5: Add an Aggregate transformation, named “DistinctRows”. In the group by settings, you need to
choose which column or combination of columns will make up the key(s) for ADF to determine distinct
rows, here in this demo I pick up “Emp ID” as my key columns.

Step 6: The inherent nature of the aggregate transformation is to block all metadata columns not used
in the aggregate. But here, we are using the aggregate to filter out non-distinct rows, so we need every
column from the original dataset. To do this, go to the aggregate settings and choose the column
pattern.

Step 7: That’s all you need to do to find distinct rows in your data, click on the Data preview tab to see
the result. You can see the duplicate data have been removed.
Step 8: The row counts are just aggregate transformation, to create a row counts go to Aggregate
settings and use the function count(1). This will create a running count of every row.
4.how to capture and persist Azure Data Factory pipeline errors to an Azure SQL Database
table.

To re-cap the tables needed for this process, I have included the diagram below which
illustrates how the pipeline_parameter, pipeline_log, and pipeline_error tables are
interconnected with each other.

Create a Parameter Table


The following script will create the pipeline_parameter table with column parameter_id as

the primary key. Note that this table drives the meta-data ETL approach.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_parameter](


[PARAMETER_ID] [int] IDENTITY(1,1) NOT NULL,
[server_name] [nvarchar](500) NULL,
[src_type] [nvarchar](500) NULL,
[src_schema] [nvarchar](500) NULL,
[src_db] [nvarchar](500) NULL,
[src_name] [nvarchar](500) NULL,
[dst_type] [nvarchar](500) NULL,
[dst_name] [nvarchar](500) NULL,
[include_pipeline_flag] [nvarchar](500) NULL,
[partition_field] [nvarchar](500) NULL,
[process_type] [nvarchar](500) NULL,
[priority_lane] [nvarchar](500) NULL,
[pipeline_date] [nvarchar](500) NULL,
[pipeline_status] [nvarchar](500) NULL,
[load_synapse] [nvarchar](500) NULL,
[load_frequency] [nvarchar](500) NULL,
[dst_folder] [nvarchar](500) NULL,
[file_type] [nvarchar](500) NULL,
[lake_dst_folder] [nvarchar](500) NULL,
[spark_flag] [nvarchar](500) NULL,
[dst_schema] [nvarchar](500) NULL,
[distribution_type] [nvarchar](500) NULL,
[load_sqldw_etl_pipeline_date] [datetime] NULL,
[load_sqldw_etl_pipeline_status] [nvarchar](500) NULL,
[load_sqldw_curated_pipeline_date] [datetime] NULL,
[load_sqldw_curated_pipeline_status] [nvarchar](500) NULL,
[load_delta_pipeline_date] [datetime] NULL,
[load_delta_pipeline_status] [nvarchar](500) NULL,
PRIMARY KEY CLUSTERED
(
[PARAMETER_ID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

Create a Log Table


This next script will create the pipeline_log table for capturing the Data Factory success

logs. In this table, column log_id is the primary key and column parameter_id is a foreign

key with a reference to column parameter_id from the pipeline_parameter table.


SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_log](


[LOG_ID] [int] IDENTITY(1,1) NOT NULL,
[PARAMETER_ID] [int] NULL,
[DataFactory_Name] [nvarchar](500) NULL,
[Pipeline_Name] [nvarchar](500) NULL,
[RunId] [nvarchar](500) NULL,
[Source] [nvarchar](500) NULL,
[Destination] [nvarchar](500) NULL,
[TriggerType] [nvarchar](500) NULL,
[TriggerId] [nvarchar](500) NULL,
[TriggerName] [nvarchar](500) NULL,
[TriggerTime] [nvarchar](500) NULL,
[rowsCopied] [nvarchar](500) NULL,
[DataRead] [int] NULL,
[No_ParallelCopies] [int] NULL,
[copyDuration_in_secs] [nvarchar](500) NULL,
[effectiveIntegrationRuntime] [nvarchar](500) NULL,
[Source_Type] [nvarchar](500) NULL,
[Sink_Type] [nvarchar](500) NULL,
[Execution_Status] [nvarchar](500) NULL,
[CopyActivity_Start_Time] [nvarchar](500) NULL,
[CopyActivity_End_Time] [nvarchar](500) NULL,
[CopyActivity_queuingDuration_in_secs] [nvarchar](500) NULL,
[CopyActivity_transferDuration_in_secs] [nvarchar](500) NULL,
CONSTRAINT [PK_pipeline_log] PRIMARY KEY CLUSTERED
(
[LOG_ID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[pipeline_log] WITH CHECK ADD FOREIGN KEY([PARAMETER_ID])


REFERENCES [dbo].[pipeline_parameter] ([PARAMETER_ID])
ON UPDATE CASCADE
GO

Create an Error Table


This next script will create a pipeline_errors table which will be used to capture the Data

Factory error details from failed pipeline activities. In this table, column error_id is the

primary key and column parameter_id is a foreign key with a reference to column

parameter_id from the pipeline_parameter table.


SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_errors](


[error_id] [int] IDENTITY(1,1) NOT NULL,
[parameter_id] [int] NULL,
[DataFactory_Name] [nvarchar](500) NULL,
[Pipeline_Name] [nvarchar](500) NULL,
[RunId] [nvarchar](500) NULL,
[Source] [nvarchar](500) NULL,
[Destination] [nvarchar](500) NULL,
[TriggerType] [nvarchar](500) NULL,
[TriggerId] [nvarchar](500) NULL,
[TriggerName] [nvarchar](500) NULL,
[TriggerTime] [nvarchar](500) NULL,
[No_ParallelCopies] [int] NULL,
[copyDuration_in_secs] [nvarchar](500) NULL,
[effectiveIntegrationRuntime] [nvarchar](500) NULL,
[Source_Type] [nvarchar](500) NULL,
[Sink_Type] [nvarchar](500) NULL,
[Execution_Status] [nvarchar](500) NULL,
[ErrorDescription] [nvarchar](max) NULL,
[ErrorCode] [nvarchar](500) NULL,
[ErrorLoggedTime] [nvarchar](500) NULL,
[FailureType] [nvarchar](500) NULL,
CONSTRAINT [PK_pipeline_error] PRIMARY KEY CLUSTERED
(
[error_id] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[pipeline_errors] WITH CHECK ADD FOREIGN KEY([parameter_id])


REFERENCES [dbo].[pipeline_parameter] ([PARAMETER_ID])
ON UPDATE CASCADE
GO

Create a Stored Procedure to Update the Log Table


Now that we have all the necessary SQL Tables in place, we can begin creating a few

necessary stored procedures. Let’s begin with the following script which will create a

stored procedure to update the pipeline_log table with data from the successful pipeline

run. Note that this stored procedure will be called from the Data Factory pipeline at run-

time.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[sp_UpdateLogTable]


@DataFactory_Name VARCHAR(250),
@Pipeline_Name VARCHAR(250),
@RunID VARCHAR(250),
@Source VARCHAR(300),
@Destination VARCHAR(300),
@TriggerType VARCHAR(300),
@TriggerId VARCHAR(300),
@TriggerName VARCHAR(300),
@TriggerTime VARCHAR(500),
@rowsCopied VARCHAR(300),
@DataRead INT,
@No_ParallelCopies INT,
@copyDuration_in_secs VARCHAR(300),
@effectiveIntegrationRuntime VARCHAR(300),
@Source_Type VARCHAR(300),
@Sink_Type VARCHAR(300),
@Execution_Status VARCHAR(300),
@CopyActivity_Start_Time VARCHAR(500),
@CopyActivity_End_Time VARCHAR(500),
@CopyActivity_queuingDuration_in_secs VARCHAR(500),
@CopyActivity_transferDuration_in_secs VARCHAR(500)
AS
INSERT INTO [pipeline_log]
(
[DataFactory_Name]
,[Pipeline_Name]
,[RunId]
,[Source]
,[Destination]
,[TriggerType]
,[TriggerId]
,[TriggerName]
,[TriggerTime]
,[rowsCopied]
,[DataRead]
,[No_ParallelCopies]
,[copyDuration_in_secs]
,[effectiveIntegrationRuntime]
,[Source_Type]
,[Sink_Type]
,[Execution_Status]
,[CopyActivity_Start_Time]
,[CopyActivity_End_Time]
,[CopyActivity_queuingDuration_in_secs]
,[CopyActivity_transferDuration_in_secs]
)
VALUES
(
@DataFactory_Name
,@Pipeline_Name
,@RunId
,@Source
,@Destination
,@TriggerType
,@TriggerId
,@TriggerName
,@TriggerTime
,@rowsCopied
,@DataRead
,@No_ParallelCopies
,@copyDuration_in_secs
,@effectiveIntegrationRuntime
,@Source_Type
,@Sink_Type
,@Execution_Status
,@CopyActivity_Start_Time
,@CopyActivity_End_Time
,@CopyActivity_queuingDuration_in_secs
,@CopyActivity_transferDuration_in_secs
)
GO

Create a Stored Procedure to Update the Errors Table


Next, lets run the following script which will create a stored procedure to update the

pipeline_errors table with detailed error data from the failed pipeline run. Note that this

stored procedure will be called from the Data Factory pipeline at run-time.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[sp_UpdateErrorTable]


@DataFactory_Name [nvarchar](500) NULL,
@Pipeline_Name [nvarchar](500) NULL,
@RunId [nvarchar](500) NULL,
@Source [nvarchar](500) NULL,
@Destination [nvarchar](500) NULL,
@TriggerType [nvarchar](500) NULL,
@TriggerId [nvarchar](500) NULL,
@TriggerName [nvarchar](500) NULL,
@TriggerTime [nvarchar](500) NULL,
@No_ParallelCopies [int] NULL,
@copyDuration_in_secs [nvarchar](500) NULL,
@effectiveIntegrationRuntime [nvarchar](500) NULL,
@Source_Type [nvarchar](500) NULL,
@Sink_Type [nvarchar](500) NULL,
@Execution_Status [nvarchar](500) NULL,
@ErrorDescription [nvarchar](max) NULL,
@ErrorCode [nvarchar](500) NULL,
@ErrorLoggedTime [nvarchar](500) NULL,
@FailureType [nvarchar](500) NULL
AS
INSERT INTO [pipeline_errors]
(
[DataFactory_Name],
[Pipeline_Name],
[RunId],
[Source],
[Destination],
[TriggerType],
[TriggerId],
[TriggerName],
[TriggerTime],
[No_ParallelCopies],
[copyDuration_in_secs],
[effectiveIntegrationRuntime],
[Source_Type],
[Sink_Type],
[Execution_Status],
[ErrorDescription],
[ErrorCode],
[ErrorLoggedTime],
[FailureType]
)
VALUES
(
@DataFactory_Name,
@Pipeline_Name,
@RunId,
@Source,
@Destination,
@TriggerType,
@TriggerId,
@TriggerName,
@TriggerTime,
@No_ParallelCopies,
@copyDuration_in_secs,
@effectiveIntegrationRuntime,
@Source_Type,
@Sink_Type,
@Execution_Status,
@ErrorDescription,
@ErrorCode,
@ErrorLoggedTime,
@FailureType
)
GO

Create a Source Error SQL Table


Recall from my previous article, Azure Data Factory Pipeline to fully Load all SQL Server

Objects to ADLS Gen2, that we used a source SQL Server Table that we then moved to

the Data Lake Storage Gen2 and ultimately into Synapse DW. Based on this process, we

will need to test a known error within the Data Factory pipeline and process. It is known

that generally a varchar(max) datatype containing at least 8000+ characters will fail when

being loaded into Synapse DW since varchar(max) is an unsupported data type. This

seems like a good use case for an error test.


The following table dbo.MyErrorTable contains two columns with col1 being the

varchar(max) datatype.

Within dbo.MyErrorTable I have added a large block of text and decided to randomly

choose Sample text for Roma : the novel of ancient Rome by Steven Saylor. After doing

some editing of the text, I confirmed that col1 contains 8001 words, which is sure to fail my

Azure Data Factory pipeline and trigger a record to be created in the pipeline_errors table.

Add Records to Parameter Table


Now that we’ve identified the source SQL tables to run through the process, I’ll add them

to the pipeline_parameter table. For this demonstration I have added the Error table that

we created in the previous step along with a regular table that we would expect to succeed

to demonstrate both a success and failure end to end logging process.


Verify the Azure Data Lake Storage Gen2 Folders and Files
After running the pipeline to load my SQL tables to Azure Data Lake Storage Gen2, we

can see that the destination ADLS2 container now has both of the tables in snappy

compressed parquet format.

As an additional verification step, we can see that the folder contains the expected parquet

file.

Configure the Pipeline Lookup Activity


It’s now time to build and configure the ADF pipeline. My previous article, Load Data Lake

files into Azure Synapse Analytics Using Azure Data Factory, covers the details on how to

build this pipeline. To recap the process, the select query within the lookup gets the list of

parquet files that need to be loaded to Synapse DW and then passes them on to each loop

which will load the parquet files to Synapse DW.


Configure the Pipeline Foreach Loop Activity
The Foreach loop contains the Copy Table activity with takes the parquet files and loads

them to Synapse DW while auto-creating the tables. If the Copy-Table activity succeeds, it

will log the pipeline run data to the pipeline_log table. However, if the Copy-Table activity

fails, it will log the pipeline error details to the pipeline_errors table.

Configure Stored Procedure to Update the Log Table


Notice that the UpdateLogTable Stored procedure that we created earlier will be called by

the success stored procedure activity.


Below are the stored procedure parameters that will Update the pipeline_log table and can

be imported directly from the Stored Procedure.


The following values will need to be entered into the stored procedure parameter values.

Name Values
DataFactory_Name @{pipeline().DataFactory}
Pipeline_Name @{pipeline().Pipeline}
RunId @{pipeline().RunId}
Source @{item().src_name}
Destination @{item().dst_name}
TriggerType @{pipeline().TriggerType}
TriggerId @{pipeline().TriggerId}
TriggerName @{pipeline().TriggerName}
TriggerTime @{pipeline().TriggerTime}
rowsCopied @{activity('Copy-Table').output.rowsCopied}
RowsRead @{activity('Copy-Table').output.rowsRead}
No_ParallelCopies @{activity('Copy-Table').output.usedParallelCopies}
copyDuration_in_secs @{activity('Copy-Table').output.copyDuration}
effectiveIntegrationRuntime @{activity('Copy-Table').output.effectiveIntegrationRuntime}
Source_Type @{activity('Copy-Table').output.executionDetails[0].source.type}
Sink_Type @{activity('Copy-Table').output.executionDetails[0].sink.type}
Execution_Status @{activity('Copy-Table').output.executionDetails[0].status}
CopyActivity_Start_Time @{activity('Copy-Table').output.executionDetails[0].start}
CopyActivity_End_Time @{utcnow()}
CopyActivity_queuingDuration_in_secs @{activity('Copy-Table').output.executionDetails[0]. detailedDurations.queuingDuration}
CopyActivity_transferDuration_in_secs @{activity('Copy-Table').output.executionDetails[0]. detailedDurations.transferDuration}
Configure Stored Procedure to Update the Error Table
The last stored procedure within the Foreach loop activity is the UpdateErrorTable Stored

procedure that we created earlier and will be called by the failure stored procedure activity.
Below are the stored procedure parameters that will Update the pipeline_errors table and

can be imported directly from the Stored Procedure.


The following values will need to be entered into the stored procedure parameter values.

Description Source
DataFactory_Name @{pipeline().DataFactory}
Pipeline_Name @{pipeline().Pipeline}
RunId @{pipeline().RunId}
Source @{item().src_name}
Destination @{item().dst_name}
TriggerType @{pipeline().TriggerType}
TriggerId @{pipeline().TriggerId}
TriggerName @{pipeline().TriggerName}
TriggerTime @{pipeline().TriggerTime}
No_ParallelCopies @{activity('Copy-Table').output.usedParallelCopies}
copyDuration_in_secs @{activity('Copy-Table').output.copyDuration}
effectiveIntegrationRuntime @{activity('Copy-Table').output.effectiveIntegrationRuntime}
Source_Type @{activity('Copy-Table').output.executionDetails[0].source.type}
Sink_Type @{activity('Copy-Table').output.executionDetails[0].sink.type}
Execution_Status @{activity('Copy-Table').output.executionDetails[0].status}
ErrorCode @{activity('Copy-Table').error.errorCode}
ErrorDescription @{activity('Copy-Table').error.message}
ErrorLoggedTIme @utcnow()
FailureType @concat(activity('Copy-Table').error.message,'failureType:',activity('Copy-Table').error.failureType)

Run the Pipeline


Now that we have configured the pipeline, it is time to run the pipeline. As we can see from

the debug mode Output log, one table succeeded and the other failed, as expected.
Verify the Results
Finally, lets verify the results in the pipeline_log table. As we can see, the pipeline_log

table has captured one log containing the source, MyTable.

And the pipeline_errors table now has one record for MyErrorTable, along with detailed

error codes, descriptions, messages and more.

As a final check, when I navigate to the Synapse DW, I can see that both tables have been

auto-created, despite the fact that one failed and one succeeded.
However, data was only loaded in MyTable since MyErrorTable contains no data.
How to implement Slow changing
Dimension Type 1 in Azure Data factory
In this article, I will talk about how we can implement slow changing
dimension (SCD Type 1).

One question must be arise in your mind that what is slow changing
dimension and why we categorize this as type 1 , type 2 and type 3.

I will give you high level overview of this. Let’s try to understand this
diagram.
SCD types

So, this is one of the data loading technique from one system (source) to
another system destination (sink) incrementally. It also depends on the
organisation how they want to perform this data load means they
want to preserve the history or they don’t want to preserve. It totally
depends on situation and business needs.

Let’s see SCD type 1 in action.

I have a stage table (STG_Customer) where I have data that I need


to to insert in my dimension table (DIM_Customer)
Note: Data in stage table is copied from Source system(ADLS gen2).
As per design we first move the data from source to Staging area then
move the data from staging area to main table (used by customer)

Let’s see what we have in our dimension table (DIM_Customer)

As DIM_Customer table has no records, so all the records from stage


table will be inserted into Dimension table.
To implement this we need to design a data flow inside our data
factory like this.

Dataflow to implement SCD type1

Once we run this data flow, It will check the key-


column(Cust_Number) to performs Data insertion or data update
logic and perform upsert operation.
Once I ran this dataflow above, as we have no data
in DIM_Customer so it will insert the data
into DIM_Customer from STG_Customer.As a result all the three
records are copied from STG_Customer to DIM_Customer table.

Now, Imagine data in source got changed, and we loaded that data
into our stage table (STG_Customer). For ID 1, country got changed
from India to London. ID2 and ID3 remains same whereas ID4 and
ID5 are new records.
Let’s run the data flow one more time, this time, as Cust_Number is
already present for ID1 so it will update the record for ID1 whereas for
ID 4 and ID 5 as these are new records so these records will be inserted
in DIM_CUSTOMER table and ID2 and ID3 will remain unchanged.

Q: How to copy data from an Azure Blob Storage (text file) to an Azure SQL Database table?
Ans:
Set up the Data Factory pipeline which will be used to copy data from the blob
storage to the Azure SQL Database.

1.Navigate to the Azure Portal and select the Author & Monitor option.

2.Click on the Create Pipeline / Copy Data option.

3.Enter a unique name for the copy activity and select whether to schedule or
execute it once for the Copy Data option -> Click Next

4.Select the type of source data store to link to in order to construct a Linked Service
(Azure Blob Storage) and click Continue

5.A new Linked Service window will open -> enter a unique name and other details -
> test the connection and then click Create.
6.If you need to copy many files recursively, specify the input data file or folder
(DataSet) and click Next.

7.Select Azure SQL Database from the list of New Linked Services, then click
Continue to configure the new Linked service.

8.Fill out all of the information in the New Linked Service window -> Test
Connection -> Create

9.Create the sink dataset with the destination database table specified.

10.Copy Data Tool -> Settings -> Next -> Review all the copy configuration from
the Summary Window -> Next -> Finish

The Data Factory will create all pipeline components before running the pipeline. Executing
the pipeline involves running the copy activity, which copies the data from the input file in
Azure Blob Storage and writes it to the Azure SQL Database table.

Question 1 : Assume that you are a data engineer for company ABC The company wanted
to do cloud migration from their on-premises to Microsoft Azure cloud. You probably will
use the Azure data factory for this purpose. You have created a pipeline that copies data of
one table from on-premises to Azure cloud. What are the necessary steps you need to take
to ensure this pipeline will get executed successfully?
The company has taken a very good decision of moving to the cloud from the traditional
on-premises database. As we have to move the data from the on-premise location to the
cloud location we need to have an Integration Runtime created. The reason being the
auto-resolve Integration runtime provided by the Azure data factory cannot connect to your
on-premises. Hence in step 1, we should create our own self-hosted integration runtime.
Now this can be done in two ways:
The first way is we can have one virtual Machine ready in the cloud and there we will
install the integration runtime of our own.
The second way, we could take a machine on the on-premises network and install the
integration runtime there.

Go to the azure data factory portal. In the manage tab select the Integration runtime.
1.Create self hosted integration runtime by simply giving general information like name
description.
2.Create Azure VM (If u already have then you can skip this step)
3. Download the integration runtime software on azure virtual machine. and install it.
4.Copy the autogenerated key from step 2 and paste it newly installed integration runtime
on azure vm.
You can follow this link for detailed step by step guide to understand the process of how to
install sefl-hosted Integration runtime.How to Install Self-Hosted Integration Runtime on
Azure vm – AzureLib
Once your Integration runtime is ready we go to linked service creation.Create the linked
servicewhich connect to the your data source and for this you use the integration runtime
created above.
After this we will create the pipeline.Your pipeline will have copy activity where source
should be the database available on the on-premises location. While sink would be the
database available in the cloud.

Question 2: Assume that you are working for a company ABC as a data engineer. You have
successfully created a pipeline needed for migration. This is working fine in your
development environment. how would you deploy this pipeline in production without
making any or very minimal changes?
When you create the pipeline for migration or for any other purposes like ETL, most of the
time it will use the data source. In the above mentioned scenario, we are doing the
migration hence it is definitely using a data source at the source side and similarly a data
source at the destination side and we need to move the data from source to destination. It
is also described in the in the question itself data engineer has developed the pipeline
successfully in the development environment.Hence it is safe to assume that source side
data source and destination side data source both probably will be pointing to the
development environment only.pipeline would have copy activity which uses the dataset
with the help of linked service for source and sink.
Linked service provides way to connect to the data source by providing the data source
details like the server address, port number, username, password, key, or other credential
related information.
In this case, our linked services probably pointing to the development environment only.
As we want to do production deployment before that we may need to do a couple of other
deployments as well like deployment for the testing environment or UAT environment.
Hence we need to design our Azure data factory pipelinecomponents in such a way that
we can provide the environment related information dynamic and as a part of a parameter.
There should be no hard coding of these kind of information.
We need to create the arm template for our pipeline. ARM template needs to have a
definition defined for all the constituents of the pipeline like Linked services, dataset,
activities and pipeline.
Once the ARM template is ready,it should be checked-in into the GIT repository. Lead or
Admin will create the devops pipeline which will take up this arm template and parameter
file as an input.Devops pipeline will deploy this arm template and create all the resources
like linked service,dataset, activities and your data pipeline into the production
environment.
Question 3: Assume that you have around 1 TB of data stored in Azure blob storage . This
data is in multiple csv files. You are asked to do couple of transformations on this data as
per business logic and needs, before moving this data into the staging container.How
would you plan and architect the solution for this given scenario. Explain with the details.
First of all, we need to analyze the situation. Here if you closely look at the size of the data,
you find that it is very huge in the size. Hence directly doing the transformation on such a
huge size of data could be very cumbersome and time consuming process. Hence we
should think about the big data processing mechanism where we can leverage the parallel
and distributed computing advantages.. Here we have two choices.
1.We can use the Hadoop MapReduce through HDInsight capability for doing the
transformation.
2.We can also think of using the spark through the Azure databricks for doing the
transformation on such a huge scale of data.
Out of these two, Spark on Azure databricks is better choice because Spark is much faster
than Hadoop due to in memory computation. So let’s choose the Azure databricks as the
option.
Next we need to create the pipeline in Azure data factory. A pipeline should use the
databricks notebook as an activity.
We can write all the business related transformation logic into the Spark notebook.
Notebook can be executed using either python, scala or java language.
When you execute the pipeline it will trigger the Azure databricks notebook and your
analytics algorithm logic runs an do transformations as you defined into the Notebook. In
the notebook itself, you can write the logic to store the output into the blob storage Staging
area.

Question 4: Assume that you have an IoT device enabled on your vehicle. This device
from the vehicle sends the data every hour and this is getting stored in a blob storage
location in Microsoft Azure. You have to move this data from this storage location into the
SQL database.How would design the solution explain with reason.
This looks like an a typical incremental load scenario. As described in the problem
statement, IoT device write the data to the location every hour. It is most likely that this
device is sending the JSON data to the cloud storage (as most of the IoT device generate
the data in JSON format). It will probably writing the new JSON file every time whenever
the data from the device sent to the cloud.
Hence we will have couple of files available in the storage location generated on hourly
basis and we need to pull these file into the azure sql database.
we need to create the pipeline into the Azure data factory which should do the incremental
load. we can use the conventional high watermark file mechanism for solving this problem.
Highwater mark design is as follows :
1.Create a file named lets say HighWaterMark.txt and stored in some place in azure blob
storage. In this file we will put the start date and time.
2.Now create the pipeline in the azure data factory.Pipeline has the first activity defined as
lookup activity. This will read the date from the HighWaterMark.txt
3.Add a one more lookup activity which will return the current date time.
4.Add the copy activity in the pipeline which will pull the file JSON files having created
timestamp greater than High Water Mark date. In the sink push the read data into the
azure sql database.
5.After copy activity add the another copy activity which will update the current date time
generated in the step 2, to the High Water Mark file.
6.Add the trigger to execute this pipeline on hourly basis.
That’s how we can design the incremental data load solution for the above described
scenario.

You might also like