0% found this document useful (0 votes)
61 views52 pages

Copy Activity in ADF

Uploaded by

sriram9489643055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views52 pages

Copy Activity in ADF

Uploaded by

sriram9489643055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Copy Activity in

ADF

SRIRAM SUNDAR A
Pipeline Copy
Activity
Data Factory
A data factory is a system or platform designed to manage and automate the
flow of data between various sources and destinations. It serves as a central
hub for orchestrating data pipelines, which are workflows that extract,
transform, and load (ETL) or process data from its source to its destination.

Azure Data Factory


Azure Data Factory (ADF) is a cloud-based data integration service provided
by Microsoft Azure. It allows you to create, schedule, and orchestrate data
workflows, enabling seamless data movement and transformation across
various data sources and destinations.
Procedure for creating pipeline
copy activity
Create Linked Services:
● Create a Linked Service for Oracle: Go to your ADF instance in the
Azure portal, navigate to the "Author" tab, and click on
"Connections". Select "New connection" and choose the Oracle
option. Provide connection details like server address, authentication
method, username, password, and database name.
● Create a Linked Service for SQL Server: Similarly, create a Linked
Service for your SQL Server database. Provide the necessary
connection information, including server address, authentication
method, username, password, and database name
Create Datasets:
● Define a Dataset for Oracle: In the "Author" tab, click on "Datasets"
and select "New dataset". Choose the "Oracle" option and specify
the connection details. Select the table(s) you want to copy data
from.
● Define a Dataset for SQL Server: Create a new dataset for your SQL
Server database. Choose the appropriate SQL Server option and
specify the connection details. Select the destination table(s) where
you want to copy the data.
Create Copy Activity:
● In the "Author" tab, go to "Pipelines" and click on "New pipeline".
● Drag and drop the "Copy Data" activity onto the pipeline canvas.
● Configure Source and Sink: In the "Properties" pane of the Copy
Data activity, select the Oracle dataset as the source and the SQL
Server dataset as the sink. Configure any necessary settings such as
column mappings or data type conversions.
● Publish and Trigger the Pipeline:
● Click on "Publish" to save your changes and publish the pipeline.
● Optionally, trigger the pipeline manually to run it immediately, or set
up a trigger to run it at specified intervals.
Mapping Data
Flow
Data Flow
● Data flow feature in Azure Data factory will allow you to develop
graphical data transformation logic that can be executed as
activities in ADF pipelines.
● Your Data flow will execute on your own Azure data bricks cluster for
scaled out data processing using spark.
● ADF internally handles all the code translation, spark optimization
and execution of transformation.
Mapping Data Flow
(For combining two tables)
● Access Azure Data Factory: Go to the Azure portal and navigate to
your Azure Data Factory instance.
● Open Data Flows: Inside your Data Factory instance, go to the
"Author" tab.
● Create a New Data Flow: Click on the "+" button and select "Data
flow" from the dropdown menu.
● Name Your Data Flow: Give your data flow a meaningful name that
reflects its purpose, such as "Join Tables Data Flow".
● Add Source Data: Inside the data flow canvas, click on "Add Source"
and choose the source dataset representing the first table you want
to join. Configure the settings for the source dataset, including any
filters or transformations if necessary.
● Add Sink Data: Similarly, click on "Add Sink" and select the sink
dataset representing the destination where you want to write the
joined data. This could be a database table or file storage.
• Add Join Transformation: Drag and drop the "Join" transformation
from the list of transformations onto the data flow canvas. Connect
the output of the source dataset to the "Join" transformation.
● Configure Join Settings: Double-click on the "Join" transformation to
configure its settings. Choose the join type (e.g., inner join, left outer
join, etc.) and specify the join conditions based on the columns from
both source datasets.
● Map Output Columns: After configuring the join transformation,
you need to map the columns from both source datasets to the
output dataset. Click on the arrow between the join transformation
and the sink dataset to open the mapping interface. Map the
columns from the source datasets to the corresponding columns in
the sink dataset.
● Preview Data and Validate: Before finalizing your data flow, it's a
good practice to preview the data to ensure that the join operation
is producing the expected results. Click on the "Debug" button to run
a debug session of the data flow and inspect the output data.

● Publish Changes: Once you're satisfied with the data flow, click on
the "Publish all" button to save your changes and publish the data
flow to your Data Factory instance.
Add Data Flow Activity to Pipeline: Navigate to the "Pipelines"
section in your Data Factory instance and open the pipeline where
you want to use the data flow. Drag and drop the "Data Flow"
activity onto the pipeline canvas and link it to any preceding or
succeeding activities as needed.

● Configure Data Flow Activity: Double-click on the data flow activity


to configure its settings. Select the data flow you created from the
dropdown menu and configure any additional settings such as
runtime properties or triggers.
● Publish Pipeline: Once you've configured the data flow activity
within your pipeline, click on the "Publish all" button to save your
changes and publish the pipeline to your Data Factory instance.
Settings in ADF
Language
● ADF supports multiple languages for the user interface within the
Azure portal. Users can select their preferred language from a list of
supported languages, which typically includes major languages such
as English, Spanish, French, German, etc.
● To change the language in the Azure portal (where ADF is
managed), navigate to the portal settings and select the desired
language.
Regional format
● The regional format settings in ADF are inherited from the Azure portal
settings.
● This includes the format for dates, times, currency, and numerical values
displayed within the ADF interface.
● Users can configure their preferred regional format in the Azure portal
settings, which will apply to all Azure services, including ADF.
● To change the regional format in the Azure portal, navigate to the portal
settings and adjust the regional format settings as needed.
TRIGGERS
Triggers in Azure Data Factory

● Triggers – You can execute your pipeline.

● Triggers determine when a pipeline execution needs to be


kicked off.

● Pipelines and triggers have a many-to-many relationship


(except for the tumbling window trigger).

● Multiple triggers can kickoff a single pipeline, or a single


trigger can kick off multiple pipelines.
Types of Triggers
Schedule Trigger
● A trigger that invokes a pipeline on a wall clock schedule.

● This trigger type allows you to specify a recurring schedule based on


time intervals such as hourly, daily, weekly, or monthly. You can set
start and end dates, recurrence patterns, and specify time zones as
per your requirement.

● For example: We set the alarm from Monday to Friday at 6AM


How to Create Schedule Trigger
● Switch to the Edit tab in Data Factory or the Integrate tab in Azure
Synapse
● Select Trigger on the menu, then select New/Edit
● On the Add Triggers page, select Choose trigger..., then
select +New.
● To specify an end date time, select Specify an End Date,
and specify Ends On, then select OK. There is a cost
associated with each pipeline run. If you are testing, you
may want to ensure that the pipeline is triggered only a
couple of times. However, ensure that there is enough time
for the pipeline to run between the publish time and the
end time.
● Select Publish all to publish the changes. Until you publish the
changes, the trigger doesn't start triggering the pipeline runs.
● Switch to the Pipeline runs tab on the left, then select Refresh to refresh
the list. You will see the pipeline runs triggered by the scheduled trigger.
Notice the values in the Triggered By column. If you use the Trigger
Now option, you will see the manual trigger run in the list.
● Switch to the Trigger Runs \ Schedule view.
Tumbling Window Trigger

Tumbling window triggers are used when you need to process data in time-
based windows, such as daily or hourly aggregations. You define a window
size and offset to partition the data into distinct intervals for processing.

Advantage of Tumbling Window Trigger over Schedule Trigger

According to Schedule Trigger , used to organize only the present events,


But Tumbling Window Trigger used to organize present events as well as past
events. Etc,..
How to Create Tumbling Window Trigger
● To create a tumbling window trigger in the Azure portal, select
the Triggers tab, and then select New.
● After the trigger configuration pane opens, select Tumbling
Window, and then define your tumbling window trigger properties.
● When you're done, select Save.
How to Create Tumbling Window Trigger
Dependency
● To create dependency on a trigger, select Trigger > Advanced > New,
and then choose the trigger to depend on with the appropriate offset
and size. Select Finish and publish the changes for the dependencies to
take effect.
● Dependency Offset

● Dependency Size
● Self Dependency
● Event based Triggers

Data integration scenarios often require customers to trigger pipelines


based on events happening in storage account, such as the arrival or
deletion of a file in Azure Blob Storage account

Advantage of Event-Based Trigger over a Tumbling Window Trigger

The advantage of using an Event-Based Trigger over a Tumbling Window


Trigger in Azure Data Factory lies in its ability to initiate pipeline runs in
response to specific external events, providing a more dynamic and reactive
approach to data processing.
How to Create Event based Triggers
● Switch to the Edit tab in Data Factory, or the Integrate tab in Azure
Synapse.
● Select Trigger on the menu, then select New/Edit.
● On the Add Triggers page, select Choose trigger..., then
select +New.
● Select trigger type Storage Event
● Select your storage account from the Azure
subscription dropdown or manually using its Storage
account resource ID. Choose which container you wish
the events to occur on. Container selection is required,
but be mindful that selecting all containers can lead to
a large number of events.
• The Blob path begins with and Blob path ends
with properties allow you to specify the containers,
folders, and blob names for which you want to receive
events.
• Select whether your trigger will respond to a Blob
created event, Blob deleted event, or both. In your
specified storage location, each event will trigger the
Data Factory and Synapse pipelines associated with the
trigger.
● Select whether or not your trigger
ignores blobs with zero bytes.
● After you configure your trigger, click
on Next: Data preview. This screen
shows the existing blobs matched by
your storage event trigger
configuration. Make sure you've
specific filters. Configuring filters that
are too broad can match a large
number of files created/deleted and
may significantly impact your cost.
Once your filter conditions have been
verified, click Finish.
● Select whether or not your trigger ignores blobs with zero bytes.
● After you configure your trigger, click on Next: Data preview. This screen
shows the existing blobs matched by your storage event trigger
configuration. Make sure you've specific filters. Configuring filters that are
too broad can match a large number of files created/deleted and may
significantly impact your cost. Once your filter conditions have been
verified, click Finish.
● To attach a pipeline to this trigger, go to the pipeline canvas and
click Trigger and select New/Edit. When the side nav appears, click on
the Choose trigger... dropdown and select the trigger you created.
Click Next: Data preview to confirm the configuration is correct and
then Next to validate the Data preview is correct.
● If your pipeline has parameters, you can specify them on the trigger runs
parameter side nav. The storage event trigger captures the folder path and
file name of the blob into the
properties @triggerBody().folderPath and @triggerBody().fileName. To
use the values of these properties in a pipeline, you must map the
properties to pipeline parameters. After mapping the properties to
parameters, you can access the values captured by the trigger through
the @pipeline().parameters.parameterName expression throughout the
pipeline.
● @triggerBody().folderPath
● @triggerBody().fileName
Thank You

You might also like