0% found this document useful (0 votes)
42 views27 pages

Adf Interview Q&a

Uploaded by

Tamil Tamil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views27 pages

Adf Interview Q&a

Uploaded by

Tamil Tamil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

INTRODUCTION:

"Good morning, I’m Ramya. I completed my graduation from Anna


University and have 3 years of experience as an Azure Data Factory
developer at TCS. I worked on BFSI projects for European clients,
particularly in the insurance domain.
In my role, I was responsible for generating reports by integrating
data from on-premises databases and APIs. I used SQL for data
filtering, implemented ADF activities like Copy Activity and Lookup,
and leveraged Self-hosted Integration Runtime for secure
connections. The processed data was delivered to Azure Blob Storage
in CSV and TXT formats, enabling efficient reporting for the client.
I’m passionate about building scalable data solutions, and I believe
my skills and experience align well with the requirements of this
role."

BASIC LEVEL QUESTIONS


1)What is the main difference between a Pipeline and an Activity in Azure
Data Factory?
Answer:
"In Azure Data Factory, a Pipeline is like a plan that manages the overall
workflow, while an Activity is an individual step or task in that plan.
For example, in one of my projects, I moved data from SQL Server Management
Studio (SSMS) to Azure SQL Database. I created a pipeline to automate the
process, and inside the pipeline, I used a Copy Activity to handle the data
transfer. However, I didn’t move raw data directly; I optimized the process using
SQL conditions within the Copy Activity's source query.
For example, I used a SQL query in the Copy Activity to join two tables, filter
only the active customers using a WHERE clause, and calculate their total
purchases using GROUP BY before transferring the data to Azure SQL.
This approach ensured that only clean and relevant data was loaded into Azure
SQL, reducing the need for additional transformations downstream. So, in this
workflow:
 The Pipeline managed the entire process from source to destination.
 The Activity handled the specific task of executing the SQL query and
moving the data.
By leveraging SQL conditions in the Copy Activity, I ensured the pipeline was
not just efficient but also aligned with the business requirements."

These conditions included:


1. JOINs: To combine data from multiple tables.
2. WHERE Clauses: To filter rows based on specific conditions.
3. ROW_NUMBER(): To add unique row numbers for partitioning or
deduplication.
4. GROUP BY: To aggregate data, such as calculating totals or averages.
5. ORDER BY: To sort data before loading it into the destination.
6. CASE Statements: To create conditional columns directly in the query.
7. DATEDIFF() and DATEPART(): For date-based filtering and calculations,
such as selecting records from the past month.
8. ISNULL/COALESCE: To handle null values by providing default values.
9. TOP Clause: To limit the number of records for testing or partial loads.
2)What is the difference between a Linked Service and a Dataset in Azure
Data Factory?
1. Linked Service
2. A Linked Service is like a connection string or a configuration that
defines the connection information to a data source or destination.
3. It specifies the credentials, endpoint, or server details needed to connect
to external data systems like Azure Blob Storage, SQL Database, or an on-
premises database.
4. Think of it as a bridge that connects ADF to the data store.
5. Example:
If you want to connect to an Azure SQL Database, the Linked Service will
contain the server name, database name, and authentication details.
Define Dataset
 A Dataset represents the data structure or data entity (like a table, file,
or folder) that you want to work with within the data store defined by
the Linked Service.
 It refers to the actual data you are reading from or writing to.
Example IMPORTANT:
You can use an analogy to make it relatable:
 Linked Service is like a connection to a library (you know the address
and credentials to access it).
 Dataset is like a specific book or shelf in that library (the actual data you
want to access).

Real-World Example
In one of my projects, I needed to copy data from an on-premises SQL Server to
Azure Blob Storage:
 I first created a Linked Service to connect to the SQL Server using
credentials and connection strings.
 Then, I defined a Dataset for the specific table in the SQL Server that I
wanted to move. Similarly, I created a Dataset for the destination,
specifying the container and file format in Azure Blob Storage.
In this setup:
 The Linked Service established the connection to SQL Server and Blob
Storage.
 The Datasets defined the specific data being moved and its structure.
Summary
 Linked Service: Connects to the data store (e.g., Azure SQL, Blob
Storage).
 Dataset: Specifies the data (e.g., table, file) within the store to be used in
your activities.
This separation ensures flexibility. For instance, the same Linked Service can be
used by multiple datasets pointing to different tables or files in the same data
store."
3)What is the purpose of a Trigger in Azure Data Factory? What are the types
of triggers, and what are the differences between them?
Answer:
In Azure Data Factory, a Trigger is used to automatically start a pipeline without
manual intervention. It helps automate data processes and makes sure the
pipeline runs when it's needed, based on specific conditions or schedules. You
can think of a trigger as the "start button" for your pipeline.
Types of Triggers in Azure Data Factory:
1. Scheduled Trigger (What you've used):
o This trigger runs the pipeline at a specific time or on a recurring
schedule (e.g., every day at midnight or every hour).
o Example: If you need to run a data load job every day at 6 AM,
you'd use a Scheduled Trigger.
2. Tumbling Window Trigger:
o This one is used when you need to process data in time-based
chunks (e.g., every hour, every day). It ensures that the pipeline
runs at regular intervals, and each run processes a specific time
window of data.
o Example: You might use this trigger to process hourly data files
where the pipeline needs to handle each hour’s data one by one,
making sure no window overlaps.
3. Event-Based Trigger:
o This trigger starts the pipeline when a specific event happens,
such as a new file being uploaded to Azure Blob Storage.
o Example: If you want to trigger a pipeline every time a new CSV
file is uploaded, you'd use an Event-Based Trigger.
Key Differences:
 Scheduled Trigger: Runs at a specific time or interval you define.
 Tumbling Window Trigger: Runs on fixed time intervals, but it processes
data for each window of time separately.
 Event-Based Trigger: Starts the pipeline when a specific event happens,
like a new file appearing in a storage location.
Simple Example:
In your case, where you've only used a Scheduled Trigger:
 You might have set it to run every night at midnight to load data into a
database.
 If you wanted to switch to a Tumbling Window Trigger, you could
process data in chunks (like hourly logs), where each run processes the
data for that exact hour and waits for the next window.
Each trigger type is useful for different needs, but the Scheduled Trigger is the
most straightforward and commonly used for time-based tasks.

4)What is the purpose of the "Debug" mode in Azure Data Factory?


The "Debug" mode in Azure Data Factory allows you to test and validate your
Pipeline in a controlled environment before deploying it to production. When
you run a Pipeline in Debug mode, you can:
1. Test the Pipeline: Verify that the Pipeline runs successfully and performs the
expected data transformations and movements.
2. Identify and fix errors: Debug mode helps you detect and troubleshoot
errors, such as data type mismatches, connection issues, or invalid
configurations.
3. Validate data: Inspect the data being processed and verify that it's correct
and consistent with your expectations.
By using the Debug mode, you can ensure that your Pipeline is working
correctly and make any necessary adjustments before deploying it to
production.
5)What is the purpose of the "Data Flow" activity in Azure Data Factory?
(Need To watch you tube video for understanding)
Answer :
The Data Flow activity in Azure Data Factory (ADF) allows you to design and
perform complex data transformations at scale without writing code. It
provides a graphical interface to build data transformation logic, making it
easier to manipulate and shape your data as it moves from one location to
another.
Purpose of Data Flow:
 Data Transformation:
It enables you to transform data using a visual interface. You can perform
tasks like filtering, aggregating, joining, splitting, and mapping data
between different sources and destinations.
For example, you might use Data Flow to clean up raw data, remove
duplicates, or join tables from different sources.
 Scalability:
Data Flows run on Azure's managed Spark environment, which means
they can scale easily to handle large volumes of data efficiently. This is
useful when you need to process complex data transformation tasks
without worrying about infrastructure management.
 Code-Free Interface:
Data Flow provides a low-code, drag-and-drop interface to design
transformations, making it accessible even to users who are not familiar
with writing code.
Key Features of Data Flow:
1. Source and Sink Transformation:
o You can configure sources (e.g., SQL Server, Azure Blob Storage,
etc.) and sinks (destination systems where the transformed data
will be loaded) in Data Flow.
2. Transformations:
o Includes operations like Filter, Join, Aggregate, Sort, Derived
Column, and Conditional Split to manipulate data.
o For example, you could use the Filter transformation to exclude
records based on certain conditions or use the Join transformation
to merge data from different sources.
3. Debugging:
o Data Flow includes a debugging feature, which helps you run a
preview of your transformations before executing the full pipeline.
This makes it easier to test and troubleshoot transformations in
real-time.
4. Reusable Logic:
o You can design a Data Flow and use it in multiple pipelines, making
it reusable across various workflows, which improves efficiency.
Real-World Example:
Let’s say you’re working on a data migration project. You might need to load
data from an on-premises SQL database into Azure SQL Database, but the data
requires some transformations, like:
 Filtering out invalid records,
 Aggregating sales data by region,
 Joining customer data from a different table,
 Renaming columns for consistency.
With the Data Flow activity, you can visually design each of these steps and
apply them in a single transformation task, without writing custom code.
Summary:
In simple terms, Data Flow in Azure Data Factory is a visual and scalable way to
design data transformations for large datasets. It’s a no-code/low-code solution
that helps you efficiently shape data as it moves through your pipeline, making
it a powerful tool for data engineers who need to transform data without
manual coding.

6) What is the difference between Azure Data Factory (ADF) and Azure
Synapse Analytics (formerly Azure SQL Data Warehouse)? (Need To watch you
tube video for understanding)
What is Azure Synapse Analytics?

Azure Synapse Analytics is a cloud-based enterprise data warehouse and


analytics service. It's designed to help organizations integrate and analyze large
amounts of data from various sources, such as relational databases, NoSQL
databases, and big data stores.

Key features of Azure Synapse Analytics:

1. Enterprise data warehouse: Azure Synapse Analytics provides a scalable and


secure data warehouse for storing and analyzing large amounts of data.
2. Big data analytics: It supports big data analytics workloads, including data
ingestion, processing, and analysis.
3. Data integration: Azure Synapse Analytics provides built-in data integration
capabilities, allowing you to integrate data from various sources.
4. Machine learning and AI: It supports machine learning and AI workloads,
enabling you to build and deploy predictive models.
Key differences between ADF and Azure Synapse Analytics:

1. Purpose: ADF is primarily used for ETL and data integration, while Azure
Synapse Analytics is designed for enterprise data warehousing, big data
analytics, and machine learning.
2. Data processing: ADF is focused on data movement and transformation,
while Azure Synapse Analytics provides advanced data processing capabilities,
including data warehousing and big data analytics.
3. Scalability: Both ADF and Azure Synapse Analytics are scalable, but Azure
Synapse Analytics is designed to handle much larger data volumes and more
complex analytics workloads.
7)What is the purpose of the "Integration Runtime" in Azure Data Factory?
1. Azure Integration Runtime (Azure IR): This is a cloud-based compute
environment that allows you to execute pipelines and perform data
movements within Azure or other cloud services.
2. Azure SSIS Integration Runtime (Azure SSIS IR): This is a specialized IR that's
designed to execute SQL Server Integration Services (SSIS) packages in the
cloud. It provides a managed environment for running SSIS workloads.
3. Self-Hosted Integration Runtime (Self-Hosted IR): This is an on-premises
compute environment that allows you to execute pipelines and access data
sources that are not accessible from the cloud. It's commonly used for hybrid
scenarios where data is stored on-premises or in a private cloud.
In Azure Data Factory, the Integration Runtime (IR) is the compute
environment that handles the execution of data movements, transformations,
and other activities. There are three main types of IRs, each suited for different
use cases:
1. Azure Integration Runtime (Azure IR):
o This is a cloud-based compute environment that allows you to run
pipelines and perform data movements within Azure or between
Azure and other cloud services.
o It is ideal for scenarios where all your data sources and
destinations are in the cloud (e.g., moving data between Azure
SQL Database and Azure Blob Storage).
2. Azure-SSIS Integration Runtime (Azure SSIS IR):
o This is a specialized IR designed to run SQL Server Integration
Services (SSIS) packages in the cloud.
o It provides a managed environment for executing SSIS workloads,
which is helpful when migrating on-premises SSIS packages to
Azure without needing to rewrite them.
o For example, if your organization has existing ETL processes built
using SSIS, you can run them in Azure with minimal changes by
using this IR.
3. Self-Hosted Integration Runtime (Self-Hosted IR):
o This is an on-premises compute environment that allows you to
execute pipelines and access data sources that are behind firewalls
or in private networks.
o It is typically used in hybrid scenarios, where you need to move or
transform data between on-premises systems and cloud services.
o For example, if you need to move data from an on-premises SQL
Server to an Azure data warehouse, the Self-Hosted IR would
securely handle that data movement.

Key Takeaway:
 The Azure IR is cloud-based, the Azure SSIS IR is designed specifically for
SSIS workloads in the cloud, and the Self-Hosted IR is used for hybrid
cloud scenarios, allowing you to work with on-premises data sources.
This answer should demonstrate your understanding of the different IR
types and how each serves a specific purpose in the Azure Data Factory
ecosystem.
Question 8: Can you explain the concept of Staging Area in Azure Data
Factory, its purpose, and provide a real-time example?
Answer: "In Azure Data Factory, a Staging Area is a temporary storage location
used to stage data before loading it into its final destination. It's a critical
component of the data integration process.

"The Staging Area is needed for several reasons:

1. Data Quality Checks: It allows for data quality checks, data validation, and
data cleansing before loading the data into the final destination.
2. Data Transformation: It enables data transformation, data aggregation, and
data formatting before loading the data into the final destination.
3. Performance Optimization: It helps optimize data loading performance by
breaking down large datasets into smaller, more manageable chunks.

"The purpose of the Staging Area is to:

1. Improve Data Quality: Ensure that the data is accurate, complete, and
consistent before loading it into the final destination.
2. Increase Efficiency: Optimize data loading performance and reduce the risk
of data loading errors.
3. Enhance Flexibility: Provide a flexible and scalable data integration solution
that can handle changing data volumes and formats.

"Here's a real-time example:

"In my previous project, we worked with a retail client who wanted to integrate
data from their e-commerce platform, ERP system, and social media channels
into a centralized data warehouse for analytics and reporting.
"We used Azure Data Factory to design a data pipeline that extracted data from
these sources, transformed it into a standardized format, and loaded it into the
data warehouse.

"However, we encountered issues with data quality and inconsistencies during


the data loading process. To address this, we implemented a Staging Area in
Azure Data Factory using Azure Blob Storage.

"The Staging Area allowed us to temporarily store the extracted data, perform
data quality checks, and transform the data into the required format before
loading it into the data warehouse.

"By using the Staging Area, we were able to improve data quality, reduce
errors, and increase the overall efficiency of the data pipeline.

"In terms of metrics, we saw a 30% reduction in data loading errors and a 25%
improvement in data processing time. The client was extremely satisfied with
the results, and we were able to deliver a scalable and reliable data integration
solution using Azure Data Factory and the Staging Area."
Question 9: Suppose you're working with a client who has an on-premises
SQL Server database that they want to migrate to Azure Synapse Analytics.
However, the database contains sensitive customer information that needs to
be encrypted during transit. How would you design a data pipeline in Azure
Data Factory to accomplish this task?

- You correctly identified the need to create a Self-Hosted Integration Runtime


(SHIR) to connect to the on-premises SQL Server database.
- You mentioned the need for a linked service with authentication, which is
correct. However, you could have elaborated on the types of authentication
that can be used, such as Windows Authentication or SQL Authentication.
- You correctly mentioned the need for a dataset to represent the data source.
To take your answer to the next level, here are some suggestions:

- Provide more details about the SHIR, such as how it enables secure data
transfer between the on-premises database and Azure.
- Explain the importance of authentication in the linked service and how it
ensures secure access to the data source.
- Mention any additional considerations, such as data encryption, network
security, or compliance requirements.

Here's an example of how you could elaborate on your answer:

"To migrate the on-premises SQL Server database to Azure Synapse Analytics,
we need to create a Self-Hosted Integration Runtime (SHIR) to connect to the
on-premises database. This enables secure data transfer between the on-
premises database and Azure.

"We also need to create a linked service with authentication to ensure secure
access to the data source. This can be done using Windows Authentication or
SQL Authentication.

"Additionally, we need to create a dataset to represent the data source. This


dataset will define the structure and schema of the data.

"Finally, we need to consider additional security measures, such as data


encryption, network security, and compliance requirements, to ensure the
secure transfer of sensitive customer information."
10)What is the difference between a Copy Activity and a Data Flow Activity in
Azure Data Factory?
Copy Activity:
1. Purpose: Copy Activity is used to copy data from a source to a destination,
with minimal transformations.
2. Data Movement: Copy Activity is optimized for data movement, allowing for
fast and efficient copying of data between sources.
3. Transformations: Limited transformations are available, such as filtering,
sorting, and aggregating data.
4. Cost: Copy Activity is a cost-effective option, as it only charges for the
amount of data moved.
5. Performance: Copy Activity provides high-performance data movement, with
throughput rates of up to 1.5 GB/s.

Data Flow Activity:

1. Purpose: Data Flow Activity is used to transform and process data, with
advanced data transformations and data quality checks.
2. Data Transformation: Data Flow Activity provides a visual interface for
creating data flows, allowing for complex data transformations, data validation,
and data quality checks.
3. Compute Resources: Data Flow Activity uses compute resources (e.g., Azure
Databricks, Azure Synapse Analytics) to process data, which can result in higher
costs.
4. Cost: Data Flow Activity is more expensive than Copy Activity, as it charges
for both data processing and compute resources.
5. Performance: Data Flow Activity provides high-performance data processing,
with throughput rates of up to 10 GB/s. However, performance can vary
depending on the complexity of the data transformations and the compute
resources used.

Comparison Summary:
| | Copy Activity | Data Flow Activity |
| --- | --- | --- |
| Purpose | Data movement with minimal transformations | Data
transformation and processing with advanced transformations |
| Cost | Cost-effective, only charges for data moved | More expensive, charges
for data processing and compute resources |
| Performance | High-performance data movement, up to 1.5 GB/s | High-
performance data processing, up to 10 GB/s |
| Transformations | Limited transformations available | Advanced data
transformations and data quality checks available |

In summary:

- Copy Activity is ideal for simple data movement scenarios, with minimal
transformations, and provides high-performance data movement.
- Data Flow Activity is suitable for complex data transformation scenarios, with
advanced data quality checks and processing requirements, and provides high-
performance data processing.
- Cost and performance are important considerations, with Copy Activity being
generally more cost-effective and Data Flow Activity providing higher
performance for complex data transformations.

11)What are the different types of datasets in Azure Data Factory, and how
do you choose the right one for your data integration scenario?

What are the different types of datasets in Azure Data Factory, and how do you
choose the right one for your data integration scenario?
In Azure Data Factory, a dataset is a named representation of data that can be
used as input or output for an activity. There are three types of datasets:

1. Linked Service Dataset: This type of dataset is linked to a specific data store,
such as Azure Blob Storage, Azure Data Lake Storage, or an on-premises
database.

Example: Suppose we're working with a retail company that wants to integrate
data from their e-commerce platform (stored in Azure Blob Storage) with their
ERP system (stored in an on-premises database). We can create a Linked
Service Dataset to connect to each of these data stores.

1. Inline Dataset: This type of dataset is defined directly within the pipeline,
without referencing an external data store.

Example: Suppose we're working with a marketing company that wants to


create a pipeline to process customer feedback data. We can create an Inline
Dataset to define the structure of the customer feedback data directly within
the pipeline.

1. Parameterized Dataset: This type of dataset allows you to parameterize the


dataset definition, making it reusable across multiple pipelines.

Example: Suppose we're working with a financial services company that wants
to create a pipeline to process daily transaction data. We can create a
Parameterized Dataset to define the structure of the transaction data, and then
reuse this dataset across multiple pipelines.

When choosing the right dataset type, consider the following factors:
- Data source: Is the data stored in a cloud-based data store or an on-premises
database?
- Data complexity: Is the data simple or complex, requiring additional
processing or transformation?
- Reusability: Do you need to reuse the dataset across multiple pipelines?

By considering these factors and choosing the right dataset type, you can
create efficient and scalable data pipelines in Azure Data Factory.

Important words:

- Linked Service Dataset


- Inline Dataset
- Parameterized Dataset
- Data source
- Data complexity
- Reusability

REAL INTERVIEW QUESTIONS IN ADF


1) How do you move data from on-premises to the cloud in Azure Data
Factory?

In one of my previous projects, I moved data from an on-premises SQL Server


to Azure SQL Database.
 I installed the Self-Hosted IR on a local server.
 Configured Linked Services for both the SQL Server and Azure SQL
Database.
 Used a Copy Data activity in the pipeline to move the data, applying a
SQL query to filter only the required rows.
 Monitored the pipeline to ensure data transfer accuracy and optimized
the process by enabling parallelism for large tables.
1. Setting Up a Self-Hosted Integration Runtime (IR):
Since on-premises data is typically behind a firewall, I would configure a Self-
Hosted Integration Runtime (IR) to securely connect the on-premises
environment to Azure Data Factory.
This IR acts as a bridge between the on-premises system and Azure, ensuring
secure data movement.
Example: If I’m working with an on-premises SQL Server, I would install the Self-
Hosted IR on a machine that has access to the database.
2. Configuring Linked Services:
Next, I would create Linked Services in Azure Data Factory to define the source
and destination connections.
For the source, I’d configure a Linked Service for the on-premises SQL Server,
using the Self-Hosted IR for connectivity.
For the destination, I’d set up a Linked Service for the cloud storage or
database, such as Azure Blob Storage or Azure SQL Database.
3. Designing the Pipeline:
I’d create a pipeline in Azure Data Factory to orchestrate the data movement.
The pipeline would include:
A Copy Data activity to extract data from the on-premises source and load it
into the cloud destination.
I could also include data transformation steps if required, using either SQL
queries in the Copy activity or a Data Flow activity.
4. Optimizing Data Transfer:
To ensure efficient data transfer:
I’d use staging if needed (e.g., staging data in Azure Blob Storage before loading
it into Azure SQL).
I’d enable compression or use partitioning to handle large datasets effectively.If
the source supports it, I’d use incremental data loads to transfer only new or
changed data instead of the entire dataset.
5. Monitoring and Debugging: I’d use the Monitor tab in Azure Data Factory to
track the pipeline execution, identify any errors, and ensure the data transfer
completes successfully.
During development, I’d test the pipeline using Debug mode to validate the
setup and fix any issues.
2) how To transfer data between two on-premises databases using Azure
Data Factory (ADF) from on premises db 1 to on premises 2 using adf?

Yes, you're absolutely right! When transferring data between two on-premises
databases from On-Premises Database 1 to On-Premises Database 2 using
Azure Data Factory (ADF), there are certain technical limitations and
considerations to keep in mind, especially when dealing with Self-Hosted
Integration Runtime (SHIR). Let’s break it down and walk through the solution.
Key Challenge: SHIR and On-Premises Data Movement
 SHIR can only be used for one data movement operation at a time. This
means that, technically, you cannot directly move data from one on-
premises database to another using a single SHIR instance
simultaneously.
 However, there is a way to work around this limitation by using Azure as
an intermediary (e.g., Azure Blob Storage or Azure SQL Database) to
stage the data before moving it to the second on-premises database.

Steps to Transfer Data from On-Premises DB1 to On-Premises DB2 Using ADF
1. Set Up the Self-Hosted Integration Runtime (SHIR)
 First, install and configure the Self-Hosted Integration Runtime (SHIR) on
a machine that can access both on-premises databases (DB1 and DB2).
 SHIR will be used to move data from On-Premises Database 1 to Azure
(e.g., Azure Blob Storage or Azure SQL Database) and then from Azure to
On-Premises Database 2.
2. Create Linked Services for Both On-Premises Databases
 Linked Service for On-Premises DB1: This will define the connection to
On-Premises Database 1.
 Linked Service for On-Premises DB2: This will define the connection to
On-Premises Database 2.
 Both Linked Services will use the SHIR you set up earlier to access the on-
premises databases.
3. Create Datasets for Source and Destination
 Dataset for DB1 (Source): This will represent the table or query in On-
Premises Database 1 that you want to copy.
 Dataset for Azure (Staging): This dataset will represent the Azure Blob
Storage or Azure SQL Database where the data will be temporarily
stored.
 Dataset for DB2 (Destination): This will represent the table in On-
Premises Database 2 where the data will be copied.
4. Create a Pipeline with Copy Activities
You will need to create a pipeline with two Copy Activities:
1. Copy Data from On-Premises DB1 to Azure (Staging):
o Use a Copy Activity to copy data from On-Premises Database 1 to
Azure (e.g., Azure Blob Storage or Azure SQL Database).
o Configure the source to use the Dataset for DB1 and the sink to
use the Dataset for Azure.
2. Copy Data from Azure to On-Premises DB2:
o Use another Copy Activity to copy data from Azure (the staging
area) to On-Premises Database 2.
o Configure the source to use the Dataset for Azure and the sink to
use the Dataset for DB2.
5. Trigger and Monitor the Pipeline
 Trigger the pipeline to run on a schedule or manually.
 Monitor the execution of the pipeline to ensure data is transferred
successfully.

Why Use Azure as an Intermediary?


 SHIR Limitation: As mentioned earlier, SHIR can only be used for one
data movement task at a time. This means you cannot directly move
data from On-Premises Database 1 to On-Premises Database 2 in a
single copy activity.
 Staging in Azure: By using Azure Blob Storage or Azure SQL Database as
an intermediary, you can break the process into two distinct steps:
o First, move the data from On-Premises DB1 to Azure.
o Then, move the data from Azure to On-Premises DB2.
This method ensures that data can flow securely and efficiently between the
two on-premises databases, even though SHIR can only handle one operation
at a time.

Alternative: Use Multiple SHIRs


If you absolutely need to perform both operations (DB1 to DB2)
simultaneously, you could use two separate SHIR instances—one for each
database. Here's how:
1. Install a second SHIR on a machine that can access On-Premises
Database 2.
2. Create two Linked Services:
o One for On-Premises Database 1 (using the first SHIR).
o One for On-Premises Database 2 (using the second SHIR).
3. Use two Copy Activities in parallel:
o Copy data from DB1 to Azure using the first SHIR.
o Copy data from Azure to DB2 using the second SHIR.
This approach involves more setup and management, but it allows you to
bypass the SHIR limitation of handling only one task at a time.
Final Explanation for the Interviewer
"To transfer data between two on-premises databases using Azure Data
Factory, I would use the Self-Hosted Integration Runtime (SHIR) to connect to
both databases. Since SHIR can only handle one data movement operation at
a time, I would stage the data in Azure (using Azure Blob Storage or Azure
SQL Database) as an intermediary. First, I would copy data from On-Premises
Database 1 to Azure, and then from Azure to On-Premises Database 2. This
method ensures that data flows securely and efficiently between the two on-
premises databases."

This solution ensures you're considering both the technical limitations and the
most efficient way to achieve the data transfer. Let me know if you'd like
further details! 😊

3)How would you design a pipeline in ADF to move data from On-Premises to
Azure SQL Database, ensuring it runs only on the last working day of the
month?

Question:
How would you design a pipeline in ADF to move data from On-Premises to
Azure SQL Database, ensuring it runs only on the last working day of the
month?

Answer:
To schedule a pipeline that runs on the last working day of each month, where
the last working day may differ (e.g., weekends or holidays), I would use the
following approach:

Step 1: Use a Custom Logic to Identify the Last Working Day


 Since ADF's built-in scheduler does not directly account for "working
days" or holidays, I will rely on an external logic or metadata source to
determine the last working day.
 I will create a control table or an external file (like CSV or JSON) that
contains a list of last working days for each month.
Example Table (Last_Working_Day):
Month Year Last_Working_Day
January 2024 2024-01-31
February 2024 2024-02-29
March 2024 2024-03-28

Step 2: ADF Pipeline Design


1. Create Linked Services:
o Connect to the On-Premises Database using a Self-Hosted
Integration Runtime (SHIR).
o Connect to the Azure SQL Database.
2. Create Datasets:
o Dataset 1: Source data (On-Premises).
o Dataset 2: Sink data (Azure SQL Database).
o Dataset 3: Control table or metadata file containing the last
working day.
3. Build the Pipeline:
o Lookup Activity:
 Use a Lookup Activity to query the control table (or file) and
retrieve the last working day for the current month.
 SQL Query Example:
sql
Copy code
SELECT Last_Working_Day
FROM Last_Working_Day
WHERE Month = MONTH(GETDATE()) AND Year = YEAR(GETDATE())
o If Condition Activity:
 Use the If Condition activity to check if the current date
matches the last working day returned by the Lookup
Activity.
 Use the expression:
text
Copy code
@equals(formatDateTime(utcnow(), 'yyyy-MM-dd'),
activity('Lookup1').output.firstRow.Last_Working_Day)
o Copy Activity:
 If the condition is true, trigger the Copy Activity to move
data from the On-Premises Database to Azure SQL
Database.

Step 3: Trigger the Pipeline


 Use an ADF Schedule Trigger to run the pipeline daily at a specific time
(e.g., 11:00 PM).
 The pipeline will check if the current day is the last working day using the
control table and only execute the Copy Activity when the condition is
met.

Step 4: Monitor and Optimize


 Use the Monitor tab in ADF to track pipeline runs and ensure the data
transfer is successful.
 Enable retry logic and error handling to account for failures.
Key Points to Impress the Interviewer:
1. Custom Logic: Using a control table or metadata file to handle dynamic
"last working day" logic.
2. Conditional Execution: Combining Lookup Activity and If Condition to
ensure the pipeline runs only on the correct day.
3. Daily Trigger: Running the pipeline daily but executing the Copy Activity
only on the last working day.
4. Scalability: The solution can easily adapt to changing holidays or business
rules by updating the control table.

Summary for the Interviewer:


"I would use a control table or metadata file to store the last working day for
each month. The pipeline would include a Lookup Activity to fetch the date and
an If Condition Activity to validate if the current day matches the last working
day. I would schedule the pipeline to run daily, and the condition ensures the
Copy Activity executes only on the last working day. This approach is dynamic,
scalable, and allows for easy updates to accommodate holidays or other
business rules."

This answer shows your understanding of ADF's conditional logic, triggers, and
ability to handle real-world scheduling challenges. Let me know if you need
further clarification or examples! 😊
. Difference Between Tumbling Window Trigger and Scheduled Trigger
Feature Scheduled Trigger Tumbling Window Trigger
Runs pipelines at fixed intervals, with a
Runs pipelines at specific
Purpose strict dependency on previous runs
times (e.g., daily, hourly).
completing.
Allows overlapping Does not allow overlaps (runs are
Overlap
executions. sequential).
Backfill Not natively supported Supports backfilling for past time
(manual workaround windows.
Feature Scheduled Trigger Tumbling Window Trigger
needed).
Simple to use and Requires careful setup to handle
Flexibility
configure. dependencies.
Time Runs at specific calendar-
Runs based on window duration.
Awareness based times.

2. Why Self-Hosted Integration Runtime (SHIR) is Needed for On-Premises


Databases
a) Network Connectivity
 On-premises databases are usually hosted in a private network behind
firewalls and cannot be accessed directly over the public internet.
 The Azure Integration Runtime operates only in the cloud and cannot
connect to private, on-premises resources.
 SHIR is installed within your private network (e.g., on a VM or server)
and acts as a bridge between ADF in the cloud and your on-premises
database.

b) Secure Data Movement


 The SHIR securely moves data between on-premises systems and Azure
using encrypted communication.
 It uses outbound-only connections (port 443 for HTTPS) to Azure, so you
don't need to open inbound ports on your firewall, which keeps your
environment secure.

c) No Direct Access Without SHIR


 Without SHIR, Azure Data Factory has no way to reach your on-premises
database because:
o Your database is not exposed to the public internet.
o Firewalls and private IPs block access from Azure cloud services.
3. Why Not Azure Integration Runtime?
 Azure Integration Runtime works only for cloud-to-cloud data
movement (e.g., from Azure Blob Storage to Azure SQL Database).
 It does not have access to on-premises systems because it operates
entirely in the cloud and cannot connect to private networks.

4. Key Role of SHIR


The Self-Hosted Integration Runtime solves the on-premises connectivity
challenge by:
1. Being installed inside the on-premises network.
2. Establishing a secure outbound connection to Azure Data Factory.
3. Acting as a proxy to read and write data between on-premises systems
and Azure.

5. Final Explanation (For the Interviewer)


"The Self-Hosted Integration Runtime (SHIR) is required to connect to on-
premises databases because those systems are typically hosted in a private
network behind firewalls. The SHIR is installed within the on-premises
network and acts as a secure bridge between Azure Data Factory and the
database. Other Integration Runtimes, like the Azure Integration Runtime,
cannot connect to on-premises systems because they operate only in the
cloud and do not have access to private networks."

You might also like