0% found this document useful (0 votes)
16 views11 pages

Az Questions

Uploaded by

madhavi.ndp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Az Questions

Uploaded by

madhavi.ndp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Q---ADF Ingest the data: Use ADF to ingest data from various disparate data

sources, such as databases, file systems, and cloud services.


Azure Data Factory is a cloud-based data integration service provided by
Microsoft Azure, allowing users to create data-driven workflows and move Data transformation: Orchestrating the movement and transformation of
and transform data between different sources. As a top provider of cloud data from multiple disparate data sources.
services, Azure offers ADF as a strong and adaptable platform for
constructing, coordinating, and overseeing data pipelines. Through Azure 4. Linked Services
Data Factory, businesses can effectively gather, transform, and evaluate Create a new Linked Service by defining its type and connection
data from multiple sources, facilitating streamlined and flexible decision- properties.
making processes driven by data.
Specify the data stores or compute resources to link to the Linked Service.
Q---ADF components
Set up the authentication method, such as OAuth or Key-based
The key components of ADF include pipelines, activities, datasets, linked authentication.
services, triggers, and integration runtimes, which allow for efficient data
integration, transformation, and movement processes. 5. Triggers
1. Pipelines Schedule-based triggers enable the execution of data integration processes
at specified times.
Design data pipelines to orchestrate and automate data movement and
transformation. Utilize pipelines for seamless data integration processes. Tumbling window triggers facilitate the processing of data-driven
workflows at regular intervals.
2. Activities
Event-based triggers initiate Data Transformation or data movement in
Data-driven workflows with Azure Data Factory (ADF) involve various response to specific conditions.
activities, such as data movement, transformation, and orchestration.
Process data using diverse compute services to perform tasks like data 6. Integration Runtimes
cleansing, enrichment, and aggregation. Support BI applications by
Define Linked Services to link data stores, compute services, and business
integrating ADF activities with BI tools for insightful data processing.
intelligence (BI) applications.
When leveraging ADF activities, prioritize aligning them with specific
business requirements to maximize their impact on data-driven operations Specify Integration Runtimes to provide the compute infrastructure for
and decision-making. running activities within a data factory.
3. Datasets Designate secure and isolated compute resources for data movement,
transformation, and orchestration.
Define data stores: Identify the disparate data sources to be integrated,
including historic data for ingestion. Utilize Integration Runtimes for BI applications, ensuring seamless
connectivity across various data sources and destinations.
- Configure data flow properties, such as source and destination datasets
Q --- ADF pipeline creation process/development Step 5: Add Dependencies
Here is a step-by-step guide to creating an Azure Data Factory (ADF) - Define dependencies between activities
pipeline:
- Use the "Depends on" property to specify the order of execution
Step 1: Plan Your Pipeline
Step 6: Debug and Test
- Define the purpose of your pipeline
- Use the "Debug" button to test individual activities
- Identify the data sources and destinations
- Use the "Test" button to test the entire pipeline
- Determine the transformations needed
Step 7: Publish and Deploy
Step 2: Create a New Pipeline
- Publish the pipeline to the Azure Data Factory service
- Log in to the Azure portal
- Deploy the pipeline to a production environment
- Navigate to your Azure Data Factory instance
Step 8: Monitor and Maintain
- Click on "Author" and then "Pipeline"
- Monitor pipeline execution and performance
- Choose a template or start from scratch
- Troubleshoot issues using the ADF monitoring tools
Step 3: Add Activities
- Update and maintain the pipeline as needed
- Drag and drop activities from the toolbox to the canvas
Additionally, here are some best practices to keep in mind:
- Configure each activity:
- Use meaningful names and descriptions for pipelines, activities, and
- Source and destination datasets datasets
- Transformation settings - Use parameters and variables to make pipelines more dynamic
- Connection properties - Use data flows to simplify complex transformations
Step 4: Configure Data Flows - Test and validate pipelines thoroughly before deploying to production
- Create data flows to define the data transformation logic - Monitor pipeline performance and optimize as needed
- Use the data flow canvas to design the transformation steps
Root Cause: The issue is due to a mismatch between the column names in
the source data and the column names defined in the Data Flow schema.
Q-----ADF pipeline failure issues and explain one or two issues and fix
it and what is the root cause for that Fix:
ADF pipeline failure issues refer to problems or errors that occur during 1. Verify the column names in the source data.
the execution of an Azure Data Factory (ADF) pipeline, preventing it from
completing successfully. These issues can cause the pipeline to fail, 2. Update the Data Flow schema to match the column names in the source
resulting in: data.

1. Data processing interruptions 3. Re-map the columns in the Data Flow activity as needed.

2. Data loss or corruption 4. Re-run the pipeline.

3. Inconsistent data Additional troubleshooting tips:

4. Delayed or missed deadlines - Check the ADF pipeline run history and activity logs for detailed error
messages.
5. Increased costs due to re-runs or manual interventions
- Validate data types, formats, and schema consistency throughout the
Example 1: pipeline.
Issue: "Failed to read data from source" error in a Copy Data activity. - Test individual activities and data flows in isolation to identify the root
cause.
Root Cause: The issue is due to incorrect credentials or permissions for the
source dataset. - Review and update pipeline dependencies and data flows as needed.
Fix: Common ADF pipeline failure issues include:
1. Verify the credentials used for the source dataset connection. - Connection and credential issues
2. Ensure the credentials have the necessary permissions to read data from - Data type and format mismatches
the source.
- Schema inconsistencies
3. Update the credentials or permissions as needed.
- Dependency and data flow errors
4. Re-run the pipeline.
- Resource constraints and timeouts
Example 2:
By identifying the root cause and applying the appropriate fix, you can
Issue: "Invalid column name" error in a Data Flow activity. resolve ADF pipeline failures and ensure reliable data processing.
Q----Types of integration runtimes Trigger concurrency
Azure Data Factory offers three types of integration runtimes (IR): This limit specifies the maximum number of Logic App instances that can
run in parallel.
Azure integration runtime: A fully managed and elastic compute that can
scan Azure or non-Azure data sources Azure Cosmos DB
Self-hosted integration runtime: Can scan data sources in an on-premises This trigger creates a function based on provided values. For example, a
network or a virtual network function template can write the number of documents and the first
document ID to the logs.
Azure-SQL Server Integration Services (SSIS) integration runtime:
Designed to handle different data integration scenarios Q----Linked services and types
Q---Triggers In Azure Data Factory (ADF), a linked service defines a data store or
compute service and acts as a connection string to deliver data from
Event hub trigger various sources to enable data integration workflows
Designed to inject large amounts of data from many mobile or IoT All the linked service types are supported for parameterization.
devices, this trigger is useful for scalable Azure Functions.
Natively supported in UI: When authoring linked service on UI, the
Schedule trigger service provides built-in parameterization experience for the following
This trigger runs based on a schedule defined in the trigger, with options types of linked services. In linked service creation/edit blade, you can find
for minutes, hours, weeks, and months. options to new parameters and add dynamic content. Below are few types.

Tumbling window trigger Azure Blob Storage Azure Databricks Delta


Lake
This trigger executes data pipelines at a specified time slice or periodic
time interval. It can be useful when working with historical data to copy or Azure Data Explorer Azure Data Lake
migrate data. Storage Gen1

HTTP trigger Azure Data Lake Storage Gen2 Azure Database for
MySQL
This trigger allows functions to be invoked by HTTP requests, which can
be used to build API endpoints and handle incoming HTTP requests. Azure Database for PostgreSQL Azure Databricks

Pipeline triggers Azure File Storage Azure Function

In YAML pipelines, different versions of the pipeline in different branches Azure Key Vault Azure SQL Database
can affect which version of the pipeline's triggers are evaluated. Azure SQL Managed Instance Azure Table ,,,etc
Q---Transformation types
In Azure Data Factory (ADF), there are several transformation types that These transformation types can be used individually or in combination to
can be used to manipulate and process data. Here are some of the most create complex data processing pipelines in ADF.
common transformation types in ADF:
Q----Look up activity
1. Aggregate: Performs aggregation operations like SUM, AVG, MAX,
MIN, etc. on data. Lookup activity can retrieve a dataset from any of the data sources
supported by data factory and Synapse pipelines. You can use it to
2. Alter Row: Modifies individual rows in a dataset based on conditions. dynamically determine which objects to operate on in a subsequent
activity, instead of hard coding the object name. Some object examples are
3. Append: Combines two or more datasets into a single dataset. files and tables.
4. Copy: Copies data from a source to a destination without modifying it. Q---Copy and get metadata activity difference
5. Data Conversion: Converts data types, formats, or encoding. In Azure, the Copy activity can preserve metadata, while the Get Metadata
6. Data Validation: Validates data against a set of rules or constraints. activity deals with metadata of a data file:

7. Derived Column: Creates new columns based on existing columns. Copy activity

8. Filter: Filters data based on conditions. When copying files as-is from Amazon S3, Azure Data Lake Storage
Gen2, Azure Blob storage, or Azure Files to Azure Data Lake Storage
9. Join: Combines data from two or more datasets based on a common Gen2, Azure Blob storage, or Azure Files with binary format, the Preserve
column. option can be found in the Copy Activity > Settings tab. This option
allows the service to automatically replace invalid characters with
10. Lookup: Looks up values in a reference dataset.
underscores when preserving metadata. The Copy activity can also
11. Pivot: Rotates data from rows to columns or vice versa. preserve ACLs when copying from Azure Data Lake Storage Gen1/Gen2.

12. Sort: Sorts data in ascending or descending order. Get Metadata activity

13. Split: Splits a single dataset into multiple datasets. This activity can provide information about a data file's metadata, such as
its file size, last modified date, and the files and folders in a main folder.
14. Union: Combines two or more datasets into a single dataset. For example, it can show the physical name of a column, its data type, and
15. Unpivot: Rotates data from columns to rows. other information

16. Upsert: Updates existing data or inserts new data if it doesn't exist.
17. XML Transformation: Transforms XML data into a different format.
Q--SCD types and explanation Q----Issues in ADB
Slowly Changing Dimensions (SCD) is a term used to describe the actions Invalid credentials. ...
taken to support changes in the value of an attribute in a dimension
member of a row. Each type of SCD has its own characteristics and use Secure connection...SSL problems. ...
cases. Here are some common SCD types: Microsoft Entra ID credentials error. ...
Type 0; Also known as a Fixed Dimension, this type does not allow Timeout errors. ...
changes and the dimension never changes.
404 errors. ...
Type 1; Also known as an Update Value in Place, this type updates the
record directly and does not maintain a record of historical values. It's Detached head state. ...
suitable when historical data is not necessary.
Resolve notebook name conflicts. ...
Type 2; Also known as Row Versioning or Preserve History (UPSERT),
Errors suggest recloning.
this type keeps current and historical records in the same file or table. It
tracks changes as version records with current flags, active dates, and other Q---Some Python code
metadata. This type is preferred when maintaining a versioned history is
important. Here is the Program to find if the given number is prime or not:

Type 3; Also known as Previous Value Column, this type tracks changes def find_primes(n):
to a specific attribute by adding a column to show the previous value. This primes = []
value is updated as further changes occur. This type is beneficial when
tracking a limited number of changes efficiently. However, it has limited for possiblePrime in range(2, n + 1):
usability and is less popular than Type 1 and 2.
isPrime = True
Type 4; Also known as History Table, this type shows the current value in
the dimension table but tracks all changes in a separate table for num in range(2, int(possiblePrime ** 0.5) + 1):

Incremental load explanation if possiblePrime % num == 0:

Q----Azure Databricks isPrime = False

Azure Databricks is a Microsoft Azure platform for big data analytics and break
AI that helps users build, deploy, share, and maintain enterprise-grade if isPrime:
data, analytics, and AI solutions. It's built on Apache Spark computing
technology and can be used on-premise or in the cloud. primes.append(possiblePrime)
return primes
Handle error rows
# Example usage: When writing data to a database sink in ADF data flows, you can set the
sink error row handling to "Continue on Error". This is an automated
n = 100 method that doesn't require custom logic in the data flow.
print(find_primes(n)) Invoke a shared error handling or logging step
Explanation: If previous activities fail, you can build a pipeline that runs multiple
1. The function find_primes(n) takes an integer n as input and returns a list activities in parallel and includes an if condition to contain the error
of prime numbers up to n. handling steps. Connect activities to the condition activity using the "Upon
Completion" path
2. The outer loop iterates over all numbers from 2 to n (inclusive).
Q----CDC
3. For each number, the inner loop checks if it has any divisors other than
1 and itself. If it does, it's not a prime number, so isPrime is set to False In Azure SQL Database, a change data capture scheduler replaces the SQL
and the loop breaks. Server Agent jobs that capture and cleanup change data for the source
tables. The scheduler runs the capture and cleanup processes automatically
4. If the inner loop completes without finding any divisors, isPrime in the scope of the database, ensuring reliability and performance without
remains True, and the number is added to the list of primes. external dependencies.
5. The function returns the list of primes.
Q---How to find out the errors
Here are some ways to find errors in Azure Data Factory (ADF):
Monitor the log
Go to the Monitoring tab in ADF Studio, select Pipeline runs, and then
choose the run you want to monitor. Hover over the area next to the
Activity name to see icons with links to the pipeline's input, output, and
other details.
Query ADF logs
Go to portal.azure.com, select your ADF subscription, and then scroll
down to the Monitoring Section in the left pane. Select Logs to open the
Query pane.
Such as CPU usage, queue length, and available memory. For example,
you can scale a Virtual Machine Scale Set based on the amount of traffic
Q---How to find out duplicates using python code and sql on a firewall.
Multiple Ways To Check if duplicates exist in a Python List Schedule
The length of List & length of Set are different. Such as time patterns in your load, or scaling rules for peak business
Check each element in set. if yes, dup, if not, append.Check for list.count() hours.
for each element. Rules
Below is sample code to find duplicates in python list Such as conditions that define the direction of scaling, the amount to scale
l=[1,2,3,4,5,2,3,4,7,9,5] by, and more. For example, you can set a rule to increase the instance
scaling count by 1 when the resource's CPU percentage exceeds 70%.
l1=[]
You can configure autoscale settings in the Azure portal:
for i in l:
Open the Azure portal
if i not in l1:
Search for and select Azure Monitor
l1. append(i)
Select Autoscale
else:
Select a resource to scale
print(i,end=' ')
Select Custom autoscale
In SQL, you can find duplicate records without using GROUP BY by
using a subquery or a self-join. Here are two examples: Subquery: Enter a name and resource group
SELECT * FROM table_name WHERE column_name IN (SELECT Select Scale based on a metric
column_name FROM table_name GROUP BY column_name HAVING
COUNT(*) > 1); Keep the default values and select Add
Q---Auto scaling You can also manage autoscaling using the Azure CLI, the REST APIs,
Azure Resource Manager, the Python SDK, or the browser-based Azure
Azure autoscale is a service that automatically scales applications and portal.
resources based on demand. It can add and remove resources to handle
application load, and can scale in and out, or horizontally. Autoscale can
be based on metrics, schedules, or rules:
Metrics
Commits
Q---Time travel Delta tables also offer data versioning, schema enforcement, performance
optimizations, distributed metadata, and streaming support.
Delta Lake time travel supports querying previous table versions based on
timestamp or table version (as recorded in the transaction log). You can Azure Databricks stores all Delta Lake table data and metadata in cloud
use time travel for applications such as the following: object storage, and configurations can be set at the table level or within the
Spark session.
Re-creating analyses, reports, or outputs (for example, the output of a
machine learning model). This could be useful for debugging or auditing, Q---Live delta table
especially in regulated industries.
Delta Live Tables manage the flow of data between many Delta tables,
Writing complex temporal queries. thus simplifying the work of data engineers on ETL development and
management. The pipeline is the main unit of execution for Delta Live
Fixing mistakes in your data. Tables. Delta Live Tables offers declarative pipeline development,
Providing snapshot isolation for a set of queries for fast changing tables. improved data reliability, and cloud-scale production operations. Users can
perform both batch and streaming operations on the same table and the
Q---Delta table data is immediately available for querying. You define the transformations
to perform on your data, and Delta Live Tables manages task
Delta tables are the default data table format in Azure Databricks and are
orchestration, cluster management, monitoring, data quality, and error
part of the open-source Delta Lake data framework. They are designed to
handling. Delta Live Tables Enhanced Autoscaling can handle streaming
handle streaming and batch data on large feeds, and are often used for data
workloads which are spiky and unpredictable.
lakes where data is ingested in large batches or via streaming. Delta tables
are built on top of Delta Lake, which offers a transactional storage layer Q---Vaccuming table
for cloud-stored data.
The VACUUM command is an essential maintenance tool for stable and
Delta tables have many features, including: optimal database performance. It improves query performance by
recovering space occupied by "dead tuples" or dead rows in a table caused
ACID transactions
by records that have been deleted or updated (e.g., a delete followed by an
Time travel insert).

Event history When a vacuum process runs, the space occupied by these dead tuples is
marked reusable by other tuples. Vacuum databases regularly to remove
Flexibility to change content dead rows.
Multiple readers and writers The VACUUM command can be used in two ways:
Ordered record of a transaction log
VACUUM FULL: Recovers space and reorganizes data in a table, but DROP
does not remove deleted rows. VACUUM SORT ONLY: Reorganizes
data, but does not reclaim disk space or remove deleted rows. Removes an entire table, along with all its data, indexes, triggers, and
structure. It's a strong move that can't be reversed, so you should back up
Q---Table A and BA table has 10 recordsB table has 20 records, joins your data before executing it. When using DROP, you need to manually
output drop foreign key constraints, and there's no option to recover data. DROP
is slower than TRUNCATE because it requires more resources to delete
SELECT * FROM A FULL OUTER JOIN B ON A.common_column = the entire table and its indices.
B.common_column;
Q---- Restrict null values answer using Coaleasce
Replace:
COALESCE provides a simple way to evaluate multiple expressions and
- A and B with the actual table names determine the first non-null expression based simply on the order of the
- common_column with the column name that exists in both tables and is expression list. You can specify your own value to use in the event all of
used for joining your expression parameters evaluate to NULL. There are situations where
we require a non-NULL value and COALESCE provides a way for use to
Note: assure that we get one.
- SELECT * means select all columns from both tables. You can specify Syntax
individual columns instead, if needed.
COALESCE ( expression [ ,...n ] )
- ON clause specifies the joining condition
The simple example uses COALESCE to return an alternative value for a
- FULL OUTER JOIN keyword indicates that we want to include all column that IS NULL. The following T-SQL query returns the [Tier] for a
records from both tables, with null values where there are no matches. customer or NONE if the customer [Tier] IS NULL:
Q-- diff b/w truncate and drop SELECT [Name], COALESCE([Tier], 'NONE') AS [Tier]
In SQL, the TRUNCATE and DROP commands are both used to remove FROM [dbo].[Customer]
data from a table, but they have different results:
TRUNCATE
Removes all rows from a table, but keeps the table's structure, columns,
constraints, and indexes. It's a Data Definition Language (DDL) statement
that's faster than DELETE but can't remove specific rows based on a
condition. TRUNCATE automatically drops foreign key constraints, and it
can be rolled back if there's an accidental deletion.
Q--- How you get notified (anwer like using triggers, email notfication,
teams channels)
Q--- CI/CD pipeline - what if piptines fails
Give at least 2 errors in pipeline
When a CI/CD pipeline build fails, it can be caused by a number of
reasons, including credential problems. For example, the credentials may Here are some examples of pipeline errors:
no longer be valid or the permissions may have changed. If your
organization has a DevOps team, they may be able to help resolve these Job time-out
issues because they usually have admin access to most services and are Downloading code issues
familiar with the permissions needed.
Command-line step failures
Other challenges that Agile teams may face with CI/CD pipelines include:
File or folder in use errors
Flawed tests: Tests are created to ensure quick code deployment and high-
quality builds. MSBuild failures
Poor team communication: Lack of cooperation and transparency can lead Process stops responding
to project failure.
Line endings for multiple platforms
Security threats: These can include insecure code from third-party sources,
Variables with single quotes appended
unauthorized access to source code repositories, and breaches in dev or test
environments. Service connection issues
Some best practices for securing CI/CD pipelines include: Pipeline stops hearing from agent
Access control, including authentication mechanisms, roles, and
permissions, Code scanning
Vulnerability management, Secure environment configurations

You might also like