Az Questions
Az Questions
1. Data processing interruptions 3. Re-map the columns in the Data Flow activity as needed.
4. Delayed or missed deadlines - Check the ADF pipeline run history and activity logs for detailed error
messages.
5. Increased costs due to re-runs or manual interventions
- Validate data types, formats, and schema consistency throughout the
Example 1: pipeline.
Issue: "Failed to read data from source" error in a Copy Data activity. - Test individual activities and data flows in isolation to identify the root
cause.
Root Cause: The issue is due to incorrect credentials or permissions for the
source dataset. - Review and update pipeline dependencies and data flows as needed.
Fix: Common ADF pipeline failure issues include:
1. Verify the credentials used for the source dataset connection. - Connection and credential issues
2. Ensure the credentials have the necessary permissions to read data from - Data type and format mismatches
the source.
- Schema inconsistencies
3. Update the credentials or permissions as needed.
- Dependency and data flow errors
4. Re-run the pipeline.
- Resource constraints and timeouts
Example 2:
By identifying the root cause and applying the appropriate fix, you can
Issue: "Invalid column name" error in a Data Flow activity. resolve ADF pipeline failures and ensure reliable data processing.
Q----Types of integration runtimes Trigger concurrency
Azure Data Factory offers three types of integration runtimes (IR): This limit specifies the maximum number of Logic App instances that can
run in parallel.
Azure integration runtime: A fully managed and elastic compute that can
scan Azure or non-Azure data sources Azure Cosmos DB
Self-hosted integration runtime: Can scan data sources in an on-premises This trigger creates a function based on provided values. For example, a
network or a virtual network function template can write the number of documents and the first
document ID to the logs.
Azure-SQL Server Integration Services (SSIS) integration runtime:
Designed to handle different data integration scenarios Q----Linked services and types
Q---Triggers In Azure Data Factory (ADF), a linked service defines a data store or
compute service and acts as a connection string to deliver data from
Event hub trigger various sources to enable data integration workflows
Designed to inject large amounts of data from many mobile or IoT All the linked service types are supported for parameterization.
devices, this trigger is useful for scalable Azure Functions.
Natively supported in UI: When authoring linked service on UI, the
Schedule trigger service provides built-in parameterization experience for the following
This trigger runs based on a schedule defined in the trigger, with options types of linked services. In linked service creation/edit blade, you can find
for minutes, hours, weeks, and months. options to new parameters and add dynamic content. Below are few types.
HTTP trigger Azure Data Lake Storage Gen2 Azure Database for
MySQL
This trigger allows functions to be invoked by HTTP requests, which can
be used to build API endpoints and handle incoming HTTP requests. Azure Database for PostgreSQL Azure Databricks
In YAML pipelines, different versions of the pipeline in different branches Azure Key Vault Azure SQL Database
can affect which version of the pipeline's triggers are evaluated. Azure SQL Managed Instance Azure Table ,,,etc
Q---Transformation types
In Azure Data Factory (ADF), there are several transformation types that These transformation types can be used individually or in combination to
can be used to manipulate and process data. Here are some of the most create complex data processing pipelines in ADF.
common transformation types in ADF:
Q----Look up activity
1. Aggregate: Performs aggregation operations like SUM, AVG, MAX,
MIN, etc. on data. Lookup activity can retrieve a dataset from any of the data sources
supported by data factory and Synapse pipelines. You can use it to
2. Alter Row: Modifies individual rows in a dataset based on conditions. dynamically determine which objects to operate on in a subsequent
activity, instead of hard coding the object name. Some object examples are
3. Append: Combines two or more datasets into a single dataset. files and tables.
4. Copy: Copies data from a source to a destination without modifying it. Q---Copy and get metadata activity difference
5. Data Conversion: Converts data types, formats, or encoding. In Azure, the Copy activity can preserve metadata, while the Get Metadata
6. Data Validation: Validates data against a set of rules or constraints. activity deals with metadata of a data file:
7. Derived Column: Creates new columns based on existing columns. Copy activity
8. Filter: Filters data based on conditions. When copying files as-is from Amazon S3, Azure Data Lake Storage
Gen2, Azure Blob storage, or Azure Files to Azure Data Lake Storage
9. Join: Combines data from two or more datasets based on a common Gen2, Azure Blob storage, or Azure Files with binary format, the Preserve
column. option can be found in the Copy Activity > Settings tab. This option
allows the service to automatically replace invalid characters with
10. Lookup: Looks up values in a reference dataset.
underscores when preserving metadata. The Copy activity can also
11. Pivot: Rotates data from rows to columns or vice versa. preserve ACLs when copying from Azure Data Lake Storage Gen1/Gen2.
12. Sort: Sorts data in ascending or descending order. Get Metadata activity
13. Split: Splits a single dataset into multiple datasets. This activity can provide information about a data file's metadata, such as
its file size, last modified date, and the files and folders in a main folder.
14. Union: Combines two or more datasets into a single dataset. For example, it can show the physical name of a column, its data type, and
15. Unpivot: Rotates data from columns to rows. other information
16. Upsert: Updates existing data or inserts new data if it doesn't exist.
17. XML Transformation: Transforms XML data into a different format.
Q--SCD types and explanation Q----Issues in ADB
Slowly Changing Dimensions (SCD) is a term used to describe the actions Invalid credentials. ...
taken to support changes in the value of an attribute in a dimension
member of a row. Each type of SCD has its own characteristics and use Secure connection...SSL problems. ...
cases. Here are some common SCD types: Microsoft Entra ID credentials error. ...
Type 0; Also known as a Fixed Dimension, this type does not allow Timeout errors. ...
changes and the dimension never changes.
404 errors. ...
Type 1; Also known as an Update Value in Place, this type updates the
record directly and does not maintain a record of historical values. It's Detached head state. ...
suitable when historical data is not necessary.
Resolve notebook name conflicts. ...
Type 2; Also known as Row Versioning or Preserve History (UPSERT),
Errors suggest recloning.
this type keeps current and historical records in the same file or table. It
tracks changes as version records with current flags, active dates, and other Q---Some Python code
metadata. This type is preferred when maintaining a versioned history is
important. Here is the Program to find if the given number is prime or not:
Type 3; Also known as Previous Value Column, this type tracks changes def find_primes(n):
to a specific attribute by adding a column to show the previous value. This primes = []
value is updated as further changes occur. This type is beneficial when
tracking a limited number of changes efficiently. However, it has limited for possiblePrime in range(2, n + 1):
usability and is less popular than Type 1 and 2.
isPrime = True
Type 4; Also known as History Table, this type shows the current value in
the dimension table but tracks all changes in a separate table for num in range(2, int(possiblePrime ** 0.5) + 1):
Azure Databricks is a Microsoft Azure platform for big data analytics and break
AI that helps users build, deploy, share, and maintain enterprise-grade if isPrime:
data, analytics, and AI solutions. It's built on Apache Spark computing
technology and can be used on-premise or in the cloud. primes.append(possiblePrime)
return primes
Handle error rows
# Example usage: When writing data to a database sink in ADF data flows, you can set the
sink error row handling to "Continue on Error". This is an automated
n = 100 method that doesn't require custom logic in the data flow.
print(find_primes(n)) Invoke a shared error handling or logging step
Explanation: If previous activities fail, you can build a pipeline that runs multiple
1. The function find_primes(n) takes an integer n as input and returns a list activities in parallel and includes an if condition to contain the error
of prime numbers up to n. handling steps. Connect activities to the condition activity using the "Upon
Completion" path
2. The outer loop iterates over all numbers from 2 to n (inclusive).
Q----CDC
3. For each number, the inner loop checks if it has any divisors other than
1 and itself. If it does, it's not a prime number, so isPrime is set to False In Azure SQL Database, a change data capture scheduler replaces the SQL
and the loop breaks. Server Agent jobs that capture and cleanup change data for the source
tables. The scheduler runs the capture and cleanup processes automatically
4. If the inner loop completes without finding any divisors, isPrime in the scope of the database, ensuring reliability and performance without
remains True, and the number is added to the list of primes. external dependencies.
5. The function returns the list of primes.
Q---How to find out the errors
Here are some ways to find errors in Azure Data Factory (ADF):
Monitor the log
Go to the Monitoring tab in ADF Studio, select Pipeline runs, and then
choose the run you want to monitor. Hover over the area next to the
Activity name to see icons with links to the pipeline's input, output, and
other details.
Query ADF logs
Go to portal.azure.com, select your ADF subscription, and then scroll
down to the Monitoring Section in the left pane. Select Logs to open the
Query pane.
Such as CPU usage, queue length, and available memory. For example,
you can scale a Virtual Machine Scale Set based on the amount of traffic
Q---How to find out duplicates using python code and sql on a firewall.
Multiple Ways To Check if duplicates exist in a Python List Schedule
The length of List & length of Set are different. Such as time patterns in your load, or scaling rules for peak business
Check each element in set. if yes, dup, if not, append.Check for list.count() hours.
for each element. Rules
Below is sample code to find duplicates in python list Such as conditions that define the direction of scaling, the amount to scale
l=[1,2,3,4,5,2,3,4,7,9,5] by, and more. For example, you can set a rule to increase the instance
scaling count by 1 when the resource's CPU percentage exceeds 70%.
l1=[]
You can configure autoscale settings in the Azure portal:
for i in l:
Open the Azure portal
if i not in l1:
Search for and select Azure Monitor
l1. append(i)
Select Autoscale
else:
Select a resource to scale
print(i,end=' ')
Select Custom autoscale
In SQL, you can find duplicate records without using GROUP BY by
using a subquery or a self-join. Here are two examples: Subquery: Enter a name and resource group
SELECT * FROM table_name WHERE column_name IN (SELECT Select Scale based on a metric
column_name FROM table_name GROUP BY column_name HAVING
COUNT(*) > 1); Keep the default values and select Add
Q---Auto scaling You can also manage autoscaling using the Azure CLI, the REST APIs,
Azure Resource Manager, the Python SDK, or the browser-based Azure
Azure autoscale is a service that automatically scales applications and portal.
resources based on demand. It can add and remove resources to handle
application load, and can scale in and out, or horizontally. Autoscale can
be based on metrics, schedules, or rules:
Metrics
Commits
Q---Time travel Delta tables also offer data versioning, schema enforcement, performance
optimizations, distributed metadata, and streaming support.
Delta Lake time travel supports querying previous table versions based on
timestamp or table version (as recorded in the transaction log). You can Azure Databricks stores all Delta Lake table data and metadata in cloud
use time travel for applications such as the following: object storage, and configurations can be set at the table level or within the
Spark session.
Re-creating analyses, reports, or outputs (for example, the output of a
machine learning model). This could be useful for debugging or auditing, Q---Live delta table
especially in regulated industries.
Delta Live Tables manage the flow of data between many Delta tables,
Writing complex temporal queries. thus simplifying the work of data engineers on ETL development and
management. The pipeline is the main unit of execution for Delta Live
Fixing mistakes in your data. Tables. Delta Live Tables offers declarative pipeline development,
Providing snapshot isolation for a set of queries for fast changing tables. improved data reliability, and cloud-scale production operations. Users can
perform both batch and streaming operations on the same table and the
Q---Delta table data is immediately available for querying. You define the transformations
to perform on your data, and Delta Live Tables manages task
Delta tables are the default data table format in Azure Databricks and are
orchestration, cluster management, monitoring, data quality, and error
part of the open-source Delta Lake data framework. They are designed to
handling. Delta Live Tables Enhanced Autoscaling can handle streaming
handle streaming and batch data on large feeds, and are often used for data
workloads which are spiky and unpredictable.
lakes where data is ingested in large batches or via streaming. Delta tables
are built on top of Delta Lake, which offers a transactional storage layer Q---Vaccuming table
for cloud-stored data.
The VACUUM command is an essential maintenance tool for stable and
Delta tables have many features, including: optimal database performance. It improves query performance by
recovering space occupied by "dead tuples" or dead rows in a table caused
ACID transactions
by records that have been deleted or updated (e.g., a delete followed by an
Time travel insert).
Event history When a vacuum process runs, the space occupied by these dead tuples is
marked reusable by other tuples. Vacuum databases regularly to remove
Flexibility to change content dead rows.
Multiple readers and writers The VACUUM command can be used in two ways:
Ordered record of a transaction log
VACUUM FULL: Recovers space and reorganizes data in a table, but DROP
does not remove deleted rows. VACUUM SORT ONLY: Reorganizes
data, but does not reclaim disk space or remove deleted rows. Removes an entire table, along with all its data, indexes, triggers, and
structure. It's a strong move that can't be reversed, so you should back up
Q---Table A and BA table has 10 recordsB table has 20 records, joins your data before executing it. When using DROP, you need to manually
output drop foreign key constraints, and there's no option to recover data. DROP
is slower than TRUNCATE because it requires more resources to delete
SELECT * FROM A FULL OUTER JOIN B ON A.common_column = the entire table and its indices.
B.common_column;
Q---- Restrict null values answer using Coaleasce
Replace:
COALESCE provides a simple way to evaluate multiple expressions and
- A and B with the actual table names determine the first non-null expression based simply on the order of the
- common_column with the column name that exists in both tables and is expression list. You can specify your own value to use in the event all of
used for joining your expression parameters evaluate to NULL. There are situations where
we require a non-NULL value and COALESCE provides a way for use to
Note: assure that we get one.
- SELECT * means select all columns from both tables. You can specify Syntax
individual columns instead, if needed.
COALESCE ( expression [ ,...n ] )
- ON clause specifies the joining condition
The simple example uses COALESCE to return an alternative value for a
- FULL OUTER JOIN keyword indicates that we want to include all column that IS NULL. The following T-SQL query returns the [Tier] for a
records from both tables, with null values where there are no matches. customer or NONE if the customer [Tier] IS NULL:
Q-- diff b/w truncate and drop SELECT [Name], COALESCE([Tier], 'NONE') AS [Tier]
In SQL, the TRUNCATE and DROP commands are both used to remove FROM [dbo].[Customer]
data from a table, but they have different results:
TRUNCATE
Removes all rows from a table, but keeps the table's structure, columns,
constraints, and indexes. It's a Data Definition Language (DDL) statement
that's faster than DELETE but can't remove specific rows based on a
condition. TRUNCATE automatically drops foreign key constraints, and it
can be rolled back if there's an accidental deletion.
Q--- How you get notified (anwer like using triggers, email notfication,
teams channels)
Q--- CI/CD pipeline - what if piptines fails
Give at least 2 errors in pipeline
When a CI/CD pipeline build fails, it can be caused by a number of
reasons, including credential problems. For example, the credentials may Here are some examples of pipeline errors:
no longer be valid or the permissions may have changed. If your
organization has a DevOps team, they may be able to help resolve these Job time-out
issues because they usually have admin access to most services and are Downloading code issues
familiar with the permissions needed.
Command-line step failures
Other challenges that Agile teams may face with CI/CD pipelines include:
File or folder in use errors
Flawed tests: Tests are created to ensure quick code deployment and high-
quality builds. MSBuild failures
Poor team communication: Lack of cooperation and transparency can lead Process stops responding
to project failure.
Line endings for multiple platforms
Security threats: These can include insecure code from third-party sources,
Variables with single quotes appended
unauthorized access to source code repositories, and breaches in dev or test
environments. Service connection issues
Some best practices for securing CI/CD pipelines include: Pipeline stops hearing from agent
Access control, including authentication mechanisms, roles, and
permissions, Code scanning
Vulnerability management, Secure environment configurations