0% found this document useful (0 votes)
87 views45 pages

Data Engineering 101 - Azure Synapse Analytics

Uploaded by

Md Zia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views45 pages

Data Engineering 101 - Azure Synapse Analytics

Uploaded by

Md Zia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Engineering 101: Azure Synapse Analytics

DATA ENGINEERING 101


Azure Synapse
Analytics

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Azure Synapse Analytics


An integrated analytics service combining big data and
data warehousing, providing the ability to analyze data
across data lakes, data warehouses, and big data
systems.

Use Synapse Studio to create an end-to-end


analytics solution by integrating data from Azure
Data Lake, running transformations using Spark,
and loading the data into a dedicated SQL pool for
reporting and analysis.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Dedicated SQL Pool


A provisioned data warehousing service offering
predictable performance with dedicated resources.
Provides massive parallel processing (MPP) for large-
scale analytics.

Creating a Dedicated SQL Pool:

CREATE DATABASE mydedicatedsqlpool WITH


(EDITION = 'DataWarehouse', SERVICE_OBJECTIVE =
'DW1000c');

This script creates a dedicated SQL pool with a


service objective of DW1000c, which provides a
specific level of performance.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Serverless SQL Pool


An on-demand, distributed query engine that allows
querying data stored in Azure Data Lake without
needing to provision any infrastructure.

Querying a CSV file in Azure Data Lake:

SELECT * FROM OPENROWSET( BULK


'https://mydatalake.blob.core.windows.net/data/s
ample.csv', FORMAT = 'CSV') AS [result];

This query reads data directly from a CSV file stored


in Azure Data Lake without provisioning a dedicated
SQL pool.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Spark Pool
An Apache Spark cluster integrated with Azure Synapse
for large-scale data processing and machine learning.
Allows running Spark jobs using Scala, Python, SQL,
and R.

Submitting a Spark Job in Synapse:


spark.sql("SELECT col1, SUM(col2) FROM my_table
GROUP BY col1").show()

This PySpark code runs a simple aggregation query


on a table stored in Spark. The results are displayed
within the Synapse notebook environment.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Synapse Pipelines
An orchestration tool that allows for the creation,
scheduling, and monitoring of data workflows. Enables
integration of data sources and automation of data
movement and transformation tasks.

Building an ETL Pipeline:


1. Create a pipeline that extracts data from an on-
premises SQL Server using the "Copy Data" activity.
2. Apply transformations using the "Data Flow"
activity.
3. Load the transformed data into a dedicated SQL
pool or a data lake.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Synapse Studio
A unified interface that provides tools for data
exploration, transformation, integration, and
visualization. It allows for managing Synapse resources,
running SQL queries, Spark jobs, and building
pipelines.
Running SQL and Spark Jobs:
Use Synapse Studio to open a SQL script and execute
a query like SELECT * FROM SalesData on a
dedicated SQL pool.

Simultaneously, you can open a notebook and run


PySpark code to transform large datasets stored in a
Spark pool.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Data Integration
Integration of various services such as Azure Data
Factory, Power BI, and Azure Machine Learning with
Synapse Analytics for comprehensive data processing
and analysis.

Integration with Azure Data Factory:


Create a pipeline in Azure Data Factory that ingests
data from multiple sources, processes it in Synapse
(e.g., using a Spark pool for transformation), and
then loads the results into Power BI for real-time
visualization and reporting.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Synapse Notebooks
Interactive notebooks that support multi-language code
execution, enabling data scientists and engineers to
explore data, build models, and collaborate on data-
driven projects.

Using Synapse Notebooks with PySpark:

from pyspark.sql import SparkSession


spark =
SparkSession.builder.appName("MyApp").getOrCreate()
df = spark.read.csv('/path/to/data.csv') df.show()

This example shows how to load and display a CSV


file in Spark.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Synapse Link
A feature that allows for real-time, operational
analytics by enabling seamless connectivity between
Azure Cosmos DB and Azure Synapse Analytics, without
ETL processes.

Real-time Analytics with Synapse Link:

Enable Synapse Link in Azure Cosmos DB to replicate


operational data to Azure Synapse Analytics in near
real-time.

Use Synapse Studio to run analytics on this data as


soon as it arrives, without waiting for scheduled ETL
processes.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Data Flow
A visual, no-code interface within Synapse Pipelines
that allows for complex data transformations, such as
joins, aggregations, and data cleansing, directly within
the pipeline.

Creating a Data Flow:

1. Use the Data Flow designer to define


transformations: drag and drop activities like
"Source," "Aggregate," "Filter," and "Sink."

2. Set up a source to load data from an Azure Data


Lake, perform transformations, and then write the
output to a dedicated SQL pool.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Monitoring & Management


Provides monitoring tools within Synapse Studio for
tracking the performance and health of SQL queries,
pipelines, and Spark jobs. Includes features like alerts,
activity logs, and resource usage metrics.

Monitoring a Pipeline:

In Synapse Studio, navigate to the "Monitor" tab to


track the status and performance of running
pipelines, see detailed logs for each activity, and set
up alerts to notify you of pipeline failures or
performance issues.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Security & Compliance


Ensures data protection with built-in features like
Transparent Data Encryption (TDE), Virtual Network
Service Endpoints, and role-based access control
(RBAC). Compliance tools help meet regulatory
requirements.
Implementing Data Security:
Use TDE to encrypt all data in your dedicated SQL
pool:

ALTER DATABASE mydatabase SET ENCRYPTION ON;

Apply RBAC to restrict access to sensitive tables,


allowing only authorized users to query or modify
the data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Workload Management
Techniques for managing and optimizing resource
allocation, query performance, and concurrency in
dedicated SQL pools. It includes features like workload
groups and resource classes.

Configuring Workload Management:

CREATE WORKLOAD GROUP high_priority WITH


(IMPORTANCE = 'high', MIN_PERCENTAGE_RESOURCE = 25);

Assign a critical workload to the "high_priority"


group to ensure it gets the necessary resources and
runs efficiently, even during peak usage.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Data Lake Integration


Seamlessly integrates with Azure Data Lake Storage
(ADLS) to provide a scalable, secure, and cost-effective
solution for storing and analyzing large datasets in a
distributed environment.

Querying Data in ADLS:

SELECT * FROM OPENROWSET( BULK


'https://mydatalake.blob.core.windows.net/data/trans
actions.parquet', FORMAT='PARQUET') AS [result];

Query Parquet files stored in ADLS using serverless


SQL without moving data into a dedicated SQL pool.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Data Partitioning
Dividing large tables into smaller, more manageable
partitions to improve query performance and
scalability. Typically used in dedicated SQL pools to
handle large datasets efficiently.

Implementing Partitioning:

CREATE TABLE Sales (Date DATE, Amount FLOAT) WITH


(DISTRIBUTION = ROUND_ROBIN,
PARTITION BY RANGE(Date)
( PARTITION p1 VALUES LESS THAN ('2021-01-01'),
PARTITION p2 VALUES LESS THAN ('2022-01-01')));

This script creates a table partitioned by date.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

PolyBase
A technology that enables the querying of external data
stored in sources like Hadoop or Azure Blob Storage as
if it were within a relational database.

Querying External Data with PolyBase:

CREATE EXTERNAL TABLE myexternaldata (id INT, name


VARCHAR(50)) WITH ( LOCATION =
'abfss://[email protected]/data/', DATA_SOURCE =
my_blob_storage, FILE_FORMAT = myfileformat);

This script creates an external table pointing to data


in Blob Storage.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Materialized Views
Precomputed views that store the results of a query
physically, allowing for faster retrieval times and
reduced query execution time by eliminating the need
to recompute results.

Creating a Materialized View:

CREATE MATERIALIZED VIEW mv_sales


AS
SELECT product_id, SUM(sales_amount) AS total_sales
FROM Sales GROUP BY product_id;

This script creates a materialized view that


precomputes total sales by product, speeding up
future queries on this data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Columnstore Indexes
A type of index optimized for read-heavy workloads in
large datasets, providing significant compression and
performance improvements for analytical queries in
dedicated SQL pools.

Creating a Columnstore Index:

CREATE CLUSTERED COLUMNSTORE INDEX idx_sales ON Sales;

This script creates a columnstore index on the Sales


table, improving query performance for large-scale
analytical queries by compressing data and
reducing I/O.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Data Encryption
Techniques to protect sensitive data within Synapse
Analytics, both at rest and in transit, using encryption
methods like TDE and SSL/TLS.

Encrypting Data at Rest:

ALTER DATABASE mydatabase SET ENCRYPTION ON;

Enable TDE to ensure that all data in your dedicated


SQL pool is encrypted at rest, providing an
additional layer of security for sensitive information.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Scaling and Performance


Methods and tools to scale Synapse resources up or
down based on demand, optimize query performance
through indexing, partitioning, and statistics
management, and ensure efficient use of resources.

Scaling a Dedicated SQL Pool:

ALTER DATABASE mydedicatedsqlpool MODIFY


(SERVICE_OBJECTIVE = 'DW2000c');

This script scales up a dedicated SQL pool to


DW2000c, providing more resources for better
performance during high-demand periods.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Data Governance
Features and integrations that enable data cataloging,
classification, and governance within Synapse Analytics,
ensuring data is managed according to organizational
policies and regulatory requirements.

Implementing Data Governance with Azure Purview:

Integrate Azure Purview with Synapse Analytics to


automatically catalog and classify data assets.
Apply data masking policies to protect sensitive
information based on classification.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Integration with Power BI


Allows seamless connectivity between Synapse
Analytics and Power BI for data visualization and
reporting, enabling real-time analytics and sharing
insights across the organization.

Connecting Power BI to Synapse:

Use the "DirectQuery" mode in Power BI to connect


to a Synapse dedicated SQL pool.

Create dashboards that refresh in real-time as data


changes in Synapse, enabling up-to-the-minute
reporting and analytics for business users.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Streaming Data
Capabilities within Synapse Analytics to ingest, process,
and analyze real-time streaming data from sources like
IoT devices or event hubs, allowing for timely decision-
making and analytics.

Processing Streaming Data:

Use Azure Stream Analytics to ingest data from an


IoT hub, process the data in real-time, and output
the results to an Azure Synapse dedicated SQL pool
for further analysis and reporting.

Use Synapse notebooks for advanced analytics on


the processed data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Cross-Dw Query
The ability to execute queries across multiple Synapse
workspaces or Azure services using T-SQL, allowing for
comprehensive analysis without needing to move data
between environments.

Executing a Cross-Dw Query:

SELECT *
FROM [workspace1].[database1].[dbo].[table1]
UNION ALL
SELECT *
FROM [workspace2].[database2].[dbo].[table2];

This query pulls data from two different Synapse


workspaces and combines it into a single result set
for analysis.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Workspace Management
Centralized management of Synapse Analytics
resources, including SQL pools, Spark pools, pipelines,
and linked services, all within a single workspace.

Creating a New Workspace:

Use the Azure portal or CLI to create a new Synapse


workspace:

az synapse workspace create --name myWorkspace


--resource-group myResourceGroup --location eastus
--sql-admin-login-user myAdmin
--sql-admin-login-password myPassword

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Linked Services
A configuration used within Synapse Analytics to define
the connection information for external data sources,
such as databases, data lakes, and other cloud services.

Creating a Linked Service:

Use Synapse Studio to create a linked service to


Azure Data Lake Storage:

CREATE DATABASE SCOPED CREDENTIAL myCredential


WITH IDENTITY = 'myIdentity', SECRET = 'mySecret';

Link this credential to an external data source.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Synapse Roles
Role-based access control (RBAC) in Synapse Analytics
to manage permissions and access to resources,
ensuring only authorized users can access or modify
certain resources.

Assigning Roles:

GRANT CONTROL ON DATABASE::[myDatabase]


TO [myUser];

This script grants the "CONTROL" role to a specific


user, giving them full access to the database within
the Synapse workspace.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Data Masking
A security feature that hides sensitive data in query
results, displaying masked values to users who do not
have the necessary permissions to view the original
data.

Applying Data Masking:

ALTER TABLE Employees


ALTER COLUMN SSN ADD MASKED
WITH (FUNCTION = 'partial(1,"XXX-XX-",4)');

This script masks the Social Security Number (SSN)


column in the Employees table, showing only the last
4 digits to unauthorized users.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Azure Synapse RBAC


Role-Based Access Control (RBAC) specific to Azure
Synapse, which helps in managing permissions and
access at different levels of the Synapse environment.

Assigning a Role to a User:

az synapse role assignment create --workspace-


name myWorkspace --role "Synapse Contributor" --
assignee "[email protected]";

This command assigns the Synapse Contributor role


to a specific user in the specified Synapse
workspace.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Elastic Query
Allows querying across multiple databases and Synapse
instances, providing the ability to execute distributed
queries that span different data sources within Synapse
Analytics.

Executing an Elastic Query:

EXEC sp_execute_remote 'RemoteDbServer',


'SELECT * FROM RemoteDatabase.dbo.Table';

This query fetches data from a remote database


within another Synapse instance, enabling cross-
database querying.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

SQL On-Demand Queries


Queries executed on-demand using serverless SQL
pools in Synapse Analytics, allowing for flexible,
scalable querying without the need for dedicated
infrastructure.

Running an On-Demand Query:

SELECT * FROM OPENROWSET(BULK


'https://mystorageaccount.blob.core.windows.net/data/file.c
sv', FORMAT='CSV') AS data;

This query reads data from a CSV file in Azure Data


Lake using the serverless SQL pool, without requiring
any dedicated resources.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Data Distribution
Strategies for distributing data across nodes in a
Synapse dedicated SQL pool to optimize performance
and resource utilization, including hash, round-robin,
and replicated distributions.

Creating a Table with Hash Distribution:

CREATE TABLE Sales (ProductID INT, SalesAmount


DECIMAL(10,2)) WITH (DISTRIBUTION =
HASH(ProductID));

This script creates a table with hash distribution


based on the ProductID column, optimizing query
performance for large datasets.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Azure Key Vault Integration


Integration with Azure Key Vault to securely manage
and store secrets, keys, and certificates used by
Synapse Analytics for encryption, authentication, and
secure data access.

Accessing Secrets from Key Vault:

CREATE DATABASE SCOPED CREDENTIAL myCredential


WITH IDENTITY = 'Managed Identity', SECRET =
(SELECT value FROM
sys.dm_pdw_nodes.sysdm_exec_requests);

This script retrieves a secret from Azure Key Vault


and uses it in a database-scoped credential.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

SQL Data Discovery & Classification


A feature that identifies, classifies, and labels sensitive
data within a Synapse dedicated SQL pool, helping to
comply with data protection regulations and policies.

Classifying Data:

UPDATE [sys].[sensitivity_classifications]
SET label_id = 'Confidential', information_type_id = 'Sensitive'
WHERE object_id = OBJECT_ID('Customers') AND
column_id = COLUMNPROPERTY(OBJECT_ID('Customers'),
'SSN', 'ColumnId');

This script labels the SSN column as confidential.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Pipeline Parameters
Variables that are passed into Synapse Pipelines to
dynamically control the behavior of activities, allowing
for flexible, reusable pipeline designs.

Using Pipeline Parameters:

Define a parameter in a pipeline:


@pipeline().parameters.inputFileName Use it in a
Copy Data activity to dynamically set the source file
name based on the parameter value:
@dataset().path + '/' +
pipeline().parameters.inputFileName

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Synapse SQL Pool Resource Classes


Predefined resource allocations for queries in a
Synapse dedicated SQL pool, helping to control the
amount of memory and CPU used by different queries
based on their importance and resource needs.

Assigning a Resource Class:

EXEC sp_addrolemember 'xlargerc', 'myUser';

This command assigns the "xlargerc" resource class


to a user, giving them access to more memory and
CPU resources for running large queries in the
dedicated SQL pool.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Synapse SQL Pool Workload Isolation


Techniques for isolating workloads in Synapse SQL
pools to ensure that high-priority queries are not
impacted by lower-priority workloads, using workload
groups and classification rules.

Isolating Workloads:

CREATE WORKLOAD GROUP high_priority


WITH (MIN_PERCENTAGE_RESOURCE = 50);

CREATE WORKLOAD CLASSIFIER high_priority_classifier WITH


(WORKLOAD_GROUP = 'high_priority', MEMBERNAME = 'myUser');

This script creates a workload group and assigns it


to a specific user.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Data Auditing
The process of tracking and recording access to and
modifications of data within Synapse Analytics to
ensure compliance with security policies and regulatory
requirements.

Enabling Auditing:

ALTER DATABASE myDatabase


SET AUDIT ACTION GROUP =
'SCHEMA_OBJECT_ACCESS_GROUP';

This script enables auditing for all schema object


access actions within a Synapse dedicated SQL pool,
recording access events for security and compliance
purposes.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Azure Synapse Hybrid Tables


A feature that combines rowstore and columnstore
indexes within the same table to optimize performance
for both transactional and analytical workloads in
Synapse dedicated SQL pools.

Creating a Hybrid Table:

CREATE TABLE Orders (OrderID INT, OrderDate


DATETIME) WITH (DISTRIBUTION = HASH(OrderID),
INDEX = CLUSTERED COLUMNSTORE INDEX (OrderID),
CLUSTERED INDEX (OrderDate));

This script creates a hybrid table with both


columnstore and rowstore indexes.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Query Performance Tuning


Techniques and tools for optimizing the performance of
queries in Synapse Analytics, including index
optimization, query rewriting, and statistics
management.

Tuning a Query: UPDATE STATISTICS Sales;

Run this command to update statistics on the Sales


table, improving query performance by ensuring the
query optimizer has accurate information about the
data distribution.

Use Query Store to monitor and analyze query


performance.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

PolyBase External Tables


A feature that allows creating external tables in
Synapse Analytics that reference data stored in external
sources, enabling seamless integration and querying of
external data alongside local data.

Creating an External Table with PolyBase:

CREATE EXTERNAL TABLE myExternalTable


(ID INT, Name VARCHAR(50))
WITH (LOCATION = 'hdfs://mycluster/data/',
DATA_SOURCE = myDataSource,
FILE_FORMAT = myFileFormat);

This script creates an external table referencing data


stored in HDFS.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Synapse Query Store


A feature that captures query performance data within
a Synapse dedicated SQL pool, allowing for monitoring
and tuning of query performance over time.

Using Query Store: Enable Query Store:

ALTER DATABASE myDatabase SET QUERY_STORE = ON;

Use Synapse Studio to view query performance


data, identify slow-running queries, and make
adjustments to improve performance based on
historical data captured by Query Store.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Azure Synapse Analytics

Synapse Auto Scaling


A feature that automatically scales the compute
resources for serverless SQL pools in Synapse Analytics
based on the workload, ensuring optimal performance
and cost efficiency.

Enabling Auto Scaling:

Use the Azure portal to configure auto-scaling for a


serverless SQL pool:

Set the minimum and maximum scale limits based


on expected workloads, and let the system
automatically adjust compute resources as needed
to handle changes in demand.

Shwetank Singh
GritSetGrow - GSGLearn.com

You might also like