Azure Synapse Course Presentation
Azure Synapse Course Presentation
Ramesh Retnasamy
Data Engineer/ Machine Learning Engineer
https://www.linkedin.com/in/ramesh-retnasamy/
About this course
Synapse Pipelines
Report
Discovery
Analysts
Solution Architecture – Spark Pool
Synapse Pipelines
Report
Discovery
Analysts
Solution Architecture – Dedicated SQL Pool
Synapse Pipelines
Report
Discovery
Analysts
Solution Architecture - Synapse Link
University students
Data Architects
Who is this course not for
You are not interested in hands-on learning approach
Azure Account
Our Commitments
Data Warehouse
Operational
Data
ETL
External Data
Data Warehouse Data Consumers
/ Mart
Data Sources
Data Warehouse
Scalability
Ingest Transform
Data Sources
Data Science/ ML
workloads BI Reports
Modern Data Warehouse
Operational
Data
Ingest Explore & Transform Model & Visualize
Prepare & Enrich Serve
External Data
Azure Azure
Databricks Databricks
Difficult to monitor
No Serverless option
Synapse Workspace
Serverless SQL
Pool
SQL Admin User
Role - Storage Blob
Data Contributor
Workspace
Container
Project Overview
Project Requirements
Solution Architecture
Data Overview – NYC Taxi Trips
Data Overview – NYC Taxis
Yellow Taxis
Data Overview – NYC Taxis
Yellow Taxis
Green Taxis
Data Overview – NYC Taxis
Yellow Taxis
Green Taxis
Yellow Taxis
Green Taxis
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
NYC Taxi Data Files Overview
Rate Code
Trip Data
Payment Type
Calendar Vendor
NYC Taxi Data Files Overview
Trip Data
Taxi Demand
Operational Reporting
Scheduling Requirements
Synapse Pipelines
Report
Discovery
Analysts
Solution Architecture – Spark Pool
Synapse Pipelines
Report
Discovery
Analysts
Solution Architecture – Dedicated SQL Pool
Synapse Pipelines
Report
Discovery
Analysts
Solution Architecture - Synapse Link
Cost Control
T-SQL Support
Serverless SQL Pool
Serverless SQL Pool
Azure Synapse
Development / Monitoring / Management & Security
Azure Storage
Serverless SQL Pool – Key Features
Serverless
Robust
Not a storage
Synapse Link
JSON
Parquet
Delta Lake
Cosmos
SQL API
MongoDB API
Dataverse
SQL Server 2022 (Preview)
Serverless SQL Pool – Use Cases
Data Transformation
Serverless SQL Pool – Who is it far?
Data Engineers
Data Scientists
Header Row
Field Terminator
Row Terminator
User/
Application
Select
SELECT *
FROM OPENROWSET(BULK ‘blob file path‘,
FORMAT = [‘CSV’ |
OPENROWSET 'PARQUET’ |
‘DELTA’]
) AS [file]
Azure Storage
Section Overview - Query CSV Files
Standard JSON
Section Overview - Query CSV Files
Standard JSON
Classic JSON
Query JSON Files
JSON_VALUE
OPENROWSET
CSV Parser
OPENJSON
Databases
Schemas
Supported Views
Stored Procedures
Tables
Triggers
Not Supported
Materialized views
DML statements
Missing values
Invalid data
Trip Data
User
File Types
Code Repetition
Trip Data
Rate Code
Trip Data
Payment Type
Calendar Vendor
Database Objects - Types
User/
Application
OPENROWSET
Creates an external table on the data Selected data will copied to the location
already present in the storage. specified in the table definition.
User/
Application
External Table
Credential
External Table
Data source type
External Table
Data Compression
External Table
Format Options
<format_options> ::=
{
External Data External File FIELD_TERMINATOR = field_terminator
Source Format | STRING_DELIMITER = string_delimiter
| First_Row = integer
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| Encoding = {'UTF8' | 'UTF16'}
Data Lake Storage | PARSER_VERSION = {'parser_version'}
}
External File Format (Example)
Reject Options
Rate Code
Trip Data
Payment Type
Calendar Vendor
Create External Table
Database
(nyc_taxi_ldw)
External Table
(taxi_zone)
Create External Table – Delimited Text
Rate Code
Trip Data
Payment Type
Calendar Vendor
Create External Table – Delimited Text (Assignment)
Rate Code
Trip Data
Payment Type
Calendar Vendor
Create External Table – Parquet
Rate Code
Trip Data
Payment Type
Calendar Vendor
Create External Table – Delta (Assignment)
Rate Code
Trip Data
Payment Type
Calendar Vendor
Views
User/
Application
View
Select
View
Select
Select
Rate Code
Trip Data
Payment Type
Calendar Vendor
Create View – Assignment
Rate Code
Trip Data
Payment Type
Calendar Vendor
Partition Pruning
Synapse Pipelines
Report
Discovery
Analysts
Section Overview - Data Ingestion
Create Views
Data Transformation
As
SELECT
Rate Code
Trip Data
Payment Type
Calendar Vendor
Transform to Parquet Format (Assignment)
Rate Code
Trip Data
Payment Type
Calendar Vendor
Transform to Parquet Format (Assignment)
Rate Code
Trip Data
Payment Type
Calendar Vendor
Transform to Parquet Format from JSON
Rate Code
Trip Data
Payment Type
Calendar Vendor
Transform to Parquet Format from JSON (Assignment)
Rate Code
Trip Data
Payment Type
Calendar Vendor
Transform to Parquet Format – Partitioned Files
Rate Code
Trip Data
Payment Type
Calendar Vendor
Stored Procedures
Reuse of code
Easier maintenance
Improved security
Stored Procedures - Limitations
T-SQL support
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview#t-
sql-support
Reduced implementation
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-stored-procedures#limitations
Transform to Parquet Format – Partitioned Files
raw/trip_data
year=2020
month=01/*.csv
month=02/*.csv
silver/trip_data
CETAS
…
*.parquet
month=12/*.csv
year=2021
month=01/*.csv
…
Transform to Parquet Format – Partitioned Files
raw/trip_data silver/trip_data
year=2020 year=2020
month=01/*.csv month=01/*.parquet
month=02/*.csv month=02/*.parquet
?
… …
month=12/*.csv month=12/*.parquet
year=2021 year=2021
month=01/*.csv month=01/*.parquet
… …
Transform to Parquet Format – Partitioned Files
raw/trip_data silver/trip_data
year=2020 year=2020
… CETAS .. …
year=2021 year=2021
… CETAS .. …
Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
month=02/*.csv month=02/*.parquet
2020 02
Exec SP
… …
…
Exec SP
month=12/*.csv month=12/*.parquet
2020 12
year=2021 year=2021
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
… …
…
Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv
2020 01
month=01/*.parquet
X
Exec SP
month=02/*.csv
2020 02
month=02/*.parquet
X
Exec SP
…
…
…
X
Exec SP
month=12/*.csv
2020 12
month=12/*.parquet X
year=2021 year=2021
month=01/*.csv
Exec SP
2020 01
month=01/*.parquet X
Exec SP
…
…
…
X
Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
month=02/*.csv month=02/*.parquet
2020 02
Exec SP
… …
…
Exec SP
month=12/*.csv month=12/*.parquet
2020 12
year=2021 year=2021
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
… …
…
Transform to Parquet Format – Partitioned Files
Synapse Pipelines
Report
Discovery
Analysts
Section Overview - Data Transformation
Project Requirements
Create View
Business Requirements (1)
Able to read data efficiently for specific months from aggregated data
Able to read data efficiently for specific months from aggregated data
Year Month
Gold Layer
Year
Month
Borough
Trip Data
Trip Date
Trip Day
Rate Code
Trip Data
Payment Type
Calendar Vendor
Gold Layer
Year Trip Data
Able to read data efficiently for specific months from aggregated data
Rate Code
Trip Data
Payment Type
Calendar Vendor
Section Overview – Synapse Pipelines
Synapse Pipelines
Report
Discovery
Analysts
Section Overview - Synapse Pipelines
Overview
Components
Creating Pipelines
Dynamic Pipelines
Pipeline Dependencies
Creating Triggers
Synapse Pipelines Overview
What are Synapse Pipelines
Multi Cloud
SaaS Apps
Data Formats
On Premises
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
Data Integration
SaaS connectors
Multi-cloud support
On-premises support
https://docs.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences
Synapse Integration Pipeline Components
Trigger Trigger
Pipeline Pipeline
Dataset Dataset
Storage Compute
SQL Synapse Azure
ADLS
Database Pool Databricks
Bronze to Silver Layer Transformation
Rate Code
Trip Data
Payment Type
Calendar Vendor
Transform to Parquet Format – Taxi Zone
DROP IF
Exists
raw/taxi_zone silver/taxi_zone
+
CETAS
Transform to Parquet Format – Taxi Zone
DROP Table
DELETE File If Exists
raw/taxi_zone silver/taxi_zone
If Exists +
CETAS
Transform to Parquet Format – Taxi Zone
DROP Table
DELETE File If Exists
raw/taxi_zone silver/taxi_zone
If Exists +
CETAS
Pipeline
Delete Script
Activity Activity
Delete Activity
Pipeline
Delete Script
Activity Activity
Dataset (raw/taxi_zone)
Linked Service
Storage Account
(ADLS Gen2)
synapsecoursedl
Script Activity
Pipeline
Delete Script
Activity Activity
Dataset (raw/taxi_zone)
Pipeline
Delete Script
Activity Activity
Dataset (raw/taxi_zone)
Pipeline
Delete Script
Activity Activity
Dataset (raw/taxi_zone)
Pipeline
Delete Script
Activity Activity
Dataset (raw/taxi_zone)
Pipeline
Delete Script
Activity Activity
Dataset (raw/taxi_zone)
Pipeline
Delete Stored Procedure
Activity Activity
Dataset (raw/taxi_zone)
Rate Code
Trip Data
Payment Type
Calendar Vendor
Bronze to Silver Layer Transformation
Rate Code
Trip Data
Payment Type
Calendar Vendor
Transformation Pipeline – Taxi Zone
Trigger
Dataset (silver/taxi_zone)
Pipeline (Calendar)
Delete Stored Procedure
Activity Activity
Dataset (silver/calendar)
Pipeline
Delete Stored Procedure
Delete
Activity Stored Procedure
Activity
Activity Activity
Dataset (silver/Calendar)
Dataset (silver/Taxi Zone)
Variable (folder_path)
Pipeline Variable (usp_name)
Dataset
(folder_path parameter)
Variable (folder_path)
Pipeline Variable (usp_name)
Dataset
(folder_path parameter)
ForEach Pipeline
Variable (Array)
Activity
Dataset
(folder_path parameter)
Script Activity
(Create Silver View)
Section Overview – Spark Pool
Notebooks Overview
Spark Pool
Spark Pool
Apache Spark
Apache Spark is a lightning-fast unified analytics engine for big data
processing and machine learning
Spark Core
Scala Python Java R
Delta Lake
Scalability
Spark Pool
Pre-loaded libraries
Integration with 3rd party IDEs
Support for C#
Integration with Serverless SQL Pool
Azure Synapse Analytics – Spark Pool
Use Cases
Serverless SQL
Spark Pool Pool
Data Consumers
Data Sources
Metadata Replication - Database
Serverless SQL
Spark Pool Pool
demo_db demo_db
Meta store
Metadata for
Metadata for
Replicate Serverless SQL
Spark Pool
Pool
Serverless SQL
Spark Pool Pool
demo_db.demo_table demo_db.dbo.demo_table
Meta store
Metadata for
Metadata for
Replicate Serverless SQL
Spark Pool
Pool
Create
Aggregate
silver/ Spark Table gold/
the silver
trip_data_green with trip_data_green_agg
data
aggregates
Integration Between Spark Pool & Serverless SQL Pool
Serverless SQL
Spark Pool Pool
nyc_taxi_ldw_spark nyc_taxi_ldw_spark.dbo.
.trip_data_green_agg .trip_data_green_agg
Meta store
Metadata for
Metadata for
Replicate Serverless SQL
Spark Pool
Pool
Create Reports
Synapse – Power BI Integration
Azure Synapse
Development / Monitoring / Management & Security
Power BI Desktop
Power BI Workspace
Power BI Workspace
Azure Active Directory User
Power BI User
Power BI Workspace
Cosmos DB - GA
Dataverse - GA
SQL Server 2022 – Private Preview
Synapse Link for Cosmos DB
Overview
External Tables
Copy Command
DMS
User or
Control Node
Application
Storage
……………………. 60 distributions
Dedicated SQL Pool - Architecture
Compute
T-SQL
DMS
User or
Control Node
Application
DMS
Compute Node
All 60 distributions
Storage
……………………. 60 distributions
Dedicated SQL Pool - Architecture
Compute
T-SQL
DMS
User or
Control Node
Application
Storage
……………………. 60 distributions
Dedicated SQL Pool - Architecture
Compute
T-SQL
Replicate
Storage
……………………. 60 distributions
Dedicated SQL Pool – Use Cases
Predictable Performance
Dedicated SQL Pool – Price & Performance
DW100c 1 60 60
DW200c 1 60 120
DW300c 1 60 180
DW400c 1 60 240
DW500c 1 60 300
DW1000c 2 30 600
DW1500c 3 20 900
DW2000c 4 15 1200
DW2500c 5 12 1500
DW3000c 6 10 1800
DW5000c 10 6 3000
DW6000c 12 5 3600
DW7500c 15 4 4500
DW10000c 20 3 6000
DW15000c 30 2 9000
DW30000c 60 1 18000
Synapse – Dedicated SQL Pool
Use Cases
Synapse – Dedicated SQL Pool
Azure Synapse
Development / Monitoring / Management & Security
Data Consumers
Data Sources
Azure Synapse Analytics
Data Consumers
Data Sources
Azure Synapse Analytics
Data Consumers
Data Sources
Azure Synapse Analytics
Data Consumers
Data Sources
Azure Synapse Analytics
Data Consumers
Data Sources
Spark Pool
Requirement
CTAS
COPY
Command
Internal Table
trip_data_green
Dedicated SQL Pool
Congratulations!
&
Thank you
Feedback
Ratings & Review
Thank you
&
Good Luck!