100% found this document useful (1 vote)
788 views261 pages

Azure Synapse Course Presentation

This document provides an overview of Azure Synapse Analytics. It discusses how modern data warehouses have evolved from traditional data warehouses to incorporate data lakes and support a wider range of workloads. It compares architectures with and without Azure Synapse Analytics, noting that Synapse brings together disparate services into a single workspace with shared monitoring, management, security and metadata.

Uploaded by

saok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
788 views261 pages

Azure Synapse Course Presentation

This document provides an overview of Azure Synapse Analytics. It discusses how modern data warehouses have evolved from traditional data warehouses to incorporate data lakes and support a wider range of workloads. It compares architectures with and without Azure Synapse Analytics, noting that Synapse brings together disparate services into a single workspace with shared monitoring, management, security and metadata.

Uploaded by

saok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 261

About Me

Ramesh Retnasamy
Data Engineer/ Machine Learning Engineer

https://www.linkedin.com/in/ramesh-retnasamy/
About this course

Azure Synapse Analytics

Limitless analytics service that brings together data integration,


enterprise data warehousing and big data analytics.
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
NYC Taxi
Cloud Data Platform
Solution Architecture – Serverless SQL Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold


Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture – Spark Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold


Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture – Dedicated SQL Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold


Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture - Synapse Link

Azure Cosmos DB Container Azure Synapse Analytics


Transactional Store Analytical Store
Row store optimized Column store
for transactional optimized for Machine Learning
reads and writes analytical queries
Bigdata Analytics
Spark Pool
Auto
Sync
NYC Taxi Data Exploration
Device Data Azure Synapse Link
/ HTAP
Serverless SQL BI Reporting
Pool
NYC Taxi Cloud Data Platform
NYC Taxi Cloud Data Platform
Who is this course for

University students

IT Developers from other disciplines

AWS/ GCP/ On-prem Data Engineers

Data Architects
Who is this course not for
You are not interested in hands-on learning approach

Your only focus is Azure Data Engineering Certification

You want to learn Data Flows

You are looking for in-depth knowledge of dedicated SQL pool

You are looking to learn dimensional data modelling &


implementation
Pre-requisites

All code and step-by-step instructions provided

Basic SQL knowledge required

Basic Python programming knowledge required

Cloud fundamentals will be beneficial, but not mandatory

Azure Account
Our Commitments

Ask Questions, we will answer J

Keeping the course up to date

Udemy life time access

Udemy 30 day money back guarantee


Introduction to Azure Synapse Analytics
Azure Synapse Analytics - Introduction

Azure Synapse Analytics is a limitless analytics


service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics - Introduction

Azure Synapse Analytics is a limitless analytics


service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics - Introduction

Azure Synapse Analytics is a limitless analytics


service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics - Introduction

Data Warehouse

Emergence of Data Lakes

Modern Data Warehouse Architecture

Modern Data Warehouse – Without Azure Synapse Analytics

Modern Data Warehouse - With Azure Synapse Analytics


Data Warehouse

Operational
Data

ETL

External Data
Data Warehouse Data Consumers
/ Mart

Data Sources
Data Warehouse

Lack of support for unstructured data

Longer to ingest new data

Proprietary data formats

Scalability

Expensive to store data

Lack of support for ML/ AI workloads


Data Lake
Operational
Data

Ingest Transform

Data Lake Data Lake Data Warehouse /


Mart
External Data

Data Sources
Data Science/ ML
workloads BI Reports
Modern Data Warehouse

Operational
Data
Ingest Explore & Transform Model & Visualize
Prepare & Enrich Serve

Azure Data ADF – Data ADF – Data


Flows Azure SQL Power BI
Factory Flows
Data
Warehouse

External Data

Azure Azure
Databricks Databricks

Data Sources Data Lake Storage


Azure Data Lake Gen2
Modern Data Warehouse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Azure Azure SQL Data


ADF Data Flows Databricks Warehouse
Azure Data
Factory Power BI
Storage
Azure Data Lake Storage Gen2
Modern Data Warehouse

Too many services/ workspaces

Difficult to monitor

Management & Security overhead

No Serverless option

Metadata not shared


Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Azure Azure SQL Data


ADF Data Flows Databricks Warehouse
Azure Data
Factory Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Azure Azure SQL Data


Flows Databricks Warehouse
Synapse
Pipelines Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Azure SQL Data


Flows Spark Pool Warehouse
Synapse
Pipelines Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL


Flows Spark Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store
Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics – Preview Features

Data Explorer Pool

Synapse Link for SQL Server 2022

Many more features


Azure Synapse Analytics
Azure Synapse Analytics is a limitless analytics service that brings together
data integration, enterprise data warehousing and big data analytics.
Create Synapse Analytics Workspace
Lab
Create Synapse Analytics Workspace
Create Synapse Analytics Workspace
User Subscription

User Managed Resource Group Azure Managed Resource Group

Synapse Workspace
Serverless SQL
Pool
SQL Admin User
Role - Storage Blob
Data Contributor

Primary ADLS Gen2


Storage Account

Workspace
Container
Project Overview

What is NYC Taxi

NYC Taxi data source & datasets

Prepare the data for the project

Project Requirements

Solution Architecture
Data Overview – NYC Taxi Trips
Data Overview – NYC Taxis

Yellow Taxis
Data Overview – NYC Taxis

Yellow Taxis

Green Taxis
Data Overview – NYC Taxis

Yellow Taxis

Green Taxis

For Hire Vehicles

High Volume For Hire Vehicles


Data Overview – NYC Taxis

Yellow Taxis

Green Taxis

For Hire Vehicles

High Volume For Hire Vehicles

https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
NYC Taxi Data Files Overview

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
NYC Taxi Data Files Overview

CSV Taxi Zone Trip Type TSV

Rate Code JSON

Trip Data

Payment Type JSON


Parquet
Delta
CSV
CSV Calendar CSV Vendor Quoted
Import NYC Taxi Data to Data Lake
Project
Requirements
Data Discovery

Data exploration capability on the raw data

Schema applied to the raw data

Discovery using T-SQL

Discovery using pay-per-query model


Data Ingestion

Ingested data to be stored as Parquet

Ingested data to be stored as tables/ views

Ability to query the ingested data using SQL

Ingestion using pay-per-query model


Data Transformation
Join the key information required for reporting to
create a new table.

Join the key information required for Analysis to


create a new table.

Must be able to analyze the transformed data


via T-SQL

Transformed data must be stored in columnar


format (i.e., Parquet)
Reporting Requirements

Taxi Demand

Credit Card Campaign

Operational Reporting
Scheduling Requirements

Scheduled to run at regular interval

Ability to monitor pipelines

Ability to re-run failed pipelines

Ability to set-up alerts on failures


Solution Architecture
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Solution Architecture – Serverless SQL Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold


Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture – Spark Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold


Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture – Dedicated SQL Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold


Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture - Synapse Link

Azure Cosmos DB Container Azure Synapse Analytics


Transactional Store Analytical Store
Row store optimized Column store
for transactional optimized for Machine Learning
reads and writes analytical queries
Bigdata Analytics
Spark Pool
Auto
Sync
Transactional Data Exploration
Data Azure Synapse Link
/ HTAP
Serverless SQL BI Reporting
Pool
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Section Overview – Serverless SQL Pool

Serverless SQL Pool Architecture

Features & Use Cases

Cost Control

Connecting to Azure Data Studio

T-SQL Support
Serverless SQL Pool
Serverless SQL Pool
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Serverless SQL Pool

Serverless SQL pool is a serverless distributed


query engine that you can use to query data over
your data lake using T-SQL.
Serverless SQL Pool

Serverless SQL pool is a serverless distributed


query engine that you can use to query data over
your data lake using T-SQL.
Serverless SQL Pool

Serverless SQL pool is a serverless distributed


query engine that you can use to query data over
your data lake using T-SQL.
Serverless SQL Pool - Architecture
Compute
T-SQL Polaris - Distributed
Query Processing
Engine
User or https://www.vldb.org/pvldb/vol13/p3204-saborit.pdf
Control Node
Application

Compute Node Compute Node Compute Node

Azure Storage
Serverless SQL Pool – Key Features
Serverless

Distributed query engine

Robust

Query using T-SQL

Pay-per-query pricing model

Not a storage

Synapse Link

Query spark tables


Serverless SQL Pool – Supported Data Sources
Azure Storage Account
Delimited – CSV, TSV etc

JSON

Parquet

Delta Lake

Cosmos
SQL API

MongoDB API

Dataverse
SQL Server 2022 (Preview)
Serverless SQL Pool – Use Cases

Discovery & Exploration

Logical Data Warehouse

Data Transformation
Serverless SQL Pool – Who is it far?

Data Engineers

Data Scientists

Data Analysts / BI Developers


Serverless SQL Pool
Cost Management
Cost Calculation
Amount of data read from storage

Billed for Data Processed Amount of data in intermediate results

Amount of data written to storage

Data Processed rounded to the nearest MB

Minimum of 10MB per query

Currently $6.25 per 1TB


Cost Control
Serverless SQL Pool
Cost Management
Lab
Azure Data Studio
Section Overview - Query CSV Files

Header Row

Field Terminator

Row Terminator

Quoted Files and escaping characters


Query CSV Files

OPENROWSET function overview

Query using OPENROWSET function

Data Types & Collations

Query subset of columns

Quoted strings & Escape Char


Serverless SQL
Pool
Query Tab Separated Values (TSV) file
OPENROWSET Function

User/
Application

Select

SELECT *
FROM OPENROWSET(BULK ‘blob file path‘,
FORMAT = [‘CSV’ |
OPENROWSET 'PARQUET’ |
‘DELTA’]
) AS [file]

Azure Storage
Section Overview - Query CSV Files

Line Delimited JSON


Section Overview - Query CSV Files

Single Line JSON

Standard JSON
Section Overview - Query CSV Files

Single Line JSON

Standard JSON

Classic JSON
Query JSON Files

JSON_VALUE

OPENROWSET
CSV Parser

OPENJSON

Line-delimited JSON Standard JSON

FIELDTERMINATOR – 0x0b FIELDTERMINATOR – 0x0b


FIELDQUOTE – 0x0b FIELDQUOTE – 0x0b
ROWTERMINATOR – 0x0b
Serverless SQL Pool – T-SQL Support

Databases

Schemas

Supported Views

Stored Procedures

Inline table value functions

External Resources – data sources, file


formats and tables
Serverless SQL Pool – T-SQL Support

Tables

Triggers
Not Supported
Materialized views

DML statements

DDL statements other than ones related


to views and security
Serverless SQL Pool – T-SQL Support

Logins and users

Credentials to control access to storage accounts


Security
Grant, deny, and revoke permissions per object level

Azure Active Directory integration


Serverless SQL Pool – T-SQL Support

CETAS - CREATE EXTERNAL TABLE AS SELECT


Additional
Features Extension to OPENROWSET to aid querying data in
data lake
Serverless SQL Pool – Monitoring Queries
Lab
Discovery & Exploration
Lab
Serverless SQL – Data Discovery
Identify the volume of the data Ability to get business value

Total record count


Right columns exist
Record count per day/ Week/ Month Transformations

Qualify of data Aggregations

Duplicates Identify additional data required

Missing values

Invalid data

Ability join datasets (e.g. keys exist)


Serverless SQL – Data Discovery

Identify duplicates in data

Check for missing data values

Invalid/ Unexpected data in columns

Join data from multiple files

Summarize/ Aggregate data

Apply some transforms


Data Virtualization
Data virtualization is a logical data layer that allows us to combine data from multiple
sources at query time without having to write complex ETL pipelines to load the data.
NYC Taxi Data Files Overview

CSV Taxi Zone Trip Type TSV

Rate Code JSON

Trip Data

Payment Type JSON


Parquet
Delta
CSV
CSV Calendar CSV Vendor Quoted
Database Objects – Why?

Storage Account and File details

User
File Types

Column Names and Data Types


OPENROWSET Function

Code Repetition

Data Lake Storage


Ability to query from BI tools
NYC Taxi Data Files Overview

CSV Taxi Zone Trip Type TSV

Rate Code JSON

Trip Data

Payment Type JSON


Parquet
Delta
CSV
CSV Calendar CSV Vendor Quoted
NYC Taxi Data Files Overview

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Database Objects - Types

User/
Application

External Tables Views

OPENROWSET

Data Lake Storage


External Table

Create External Table As Select


Create External Table
(CETAS)

Creates an external table on the data Selected data will copied to the location
already present in the storage. specified in the table definition.

Metadata only change Metadata change + Data copied


External Table

User/
Application

External Table

External Data External File


Source Format

Data Lake Storage


External Data Source

User/ Location of the storage


Application

Credential
External Table
Data source type

External Data External File


Source Format

Data Lake Storage


External Data Source
CREATE EXTERNAL DATA SOURCE <Data Source Name>
WITH
User/ (
Application LOCATION = <folder path URI> ,
CREDENTIAL = <Credential Name> ,
TYPE = {HADOOP}
) ;

External Table

CREATE EXTERNAL DATA SOURCE nyc_taxi_data


WITH
External Data External File (
Source Format LOCATION = 'abfss://nyc-taxi-
[email protected]',
CREDENTIAL = nyc_taxi_data_cred
) ;
Data Lake Storage
External File Format

User/ Format Type


Application

Data Compression
External Table
Format Options

External Data External File


Source Format

Data Lake Storage


External File Format
CREATE EXTERNAL FILE FORMAT <file format name>
WITH
(
FORMAT_TYPE = {DELIMITEDTEXT | PARQUET| DELTA}
User/
Application [ , DATA_COMPRESSION =
'org.apache.hadoop.io.compress.GzipCodec’
| 'org.apache.hadoop.io.compress.SnappyCodec’]
[ , FORMAT_OPTIONS ( <format_options> [ ,...n ] ) ]
);
External Table

<format_options> ::=
{
External Data External File FIELD_TERMINATOR = field_terminator
Source Format | STRING_DELIMITER = string_delimiter
| First_Row = integer
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| Encoding = {'UTF8' | 'UTF16'}
Data Lake Storage | PARSER_VERSION = {'parser_version'}
}
External File Format (Example)

CREATE EXTERNAL FILE FORMAT csv_file_format


User/ WITH
Application (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS
(
External Table FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
FIRST_ROW = 2,
USE_TYPE_DEFAULT = False,
ENCODING = 'UTF8',
External Data External File
PARSER_VERSION = '2.0’
Source Format
)
);

Data Lake Storage


External Table

External Data Source Name


User/
Application
File Locations name

External File Format Name


External Table
Column Name/ Data Type

External Data External File Read Options


Source Format

Reject Options

Data Lake Storage


External Table
CREATE EXTERNAL TABLE
<[database_name].[schema_name].table_name>
( <column name> <data type> )
WITH (
User/
Application LOCATION = ‘<folder_or_filepath>',
DATA_SOURCE = <external_data_source_name>,
FILE_FORMAT = <external_file_format_name>
[, TABLE_OPTIONS =
N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}
External Table ']
[, <reject_options>]
)

External Data External File <reject_options> ::=


Source Format {
| REJECT_TYPE = value,
| REJECT_VALUE = reject_value,
| REJECT_SAMPLE_VALUE = reject_sample_value,
Data Lake Storage | REJECTED_ROW_LOCATION = '/REJECT_Directory'
}
External Table
CREATE EXTERNAL TABLE demo.ldw.taxi_zone
(
location_id SMALLINT,
borough VARCHAR(1),
User/
Application zone VARCHAR(50) ,
service_zone VARCHAR(15)
)
WITH (
LOCATION = 'raw/taxi_zone.csv',
External Table DATA_SOURCE = nyc_taxi_data,
FILE_FORMAT = csv_file_format,
REJECT_VALUE = 10,
REJECTED_ROW_LOCATION = 'rejections/ldw/taxi_zone'
External Data External File );
Source Format

Data Lake Storage


Create External Table
Lab
NYC Taxi Data Files Overview

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create External Table

Database
(nyc_taxi_ldw)

External Data Source External File Format Database Schema


(nyc_taxi_src) (csv_file_format) (bronze)

External Table
(taxi_zone)
Create External Table – Delimited Text

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create External Table – Delimited Text (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create External Table – Parquet

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create External Table – Delta (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Views

Virtual tables with rows and columns

Data in a view is defined by a SELECT


statement
Views

User/
Application

View

Select

Open Rowset External Table

Data Lake Storage


Create View
CREATE VIEW [ schema_name . ] view_name
User/ [ ( column_name [ ,...n ] ) ]
Application
AS
<select_statement> [;]

View

Select

OPENROWSET External Table

Data Lake Storage


Create View - OPENROWSET
CREATE VIEW bronze.vw_vendor
User/ AS
Application
SELECT *
FROM OPENROWSET(
BULK 'vendor.csv',
View DATA_SOURCE = 'nyc_taxi_data_raw',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
Select ) AS vendor;

OPENROWSET External Table

Data Lake Storage


Create View – External Table
CREATE VIEW bronze.vw_taxi_zone_brooklyn
User/ AS
Application
SELECT location_id, zone, service_zone
FROM bronze.taxi_zone
WHERE borough = 'Brooklyn';
View

Select

OPENROWSET External Table

Data Lake Storage


Create View
Lab
NYC Taxi Data Files Overview

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create View – Assignment

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Partition Pruning

External Tables cannot prune partitions.

Combination of Views & OpenRowset can be used to prune partitions


External Tables vs Views
Section Overview - Data Ingestion

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold


Data Layer Layer Layer

Report
Discovery

Analysts
Section Overview - Data Ingestion

CETAS Statement Overview

Transform Delimited to Parquet

Transform JSON to Parquet

Challenges Processing Partitioned Data

Stored Procedures Introduction

Transform Partitioned Data

Create Views
Data Transformation

Write the data in Columnar format

Remove unwanted columns

Transform semi-structured data (e.g. JSON)

Store pre-aggregated data

Build a traditional data warehouse


Create External Table As Select (CETAS)
Create External Table

As

SELECT

SELECT (+ joins, aggs, trans) External Table

External Data Source


OPENROWSET View External Table
/ External File Format

Data Lake Storage (raw) Data Lake Storage (transformed)


CREATE EXTERNAL TABLE AS SELECT (CETAS)

CREATE EXTERNAL TABLE {[database_name .] [ schema_name . ] table_name }


[(column_name [,...n ] ) ]
WITH (
LOCATION = ‘path_to_folder’,
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
)
AS
<select_statement>
[;]
CREATE EXTERNAL TABLE AS SELECT (CETAS)

CREATE EXTERNAL TABLE transformed.taxi_zone


WITH (
LOCATION = 'transformed/taxi_zone',
DATA_SOURCE = nyc_taxi_data,
FILE_FORMAT = parquet_file_format
)
AS
SELECT *
FROM raw.taxi_zone ;
CREATE EXTERNAL TABLE AS SELECT (CETAS)
CREATE EXTERNAL TABLE transformed.taxi_zone
WITH (
LOCATION = 'transformed/taxi_zone',
DATA_SOURCE = nyc_taxi_data,
FILE_FORMAT = parquet_file_format
)
AS
SELECT *
FROM
OPENROWSET(
BULK 'abfss://[email protected]/raw/taxi_zone.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
FIRSTROW = 2
)
WITH (
location_id SMALLINT 1,
borough VARCHAR(15) 2,
zone VARCHAR(50) 3,
service_zone VARCHAR(15) 4
) AS [result];
CREATE EXTERNAL TABLE AS SELECT (CETAS)

CREATE EXTERNAL TABLE transformed.taxi_borough


WITH (
LOCATION = 'transformed/taxi_borough',
DATA_SOURCE = nyc_taxi_data,
FILE_FORMAT = parquet_file_format
)
AS
SELECT borough, COUNT(1) AS number_of_zones
FROM raw.taxi_zone
GROUP BY borough;
Transform to Parquet Format

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format from JSON

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format from JSON (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format – Partitioned Files

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Stored Procedures

Group of T-SQL statements stored in the database

Accept input parameters & Return output parameters

Can execute another stored procedure


Stored Procedures - Example

CREATE PROCEDURE usp_test @borough


AS
BEGIN
SELECT *
FROM bronze.taxi_zone
WHERE borough = @borough;
END;

EXEC usp_test @borough = ' Queens'


Stored Procedures - Benefits

Reuse of code

Easier maintenance

Improved security
Stored Procedures - Limitations

T-SQL support
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview#t-
sql-support

Reduced implementation
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-stored-procedures#limitations
Transform to Parquet Format – Partitioned Files
raw/trip_data

year=2020

month=01/*.csv

month=02/*.csv
silver/trip_data
CETAS

*.parquet
month=12/*.csv

year=2021

month=01/*.csv


Transform to Parquet Format – Partitioned Files
raw/trip_data silver/trip_data

year=2020 year=2020

month=01/*.csv month=01/*.parquet

month=02/*.csv month=02/*.parquet
?
… …

month=12/*.csv month=12/*.parquet

year=2021 year=2021

month=01/*.csv month=01/*.parquet

… …
Transform to Parquet Format – Partitioned Files
raw/trip_data silver/trip_data

year=2020 year=2020

month=01/*.csv CETAS 1 month=01/*.parquet

month=02/*.csv CETAS 2 month=02/*.parquet

… CETAS .. …

month=12/*.csv CETAS 12 month=12/*.parquet

year=2021 year=2021

month=01/*.csv CETAS 13 month=01/*.parquet

… CETAS .. …
Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
month=02/*.csv month=02/*.parquet
2020 02
Exec SP
… …

Exec SP
month=12/*.csv month=12/*.parquet
2020 12
year=2021 year=2021
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
… …

Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv
2020 01
month=01/*.parquet
X
Exec SP
month=02/*.csv
2020 02
month=02/*.parquet
X
Exec SP



X
Exec SP
month=12/*.csv
2020 12
month=12/*.parquet X
year=2021 year=2021

month=01/*.csv
Exec SP
2020 01
month=01/*.parquet X
Exec SP



X
Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
month=02/*.csv month=02/*.parquet
2020 02
Exec SP
… …

Exec SP
month=12/*.csv month=12/*.parquet
2020 12
year=2021 year=2021
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
… …

Transform to Parquet Format – Partitioned Files

Stored Procedure – CETAS to transform data

Stored Procedure – DROP External Tables

Execute store procedure for each partition

Create view with partitioned columns


Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
month=02/*.csv month=02/*.parquet
2020 02
Exec SP
… …

Exec SP
month=12/*.csv month=12/*.parquet
2020 12
year=2021 year=2021
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
… …

Section Overview - Data Transformation

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold


Data Layer Layer Layer

Report
Discovery

Analysts
Section Overview - Data Transformation

Project Requirements

Data for Campaign Analysis

Data for Taxi Demand

Create View
Business Requirements (1)

Campaign to encourage credit card payments

Trips made using credit card/ cash payments

Payment behaviour during days of the week/ weekend

Payment behaviour between boroughs


Non Functional Requirements

Reporting data to be pre-aggregated for better performance

Pre-aggregate data for each year/ month partition in isolation

Able to read data efficiently for specific months from aggregated data

Minimize the number of aggregated tables created


Business Requirements

Campaign to encourage credit card payments

Trips made using credit card/ cash payments

Payment behaviour during days of the week/ weekend

Payment behaviour between boroughs

Cash Trip Count Card Trip Count Trip Date

Trip Day Trip Day Week End Ind Borough


Non Functional Requirements

Reporting data to be pre-aggregated for better performance

Pre-aggregate data for each year/ month partition in isolation

Able to read data efficiently for specific months from aggregated data

Minimize the number of aggregated tables created

Year Month
Gold Layer
Year

Month

Borough
Trip Data
Trip Date

Trip Day

Trip Day Week End Ind

Cash Trip Count

Card Trip Count


Silver Layer

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Gold Layer
Year Trip Data

Month Trip Data

Borough Taxi Zone


Trip Data
Trip Date Trip Data

Trip Day Calendar

Trip Day Week End Ind Derived from Trip Day

Cash Trip Count Payment Type + Trip Data


Card Trip Count Payment Type + Trip Data
Gold Layer
Year Trip Data

Month Trip Data

Borough Taxi Zone


Trip Data
Trip Date Trip Data

Trip Day Calendar

Trip Day Week End Ind Derived from Trip Day

Cash Trip Count Payment Type + Trip Data


Card Trip Count Payment Type + Trip Data
Business Requirements (2)
Identify taxi demand

Demand based on borough

Demand based on day of the week/ weekend

Demand based on trip type (i.e., Street hail/ Despatch)

Trip distance, trip duration, total fare amount etc per


day/ borough
Non Functional Requirements

Reporting data to be pre-aggregated for better performance

Pre-aggregate data for each year/ month partition in isolation

Able to read data efficiently for specific months from aggregated data

Minimize the number of aggregated tables created


Business Requirements (2)
Identify taxi demand

Demand based on borough

Demand based on day of the week/ weekend

Demand based on trip type (i.e., Street hail/ Despatch)

Trip distance, trip duration, total fare amount etc per


day/ borough
Borough Trip Date Trip Day Trip Day Week End Ind Despatch Trip Count
Street-hail Trip Count Trip Distance Trip Duration Total Fare Amount
Business Requirements (2)
Identify taxi demand

Demand based on borough

Demand based on day of the week/ weekend

Demand based on trip type (i.e., Street hail/ Despatch)

Trip distance, trip duration, total fare amount etc per


day/ borough
Borough Trip Date Trip Day Trip Day Week End Ind Despatch Trip Count
Street-hail Trip Count Trip Distance Trip Duration Total Fare Amount
Silver Layer

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Section Overview – Synapse Pipelines

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold


Data Layer Layer Layer

Report
Discovery

Analysts
Section Overview - Synapse Pipelines

Overview

Components

Creating Pipelines

Variables & Parameters

Dynamic Pipelines

Pipeline Dependencies

Creating Triggers
Synapse Pipelines Overview
What are Synapse Pipelines

A fully managed, serverless data integration & orchestration service


The Data Problem

Multi Cloud

SaaS Apps

Data Formats

On Premises
The Data Problem

Transform/
Ingest Publish
Analyze

Data Consumers

Data Sources
Data Integration

100+ Native connectors

SaaS connectors

Multi-cloud support

On-premises support

Serverless & Auto Scale

Control flow activities


Data Orchestration
Synapse Data Flows

Synapse Dedicated SQL Pool Scripts

Synapse Serverless SQL Pool Scripts

Synapse Spark Notebooks

Azure Databricks Notebooks

Azure HDInsight Scripts

Azure Machine Learning Pipelines


Schedule & Monitor

Schedule to run pipelines

Monitoring within the synapse

Alerting within the synapse

Ability to monitor/ alert from outside of Synapse


What are Synapse Pipelines

A fully managed, serverless data integration & orchestration service


Azure Data Factory vs Synapse Pipelines

Shares the same codebase, with minor differences


Azure Data Factory vs Synapse Pipelines

https://docs.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences
Synapse Integration Pipeline Components
Trigger Trigger

Pipeline Pipeline

Activity Activity Activity

Dataset Dataset

Linked Service Linked Service Linked Service

Storage Compute
SQL Synapse Azure
ADLS
Database Pool Databricks
Bronze to Silver Layer Transformation

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format – Taxi Zone

raw/taxi_zone CETAS silver/taxi_zone


Transform to Parquet Format – Taxi Zone

DROP IF
Exists
raw/taxi_zone silver/taxi_zone
+
CETAS
Transform to Parquet Format – Taxi Zone

DROP Table
DELETE File If Exists
raw/taxi_zone silver/taxi_zone
If Exists +
CETAS
Transform to Parquet Format – Taxi Zone

DROP Table
DELETE File If Exists
raw/taxi_zone silver/taxi_zone
If Exists +
CETAS

Pipeline
Delete Script
Activity Activity
Delete Activity

Pipeline
Delete Script
Activity Activity

Dataset (raw/taxi_zone)

Linked Service

Storage Account
(ADLS Gen2)
synapsecoursedl
Script Activity

Pipeline
Delete Script
Activity Activity

Dataset (raw/taxi_zone)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Script Activity
Trigger

Pipeline
Delete Script
Activity Activity

Dataset (raw/taxi_zone)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Transformation Pipeline – Taxi Zone
Trigger

Pipeline
Delete Script
Activity Activity

Dataset (raw/taxi_zone)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Transformation Pipeline – Taxi Zone
Trigger

Pipeline
Delete Script
Activity Activity

Dataset (raw/taxi_zone)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Transformation Pipeline – Taxi Zone
Trigger

Pipeline
Delete Script
Activity Activity

Dataset (raw/taxi_zone)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Transformation Pipeline – Taxi Zone
Trigger

Pipeline
Delete Stored Procedure
Activity Activity

Dataset (raw/taxi_zone)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Bronze to Silver Layer Transformation

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Bronze to Silver Layer Transformation

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transformation Pipeline – Taxi Zone
Trigger

Pipeline (Taxi Zone)


Delete Stored Procedure
Activity Activity

Dataset (silver/taxi_zone)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Transformation Pipeline – Calendar
Trigger

Pipeline (Calendar)
Delete Stored Procedure
Activity Activity

Dataset (silver/calendar)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Transformation Pipeline
Trigger

Pipeline
Delete Stored Procedure
Delete
Activity Stored Procedure
Activity
Activity Activity

Dataset (silver/Calendar)
Dataset (silver/Taxi Zone)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Dynamic Pipeline - Parameters

Available in Pipelines, Datasets, Linked Services & Data Flows

Pass external values from one component to the other

The value cannot be changed inside the component

Unlocks reusability of a component


Dynamic Pipeline - Parameters & Variables
Trigger

Variable (folder_path)
Pipeline Variable (usp_name)

Delete Stored Procedure


Activity (Dynamic) Activity (Dynamic)

Dataset
(folder_path parameter)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Dynamic Pipeline - Parameters & Variables
Trigger

Variable (folder_path)
Pipeline Variable (usp_name)

Delete Stored Procedure


Activity (Dynamic) Activity (Dynamic)

Dataset
(folder_path parameter)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Dynamic Pipeline - Parameters & Variables
Trigger

ForEach Pipeline
Variable (Array)
Activity

Delete Stored Procedure


Activity (Dynamic) Activity (Dynamic)

Dataset
(folder_path parameter)

Linked Service Linked Service

Storage Account Serverless SQL Pool


(ADLS Gen2) Built-in @
synapsecoursedl synapse-course-ws
Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
month=02/*.csv month=02/*.parquet
2020 02
Exec SP
… …

Exec SP
month=12/*.csv month=12/*.parquet
2020 12
year=2021 year=2021
Exec SP
month=01/*.csv month=01/*.parquet
2021 01
Exec SP
… …

Transform to Parquet Format – Partitioned Files
Pipeline
Script Activity
(Get partition year/ month)

ForEach Activity (ForEach


partition year/ month)

Delete Stored Procedure


Activity Activity

Script Activity
(Create Silver View)
Section Overview – Spark Pool

Spark Pool Overview

Create Spark Pool

Notebooks Overview

Spark Pool Integration with Serverless SQL Pool

Create Spark table

Create Synapse Pipeline


Azure Synapse Analytics – Spark Pool
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics – Spark Pool

Spark Pool

Spark Pool
Apache Spark
Apache Spark is a lightning-fast unified analytics engine for big data
processing and machine learning

100% Open source under Apache License

Simple and easy to use APIs

In-memory processing engine

Distributed computing Platform

Unified engine which supports SQL, streaming, ML and graph


processing
Apache Spark Architecture
Spark
Spark ML Spark Graph
Spark Streaming
SQL
DataFrame / Dataset APIs

Spark SQL Engine


Catalyst Optimizer Tungsten

Spark Core
Scala Python Java R

Resilient Distributed Dataset (RDD)

Spark Standalone, YARN, Apache Mesos, Kubernetes


Azure Synapse Analytics – Spark Pool
Ease of pool creation
Use of notebooks

Delta Lake
Scalability
Spark Pool

Pre-loaded libraries
Integration with 3rd party IDEs

Support for C#
Integration with Serverless SQL Pool
Azure Synapse Analytics – Spark Pool
Use Cases

Data Preparation/ Data Transformation

Spark Pool Machine Learning


Create Spark Pool
Lab
Notebooks Overview
Lab
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics

Serverless SQL
Spark Pool Pool

Data Consumers

Data Sources
Metadata Replication - Database

Serverless SQL
Spark Pool Pool

demo_db demo_db

Meta store

Metadata for
Metadata for
Replicate Serverless SQL
Spark Pool
Pool

Azure Data Lake Storage Gen2


Metadata Replication - Table

Serverless SQL
Spark Pool Pool

demo_db.demo_table demo_db.dbo.demo_table

Meta store

Metadata for
Metadata for
Replicate Serverless SQL
Spark Pool
Pool

Azure Data Lake Storage Gen2


Metadata Replication

Metadata replicated asynchronously

Supports only Parquet & CSV backed tables

Replicated tables cannot be updated by Serverless SQL Pool

Secured at the underlying storage level

Database names have to be unique across Spark Pools


Integration Between Spark Pool & Serverless SQL Pool

Create
Aggregate
silver/ Spark Table gold/
the silver
trip_data_green with trip_data_green_agg
data
aggregates
Integration Between Spark Pool & Serverless SQL Pool

Serverless SQL
Spark Pool Pool

nyc_taxi_ldw_spark nyc_taxi_ldw_spark.dbo.
.trip_data_green_agg .trip_data_green_agg

Meta store

Metadata for
Metadata for
Replicate Serverless SQL
Spark Pool
Pool

Azure Data Lake Storage Gen2


Synapse – Power BI Integration
Section Overview – Power BI Integration

Power BI Integration Overview

Connecting from Power BI Desktop

Publishing to Power BI Workspace

Power BI Integration within Synapse Studio

Create Reports
Synapse – Power BI Integration
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Pre-requisites

Power BI Desktop

Power BI Workspace
Power BI Workspace
Azure Active Directory User

Power BI User

Power BI Workspace

Access to the Synapse resource group

Access to Synapse Studio

Access to the Synapse Primary Storage Account


Project Requirements
Business Requirements (1)

Campaign to encourage credit card payments

Trips made using credit card/ cash payments

Payment behaviour during days of the week/ weekend

Payment behaviour between boroughs


Business Requirements (2)
Identify taxi demand

Demand based on borough

Demand based on day of the week/ weekend

Demand based on trip type (i.e., Street hail/ Despatch)

Trip distance, trip duration, total fare amount etc per


day/ borough
Section Overview – Synapse Link

Synapse Link Overview

Synapse Link for Cosmos DB Overview


Synapse Link
Create Cosmos DB Service

Query using Serverless SQL Pool

Query using Spark Pool


Synapse Link
Why do we need Synapse Link?

Row Oriented Column Oriented


Synapse Link

Row Oriented Column Oriented

Cosmos DB - GA
Dataverse - GA
SQL Server 2022 – Private Preview
Synapse Link for Cosmos DB

Azure Cosmos DB Container Azure Synapse Analytics


Transactional Store Analytical Store
Row store optimized Column store
for transactional optimized for Machine Learning
reads and writes analytical queries
Bigdata Analytics
Spark Pool
Auto
Sync
Transactional Data Exploration
Data Azure Synapse Link
/ HTAP
Serverless SQL BI Reporting
Pool
Synapse Link for Cosmos DB - Benefits

Less Complex Solution

Fully managed service

Near Real Time Analytics

No impact to Transactional System

Automatic Schema Inference

Native Integration between Synapse and Cosmos


Synapse Link for Cosmos DB - Limitations

Only SQL API and Mongo DB API supported

Dedicated SQL Pool – Not Supported

Limited support for existing Cosmos DB containers

Backup and Restore of Analytical Store – Not


Available

Custom Partitioning – In Preview


Project Overview

NYC Taxis are fitted with a device to manage taxi hires

Device sends data to Cosmos DB every minute

Create a Cosmos DB analytic store using Synapse Link

Ability to query/ process the analytic store using Spark Pool

Ability to query and make data available to PowerBI using


Serverless SQL Pool
Project Solution

Azure Cosmos DB Container Azure Synapse Analytics


Transactional Store Analytical Store
Row store optimized Column store
for transactional optimized for Machine Learning
reads and writes analytical queries
Bigdata Analytics
Spark Pool
Auto
Sync
Taxi Device Data Exploration
Heartbeat Azure Synapse Link
/ HTAP
Data Serverless SQL BI Reporting
Pool
Section Overview – Dedicated SQL Pool

Overview

Create Dedicated SQL Pool

External Tables

Copy Command

Connecting from Azure Data Studio

Connecting from Power BI


Dedicated SQL Pool

Dedicated SQL pool (formerly SQL DW) is a


distributed query engine that you can use to
perform high performance big data analytics
using T-SQL.
Dedicated SQL Pool - Architecture
Compute
T-SQL

DMS
User or
Control Node
Application

Massively Parallel Processing Engine (MPP)

DMS DMS DMS

Compute Node Compute Node Compute Node

Storage
……………………. 60 distributions
Dedicated SQL Pool - Architecture
Compute
T-SQL

DMS
User or
Control Node
Application

Massively Parallel Processing Engine (MPP)

DMS

Compute Node

All 60 distributions

Storage
……………………. 60 distributions
Dedicated SQL Pool - Architecture
Compute
T-SQL

DMS
User or
Control Node
Application

Massively Parallel Processing Engine (MPP)

DMS DMS DMS

Compute Node Compute Node Compute Node

20 distributions 20 distributions 20 distributions

Storage
……………………. 60 distributions
Dedicated SQL Pool - Architecture
Compute
T-SQL

DMS Distribution Methods


User or
Control Node
Application
Round Robin
Massively Parallel Processing Engine (MPP)
Hash

Replicate

DMS DMS DMS

Compute Node Compute Node Compute Node

20 distributions 20 distributions 20 distributions

Storage
……………………. 60 distributions
Dedicated SQL Pool – Use Cases

Traditional Data Warehouses (Facts/ Dims)

Larger Data Warehouses (Greater than 1TB)

Instant response times

Predictable Performance
Dedicated SQL Pool – Price & Performance

Based on Data Warehouse Units (DWUs)

DWU relates to CPU, memory, and IO


Dedicated SQL Pool - DWU
Performance level Compute nodes Distributions per Memory per data
(DWU) Compute node warehouse (GB)

DW100c 1 60 60
DW200c 1 60 120
DW300c 1 60 180
DW400c 1 60 240
DW500c 1 60 300
DW1000c 2 30 600
DW1500c 3 20 900
DW2000c 4 15 1200
DW2500c 5 12 1500
DW3000c 6 10 1800
DW5000c 10 6 3000
DW6000c 12 5 3600
DW7500c 15 4 4500
DW10000c 20 3 6000
DW15000c 30 2 9000
DW30000c 60 1 18000
Synapse – Dedicated SQL Pool
Use Cases
Synapse – Dedicated SQL Pool
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data


Integration Visualization

Synapse Data Dedicated SQL Serverless SQL


Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics

Synapse Dedicated SQL


Pipelines Pool

Data Consumers

Data Sources
Azure Synapse Analytics

Synapse Dedicated SQL


Pipelines Pool

Data Consumers

Data Sources
Azure Synapse Analytics

Synapse Serverless SQL Dedicated SQL


Pipelines Pool Pool

Data Consumers

Data Sources
Azure Synapse Analytics

Synapse Dedicated SQL


Pipelines Spark Pool Pool

Data Consumers

Data Sources
Azure Synapse Analytics

Synapse Serverless SQL Dedicated SQL


Pipelines Pool Pool

Data Consumers

Data Sources

Spark Pool
Requirement

Serverless SQL Serverless SQL Dedicated SQL


Pool Pool Pool Power BI
Bronze Silver Gold Dedicated
Layer Layer Layer SQL Pool Data Consumers
trip_data_green Table
Solution

Serverless SQL Serverless SQL Dedicated SQL


Pool Pool Pool Power BI
Bronze Silver Gold Dedicated
Layer Layer Layer SQL Pool Data Consumers
trip_data_green Table

CTAS

External Table Internal Table


trip_data_green trip_data_green
Dedicated SQL Pool
Solution

Serverless SQL Serverless SQL Dedicated SQL


Pool Pool Pool Power BI
Bronze Silver Gold Dedicated
Layer Layer Layer SQL Pool Data Consumers
trip_data_green Table

COPY
Command

Internal Table
trip_data_green
Dedicated SQL Pool
Congratulations!
&
Thank you
Feedback
Ratings & Review
Thank you
&
Good Luck!

You might also like