100% found this document useful (1 vote)

788 views261 pages

Azure Synapse Course Presentation

This document provides an overview of Azure Synapse Analytics. It discusses how modern data warehouses have evolved from traditional data warehouses to incorporate data lakes and support a wider range of workloads. It compares architectures with and without Azure Synapse Analytics, noting that Synapse brings together disparate services into a single workspace with shared monitoring, management, security and metadata.

Uploaded by

saok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

788 views261 pages

Azure Synapse Course Presentation

Uploaded by

saok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 261

About Me

Ramesh Retnasamy
Data Engineer/ Machine Learning Engineer

https://www.linkedin.com/in/ramesh-retnasamy/
About this course

Azure Synapse Analytics

Limitless analytics service that brings together data integration,

enterprise data warehousing and big data analytics.
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
NYC Taxi
Cloud Data Platform
Solution Architecture – Serverless SQL Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold

Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture – Spark Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold

Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture – Dedicated SQL Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold

Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture - Synapse Link

Azure Cosmos DB Container Azure Synapse Analytics

Transactional Store Analytical Store
Row store optimized Column store
for transactional optimized for Machine Learning
reads and writes analytical queries
Bigdata Analytics
Spark Pool
Auto
Sync
NYC Taxi Data Exploration
Device Data Azure Synapse Link
/ HTAP
Serverless SQL BI Reporting
Pool
NYC Taxi Cloud Data Platform
NYC Taxi Cloud Data Platform
Who is this course for

University students

IT Developers from other disciplines

AWS/ GCP/ On-prem Data Engineers

Data Architects
Who is this course not for
You are not interested in hands-on learning approach

Your only focus is Azure Data Engineering Certification

You want to learn Data Flows

You are looking for in-depth knowledge of dedicated SQL pool

You are looking to learn dimensional data modelling &

implementation
Pre-requisites

All code and step-by-step instructions provided

Basic SQL knowledge required

Basic Python programming knowledge required

Cloud fundamentals will be beneficial, but not mandatory

Azure Account
Our Commitments

Ask Questions, we will answer J

Keeping the course up to date

Udemy life time access

Udemy 30 day money back guarantee

Introduction to Azure Synapse Analytics
Azure Synapse Analytics - Introduction

Azure Synapse Analytics is a limitless analytics

service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics - Introduction

Azure Synapse Analytics is a limitless analytics

service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics - Introduction

Azure Synapse Analytics is a limitless analytics

service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics - Introduction

Data Warehouse

Emergence of Data Lakes

Modern Data Warehouse Architecture

Modern Data Warehouse – Without Azure Synapse Analytics

Modern Data Warehouse - With Azure Synapse Analytics

Data Warehouse

Operational
Data

ETL

External Data
Data Warehouse Data Consumers
/ Mart

Data Sources
Data Warehouse

Lack of support for unstructured data

Longer to ingest new data

Proprietary data formats

Scalability

Expensive to store data

Lack of support for ML/ AI workloads

Data Lake
Operational
Data

Ingest Transform

Data Lake Data Lake Data Warehouse /

Mart
External Data

Data Sources
Data Science/ ML
workloads BI Reports
Modern Data Warehouse

Operational
Data
Ingest Explore & Transform Model & Visualize
Prepare & Enrich Serve

Azure Data ADF – Data ADF – Data

Flows Azure SQL Power BI
Factory Flows
Data
Warehouse

External Data

Azure Azure
Databricks Databricks

Data Sources Data Lake Storage

Azure Data Lake Gen2
Modern Data Warehouse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Azure Azure SQL Data

ADF Data Flows Databricks Warehouse
Azure Data
Factory Power BI
Storage
Azure Data Lake Storage Gen2
Modern Data Warehouse

Too many services/ workspaces

Difficult to monitor

Management & Security overhead

No Serverless option

Metadata not shared

Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Azure Azure SQL Data

ADF Data Flows Databricks Warehouse
Azure Data
Factory Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Azure Azure SQL Data

Flows Databricks Warehouse
Synapse
Pipelines Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Azure SQL Data

Flows Spark Pool Warehouse
Synapse
Pipelines Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL

Flows Spark Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store
Storage Gen2
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Azure Synapse Analytics – Preview Features

Data Explorer Pool

Synapse Link for SQL Server 2022

Many more features

Azure Synapse Analytics
Azure Synapse Analytics is a limitless analytics service that brings together
data integration, enterprise data warehousing and big data analytics.
Create Synapse Analytics Workspace
Lab
Create Synapse Analytics Workspace
Create Synapse Analytics Workspace
User Subscription

User Managed Resource Group Azure Managed Resource Group

Synapse Workspace
Serverless SQL
Pool
SQL Admin User
Role - Storage Blob
Data Contributor

Primary ADLS Gen2

Storage Account

Workspace
Container
Project Overview

What is NYC Taxi

NYC Taxi data source & datasets

Prepare the data for the project

Project Requirements

Solution Architecture
Data Overview – NYC Taxi Trips
Data Overview – NYC Taxis

Yellow Taxis
Data Overview – NYC Taxis

Yellow Taxis

Green Taxis
Data Overview – NYC Taxis

Yellow Taxis

Green Taxis

For Hire Vehicles

High Volume For Hire Vehicles

Data Overview – NYC Taxis

Yellow Taxis

Green Taxis

For Hire Vehicles

High Volume For Hire Vehicles

https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
NYC Taxi Data Files Overview

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
NYC Taxi Data Files Overview

CSV Taxi Zone Trip Type TSV

Rate Code JSON

Trip Data

Payment Type JSON

Parquet
Delta
CSV
CSV Calendar CSV Vendor Quoted
Import NYC Taxi Data to Data Lake
Project
Requirements
Data Discovery

Data exploration capability on the raw data

Schema applied to the raw data

Discovery using T-SQL

Discovery using pay-per-query model

Data Ingestion

Ingested data to be stored as Parquet

Ingested data to be stored as tables/ views

Ability to query the ingested data using SQL

Ingestion using pay-per-query model

Data Transformation
Join the key information required for reporting to
create a new table.

Join the key information required for Analysis to

create a new table.

Must be able to analyze the transformed data

via T-SQL

Transformed data must be stored in columnar

format (i.e., Parquet)
Reporting Requirements

Taxi Demand

Credit Card Campaign

Operational Reporting
Scheduling Requirements

Scheduled to run at regular interval

Ability to monitor pipelines

Ability to re-run failed pipelines

Ability to set-up alerts on failures

Solution Architecture
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Solution Architecture – Serverless SQL Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold

Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture – Spark Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold

Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture – Dedicated SQL Pool

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold

Data Layer Layer Layer

Report
Discovery

Analysts
Solution Architecture - Synapse Link

Azure Cosmos DB Container Azure Synapse Analytics

Transactional Store Analytical Store
Row store optimized Column store
for transactional optimized for Machine Learning
reads and writes analytical queries
Bigdata Analytics
Spark Pool
Auto
Sync
Transactional Data Exploration
Data Azure Synapse Link
/ HTAP
Serverless SQL BI Reporting
Pool
Azure Synapse Analytics
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Section Overview – Serverless SQL Pool

Serverless SQL Pool Architecture

Features & Use Cases

Cost Control

Connecting to Azure Data Studio

T-SQL Support
Serverless SQL Pool
Serverless SQL Pool
Azure Synapse
Development / Monitoring / Management & Security

Data Compute Data

Integration Visualization

Synapse Data Dedicated SQL Serverless SQL

Flows Spark Pool Pool Pool
Synapse
Pipelines Power BI
Storage
Azure Data Lake
Meta store Synapse Link
Storage Gen2
Serverless SQL Pool

Serverless SQL pool is a serverless distributed

query engine that you can use to query data over
your data lake using T-SQL.
Serverless SQL Pool

Serverless SQL pool is a serverless distributed

query engine that you can use to query data over
your data lake using T-SQL.
Serverless SQL Pool

Serverless SQL pool is a serverless distributed

query engine that you can use to query data over
your data lake using T-SQL.
Serverless SQL Pool - Architecture
Compute
T-SQL Polaris - Distributed
Query Processing
Engine
User or https://www.vldb.org/pvldb/vol13/p3204-saborit.pdf
Control Node
Application

Compute Node Compute Node Compute Node

Azure Storage
Serverless SQL Pool – Key Features
Serverless

Distributed query engine

Robust

Query using T-SQL

Pay-per-query pricing model

Not a storage

Synapse Link

Query spark tables

Serverless SQL Pool – Supported Data Sources
Azure Storage Account
Delimited – CSV, TSV etc

JSON

Parquet

Delta Lake

Cosmos
SQL API

MongoDB API

Dataverse
SQL Server 2022 (Preview)
Serverless SQL Pool – Use Cases

Discovery & Exploration

Logical Data Warehouse

Data Transformation
Serverless SQL Pool – Who is it far?

Data Engineers

Data Scientists

Data Analysts / BI Developers

Serverless SQL Pool
Cost Management
Cost Calculation
Amount of data read from storage

Billed for Data Processed Amount of data in intermediate results

Amount of data written to storage

Data Processed rounded to the nearest MB

Minimum of 10MB per query

Currently $6.25 per 1TB

Cost Control
Serverless SQL Pool
Cost Management
Lab
Azure Data Studio
Section Overview - Query CSV Files

Header Row

Field Terminator

Row Terminator

Quoted Files and escaping characters

Query CSV Files

OPENROWSET function overview

Query using OPENROWSET function

Data Types & Collations

Query subset of columns

Quoted strings & Escape Char

Serverless SQL
Pool
Query Tab Separated Values (TSV) file
OPENROWSET Function

User/
Application

Select

SELECT *
FROM OPENROWSET(BULK ‘blob file path‘,
FORMAT = [‘CSV’ |
OPENROWSET 'PARQUET’ |
‘DELTA’]
) AS [file]

Azure Storage
Section Overview - Query CSV Files

Line Delimited JSON

Section Overview - Query CSV Files

Single Line JSON

Standard JSON
Section Overview - Query CSV Files

Single Line JSON

Standard JSON

Classic JSON
Query JSON Files

JSON_VALUE

OPENROWSET
CSV Parser

OPENJSON

Line-delimited JSON Standard JSON

FIELDTERMINATOR – 0x0b FIELDTERMINATOR – 0x0b

FIELDQUOTE – 0x0b FIELDQUOTE – 0x0b
ROWTERMINATOR – 0x0b
Serverless SQL Pool – T-SQL Support

Databases

Schemas

Supported Views

Stored Procedures

Inline table value functions

External Resources – data sources, file

formats and tables
Serverless SQL Pool – T-SQL Support

Tables

Triggers
Not Supported
Materialized views

DML statements

DDL statements other than ones related

to views and security
Serverless SQL Pool – T-SQL Support

Logins and users

Credentials to control access to storage accounts

Security
Grant, deny, and revoke permissions per object level

Azure Active Directory integration

Serverless SQL Pool – T-SQL Support

CETAS - CREATE EXTERNAL TABLE AS SELECT

Additional
Features Extension to OPENROWSET to aid querying data in
data lake
Serverless SQL Pool – Monitoring Queries
Lab
Discovery & Exploration
Lab
Serverless SQL – Data Discovery
Identify the volume of the data Ability to get business value

Total record count

Right columns exist
Record count per day/ Week/ Month Transformations

Qualify of data Aggregations

Duplicates Identify additional data required

Missing values

Invalid data

Ability join datasets (e.g. keys exist)

Serverless SQL – Data Discovery

Identify duplicates in data

Check for missing data values

Invalid/ Unexpected data in columns

Join data from multiple files

Summarize/ Aggregate data

Apply some transforms

Data Virtualization
Data virtualization is a logical data layer that allows us to combine data from multiple
sources at query time without having to write complex ETL pipelines to load the data.
NYC Taxi Data Files Overview

CSV Taxi Zone Trip Type TSV

Rate Code JSON

Trip Data

Payment Type JSON

Parquet
Delta
CSV
CSV Calendar CSV Vendor Quoted
Database Objects – Why?

Storage Account and File details

User
File Types

Column Names and Data Types

OPENROWSET Function

Code Repetition

Data Lake Storage

Ability to query from BI tools
NYC Taxi Data Files Overview

CSV Taxi Zone Trip Type TSV

Rate Code JSON

Trip Data

Payment Type JSON

Parquet
Delta
CSV
CSV Calendar CSV Vendor Quoted
NYC Taxi Data Files Overview

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Database Objects - Types

User/
Application

External Tables Views

OPENROWSET

Data Lake Storage

External Table

Create External Table As Select

Create External Table
(CETAS)

Creates an external table on the data Selected data will copied to the location
already present in the storage. specified in the table definition.

Metadata only change Metadata change + Data copied

External Table

User/
Application

External Table

External Data External File

Source Format

Data Lake Storage

External Data Source

User/ Location of the storage

Application

Credential
External Table
Data source type

External Data External File

Source Format

Data Lake Storage

External Data Source
CREATE EXTERNAL DATA SOURCE <Data Source Name>
WITH
User/ (
Application LOCATION = <folder path URI> ,
CREDENTIAL = <Credential Name> ,
TYPE = {HADOOP}
) ;

External Table

CREATE EXTERNAL DATA SOURCE nyc_taxi_data

WITH
External Data External File (
Source Format LOCATION = 'abfss://nyc-taxi-
[email protected]',
CREDENTIAL = nyc_taxi_data_cred
) ;
Data Lake Storage
External File Format

User/ Format Type

Application

Data Compression
External Table
Format Options

External Data External File

Source Format

Data Lake Storage

External File Format
CREATE EXTERNAL FILE FORMAT <file format name>
WITH
(
FORMAT_TYPE = {DELIMITEDTEXT | PARQUET| DELTA}
User/
Application [ , DATA_COMPRESSION =
'org.apache.hadoop.io.compress.GzipCodec’
| 'org.apache.hadoop.io.compress.SnappyCodec’]
[ , FORMAT_OPTIONS ( <format_options> [ ,...n ] ) ]
);
External Table

<format_options> ::=
{
External Data External File FIELD_TERMINATOR = field_terminator
Source Format | STRING_DELIMITER = string_delimiter
| First_Row = integer
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| Encoding = {'UTF8' | 'UTF16'}
Data Lake Storage | PARSER_VERSION = {'parser_version'}
}
External File Format (Example)

CREATE EXTERNAL FILE FORMAT csv_file_format

User/ WITH
Application (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS
(
External Table FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
FIRST_ROW = 2,
USE_TYPE_DEFAULT = False,
ENCODING = 'UTF8',
External Data External File
PARSER_VERSION = '2.0’
Source Format
)
);

Data Lake Storage

External Table

External Data Source Name

User/
Application
File Locations name

External File Format Name

External Table
Column Name/ Data Type

External Data External File Read Options

Source Format

Reject Options

Data Lake Storage

External Table
CREATE EXTERNAL TABLE
<[database_name].[schema_name].table_name>
( <column name> <data type> )
WITH (
User/
Application LOCATION = ‘<folder_or_filepath>',
DATA_SOURCE = <external_data_source_name>,
FILE_FORMAT = <external_file_format_name>
[, TABLE_OPTIONS =
N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}
External Table ']
[, <reject_options>]
)

External Data External File <reject_options> ::=

Source Format {
| REJECT_TYPE = value,
| REJECT_VALUE = reject_value,
| REJECT_SAMPLE_VALUE = reject_sample_value,
Data Lake Storage | REJECTED_ROW_LOCATION = '/REJECT_Directory'
}
External Table
CREATE EXTERNAL TABLE demo.ldw.taxi_zone
(
location_id SMALLINT,
borough VARCHAR(1),
User/
Application zone VARCHAR(50) ,
service_zone VARCHAR(15)
)
WITH (
LOCATION = 'raw/taxi_zone.csv',
External Table DATA_SOURCE = nyc_taxi_data,
FILE_FORMAT = csv_file_format,
REJECT_VALUE = 10,
REJECTED_ROW_LOCATION = 'rejections/ldw/taxi_zone'
External Data External File );
Source Format

Data Lake Storage

Create External Table
Lab
NYC Taxi Data Files Overview

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create External Table

Database
(nyc_taxi_ldw)

External Data Source External File Format Database Schema

(nyc_taxi_src) (csv_file_format) (bronze)

External Table
(taxi_zone)
Create External Table – Delimited Text

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create External Table – Delimited Text (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create External Table – Parquet

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create External Table – Delta (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Views

Virtual tables with rows and columns

Data in a view is defined by a SELECT

statement
Views

User/
Application

View

Select

Open Rowset External Table

Data Lake Storage

Create View
CREATE VIEW [ schema_name . ] view_name
User/ [ ( column_name [ ,...n ] ) ]
Application
AS
<select_statement> [;]

View

Select

OPENROWSET External Table

Data Lake Storage

Create View - OPENROWSET
CREATE VIEW bronze.vw_vendor
User/ AS
Application
SELECT *
FROM OPENROWSET(
BULK 'vendor.csv',
View DATA_SOURCE = 'nyc_taxi_data_raw',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
Select ) AS vendor;

OPENROWSET External Table

Data Lake Storage

Create View – External Table
CREATE VIEW bronze.vw_taxi_zone_brooklyn
User/ AS
Application
SELECT location_id, zone, service_zone
FROM bronze.taxi_zone
WHERE borough = 'Brooklyn';
View

Select

OPENROWSET External Table

Data Lake Storage

Create View
Lab
NYC Taxi Data Files Overview

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Create View – Assignment

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Partition Pruning

External Tables cannot prune partitions.

Combination of Views & OpenRowset can be used to prune partitions

External Tables vs Views
Section Overview - Data Ingestion

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold

Data Layer Layer Layer

Report
Discovery

Analysts
Section Overview - Data Ingestion

CETAS Statement Overview

Transform Delimited to Parquet

Transform JSON to Parquet

Challenges Processing Partitioned Data

Stored Procedures Introduction

Transform Partitioned Data

Create Views
Data Transformation

Write the data in Columnar format

Remove unwanted columns

Transform semi-structured data (e.g. JSON)

Store pre-aggregated data

Build a traditional data warehouse

Create External Table As Select (CETAS)
Create External Table

SELECT

SELECT (+ joins, aggs, trans) External Table

External Data Source

OPENROWSET View External Table
/ External File Format

Data Lake Storage (raw) Data Lake Storage (transformed)

CREATE EXTERNAL TABLE AS SELECT (CETAS)

CREATE EXTERNAL TABLE {[database_name .] [ schema_name . ] table_name }

[(column_name [,...n ] ) ]
WITH (
LOCATION = ‘path_to_folder’,
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
)
AS
<select_statement>
[;]
CREATE EXTERNAL TABLE AS SELECT (CETAS)

CREATE EXTERNAL TABLE transformed.taxi_zone

WITH (
LOCATION = 'transformed/taxi_zone',
DATA_SOURCE = nyc_taxi_data,
FILE_FORMAT = parquet_file_format
)
AS
SELECT *
FROM raw.taxi_zone ;
CREATE EXTERNAL TABLE AS SELECT (CETAS)
CREATE EXTERNAL TABLE transformed.taxi_zone
WITH (
LOCATION = 'transformed/taxi_zone',
DATA_SOURCE = nyc_taxi_data,
FILE_FORMAT = parquet_file_format
)
AS
SELECT *
FROM
OPENROWSET(
BULK 'abfss://[email protected]/raw/taxi_zone.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
FIRSTROW = 2
)
WITH (
location_id SMALLINT 1,
borough VARCHAR(15) 2,
zone VARCHAR(50) 3,
service_zone VARCHAR(15) 4
) AS [result];
CREATE EXTERNAL TABLE AS SELECT (CETAS)

CREATE EXTERNAL TABLE transformed.taxi_borough

WITH (
LOCATION = 'transformed/taxi_borough',
DATA_SOURCE = nyc_taxi_data,
FILE_FORMAT = parquet_file_format
)
AS
SELECT borough, COUNT(1) AS number_of_zones
FROM raw.taxi_zone
GROUP BY borough;
Transform to Parquet Format

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format from JSON

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format from JSON (Assignment)

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format – Partitioned Files

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Stored Procedures

Group of T-SQL statements stored in the database

Accept input parameters & Return output parameters

Can execute another stored procedure

Stored Procedures - Example

CREATE PROCEDURE usp_test @borough

AS
BEGIN
SELECT *
FROM bronze.taxi_zone
WHERE borough = @borough;
END;

EXEC usp_test @borough = ' Queens'

Stored Procedures - Benefits

Reuse of code

Easier maintenance

Improved security
Stored Procedures - Limitations

T-SQL support
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview#t-
sql-support

Reduced implementation
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-stored-procedures#limitations
Transform to Parquet Format – Partitioned Files
raw/trip_data

year=2020

month=01/*.csv

month=02/*.csv
silver/trip_data
CETAS
…
*.parquet
month=12/*.csv

year=2021

month=01/*.csv

…
Transform to Parquet Format – Partitioned Files
raw/trip_data silver/trip_data

year=2020 year=2020

month=01/*.csv month=01/*.parquet

month=02/*.csv month=02/*.parquet
?
… …

month=12/*.csv month=12/*.parquet

year=2021 year=2021

month=01/*.csv month=01/*.parquet

… …
Transform to Parquet Format – Partitioned Files
raw/trip_data silver/trip_data

year=2020 year=2020

month=01/.csv CETAS 1 month=01/.parquet

month=02/.csv CETAS 2 month=02/.parquet

… CETAS .. …

month=12/.csv CETAS 12 month=12/.parquet

year=2021 year=2021

month=01/.csv CETAS 13 month=01/.parquet

… CETAS .. …
Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
month=02/*.csv month=02/*.parquet
2020 02
Exec SP
… …
…
Exec SP
month=12/*.csv month=12/*.parquet
2020 12
year=2021 year=2021
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
… …
…
Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv
2020 01
month=01/*.parquet
X
Exec SP
month=02/*.csv
2020 02
month=02/*.parquet
X
Exec SP
…
…
…
X
Exec SP
month=12/*.csv
2020 12
month=12/*.parquet X
year=2021 year=2021

month=01/*.csv
Exec SP
2020 01
month=01/*.parquet X
Exec SP
…
…
…
X
Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
month=02/*.csv month=02/*.parquet
2020 02
Exec SP
… …
…
Exec SP
month=12/*.csv month=12/*.parquet
2020 12
year=2021 year=2021
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
… …
…
Transform to Parquet Format – Partitioned Files

Stored Procedure – CETAS to transform data

Stored Procedure – DROP External Tables

Execute store procedure for each partition

Create view with partitioned columns

Transform to Parquet Format – Partitioned Files
Stored
raw/trip_data silver/trip_data
Procedure
year=2020 year=2020
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
month=02/*.csv month=02/*.parquet
2020 02
Exec SP
… …
…
Exec SP
month=12/*.csv month=12/*.parquet
2020 12
year=2021 year=2021
Exec SP
month=01/*.csv month=01/*.parquet
2020 01
Exec SP
… …
…
Section Overview - Data Transformation

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold

Data Layer Layer Layer

Report
Discovery

Analysts
Section Overview - Data Transformation

Project Requirements

Data for Campaign Analysis

Data for Taxi Demand

Create View
Business Requirements (1)

Campaign to encourage credit card payments

Trips made using credit card/ cash payments

Payment behaviour during days of the week/ weekend

Payment behaviour between boroughs

Non Functional Requirements

Reporting data to be pre-aggregated for better performance

Pre-aggregate data for each year/ month partition in isolation

Able to read data efficiently for specific months from aggregated data

Minimize the number of aggregated tables created

Business Requirements

Campaign to encourage credit card payments

Trips made using credit card/ cash payments

Payment behaviour during days of the week/ weekend

Payment behaviour between boroughs

Cash Trip Count Card Trip Count Trip Date

Trip Day Trip Day Week End Ind Borough

Non Functional Requirements

Reporting data to be pre-aggregated for better performance

Pre-aggregate data for each year/ month partition in isolation

Able to read data efficiently for specific months from aggregated data

Minimize the number of aggregated tables created

Year Month
Gold Layer
Year

Month

Borough
Trip Data
Trip Date

Trip Day

Trip Day Week End Ind

Cash Trip Count

Card Trip Count

Silver Layer

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Gold Layer
Year Trip Data

Month Trip Data

Borough Taxi Zone

Trip Data
Trip Date Trip Data

Trip Day Calendar

Trip Day Week End Ind Derived from Trip Day

Cash Trip Count Payment Type + Trip Data

Card Trip Count Payment Type + Trip Data
Gold Layer
Year Trip Data

Month Trip Data

Borough Taxi Zone

Trip Data
Trip Date Trip Data

Trip Day Calendar

Trip Day Week End Ind Derived from Trip Day

Cash Trip Count Payment Type + Trip Data

Card Trip Count Payment Type + Trip Data
Business Requirements (2)
Identify taxi demand

Demand based on borough

Demand based on day of the week/ weekend

Demand based on trip type (i.e., Street hail/ Despatch)

Trip distance, trip duration, total fare amount etc per

day/ borough
Non Functional Requirements

Reporting data to be pre-aggregated for better performance

Pre-aggregate data for each year/ month partition in isolation

Able to read data efficiently for specific months from aggregated data

Minimize the number of aggregated tables created

Business Requirements (2)
Identify taxi demand

Demand based on borough

Demand based on day of the week/ weekend

Demand based on trip type (i.e., Street hail/ Despatch)

Trip distance, trip duration, total fare amount etc per

day/ borough
Borough Trip Date Trip Day Trip Day Week End Ind Despatch Trip Count
Street-hail Trip Count Trip Distance Trip Duration Total Fare Amount
Business Requirements (2)
Identify taxi demand

Demand based on borough

Demand based on day of the week/ weekend

Demand based on trip type (i.e., Street hail/ Despatch)

Trip distance, trip duration, total fare amount etc per

day/ borough
Borough Trip Date Trip Day Trip Day Week End Ind Despatch Trip Count
Street-hail Trip Count Trip Distance Trip Duration Total Fare Amount
Silver Layer

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Section Overview – Synapse Pipelines

Synapse Pipelines

Ingest Transform Present

NYC Taxi Bronze Silver Gold

Data Layer Layer Layer

Report
Discovery

Analysts
Section Overview - Synapse Pipelines

Overview

Components

Creating Pipelines

Variables & Parameters

Dynamic Pipelines

Pipeline Dependencies

Creating Triggers
Synapse Pipelines Overview
What are Synapse Pipelines

A fully managed, serverless data integration & orchestration service

The Data Problem

Multi Cloud

SaaS Apps

Data Formats

On Premises
The Data Problem

Transform/
Ingest Publish
Analyze

Data Consumers

Data Sources
Data Integration

100+ Native connectors

SaaS connectors

Multi-cloud support

On-premises support

Serverless & Auto Scale

Control flow activities

Data Orchestration
Synapse Data Flows

Synapse Dedicated SQL Pool Scripts

Synapse Serverless SQL Pool Scripts

Synapse Spark Notebooks

Azure Databricks Notebooks

Azure HDInsight Scripts

Azure Machine Learning Pipelines

Schedule & Monitor

Schedule to run pipelines

Monitoring within the synapse

Alerting within the synapse

Ability to monitor/ alert from outside of Synapse

What are Synapse Pipelines

A fully managed, serverless data integration & orchestration service

Azure Data Factory vs Synapse Pipelines

Shares the same codebase, with minor differences

Azure Data Factory vs Synapse Pipelines

https://docs.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences
Synapse Integration Pipeline Components
Trigger Trigger

Pipeline Pipeline

Activity Activity Activity

Dataset Dataset

Linked Service Linked Service Linked Service

Storage Compute
SQL Synapse Azure
ADLS
Database Pool Databricks
Bronze to Silver Layer Transformation

Taxi Zone Trip Type

Rate Code

Trip Data

Payment Type

Calendar Vendor
Transform to Parquet Format – Taxi Zone

raw/taxi_zone CETAS silver/taxi_zone

Transform to Parquet Format – Taxi Zone

DROP IF
Exists
raw/taxi_zone silver/taxi_zone
+
CETAS
Transform to Parquet Format – Taxi Zone

DROP Table
DELETE File If Exists
raw/taxi_zone silver/taxi_zone
If Exists +
CETAS
Transform to Parquet Format – Taxi Zone

DROP Table
DELETE File If Exists
raw/taxi_zone silver/taxi_zone
If Exists +
CETAS

Pipeline
Delete Script
Activity Activity
Delete Activity