ADF Course Deck V2
ADF Course Deck V2
Ramesh Retnasamy
Data Engineer/ Architect
https://www.linkedin.com/in/ramesh-retnasamy/
About this course
Azure storage solutions
Azure HDInsight
Reporting Technologies
PowerBI
Covid-19 Prediction/ Reporting
Covid-19 Prediction/ Reporting
Data Warehouse
Data Transformation
Data Sources
Data Lake
Data Integration/ Transformation/ Orchestration
Covid-19 Prediction/ Reporting
Covid-19 Prediction/ Reporting
Who is this course for
University students
Data Architects
Data Scientists
Who is this course not for
Azure Account
Our Commitments
4.Ingestion - Blob
7.Data Flow (2)
9.HDInsight
10.Databricks
DevOps
12.Orchestration 15.CI/CD
Multi Cloud
SaaS Apps
Data Formats
On Premises
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
What is Azure Data Factory
Serverless
A fully managed, serverless data integration solution for ingesting, preparing and
transforming all of your data at scale.
What Azure Data Factory Is Not
Confirmed cases
Mortality
Testing Numbers
Confirmed cases
Mortality
Testing Numbers
Data Sources ECDC Website
Confirmed cases
Mortality
Testing Numbers
Eurostat Website
Population by age
Solution Architecture
Solution Architecture
Transformation
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Solution Architecture
Transformation
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Solution Architecture
Transformation
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Solution Architecture
Transformation
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Storage Solutions
Key Factors to Consider
Structured
Structure of the data
Semi-Structured
Unstructured
File Storage
Disk Storage
Table Storage
Queue Storage
Azure Data Lake
Enhance Management
Azure Cosmos DB
Globally distributed
Multi Model
High Throughput
Storage solutions used in this course
Data Factory
HD Insight Cluster
Creating Azure Free Account
Creating Azure Data Factory
Creating Azure Storage Account
Creating Azure Data Lake Gen2
Creating Azure SQL Database
Data Ingestion
Data Ingestion - Module Overview
(Population by Age)
Data Ingestion – Population Data
Transformation
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Data Ingestion – Population Data
Copy Activity
Linked Services
Datasets
Pipeline
Validation Activity
If Condition Activity
Web Activity
Delete Activity
Trigger
Copy Activity
Source Target
Source Sink
Source File
Target File
(Zipped Copy Activity
(TSV)
TSV)
Copy Activity
Source Sink
Source File
Target File
(Zipped Source Copy Activity Sink Data
(TSV)
TSV) Data Set Set
Copy Activity
Source Sink
Source File
Target File
(Zipped Source Copy Activity Sink Data
(TSV)
TSV) Data Set Set
Copy Activity
Pipeline
Source Sink
Source File
Target File
(Zipped Source Copy Activity Sink Data
(TSV)
TSV) Data Set Set
Copy Activity
From Azure Blob Storage
Copy Activity
Pipeline 5
Source Sink
1 ls_ablob_covidreportingsa
2 ds_population_raw_gz
1
3
Azure Blob Azure Data
Storage
Linked Service Linked Service
Lake 3 ls_adls_covidreportingdl
4 ds_population_raw_tsv
Source File
(Zipped Source 2 Copy Activity
4 Sink
Target File
TSV) Data Set
6
Data Set
(TSV)
5 pl_ingest_population_data
Event Trigger
Runs on a calendar/ Clock
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Data Ingestion – ECDC Data
ECDC Data Overview
Pipeline Variables
Pipeline Parameters
Lookup Activity
URL - https://www.ecdc.europa.eu/en/covid-19/data
Data Ingestion
URL - https://www.ecdc.europa.eu/en/publications-data/data-national-14-day-notification-rate-covid-19
Copy Activity – Case & Deaths Data
Pipeline 5 ls_http_opendata_ecdc_eu
Source Sink
1 ropa_eu
ds_cases_deaths_raw_csv
2 _http
1
3
HTTP Azure Data
Linked Service Linked Service
Lake 3 ls_adls_covidreportingdl
ds_cases_deaths_raw_csv
4 _dl
Source File
(CSV) Source 2 Copy Activity
4 Sink
Target File
pl_ingest_cases_deaths_d
Data Set
6
Data Set
(CSV)
5 ata
ds_cases_deaths_raw_csv
2 _http
1
3
HTTP Azure Data
Linked Service Linked Service
Lake 3 ls_adls_covidreportingdl
ds_cases_deaths_raw_csv
4 _dl
Source File
(CSV) Source 2 Copy Activity
4 Sink
Target File
pl_ingest_cases_deaths_d
Data Set
6
Data Set
(CSV)
5 ata
ds_hospital_admissions_ra
2 w_csv_http
1
3
HTTP Azure Data
Linked Service Linked Service
Lake 3 ls_adls_covidreportingdl
ds_hospital_admissions_ra
4 w_csv_dl
Source File
(CSV) Source 2 Copy Activity
4 Sink
Target File
pl_ingest_hospital_admissi
Data Set
6
Data Set
(CSV)
5 ons_data
Variables are internal values set inside a pipeline. The value can be changed
inside the pipeline using Set Variable or Append Variable Activity
Differences
Source
https://opendata.ecdc.europa.eu/covid19/nationalcasedeath/csv
https://opendata.ecdc.europa.eu/covid19/hospitalicuadmissionrates/csv/data.csv
https://opendata.ecdc.europa.eu/covid19/testing/csv
https://www.ecdc.europa.eu/sites/default/files/documents/data_response_graphs_0.csv
Sink
raw/ecdc/case_distribution.csv
raw/ecdc/hospital_admission.csv
raw/ecdc/testing.csv
raw/ecdc/country_response.csv
Data Flows (1) - Module Overview
(Cases & Deaths File)
Data Flow – Cases & Deaths Data
Transformation
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Data Flow – Cases & Deaths Data
Data Flow Overview
Requirement
Source Transformation
Filter Transformation
Select Transformation
Pivot Transformation
Lookup Transformation
Sink Transformation
Create Pipeline
Data Flows
Code free data transformations
Features
Benefits from Data factory scheduling and
monitoring capabilities.
Data Flows
Types
Only available in some regions
https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-
overview#available-regions
continent country_code_3_digit
population population
indicator cases_count
daily_count deaths_count
date reported_date
rate_14_day source
source
Transform Cases & Deaths Data
Europe
continent country_code_3_digit
population population
indicator cases_count
daily_count deaths_count
date reported_date(Rename)
rate_14_day source
source
Data Flows (2) - Module Overview
(Hospital Admissions File)
Data Flow – Cases & Deaths Data
Requirement
Source Transformation
Select Transformation
Lookup Transformation
Pivot Transformation
Sink Transformation
Aggregate Transformation
Sort Transformation
Join Transformation
Create Pipeline
Hospital Admissions Data
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Source Transformation
Assignment
Select Transformation
Assignment
Remove url
Select only required fields (i.e. remove additional fields from lookup)
Pivot Transformation
Assignment
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Select & Sink Transformation
Assignment
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Data Flow Execution
Assignment
HDInsight Activity - Module Overview
(Testing File)
HDInsight Activity – Testing File
Transformation
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
HDInsight Activity – Testing File
Creating HDInsight Cluster
HDInsight UI Overview
Transformation Requirement
Creating Pipeline
Transformed File
Raw File from ECDC Column Name
country
Column Name
country_code_2_digit (lookup)
country
country_code_3_digit(lookup)
country_code (Remove)
reported_year_week
Year_week
reported_week_start_date(lookup)
new_cases
reported_week_end_date(lookup)
test_done
new_cases
population
test_done
testing_rate
population
positivity_rate
testing_rate
testing_data_source
positivity_rate
testing_data_source
Databricks Activity - Module Overview
(Population File)
Databricks Activity – Population File
Transformation
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Databricks Activity – Population File
Create Databricks Service
Transformation Requirements
Creating Pipeline
Databricks Environment Set-up
Creating Databricks Service
Creating Databricks Cluster
What is a cluster?
Driver Node
Databricks
Runtime
Azure Data
Azure SQL
Lake
Database
Storage
Gen2
Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Copy Data to SQL
Copy Testing
Copy Activity – Data Lake to SQL
Cases and Deaths Data
Copy Activity – Data Lake to SQL
Hospital Admissions Daily Data
Assignment
Copy Activity – Data Lake to SQL
Testing Data
Data Orchestration
Data Orchestration Requirements
Activities only run once the upstream dependency has been satisfied
Custom-made Solution
Data Orchestration
Option 1 – Parent Pipeline
Data Orchestration
Option 2 – Trigger Dependency
Azure Data Factory - Monitoring
Azure Data Factory - Monitoring
What to Monitor
Creating Alerts
Reporting on Metrics
Log Analytics
Integration runtime
Trigger runs
Pipeline runs
Activity runs
Data Factory Monitor
Module Overview
Continuous Integration / Continuous Delivery
ADF – Test
Release
Ops
Siloed
Dev
DevOps - Introduction
Automation
Continuous Improvement
Continuous Integration / Continuous Delivery
Continuous Improvement
Improve Monitor
Continuous Integration / Continuous Delivery - ADF
Improve Monitor
Continuous Integration / Continuous Delivery - ADF
Improve Monitor
Option 1 – Manual Integration / Automated Delivery
Improve Monitor
Option 2 – Fully Automated Solution
Improve Monitor
CI/CD Option 1 – Using ADF Publish
Debug
Pipelines
Data Flows
Datasets
Linked Services
Developer 1 ADF - Studio Triggers
Live Mode
Pipelines ADF – Repo
Data Flows
Datasets
Linked Services
Developer 2 ADF - Studio Triggers
CI/CD Option 1 – Using ADF Publish
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
ADF – Dev
Git feature branch 2 adf_publish branch
ARM
Template
Release
ARM
Template
Release
Continuous Improvement
Improve Monitor
Azure DevOps
Continuous Improvement
Improve Monitor
Azure DevOps
Boards
Repos
Pipelines
Test Plans
Artifacts
Azure DevOps
Azure DevOps
Organization
Project Project
Boards Boards
Repos Repos
Developer Pipelines Pipelines
Artifacts Artifacts
Azure DevOps Environment Set-up
Azure Data Factory Set-up
Azure Data Factory Set-up
Git
Azure Data Factory Set-up
ADF – Test
ADF – Prod
Git
Azure Data Factory Set-up
feature branch 1
Developer 1 ADF Dev -
Studio merge
master branch
Git Mode
merge
ADF Dev -
Git feature branch 2 adf_publish branch Repo
main branch
ARM
Template
Continuous Delivery
Continuous Integration (CI) Continuous Delivery (CD)
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
ADF – Dev
Git feature branch 2 adf_publish branch
ARM
Template
Release
ADF – Prod
Continuous Delivery
master branch
ADF – Test
ARM
Template
Release
ADF – Prod
Continuous Delivery
master branch
master branch
master branch
Test Stage
Test Stage
ARM
Template
Release
ADF – Prod
CI/CD Option 1 – Using ADF Publish
Manual
Build
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
ADF – Dev
Git feature branch 2 adf_publish branch
ARM
Template
Release
ADF – Prod
CI/CD Option 2 – Using Build Pipeline
Automated
Build
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
Git
Build ADF – Dev
feature branch 2 Pipeline
ADF – Prod
CI/CD Option 2 – Using Build Pipeline
Continuous Integration (CI) Continuous
Deployment (CD)
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
Git
Build ADF – Dev
feature branch 2 Pipeline
ADF – Prod
Continuous Integration/ Continuous Delivery
Continuous Integration (CI) Continuous
Deployment (CD)
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
Git
Build ADF – Dev
feature branch 2 Pipeline
ADF – Prod
CI/CD Scenario – Data Lake Access
ADF – Dev
DL – Test DL – Prod
CI/CD Scenario – Data Lake Access
Access Key
Service Principal
ADF – Dev DL – Dev System Assigned Managed Identity
User Assigned Managed Identity
DL – Test DL – Prod
CI/CD Scenario – Data Lake Access
DL – Test DL – Prod
Data Lake Storage Set-up
Env Data Factory Name Resource Group Data Lake Name GIT
Name Enabled
dev dev-ci-cd-demo-adf dev-ci-cd-demo-rg devcicddemodl Y
DL – Test DL – Prod
Data Lake Access via System Assigned Managed Identity
DL – Test DL – Prod
Data Lake Access via Access Keys
DL – Test DL – Prod
Data Lake Access via Access Keys – Option 1
KV – Dev
ADF – Dev
DL – Dev
KV – Dev
ADF – Dev
DL – Dev
Env Data Factory Name Resource Group Data Lake Name Key Vault Name GIT
Name Enabled
dev dev-ci-cd-demo-adf dev-ci-cd-demo-rg devcicddemodl dev-ci-cd-demo-kv Y