0% found this document useful (0 votes)
334 views216 pages

ADF Course Deck V2

This document provides an overview of an Azure Data Factory course that teaches ingestion, transformation, and exploration of Covid-19 data. The course covers Azure storage solutions like Azure SQL Database, Azure Blob Storage, and Azure Data Lake Storage Gen2. It also covers data transformation technologies like Azure Data Factory, Azure Databricks, and Power BI for reporting. The document outlines the course structure, prerequisites, and commitments to students. It provides a sample project on building a data lake and data warehouse for Covid-19 prediction and reporting using various Azure services.

Uploaded by

ravikumar lanka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
334 views216 pages

ADF Course Deck V2

This document provides an overview of an Azure Data Factory course that teaches ingestion, transformation, and exploration of Covid-19 data. The course covers Azure storage solutions like Azure SQL Database, Azure Blob Storage, and Azure Data Lake Storage Gen2. It also covers data transformation technologies like Azure Data Factory, Azure Databricks, and Power BI for reporting. The document outlines the course structure, prerequisites, and commitments to students. It provides a sample project on building a data lake and data warehouse for Covid-19 prediction and reporting using various Azure services.

Uploaded by

ravikumar lanka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 216

About Me

Ramesh Retnasamy
Data Engineer/ Architect

https://www.linkedin.com/in/ramesh-retnasamy/
About this course
Azure storage solutions

Azure SQL Database

Azure Blob Storage

Azure Data Lake Storage Gen2

Other Bigdata Solutions


Azure Data Factory (ADF)
Azure Databricks

Azure HDInsight

Reporting Technologies

PowerBI
Covid-19 Prediction/ Reporting
Covid-19 Prediction/ Reporting
Data Warehouse

Data Transformation

Data Sources

Data Lake
Data Integration/ Transformation/ Orchestration
Covid-19 Prediction/ Reporting
Covid-19 Prediction/ Reporting
Who is this course for

University students

IT Developers from other disciplines

AWS/ GCP/ On-prem Data Engineers

Data Architects

Data Scientists
Who is this course not for

Your main focus is not learning Azure Data Factory

You are not interested in hands-on learning approach

Your only focus is Azure Data Engineering Certification


Pre-requisites

No prior knowledge assumed

cloud fundamentals would be beneficial, not necessary

Basic knowledge on SQL would be beneficial, not necessary

Azure Account
Our Commitments

Ask Questions, I will answer J

Keeping the course up to date

Udemy life time access

Udemy 30 day money back guarantee


Course Structure
Ingestion Transformation Copy Exploration
6.Data Flow (1)

4.Ingestion - Blob
7.Data Flow (2)

2.Overviews 3.Environment Set-up


8.Data Prep 11.Copy to SQL 14.Power BI
5.Ingestion - HTTP

9.HDInsight

10.Databricks

DevOps
12.Orchestration 15.CI/CD

13.Monitoring 16.CI/CD Scenarios


Covid-19 Prediction/ Reporting
Azure Data Factory Overview
What is Azure Data Factory

A fully managed, serverless data integration solution for ingesting,


preparing and transforming all of your data at scale.
The Data Problem

Multi Cloud

SaaS Apps

Data Formats

On Premises
The Data Problem

Transform/
Ingest Publish
Analyze

Data Consumers

Data Sources
The Data Problem

Transform/
Ingest Publish
Analyze

Data Consumers

Data Sources
The Data Problem

Transform/
Ingest Publish
Analyze

Data Consumers

Data Sources
The Data Problem

Transform/
Ingest Publish
Analyze

Data Consumers

Data Sources
The Data Problem

Transform/
Ingest Publish
Analyze

Data Consumers

Data Sources
What is Azure Data Factory

Fully Managed Service

Serverless

Data Integration Service

Data Transformation Service

Data Orchestration Service

A fully managed, serverless data integration solution for ingesting, preparing and
transforming all of your data at scale.
What Azure Data Factory Is Not

Data Migration Tool

Data Streaming Service

Suitable for Complex Data Transformations

Data Storage Service


Project Overview
Covid-19 Prediction/
Reporting
Data Lake Data Lake to be built with the following data to aid
Data Scientists to predict the spread of the virus/
mortality

Confirmed cases

Mortality

Hospitalization/ ICU Cases

Testing Numbers

Country’s population by age group


Data Warehouse to be built with the following data
Data Warehouse to aid Reporting on Trends

Confirmed cases

Mortality

Hospitalization/ ICU Cases

Testing Numbers
Data Sources ECDC Website

Confirmed cases

Mortality

Hospitalization/ ICU Cases

Testing Numbers

Eurostat Website

Population by age
Solution Architecture
Solution Architecture
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Solution Architecture
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Solution Architecture
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Solution Architecture
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Storage Solutions
Key Factors to Consider
Structured
Structure of the data
Semi-Structured

Unstructured

How often is the data accessed?


Operational needs
How quickly do we need to serve?

Need to run simple queries?

Need to run heavy analytical workload?

Accessed from multiple regions?


Azure Databases

Azure SQL Database

Azure Database for MySQL

Azure Database for PostgreSQL

Azure Database for MariaDB

VM Images with Oracle, SQL Server etc.


Azure Storage Account
Blob Storage

File Storage

Disk Storage

Table Storage

Queue Storage
Azure Data Lake

Azure Data Lake Storage Gen2

Enhance Performance Better Security

Enhance Management
Azure Cosmos DB

Globally distributed

Multi Model

High Throughput
Storage solutions used in this course

Azure SQL Database

Azure Blob Storage

Azure Data Lake Storage Gen2


Environment set-up
Environment set-up
Azure Subscription

Data Factory

Blob Storage Account

Data Lake Storage Gen2

Azure SQL Database

Azure Databricks Cluster

HD Insight Cluster
Creating Azure Free Account
Creating Azure Data Factory
Creating Azure Storage Account
Creating Azure Data Lake Gen2
Creating Azure SQL Database
Data Ingestion
Data Ingestion - Module Overview
(Population by Age)
Data Ingestion – Population Data
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Data Ingestion – Population Data
Copy Activity

Linked Services

Datasets

Pipeline

Validation Activity

If Condition Activity

Web Activity

Get Metadata Activity

Delete Activity

Trigger
Copy Activity

Azure Blob Storage Azure Data Lake


Copy Activity

Ingest ”population by age” for all EU


Countries into the Data Lake to support the
machine learning models to predict increase
in Covid-19 mortality rates
Copy Activity

Source Target

Copy & Extract

Storage Account – covidreportingdl


Storage Account – covidreportingsa Container – raw
Container – population File -
File - population_by_age.tsv.gz population/population_by_age.tsv

Data Sourced from - https://ec.europa.eu/eurostat/tgm/table.do?tab=table&plugin=1&language=en&pcode=tps00010


Copy Activity

Source Sink

Azure Blob Azure Data


Storage Lake

Source File
Target File
(Zipped Copy Activity
(TSV)
TSV)
Copy Activity

Source Sink

Azure Blob Azure Data


Storage Lake

Source File
Target File
(Zipped Source Copy Activity Sink Data
(TSV)
TSV) Data Set Set
Copy Activity

Source Sink

Azure Blob Azure Data


Linked Service Linked Service
Storage Lake

Source File
Target File
(Zipped Source Copy Activity Sink Data
(TSV)
TSV) Data Set Set
Copy Activity

Pipeline
Source Sink

Azure Blob Azure Data


Linked Service Linked Service
Storage Lake

Source File
Target File
(Zipped Source Copy Activity Sink Data
(TSV)
TSV) Data Set Set
Copy Activity
From Azure Blob Storage
Copy Activity
Pipeline 5
Source Sink
1 ls_ablob_covidreportingsa

2 ds_population_raw_gz
1
3
Azure Blob Azure Data
Storage
Linked Service Linked Service
Lake 3 ls_adls_covidreportingdl

4 ds_population_raw_tsv

Source File
(Zipped Source 2 Copy Activity
4 Sink
Target File
TSV) Data Set
6
Data Set
(TSV)
5 pl_ingest_population_data

6 Copy Population Data


Storage Account: covidreportingdl
Storage Account: covidreportingsa
Container: raw
Container: population
File: population/population_by_age.tsv
File: population_by_age.tsv.gz
Handling Real World
Scenarios
Scenario 1
Execute Copy Activity when the file becomes available
Scenario 2
Execute Copy Activity only if file contents are as expected
Scenario 3
Delete the source file on successful copy
Scheduling Pipeline
Execution
Schedule Trigger

Triggers Tumbling Window Trigger

Event Trigger
Runs on a calendar/ Clock

Supports periodic and specific times

Schedule Trigger to Pipeline is Many to Many

Trigger Can only be scheduled for a future time to start


Runs at periodic intervals

Windows are fixed sized, non-overlapping

Tumbling Can be scheduled for the past windows/


slices
Window
Trigger Trigger to Pipeline is one to one
Runs in response to events

Events can be creation or deletion of Blobs/


Files
Event Trigger to Pipeline is Many to Many
Trigger
Data Ingestion - Module Overview
(ECDC Data)
Data Ingestion – ECDC Data
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Data Ingestion – ECDC Data
ECDC Data Overview

Create Initial Pipeline

Pipeline Variables

Pipeline Parameters

Lookup Activity

For Each Activity

Linked Service Parameters

Metadata driven pipeline


Recent Changes to ECDC Data
Recent Changes to ECDC Data

Granularity of the data changed from daily to weekly

File structure is also different as a result

Use GIT Repo - https://github.com/cloudboxacademy/covid19


Data Ingestion

HTTP Azure Data Lake


Data Ingestion Requirements

Covid-19 new cases and deaths by Country

Covid-19 Hospital admissions & ICU cases

Covid-19 Testing Numbers

Country Response to Covid-19

URL - https://www.ecdc.europa.eu/en/covid-19/data
Data Ingestion

Case & Deaths Data

URL - https://www.ecdc.europa.eu/en/publications-data/data-national-14-day-notification-rate-covid-19
Copy Activity – Case & Deaths Data
Pipeline 5 ls_http_opendata_ecdc_eu
Source Sink
1 ropa_eu

ds_cases_deaths_raw_csv
2 _http
1
3
HTTP Azure Data
Linked Service Linked Service
Lake 3 ls_adls_covidreportingdl

ds_cases_deaths_raw_csv
4 _dl
Source File
(CSV) Source 2 Copy Activity
4 Sink
Target File
pl_ingest_cases_deaths_d
Data Set
6
Data Set
(CSV)
5 ata

Copy Cases And Deaths


6 Data
URL: Storage Account: covidreportingdl
https://opendata.ecdc.europa.eu/covid19/nationalcasedeath/csv Container: raw
File: ecdc/cases_deaths.csv
Copy Activity – Case & Deaths Data
Pipeline 5 ls_http_opendata_ecdc_eu
Source Sink
1 ropa_eu

ds_cases_deaths_raw_csv
2 _http
1
3
HTTP Azure Data
Linked Service Linked Service
Lake 3 ls_adls_covidreportingdl

ds_cases_deaths_raw_csv
4 _dl
Source File
(CSV) Source 2 Copy Activity
4 Sink
Target File
pl_ingest_cases_deaths_d
Data Set
6
Data Set
(CSV)
5 ata

Copy Cases And Deaths


6 Data
URL: Storage Account: covidreportingdl
https://opendata.ecdc.europa.eu/covid19/nationalcasedeath/csv Container: raw
File: ecdc/cases_deaths.csv
Copy Activity – Hospital Admission Data
Pipeline 5 ls_http_opendata_ecdc_eu
Source Sink
1 ropa_eu

ds_hospital_admissions_ra
2 w_csv_http
1
3
HTTP Azure Data
Linked Service Linked Service
Lake 3 ls_adls_covidreportingdl

ds_hospital_admissions_ra
4 w_csv_dl
Source File
(CSV) Source 2 Copy Activity
4 Sink
Target File
pl_ingest_hospital_admissi
Data Set
6
Data Set
(CSV)
5 ons_data

Copy Hospital Admissions


6 Data
URL: Storage Account: covidreportingdl
https://opendata.ecdc.europa.eu/covid19/hospitalicuadmission Container: raw
rates/csv/data.csv File: ecdc/hospital_admissions.csv
Parameters & Variables

Parameters are external values passed into pipelines, datasets or linked


services. The value cannot be changed inside a pipeline.

Variables are internal values set inside a pipeline. The value can be changed
inside the pipeline using Set Variable or Append Variable Activity
Differences
Source
https://opendata.ecdc.europa.eu/covid19/nationalcasedeath/csv
https://opendata.ecdc.europa.eu/covid19/hospitalicuadmissionrates/csv/data.csv
https://opendata.ecdc.europa.eu/covid19/testing/csv
https://www.ecdc.europa.eu/sites/default/files/documents/data_response_graphs_0.csv

Sink
raw/ecdc/case_distribution.csv
raw/ecdc/hospital_admission.csv
raw/ecdc/testing.csv
raw/ecdc/country_response.csv
Data Flows (1) - Module Overview
(Cases & Deaths File)
Data Flow – Cases & Deaths Data
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Data Flow – Cases & Deaths Data
Data Flow Overview

Requirement

Source Transformation

Filter Transformation

Select Transformation

Pivot Transformation

Lookup Transformation

Sink Transformation

Create Pipeline
Data Flows
Code free data transformations

Executed on Data Factory managed


Data Flows Databricks Spark clusters

Features
Benefits from Data factory scheduling and
monitoring capabilities.
Data Flows

Types
Only available in some regions
https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-
overview#available-regions

Limited set of connectors available


Data Flows https://docs.microsoft.com/en-us/azure/data-factory/data-flow-source#supported-
sources

Not suitable for very complex logic


Limitations
Data Flows
Transform Cases & Deaths
Data
Transform Cases & Deaths Data
Europe

Raw File from ECDC Only


Transformed File
Column Name Column Name
country country
country_code country_code_2_digit(Lookup)

continent country_code_3_digit
population population
indicator cases_count
daily_count deaths_count
date reported_date
rate_14_day source
source
Transform Cases & Deaths Data
Europe

Raw File from ECDC Only


Transformed File
Column Name Column Name
country country
country_code country_code_2_digit(Lookup)

continent country_code_3_digit
population population
indicator cases_count
daily_count deaths_count
date reported_date(Rename)
rate_14_day source
source
Data Flows (2) - Module Overview
(Hospital Admissions File)
Data Flow – Cases & Deaths Data
Requirement

Source Transformation

Select Transformation

Lookup Transformation

Pivot Transformation

Sink Transformation

Conditional Split Transformation

Derived Column Transformation

Aggregate Transformation

Sort Transformation

Join Transformation

Create Pipeline
Hospital Admissions Data
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Source Transformation
Assignment
Select Transformation
Assignment

Remove url

Rename date to reported_date

Rename year_week to reported_year_week


Lookup Transformation
Assignment

Lookup country file

Select only required fields (i.e. remove additional fields from lookup)
Pivot Transformation
Assignment
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Select & Sink Transformation
Assignment
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Hospital Admissions Data
Column Name Transformed Daily File
Raw File from ECDC country
country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
Column Name
Population(Lookup)
country reported_date
hospital_occupancy_count
indicator icu_occupancy_count
source
date
Column Name Transformed Weekly File
year_week country
value country_code_2_digit(Lookup)
country_code_3_digit(Lookup)
source
population(Lookup)
url reported_year_week(transformed)
reported_week_start_date(Lookup)
reported_week_end_date(Lookup)
new_hospital_occupancy_count
new_icu_occupancy_count
Source
Data Flow Execution
Assignment
HDInsight Activity - Module Overview
(Testing File)
HDInsight Activity – Testing File
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
HDInsight Activity – Testing File
Creating HDInsight Cluster

HDInsight UI Overview

Transformation Requirement

Hive Script Walk-through

Creating Pipeline

Delete HDInsight Cluster


Creating HDInsight Cluster
Testing Data
Testing Data

Transformed File
Raw File from ECDC Column Name
country
Column Name
country_code_2_digit (lookup)
country
country_code_3_digit(lookup)
country_code (Remove)
reported_year_week
Year_week
reported_week_start_date(lookup)
new_cases
reported_week_end_date(lookup)
test_done
new_cases
population
test_done
testing_rate
population
positivity_rate
testing_rate
testing_data_source
positivity_rate
testing_data_source
Databricks Activity - Module Overview
(Population File)
Databricks Activity – Population File
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Databricks Activity – Population File
Create Databricks Service

Create Databricks Cluster

Mount Storage Accounts

Transformation Requirements

Creating Pipeline
Databricks Environment Set-up
Creating Databricks Service
Creating Databricks Cluster
What is a cluster?

Driver Node

Databricks
Runtime

Worker Node Worker Node Worker Node


Cluster Types

All Purpose/ Interactive Job Clusters


Clusters
Mounting Data Lake Storage
Mounting Data Lake Storage

Create Azure Service Principal

Grant access for data lake to Azure Service Principal

Create the mount in databricks using Service Principal


Transform Population By Age Data
Transform Population By Age Data

Raw File Transformed File


Column Name Column Name
indic_de,geo\time Country (Lookup)
2008 country_code_2_digit(Substr)
2009 country_code_3_digit(Lookup)
2010 population(Lookup)
2011 age_group_0_14
… age_group_25_49
…. age_group_50_64
2018 age_group_65_79
2019 age_group_80_max
Transform Population By Age Data
Data Factory Pipeline
Copy Data to Azure SQL
Copy Data to SQL
Transformation

ECDC COVD-19 Publish


Ingest Transform/
data
Analyze
(HTTP Connector)

Azure Data
Azure SQL
Lake
Database
Storage
Gen2

Population Data
(Azure Blob Azure Data ML Models
Storage) Lake
Storage
Gen2
Copy Data to SQL

Copy Cases & Deaths

Copy Hospital Admissions

Copy Testing
Copy Activity – Data Lake to SQL
Cases and Deaths Data
Copy Activity – Data Lake to SQL
Hospital Admissions Daily Data

Assignment
Copy Activity – Data Lake to SQL
Testing Data
Data Orchestration
Data Orchestration Requirements

Pipeline executions are full automated

Pipelines run at regular intervals or on an event occurring

Activities only run once the upstream dependency has been satisfied

Easier to monitor for execution progress and issues


Data Factory Capability

Dependency between activities inside a pipeline

Dependency between pipelines within a parent pipeline

Dependency between triggers [Only tumbling window triggers]

Custom-made Solution
Data Orchestration
Option 1 – Parent Pipeline
Data Orchestration
Option 2 – Trigger Dependency
Azure Data Factory - Monitoring
Azure Data Factory - Monitoring
What to Monitor

Data Factory Monitoring

Creating Alerts

Recovery From Failure

Reporting on Metrics

Azure Monitor Introduction

Log Analytics

Azure Data Factory Analytics


Monitoring
What do we want to monitor
Azure Data Factory Resource

Integration runtime

Trigger runs

Pipeline runs

Activity runs
Data Factory Monitor

Ability to monitor status of pipeline/ triggers

Can be used to re-run failed pipelines/ triggers

Ability to send alerts from base level metrics

Provides base level metrics and logs

Pipeline runs are stored only for 45 days


Azure Monitor

Ability to route the diagnostic data to other storage solutions

Provides richer diagnostic data

Ability to write complex queries and custom reporting

Ability to report across multiple data factories


Data Factory Monitor
Azure Monitor
Reporting via Power BI
Reporting via Power BI
Introduction to Power BI Desktop

Review the Covid-19 pre-built Report


Power BI Desktop Overview
Continuous Integration / Continuous Delivery
(CI / CD)

Module Overview
Continuous Integration / Continuous Delivery

ADF – Dev ADF – Test ADF – Prod


Continuous Integration / Continuous Delivery

ADF – Test
Release

ADF – Dev Build Build


Git Artefacts

Release ADF – Prod


Continuous Integration / Continuous Delivery
(CI / CD)
DevOps - Introduction

Ops

Siloed
Dev
DevOps - Introduction

Dev Devops Ops


DevOps - Characteristics

Collaboration, trust and transparency

Agile Development Approach

Dev Devops Ops Continuous Integration/ Delivery

Automation

Continuous Improvement
Continuous Integration / Continuous Delivery

Continuous Integration Continuous Delivery/


Deployment

Plan Code Build Test Release Deploy

Continuous Improvement

Improve Monitor
Continuous Integration / Continuous Delivery - ADF

Continuous Integration Continuous Delivery/


Deployment

Plan Code Build Test Release Deploy

JSON Continuous Improvement ARM


Template

Improve Monitor
Continuous Integration / Continuous Delivery - ADF

Continuous Integration Continuous Delivery/


Deployment

Plan Code Build Test Release Deploy

JSON Continuous Improvement ARM


Template

Improve Monitor
Option 1 – Manual Integration / Automated Delivery

Continuous Integration Continuous Delivery/


Deployment

Plan Code Build Test Release Deploy

JSON Continuous Improvement ARM


Template

Improve Monitor
Option 2 – Fully Automated Solution

Continuous Integration Continuous Delivery/


Deployment

Plan Code Build Test Release Deploy

JSON Continuous Improvement ARM


Template

Improve Monitor
CI/CD Option 1 – Using ADF Publish
Debug

Pipelines
Data Flows
Datasets
Linked Services
Developer 1 ADF - Studio Triggers

Live Mode
Pipelines ADF – Repo
Data Flows
Datasets
Linked Services
Developer 2 ADF - Studio Triggers
CI/CD Option 1 – Using ADF Publish

feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
ADF – Dev
Git feature branch 2 adf_publish branch

Developer 2 ADF - Studio


ADF – Test

ARM
Template
Release

Github Azure Devops


Repos ADF – Prod
CI/CD Option 1 – Using ADF Publish
Continuous Integration (CI) Continuous
Deployment (CD)
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
ADF – Dev
Git feature branch 2 adf_publish branch

Developer 2 ADF - Studio


ADF – Test

ARM
Template
Release

Github Azure Devops


Repos ADF – Prod
Azure DevOps
DevOps

Continuous Integration Continuous Delivery/


Deployment

Plan Code Build Test Release Deploy

Continuous Improvement

Improve Monitor
Azure DevOps

Continuous Integration Continuous Delivery/


Deployment

Plan Code Build Test Release Deploy

Continuous Improvement

Improve Monitor
Azure DevOps

Boards

Repos

Pipelines

Test Plans

Artifacts
Azure DevOps
Azure DevOps
Organization

Project Project

Boards Boards

Repos Repos
Developer Pipelines Pipelines

Test Plans Test Plans

Artifacts Artifacts
Azure DevOps Environment Set-up
Azure Data Factory Set-up
Azure Data Factory Set-up

ADF – Dev ADF – Test ADF – Prod

Git
Azure Data Factory Set-up

ADF – Test

ADF – Dev ARM


Template Release

ADF – Prod

Git
Azure Data Factory Set-up

Env Data Factory Name Resource Group GIT Enabled


Name
dev dev-ci-cd-demo-adf dev-ci-cd-demo-rg Y

test test-ci-cd-demo-adf test-ci-cd-demo-rg N

prod prod-ci-cd-demo-adf prod-ci-cd-demo-rg N


ADF Git Configuration
Git Configuration

feature branch 1
Developer 1 ADF Dev -
Studio merge
master branch
Git Mode
merge
ADF Dev -
Git feature branch 2 adf_publish branch Repo

Developer 2 ADF Dev -


Studio
ARM
Template
Continuous Integration - Code
2. Develop pipeline
3. Debug pipeline 4. Create Pull Request
5. Review & Approve Pull Request
1. Create feature branch
6. Merge to collaboration branch

main branch

Developer ADF Dev - Git


Studio
Continuous Integration - Build
2. Develop pipeline
3. Debug pipeline 4. Create Pull Request
5. Review & Approve Pull Request
1. Create feature branch
6. Merge to collaboration branch
7. Publish the changes
main branch

Developer ADF Dev - ADF Dev -


Git adf_publish branch Repo
Studio

ARM
Template
Continuous Delivery
Continuous Integration (CI) Continuous Delivery (CD)

feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
ADF – Dev
Git feature branch 2 adf_publish branch

Developer 2 ADF - Studio


ADF – Test

ARM
Template
Release

ADF – Prod
Continuous Delivery

master branch

Git adf_publish branch

ADF – Test

ARM
Template
Release

ADF – Prod
Continuous Delivery

master branch

Git adf_publish branch

ARM ADF – Test ADF – Prod


Release Release
Template
Continuous Delivery

master branch

Git adf_publish branch

ARM ADF – Test ADF – Prod


Release Approve Release
Template
Continuous Delivery – Release Pipeline

master branch

Git adf_publish branch

ARM Pre Deployment ARM Post Deployment ADF – Test


Template
Release Script Deployment Script
Continuous Delivery – Release Pipeline
master branch

Git adf_publish branch


Test Stage

ARM Pre Deployment ARM Post Deployment ADF – Test


Template
Release Script Deployment Script
Prod Stage

Pre Deployment ARM Post Deployment ADF – Prod


Script Deployment Script
Continuous Delivery – Release Pipeline
master branch

Git adf_publish branch


Test Stage

ARM Pre Deployment ARM Post Deployment ADF – Test


Template
Release Script Deployment Script
Prod Stage

Pre Deployment ARM Post Deployment ADF – Prod


Script Deployment Script
Continuous Delivery – Release Pipeline
master branch

Git adf_publish branch


Test Stage

ARM ARM ADF – Test


Template
Release Deployment
Release Pipeline - Issues
master branch

Git adf_publish branch


Test Stage

ARM ARM ADF – Test


Template
Release Deployment
Release Pipeline - Issues

Test Stage

ARM ADF – Test


Release Deployment

Deleted Objects Active Triggers


Release Pipeline - Issues

Test Stage

ARM ADF – Test


Release Deployment

Deleted Objects Active Triggers


Release Pipeline – Pre & Post Deployment
master branch

Git adf_publish branch


Test Stage

ARM Pre Deployment ARM Post Deployment ADF – Test


Template
Release Script Deployment Script
Release Pipeline – Deployment to Test
master branch

Git adf_publish branch


Test Stage

ARM Pre Deployment ARM Post Deployment ADF – Test


Template
Release Script Deployment Script
Release Pipeline – Deployment to Prod
master branch

Git adf_publish branch


Test Stage

ARM Pre Deployment ARM Post Deployment ADF – Test


Template
Release Script Deployment Script
Prod Stage

Pre Deployment ARM Post Deployment ADF – Prod


Script Deployment Script
CI/CD Option 1 – Using ADF Publish
Continuous Integration (CI) Continuous
Deployment (CD)
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
ADF – Dev
Git feature branch 2 adf_publish branch

Developer 2 ADF - Studio


ADF – Test

ARM
Template
Release

ADF – Prod
CI/CD Option 1 – Using ADF Publish
Manual
Build
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge
ADF – Dev
Git feature branch 2 adf_publish branch

Developer 2 ADF - Studio


ADF – Test

ARM
Template
Release

ADF – Prod
CI/CD Option 2 – Using Build Pipeline
Automated
Build
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge

Git
Build ADF – Dev
feature branch 2 Pipeline

Developer 2 ADF - Studio


ARM ADF – Test
Template
Release
Pipeline

ADF – Prod
CI/CD Option 2 – Using Build Pipeline
Continuous Integration (CI) Continuous
Deployment (CD)
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge

Git
Build ADF – Dev
feature branch 2 Pipeline

Developer 2 ADF - Studio


ARM ADF – Test
Template
Release
Pipeline

ADF – Prod
Continuous Integration/ Continuous Delivery
Continuous Integration (CI) Continuous
Deployment (CD)
feature branch 1
Developer 1 ADF - Studio
merge
master branch
Git Mode
merge

Git
Build ADF – Dev
feature branch 2 Pipeline

Developer 2 ADF - Studio


ARM ADF – Test
Template
Release
Pipeline

ADF – Prod
CI/CD Scenario – Data Lake Access

ADF – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template
CI/CD Scenario – Data Lake Access

ADF – Dev DL – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template
CI/CD Scenario – Data Lake Access

ADF – Dev DL – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template

DL – Test DL – Prod
CI/CD Scenario – Data Lake Access
Access Key
Service Principal
ADF – Dev DL – Dev System Assigned Managed Identity
User Assigned Managed Identity

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template

DL – Test DL – Prod
CI/CD Scenario – Data Lake Access

ADF – Dev DL – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template

DL – Test DL – Prod
Data Lake Storage Set-up

Env Data Factory Name Resource Group Data Lake Name GIT
Name Enabled
dev dev-ci-cd-demo-adf dev-ci-cd-demo-rg devcicddemodl Y

test test-ci-cd-demo-adf test-ci-cd-demo-rg testcicddemodl N

prod prod-ci-cd-demo-adf prod-ci-cd-demo-rg prodcicddemodl N


Data Lake Access via System Assigned Managed Identity

ADF – Dev DL – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template

DL – Test DL – Prod
Data Lake Access via System Assigned Managed Identity

ADF – Dev DL – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template

DL – Test DL – Prod
Data Lake Access via Access Keys

ADF – Dev DL – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template

DL – Test DL – Prod
Data Lake Access via Access Keys – Option 1

ADF – Dev DL – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template

KV – Test DL – Test KV – Prod DL – Prod


Data Lake Access via Access Keys – Option 2

KV – Dev

ADF – Dev

DL – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template

KV – Test DL – Test KV – Prod DL – Prod


Data Lake Access via Access Keys – Option 2

KV – Dev

ADF – Dev

DL – Dev

Build ARM ADF – Test ADF – Prod


Release Approve Release
Template

KV – Test DL – Test KV – Prod DL – Prod


Key Vault Set-up

Env Data Factory Name Resource Group Data Lake Name Key Vault Name GIT
Name Enabled
dev dev-ci-cd-demo-adf dev-ci-cd-demo-rg devcicddemodl dev-ci-cd-demo-kv Y

test test-ci-cd-demo-adf test-ci-cd-demo-rg testcicddemodl test-ci-cd-demo-kv N

prod prod-ci-cd-demo-adf prod-ci-cd-demo-rg prodcicddemodl prod-ci-cd-demo-kv N


Congratulations!
&
Thank you
Feedback
Ratings & Review
Thank you
&
Good Luck!
Version History

You might also like