0% found this document useful (0 votes)
9 views105 pages

Azure Data Factory: PCSM - 2

The document provides an overview of Azure Data Factory and its capabilities for data integration, management, and analytics. It highlights features such as support for various data types, security measures, scalability, and cost-effectiveness, as well as integration with other Azure services. Additionally, it discusses the architecture of Azure Data Lake Storage Gen2 and its advantages for big data workloads.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views105 pages

Azure Data Factory: PCSM - 2

The document provides an overview of Azure Data Factory and its capabilities for data integration, management, and analytics. It highlights features such as support for various data types, security measures, scalability, and cost-effectiveness, as well as integration with other Azure services. Additionally, it discusses the architecture of Azure Data Lake Storage Gen2 and its advantages for big data workloads.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

PCSM - 高级数据库服务 2:

Azure Data Factory介绍及应用场景


从数据中获取价值存在
并非易事

数据孤岛 不一致的 解决方案的复杂 多云环境 成本上升


数据类型 性
从数据中获取真实价值

Data silos Incongruent Pe r f o r m a n c e Complexity of Rising costs


data types constraints solutions

所有数据的一个 支持不同类型的数据 无限 熟悉的工具和生 低


数据总线 数据,弹性伸缩 态系统 TCO

On-premises, hybrid, Azure


数据放哪里?
一个优秀的数据湖
大规模 精细、多层安全 优化以达到最大性 易于集成 成本效益

PB Scale, data Granular security Lightning Supports multiple Cloud


accessible and protection quick job methods of data economic
everywhere, against execution ingress, model with the
growth on accidental data processing, egress ability to
demand loss and visualization intelligently
manage costs

丰富的数据管理和治理
Azure Data Lake Storage Gen2
"不折不扣"数据湖:安全、性能好、可大规模扩展的数据湖存储,将对象存储的成本和规模配置文件与数据湖存
储的性能和分析功能集相结合

安全 易于管理 快 扩展性 经济高效 集成就绪

✓ Support for fine-


✓ Automated ✓ Atomic directory ✓ No limits on ✓ Object store ✓ Optimized for Spark
grained ACLs,
Lifecycle Policy operations data store size pricing levels and Hadoop
protecting data at the
Management means jobs Analytic Engines
file and folder level
complete faster
✓ Global footprint ✓ File system
✓ Multi-layered
✓ Object Level (50+ regions) operations ✓ Tightly integrated
protection via at-rest
tiering minimize with Azure end to
Storage Service
transactions end analytics
encryption and Azure
required for job solutions
Active Directory
completion
integration
Subscription 1 Storage Account 1 Container 1

Subscription 2 Storage Account 2 Container 2 Blobs

Subscription...n Storage Account...n Container…n

http://<StorageAccount>.blob.core.windows.net/<Container>/<Blob>
ADLS Gen2 Architecture

Blob API ADLS API

HIERARCHICAL FILE SYSTEM

Performance Scale and Cost


Security
Enhancements Effectiveness

Blob Storage
Object Tiering and Lifecycle Policy AAD Integration, RBAC, Storage HA/DR support through ZRS and Data Governance and
Management Account Security RA-GRS Management
Convergence of two Storage Ser vices

Azure Blob Storage Azure Data Lake Store

Global scale – All Azure regions Built for Hadoop


Full BCDR capabilities Hierarchical namespace
Tiered - Hot/Cool/Archive ACLs, AAD and RBAC
Cost Efficient Performance tuned for big data
Large partner ecosystem Very high scale capacity and throughput
Azure Data Lake Storage Gen2
ADLS 支持直接查询加速
Enables unique capabilities for apps and workloads that do not yet use the Object or HDFS protocols

Blob API ADLS Gen2 API NFS v3

Object Data Analytics Data File Data


Server Backups, Archive Hadoop File System, File HPC Data, Applications
Storage, Semi-structured and Folder Hierarchy, using NFS v3 against large
Data Granular ACLS Atomic File sequentially read data sets
Transactions

Common Blob Storage Foundation

Object Tiering and Lifecycle AAD Integration, RBAC, HA/DR support through ZRS
Policy Management Storage Account Security and RA-GRS

Public Cloud Object storage access through NFS v3 is an industry first


端到端分析的基石

INGEST PREP & TRAIN MODEL & SERVE

Logs (unstructured)

Media (unstructured) Azure Analysis Power BI


Services

Azure Databricks Azure SQL


Azure Data Factory Data Warehouse

Files (unstructured)

STORE
Business/custom apps
Azure Data Lake Storage Gen2
(structured)
端到端分析的基石

INGEST PREP & TRAIN MODEL & SERVE

Logs (unstructured)
Azure Analysis Power BI
Services

Media (unstructured)

Azure Databricks Azure SQL


Azure Data Factory Data Warehouse
SparkR

Files (unstructured) Cosmos DB


Apps

STORE
Business/custom apps
Azure Data Lake Storage Gen2
(structured)
与Azure完全集成模式
Data Security
• Consistent AAD-based OAuth authentication
• Azure RBAC and POSIX-compliant ACLs
• Integrates with analytics frameworks for end-user authorization
• Encryption at rest: Customer or Microsoft managed keys
• Encryption in transit: TLS
• Transport-level protection: VNet service endpoints
Business Continuity/Disaster Recovery
• Data Redundancy Options: LRS, ZRS, GRS, RA-GRS & GZRS (Soon)
• Data failover options: automatic, customer controlled
• Analytics cluster failover options

Management/Monitoring
• Azure Monitor

Data Governance
• Azure Purview (preview)
数据组织

• Example of data organization and access control for a platform


• Raw folder restricted to writers (uploaders, ADF pipelines etc)
• Staged folder restricted to non-production services
• Production folder restricted to production services

• Example of data organization and access control for a business


• Raw folder restricted to users or groups generating data (loggers,
transactional systems)
• Staged folder restricted to developers
• Production folder restricted to data scientists, business analysts,
decision makers etc..

• Data can be time series organized


• /raw/eventlogs/2019/01/15/…..
• /raw/iotevents/2019/01/15/….
• /staging/curatedevents/2019/01/….
• /prod/incidents/2019/01/15/…
• /prod/salesdata/2019/01/15/….
在一个集中帐户中组织的数据

• Single account owned by a customer/tenant

• Contains data for several teams or departments within the


customer/tenant

• Access can be controlled using RBAC roles per department

• More granular access control possible via POSIX style ACL’s at the file and
folder level

• Billed to the customer (subscription)

• Isolation at the account level

abfss://[email protected]/raw/eventlogs/2019/01/15/evendata.txt
abfss://[email protected]/raw/eventlogs/2019/01/15/evendata.txt
在多个帐户中组织的数据

• Separate accounts per department

• Stricter isolation since departments are in different accounts

• Billing can be to the customer or department


• Accounts in same subscription get billed to one BU
• Accounts in different subscriptions get billed to different BU’s

• Can be further split into separate accounts for further isolation

• Will need to attach / mount multiple accounts or filesystems to the


analytics engines like HDInsight, Databricks

abfss://[email protected]/eventlogs/2019/01/15/evendata.txt
abfss://[email protected]/eventlogs/2019/01/15/evendata.txt
GA: Nov 2020
How fast is the Premium tier?

➢ Hadoop DFSIO benchmark: ADLS Premium is 2.8X and 1.6X faster for write and read per CPU-core throughput via
• Databricks.
➢ HBase workload: 40% faster latencies.
➢ TPC-DS benchmark: ADLS standard latencies are ~1.5-5x higher compared to premium tier.
• ➢ Synapse: SQL serverless 1 TB TPCH queries are up-to 33% faster on premium tier.
➢ Interactive workloads – e.g. The Microsoft Edge team doing interactive analytics on device grain metric data reads
found premium tier to be 3x faster.
➢ ML/AI workload – A pharmaceutical customer tested premium tier and observed low and consistent latencies, higher
read throughput, less job failures using Premium and cost savings due to savings in compute spending.
When should Premium tier be used? ➢ A bioinformatics customer scaled up their data process pipeline for genomics study using premium tier (also saw a 2.4x

egress speedup and are using premium as a staging/caching layer.

Cost Effectiveness

➢ ➢ Premium tier is cost effective for transaction heavy workloads (TPS/TB > 35).

• Columns represents the number of transactions in a month, rows represent percentage of transactions that are reads.
➢ • Cell values show percentage of cost reduction associated with a read transaction percentage and the number of transactions executed.
• E.g., in East US 2, when transactions exceed 90M, and 70% are read transaction, premium tier is cheaper than hot tier.

Premium
Status: GA (Nov 2020) Hot
Cool
Archive
数据迁移
ADLS Gen2 & Big Data
Big Data Use Cases
Ingest & ETL Streaming Analytics & Machine Learning Data Aggregation Presentation

Functions
Monitor Event Hubs Machine
App Insights Log Analytics Stream Learning Search Power BI
Analytics Data Warehouse

IoT Hub
Data Factory CDN
Batch

Azure HDInsight

Blob Storage Pillars


Open & Manageable & Cost Scalable & Secure & Compliant Durable & Available
Interoperable Efficient Performant
Manageable & Scalable & Secure &
Cost Efficient Performant Compliant
A Z U R E D ATA FA C T O R Y
云中全托管理的数据集成服务

PRODUCTIVE HYBRID SCALABLE TRUSTED

✓ Drag & Drop UI ✓ Orchestrate where ✓ Serverless scalability ✓ Certified compliant


your data lives with no infrastructure Data Movement
✓ Codeless Data to manage
Movement ✓ Lift SSIS packages
to Azure
A Z U R E D ATA FA C TO R Y

数据集成服务:无服务器、弹性扩展、混合 Data Movement and Transformation @Scale


Cloud & Hybrid w/ 90+ connectors provided
Up to 2GB/s, ETL/ELT in the cloud

Hybrid Pipeline Model


Seamlessly span: on premise, Azure, other clouds & SaaS
Run on-demand, scheduled, data-availability or on event

Author & Monitor


Programmability w/ multi-language SDK
Visual Tools

SSIS Package Execution


Lift existing SQL Server ETL to Azure
Use existing tools (SSMS, SSDT)
A Z U R E D ATA FA C TO R Y
大规模现代化企业数据仓库

通过 Azure 数据工厂集成
LOB

INGEST STORE PREP & MODEL & Cloud


CRM
TRANSFORM SERVE

Graph
Azure Analysis Services

Image VNet

Social Data orchestration, Azure SQL DW,


Azure Data Lake Data Transformations
scheduling HDInsight, Data Apps and
Azure Storage Machine Learning
and monitoring Lakes Insights

IoT On-premise
将 SQL 服务器集成服务 (SSIS) 包无缝迁移到 Azure

Azure Data Factory

SSIS Integration Runtime

Cloud data sources SSIS Cloud ETL SQL DB Managed Instance

Cloud
On-premises
VNET

Microsoft
SQL Server
Integration Services
On-Premise data sources SQL Server
混合和多云数据集成

Azure Data Factory


PaaS Data Integration
DATA DRIVEN
APPLICATIONS

Author, orchestrate and monitor with Azure Data Factory

DATA SCIENCE
AND MACHINE
On-Prem SaaS Apps Public Cloud LEARNING
MODELS

ANALYTICAL
DASHBOARDS
USING POWER BI
A Z U R E D ATA FA C TO R Y O V E R V I E W
Audiences
Data Engineer Data Scientist Citizen Integrator

Data sources Ingest Prepare, transform, predict & enrich Serve Visualize

On-premises data 90 + Pre-built Code-free


connectors

Multicloud data Code-centric

Orchestrate and Monitor

SaaS data
Azure Data Lake
G E T T H E M O S T O U T O F Y O U R A N A LY T I C S
Audiences
Data Engineer Data Scientist Citizen Integrator

Data sources Ingest Prepare, transform, predict & enrich Serve Visualize

Wrangling data flow Mapping data


flow

On-premises data 90 + Pre-built Code-free


connectors OR
Code-centric

Multicloud data Databricks … Azure ML


HDInsight

Orchestrate and Monitor

SaaS data
Azure Data Lake
Azure Data Factory 概念

Pipelines

Activities Triggers

Linked Services Datasets Integration Runtime


Azure Data Factory

ETL / ELT

Visual UI Code Support

Control Flow SSIS Execution


Datasets, Activities, Pipelines

Pipelines and activities in Azure Data Factory


https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities
Datasets and linked services

Datasets and linked services in Azure Data Factory


https://docs.microsoft.com/en-us/azure/data-factory/concepts-datasets-linked-services
Linked Service – JSON – 数据在哪
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value":
"DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Datasets & Linked services


https://docs.microsoft.com/en-us/azure/data-factory/concepts-datasets-linked-services
Dataset – JSON – 哪些数据
{
"name": "DatasetSample",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "MyAzureSqlLinkedService",
"type": "LinkedServiceReference",
},
"typeProperties":
{
"tableName": "MyTable"
},
}
}

Datasets & Linked services


https://docs.microsoft.com/en-us/azure/data-factory/concepts-datasets-linked-services
Schema Mapping
Column Mapping - JSON
{ {
"name": "SqlServerInput", "name": "AzureSqlOutput",
"properties": { "properties": {
"structure": "structure":
[ [
{ "name": "UserId"}, { "name": "MyUserId"},
{ "name": "Name"}, { "name": "MyName" },
{ "name": "Group"} { "name": "MyGroup"}
], ],
"type": "SqlServerTable", "type": "AzureSqlTable",
"linkedServiceName": { "linkedServiceName": {
"referenceName": "SqlServerLinkedService", "referenceName": "AzureSqlLinkedService",
"type": "LinkedServiceReference" "type": "LinkedServiceReference"
}, },
"typeProperties": { "typeProperties": {
"tableName": "SourceTable" "tableName": "SinkTable"
} }
} }
} }

Schema mapping in copy activity


https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping
Copy Activity - Column Mapping - JSON
{
"name": "CopyActivity",
"type": "Copy",
"inputs": [
{
"referenceName": "SqlServerInput",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlOutput",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": { "type": "SqlSource" },
"sink": { "type": "SqlSink" },
"translator":
{
"type": "TabularTranslator",
"ColumnMappings": "UserId: MyUserId, Group: MyGroup, Name: MyName"
}
}
}

Schema mapping in copy activity


https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping
Demo: 创建ADF基本概念
ServiceNow
Amazon S3 SAP HANA
and Redshift and Table
Salesforce
HDFS
Marketo
Big data
SaaS sources
applications

Google Cloud
Dynamics
Storage and
BigQuery
NoSQL

Mongo DB

File Azure Data


Enterprise data
Factory warehouses
Netezza

File
System
Azure Data
Services
FTP/SFTP
Oracle
Exadata
Azure SQL
DB Teradata
Azure
Synapse Azure Cosmos DB
Analytics Storage
访问所有数据
• 90+ connectors & growing
• Azure IR available in 30+ regions
• Hybrid connectivity using self-hosted IR: on-prem & VNet
Azure Database File Storage NoSQL Services and Apps Generic
Azure Blob Storage Amazon Redshift SQL Server Amazon S3 Couchbase Dynamics 365 Salesforce HTTP
Azure Data Lake Salesforce Service
Oracle MySQL File System Cassandra Dynamics CRM OData
Store Cloud
Azure SQL DB Netezza PostgreSQL FTP MongoDB SAP C4C ServiceNow ODBC
Azure SQL DW SAP BW SAP HANA SFTP Oracle CRM Hubspot

Azure Cosmos DB Google BigQuery Informix HDFS Oracle Service


Marketo
Cloud
Azure DB for
Sybase DB2 SAP ECC Oracle Responsys
MySQL
Azure DB for
Greenplum MariaDB Zendesk Oracle Eloqua
PostgreSQL

Azure Search Microsoft Access Drill Salesforce


Zoho CRM
ExactTarget
Azure Table Amazon
Hive Phoenix Atlassian Jira
Storage Marketplace
Azure File Storage Hbase Presto Megento Concur
Impala Spark PayPal QuickBooks Online
Vertica Shopify Xero
* Supported file formats: CSV, AVRO, ORC, Parquet, JSON GE Historian Square
无代码转换
Code-free transformation at scale with Mapping Data Flow

Construct and perform ETL and ELT


processes in an intuitive environment

Spend less time on design, testing, and


maintenance, while focusing on business
logic

• Data cleansing, transformation,


aggregation, and conversion

• Cloud scale via Spark execution

• Resilient, easily built data flows


以代码为中心
Fully contained operationalization for code-centric computing

Azure Databricks Azure HDInsight Synapse SQL Pool,


Notebook, Jar, Python Hive, Pig, Spark, MapReduce, Azure SQL DB & SQL Server
Streaming Stored Procedure

Machine Learning Azure Functions Azure Batch


Batch Execution, Update Resource Function calls Custom Executable
• Perform code-free, agile data preparation at
scale via Spark execution

• Get started easily with familiar Microsoft


Excel like UI

• Choose from a variety of wrangling


functions like combining tables, adding
columns, and reducing rows

• Create, debug, schedule, and monitor the


wrangling data flow in a one-stop-shop
Lower your cost of deployment, testing, and
maintenance by monitoring pipeline and
activity runs with pre-built templates and
UI-based views

• Query Runs with rich language

• Reference operational lineage between


parent-child pipelines

• Integrate Azure Monitor for diagnostics


loggings, metrics and alerts, anF events.

• Restate pipeline and activities


流程编排

Trigger Pipeline

Activity Activity …
Activity Activity

Activity

Workflow Control flow

Connectivity

Self-hosted Azure
Integration Runtime Integration Runtime

Linked service Legend


Command and Control
Data
My Pipeline 1

For Each… My Pipeline 2


Trigger Success, Success, Activity 3 Activity 1
params params
Activity 1 Activity 2
Event Activity 4
Wall Clock
Activity 2
On Demand
Error,
param
s “On Error”
… …
Activity 1
Azure Data Factory Updated Flexible Application Model

Triggers
Trigger Runs

Pipeline Runs Pipeline

Linked
Service Data Movement
Activity
Integration
Runtime
Activity Data
Dataset Transformation
Dispatch
Activity
Activity Runs
Demo
• E - 抽取交易数据到ADLS
• T(L?) - 聚合数据到可分析的ADLS
常见数据流方案
Metadata Validation Rules

https://www.youtube.com/watch?v=E_UD3R-VpYE
D ATA P R O F I L I N G

https://techcommunity.microsoft.com/t5/azure-data-
factory/how-to-save-your-data-profiler-summary-stats-in-adf-data-flows/ba-p/1243251
ADF Integration Runtime (IR)
▪ ADF compute environment with multiple capabilities:
Portal Application & SDK - Activity dispatch & monitoring
- Data movement
- SSIS package execution
▪ To integrate data flow and control flow across the
enterprises’ hybrid cloud, customer can instantiate
Azure Data Factory Service
multiple IR instances for different network environments:
- On premises (similar to DMG in ADF V1)
- In public cloud
- Inside VNet
▪ Bring a consistent provision and monitoring experience
across the network environments

Self-Hosted IR Azure IR

Data Movement & Activity Data Movement & Activity


Dispatch on-prem, Cloud, Dispatch In Azure Public
VNET Network, SSIS
VNET coming soon
UX & SDK

Azure Data Factory Service

On Premises Apps & Data Cloud Apps, Svcs & Data


UX & SDK

Azure Data Factory Service

On Premises Apps & Data Cloud Apps, Svcs & Data


UX & SDK

Azure Data Factory Service

On Premises Apps & Data Cloud Apps, Svcs & Data


UX & SDK

Azure Data Factory Service

On Premises Apps & Data Cloud Apps, Svcs & Data


如何创建和配置 Azure 集成运行时

创建 Azure 集成运行时 - Azure Data Factory & Azure Synapse | Microsoft Docs
PowerShell
VNET
HIPAA/HITECH
ISO/IEC 27001
ISO/IEC 27018
CSA STAR
数据迁移的场景
数据迁移方案
Ingest data using ADF to bootstrap your analytics workload

KEY SCENARIO WHY ADF

Data migration for data lake & EDW • Tuned for perf & scale: PBs for data lake migration, tens of TB for
EDW migration
1. Big data workload migration from AWS S3, on-
prem Hadoop File System, etc • Cost effective: serverless, PAYG
2. EDW migration from Oracle Exadata, Netezza, • Support for initial snapshot & incremental catch-up
Teradata, AWS Redshift, etc

Data ingestion for cloud ETL • Rich built-in connectors: file stores, RDBMS, NoSQL.
1. Load as-is from variety of data stores • Hybrid connectivity: on-prem, other public clouds, VNet/VPC
2. Stage for data prep and rich transformation • Enterprise grade security: AAD auth, AKV integration
3. Publish to DW for reporting or OLTP store for • Developer productivity: code-free authoring, CICD
app consumption • Single-pane-of-class monitoring & Azure Monitor integration
在 线 ( A D F ) 还 是 离 线 ( D ATA B O X ) ?

Data size \ bandwidth 50 Mbps 100 Mbps 200 Mbps 500 Mbps 1 Gbps 10 Gbps

1GB 2.7 min 1.4 min 0.7 min 0.3 min 0.1 min 0.0 min
10GB 27.3 min 13.7 min 6.8 min 2.7 min 1.3 min 0.1 min
100GB 4.6 hrs 2.3 hrs 1.1 hrs 0.5 hrs 0.2 hrs 0.0 hrs
1TB 46.6 hrs 23.3 hrs 11.7 hrs 4.7 hrs 2.3 hrs 0.2 hrs
10TB 19.4 days 9.7 days 4.9 days 1.9 days 0.9 days 0.1 days
100TB 194.2 days 97.1 days 48.5 days 19.4 days 9.5 days 0.9 days
1PB 64.7 mo 32.4 mo 16.2 mo 6.5 mo 3.2 mo 0.3 mo
10PB 647.3 mo 323.6 mo 161.8 mo 64.7 mo 31.6 mo 3.2 mo

Online Offline
客户案例

Scenario: Data migration from Amazon S3 to Azure Storage. Migrate existing analytics workload from
AWS into Azure, over time modernize workload using Azure data services.

CASE #1 CASE #2
Requirements: Requirements:
• 最小迁移持续时间 • 数据通过专有连接而不是公共互联网;
• 最大负载吞吐量 • 通过 AWS 直接连接优化 AWS 的网络出口费用

Result: Result:
• 2PB data from S3 → Blob in <11 days • 1PB from S3 → Blob over 10Gbps private link
• Initial load: 1.9 PB, avg. throughput 2.1 GB/s • Avg. throughput 787 MB/s
• Incremental: 221 TB, avg. throughput 3.6 GB/s
客户案例(续)

CASE #3 CASE #4
Scenario: Scenario:
• 按照典型的 MDW 模式标准化整个企业的分析解 • 构建一个现代多平台 DW,将跨越 BUs 的数据孤
决方案 岛汇集到中央数据湖中.
• Lake hydration from on-prem Netezza
Requirements:
Requirements:
• 处理大数据时具有更好的弹性和扩展.
• 指定迁移窗口从周一到周五的 18:00-08:00
• 术业有专攻,不同的数据和分析工具,支持各种
• Netezza的开销限制: 最大 8 个并发 Db 连接 角色(数据工程师、数据科学家、数据集成商).

Result:
Result:
• 25TB on-prem Netezza to ADLS
• Ingest data from various sources – SFTP, SAP,
• Total migration duration: ~3 weeks Amazon S3, Google BigQuery, SQL Server, Oracle,
Teradata, File, APIs, etc.
数据迁移的
技术考虑
构建解决方案 - 流程

Understand the scenario and the workload:


• Connectivity: What are the source and sink stores? Which format?
• Network: What is the network requirement? Can ADF meet
• Data loading pattern: One-time historical or incremental copy? these requirements?
• Scale: What is the data volume, # of objects (tables/files) and size distribution?
Identify key criteria:
• Security requirement
• Performance expectation
• Special need
评估 - 连接协议
A C C E S S A L L Y O U R D ATA - 9 0 + B U I LT - I N
CO N N E C TO R S & G R O W I N G
Azure (15) Database & DW (26) File Storage (6) File Formats (6) NoSQL (3) Services & Apps (28) Generic (4)
Blob Storage Amazon Redshift Oracle Amazon S3 AVRO Cassandra Amazon MWS Oracle Service Cloud HTTP

Cosmos DB – SQL API DB2 Phoenix File System Binary Couchbase CDS for Apps PayPal OData

Cosmos DB – MongoDB API Drill PostgreSQL FTP Delimited Text MongoDB Concur QuickBooks ODBC

ADLS Gen1 Google BigQuery Presto Google Cloud Storage JSON Dynamics 365 Salesforce REST

ADLS Gen2 Greenplum SAP BW Open Hub HDFS ORC Dynamics AX SF Service Cloud

Data Explorer HBase SAP BW MDX SFTP Parquet Dynamics CRM SF Marketing Cloud

Database for MariaDB Hive SAP HANA Google AdWords SAP C4C

Database for MySQL Impala SAP Table HubSpot SAP ECC

Database for PostgreSQL Informix Spark Jira ServiceNow

File Storage MariaDB SQL Server Magento Shopify

SQL Database Microsoft Access Sybase Marketo Square

SQL Database MI MySQL Teradata Office 365 Web Table

SQL Data Warehouse Netezza Vertica Oracle Eloqua Xero

Search Index Oracle Responsys Zoho

Table Storage
连接器"可扩展性"

不在支持列表中?不用担心:

DATA STORE TYPE SOLUTION

Database/DW ✓ Use generic ODBC connector for both read and write

✓ Check if it provides RESTful APIs, use generic REST connector


SaaS apps
✓ Check if it has OData feed, use generic OData connector

✓ Check if you can load to any supported data stores as staging


Others .....
first, e.g. File/FTP/SFTP/S3, then let ADF pick up from there
评估 - 网络配置
• Azure Integration Runtime: managed,
serverless, and pay-as-you-go
Pipeline
• Specify how much horsepower to use for
each copy by Data Integration Units
(DIUs)

• DIU is a combination of CPU, memory, and


network resource allocation.
Azure IR
• Default behavior based on your data VM VM VM VM
pattern – larger file size & file count, larger
DIUs. VM VM VM VM

• You can set DIU = 2, 4, 8, …, 256


Cloud Data Stores Cloud Data Stores
• Self-hosted Integration Runtime:
component installed on machine on-prem
or VM in cloud
Pipeline

• Touchless: latest version automatically


pushed down to machine during downtime

• HA and scale-out: register up to 4 nodes


for each self-hosted IR.

• Active-active mode: requests are


dispatched to nodes using round-robin.
Self-hosted IR
• Single-node concurrency: configure # of
concurrent activity runs, default behavior is
determined based on IR CPU/memory On-prem
Cloud Data Stores
Data Stores
通过公共互联网连接到 AZURE
Self-hosted IR deployed on premises

Data Factory

Self-hosted IR
on premises

Azure Storage Azure SQL DB Azure SQL DW HDInsight Databricks


Data stores
VNet ACL

Corporate
Firewall Boundary
通过专线连接到 AZURE VNET
Self-hosted IR deployed on Azure VM

Data Azure Storage Azure SQL DB Azure SQL DW


Factory
VNet ACL

VNet
service
endpoints

Express Route
Data stores (private peering) Self-hosted IR on
Azure VM
HDInsight Databricks

Corporate Azure Virtual


Firewall Boundary Network
C L O U D - T O - C L O U D P R I VAT E P E E R I N G
Self-hosted IR deployed on Azure VM

Data Factory Azure Storage

VNet ACL
VNet
service
endpoints

Direct Connect Express Route


AWS S3 store Private
Peering
(router Self-hosted IR on
based) Azure VM

AWS VPC Azure Virtual


Network

AWS
网络配置选项和建议

De-facto choice for enterprises!


Self-hosted IR on Azure VM
Most secure
Data transfer via ER Dedicated network w/ predicable perf
On-prem to cloud
Self-hosted IR on prem Well suited for ISVs!
Data transfer via public internet Low friction and effort

Azure IR Well suited for most scenarios


Data transfer via public internet Low friction and effort

Cloud to cloud
Most secure
Self-hosted IR on Azure VM
Dedicated network BW
Data transfer via private peering
Reduce AWS egress charge
评估 - Security
设计无密码管道

AUTHENTICATION CREDENTIAL MANAGEMENT

Authenticate to Azure data services: Recommended: store credentials in Azure Key


Use Managed Identify (aka MSI) or Service Vault and reference key vault secret from ADF
Principal

Authenticate to external data stores: Alternatively: supply credential inline in Linked


Use basic auth, key-based auth, or Windows Service and ADF will encrypt it at rest:
auth, where applicable - Cloud creds are encrypted by ADF-managed
certs
- On-prem creds are encrypted by Windows
DPAPI
数据安全性

DATA ENCRYPTION IN TRANSIT DATA ENCRYPTION AT REST

Data is transferred through secure channel Data is encrypted at rest by leveraging data store’s
HTTPS/TLS (V1.2) if data store supports encryption mechanisms:
HTTPS or TLS. - Transparent Data Encryption (TDE) in Azure SQL
Data Warehouse and Azure SQL Database
- Storage Service Encryption (SSE) for Azure Storage
- Automatic encryption for ADLS
- And so on
评估 - 数据迁移模式和最佳实践
从文件源检索数据

ADF in parallel copy/parse multiple files (determined by “parallel


copy”) across Azure IR Data Integration Units or Self-hosted IR
nodes; within each file, data is handled by chunks concurrently.

增量数据模式 建议配置

Folders/files with date/time in name Dynamic folder/file/prefix/wildcard* + tumbling window trigger


1 e.g.
path: root/folder/2019/11/06 root/folder/@{formatDateTime(pipeline().parameters.windowStartTime, 'yyyy/MM/dd’)
file name: <prefix>_20191106_<suffix> *_@{formatDateTime(pipeline().parameters.windowStartTime, 'yyyyMMdd’)_*

Identify newly created/updated files Dynamic last modified time tumbling window trigger
2 by last modified time
e.g. modifiedDatetimeStart: @pipeline().parameters.windowStartTime
between 2019/11/06-2019/11/07 modifiedDatetimeEnd: @pipeline().parameters.windowEndTime
从非文件源检索数据
(DB/DW、 NOSQL、SAAS)
ADF issue the specified query to source to retrieve the data.

Source data
Single Copy Activity execution
Out-of-box optimization for Oracle, C1 C2 PartitionCol
e.g. set Parallel Copy = 4
Teradata, Netezza, SAP Table, SAP BW … … … 10000
via Open Hub: … … … 10001
… … … …
• Built-in parallel copy by partitions to
boost performance for large table … … … 30000
migration/ingestion, similar as Sqoop. … … … 30001
… … … …
• Options of range partition and native
… … … 50000
partition mechanism per data store.
… … … 50001
… … … …
… … … 70000
… … … 70001
… … … …
......
… … … …
从非文件源检索数据
(DB/DW、 NOSQL、SAAS)
增量数据模式 建议配置
Dynamic query + tumbling window trigger
1
Data has timestamp column e.g. SELECT * FROM MyTable
last modified time WHERE LastModifiedDate >= @{formatDateTime(pipeline().parameters.windowStartTime, 'yyyy/MM/dd’)
AND LastModifiedDate < @{formatDateTime(pipeline().parameters.windowEndTime, 'yyyy/MM/dd’)

Control table/file + high watermark


Get old watermark
Update watermark
Data has incremental column e.g. Copy delta
2
ID (high watermark)
Get new watermark

Data is small in size


3 Full copy and overwrite
e.g. dimension data
Support for Azure SQL DB and SQL Server.
4 Source has change tracking
Leverage sys.change_tracking_tables as high watermark value
构建解决方案 - 流程

Choose a solution and conduct test on functionality:


• Connectivity: test out built-in connectors – reach data store using desired auth, pull the expected data
out and convert into the expected output format.
• Network: Azure IR vs Self-hosted IR, on-prem or in Azure VM
• Security: secure credential handling, encryption in transit/at rest
• Data loading pattern: choose the historical and incremental copy solution based on the workload/need

Get started:
• Copy Data Tool to solve moderate need, or quickly get familiar with the config
• Solution gallery for advanced scenarios, e.g. partitioned load, incremental load with high watermark
构建解决方案 - 流程

Performance test and tuning to handle data at scale:


1. Pick a representative workload
2. Obtain throughput baseline and compare to the performance expectation
3. Identify performance bottlenecks
4. Tune configurations to optimize
5. Scale to entire dataset
U N D E R S TA N D H O W A D F C O P Y S C A L E S

Flexible control flow &


Pipeline
scheduling to scale out.
Pipeline
(concurrency, partitions)

Azure IR

Cloud Data Stores Azure Data Stores

On-prem Data Stores Self-hosted IR


Elastic managed infra to
handle data at scale.
(configurable DIUs per run)

Customer managed infra


with scaling options.
(powerfulness, concurrency)
性能监控和优化

Follow performance tips

More execution details, e.g.


• Queue time – scale IR
• Time to first byte – tweak query
• More to come

More: used DIUs, used parallel copies, PolyBase true/false, staged copy w/ metrics, etc.

Copy activity performance and scalability guide


示例:从AWS S3 的数据湖迁移 •A ZURE ADLS
Azure SQL DB Control Table Suggestions:
PartitionID Prefix Status
1 booking/year=2019/month=01 Success
2 booking/year=2019/month=02 Success
3 booking/year=2019/month=03 Failed Update
control table
Look up
control table
For Each partition in control table…

Copy Pipeline

Source Bucket: Destination Container:


/bucket/<prefix>/files /container/<prefix>/files
示例:来自 NETEZZA 的负载和过程数据 • ADLS
Control Table
PartitionID TableName QueryStmt Status
1 T1 SELECT * from T1 where… Success
1 T1 SELECT * from T1 where… Success

1 T2 SELECT * from T2 where… Success


2 T3 SELECT * from T3 where… Failed
Update
control table
Parent Pipeline Copy & Process Pipeline per batch

For Each BatchID For Each data portion in current batch…

T1 T2 … T14 Raw Folder:


/YYYY/MM/DD/data.c Processed Folder: (cleanse & aggregate)
/YYYY/MM/DD/files
sv
用于启用无代码 ETL 的混合数据集成服务

Industry Visual Hybrid Pay only for Managed SSIS


leading data No Code what you use
ingestion

Productive & trusted hybrid data integration service


that simplifies ETL with any data, from any source, at scale.
数据集成 + 数据治理
AZURE DATA FACTORY
Code-free, Serverless Data Integration + Governed, Trusted Data

Leverage Active Metadata, Extract ETL Lineage


and semantics
AZURE PURVIEW

Data Map Data assets Lineage Classifications Business context …

6
Data Producers 6 6
Data Engineers Data
and Consumers and SMEs Officers

Data Use
Data Catalog Data Sharing Data Quality Master Data Data Privacy
Governance
Business glossary Intra Assessment Ref. data management Data policy Privacy operations
Data catalog Inter Cleaning Master record management Access governance Risk assessment
Loss prevention

Publish, discover and curate

数据发现与分类

On-prem and multi-cloud Operational, analytical, SaaS


Links & Resources:
Quick-start: Use the Copy Data tool to copy data:
https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-copy-data-tool

Azure Data Factory Documentation:


https://docs.microsoft.com/en-us/azure/data-factory

Introduction to Azure Data Factory:


https://docs.microsoft.com/en-us/azure/data-factory/introduction

Compare Azure Data Factory with Data Factory version 1:


https://docs.microsoft.com/en-us/azure/data-factory/compare-versions

Azure Data Factory - Code Samples:


https://azure.microsoft.com/en-us/resources/samples/?service=data-factory&sort=0

Azure Data Factory Updates:


https://azure.microsoft.com/en-us/updates/?product=data-factory

Copy Activity performance and tuning guide:


https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance
THANK YOU

You might also like