Azure Data Factory: PCSM - 2
Azure Data Factory: PCSM - 2
丰富的数据管理和治理
Azure Data Lake Storage Gen2
"不折不扣"数据湖:安全、性能好、可大规模扩展的数据湖存储,将对象存储的成本和规模配置文件与数据湖存
储的性能和分析功能集相结合
http://<StorageAccount>.blob.core.windows.net/<Container>/<Blob>
ADLS Gen2 Architecture
Blob Storage
Object Tiering and Lifecycle Policy AAD Integration, RBAC, Storage HA/DR support through ZRS and Data Governance and
Management Account Security RA-GRS Management
Convergence of two Storage Ser vices
Object Tiering and Lifecycle AAD Integration, RBAC, HA/DR support through ZRS
Policy Management Storage Account Security and RA-GRS
Logs (unstructured)
Files (unstructured)
STORE
Business/custom apps
Azure Data Lake Storage Gen2
(structured)
端到端分析的基石
Logs (unstructured)
Azure Analysis Power BI
Services
Media (unstructured)
STORE
Business/custom apps
Azure Data Lake Storage Gen2
(structured)
与Azure完全集成模式
Data Security
• Consistent AAD-based OAuth authentication
• Azure RBAC and POSIX-compliant ACLs
• Integrates with analytics frameworks for end-user authorization
• Encryption at rest: Customer or Microsoft managed keys
• Encryption in transit: TLS
• Transport-level protection: VNet service endpoints
Business Continuity/Disaster Recovery
• Data Redundancy Options: LRS, ZRS, GRS, RA-GRS & GZRS (Soon)
• Data failover options: automatic, customer controlled
• Analytics cluster failover options
Management/Monitoring
• Azure Monitor
Data Governance
• Azure Purview (preview)
数据组织
• More granular access control possible via POSIX style ACL’s at the file and
folder level
abfss://[email protected]/raw/eventlogs/2019/01/15/evendata.txt
abfss://[email protected]/raw/eventlogs/2019/01/15/evendata.txt
在多个帐户中组织的数据
abfss://[email protected]/eventlogs/2019/01/15/evendata.txt
abfss://[email protected]/eventlogs/2019/01/15/evendata.txt
GA: Nov 2020
How fast is the Premium tier?
➢ Hadoop DFSIO benchmark: ADLS Premium is 2.8X and 1.6X faster for write and read per CPU-core throughput via
• Databricks.
➢ HBase workload: 40% faster latencies.
➢ TPC-DS benchmark: ADLS standard latencies are ~1.5-5x higher compared to premium tier.
• ➢ Synapse: SQL serverless 1 TB TPCH queries are up-to 33% faster on premium tier.
➢ Interactive workloads – e.g. The Microsoft Edge team doing interactive analytics on device grain metric data reads
found premium tier to be 3x faster.
➢ ML/AI workload – A pharmaceutical customer tested premium tier and observed low and consistent latencies, higher
read throughput, less job failures using Premium and cost savings due to savings in compute spending.
When should Premium tier be used? ➢ A bioinformatics customer scaled up their data process pipeline for genomics study using premium tier (also saw a 2.4x
➢
egress speedup and are using premium as a staging/caching layer.
Cost Effectiveness
➢ ➢ Premium tier is cost effective for transaction heavy workloads (TPS/TB > 35).
• Columns represents the number of transactions in a month, rows represent percentage of transactions that are reads.
➢ • Cell values show percentage of cost reduction associated with a read transaction percentage and the number of transactions executed.
• E.g., in East US 2, when transactions exceed 90M, and 70% are read transaction, premium tier is cheaper than hot tier.
Premium
Status: GA (Nov 2020) Hot
Cool
Archive
数据迁移
ADLS Gen2 & Big Data
Big Data Use Cases
Ingest & ETL Streaming Analytics & Machine Learning Data Aggregation Presentation
Functions
Monitor Event Hubs Machine
App Insights Log Analytics Stream Learning Search Power BI
Analytics Data Warehouse
IoT Hub
Data Factory CDN
Batch
Azure HDInsight
通过 Azure 数据工厂集成
LOB
Graph
Azure Analysis Services
Image VNet
IoT On-premise
将 SQL 服务器集成服务 (SSIS) 包无缝迁移到 Azure
Cloud
On-premises
VNET
Microsoft
SQL Server
Integration Services
On-Premise data sources SQL Server
混合和多云数据集成
DATA SCIENCE
AND MACHINE
On-Prem SaaS Apps Public Cloud LEARNING
MODELS
ANALYTICAL
DASHBOARDS
USING POWER BI
A Z U R E D ATA FA C TO R Y O V E R V I E W
Audiences
Data Engineer Data Scientist Citizen Integrator
Data sources Ingest Prepare, transform, predict & enrich Serve Visualize
SaaS data
Azure Data Lake
G E T T H E M O S T O U T O F Y O U R A N A LY T I C S
Audiences
Data Engineer Data Scientist Citizen Integrator
Data sources Ingest Prepare, transform, predict & enrich Serve Visualize
SaaS data
Azure Data Lake
Azure Data Factory 概念
Pipelines
Activities Triggers
ETL / ELT
Google Cloud
Dynamics
Storage and
BigQuery
NoSQL
Mongo DB
File
System
Azure Data
Services
FTP/SFTP
Oracle
Exadata
Azure SQL
DB Teradata
Azure
Synapse Azure Cosmos DB
Analytics Storage
访问所有数据
• 90+ connectors & growing
• Azure IR available in 30+ regions
• Hybrid connectivity using self-hosted IR: on-prem & VNet
Azure Database File Storage NoSQL Services and Apps Generic
Azure Blob Storage Amazon Redshift SQL Server Amazon S3 Couchbase Dynamics 365 Salesforce HTTP
Azure Data Lake Salesforce Service
Oracle MySQL File System Cassandra Dynamics CRM OData
Store Cloud
Azure SQL DB Netezza PostgreSQL FTP MongoDB SAP C4C ServiceNow ODBC
Azure SQL DW SAP BW SAP HANA SFTP Oracle CRM Hubspot
Trigger Pipeline
Activity Activity …
Activity Activity
Activity
Connectivity
Self-hosted Azure
Integration Runtime Integration Runtime
Triggers
Trigger Runs
Linked
Service Data Movement
Activity
Integration
Runtime
Activity Data
Dataset Transformation
Dispatch
Activity
Activity Runs
Demo
• E - 抽取交易数据到ADLS
• T(L?) - 聚合数据到可分析的ADLS
常见数据流方案
Metadata Validation Rules
https://www.youtube.com/watch?v=E_UD3R-VpYE
D ATA P R O F I L I N G
https://techcommunity.microsoft.com/t5/azure-data-
factory/how-to-save-your-data-profiler-summary-stats-in-adf-data-flows/ba-p/1243251
ADF Integration Runtime (IR)
▪ ADF compute environment with multiple capabilities:
Portal Application & SDK - Activity dispatch & monitoring
- Data movement
- SSIS package execution
▪ To integrate data flow and control flow across the
enterprises’ hybrid cloud, customer can instantiate
Azure Data Factory Service
multiple IR instances for different network environments:
- On premises (similar to DMG in ADF V1)
- In public cloud
- Inside VNet
▪ Bring a consistent provision and monitoring experience
across the network environments
Self-Hosted IR Azure IR
创建 Azure 集成运行时 - Azure Data Factory & Azure Synapse | Microsoft Docs
PowerShell
VNET
HIPAA/HITECH
ISO/IEC 27001
ISO/IEC 27018
CSA STAR
数据迁移的场景
数据迁移方案
Ingest data using ADF to bootstrap your analytics workload
Data migration for data lake & EDW • Tuned for perf & scale: PBs for data lake migration, tens of TB for
EDW migration
1. Big data workload migration from AWS S3, on-
prem Hadoop File System, etc • Cost effective: serverless, PAYG
2. EDW migration from Oracle Exadata, Netezza, • Support for initial snapshot & incremental catch-up
Teradata, AWS Redshift, etc
Data ingestion for cloud ETL • Rich built-in connectors: file stores, RDBMS, NoSQL.
1. Load as-is from variety of data stores • Hybrid connectivity: on-prem, other public clouds, VNet/VPC
2. Stage for data prep and rich transformation • Enterprise grade security: AAD auth, AKV integration
3. Publish to DW for reporting or OLTP store for • Developer productivity: code-free authoring, CICD
app consumption • Single-pane-of-class monitoring & Azure Monitor integration
在 线 ( A D F ) 还 是 离 线 ( D ATA B O X ) ?
Data size \ bandwidth 50 Mbps 100 Mbps 200 Mbps 500 Mbps 1 Gbps 10 Gbps
1GB 2.7 min 1.4 min 0.7 min 0.3 min 0.1 min 0.0 min
10GB 27.3 min 13.7 min 6.8 min 2.7 min 1.3 min 0.1 min
100GB 4.6 hrs 2.3 hrs 1.1 hrs 0.5 hrs 0.2 hrs 0.0 hrs
1TB 46.6 hrs 23.3 hrs 11.7 hrs 4.7 hrs 2.3 hrs 0.2 hrs
10TB 19.4 days 9.7 days 4.9 days 1.9 days 0.9 days 0.1 days
100TB 194.2 days 97.1 days 48.5 days 19.4 days 9.5 days 0.9 days
1PB 64.7 mo 32.4 mo 16.2 mo 6.5 mo 3.2 mo 0.3 mo
10PB 647.3 mo 323.6 mo 161.8 mo 64.7 mo 31.6 mo 3.2 mo
Online Offline
客户案例
Scenario: Data migration from Amazon S3 to Azure Storage. Migrate existing analytics workload from
AWS into Azure, over time modernize workload using Azure data services.
CASE #1 CASE #2
Requirements: Requirements:
• 最小迁移持续时间 • 数据通过专有连接而不是公共互联网;
• 最大负载吞吐量 • 通过 AWS 直接连接优化 AWS 的网络出口费用
Result: Result:
• 2PB data from S3 → Blob in <11 days • 1PB from S3 → Blob over 10Gbps private link
• Initial load: 1.9 PB, avg. throughput 2.1 GB/s • Avg. throughput 787 MB/s
• Incremental: 221 TB, avg. throughput 3.6 GB/s
客户案例(续)
CASE #3 CASE #4
Scenario: Scenario:
• 按照典型的 MDW 模式标准化整个企业的分析解 • 构建一个现代多平台 DW,将跨越 BUs 的数据孤
决方案 岛汇集到中央数据湖中.
• Lake hydration from on-prem Netezza
Requirements:
Requirements:
• 处理大数据时具有更好的弹性和扩展.
• 指定迁移窗口从周一到周五的 18:00-08:00
• 术业有专攻,不同的数据和分析工具,支持各种
• Netezza的开销限制: 最大 8 个并发 Db 连接 角色(数据工程师、数据科学家、数据集成商).
Result:
Result:
• 25TB on-prem Netezza to ADLS
• Ingest data from various sources – SFTP, SAP,
• Total migration duration: ~3 weeks Amazon S3, Google BigQuery, SQL Server, Oracle,
Teradata, File, APIs, etc.
数据迁移的
技术考虑
构建解决方案 - 流程
Cosmos DB – SQL API DB2 Phoenix File System Binary Couchbase CDS for Apps PayPal OData
Cosmos DB – MongoDB API Drill PostgreSQL FTP Delimited Text MongoDB Concur QuickBooks ODBC
ADLS Gen1 Google BigQuery Presto Google Cloud Storage JSON Dynamics 365 Salesforce REST
ADLS Gen2 Greenplum SAP BW Open Hub HDFS ORC Dynamics AX SF Service Cloud
Data Explorer HBase SAP BW MDX SFTP Parquet Dynamics CRM SF Marketing Cloud
Database for MariaDB Hive SAP HANA Google AdWords SAP C4C
Table Storage
连接器"可扩展性"
不在支持列表中?不用担心:
Database/DW ✓ Use generic ODBC connector for both read and write
Data Factory
Self-hosted IR
on premises
Corporate
Firewall Boundary
通过专线连接到 AZURE VNET
Self-hosted IR deployed on Azure VM
VNet
service
endpoints
Express Route
Data stores (private peering) Self-hosted IR on
Azure VM
HDInsight Databricks
VNet ACL
VNet
service
endpoints
AWS
网络配置选项和建议
Cloud to cloud
Most secure
Self-hosted IR on Azure VM
Dedicated network BW
Data transfer via private peering
Reduce AWS egress charge
评估 - Security
设计无密码管道
Data is transferred through secure channel Data is encrypted at rest by leveraging data store’s
HTTPS/TLS (V1.2) if data store supports encryption mechanisms:
HTTPS or TLS. - Transparent Data Encryption (TDE) in Azure SQL
Data Warehouse and Azure SQL Database
- Storage Service Encryption (SSE) for Azure Storage
- Automatic encryption for ADLS
- And so on
评估 - 数据迁移模式和最佳实践
从文件源检索数据
增量数据模式 建议配置
Identify newly created/updated files Dynamic last modified time tumbling window trigger
2 by last modified time
e.g. modifiedDatetimeStart: @pipeline().parameters.windowStartTime
between 2019/11/06-2019/11/07 modifiedDatetimeEnd: @pipeline().parameters.windowEndTime
从非文件源检索数据
(DB/DW、 NOSQL、SAAS)
ADF issue the specified query to source to retrieve the data.
Source data
Single Copy Activity execution
Out-of-box optimization for Oracle, C1 C2 PartitionCol
e.g. set Parallel Copy = 4
Teradata, Netezza, SAP Table, SAP BW … … … 10000
via Open Hub: … … … 10001
… … … …
• Built-in parallel copy by partitions to
boost performance for large table … … … 30000
migration/ingestion, similar as Sqoop. … … … 30001
… … … …
• Options of range partition and native
… … … 50000
partition mechanism per data store.
… … … 50001
… … … …
… … … 70000
… … … 70001
… … … …
......
… … … …
从非文件源检索数据
(DB/DW、 NOSQL、SAAS)
增量数据模式 建议配置
Dynamic query + tumbling window trigger
1
Data has timestamp column e.g. SELECT * FROM MyTable
last modified time WHERE LastModifiedDate >= @{formatDateTime(pipeline().parameters.windowStartTime, 'yyyy/MM/dd’)
AND LastModifiedDate < @{formatDateTime(pipeline().parameters.windowEndTime, 'yyyy/MM/dd’)
Get started:
• Copy Data Tool to solve moderate need, or quickly get familiar with the config
• Solution gallery for advanced scenarios, e.g. partitioned load, incremental load with high watermark
构建解决方案 - 流程
Azure IR
More: used DIUs, used parallel copies, PolyBase true/false, staged copy w/ metrics, etc.
Copy Pipeline
6
Data Producers 6 6
Data Engineers Data
and Consumers and SMEs Officers
Data Use
Data Catalog Data Sharing Data Quality Master Data Data Privacy
Governance
Business glossary Intra Assessment Ref. data management Data policy Privacy operations
Data catalog Inter Cleaning Master record management Access governance Risk assessment
Loss prevention
数据发现与分类