0% found this document useful (0 votes)

9 views105 pages

Azure Data Factory: PCSM - 2

The document provides an overview of Azure Data Factory and its capabilities for data integration, management, and analytics. It highlights features such as support for various data types, security measures, scalability, and cost-effectiveness, as well as integration with other Azure services. Additionally, it discusses the architecture of Azure Data Lake Storage Gen2 and its advantages for big data workloads.

Uploaded by

morlanaveenvenkatsai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views105 pages

Azure Data Factory: PCSM - 2

Uploaded by

morlanaveenvenkatsai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

PCSM - 高级数据库服务 2:

Azure Data Factory介绍及应用场景

从数据中获取价值存在
并非易事

数据孤岛不一致的解决方案的复杂多云环境成本上升

数据类型性
从数据中获取真实价值

Data silos Incongruent Pe r f o r m a n c e Complexity of Rising costs

data types constraints solutions

所有数据的一个支持不同类型的数据无限熟悉的工具和生低

数据总线数据，弹性伸缩态系统 TCO

On-premises, hybrid, Azure

数据放哪里？
一个优秀的数据湖
大规模精细、多层安全优化以达到最大性易于集成成本效益
能

PB Scale, data Granular security Lightning Supports multiple Cloud

accessible and protection quick job methods of data economic
everywhere, against execution ingress, model with the
growth on accidental data processing, egress ability to
demand loss and visualization intelligently
manage costs

丰富的数据管理和治理
Azure Data Lake Storage Gen2
"不折不扣"数据湖：安全、性能好、可大规模扩展的数据湖存储，将对象存储的成本和规模配置文件与数据湖存
储的性能和分析功能集相结合

安全易于管理快扩展性经济高效集成就绪

✓ Support for fine-

✓ Automated ✓ Atomic directory ✓ No limits on ✓ Object store ✓ Optimized for Spark
grained ACLs,
Lifecycle Policy operations data store size pricing levels and Hadoop
protecting data at the
Management means jobs Analytic Engines
file and folder level
complete faster
✓ Global footprint ✓ File system
✓ Multi-layered
✓ Object Level (50+ regions) operations ✓ Tightly integrated
protection via at-rest
tiering minimize with Azure end to
Storage Service
transactions end analytics
encryption and Azure
required for job solutions
Active Directory
completion
integration
Subscription 1 Storage Account 1 Container 1

Subscription 2 Storage Account 2 Container 2 Blobs

Subscription...n Storage Account...n Container…n

http://<StorageAccount>.blob.core.windows.net/<Container>/<Blob>
ADLS Gen2 Architecture

Blob API ADLS API

HIERARCHICAL FILE SYSTEM

Performance Scale and Cost

Security
Enhancements Effectiveness

Blob Storage
Object Tiering and Lifecycle Policy AAD Integration, RBAC, Storage HA/DR support through ZRS and Data Governance and
Management Account Security RA-GRS Management
Convergence of two Storage Ser vices

Azure Blob Storage Azure Data Lake Store

Global scale – All Azure regions Built for Hadoop

Full BCDR capabilities Hierarchical namespace
Tiered - Hot/Cool/Archive ACLs, AAD and RBAC
Cost Efficient Performance tuned for big data
Large partner ecosystem Very high scale capacity and throughput
Azure Data Lake Storage Gen2
ADLS 支持直接查询加速
Enables unique capabilities for apps and workloads that do not yet use the Object or HDFS protocols

Blob API ADLS Gen2 API NFS v3

Object Data Analytics Data File Data

Server Backups, Archive Hadoop File System, File HPC Data, Applications
Storage, Semi-structured and Folder Hierarchy, using NFS v3 against large
Data Granular ACLS Atomic File sequentially read data sets
Transactions

Common Blob Storage Foundation

Object Tiering and Lifecycle AAD Integration, RBAC, HA/DR support through ZRS
Policy Management Storage Account Security and RA-GRS

Public Cloud Object storage access through NFS v3 is an industry first

端到端分析的基石

INGEST PREP & TRAIN MODEL & SERVE

Logs (unstructured)

Media (unstructured) Azure Analysis Power BI

Services

Azure Databricks Azure SQL

Azure Data Factory Data Warehouse

Files (unstructured)

STORE
Business/custom apps
Azure Data Lake Storage Gen2
(structured)
端到端分析的基石

INGEST PREP & TRAIN MODEL & SERVE

Logs (unstructured)
Azure Analysis Power BI
Services

Media (unstructured)

Azure Databricks Azure SQL

Azure Data Factory Data Warehouse
SparkR

Files (unstructured) Cosmos DB

Apps

STORE
Business/custom apps
Azure Data Lake Storage Gen2
(structured)
与Azure完全集成模式
Data Security
• Consistent AAD-based OAuth authentication
• Azure RBAC and POSIX-compliant ACLs
• Integrates with analytics frameworks for end-user authorization
• Encryption at rest: Customer or Microsoft managed keys
• Encryption in transit: TLS
• Transport-level protection: VNet service endpoints
Business Continuity/Disaster Recovery
• Data Redundancy Options: LRS, ZRS, GRS, RA-GRS & GZRS (Soon)
• Data failover options: automatic, customer controlled
• Analytics cluster failover options

Management/Monitoring
• Azure Monitor

Data Governance
• Azure Purview (preview)
数据组织

• Example of data organization and access control for a platform

• Raw folder restricted to writers (uploaders, ADF pipelines etc)
• Staged folder restricted to non-production services
• Production folder restricted to production services

• Example of data organization and access control for a business

• Raw folder restricted to users or groups generating data (loggers,
transactional systems)
• Staged folder restricted to developers
• Production folder restricted to data scientists, business analysts,
decision makers etc..

• Data can be time series organized

• /raw/eventlogs/2019/01/15/…..
• /raw/iotevents/2019/01/15/….
• /staging/curatedevents/2019/01/….
• /prod/incidents/2019/01/15/…
• /prod/salesdata/2019/01/15/….
在一个集中帐户中组织的数据

• Single account owned by a customer/tenant

• Contains data for several teams or departments within the

customer/tenant

• Access can be controlled using RBAC roles per department

• More granular access control possible via POSIX style ACL’s at the file and
folder level

• Billed to the customer (subscription)

• Isolation at the account level

abfss://[email protected]/raw/eventlogs/2019/01/15/evendata.txt
abfss://[email protected]/raw/eventlogs/2019/01/15/evendata.txt
在多个帐户中组织的数据

• Separate accounts per department

• Stricter isolation since departments are in different accounts

• Billing can be to the customer or department

• Accounts in same subscription get billed to one BU
• Accounts in different subscriptions get billed to different BU’s

• Can be further split into separate accounts for further isolation

• Will need to attach / mount multiple accounts or filesystems to the

analytics engines like HDInsight, Databricks

abfss://[email protected]/eventlogs/2019/01/15/evendata.txt
abfss://[email protected]/eventlogs/2019/01/15/evendata.txt
GA: Nov 2020
How fast is the Premium tier?

➢ Hadoop DFSIO benchmark: ADLS Premium is 2.8X and 1.6X faster for write and read per CPU-core throughput via
• Databricks.
➢ HBase workload: 40% faster latencies.
➢ TPC-DS benchmark: ADLS standard latencies are ~1.5-5x higher compared to premium tier.
• ➢ Synapse: SQL serverless 1 TB TPCH queries are up-to 33% faster on premium tier.
➢ Interactive workloads – e.g. The Microsoft Edge team doing interactive analytics on device grain metric data reads
found premium tier to be 3x faster.
➢ ML/AI workload – A pharmaceutical customer tested premium tier and observed low and consistent latencies, higher
read throughput, less job failures using Premium and cost savings due to savings in compute spending.
When should Premium tier be used? ➢ A bioinformatics customer scaled up their data process pipeline for genomics study using premium tier (also saw a 2.4x
➢
egress speedup and are using premium as a staging/caching layer.

Cost Effectiveness

➢ ➢ Premium tier is cost effective for transaction heavy workloads (TPS/TB > 35).

• Columns represents the number of transactions in a month, rows represent percentage of transactions that are reads.
➢ • Cell values show percentage of cost reduction associated with a read transaction percentage and the number of transactions executed.
• E.g., in East US 2, when transactions exceed 90M, and 70% are read transaction, premium tier is cheaper than hot tier.

Premium
Status: GA (Nov 2020) Hot
Cool
Archive
数据迁移
ADLS Gen2 & Big Data
Big Data Use Cases
Ingest & ETL Streaming Analytics & Machine Learning Data Aggregation Presentation

Functions
Monitor Event Hubs Machine
App Insights Log Analytics Stream Learning Search Power BI
Analytics Data Warehouse

IoT Hub
Data Factory CDN
Batch

Azure HDInsight

Blob Storage Pillars

Open & Manageable & Cost Scalable & Secure & Compliant Durable & Available
Interoperable Efficient Performant
Manageable & Scalable & Secure &
Cost Efficient Performant Compliant
A Z U R E D ATA FA C T O R Y
云中全托管理的数据集成服务

PRODUCTIVE HYBRID SCALABLE TRUSTED

✓ Drag & Drop UI ✓ Orchestrate where ✓ Serverless scalability ✓ Certified compliant

your data lives with no infrastructure Data Movement
✓ Codeless Data to manage
Movement ✓ Lift SSIS packages
to Azure
A Z U R E D ATA FA C TO R Y

数据集成服务：无服务器、弹性扩展、混合 Data Movement and Transformation @Scale

Cloud & Hybrid w/ 90+ connectors provided
Up to 2GB/s, ETL/ELT in the cloud

Hybrid Pipeline Model

Seamlessly span: on premise, Azure, other clouds & SaaS
Run on-demand, scheduled, data-availability or on event

Author & Monitor

Programmability w/ multi-language SDK
Visual Tools

SSIS Package Execution

Lift existing SQL Server ETL to Azure
Use existing tools (SSMS, SSDT)
A Z U R E D ATA FA C TO R Y
大规模现代化企业数据仓库

通过 Azure 数据工厂集成
LOB

INGEST STORE PREP & MODEL & Cloud

CRM
TRANSFORM SERVE

Graph
Azure Analysis Services

Image VNet

Social Data orchestration, Azure SQL DW,

Azure Data Lake Data Transformations
scheduling HDInsight, Data Apps and
Azure Storage Machine Learning
and monitoring Lakes Insights

IoT On-premise
将 SQL 服务器集成服务（SSIS）包无缝迁移到 Azure

Azure Data Factory

SSIS Integration Runtime

Cloud data sources SSIS Cloud ETL SQL DB Managed Instance

Cloud
On-premises
VNET

Microsoft
SQL Server
Integration Services
On-Premise data sources SQL Server
混合和多云数据集成

Azure Data Factory

PaaS Data Integration
DATA DRIVEN
APPLICATIONS

Author, orchestrate and monitor with Azure Data Factory

DATA SCIENCE
AND MACHINE
On-Prem SaaS Apps Public Cloud LEARNING
MODELS

ANALYTICAL
DASHBOARDS
USING POWER BI
A Z U R E D ATA FA C TO R Y O V E R V I E W
Audiences
Data Engineer Data Scientist Citizen Integrator

Data sources Ingest Prepare, transform, predict & enrich Serve Visualize

On-premises data 90 + Pre-built Code-free

connectors

Multicloud data Code-centric

Orchestrate and Monitor

SaaS data
Azure Data Lake
G E T T H E M O S T O U T O F Y O U R A N A LY T I C S
Audiences
Data Engineer Data Scientist Citizen Integrator

Data sources Ingest Prepare, transform, predict & enrich Serve Visualize

Wrangling data flow Mapping data

flow

On-premises data 90 + Pre-built Code-free

connectors OR
Code-centric

Multicloud data Databricks … Azure ML

HDInsight

Orchestrate and Monitor

SaaS data
Azure Data Lake
Azure Data Factory 概念

Pipelines

Activities Triggers

Linked Services Datasets Integration Runtime

Azure Data Factory

ETL / ELT

Visual UI Code Support

Control Flow SSIS Execution

Datasets, Activities, Pipelines

Pipelines and activities in Azure Data Factory

https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities
Datasets and linked services

Datasets and linked services in Azure Data Factory

https://docs.microsoft.com/en-us/azure/data-factory/concepts-datasets-linked-services
Linked Service – JSON – 数据在哪
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value":
"DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Datasets & Linked services

https://docs.microsoft.com/en-us/azure/data-factory/concepts-datasets-linked-services
Dataset – JSON – 哪些数据
{
"name": "DatasetSample",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "MyAzureSqlLinkedService",
"type": "LinkedServiceReference",
},
"typeProperties":
{
"tableName": "MyTable"
},
}
}

Datasets & Linked services

https://docs.microsoft.com/en-us/azure/data-factory/concepts-datasets-linked-services
Schema Mapping
Column Mapping - JSON
{ {
"name": "SqlServerInput", "name": "AzureSqlOutput",
"properties": { "properties": {
"structure": "structure":
[ [
{ "name": "UserId"}, { "name": "MyUserId"},
{ "name": "Name"}, { "name": "MyName" },
{ "name": "Group"} { "name": "MyGroup"}
], ],
"type": "SqlServerTable", "type": "AzureSqlTable",
"linkedServiceName": { "linkedServiceName": {
"referenceName": "SqlServerLinkedService", "referenceName": "AzureSqlLinkedService",
"type": "LinkedServiceReference" "type": "LinkedServiceReference"
}, },
"typeProperties": { "typeProperties": {
"tableName": "SourceTable" "tableName": "SinkTable"
} }
} }
} }

Schema mapping in copy activity

https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping
Copy Activity - Column Mapping - JSON
{
"name": "CopyActivity",
"type": "Copy",
"inputs": [
{
"referenceName": "SqlServerInput",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlOutput",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": { "type": "SqlSource" },
"sink": { "type": "SqlSink" },
"translator":
{
"type": "TabularTranslator",
"ColumnMappings": "UserId: MyUserId, Group: MyGroup, Name: MyName"
}
}
}

Schema mapping in copy activity

https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping
Demo: 创建ADF基本概念
ServiceNow
Amazon S3 SAP HANA
and Redshift and Table
Salesforce
HDFS
Marketo
Big data
SaaS sources
applications

Google Cloud
Dynamics
Storage and
BigQuery
NoSQL

Mongo DB

File Azure Data

Enterprise data
Factory warehouses
Netezza

File
System
Azure Data
Services
FTP/SFTP
Oracle
Exadata
Azure SQL
DB Teradata
Azure
Synapse Azure Cosmos DB
Analytics Storage
访问所有数据
• 90+ connectors & growing
• Azure IR available in 30+ regions
• Hybrid connectivity using self-hosted IR: on-prem & VNet
Azure Database File Storage NoSQL Services and Apps Generic
Azure Blob Storage Amazon Redshift SQL Server Amazon S3 Couchbase Dynamics 365 Salesforce HTTP
Azure Data Lake Salesforce Service
Oracle MySQL File System Cassandra Dynamics CRM OData
Store Cloud
Azure SQL DB Netezza PostgreSQL FTP MongoDB SAP C4C ServiceNow ODBC
Azure SQL DW SAP BW SAP HANA SFTP Oracle CRM Hubspot

Azure Cosmos DB Google BigQuery Informix HDFS Oracle Service

Marketo
Cloud
Azure DB for
Sybase DB2 SAP ECC Oracle Responsys
MySQL
Azure DB for
Greenplum MariaDB Zendesk Oracle Eloqua
PostgreSQL

Azure Search Microsoft Access Drill Salesforce

Zoho CRM
ExactTarget
Azure Table Amazon
Hive Phoenix Atlassian Jira
Storage Marketplace
Azure File Storage Hbase Presto Megento Concur
Impala Spark PayPal QuickBooks Online
Vertica Shopify Xero
* Supported file formats: CSV, AVRO, ORC, Parquet, JSON GE Historian Square
无代码转换
Code-free transformation at scale with Mapping Data Flow

Construct and perform ETL and ELT

processes in an intuitive environment

Spend less time on design, testing, and

maintenance, while focusing on business
logic

• Data cleansing, transformation,

aggregation, and conversion

• Cloud scale via Spark execution

• Resilient, easily built data flows

以代码为中心
Fully contained operationalization for code-centric computing

Azure Databricks Azure HDInsight Synapse SQL Pool,

Notebook, Jar, Python Hive, Pig, Spark, MapReduce, Azure SQL DB & SQL Server
Streaming Stored Procedure

Machine Learning Azure Functions Azure Batch

Batch Execution, Update Resource Function calls Custom Executable
• Perform code-free, agile data preparation at
scale via Spark execution

• Get started easily with familiar Microsoft

Excel like UI

• Choose from a variety of wrangling

functions like combining tables, adding
columns, and reducing rows

• Create, debug, schedule, and monitor the

wrangling data flow in a one-stop-shop
Lower your cost of deployment, testing, and
maintenance by monitoring pipeline and
activity runs with pre-built templates and
UI-based views

• Query Runs with rich language

• Reference operational lineage between

parent-child pipelines

• Integrate Azure Monitor for diagnostics

loggings, metrics and alerts, anF events.

• Restate pipeline and activities

流程编排

Trigger Pipeline

Activity Activity …
Activity Activity

Activity

Workflow Control flow

Connectivity

Self-hosted Azure
Integration Runtime Integration Runtime

Linked service Legend

Command and Control
Data
My Pipeline 1

For Each… My Pipeline 2

Trigger Success, Success, Activity 3 Activity 1
params params
Activity 1 Activity 2
Event Activity 4
Wall Clock
Activity 2
On Demand
Error,
param
s “On Error”
… …
Activity 1
Azure Data Factory Updated Flexible Application Model

Triggers
Trigger Runs

Pipeline Runs Pipeline

Linked
Service Data Movement
Activity
Integration
Runtime
Activity Data
Dataset Transformation
Dispatch
Activity
Activity Runs
Demo
• E - 抽取交易数据到ADLS
• T(L?) - 聚合数据到可分析的ADLS
常见数据流方案
Metadata Validation Rules

https://www.youtube.com/watch?v=E_UD3R-VpYE
D ATA P R O F I L I N G

https://techcommunity.microsoft.com/t5/azure-data-
factory/how-to-save-your-data-profiler-summary-stats-in-adf-data-flows/ba-p/1243251
ADF Integration Runtime (IR)
▪ ADF compute environment with multiple capabilities:
Portal Application & SDK - Activity dispatch & monitoring
- Data movement
- SSIS package execution
▪ To integrate data flow and control flow across the
enterprises’ hybrid cloud, customer can instantiate
Azure Data Factory Service
multiple IR instances for different network environments:
- On premises (similar to DMG in ADF V1)
- In public cloud
- Inside VNet
▪ Bring a consistent provision and monitoring experience
across the network environments

Self-Hosted IR Azure IR

Data Movement & Activity Data Movement & Activity

Dispatch on-prem, Cloud, Dispatch In Azure Public
VNET Network, SSIS
VNET coming soon
UX & SDK

Azure Data Factory Service

On Premises Apps & Data Cloud Apps, Svcs & Data

UX & SDK

Azure Data Factory Service

On Premises Apps & Data Cloud Apps, Svcs & Data

UX & SDK

Azure Data Factory Service

On Premises Apps & Data Cloud Apps, Svcs & Data

UX & SDK

Azure Data Factory Service

On Premises Apps & Data Cloud Apps, Svcs & Data

如何创建和配置 Azure 集成运行时

创建 Azure 集成运行时 - Azure Data Factory & Azure Synapse | Microsoft Docs
PowerShell
VNET
HIPAA/HITECH
ISO/IEC 27001
ISO/IEC 27018
CSA STAR
数据迁移的场景
数据迁移方案
Ingest data using ADF to bootstrap your analytics workload

KEY SCENARIO WHY ADF

Data migration for data lake & EDW • Tuned for perf & scale: PBs for data lake migration, tens of TB for
EDW migration
1. Big data workload migration from AWS S3, on-
prem Hadoop File System, etc • Cost effective: serverless, PAYG
2. EDW migration from Oracle Exadata, Netezza, • Support for initial snapshot & incremental catch-up
Teradata, AWS Redshift, etc

Data ingestion for cloud ETL • Rich built-in connectors: file stores, RDBMS, NoSQL.
1. Load as-is from variety of data stores • Hybrid connectivity: on-prem, other public clouds, VNet/VPC
2. Stage for data prep and rich transformation • Enterprise grade security: AAD auth, AKV integration
3. Publish to DW for reporting or OLTP store for • Developer productivity: code-free authoring, CICD
app consumption • Single-pane-of-class monitoring & Azure Monitor integration
在线（ A D F ）还是离线（ D ATA B O X ）？

Data size \ bandwidth 50 Mbps 100 Mbps 200 Mbps 500 Mbps 1 Gbps 10 Gbps

1GB 2.7 min 1.4 min 0.7 min 0.3 min 0.1 min 0.0 min
10GB 27.3 min 13.7 min 6.8 min 2.7 min 1.3 min 0.1 min
100GB 4.6 hrs 2.3 hrs 1.1 hrs 0.5 hrs 0.2 hrs 0.0 hrs
1TB 46.6 hrs 23.3 hrs 11.7 hrs 4.7 hrs 2.3 hrs 0.2 hrs
10TB 19.4 days 9.7 days 4.9 days 1.9 days 0.9 days 0.1 days
100TB 194.2 days 97.1 days 48.5 days 19.4 days 9.5 days 0.9 days
1PB 64.7 mo 32.4 mo 16.2 mo 6.5 mo 3.2 mo 0.3 mo
10PB 647.3 mo 323.6 mo 161.8 mo 64.7 mo 31.6 mo 3.2 mo

Online Offline
客户案例

Scenario: Data migration from Amazon S3 to Azure Storage. Migrate existing analytics workload from
AWS into Azure, over time modernize workload using Azure data services.

CASE #1 CASE #2
Requirements: Requirements:
• 最小迁移持续时间 • 数据通过专有连接而不是公共互联网;
• 最大负载吞吐量 • 通过 AWS 直接连接优化 AWS 的网络出口费用

Result: Result:
• 2PB data from S3 → Blob in <11 days • 1PB from S3 → Blob over 10Gbps private link
• Initial load: 1.9 PB, avg. throughput 2.1 GB/s • Avg. throughput 787 MB/s
• Incremental: 221 TB, avg. throughput 3.6 GB/s
客户案例（续）

CASE #3 CASE #4
Scenario: Scenario:
• 按照典型的 MDW 模式标准化整个企业的分析解 • 构建一个现代多平台 DW，将跨越 BUs 的数据孤
决方案岛汇集到中央数据湖中.
• Lake hydration from on-prem Netezza
Requirements:
Requirements:
• 处理大数据时具有更好的弹性和扩展.
• 指定迁移窗口从周一到周五的 18：00-08：00
• 术业有专攻，不同的数据和分析工具，支持各种
• Netezza的开销限制：最大 8 个并发 Db 连接角色（数据工程师、数据科学家、数据集成商）.

Result:
Result:
• 25TB on-prem Netezza to ADLS
• Ingest data from various sources – SFTP, SAP,
• Total migration duration: ~3 weeks Amazon S3, Google BigQuery, SQL Server, Oracle,
Teradata, File, APIs, etc.
数据迁移的
技术考虑
构建解决方案 - 流程

Understand the scenario and the workload:

• Connectivity: What are the source and sink stores? Which format?
• Network: What is the network requirement? Can ADF meet
• Data loading pattern: One-time historical or incremental copy? these requirements?
• Scale: What is the data volume, # of objects (tables/files) and size distribution?
Identify key criteria:
• Security requirement
• Performance expectation
• Special need
评估 - 连接协议
A C C E S S A L L Y O U R D ATA - 9 0 + B U I LT - I N
CO N N E C TO R S & G R O W I N G
Azure (15) Database & DW (26) File Storage (6) File Formats (6) NoSQL (3) Services & Apps (28) Generic (4)
Blob Storage Amazon Redshift Oracle Amazon S3 AVRO Cassandra Amazon MWS Oracle Service Cloud HTTP

Cosmos DB – SQL API DB2 Phoenix File System Binary Couchbase CDS for Apps PayPal OData

Cosmos DB – MongoDB API Drill PostgreSQL FTP Delimited Text MongoDB Concur QuickBooks ODBC

ADLS Gen1 Google BigQuery Presto Google Cloud Storage JSON Dynamics 365 Salesforce REST

ADLS Gen2 Greenplum SAP BW Open Hub HDFS ORC Dynamics AX SF Service Cloud

Data Explorer HBase SAP BW MDX SFTP Parquet Dynamics CRM SF Marketing Cloud

Database for MariaDB Hive SAP HANA Google AdWords SAP C4C

Database for MySQL Impala SAP Table HubSpot SAP ECC

Database for PostgreSQL Informix Spark Jira ServiceNow

File Storage MariaDB SQL Server Magento Shopify

SQL Database Microsoft Access Sybase Marketo Square

SQL Database MI MySQL Teradata Office 365 Web Table

SQL Data Warehouse Netezza Vertica Oracle Eloqua Xero

Search Index Oracle Responsys Zoho

Table Storage
连接器"可扩展性"

不在支持列表中？不用担心：

DATA STORE TYPE SOLUTION

Database/DW ✓ Use generic ODBC connector for both read and write

✓ Check if it provides RESTful APIs, use generic REST connector

SaaS apps
✓ Check if it has OData feed, use generic OData connector

✓ Check if you can load to any supported data stores as staging

Others .....
first, e.g. File/FTP/SFTP/S3, then let ADF pick up from there
评估 - 网络配置
• Azure Integration Runtime: managed,
serverless, and pay-as-you-go
Pipeline
• Specify how much horsepower to use for
each copy by Data Integration Units
(DIUs)

• DIU is a combination of CPU, memory, and

network resource allocation.
Azure IR
• Default behavior based on your data VM VM VM VM
pattern – larger file size & file count, larger
DIUs. VM VM VM VM

• You can set DIU = 2, 4, 8, …, 256

Cloud Data Stores Cloud Data Stores
• Self-hosted Integration Runtime:
component installed on machine on-prem
or VM in cloud
Pipeline

• Touchless: latest version automatically

pushed down to machine during downtime

• HA and scale-out: register up to 4 nodes

for each self-hosted IR.

• Active-active mode: requests are

dispatched to nodes using round-robin.
Self-hosted IR
• Single-node concurrency: configure # of
concurrent activity runs, default behavior is
determined based on IR CPU/memory On-prem
Cloud Data Stores
Data Stores
通过公共互联网连接到 AZURE
Self-hosted IR deployed on premises

Data Factory

Self-hosted IR
on premises

Azure Storage Azure SQL DB Azure SQL DW HDInsight Databricks

Data stores
VNet ACL

Corporate
Firewall Boundary
通过专线连接到 AZURE VNET
Self-hosted IR deployed on Azure VM

Data Azure Storage Azure SQL DB Azure SQL DW

Factory
VNet ACL

VNet
service
endpoints

Express Route
Data stores (private peering) Self-hosted IR on
Azure VM
HDInsight Databricks

Corporate Azure Virtual

Firewall Boundary Network
C L O U D - T O - C L O U D P R I VAT E P E E R I N G
Self-hosted IR deployed on Azure VM

Data Factory Azure Storage

VNet ACL
VNet
service
endpoints

Direct Connect Express Route

AWS S3 store Private
Peering
(router Self-hosted IR on
based) Azure VM

AWS VPC Azure Virtual

Network

AWS
网络配置选项和建议

De-facto choice for enterprises!

Self-hosted IR on Azure VM
Most secure
Data transfer via ER Dedicated network w/ predicable perf
On-prem to cloud
Self-hosted IR on prem Well suited for ISVs!
Data transfer via public internet Low friction and effort

Azure IR Well suited for most scenarios

Data transfer via public internet Low friction and effort

Cloud to cloud
Most secure
Self-hosted IR on Azure VM
Dedicated network BW
Data transfer via private peering
Reduce AWS egress charge
评估 - Security
设计无密码管道

AUTHENTICATION CREDENTIAL MANAGEMENT

Authenticate to Azure data services: Recommended: store credentials in Azure Key

Use Managed Identify (aka MSI) or Service Vault and reference key vault secret from ADF
Principal

Authenticate to external data stores: Alternatively: supply credential inline in Linked

Use basic auth, key-based auth, or Windows Service and ADF will encrypt it at rest:
auth, where applicable - Cloud creds are encrypted by ADF-managed
certs
- On-prem creds are encrypted by Windows
DPAPI
数据安全性

DATA ENCRYPTION IN TRANSIT DATA ENCRYPTION AT REST

Data is transferred through secure channel Data is encrypted at rest by leveraging data store’s
HTTPS/TLS (V1.2) if data store supports encryption mechanisms:
HTTPS or TLS. - Transparent Data Encryption (TDE) in Azure SQL
Data Warehouse and Azure SQL Database
- Storage Service Encryption (SSE) for Azure Storage
- Automatic encryption for ADLS
- And so on
评估 - 数据迁移模式和最佳实践
从文件源检索数据

ADF in parallel copy/parse multiple files (determined by “parallel

copy”) across Azure IR Data Integration Units or Self-hosted IR
nodes; within each file, data is handled by chunks concurrently.

增量数据模式建议配置

Folders/files with date/time in name Dynamic folder/file/prefix/wildcard* + tumbling window trigger

1 e.g.
path: root/folder/2019/11/06 root/folder/@{formatDateTime(pipeline().parameters.windowStartTime, 'yyyy/MM/dd’)
file name: <prefix>_20191106_<suffix> *_@{formatDateTime(pipeline().parameters.windowStartTime, 'yyyyMMdd’)_*

Identify newly created/updated files Dynamic last modified time tumbling window trigger
2 by last modified time
e.g. modifiedDatetimeStart: @pipeline().parameters.windowStartTime
between 2019/11/06-2019/11/07 modifiedDatetimeEnd: @pipeline().parameters.windowEndTime
从非文件源检索数据
（DB/DW、 NOSQL、SAAS）
ADF issue the specified query to source to retrieve the data.

Source data
Single Copy Activity execution
Out-of-box optimization for Oracle, C1 C2 PartitionCol
e.g. set Parallel Copy = 4
Teradata, Netezza, SAP Table, SAP BW … … … 10000
via Open Hub: … … … 10001
… … … …
• Built-in parallel copy by partitions to
boost performance for large table … … … 30000
migration/ingestion, similar as Sqoop. … … … 30001
… … … …
• Options of range partition and native
… … … 50000
partition mechanism per data store.
… … … 50001
… … … …
… … … 70000
… … … 70001
… … … …
......
… … … …
从非文件源检索数据
（DB/DW、 NOSQL、SAAS）
增量数据模式建议配置
Dynamic query + tumbling window trigger
1
Data has timestamp column e.g. SELECT * FROM MyTable
last modified time WHERE LastModifiedDate >= @{formatDateTime(pipeline().parameters.windowStartTime, 'yyyy/MM/dd’)
AND LastModifiedDate < @{formatDateTime(pipeline().parameters.windowEndTime, 'yyyy/MM/dd’)

Control table/file + high watermark

Get old watermark
Update watermark
Data has incremental column e.g. Copy delta
2
ID (high watermark)
Get new watermark

Data is small in size

3 Full copy and overwrite
e.g. dimension data
Support for Azure SQL DB and SQL Server.
4 Source has change tracking
Leverage sys.change_tracking_tables as high watermark value
构建解决方案 - 流程

Choose a solution and conduct test on functionality:

• Connectivity: test out built-in connectors – reach data store using desired auth, pull the expected data
out and convert into the expected output format.
• Network: Azure IR vs Self-hosted IR, on-prem or in Azure VM
• Security: secure credential handling, encryption in transit/at rest
• Data loading pattern: choose the historical and incremental copy solution based on the workload/need

Get started:
• Copy Data Tool to solve moderate need, or quickly get familiar with the config
• Solution gallery for advanced scenarios, e.g. partitioned load, incremental load with high watermark
构建解决方案 - 流程

Performance test and tuning to handle data at scale:

1. Pick a representative workload
2. Obtain throughput baseline and compare to the performance expectation
3. Identify performance bottlenecks
4. Tune configurations to optimize
5. Scale to entire dataset
U N D E R S TA N D H O W A D F C O P Y S C A L E S

Flexible control flow &

Pipeline
scheduling to scale out.
Pipeline
(concurrency, partitions)

Azure IR

Cloud Data Stores Azure Data Stores

On-prem Data Stores Self-hosted IR

Elastic managed infra to
handle data at scale.
(configurable DIUs per run)

Customer managed infra

with scaling options.
(powerfulness, concurrency)
性能监控和优化

Follow performance tips

More execution details, e.g.

• Queue time – scale IR
• Time to first byte – tweak query
• More to come

More: used DIUs, used parallel copies, PolyBase true/false, staged copy w/ metrics, etc.

Copy activity performance and scalability guide

示例：从AWS S3 的数据湖迁移 •A ZURE ADLS
Azure SQL DB Control Table Suggestions:
PartitionID Prefix Status
1 booking/year=2019/month=01 Success
2 booking/year=2019/month=02 Success
3 booking/year=2019/month=03 Failed Update
control table
Look up
control table
For Each partition in control table…

Copy Pipeline

Source Bucket: Destination Container:

/bucket/<prefix>/files /container/<prefix>/files
示例：来自 NETEZZA 的负载和过程数据 • ADLS
Control Table
PartitionID TableName QueryStmt Status
1 T1 SELECT * from T1 where… Success
1 T1 SELECT * from T1 where… Success

1 T2 SELECT * from T2 where… Success

2 T3 SELECT * from T3 where… Failed
Update
control table
Parent Pipeline Copy & Process Pipeline per batch

For Each BatchID For Each data portion in current batch…

T1 T2 … T14 Raw Folder:

/YYYY/MM/DD/data.c Processed Folder: (cleanse & aggregate)
/YYYY/MM/DD/files
sv
用于启用无代码 ETL 的混合数据集成服务

Industry Visual Hybrid Pay only for Managed SSIS

leading data No Code what you use
ingestion

Productive & trusted hybrid data integration service

that simplifies ETL with any data, from any source, at scale.
数据集成 + 数据治理
AZURE DATA FACTORY
Code-free, Serverless Data Integration + Governed, Trusted Data

Leverage Active Metadata, Extract ETL Lineage

and semantics
AZURE PURVIEW

Data Map Data assets Lineage Classifications Business context …

6
Data Producers 6 6
Data Engineers Data
and Consumers and SMEs Officers

Data Use
Data Catalog Data Sharing Data Quality Master Data Data Privacy
Governance
Business glossary Intra Assessment Ref. data management Data policy Privacy operations
Data catalog Inter Cleaning Master record management Access governance Risk assessment
Loss prevention

Publish, discover and curate

数据发现与分类

On-prem and multi-cloud Operational, analytical, SaaS

Links & Resources:
Quick-start: Use the Copy Data tool to copy data:
https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-copy-data-tool

Azure Data Factory Documentation:

https://docs.microsoft.com/en-us/azure/data-factory

Introduction to Azure Data Factory:

https://docs.microsoft.com/en-us/azure/data-factory/introduction

Compare Azure Data Factory with Data Factory version 1:

https://docs.microsoft.com/en-us/azure/data-factory/compare-versions

Azure Data Factory - Code Samples:

https://azure.microsoft.com/en-us/resources/samples/?service=data-factory&sort=0

Azure Data Factory Updates:

https://azure.microsoft.com/en-us/updates/?product=data-factory

Copy Activity performance and tuning guide:

https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance
THANK YOU

Azure DATA Fatcory
No ratings yet
Azure DATA Fatcory
2,982 pages
Azure Data Factory
No ratings yet
Azure Data Factory
3,167 pages
Testing On Your Host Machine
No ratings yet
Testing On Your Host Machine
5 pages
ADF Copy Data
No ratings yet
ADF Copy Data
85 pages
Azure Data Factory Interview Questions
0% (1)
Azure Data Factory Interview Questions
14 pages
Microsoft Modern Data Estate
No ratings yet
Microsoft Modern Data Estate
48 pages
Warner DP 203 Slides
100% (1)
Warner DP 203 Slides
91 pages
ADF Course Deck
No ratings yet
ADF Course Deck
88 pages
Azure Data Platform End2End - 2day
100% (2)
Azure Data Platform End2End - 2day
108 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
09 - Azure Data Engineering Cheatsheet
No ratings yet
09 - Azure Data Engineering Cheatsheet
37 pages
Start To Finish With Azure Data Factory
100% (2)
Start To Finish With Azure Data Factory
30 pages
Data Factory
100% (2)
Data Factory
26 pages
Azure Data Lake and U-SQL
No ratings yet
Azure Data Lake and U-SQL
51 pages
Data Factory, Data Integration
No ratings yet
Data Factory, Data Integration
2,034 pages
Report Blood Bank Management System DBMS
100% (1)
Report Blood Bank Management System DBMS
33 pages
Data Lake Storage
100% (1)
Data Lake Storage
237 pages
Fanuc Cimplicity I Cell Operator
100% (1)
Fanuc Cimplicity I Cell Operator
739 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
Etap - Lfa PDF
No ratings yet
Etap - Lfa PDF
5 pages
Design of Sense Amplifier For Coupling Suppression by Using
100% (1)
Design of Sense Amplifier For Coupling Suppression by Using
28 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
Azure Datalake
No ratings yet
Azure Datalake
8 pages
f4b7901ed5e5f9106a3a82eea2e2f003
No ratings yet
f4b7901ed5e5f9106a3a82eea2e2f003
3,614 pages
Data Lake Slides
No ratings yet
Data Lake Slides
83 pages
Adf Syllabus
No ratings yet
Adf Syllabus
12 pages
Data Architectures in Azure For Analytics & Big Data: October 20, 2018
No ratings yet
Data Architectures in Azure For Analytics & Big Data: October 20, 2018
26 pages
Exam 70-445 Prep
No ratings yet
Exam 70-445 Prep
56 pages
OReilly DP 203 Slide Deck
No ratings yet
OReilly DP 203 Slide Deck
93 pages
Skill Reports Merged
No ratings yet
Skill Reports Merged
77 pages
Data All Delivering Them DW With Azure 202003224202063744
No ratings yet
Data All Delivering Them DW With Azure 202003224202063744
92 pages
EssentialsOfAzureDataLakeStorageGen2 MelissaCoates
No ratings yet
EssentialsOfAzureDataLakeStorageGen2 MelissaCoates
41 pages
Aniruddha BigDataandAnalytics
No ratings yet
Aniruddha BigDataandAnalytics
33 pages
Blob L100
No ratings yet
Blob L100
22 pages
Information Assurance and Security II Lectures Quizzes and Activities Compress 1
No ratings yet
Information Assurance and Security II Lectures Quizzes and Activities Compress 1
26 pages
Cs301 Solved Subjective Final Term by Junaid
No ratings yet
Cs301 Solved Subjective Final Term by Junaid
39 pages
AzureDataLake WhatWhyHow MelissaCoates
No ratings yet
AzureDataLake WhatWhyHow MelissaCoates
66 pages
SDC - Synapse Analytics
No ratings yet
SDC - Synapse Analytics
23 pages
Azure Data Factory Deck 1
No ratings yet
Azure Data Factory Deck 1
59 pages
ADE Project Along With CI - CD Pipeline
No ratings yet
ADE Project Along With CI - CD Pipeline
36 pages
GITEX Global Cybersecurity - Agenda
No ratings yet
GITEX Global Cybersecurity - Agenda
20 pages
Trabajo) Lake
No ratings yet
Trabajo) Lake
13 pages
Session 6 - Azure Case Study - Covid 19
No ratings yet
Session 6 - Azure Case Study - Covid 19
42 pages
Azure Storage
No ratings yet
Azure Storage
9 pages
Azure Data Factory Microsoft Fabric
No ratings yet
Azure Data Factory Microsoft Fabric
14 pages
Azure ADF
No ratings yet
Azure ADF
22 pages
Cloud MCQ's
No ratings yet
Cloud MCQ's
1 page
Data Analyst Azure PowerBI Syllabus
No ratings yet
Data Analyst Azure PowerBI Syllabus
35 pages
Azure Services
No ratings yet
Azure Services
4 pages
ADE - 7 - 30AM - Frame 3
No ratings yet
ADE - 7 - 30AM - Frame 3
1 page
Whiz Cheat Sheet DP 203 v2
No ratings yet
Whiz Cheat Sheet DP 203 v2
42 pages
FSP 3000
No ratings yet
FSP 3000
16 pages
Selection Tool PDF
No ratings yet
Selection Tool PDF
63 pages
Nectarchat Mobile Application V1.0
No ratings yet
Nectarchat Mobile Application V1.0
15 pages
E Translation
No ratings yet
E Translation
49 pages
HL7 Version 2 XML Encoding Rules, Release 2
No ratings yet
HL7 Version 2 XML Encoding Rules, Release 2
50 pages
Introduction To Simio
No ratings yet
Introduction To Simio
98 pages
Web Technologies (Topic - 04 SGML)
No ratings yet
Web Technologies (Topic - 04 SGML)
11 pages
Broadband Multiplay
No ratings yet
Broadband Multiplay
26 pages
Certificate in Computing (Cic) : Tbrm-End Examination Clune, 2008
No ratings yet
Certificate in Computing (Cic) : Tbrm-End Examination Clune, 2008
20 pages
Word Processing Teachers Note Section 2 Part I
No ratings yet
Word Processing Teachers Note Section 2 Part I
30 pages
VeNus Manual For Vendor (EN) - Final
No ratings yet
VeNus Manual For Vendor (EN) - Final
27 pages
(Share) Simple Rsi Bull - Bear Strategy
No ratings yet
(Share) Simple Rsi Bull - Bear Strategy
7 pages
Experiment No 2: AIM: Program To Implement Binary Search
No ratings yet
Experiment No 2: AIM: Program To Implement Binary Search
21 pages
AZ - 900 Part 5
No ratings yet
AZ - 900 Part 5
11 pages
MD Lab 5
No ratings yet
MD Lab 5
8 pages
Workshop01 - Answer
No ratings yet
Workshop01 - Answer
8 pages
E Mesh Monitor Brochure
No ratings yet
E Mesh Monitor Brochure
4 pages
117 2080 PL-300-1
No ratings yet
117 2080 PL-300-1
7 pages
MCA Rtu Syllabuss
No ratings yet
MCA Rtu Syllabuss
6 pages
Logs and Data Purge
No ratings yet
Logs and Data Purge
5 pages
Comparators in Java
No ratings yet
Comparators in Java
3 pages
Wireless Communication Interfaces
No ratings yet
Wireless Communication Interfaces
2 pages
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Apache Arrow Dataset in Practice: The Complete Guide for Developers and Engineers
From Everand
Apache Arrow Dataset in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
From Everand
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Azure Container Apps Deployment and Architecture: The Complete Guide for Developers and Engineers
From Everand
Azure Container Apps Deployment and Architecture: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Amazon RDS Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
Amazon RDS Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
From Everand
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Azure Kubernetes Service Essentials: Definitive Reference for Developers and Engineers
From Everand
Azure Kubernetes Service Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Amazon EMR Solutions in Cloud Computing: Definitive Reference for Developers and Engineers
From Everand
Amazon EMR Solutions in Cloud Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Realm Database in Mobile Application Development: Definitive Reference for Developers and Engineers
From Everand
Realm Database in Mobile Application Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Data Preparation with AWS Glue DataBrew: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Preparation with AWS Glue DataBrew: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Aurora Database Design and Architecture: Definitive Reference for Developers and Engineers
From Everand
Aurora Database Design and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Azure Data Demystified: From SQL to Synapse
From Everand
Azure Data Demystified: From SQL to Synapse
Kameron Hussain
No ratings yet

Azure Data Factory: PCSM - 2

Uploaded by

Azure Data Factory: PCSM - 2

Uploaded by

PCSM - 高级数据库服务 2:

Azure Data Factory介绍及应用场景

数据孤岛 不一致的 解决方案的复杂 多云环境 成本上升

Data silos Incongruent Pe r f o r m a n c e Complexity of Rising costs

所有数据的一个 支持不同类型的数据 无限 熟悉的工具和生 低

On-premises, hybrid, Azure

PB Scale, data Granular security Lightning Supports multiple Cloud

安全 易于管理 快 扩展性 经济高效 集成就绪

✓ Support for fine-

Subscription 2 Storage Account 2 Container 2 Blobs

Subscription...n Storage Account...n Container…n

Blob API ADLS API

HIERARCHICAL FILE SYSTEM

Performance Scale and Cost

Azure Blob Storage Azure Data Lake Store

Global scale – All Azure regions Built for Hadoop

Blob API ADLS Gen2 API NFS v3

Object Data Analytics Data File Data

Common Blob Storage Foundation

Public Cloud Object storage access through NFS v3 is an industry first

INGEST PREP & TRAIN MODEL & SERVE

Media (unstructured) Azure Analysis Power BI

Azure Databricks Azure SQL

INGEST PREP & TRAIN MODEL & SERVE

Azure Databricks Azure SQL

Files (unstructured) Cosmos DB

• Example of data organization and access control for a platform

• Example of data organization and access control for a business

• Data can be time series organized

• Single account owned by a customer/tenant

• Contains data for several teams or departments within the

• Access can be controlled using RBAC roles per department

• Billed to the customer (subscription)

• Isolation at the account level

• Separate accounts per department

• Stricter isolation since departments are in different accounts

• Billing can be to the customer or department

• Can be further split into separate accounts for further isolation

• Will need to attach / mount multiple accounts or filesystems to the

Blob Storage Pillars

PRODUCTIVE HYBRID SCALABLE TRUSTED

✓ Drag & Drop UI ✓ Orchestrate where ✓ Serverless scalability ✓ Certified compliant

数据集成服务：无服务器、弹性扩展、混合 Data Movement and Transformation @Scale

Hybrid Pipeline Model

Author & Monitor

SSIS Package Execution

INGEST STORE PREP & MODEL & Cloud

Social Data orchestration, Azure SQL DW,

Azure Data Factory

SSIS Integration Runtime

Cloud data sources SSIS Cloud ETL SQL DB Managed Instance

Azure Data Factory

Author, orchestrate and monitor with Azure Data Factory

On-premises data 90 + Pre-built Code-free

Multicloud data Code-centric

Orchestrate and Monitor

Wrangling data flow Mapping data

On-premises data 90 + Pre-built Code-free

Multicloud data Databricks … Azure ML

Orchestrate and Monitor

Linked Services Datasets Integration Runtime

Visual UI Code Support

Control Flow SSIS Execution

Pipelines and activities in Azure Data Factory

Datasets and linked services in Azure Data Factory

Datasets & Linked services

Datasets & Linked services

Schema mapping in copy activity

Schema mapping in copy activity

File Azure Data

Azure Cosmos DB Google BigQuery Informix HDFS Oracle Service

Azure Search Microsoft Access Drill Salesforce

Construct and perform ETL and ELT

Spend less time on design, testing, and

• Data cleansing, transformation,

• Cloud scale via Spark execution

数据孤岛不一致的解决方案的复杂多云环境成本上升

所有数据的一个支持不同类型的数据无限熟悉的工具和生低

安全易于管理快扩展性经济高效集成就绪